VISION TRANSFORMER-BASED DEEP LEARNING TECHNIQUES FOR IMPROPER FACE MASK-WEARING DETECTION

MIST Central Library Repository

Show simple item record

dc.contributor.author A. S. M. MUNTAHEEN
dc.date.accessioned 2025-12-03T13:14:06Z
dc.date.available 2025-12-03T13:14:06Z
dc.date.issued 2024-03
dc.identifier.uri http://dspace.mist.ac.bd:8080/xmlui/handle/123456789/1051
dc.description VISION TRANSFORMER-BASED DEEP LEARNING TECHNIQUES FOR IMPROPER FACE MASK-WEARING DETECTION en_US
dc.description.abstract COVID-19 pandemic causes a global catastrophe that remarkably affects individual lives and society, as well as the economy. The world has taken numerous defenses against this contagious disease and using face masks is one of the most crucial defense mechanisms. Effective prevention relies on proper face mask use, yet less than 25% of individuals adhere to correct usage. The prevalent method for face mask detection involves image processing, machine learning, and deep learning; notably, the Vision Transformer (ViT) base model has outperformed traditional deep learning models in making a significant impact in various domains. The exploration of ViT model in face mask detection is yet to be explored. This paper proposes to apply a most recent deep learning-based image classification model named ViT model to automatically detect improper face mask-wearing, i.e., whether the face masks are being worn correctly or not. The ViT base model shows a significant impact on incorrect face mask detection. The experiment has been conducted on a large custom dataset consisting of 2,03,780 digital images with 03 class labels namely ‘With Mask’ (when people are wearing the mask properly), ‘Without Mask’ (when people are not wearing the mask), and ‘Incorrect Mask’ (when people are not wearing the mask properly) to train, validate and test ViT model that can classify the use of face masks correctly. The results show that the accuracy achieved with the pre-trained ViT model is highly remarkable. Furthermore, the same experiment has been conducted on the Convolutional Neural Network (CNN) model with the same dataset consisting of 03 class labels. Then, the comparison has been done between the CNN model results and the ViT model results. The findings indicate that the ViT model exhibited faster training times and higher training accuracy, making it a more time-efficient option for incorrect face mask identification. This advantage in training time can be crucial for real-time applications and scenarios requiring quick response and decision-making. These findings contribute to advancing the field of image classification and offer valuable insights for future research and the development of improved image classification systems. The extended experiments involve five distinct CNN architectures—XCEPTION, MOBILENETV2, VGG16, INCEPTION, and ResNet 50—utilizing a smaller dataset consisting of 2079 digital images with the same 03 class labels. The results demonstrate that the ViT model outperforms all other models. In conclusion, this study establishes that the ViT model achieves the highest accuracy among the other evaluated models. en_US
dc.language.iso en en_US
dc.title VISION TRANSFORMER-BASED DEEP LEARNING TECHNIQUES FOR IMPROPER FACE MASK-WEARING DETECTION en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search DSpace


Advanced Search

Browse

My Account