| dc.description.abstract |
COVID-19 pandemic causes a global catastrophe that remarkably affects individual lives
and society, as well as the economy. The world has taken numerous defenses against this
contagious disease and using face masks is one of the most crucial defense mechanisms.
Effective prevention relies on proper face mask use, yet less than 25% of individuals adhere
to correct usage. The prevalent method for face mask detection involves image processing,
machine learning, and deep learning; notably, the Vision Transformer (ViT) base model
has outperformed traditional deep learning models in making a significant impact in various
domains. The exploration of ViT model in face mask detection is yet to be explored. This
paper proposes to apply a most recent deep learning-based image classification model
named ViT model to automatically detect improper face mask-wearing, i.e., whether the
face masks are being worn correctly or not. The ViT base model shows a significant impact
on incorrect face mask detection. The experiment has been conducted on a large custom
dataset consisting of 2,03,780 digital images with 03 class labels namely ‘With Mask’
(when people are wearing the mask properly), ‘Without Mask’ (when people are not
wearing the mask), and ‘Incorrect Mask’ (when people are not wearing the mask properly)
to train, validate and test ViT model that can classify the use of face masks correctly. The
results show that the accuracy achieved with the pre-trained ViT model is highly
remarkable. Furthermore, the same experiment has been conducted on the Convolutional
Neural Network (CNN) model with the same dataset consisting of 03 class labels. Then,
the comparison has been done between the CNN model results and the ViT model results.
The findings indicate that the ViT model exhibited faster training times and higher training
accuracy, making it a more time-efficient option for incorrect face mask identification. This
advantage in training time can be crucial for real-time applications and scenarios requiring
quick response and decision-making. These findings contribute to advancing the field of
image classification and offer valuable insights for future research and the development of
improved image classification systems. The extended experiments involve five distinct
CNN architectures—XCEPTION, MOBILENETV2, VGG16, INCEPTION, and ResNet
50—utilizing a smaller dataset consisting of 2079 digital images with the same 03 class
labels. The results demonstrate that the ViT model outperforms all other models. In
conclusion, this study establishes that the ViT model achieves the highest accuracy among
the other evaluated models. |
en_US |