Face Expression Recognition via transformer-based classification models
Tarih
Dergi Başlığı
Dergi ISSN
Cilt Başlığı
Yayıncı
Erişim Hakkı
Özet
Facial Expression Recognition (FER) tasks have widely studied in the literature since it has many applications. Fast development of technology in deep learning computer vision algorithms, especially, transformer-based classification models, makes it hard to select most appropriate models. Using complex model may increase accuracy performance but decreasing inference time which is a crucial in near real-time applications. On the other hand, small models may not give desired results. In this study, we aimed to examine performance of 5 different relatively small transformer-based image classification algorithms for FER tasks. We used vanilla ViT, PiT, Swin, DeiT, and CrossViT with considering their trainable parameter size and architectures. Each model has 20-30M trainable parameters which means relatively small. Moreover, each model has different architectures. As an illustration, CrossViT focuses on image using multi-scale patches and PiT model introduces convolution layers and pooling techniques to vanilla ViT model. We obtained all results for widely used FER datasets: CK+ and KDEF. We observed that, PiT model achieves the best accuracy scores 0.9513 and 0.9090 for CK+ and KDEF datasets, respectively