Face Expression Recognition via transformer-based classification models

dc.contributor.authorArslanoglu, M. Cihad
dc.contributor.authorAcar, Hüseyin
dc.contributor.authorAlbayrak, Abdulkadir
dc.date.accessioned2025-02-22T14:17:59Z
dc.date.available2025-02-22T14:17:59Z
dc.date.issued2024
dc.departmentDicle Üniversitesien_US
dc.description.abstractFacial Expression Recognition (FER) tasks have widely studied in the literature since it has many applications. Fast development of technology in deep learning computer vision algorithms, especially, transformer-based classification models, makes it hard to select most appropriate models. Using complex model may increase accuracy performance but decreasing infer- ence time which is a crucial in near real-time applications. On the other hand, small models may not give desired results. In this study, it is aimed to examine accuracy and data process time per- formance of 5 different relatively small transformer-based image classification algorithms for FER tasks. Used models are vanilla Vision Transformer (ViT), Pooling-based Vision Transformer (PiT), Shifted Windows Transformer (Swin), Data-efficient image Transformers (DeiT), and Cross-attention Vision Transformer (CrossViT) with considering their trainable parameter size and architectures. Each model has 20-30M trainable parameters which means relatively small. Moreover, each model has different architectures. As an illustration, CrossViT focuses on image using multi-scale patches and PiT model introduces convolution layers and pooling techniques to vanilla ViT model. Model performances are evaluated on CK+48 and KDEF datasets that are well- known and most used in the literature. It was observed that all models exhibit similar performance with literature results. PiT model that includes both Convolutional Neural Network (CNN) and Transformer layers achieved the best accuracy scores 0.9513 and 0.9090 for CK+48 and KDEF datasets, respectively. It shows CNN layers boost performance of Transformer based models and help to learn data more efficiently for CK+48 and KDEF datasets. Swin Transformer performs 0.9080 worst accuracy score for CK+48 dataset and 0.8434 nearly worst score for KDEF dataset. Swin Transformer and PiT exhibit worst and best image processing performance in terms of spent time, respectively. This makes PiT model suitable for real-time applications too. Moreover, PiT model require 25 and 83 second least training epoch to reach these performance for CK+48 and KDEF, respectively.en_US
dc.identifier.doi10.17694/bajece.1486140
dc.identifier.endpage223en_US
dc.identifier.issn2147-284X
dc.identifier.issue3en_US
dc.identifier.startpage214en_US
dc.identifier.trdizinid1278353en_US
dc.identifier.urihttps://doi.org/10.17694/bajece.1486140
dc.identifier.urihttps://search.trdizin.gov.tr/tr/yayin/detay/1278353
dc.identifier.urihttps://hdl.handle.net/11468/30186
dc.identifier.volume12en_US
dc.indekslendigikaynakTR-Dizin
dc.language.isoenen_US
dc.relation.ispartofBalkan Journal of Electrical and Computer Engineeringen_US
dc.relation.publicationcategoryMakale - Ulusal Hakemli Dergi - Kurum Öğretim Elemanıen_US
dc.rightsinfo:eu-repo/semantics/openAccessen_US
dc.snmzKA_TR_20250222
dc.subjectClassificationen_US
dc.subjectTransformersen_US
dc.subjectViTen_US
dc.subjectFERen_US
dc.titleFace Expression Recognition via transformer-based classification modelsen_US
dc.typeArticleen_US

Dosyalar