Face Expression Recognition via transformer-based classification models
dc.contributor.author | Arslanoglu, M. Cihad | |
dc.contributor.author | Acar, Hüseyin | |
dc.contributor.author | Albayrak, Abdulkadir | |
dc.date.accessioned | 2025-02-22T14:17:59Z | |
dc.date.available | 2025-02-22T14:17:59Z | |
dc.date.issued | 2024 | |
dc.department | Dicle Üniversitesi | en_US |
dc.description.abstract | Facial Expression Recognition (FER) tasks have widely studied in the literature since it has many applications. Fast development of technology in deep learning computer vision algorithms, especially, transformer-based classification models, makes it hard to select most appropriate models. Using complex model may increase accuracy performance but decreasing infer- ence time which is a crucial in near real-time applications. On the other hand, small models may not give desired results. In this study, it is aimed to examine accuracy and data process time per- formance of 5 different relatively small transformer-based image classification algorithms for FER tasks. Used models are vanilla Vision Transformer (ViT), Pooling-based Vision Transformer (PiT), Shifted Windows Transformer (Swin), Data-efficient image Transformers (DeiT), and Cross-attention Vision Transformer (CrossViT) with considering their trainable parameter size and architectures. Each model has 20-30M trainable parameters which means relatively small. Moreover, each model has different architectures. As an illustration, CrossViT focuses on image using multi-scale patches and PiT model introduces convolution layers and pooling techniques to vanilla ViT model. Model performances are evaluated on CK+48 and KDEF datasets that are well- known and most used in the literature. It was observed that all models exhibit similar performance with literature results. PiT model that includes both Convolutional Neural Network (CNN) and Transformer layers achieved the best accuracy scores 0.9513 and 0.9090 for CK+48 and KDEF datasets, respectively. It shows CNN layers boost performance of Transformer based models and help to learn data more efficiently for CK+48 and KDEF datasets. Swin Transformer performs 0.9080 worst accuracy score for CK+48 dataset and 0.8434 nearly worst score for KDEF dataset. Swin Transformer and PiT exhibit worst and best image processing performance in terms of spent time, respectively. This makes PiT model suitable for real-time applications too. Moreover, PiT model require 25 and 83 second least training epoch to reach these performance for CK+48 and KDEF, respectively. | en_US |
dc.identifier.doi | 10.17694/bajece.1486140 | |
dc.identifier.endpage | 223 | en_US |
dc.identifier.issn | 2147-284X | |
dc.identifier.issue | 3 | en_US |
dc.identifier.startpage | 214 | en_US |
dc.identifier.trdizinid | 1278353 | en_US |
dc.identifier.uri | https://doi.org/10.17694/bajece.1486140 | |
dc.identifier.uri | https://search.trdizin.gov.tr/tr/yayin/detay/1278353 | |
dc.identifier.uri | https://hdl.handle.net/11468/30186 | |
dc.identifier.volume | 12 | en_US |
dc.indekslendigikaynak | TR-Dizin | |
dc.language.iso | en | en_US |
dc.relation.ispartof | Balkan Journal of Electrical and Computer Engineering | en_US |
dc.relation.publicationcategory | Makale - Ulusal Hakemli Dergi - Kurum Öğretim Elemanı | en_US |
dc.rights | info:eu-repo/semantics/openAccess | en_US |
dc.snmz | KA_TR_20250222 | |
dc.subject | Classification | en_US |
dc.subject | Transformers | en_US |
dc.subject | ViT | en_US |
dc.subject | FER | en_US |
dc.title | Face Expression Recognition via transformer-based classification models | en_US |
dc.type | Article | en_US |