Parallelizing CNN and Transformer Encoders for Audio Based Emotion Recognition in English Language

Adidarma, Adam Satria (2023) Parallelizing CNN and Transformer Encoders for Audio Based Emotion Recognition in English Language. Other thesis, Institut Teknologi Sepuluh Nopember.

[thumbnail of 05111942000001-Undergraduate_Thesis.pdf] Text
05111942000001-Undergraduate_Thesis.pdf - Accepted Version
Restricted to Repository staff only until 1 October 2023.

Download (9MB) | Request a copy


Artificial intelligence (AI) has had a significant impact on various industries and sectors of society, with the adoption of AI growing 37% from 2018 to 2019, according to a Gartner report. Speech emotion recognition (SER) is a subfield of AI that focuses on recognizing the emotional aspects of speech, separate from the semantic content. Emotions play a crucial role in human communication and have been the subject of increasing research in recent years. While current studies on emotion detection often focus on visual modalities, such as facial expressions, emotion is a multimodal concept that requires the study of visual, tactile, vocal, and physiological indicators. SER can be applied in various contexts, including call centers, education, marketing, psychology, and healthcare.
This study proposes an approach to implement an SER system using a parallelized Convolutional Neural Network (CNN) model with a Transformer Encoder Block which aims to classify a range of emotions, including anger, disgust, fear, happy, neutral, and sad. The performance of this system will be evaluated using publicly available English audio emotion datasets such as CREMA-D, RAVDESS, and SAVEE. To extract the features of these sounds, the Mel Frequency Cepstrum Coefficients (MFCC) was used. The transformer encoder block with CNN model is compared to various machine learning architectures, including the original LeNet CNN, a Convolutional Recurrent Neural Network (CRNN) architecture, and a Support Vector Machine (SVM) model, to determine the most effective approach for SER. The performance for each model varies significantly depending on the dataset being used. Among the models being compared, the Transformer Encoder and CNN model showed the best overall performance across all datasets with an average accuracy score of 74%. The Transformer Encoder Block is effective when processing long sequences of data, especially in natural language processing. Meanwhile, CNNs are excellent at capturing local patterns and relationships in data, making them ideal for analyzing spectrogram images generated from audio signals.

Item Type: Thesis (Other)
Uncontrolled Keywords: Convolutional Neural Network, Deteksi Emosi, Kecerdasan Artifisial, Pengenalan Emosi Bicara, Transformer Encoder, Artificial Intelligence, Convolution Neural Network, Emotion Detection, Speech Emotion Recognition, Transformer Encoder
Subjects: Q Science > Q Science (General) > Q325.5 Machine learning.
Divisions: Faculty of Intelligent Electrical and Informatics Technology (ELECTICS) > Informatics Engineering > 55201-(S1) Undergraduate Thesis
Depositing User: Adam Satria Adidarma
Date Deposited: 19 Jul 2023 08:11
Last Modified: 19 Jul 2023 08:11

Actions (login required)

View Item View Item