Pengenalan Emosi Melalui Suara Manusia Menggunakan Metode Convolutional Neural Network (CNN)

Maharani, Diaz (2024) Pengenalan Emosi Melalui Suara Manusia Menggunakan Metode Convolutional Neural Network (CNN). Other thesis, Institut Teknologi Sepuluh Nopember.

[thumbnail of 5003201165-Undergraduate_Thesis.pdf] Text
5003201165-Undergraduate_Thesis.pdf - Accepted Version
Restricted to Repository staff only

Download (4MB) | Request a copy

Abstract

Emosi memainkan peran penting dalam komunikasi yang efektif. Pengenalan emosi, atau Speech Emotion Recognition (SER) digunakan untuk mengidentifikasi emosi secara otomatis pada bidang teknologi. Penelitian ini menggunakan model Convolutional Neural Network (CNN) dan model transfer learning Residual Network 50 (ResNet50) untuk melakukan tugas klasifikasi emosi menggunakan dataset audio RAVDESS yang mencakup tujuh emosi, yaitu tenang, senang, sedih, marah, takut, jijik, dan terkejut. Setelah memastikan kelas emosi seimbang, ekstraksi fitur Mel-Frequency Cepstral Coefficients (MFCC) dilakukan pada data audio untuk menghasilkan representasi fitur yang optimal. Nilai-nilai MFCC yang diperoleh divisualisasikan dalam bentuk gambar sebagai input model. Dengan menggunakan kombinasi hyperparameter optimal yang diperoleh dari proses hyperparameter tuning, model CNN dengan learning rate 0,00001 dan dropout 0,5 menghasilkan train accuracy sebesar 0,815116, train loss sebesar 0,489793, validation accuracy sebesar 0,632558, dan validation loss sebesar 1,120799. Model ResNet50 dengan learning rate 0,0001 dan dropout 0,2 memperoleh hasil yang lebih baik dengan train accuracy sebesar 0,998837, train loss sebesar 0,037026, validation accuracy sebesar 0,651163, dan validation loss sebesar 0,994415. Perbedaan besar antara nilai akurasi pada data training dan validasi menunjukkan bahwa kedua model mengalami overfitting. Diambil tiga emosi dengan prediksi terbaik, yaitu emosi tenang, marah, dan terkejut untuk menelusuri overfitting ini. Setelah pengurangan kelas, model ResNet50 juga mencapai hasil yang lebih baik daripada model CNN yang ditunjukkan oleh nilai train accuracy sebesar 0,991848, train loss sebesar 0,079528, validation accuracy sebesar 0,891304, dan validation loss sebesar 0,319050. Meskipun masih terdapat indikasi overfitting, ResNet50 lebih baik dalam memprediksi data baru. Pengurangan kelas meningkatkan test accuracy pada kedua model hingga 20%.
=====================================================================================================
Emotion plays an important role in effective communication. Emotion recognition, or Speech Emotion Recognition (SER), is used to automatically identify emotions in technology. This study uses Convolutional Neural Network (CNN) and transfer learning model Residual Network 50 (ResNet50) to classify emotions using the RAVDESS audio dataset, which includes seven emotions: calm, happy, sad, angry, fearful, disgust, and surprised. After ensuring balanced emotion classes, Mel-Frequency Cepstral Coefficients (MFCC) feature extraction was performed on the audio data to produce optimal feature representation. The obtained MFCC values were visualized as images for model input. Using optimal hyperparameter combinations obtained from hyperparameter tuning, the CNN model with a learning rate of 0,00001 and dropout of 0,5 achieved a train accuracy of 0,815116, train loss of 0,489793, validation accuracy of 0,632558, and validation loss of 1,120799. The ResNet50 model with a learning rate of 0,0001 and dropout of 0,2 performed better with a train accuracy of 0,998837, train loss of 0,037026, validation accuracy of 0,651163, and validation loss of 0,994415. The significant difference between training and validation accuracy indicates that both models experienced overfitting. To address this, the three best-predicted emotions (calm, angry, and surprised) were selected. After class reduction, the ResNet50 model also achieved better results than the CNN model, with a train accuracy of 0,991848, train loss of 0,079528, validation accuracy of 0,891304, and validation loss of 0,319050. Although there are still indications of overfitting, ResNet50 is better at predicting new data. Reducing classes increased test accuracy for both models by up to 20%.

Item Type: Thesis (Other)
Uncontrolled Keywords: Convolutional Neural Network, MFCC, Speech emotion recognition, Transfer learning
Subjects: Q Science
Q Science > Q Science (General)
Q Science > Q Science (General) > Q325.5 Machine learning. Support vector machines.
Divisions: Faculty of Science and Data Analytics (SCIENTICS) > Statistics > 49201-(S1) Undergraduate Thesis
Depositing User: Diaz Maharani
Date Deposited: 09 Aug 2024 05:27
Last Modified: 09 Aug 2024 05:27
URI: http://repository.its.ac.id/id/eprint/115020

Actions (login required)

View Item View Item