Speech Emotion Recognition Berbasis CNN dan LSTM dengan Mengaplikasikan Teknik Augmentasi Data

Yafi, Bill Harit (2023) Speech Emotion Recognition Berbasis CNN dan LSTM dengan Mengaplikasikan Teknik Augmentasi Data. Other thesis, Institut Teknologi Sepuluh Nopember.

[thumbnail of 05111940000114-Undergraduate_Thesis.pdf] Text
05111940000114-Undergraduate_Thesis.pdf - Accepted Version
Restricted to Repository staff only until 1 October 2025.

Download (4MB) | Request a copy

Abstract

Penelitian ini memfokuskan diri pada penggunaan teknologi deep learning untuk mengembangkan sistem pengenalan emosi dari data suara atau Speech Emotion Recognition (SER). Sistem SER dirancang dan dibangun menggunakan kombinasi dari dua arsitektur neural network populer, yaitu Convolutional Neural Network (CNN) dan Long Short-Term Memory (LSTM). Model yang diusulkan dilatih menggunakan dataset publik audio seperti CREMA-D, RAVDESS, dan SAVEE, dengan Mel-Frequency Ceptstral Coefficients (MFCCs) sebagai representasi fitur dari dataset tersebut. Selain itu, teknik augmentasi data audio seperti penambahan Gaussian noise, time stretching, dan pitch shifting digunakan untuk menciptakan variasi baru dari data latih, sehingga meningkatkan performa model dalam mengenali emosi dari berbagai varian suara. Evaluasi dilakukan dengan membandingkan model CNN-LSTM dengan model pembanding berbasis CNN yang terkenal seperti Residual Network (ResNet) dan Visual Geometry Group (VGGNet). Model dievaluasi dengan menggunakan metrik akurasi dan waktu pelatihan, serta metrik evaluasi F1-Score, presisi, dan recall untuk mempelajarinya lebih dalam. Berdasarkan hasil penelitian, model yang diusulkan menunjukkan performa lebih tinggi pada metrik akurasi, F1-score, recall, dan presisi dibandingkan dengan model lainnya pada dataset CREMA-D, RAVDESS, dan SAVEE dengan rata-rata akurasi sebesar 74,9%, F1-Score sebesar 0,73, presisi sebesar 0,75, dan recall sebesar 0,73. Lalu setelah ditambahkan data augmentasi, performa model meningkat dengan rata-rata akurasi sebesar 76,4%, F1-Score sebesar 0,75, presisi sebesar 0,77, dan recall sebesar 0,75. Hal ini menunjukkan bahwa model berbasis CNN-LSTM dengan teknik augmentasi data dapat menjadi pilihan yang efektif untuk pengembangan sistem pendeteksi emosi suara.
=================================================================================================================================
This study focuses on the utilization of deep learning technology to develop a Speech Emotion Recognition (SER) system for analyzing emotional content from audio data. The SER system is designed and built using a combination of two popular neural network architectures, Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM). The proposed model is trained using publicly available audio datasets such as CREMA-D, RAVDESS, and SAVEE, with Mel-Frequency Cepstral Coefficients (MFCCs) as the feature representation of the dataset. Additionally, audio data augmentation techniques like adding Gaussian noise, time stretching, and pitch shifting are employed to create variations in the training data, thereby enhancing the model's performance in recognizing emotions from diverse sound variations. The evaluation is conducted by comparing the CNN-LSTM model with well-known CNN-based benchmark models like Residual Network (ResNet) and Visual Geometry Group (VGGNet). These models are evaluated using metrics such as accuracy, training time, F1-Score, precision, and recall to gain deeper insights. Based on the research results, the proposed model demonstrates higher performance in terms of accuracy, F1-score, recall, and precision compared to other models on the CREMA-D, RAVDESS, and SAVEE datasets, with an average accuracy of 74.9%, F1-Score of 0,73, precision of 0,75, and recall of 0,73. Furthermore, after incorporating data augmentation, the model's performance improved with an average accuracy of 76.4%, F1-Score of 0,75, precision of 0,77, and recall of 0,75. These results indicate that the CNN-LSTM model with data augmentation techniques can be an effective choice for the development of sound emotion detection systems.

Item Type: Thesis (Other)
Uncontrolled Keywords: Augmentasi Data, Convolutional Neural Network, Data Augmentation, Long Short-Term Memory, Speech Emotion Recognition
Subjects: Q Science > Q Science (General) > Q325.5 Machine learning. Support vector machines.
Divisions: Faculty of Intelligent Electrical and Informatics Technology (ELECTICS) > Informatics Engineering > 55201-(S1) Undergraduate Thesis
Depositing User: Bill Harit Yafi
Date Deposited: 29 Nov 2023 01:48
Last Modified: 29 Nov 2023 01:48
URI: http://repository.its.ac.id/id/eprint/101269

Actions (login required)

View Item View Item