BERT+ViT: Pengenalan Emosi Manusia Multimodal Menggunakan Teks dan Audio

Fortunela, Khaela (2024) BERT+ViT: Pengenalan Emosi Manusia Multimodal Menggunakan Teks dan Audio. Other thesis, Institut Teknologi Sepuluh Nopember.

[thumbnail of 05111940000057-Undergraduate_Thesis.pdf] Text
05111940000057-Undergraduate_Thesis.pdf - Accepted Version
Restricted to Repository staff only until 1 July 2026.

Download (3MB)

Abstract

Pengenalan emosi manusia memegang peran penting dalam perkembangan natural language processing (NLP) dalam beberapa tahun terakhir. Penelitian ini bertujuan untuk mengusulkan metode pengenalan emosi dengan menggunakan pendekatan gabungan fitur teks dan fitur akustik dalam konteks deep learning. Penelitian dilakukan pada dataset MELD, yang terdiri dari video dan teks diambil dari sebuah sitkom dengan percakapan Bahasa Inggris. Untuk pendekatan fitur akustik, dilakukan ekstraksi fitur Mel-Frequency Cepstral Coefficients (MFCC) dari audio yang kemudian dijadikan input untuk model Vision Transformer (ViT). Sementara itu, pada pendekatan fitur teks, BERT digunakan untuk menangkap makna berdasarkan konteks dari input teks. Kedua pendekatan berbasis transformer ini menghasilkan model yang mampu mengenali emosi manusia. Prediksi dari kedua model menghasilkan probabilitas untuk tiap label emosi, yang kemudian digunakan sebagai input algoritma klasifikasi guna menghasilkan hasil fusi dari kedua pendekatan. Evaluasi dilakukan dengan membandingkan kinerja dari kedua model terhadap model dilatih dengan gabungan nilai probabilitas kedua model. Berdasarkan penelitian yang dilakukan, dapat disimpulkan bahwa metode yang telah diimplementasikan yaitu Hoeffding Tree dengan dataset MELD memiliki nilai weighted F1-score 0,60, diatas metode pembanding seperti model BERT dengan weighted F1-score 0,47 dan model ViT dengan weighted F1-score 0,30. Apabila dibandingkan, model Hoeffding Tree yang dilatih dengan nilai probabilitas dari model BERT dan ViT, mengalami peningkatan kinerja untuk weighted F1-score dengan peningkatan sebesar 27,66% terhadap model BERT dan merupakan dua kali lipat weighted F1-score model ViT.
=================================================================================================================================
Human emotion recognition has been part of research in natural language processing (NLP) in recent years. This research aims to propose a method for emotion recognition using a combined approach of text and acoustic features within the context of deep learning. The study was conducted on the MELD dataset, consisting of videos and texts extracted from an English sitcom dialogue. For the acoustic feature approach, Mel-Frequency Cepstral Coefficients (MFCC) were extracted from the audio, serving as input for the Vision Transformer (ViT) model. Meanwhile, in the text feature approach, BERT was utilized to capture meaning based on the context of the input text. Both transformer-based approaches resulted in models capable of recognizing human emotions. Predictions from both models generated probabilities for each emotion label, subsequently used as input for a classification algorithm to produce fused results from both approaches. Evaluation was carried out by comparing the performance of both models against a model trained with the combined probability values of the two models. Based on the research conducted, it can be concluded that the implemented method, using Hoeffding Tree for the MELD dataset results in weighted F1-score of 0.60 which is higher compared to the BERT model with a weighted F1-score of 0.47 and the ViT model with a weighted F1-score of 0.30. When comparing the performance of the Hoeffding Tree model trained with the probability values from the BERT and ViT models, the weighted F1-score experiences an increase of 27.66% compared to the BERT model and is twice as high as the weighted F1-score of the ViT model.

Item Type: Thesis (Other)
Uncontrolled Keywords: Algoritma Klasifikasi, BERT, MFCC, NLP, Transformer, Vision Transformer, BERT, Classification Algorithm, MFCC, Natural Language Processing, Transformer, Vision Transformer
Subjects: T Technology > T Technology (General) > T57.5 Data Processing
Divisions: Faculty of Intelligent Electrical and Informatics Technology (ELECTICS) > Informatics Engineering > 55201-(S1) Undergraduate Thesis
Depositing User: Khaela Fortunela
Date Deposited: 19 Feb 2024 03:31
Last Modified: 19 Feb 2024 03:31
URI: http://repository.its.ac.id/id/eprint/107475

Actions (login required)

View Item View Item