Klasifikasi Multilabel Ujaran Berbahaya pada Data Twitter Menggunakan Model Transformer

Sari, Nadya Permata (2024) Klasifikasi Multilabel Ujaran Berbahaya pada Data Twitter Menggunakan Model Transformer. Other thesis, Institut Teknologi Sepuluh Nopember.

[thumbnail of 5025201015-Undergraduate_Thesis.pdf] Text
5025201015-Undergraduate_Thesis.pdf - Accepted Version
Restricted to Repository staff only until 1 October 2026.

Download (9MB) | Request a copy

Abstract

Media sosial, khususnya Twitter, telah menjadi salah satu platform komunikasi digital terbesar dengan jutaan pengguna aktif di seluruh dunia. Namun, perkembangan pesat media sosial juga membawa tantangan baru, termasuk penyebaran konten berbahaya seperti ujaran berbahaya, yang memiliki potensi untuk memicu kekerasan dan merusak hubungan sosial. Aspek-aspek yang terkandung dalam ujaran berbahaya adalah sosial, historis, dehumanisasi, tuduhan, serangan terhadap wanita dan anak perempuan, pertanyaan tentang loyalitas kelompok, dan ancaman terhadap suatu kelompok. Penelitian Tugas Akhir yang dilakukan bertujuan untuk mengembangkan sistem klasifikasi multilabel dengan menggunakan pendekatan Transformer guna mengidentifikasi ujaran berbahaya dalam berbagai aspek pada data teks Twitter berbahasa Inggris.
Klasifikasi multilabel dilakukan menggunakan beberapa model Transformer, seperti BERT, GPT2, XLNet, RoBERTa, DistilBERT, ELECTRA, dan DeBERTa. Tahap pertama adalah pengambilan data teks Twitter berbahasa Inggris lalu dibersihkan dan diproses sebelum digunakan untuk pelatihan model. Selanjutnya, dilakukan pseudo labeling untuk mendapatkan label pada keseluruhan dataset yang belum memiliki label. Pembobotan nilai ujaran berbahaya dilakukan menggunakan Weighted Sum Model (WSM). Pengujian dilakukan dengan beberapa skenario seperti pengujian confidence level saat pseudo labeling, perbandingan performa model Transformer, hyperparameter tuning untuk model Transformer terbaik, perbandingan dengan metode supervised Machine Learning konvensional, dan klasifikasi pengguna penyebar ujaran berbahaya.
Data terlabel yang digunakan untuk klasifikasi multilabel adalah data hasil dari proses pseudo labeling dengan confidence level di atas 85%. Proses klasifikasi multilabel terbaik didapatkan saat menggunakan model Transformer BERT. Variasi Transformer BERT terbaik yang digunakan adalah BERT-Base dengan hyperparameter berupa hidden layers bernilai 12, batch size bernilai 16, learning rate bernilai 1e-5, dan jenis optimizer Adam yang memperoleh nilai hamming loss 0,031, subset accuracy 0,810, dan accuracy 0,965. Pengujian dengan perbandingan metode supervised Machine Learning konvensional juga menunjukkan model Transformer memberikan nilai yang lebih baik pada metriks evaluasi hamming loss, subset accuracy, dan accuracy.
===========================================================================================================
Social media, especially Twitter, has become one of the largest digital communication platforms, with millions of active users worldwide. However, the rapid development of social media also brings new challenges, including the spread of harmful content, such as dangerous speech, which has the potential to incite violence and damage social relations. Dangerous speech can target particular social groups based on characteristics such as religion, gender, and others. The aspects contained in dangerous speech are social, historical, dehumanization, accusations, attacks on women and girls, questions of group loyalty, and threats to a group. The final assignment research aims to develop a multi-label classification system using the Transformer approach to identify dangerous speech in various aspects of English Twitter text data.
Multilabel classification is carried out using several Transformer models, such as BERT, GPT2, XLNet, RoBERTa, DistilBERT, ELECTRA, and DeBERTa. The first stage is collecting English Twitter text data and then cleaning and processing it before using it for model training. Next, pseudo labeling is carried out to get labels for the entire dataset that does not yet have a label. Testing was carried out with several scenarios such as threshold testing when pseudo labeling, comparison of Transformer model performance, hyperparameter tuning for the best Transformer model, comparison with conventional supervised Machine Learning methods, and classification of users spreading dangerous speech.
The labeled data used for multilabel classification is data resulting from the pseudo labeling process with a confidence level above 85%. The best multilabel classification process is obtained when using the BERT Transformer model. The best BERT Transformer variation used is BERT-Base with hyperparameters in the form of hidden layers worth 12, batch size worth 16, learning rate worth 1e-5, and the Adam optimizer type which gets a hamming loss value of 0.031, subset accuracy 0.810, and accuracy 0.965. Tests comparing conventional supervised Machine Learning methods also show that the Transformer model provides better scores on the evaluation metrics of hamming loss, subset accuracy, and accuracy.

Item Type: Thesis (Other)
Uncontrolled Keywords: Klasifikasi Multilabel, Transformer, Twitter, Ujaran Berbahaya, Dangerous Speech, Multilabel Classification, Transformer, Twitter.
Subjects: Q Science > Q Science (General) > Q325.5 Machine learning. Support vector machines.
Q Science > QA Mathematics > QA336 Artificial Intelligence
Q Science > QA Mathematics > QA76.87 Neural networks (Computer Science)
T Technology > T Technology (General) > T57.5 Data Processing
Divisions: Faculty of Information Technology > Informatics Engineering > 55201-(S1) Undergraduate Thesis
Depositing User: Nadya Permata Sari
Date Deposited: 31 Jul 2024 06:34
Last Modified: 31 Jul 2024 06:34
URI: http://repository.its.ac.id/id/eprint/110300

Actions (login required)

View Item View Item