Atribusi Penulis Pada Dangerous User Di Twitter Menggunakan Pendekatan Tanpa Pengawasan Dengan IndoBERTweet

Rosyadi, Muhammad Hafidh (2025) Atribusi Penulis Pada Dangerous User Di Twitter Menggunakan Pendekatan Tanpa Pengawasan Dengan IndoBERTweet. Other thesis, Institut Teknologi Sepuluh Nopember.

[thumbnail of 5025211013-Undergraduate_Thesis.pdf] Text
5025211013-Undergraduate_Thesis.pdf - Accepted Version
Restricted to Repository staff only

Download (8MB) | Request a copy

Abstract

Peningkatan pemanfaatan platform media sosial seperti Twitter memicu tantangan baru dalam mitigasi penyebaran ujaran berbahaya (dangerous speech), yang dapat memperdalam polarisasi sosial dan memotivasi kekerasan antar-kelompok. Penelitian ini mengusulkan metode atribusi penulis ujaran kebencian tanpa pengawasan untuk dangerous user di Twitter. Eksperimen dimulai dengan pengumpulan dan pembersihan 14.878 tweet berbahasa Indonesia periode Januari 2020 hingga Mei 2025 dari 11 pengguna yang telah terklasifikasi sebagai dangerous user; pembersihan lanjutan mencakup normalisasi teks ke huruf kecil, penghapusan tautan, simbol, karakter spesial, entitas HTML, serta stopword, dan penggantian hashtag dengan kata pembentuknya. Ekstraksi fitur meliputi gaya penulisan (fitur stilistik) berdasarkan kemunculan karakteristik linguistik pengguna, fitur TF–IDF, dan representasi embedding dengan IndoBERTweet. Selanjutnya pengelompokan dilakukan menggunakan empat algoritma tanpa pengawasan yaitu KMeans, DBSCAN, Hierarchical Clustering, dan Gaussian Mixture Model (GMM), dan dievaluasi dengan metrik NMI=0,4445 dan ARI=0,3132. Hasil eksperimen menunjukkan bahwa kombinasi pembersihan lanjutan, representasi embedding IndoBERTweet, dan algoritma GMM memberikan performa terbaik. Temuan ini menegaskan pentingnya representasi kontekstual modern dan teknik pembersihan yang disesuaikan untuk menangkap ciri khas penulisan pengguna. Berdasarkan hasil ini, direkomendasikan pengembangan lebih lanjut melalui fine-tuning model IndoBERTweet dengan pendekatan style-aware guna mencapai performa atribusi penulis yang lebih tinggi dalam skenario dinamis platform media sosial.
===================================================================================================================================
The increased use of social media platforms such as Twitter has created new challenges in mitigating the spread of dangerous speech, which can deepen social polarization and motivate intergroup violence. This study proposes an unsupervised method for attributing the authorship of hate speech to dangerous users on Twitter. The experiment began with the collection and cleaning of 14,878 Indonesian-language tweets from January 2020 to May 2025 from 11 users classified as dangerous users; further cleaning included text normalization to lowercase letters, removal of links, symbols, special characters, HTML entities, and stopwords, and replacement of hashtags with their constituent words. Feature extraction included writing style (stylistic features) based on the linguistic characteristics of the users, TF–IDF features, and embedding representations using IndoBERTweet. Next, clustering is performed using four unsupervised algorithms, namely KMeans, DBSCAN, Hierarchical Clustering, and Gaussian Mixture Model (GMM), and evaluated with metrics NMI=0.4445 and ARI=0.3132. The experimental results show that the combination of advanced cleaning, IndoBERTweet embedding representation, and the GMM algorithm provides the best performance. These findings emphasize the importance of modern contextual representations and tailored cleaning techniques to capture the unique characteristics of user writing. Based on these results, further development is recommended through fine-tuning the IndoBERTweet model with a style-aware approach to achieve higher author attribution performance in dynamic social media platform scenarios.

Item Type: Thesis (Other)
Uncontrolled Keywords: Ujaran berbahaya, Pembelajaran tanpa pengawasan, IndoBERTweet, NMI, ARI, Dangerous Speech, Unsupervised Learning, IndoBERTweet, NMI, ARI
Subjects: P Language and Literature > P Philology. Linguistics
Q Science > QA Mathematics > QA278.55 Cluster analysis
Q Science > QA Mathematics > QA278.5 Principal components analysis. Factor analysis. Correspondence analysis (Statistics)
Q Science > QA Mathematics > QA336 Artificial Intelligence
Q Science > QA Mathematics > QA76.9.A25 Computer security. Digital forensic. Data encryption (Computer science)
Q Science > QA Mathematics > QA76.9.D343 Data mining. Querying (Computer science)
Q Science > QA Mathematics > QA9.58 Algorithms
T Technology > TA Engineering (General). Civil engineering (General) > TA1637 Image processing--Digital techniques. Image analysis--Data processing.
Divisions: Faculty of Intelligent Electrical and Informatics Technology (ELECTICS) > Informatics Engineering > 55201-(S1) Undergraduate Thesis
Depositing User: Muhammad Hafidh Rosyadi
Date Deposited: 28 Jul 2025 09:24
Last Modified: 29 Jul 2025 02:00
URI: http://repository.its.ac.id/id/eprint/122257

Actions (login required)

View Item View Item