Klasifikasi Multi-label Dangerous User Berdasarkan Fitur Struktural Twitter

Parwata, Anak Agung Yatestha (2024) Klasifikasi Multi-label Dangerous User Berdasarkan Fitur Struktural Twitter. Other thesis, Institut Teknologi Sepuluh Nopember.

[thumbnail of 5025201234-Undergraduate_Thesis.pdf] Text
5025201234-Undergraduate_Thesis.pdf - Accepted Version
Restricted to Repository staff only until 1 April 2026.

Download (9MB) | Request a copy

Abstract

Dangerous speech, atau ujaran berbahaya merupakan suatu ujaran yang dapat meningkatkan risiko seseorang atau suatu kelompok orang melakukan kejahatan terhadap orang lain atau kelompok orang lainnya. Dalam dangerous speech ini, terdapat 7 aspek, yaitu: konteks sosial, konteks historis, dehumanisasi, tuduhan, serangan terhadap wanita dan anak kecil, loyalitas suatu kelompok, dan ancaman terhadap suatu kelompok. Dangerous speech sendiri marak ditemukan di media sosial seperti Twitter. Dangerous user merupakan pengguna yang memiliki presentase tweet dangerous speech dan abusive language lebih banyak dibanding tweet neutral speech dan hate speech. Pada penelitian ini, telah dibuat suatu model yang dapat melakukan klasifikasi multi-label dangerous user berdasarkan fitur struktural yang diekstraksi dari graf pengguna yang terbentuk dari kemiripan unggahan Twitter. Fitur struktural yang digunakan adalah degree centrality, eigenvector centrality, betweenness centrality, closeness centrality, dan clustering coefficient. Tahapan yang dilalui mencakup pengolahan data tweet, ekstraksi fitur struktural pada jejaring berdasarkan kemiripan tweet, dan Klasifikasi pengguna dengan multi-label topik dangerous speech. Pengolahan data tweet dilakukan dengan cara pra-proses tweet, pemodelan topik menggunakan LDA, dan pseudo-label aspek dangerous speech menggunakan model IndoBERTweet, Naïve Bayes, dan Logistic Regresion. Tahapan selanjutnya adalah ekstraksi fitur struktural pada jejaring berdasarkan kemiripan tweet yang dilakukan dengan cara membuat graf dimana nodes nya merupakan pengguna twitter dan edgesnya merupakan nilai cosine similiarity atau kemiripan unggahan. Dilakukan pemilihan node dan edge mengunakan metode thresholding. Pengekstraksian fitur dilakukan dengan menggunakan pustaka networkx. Selanjutnya dilakukan pengklasifikasian pengguna dengan multi-label topik dangerous speech menggunakan Logistic Regression, Naïve Bayes, K-nearest Neighbors, Support Vector Classifier Decision Tree dan Random Forest. Pseudo-label aspek dangerous speech terbaik menggunakan model gabungan dari IndoBERTweet, Naïve Bayes, dan Logistic Regresion dengan nilai akurasi diatas 80%. Dalam pemodelan topik, ditemukan jumlah topik terbaik adalah 6 berdasarkan nilai coherence score, namun masih terdapat sub-topik yang terbentuk, sehingga digunakan jumlah topik 19 berdasarkan nilai coherence scorenya dan ketiadaannya sub-topik baru. Ditemukan metode thresholding terbaik dengan minimal tiap pengguna memiliki average retweet lebih dari 1 dan nilai cosine similiarity lebih dari 0,149. Klasifikasi dangerous user terbaik terdapat pada model Naïve Bayes dengan fitur degree centrality, closeness centrality, dan eigenvecot centrality dengan nilai hamming loss sebesar 0,555.
=============================================================================================================================
Dangerous speech, or dangerous discourse, refers to a form of expression that can increase the risk of an individual or a group of people committing crimes against others or another group of people. In dangerous speech, there are 7 aspects, namely: social context, historical context, dehumanization, accusations, attacks against women and children, group loyalty, and threats against a group. Dangerous speech is often found on social media platforms like Twitter. A dangerous user is a user who has a higher percentage of dangerous speech and abusive language in their tweets compared to neutral speech and hate speech. In this final project, a model has been developed to perform multi-label classification of dangerous users based on structural features extracted from user graphs formed by the similarity of Twitter posts. The structural features used are degree centrality, eigenvector centrality, betweenness centrality, closeness centrality, and clustering coefficient. The steps involved include tweet data preprocessing, structural feature extraction on the network based on tweet similarity, and user classification with multi-label dangerous speech topics. Tweet data processing is done by pre-processing tweets, topic modeling using LDA, and pseudo-labeling of dangerous speech aspects using IndoBERTweet, Naïve Bayes, and Logistic Regression models. The next step is the extraction of structural features on the network based on tweet similarity, which is done by creating a graph where the nodes represent Twitter users and the edges represent cosine similarity values or post similarity. Node and edge selection is performed using thresholding methods. Feature extraction is done using the networkx library. User classification with multi-label dangerous speech topics is done using Logistic Regression, Naïve Bayes, K-nearest Neighbors, Support Vector Classifier Decision Tree, and Random Forest. The best pseudo-label for dangerous speech aspects is obtained using a combined model of IndoBERTweet, Naïve Bayes, and Logistic Regression with an accuracy above 80%. In topic modeling, it was found that the optimal number of topics is 19 based on their coherence score and the absence of new sub-topics. The best thresholding method involves each account having an average retweet count of more than 1 and a cosine similarity value of more than 0.149. The best classification of dangerous users is achieved with the Naïve Bayes model using degree centrality, closeness centrality, and eigenvector centrality features, with a hamming loss value of 0.555.

Item Type: Thesis (Other)
Uncontrolled Keywords: Dangerous speech, Dangerous user, Klasifikasi multi-label, Topic Modelling, Multi-label
Subjects: T Technology > T Technology (General) > T385 Visualization--Technique
T Technology > T Technology (General) > T57.5 Data Processing
T Technology > T Technology (General) > T57.8 Nonlinear programming. Support vector machine. Wavelets. Hidden Markov models.
Divisions: Faculty of Intelligent Electrical and Informatics Technology (ELECTICS) > Informatics Engineering > 55201-(S1) Undergraduate Thesis
Depositing User: Anak Agung Yatestha Parwata
Date Deposited: 09 Feb 2024 14:19
Last Modified: 09 Feb 2024 14:19
URI: http://repository.its.ac.id/id/eprint/106584

Actions (login required)

View Item View Item