Ekstraksi Gejala Dan Entitas Bernama Pada Klasifikasi Tweet Demam Berdarah Dengue Menggunakan Transformer IndoBERT Multi-Layer Perceptron

Azzahra, Naura Jasmine (2025) Ekstraksi Gejala Dan Entitas Bernama Pada Klasifikasi Tweet Demam Berdarah Dengue Menggunakan Transformer IndoBERT Multi-Layer Perceptron. Other thesis, Institut Teknologi Sepuluh Nopember.

[thumbnail of 5026211005-Undergraduate_Thesis.pdf] Text
5026211005-Undergraduate_Thesis.pdf - Accepted Version
Restricted to Repository staff only

Download (9MB) | Request a copy

Abstract

Demam Berdarah Dengue (DBD) merupakan penyakit endemik di Indonesia yang dapat dimonitor melalui media sosial seperti Twitter. Penelitian ini mengembangkan dan mengevaluasi model klasifikasi serta ekstraksi entitas bernama (Named Entity Recognition/NER) dari tweet terkait DBD menggunakan arsitektur IndoBERT dan Multi-Layer Perceptron (MLP). Dataset utama terdiri atas 27.465 tweet berbahasa Indonesia yang berasal dari dua sumber: 6.899 tweet diperoleh dari penelitian Anggraeni et al. (2024) mencakup Januari–Agustus 2023, sisanya dikumpulkan melalui scraping menggunakan Selenium Twitter Scraper berdasarkan kata kunci spesifik sepanjang September 2023 hingga Juni 2025. Setelah pembersihan, diperoleh 11.689 tweet bersih.
Sebanyak 2.250 tweet digunakan untuk membangun model klasifikasi tiga kelas: tidak relevan, relevan-gejala DBD, dan relevan-informasi atau diskusi DBD. Dari total tersebut, 743 tweet divalidasi oleh lima dokter berbagai spesialisasi (anak, THT-KL, kulit dan kelamin, radiologi, dan umum). Model dilatih menggunakan dua skema data, yaitu tervalidasi dan campuran (tervalidasi + anotasi peneliti). Masing-masing arsitektur model (MLP, CNN, LSTM, CNN-LSTM) diuji melalui 16 eksperimen hyperparameter untuk memperoleh konfigurasi terbaik, sebelum dibandingkan. Hasil evaluasi menunjukkan bahwa model CNN-LSTM dengan konfigurasi terbaik memiliki performa tertinggi di kedua skema data. Pada dataset tervalidasi, model mencapai F1-score 0,913; pada dataset campuran, F1-score meningkat menjadi 0,926, menunjukkan generalisasi dan stabilitas tinggi.
Model NER dikembangkan dengan pendekatan fine-tuning pada IndoBERT sebagai backbone. Proses ini memanfaatkan model pre-trained yang disesuaikan dengan domain medis dan gaya bahasa media sosial. NER dirancang untuk mengekstraksi entitas dari tweet relevan-gejala DBD (label 1), menggunakan skema BIOE untuk gejala (SYM) dan pendekatan single entity untuk waktu (TIME), tanggal (DATE), usia (AGE), jenis kelamin (GEN), dan lokasi (LOC). Sebanyak 556 tweet dilabeli secara manual dan dievaluasi bersama 217 tweet validasi terpisah.
Model IndoBERT Base dengan arsitektur 3-layer MLP menunjukkan hasil terbaik: akurasi 98,79%, macro F1-score 0,8345, dan weighted F1-score 0,9883. Performa tiap entitas meliputi: O (F1 0,9944), B-SYM (0,9027), I-SYM (0,8918), I-TIME (0,7847), I-DATE (0,8889), I-AGE (0,5714), dan I-LOC (0,8077). Sebagai pembanding, dua varian IndoBERT Large juga diuji dengan fine-tuning berbeda. Model dengan 4-layer MLP dan focal loss menghasilkan weighted F1-score 0,8636, sedangkan model 2-layer MLP dengan cross-entropy loss mencapai 0,8911. IndoBERT Base MLP tetap unggul dalam stabilitas (10 epoch) dan akurasi keseluruhan.
Hasil ekstraksi menunjukkan bahwa demam, sakit kepala, dan penurunan trombosit merupakan gejala paling dominan. Analisis durasi gejala menggunakan pendekatan Miao et al. menunjukkan pola temporal konsisten dengan literatur klinis: fase akut 0,5–1,5 minggu dan fase pemulihan 1–4 minggu. Penelitian ini memberi kontribusi metodologis dalam pengembangan model IndoBERT yang di-fine-tune untuk ekstraksi informasi kesehatan dari media sosial, serta mendemonstrasikan potensi NLP untuk pemantauan DBD berbasis entitas klinis secara real-time.

======================================================================================================================================

Dengue Hemorrhagic Fever (DHF) is an endemic disease in Indonesia that can be monitored through social media platforms such as Twitter. This study develops and evaluates classification and named entity recognition (NER) models for DHF-related tweets using the IndoBERT architecture and a Multi-Layer Perceptron (MLP). The primary dataset consists of 27,465 Indonesian-language tweets from two sources: 6,899 tweets were obtained from a previous study by Anggraeni et al. (2024) covering the period from January to August 2023, while the rest were collected through scraping using Selenium Twitter Scraper with specific keywords from September 2023 to June 2025. After preprocessing, 11,689 clean tweets were obtained.
A total of 2,250 tweets were used to build a three-class classification model: non-relevant, DHF symptom-relevant, and DHF information/discussion-relevant. Of these, 743 tweets were validated by five medical doctors from various specializations (pediatrics, ENT, dermatology-venereology, radiology, and general practice). The classification model was trained using two data schemes: validated and mixed (validated + researcher-annotated). Each model architecture (MLP, CNN, LSTM, CNN-LSTM) was tested across 16 hyperparameter tuning experiments to obtain the optimal configuration before comparison. Evaluation results show that the CNN-LSTM model with optimal configuration achieved the highest performance in both data schemes. On the validated dataset, the model achieved an F1-score of 0.913, and on the mixed dataset, the F1-score improved to 0.926, indicating strong generalization and stability.
The NER model was developed using a fine-tuning approach on IndoBERT as the backbone. This process leveraged the pre-trained model tailored to the medical domain and informal language style found in social media. The NER model was designed to extract entities from symptom-relevant tweets (label 1), using the BIOE annotation scheme for symptom entities (SYM) and a single-entity approach for other categories such as time (TIME), date (DATE), age (AGE), gender (GEN), and location (LOC). A total of 556 tweets were manually labeled and evaluated alongside a separate validation set of 217 tweets.
The IndoBERT Base model with a 3-layer MLP architecture achieved the best results: 98.79% accuracy, macro F1-score of 0.8345, and weighted F1-score of 0.9883. Entity-level performance includes: O (F1 0.9944), B-SYM (0.9027), I-SYM (0.8918), I-TIME (0.7847), I-DATE (0.8889), I-AGE (0.5714), and I-LOC (0.8077). For comparison, two variants of IndoBERT Large were also tested using different fine-tuning strategies. The model with a 4-layer MLP and focal loss yielded a weighted F1-score of 0.8636, while the model with a 2-layer MLP and cross-entropy loss achieved 0.8911. IndoBERT Base MLP remained superior in convergence stability (10 epochs) and overall accuracy.
The extraction results indicate that fever, headache, and platelet decline were the most frequently mentioned symptoms. Further analysis of symptom duration using the approach by Miao et al. revealed temporal patterns consistent with clinical literature: the acute phase lasted between 0.5 and 1.5 weeks, and the recovery phase ranged from 1 to 4 weeks. This study contributes methodologically to the development of fine-tuned IndoBERT models for extracting health information from Indonesian-language social media and demonstrates the potential of NLP in real-time DHF monitoring based on clinical entity extraction

Item Type: Thesis (Other)
Uncontrolled Keywords: Twitter, Demam Berdarah Dengue, Klasifikasi Teks, Named Entity Recognition, Ekstraksi Gejala, Transformer, Pengenalan Entitas Bernama, Symptoms Extraction, Natural Language Processing
Subjects: Q Science > QA Mathematics > QA76.87 Neural networks (Computer Science)
R Medicine > R Medicine (General) > R858 Deep Learning
R Medicine > RC Internal medicine > RC137 Dengue
T Technology > T Technology (General)
T Technology > T Technology (General) > T57.5 Data Processing
Divisions: Faculty of Intelligent Electrical and Informatics Technology (ELECTICS) > Information System > 57201-(S1) Undergraduate Thesis
Depositing User: Naura Jasmine Azzahra
Date Deposited: 26 Jul 2025 07:46
Last Modified: 26 Jul 2025 07:46
URI: http://repository.its.ac.id/id/eprint/122264

Actions (login required)

View Item View Item