Hermawan, Alfado Rafly (2025) Peningkatan Kualitas Anotasi Gejala Long COVID Pada Media Sosial X Berbasis RAG-KG. Masters thesis, Institut Teknologi Sepuluh Nopember.
![]() |
Text
6026231041-Master_Thesis.pdf - Accepted Version Restricted to Repository staff only Download (3MB) | Request a copy |
Abstract
Variabilitas gejala Long-COVID yang kompleks sering kali mengakibatkan penanganan penyakit menjadi kurang optimal. Gejala-gejala ini dapat bertahan lama dan sering kali tidak terkait langsung dengan infeksi COVID-19. Beberapa individu dengan gejala ringan tidak memerlukan perawatan inap, sehingga gejala Long-COVID sering kali tidak tercatat dalam catatan medis resmi. Teks media sosial menjadi sumber informasi yang penting tentang kondisi kesehatan baru karena cakupannya yang luas dan beragam. Pemanfaatan teks media sosial dapat meningkatkan pemahaman yang lebih mendalam mengenai kondisi kesehatan baru. Sehingga pendekatan teknik pemrosesan bahasa alami (NLP) menjadi krusial. Untuk mengatasi tantangan ini penelitian ini mengusulkan sebuah kerangka kerja inovatif yang menggabungkan model deteksi awal, Large Language Model (LLM), dan Knowledge Graph (KG) guna meningkatkan proses anotasi Named Entity Recognition (NER) pada data Long-COVID. Model deteksi awal, yang menggunakan arsitektur BERT, telah menghasilkan performa terbaik dengan F1 score sebesar 89,9% pada data validasi dan 76% pada data uji, sehingga mampu menyaring informasi relevan dan meminimalkan dampak noise sebelum proses anotasi lebih lanjut. Pada tahap anotasi, digunakan model LLM seperti GPT-4o mini dan LLaMA 3.1 8B, dengan pendekatan frasa hierarki+ontologi yang optimal untuk GPT-4o-mini serta pendekatan frasa hierarki untuk LLaMA 3.1 8B. Selain itu, penerapan model embedding yang dilatih pada domain medis terbukti mampu meningkatkan performa Retrieval-Augmented Generation (RAG) dalam proses anotasi NER. Secara keseluruhan, integrasi Knowledge Graph melalui pendekatan RAG meningkatkan kinerja ekstraksi gejala Long-COVID, dengan peningkatan F1 score sebesar 6,31% pada GPT-4o-mini dan 3,27% pada LLaMA 3.1 8B. Meski peningkatan selektivitas model ini berhasil mengurangi kesalahan identifikasi, hal tersebut juga menyebabkan peningkatan persentase teks yang tidak terevaluasi.
==========================================================================================================================================
The complex variability of Long-COVID symptoms often results in suboptimal disease management. These symptoms can persist for a long time and are often not directly related to COVID-19 infection. Some individuals with mild symptoms do not require hospitalization, so Long-COVID symptoms are often not recorded in official medical records. Social media texts have become an important source of information about new health conditions due to their wide and diverse coverage. Utilizing social media texts can enhance a deeper understanding of new health conditions, making the application of natural language processing (NLP) techniques crucial. To address this challenge, this study proposes an innovative framework that combines an early detection model, a Large Language Model (LLM), and a Knowledge Graph (KG) to improve the Named Entity Recognition (NER) annotation process on Long-COVID data. The early detection model, which uses the BERT architecture, has achieved the best performance with an F1-score of 89.9% on the validation data and 76% on the test data, thus capable of filtering relevant information and minimizing the impact of noise before further annotation. In the annotation stage, LLM models such as GPT-4o-mini and LLaMA 3.1 8B are used, with an optimal hierarchical phrase+ontology approach for GPT-4o-mini and a hierarchical phrase approach for LLaMA 3.1 8B. In addition, the application of embedding models trained on the medical domain has proven to enhance the performance of Retrieval-Augmented Generation (RAG) in the NER annotation process. Overall, the integration of the Knowledge Graph through the RAG approach improves the extraction performance of Long-COVID symptoms, with an F1-score increase of 6.31% for GPT-4o-mini and 3.27% for LLaMA 3.1 8B. Although the increased selectivity of these models successfully reduced misidentification (false positives), it also resulted in a higher percentage of texts that were not evaluated.
Item Type: | Thesis (Masters) |
---|---|
Uncontrolled Keywords: | gejala long covid, large language model, knowledge graph, BERT, named entity recognition, long covid symptoms |
Subjects: | R Medicine > R Medicine (General) > R858 Deep Learning R Medicine > RA Public aspects of medicine > RA644.C67 COVID-19 (Disease) R Medicine > RA Public aspects of medicine > RA653.5 Epidemics T Technology > T Technology (General) > T57.5 Data Processing |
Divisions: | Faculty of Intelligent Electrical and Informatics Technology (ELECTICS) > Information System > 59101-(S2) Master Thesis |
Depositing User: | Alfado Rafly Hermawan |
Date Deposited: | 24 Jun 2025 07:30 |
Last Modified: | 24 Jun 2025 07:30 |
URI: | http://repository.its.ac.id/id/eprint/119248 |
Actions (login required)
![]() |
View Item |