Apalwan, Haliza Nur Kamila (2026) Pengembangan dan Analisis Evaluatif Dataset Sintetik untuk Sistem AI Kesehatan Mental EMILIA. Project Report. [s.n], [s.l.]. (Unpublished)
|
Text
5025231038-Project_Report.pdf - Accepted Version Restricted to Repository staff only Download (2MB) | Request a copy |
Abstract
Salah satu faktor penyebab remaja Indonesia mengalami masalah kesehatan mental adalah minimnya akses layanan psikologis yang terjangkau dan mudah diakses. Untuk mengatasi keterbatasan ini, penelitian ini memanfaatkan Large Language Model (LLM) dalam pembuatan dataset sintetik berbentuk pasangan pertanyaan-jawaban, dengan panduan dokumen psikologi. Dataset ini dievaluasi menggunakan metode Natural Language Inference (NLI) untuk mengukur konsistensi semantik, tingkat halusinasi, dan relevansi jawaban. Selain itu, dilakukan pemodelan topik menggunakan pendekatan BERTopic untuk mengidentifikasi tema psikologis utama dalam dataset. Uji coba pemodelan topik dilakukan dengan membandingkan dua model embedding, yaitu IndoBERT dan SBERT. Hasil uji coba menunjukkan bahwa LLM mampu menghasilkan data dengan konsistensi semantik yang tinggi dengan nilai probabilitas entailment 0,85, skor rata-rata halusinasi 3,39, dan relevansi jawaban 3,19. Penggunaan model embedding SBERT pada BERTopic memberikan performa terbaik dibandingkan IndoBERT dengan nilai coherence 0,55 dan diversity 0,58. Dengan demikian, pemanfaatan LLM efektif dalam menghasilkan dataset sintetik domain kesehatan mental dan penggunaan model embedding SBERT pada BERTopic lebih unggul dalam mengeksplorasi struktur tematik pada data pertanyaan-jawaban yang tidak terstruktur.
=====================================================================================================================================
One of the factors causing mental health problems in Indonesian adolescents is the limited access to affordable and easily accessible psychological services. To address this limitation, this study utilized a Large Language Model (LLM) to create a synthetic dataset in the form of question-answer pairs, guided by psychology documents. This dataset was evaluated using the Natural Language Inference (NLI) method to measure semantic consistency, hallucination levels, and answer relevance. Furthermore, topic modeling using the BERTopic approach was performed to identify key psychological themes in the dataset. Topic modeling trials were conducted by comparing two embedding models, namely IndoBERT and SBERT. The trial results showed that LLM was able to produce data with high semantic consistency with an entailment probability value of 0.85, an average hallucination score of 3.39, and answer relevance of 3.19. The use of the SBERT embedding model on BERTopic provided the best performance compared to IndoBERT with a coherence value of 0.55 and a diversity of 0.58. Thus, the utilization of LLM is effective in generating synthetic datasets of mental health domain and the use of SBERT embedding model on BERTopic is superior in exploring thematic structures in unstructured question-answer data.
| Item Type: | Monograph (Project Report) |
|---|---|
| Uncontrolled Keywords: | Kesehatan Mental, Dataset Sintetik, Large Language Model, Natural Language Inference, BERTopic, Mental Health, Synthetic Dataset, Large Language Model, Natural Language Inference, BERTopic |
| Subjects: | T Technology > T Technology (General) > T57.5 Data Processing |
| Divisions: | Faculty of Industrial Technology > Informatics Engineering > 55201-(S1) Undergraduate Thesis |
| Depositing User: | Haliza Nur Kamila Apalwan |
| Date Deposited: | 05 Jan 2026 08:32 |
| Last Modified: | 05 Jan 2026 08:32 |
| URI: | http://repository.its.ac.id/id/eprint/129270 |
Actions (login required)
![]() |
View Item |
