Klasifikasi Multi-Label Kode International Classification of Diseases pada Penyakit Diabetes Berbasis Transformer dan Diagnosis Generatif Menggunakan Finetunning Large Language Model

Izzaddien, Yusril Falih (2026) Klasifikasi Multi-Label Kode International Classification of Diseases pada Penyakit Diabetes Berbasis Transformer dan Diagnosis Generatif Menggunakan Finetunning Large Language Model. Masters thesis, Institut Teknologi Sepuluh Nopember.

[thumbnail of 6025241026-Master_Thesis.pdf] Text
6025241026-Master_Thesis.pdf - Accepted Version
Restricted to Repository staff only

Download (17MB) | Request a copy

Abstract

Klasifikasi otomatis kode International Classification of Diseases (ICD-10) untuk penyakit diabetes dari Electronic Health Records (EHR) menghadapi tantangan akibat panjang dokumen klinis, kompleksitas terminologi medis, serta karakteristik multi-label yang tidak seimbang. Selain itu, memprediksi diagnosis klinis berbasis large language models (LLM) rentan terhadap halusinasi apabila tidak didukung oleh informasi terstruktur. Penelitian ini mengusulkan pipeline dua tahap yang mengintegrasikan klasifikasi kode ICD dan generasi diagnosis klinis secara terarah. Penelitian menggunakan dataset MIMIC-III yang terdiri dari 10.000 data EHR dengan fokus diabetes sebanyak 25 kode ICD-10 berisi E10–E13. Tahap pertama, EHR mentah diringkas menggunakan Chain-of-Thought (CoT) menjadi representasi 512 token dan diklasifikasikan menggunakan ClinicalBERT dengan skema global adaptive threshold. Tahap kedua Hasil klasifikasi kemudian dimanfaatkan sebagai structured prompt untuk generasi diagnosis klinis menggunakan LLaMA 3 8B yang telah di-fine-tune, dengan pengayaan konteks melalui ringkasan CoT, terminologi klinis berbasis retrieval-augmented generation, dan kata kunci TF-IDF. Hasil eksperimen menunjukkan bahwa ClinicalBERT dengan CoT meningkatkan F1-micro sebesar 2,67% dan F1-macro sebesar 6,54% dibandingkan baseline. Pada tugas generasi diagnosis, pendekatan yang diusulkan meningkatkan ClinicalBERTScore sebesar 4,30% dan semantic similarity sebesar 4,12%, mencapai nilai ClinicalBERTScore (F1) sebesar 0,777. Pendekatan ini menghasilkan diagnosis klinis yang akurat dan efisien dengan model berparameter relatif kecil.
==================================================================================================================================
Automatic classification of International Classification of Diseases (ICD-10) codes for diabetes from Electronic Health Records (EHR) faces challenges due to the length of clinical documents, the complexity of medical terminology, and unbalanced multi-label characteristics. In addition, predicting clinical diagnoses based on large language models (LLM) is prone to hallucinations if not supported by structured information. This study proposes a two-stage pipeline that integrates ICD code classification and targeted clinical diagnosis generation. The study uses the MIMIC-III dataset consisting of 10,000 EHR data with a focus on diabetes with 25 ICD-10 code contain E10–E13. In the first stage, raw EHRs are summarized using Chain-of-Thought (CoT) into a 512-token representation and classified using ClinicalBERT with a global adaptive threshold scheme. In the second stage, the classification results were then used as a structured prompt for clinical diagnosis generation using fine-tuned LLaMA 3 8B, with context enrichment through CoT summaries, retrieval-augmented generation-based clinical terminology, and TF-IDF keywords. Experimental results show that ClinicalBERT with CoT improves F1-micro by 2.67% and F1-macro by 6.54% compared to the baseline. In the diagnosis generation task, the proposed approach improved the ClinicalBERTScore by 4.30% and semantic similarity by 4.12%, achieving a ClinicalBERTScore (F1) of 0.777. This approach produces accurate and efficient clinical diagnoses with a relatively small parameter model.

Item Type: Thesis (Masters)
Uncontrolled Keywords: Chain-of-Thought, Klasifikasi Multi-label, Klasifikasi Penyakit Internasional, Retrieval-Augmented Generation, Transformer, Chain-of-Thought, Multi-label Classification, International Disease Classification, Retrieval-Augmented Generation, Transformer.
Subjects: Q Science > QA Mathematics > QA336 Artificial Intelligence
Q Science > QA Mathematics > QA76.76.E95 Expert systems
Q Science > QA Mathematics > QA76.87 Neural networks (Computer Science)
R Medicine > R Medicine (General) > R858 Deep Learning
Divisions: Faculty of Intelligent Electrical and Informatics Technology (ELECTICS) > Informatics Engineering > 55101-(S2) Master Thesis
Depositing User: Yusril Falih Izzaddien
Date Deposited: 30 Jan 2026 06:53
Last Modified: 30 Jan 2026 06:53
URI: http://repository.its.ac.id/id/eprint/131300

Actions (login required)

View Item View Item