Tabular Data Generation pada Dataset Sirosis Hati dengan Metode Conditional dan Unconditional GAN

Ardyanto, Sandhika Surya (2025) Tabular Data Generation pada Dataset Sirosis Hati dengan Metode Conditional dan Unconditional GAN. Other thesis, Institut Teknologi Sepuluh Nopember.

[thumbnail of 5025211022-Undergraduate_Thesis.pdf] Text
5025211022-Undergraduate_Thesis.pdf
Restricted to Repository staff only

Download (8MB) | Request a copy

Abstract

Masalah kesehatan merupakan salah satu bidang yang menjadi fokus untuk diselesaikan oleh kemajuan teknologi. Tahapan awal dalam proses pengobatan adalah diagnosis penyakit dimana tantangan utama dalam diagnosis penyakit kronis seperti sirosis hati adalah gejala awalnya sering sulit dibedakan dengan penyakit umum lainnya. Pengumpulan data medis yang merupakan bagian dari EHR menimbulkan perhatian khusus terhadap privasi karena mengandung informasi sensitif seperti riwayat penyakit, dapat menjadi solusi dari kurangnya data training dengan menyintesis data yang menyerupai catatan medis asli pasien. Generative Adversarial Network (GAN) bekerja dengan dua model yang saling bersaing dimana satu model berusaha menghasilkan data yang tidak dapat dibedakan dari data asli sedangkan model lainnya mencoba membedakan antara data sintetis dan data nyata. Fitur pada data medis dapat berupa data numerik maupun diskrit maka model ini menggunakan fungsi loss khusus dari campuran berbagai tipe data untuk menangkap korelasi yang kuat antar fitur dalam dataset yang kompleks sehingga dapat menghasilkan data sintetis yang serealistis mungkin. Penelitian ini berfokus pada metode MedGAN, MedBGAN, MedWGAN dan CTGAN yang diterapkan pada dataset pasien sirosis hati untuk mengevaluasi seberapa realistis data yang dihasilkan. Evaluasi dilakukan menggunakan ukuran statistik seperti PCC, Cramer’s V, KS-Test, JSD, serta pendekatan efikasi machine learning. Metriks PCC dan KS-Test paling optimal menggunakan MedBGAN dimana masing-masing sebesar 98,83% dan 18,25%. Metriks Cramer’s V, JSD mean, dan JSD std paling optimal menggunakan CTGAN dimana masing-masing sebesar 4,1%; 0,2%; 0,2%. Performa metriks klasifikasi paling optimal saat menggunakan model CTGAN dimana akurasi tertinggi mencapai 44,4%.
=================================================================================================================================
Health issues are one of the primary areas being addressed through technological advancements. The initial stage in the medical treatment process is disease diagnosis, where the main challenge in diagnosing chronic diseases such as liver cirrhosis lies in the fact that early symptoms are often difficult to distinguish from those of other common illnesses. The collection of medical data, as part of EHR, raises specific privacy concerns due to the presence of sensitive information such as patients' medical histories. However, synthetic data that closely resembles real medical records can offer a solution to the shortage of training data.Generative Adversarial Networks (GAN) operate using two competing models, where one model generates synthetic data that is difficult to distinguish from real data, while the other model attempts to differentiate between synthetic and real data. Medical datasets often contain both numerical and discrete features; therefore, these models use customized loss functions capable of handling mixed data types to effectively capture complex feature correlations, allowing for the generation of highly realistic synthetic data. This research focuses on implementing MedGAN, MedBGAN, MedWGAN, and CTGAN on a liver cirrhosis patient dataset to evaluate the realism of the generated data. Evaluation was performed using statistical measures such as PCC, Cramer's V, KS-Test, JSD, and machine learning efficacy approaches. The best results for numerical data were achieved using MedBGAN, with PCC reaching 98.83% and KS-Test at 18.25%. For discrete data, CTGAN demonstrated the most optimal performance, with Cramer's V at 4.1%, JSD mean at 0.2%, and JSD standard deviation also at 0.2%. The highest machine learning classification performance was also obtained using CTGAN, with an accuracy reaching 44.4%.

Item Type: Thesis (Other)
Uncontrolled Keywords: Cramer’s V, CTGAN, JSD, KS-Test, Efikasi Machine Learning, MedBGAN, MedGAN, MedWGAN, PCC, Sirosis Hati.
Subjects: T Technology > T Technology (General) > T57.5 Data Processing
T Technology > T Technology (General) > T58.5 Information technology. IT--Auditing
Divisions: Faculty of Intelligent Electrical and Informatics Technology (ELECTICS) > Informatics Engineering > 55201-(S1) Undergraduate Thesis
Depositing User: Sandhika Surya Ardyanto
Date Deposited: 29 Jul 2025 05:25
Last Modified: 29 Jul 2025 05:25
URI: http://repository.its.ac.id/id/eprint/122441

Actions (login required)

View Item View Item