Random Forest dengan SHAP untuk Model Klasifikasi yang Interpretable pada Data Genomik (Studi Kasus: Klasifikasi Tingkat Keparahan Kanker Prostat)

Iko, Muhammad Al-haddad P. (2025) Random Forest dengan SHAP untuk Model Klasifikasi yang Interpretable pada Data Genomik (Studi Kasus: Klasifikasi Tingkat Keparahan Kanker Prostat). Masters thesis, Institut Teknologi Sepuluh November.

[thumbnail of 6003241010-Master_Thesis.pdf] Text
6003241010-Master_Thesis.pdf
Restricted to Repository staff only

Download (10MB) | Request a copy

Abstract

Random forest merupakan algoritma ensemble yang efektif untuk menganalisis data genomik high-dimensional, namun interpretasinya seringkali terbatas karena sifatnya yang black-box. Penelitian ini menggabungkan random forest dengan Shapley Additive Explanations SHAP untuk meningkatkan transparansi dan interpretabilitas model pada dua dataset genomik, yaitu DNA methylation TCGA-PRAD untuk binary classification (early vs. advanced) dan ekspresi gen GSE35988 untuk multi-class classification (benign, localized, metastatic). Dengan menerapkan feature selection terhadap 100 gen teratas, model menunjukkan performa prediksi dengan akurasi 0,683 dan AUC 0,780 pada binary classification, serta akurasi 0,917 dan AUC 0,971 pada multi-class classification. Analisis SHAP mengidentifikasi gen seperti CCND1, CAPN2, dan ITPR3 pada model binary, serta POLA, TLR10, dan CFHR pada model multi-class, yang secara konsisten memberikan kontribusi terbesar terhadap prediksi dan menunjukkan pola biologis yang relevan dengan progresivitas kanker prostat. Selain memberikan gambaran pengaruh fitur pada level global dan lokal, SHAP juga memfasilitasi interpretasi arah kontribusi gen, sehingga memperkuat pemahaman mekanisme molekuler yang mendasari perbedaan stadium. Secara keseluruhan, integrasi random forest dan SHAP tidak hanya menghasilkan model yang akurat dan stabil, tetapi juga memberikan interpretasi yang dapat ditindaklanjuti untuk mendukung identifikasi biomarker, validasi biologis, dan pengambilan keputusan klinis dalam stratifikasi risiko kanker prostat.
=================================================================================================================================
Random Forest is an effective ensemble algorithm for analyzing high-dimensional genomic data, yet its interpretability is often limited due to its inherent black-box nature. This study integrates Random Forest with Shapley Additive Explanations (SHAP) to enhance model transparency and interpretability across two genomic datasets: the TCGA-PRAD DNA methylation dataset for binary classification (early vs. advanced) and the GSE35988 gene expression dataset for multi-class classification (benign, localized, metastatic). Using feature selection to retain the top 100 genes, the models demonstrated strong predictive performance, achieving an accuracy of 0.683 and an AUC of 0.780 for the binary classification task, as well as an accuracy of 0.917 and an AUC of 0.971 for the multi-class classification task. SHAP analysis identified key genomic features such as CCND1, CAPN2, and ITPR3 in the binary model, and POLA, TLR10, dan CFHR in the multi-class model, all of which consistently contributed most strongly to the predictions and exhibited biologically relevant patterns associated with prostate cancer progression. Beyond providing insights into feature contributions at both global and local levels, SHAP facilitated the interpretation of the directionality of each gene’s influence, thereby strengthening the understanding of molecular mechanisms underlying disease staging. Overall, the integration of Random Forest and SHAP yielded not only accurate and stable predictive models but also actionable interpretability, supporting biomarker identification, biological validation, and clinical decision-making in prostate cancer risk stratification.

Item Type: Thesis (Masters)
Uncontrolled Keywords: data genomik, model interpretabilitas, kanker prostat, random forest, SHAP genomic data, interpretability model, prostate cancer, random forest, SHAP
Subjects: Q Science > Q Science (General) > Q325.5 Machine learning. Support vector machines.
Divisions: Faculty of Mathematics, Computation, and Data Science > Statistics > 49101-(S2) Master Thesis
Depositing User: Muhammad Al-haddad P. Iko
Date Deposited: 15 Jan 2026 07:30
Last Modified: 15 Jan 2026 07:30
URI: http://repository.its.ac.id/id/eprint/129656

Actions (login required)

View Item View Item