Cindana, Adinda Prilly (2025) Model-Based Clustering For Differentially Expressed Genes Identification Of RNA-SEQ Count Data. Other thesis, Institut Teknologi Sepuluh Nopember.
![]() |
Text
5003201126-Undergraduate_Thesis.pdf - Accepted Version Restricted to Repository staff only until 1 April 2027. Download (5MB) | Request a copy |
Abstract
Kanker kolorektal (CRC), yaitu kanker yang tumbuh di kolon atau rektum, merupakan kanker ketiga terbanyak didiagnosis di dunia dan kanker kedua terbanyak pada pria di Indonesia. Identifikasi gen yang terlibat dalam perkembangan CRC penting untuk penanda diagnostik dan target terapi. Penelitian ini menerapkan klastering berbasis model pada data RNA-Seq untuk menganalisis gen yang diekspresikan secara diferensial (DEGs) pada CRC. Dengan menggunakan dataset GSE104836 yang mencakup sampel CRC dan non-tumor yang berpasangan, model berbasis distribusi binomial negatif digunakan untuk menangani variasi dalam data RNA-Seq. Pendekatan ini mengelompokkan gen sekaligus membedakan ekspresi diferensial (DE) yang sesungguhnya dari variasi acak sehingga membantu mengidentifikasi kelompok gen yang terkait dengan CRC. Analisis ini mengidentifikasi empat klaster gen berdasarkan profil ekspresi, yang dioptimalkan menggunakan metrik evaluasi termasuk silhouette score (0,3928), indeks Calinski-Harabasz (22084,96), dan indeks Davies-Bouldin (1,12). Dataset berisi 36.188 gen dibagi menjadi klaster 1 (2.176 gen), klaster 2 (17.068 gen), klaster 3 (3.016 gen), dan klaster 4 (13.928 gen). Klaster 4 dengan nilai L^2 norm terkecil(0.030)merupakan kelompok non-DEG. Klaster 3 menunjukkan gen yang diekspresikan lebih tinggi pada kondisi pertama (normal), sedangkan klaster 1 dan 2 merepresentasikan gen dengan ekspresi lebih tinggi pada kondisi kedua (CRC). Dengan begitu, algoritma ini mengidentifikasi 3.016 gen downregulated, 19.244 gen upregulated, dan 13.928 non-DEGs. Sebanyak 30 gen upregulated paling signifikan yang diidentifikasi oleh DESeq2, analisis DE konvensional, ditemukan di klaster 1, sedangkan 30 gen downregulated paling signifikan ditemukan di klaster 3. Penelitian ini tidak hanya berkontribusi pada biologi komputasi dengan menerapkan klastering berbasis model untuk analisis DE pada data biologis yang kompleks, tetapi juga berfungsi sebagai pengantar untuk RNA-Seq, klastering ekspresi gen, dan identifikasi DEG dengan memerhatikan klaster gen yang memiliki koregulasi dengan profil ekspresi yang berbeda. Identifikasi 19.244 gen upregulated dan 3.016 gen downregulated menyediakan referensi bagi peneliti untuk mengidentifikasi gen kunci yang terlibat dalam perkembangan CRC. Daftar 30 gen signifikan teratas dalam kedua kategori tersebut memberikan daftar kandidat terfokus untuk validasi eksperimental lebih lanjut. Hasil ini membantu identifikasi gen kunci CRC, memfasilitasi deteksi dini, serta mendukung pengembangan terapi berbasis target untuk meningkatkan perawatan pasien.
==============================================================================================================================
Colorectal cancer (CRC), a malignancy that develops in the colon or rectum, is the third most diagnosed cancer globally and the second most common cancer among men in Indonesia. Identifying genes involved in CRC progression offers critical insights into potential diagnostic markers and therapeutic targets. This study applies model-based clustering to RNA-Seq data to analyze differentially expressed genes (DEGs) in CRC. Utilizing the GSE104836 dataset, which includes paired CRC and nontumor tissue samples, we employ a model based on the negative binomial distribution to handle RNA-Seq data variability. This approach clusters genes while distinguishing true differential expression (DE) from random variation, aiding in the identification of gene groups associated with CRC. The analysis identified four clusters of genes based on expression profiles, optimized using evaluation metrics including silhouette score (0.3928), Calinski-Harabasz index (22084.96), and Davies-Bouldin index (1.12). The dataset of 36,188 genes was divided into cluster 1 (2,176 genes), cluster 2 (17,068 genes), cluster 3 (3,016 genes), and cluster 4 (13,928 genes). Notably, cluster 4, having the smallest norm (0.030), corresponds to the non-DEG group. Cluster 3 indicates the upregulated genes in the first condition (normal), while cluster 1 and 2 represent genes with higher expression in the second condition (CRC). Thus, the algorithm identifies 3,016 downregulated genes, 19,244 upregulated genes, and 13,928 non-DEGs. The top 30 significant upregulated genes identified by DESeq2, another earlier DE analysis tool, were found in the first cluster, while the top 30 significant downregulated genes identified by DESeq2 were found in the third cluster. This study not only contributes to computational biology by applying model-based clustering for DE analysis to complex biological data but also serves as an accessible introduction to RNA-Seq, gene expression clustering, and DEG identification by highlighting clusters of coregulated genes with distinct expression profiles. The identification of 19,244 upregulated and 3,016 downregulated genes offers a valuable resource for researchers to pinpoint key genes involved in CRC progression. Notably, the top 30 significant genes in both categories provide a focused list of candidates for further experimental validation. These insights have the potential to guide the discovery of new diagnostic biomarkers and therapeutic targets for CRC. By deepening the understanding of gene expression dynamics in CRC, this approach may facilitate earlier detection, enhance molecular diagnostic accuracy, and support the development of targeted therapies, ultimately improving patient care and outcomes.
Item Type: | Thesis (Other) |
---|---|
Uncontrolled Keywords: | Analisis ekspresi diferensial, Analisis Klaster, Ekspresi gen, RNA-Seq, Clustering, Differential expression analysis, Gene expression |
Subjects: | H Social Sciences > HA Statistics |
Divisions: | Faculty of Science and Data Analytics (SCIENTICS) > Statistics > 49201-(S1) Undergraduate Thesis |
Depositing User: | Adinda Prilly Cindana |
Date Deposited: | 04 Feb 2025 01:07 |
Last Modified: | 04 Feb 2025 01:07 |
URI: | http://repository.its.ac.id/id/eprint/118013 |
Actions (login required)
![]() |
View Item |