Penambangan Topik Secara Temporal Pada Literatur Ilmiah Arxiv Menggunakan Topic Modeling

Viery Kristenzky Tampubolon, Unedo (2026) Penambangan Topik Secara Temporal Pada Literatur Ilmiah Arxiv Menggunakan Topic Modeling. Other thesis, Institut Teknologi Sepuluh Nopember.

[thumbnail of 5025221116-Undergraduate_Thesis.pdf] Text
5025221116-Undergraduate_Thesis.pdf - Accepted Version
Restricted to Repository staff only

Download (8MB) | Request a copy

Abstract

Pemodelan Topik atau Topic Modeling adalah metode NLP unsupervised untuk meringkas data teks melalui kelompok kata. Model ini membantu dalam klasifikasi teks dan tugas pencarian informasi. Penelitian ini bertujuan untuk melakukan penambangan pada literatur ilmiah dari database platform arxiv untuk menemukan topik – topik yang relevan secara temporal dari tahun 2000 sampai 2025. Penelitian ini akan menggunakan dataset literatur dari bidang Ilmu komputer, Matematika dan Fisika dengan lima metode topic modeling yang akan digunakan antara lain BERTopic, Dynamic Topic Modeling (DTM), Latent Dirichlet Allocation (LDA) dan TopicGPT. Dalam penelitian ini, analisis terhadap hasil topik secara temporal dilakukan untuk melihat perubahan kecenderungan topik dari waktu ke waktu. Metode pengumpulan data dilakukan dengan scraping judul dan abstrak artikel dan jurnal dari website arxiv. Berdasarkan hasil pengujian yang telah dilakukan, model BERTopic menunjukkan performa terbaik secara menyeluruh, baik pada evaluasi statis maupun dinamis. Pada evaluasi statis, BERTopic mencapai nilai Topic Quality (TQ) global tertinggi hingga 0,8531, dengan coherence score pada subjek Ilmu Komputer (CS), Matematika, dan Fisika masing-masing sebesar 0,7095, 0,7280, dan 0,7398, serta nilai keragaman topik (IRBO) sebesar 0,9893, 0,9809, dan 0,9873. Pada evaluasi dinamis, BERTopic juga unggul dengan nilai Temporal Topic Quality (TTQ) tertinggi di seluruh subjek, yaitu 0,4917 pada CS, 0,5214 pada Matematika, dan 0,5157 pada Fisika, serta mendominasi kategori evolusi topik (72,17% - 78,87%) yang mencerminkan konsistensi identitas topik yang sangat baik antar periode. Sementara itu, LDA menunjukkan adaptasi progresif pada evaluasi per irisan waktu, DTM mencatat performa statis terendah namun kompetitif secara dinamis, dan TopicGPT menghasilkan topik berkualitas tinggi secara statis namun dengan koherensi evolusi yang lebih rendah.
==================================================================================================================================
Topic modeling is an unsupervised Natural Language Processing (NLP) method used to summarize text data through groups of related words. This approach supports various tasks, including text classification and information retrieval. This study aims to mine scientific literature from the arXiv database platform to identify temporally relevant topics from 2000 to 2025. The study utilizes a dataset consisting of scientific literature from the fields of Computer Science, Mathematics, and Physics. Four topic modeling methods are employed, namely BERTopic, Dynamic Topic Modeling (DTM), Latent Dirichlet Allocation (LDA), and TopicGPT. Furthermore, a temporal analysis of the generated topics is conducted to observe changes in topic trends over time. Data collection is performed by scraping article and journal titles and abstracts available on the arXiv platform. Based on the experimental results, BERTopic demonstrated the best overall performance in both static and dynamic evaluations. In the static evaluation, BERTopic achieved the highest global Topic Quality (TQ) score of 0.8531. The coherence scores for the Computer Science, Mathematics, and Physics domains reached 0.7095, 0.7280, and 0.7398, respectively, while the corresponding topic diversity (IRBO) scores were 0.9893, 0.9809, and 0.9873. In the dynamic evaluation, BERTopic also outperformed the other models by achieving the highest Temporal Topic Quality (TTQ) scores across all domains, namely 0.4917 for Computer Science, 0.5214 for Mathematics, and 0.5157 for Physics. Moreover, BERTopic dominated the topic evolution category, accounting for 72.17%–78.87% of the results, indicating strong consistency in topic identity across different time periods. Meanwhile, LDA exhibited progressive adaptation in the per-time-slice evaluation, DTM achieved the lowest performance in the static evaluation but remained competitive in the dynamic evaluation, and TopicGPT generated high-quality topics in the static evaluation while demonstrating lower evolutionary coherence over time.

Item Type: Thesis (Other)
Uncontrolled Keywords: Analisis Temporal, Arxiv, BERTopic, Dynamic Topic Modeling, Latent Dirichlet Allocation, Literatur Ilmiah, Pemodelan Topik,TopicGPT, Arxiv, BERTopic, Dynamic Topic Modeling, Latent Dirichlet Allocation, Scientific Literature, Temporal Analysis, TopicGPT, Topic Modeling
Subjects: Q Science > QA Mathematics > QA76.87 Neural networks (Computer Science)
Q Science > QA Mathematics > QA76.9.D343 Data mining. Querying (Computer science)
T Technology > T Technology (General) > T57.5 Data Processing
Divisions: Faculty of Intelligent Electrical and Informatics Technology (ELECTICS) > Informatics Engineering > 55201-(S1) Undergraduate Thesis
Depositing User: Unedo Viery Kristenzky Tampubolon
Date Deposited: 15 Jun 2026 01:27
Last Modified: 15 Jun 2026 01:27
URI: http://repository.its.ac.id/id/eprint/133783

Actions (login required)

View Item View Item