Pelabelan Klaster Artikel Ilmiah Menggunakan Topic Rank dan Maximum Common Subgraph

Nurilham, Adhi (2018) Pelabelan Klaster Artikel Ilmiah Menggunakan Topic Rank dan Maximum Common Subgraph. Masters thesis, Insitut Teknologi Sepuluh Nopember.

[img] Text
5116201047-Master_Thesis.pdf - Published Version
Restricted to Repository staff only

Download (2MB) | Request a copy

Abstract

Metode klasterisasi dapat memudahkan pengelompokkan artikel ilmiah. Pelabelan klaster diperlukan untuk mengetahui frasa kunci yang merepresentasikan topik bahasan kelompok artikel ilmiah. Beberapa klaster artikel ilmiah perlu digabung karena masih memiliki kemiripan topik untuk memberikan hasil label klaster yang lebih baik. Kemiripan topik dapat diwakili dengan kesamaan relasi kata yang dimodelkan dengan graf. Penelitian ini memiliki usulan metode pelabelan klaster artikel ilmiah dengan proses penggabungan klaster berdasarkan kesamaan struktur graf representasi klaster. Usulan metode terdiri dari : (1) Pengelompokkan artikel ilmiah menggunakan metode klasterisasi K-Means++. (2) Ekstraksi kandidat frasa menggunakan Frequent Phrase Mining (FPM). (3) Konstruksi graf menggunakan kata – kata pembentuk frasa sebagai vertex dan relasi kata sebagai edge berdasarkan Word2Vec. (4) Penggabungan klaster dengan pengukuran similaritas klaster berdasarkan struktur Maximum Common Subgraph (MCS). (5) Pelabelan klaster pada hasil penggabungan klaster menggunakan metode TopicRank. Usulan metode dievaluasi pada 2 dataset artikel ilmiah yang memiliki variasi tingkat pemisahan dan kohesi klaster. Koherensi topik digunakan sebagai pengukuran evaluasi untuk mengukur tingkat keterkaitan topik label klaster pada sebuah klaster. Hasil pengujian menunjukkan bahwa dataset yang memiliki tingkat pemisahan dan kohesi klaster yang tinggi (homogen) menghasilkan koherensi topik label klaster gabungan yang lebih tinggi. Penggunaan relasi kata co-occurrence pada pembuatan graf representasi klaster menghasilkan koherensi topik yang lebih baik dibandingkan relasi kata Word2Vec. Hal ini disebabkan oleh relasi kata co-occurrence berbasis frekuensi sehingga merepresentasikan topik mayoritas klaster. ============== Unstructured scientific articles can benefited by clustering method to group scientific articles based on topic similarity. Cluster labeling on the yielded cluster is required to discover key phrases that best represent the topics covered. Several clusters still need to be bundled because they still have similar topics to give better cluster labels results. In addition to word occurences, the similarity of the topic can also be represented by word semantic relation that can be modeled with the graph. This research proposes labeling clusters of scientific articles with cluster merging as research contribution to provide a more representative label of cluster topics. This research proposed cluster labeling method with cluster merging process using graph model. Graph model approach is choosen because it can map the relationship between words, hence representing text semantic information. There are several stages in the proposed method. First, K-Means++ clustering method is applied on a collection of scientific articles. Second, for each cluster, phrase extraction is executed using Frequent Phrase Mining to get word tokens that capable to constitute representative phrase for cluster topics. Acquired word tokens used as input to constructing graph representation of a cluster. After that, cluster merging is done based on cluster graph similarity using Maximum Common Subgraph (MCS) method. Then, the cluster labeling process is performed on clusters that have been merged using the TopicRank method. Proposed method evaluated on 2 dataset based on the merged cluster label topic coherence score, using Word2Vec-based graph model and co-occurence-based graph model. Result show that homogenous dataset 1 yield better result than heterogenous dataset 2. In addition, the use of co-occurence-based graph produce prefereable result on cluster merging process.

Item Type: Thesis (Masters)
Uncontrolled Keywords: Pelabelan Klaster, Teori Graf, TopicRank, Frequent Phrase Mining, Maximum Common Subgraph, Cluster Labeling, Graph Theory, TopicRank, Frequent Phrase Mining, Maximum Common Subgraph
Subjects: T Technology > T Technology (General) > T57.5 Data Processing
T Technology > T Technology (General) > T58.6 Management information systems
Divisions: Faculty of Information and Communication Technology > Informatics > (S2) Master Theses
Depositing User: Adhi Nurilham
Date Deposited: 03 Aug 2018 01:59
Last Modified: 03 Aug 2018 01:59
URI: http://repository.its.ac.id/id/eprint/57999

Actions (login required)

View Item View Item