Pengukuran Kemiripan Term Berbasis Co-Occurrence dan Inverse Class Frequency Pada Pengembangan Thesaurus Bahasa Arab

Yunianto, Dika Rizky (2017) Pengukuran Kemiripan Term Berbasis Co-Occurrence dan Inverse Class Frequency Pada Pengembangan Thesaurus Bahasa Arab. Masters thesis, Institut Teknologi Sepuluh Nopember.

[thumbnail of 5115201007-Master_Theses.pdf]
Preview
Text
5115201007-Master_Theses.pdf - Published Version

Download (2MB) | Preview

Abstract

Thesaurus merupakan tools yang bermanfaat untuk melakukan query expansion dalam pencarian dokumen. Thesaurus adalah kamus yang dibentuk dengan melihat kemiripan term. Kemiripan term dalam pembentukan thesaurus secara otomatis salah satunya dilakukan dengan pendekatan statistikal dari term pada dokumen-dokumen corpus. Beberapa thesaurus pada bahasa arab dibentuk dengan menggunakan pendekatan statistikal. Salah satu pendekatan statistikal adalah teknik co-occurrence yang memperhatikan frekuensi kemunculan term secara bersama-sama. Melihat kemiripan term dalam pembentukan thesaurus tidak hanya bergantung pada nilai informatif suatu term terhadap dokumen. Namun juga nilai informatif suatu term terhadap cluster. Dokumen-dokumen corpus dikumpulkan kemudian dilakukan proses preprocessing untuk medaptakan daftar term. Daftar term tersebut akan dihitung nilai TF-IDF nya sebagi fitur untuk melakukan clustering pada dokumen. Dokumen yang telah ter-cluster akan dijadikan patokan untuk menghitung nilai Inverse Class frequency (ICF). Nilai TF – ICF digunakan untuk perhitungan cluster weight pada teknik co-occurence dimana perhitungan tersebut memperhatikan kemunculan bersama kedua term. Hasil dari cluster weight yang melibatkan TF-ICF tersebut menjadi patokan nilai kemiripan term dalam pembentukan thesaurus. Pengujian terhadap thesaurus hasil bentukan metode usulan menghasilkan nilai precision tertinggi sebesar 76,7% sedangkan recall memiliki nilai terbesar 81,8% dan f-measure sebesar 54,1%.
============================================================================================
Thesaurus is a useful tool to perform query expansion in the document search. Dictionary Thesaurus is formed by looking at the similarities term. Similarities in the formation of a thesaurus term is automatically one of them carried out by statistical approach of the term in the document corpus. Some thesaurus in Arabic is formed by using a statistical approach. One approach is a statistical technique that takes into account the co-occurrence frequency of occurrence of terms together. See the resemblance in the formation of a thesaurus term depends not only on the informative value of a term of the document. But also informative value of a term to the cluster. The documents collected corpus preprocessing process is then performed to medaptakan term list. The term list will be calculated the value of its TF-IDF as a feature to perform clustering on the document. Documents that have already been cluster will be used as a benchmark to calculate the value of Inverse Class frequency (ICF). TF value - ICF is used for the calculation of weight in the engineering cluster co-occurence where the calculation of the notice of appearance with the two terms. Results of cluster weight involving TF-ICF has become a benchmark value of term similarity in the formation of a thesaurus. Tests on the thesaurus result form the proposed method produces the highest precision value amounted to 76.7%, while the recall has the greatest value 81.8% and f-measure of 54.1%.

Item Type: Thesis (Masters)
Uncontrolled Keywords: Teknik Co-occurence, Inverse Class Frequency, Kemiripan Term, Thesaurus Bahasa Arab
Subjects: Q Science > QA Mathematics > QA75 Electronic computers. Computer science. EDP
Divisions: Faculty of Information Technology > Informatics Engineering > 55101-(S2) Master Thesis
Depositing User: Dika Rizky Yunianto
Date Deposited: 03 Apr 2017 06:41
Last Modified: 06 Mar 2019 03:51
URI: http://repository.its.ac.id/id/eprint/2899

Actions (login required)

View Item View Item