Query Expansion Berbasis Statistik dan Semantik dalam Pencarian Dokumen Berbahasa Arab

Maryamah, Maryamah (2022) Query Expansion Berbasis Statistik dan Semantik dalam Pencarian Dokumen Berbahasa Arab. Doctoral thesis, Institut Teknologi Sepuluh Nopember.

[thumbnail of 05111860010012-Disertation.pdf]

Text
05111860010012-Disertation.pdf - Accepted Version
Restricted to Repository staff only until 1 April 2024.
Download (3MB) | Request a copy

Abstract

Indonesia merupakan negara yang memiliki mayoritas penduduk beragama Islam. Hal ini menyebabkan diakomodasikannya hukum Islam dalam peraturan perundang-undangan. Perumusan Undang-Undang (UU) tersebut memerlukan referensi kitab (dokumen) sebagai dasar pertimbangannya. Selain itu, masyarakat juga membutuhkan jawaban berdasarkan referensi kitab yang relevan. Referensi kitab yang relevan dapat ditemukan dengan menggunakan kueri yang tepat pada sistem pencarian dokumen. Oleh karena itu, tim perumus UU (pengguna) dan masyarakat membutuhkan sejumlah term yang relevan untuk dijadikan sebagai kueri. Namun, kueri yang relevan tersebut tidak mudah diformulasikan. Query expansion (QE) merupakan salah satu metode yang dapat membantu mereformulasikan kueri awal dengan melakukan penambahan beberapa kata atau term untuk meningkatkan hasil pencarian.
Penambahan kandidat term menggunakan QE pada pencarian dokumen membutuhkan pemahaman konteks yang tepat. Pencarian dokumen mengenai hukum islam yang berkaitan dengan kondisi masyarakat saat ini sangat jauh berbeda dibandingkan dengan masyarakat pada zaman dokumen referensi ditulis. Hal ini menyebabkan banyak dalil atau hukum Islam yang kurang relevan apabila langsung diterapkan untuk mengatasi permasalahan saat ini sehingga membutuhkan pemahaman konteks permasalahan yang mirip atau terminologi dengan permasalahan yang sedang dihadapi (lintas bidang). Pengembangan sistem menggunakan QE hybrid berbasis statistik dan semantik yang dapat memberikan kandidat term dengan terminologi lintas bidang yang berbeda sangat diperlukan untuk meningkatkan relevansi hasil pencarian dokumen berbahasa Arab.
Penelitian ini mengusulkan pengembangan metode QE hybrid baru yang menggabungkan informasi statistik dan semantik untuk menemukan kandidat term yang relevan dengan konteks kalimat sesuai kueri pengguna pada pencarian dokumen berbahasa Arab. Sistem yang diusulkan terdiri dari tahap utama yaitu metode translation dengan gabungan kamus, google translate, dan Word Embedding, QE statistik pseudo-relevance feedback dengan gabungan term extraction, QE semantik menggunakan knowledge Wikipedia (BabelNet Embedding) dan metode Word Embedding (BERT dan Word2Vec) dan kedua QE statistik dan semantik tersebut akan digabungkan menjadi QE Hybrid, filtering untuk meminimalisir kandidat term yang kurang relevan, serta pencarian dokumen akhir. Terdapat 2 jenis dokumen yang digunakan pada penelitian ini yaitu dokumen bersifat statik yang berasal dari referensi dataset kitab Bahasa Arab dan tidak bertambah seiring berjalannya waktu. Sedangkan dokumen yang berasal dari hasil forum berupa pertanyaan dan referensi yang bersifat dinamis karena selalu bertambah seiring berjalannya waktu.
Metode translation yang diusulkan dengan gabungan kamus, google translate, dan word embedding mendapatkan hasil yang lebih tinggi dibandingkan hasil google translate dengan evaluasi BLEU score 0,47. QE statistik yang diusulkan juga dapat memberikan kandidat term yang memiliki karakteristik domain yang sama dengan kueri dan pemahaman konteks sehingga hasil pencarian dokumen memiliki lebih tinggi dibandingkan metode tradisional dengan 21%, 14%, 42%, 42%, 45%, dan 58% berdasarkan evaluasi Precision (10 dokumen (P10) dan 5 dokumen (P5)), Recall, Mean Reciprocal Rank (MRR), dan Success Rate (SR5K untuk 5 dokumen dan SR10K untuk 10 dokumen) secara berurutan. Metode usulan QE semantik juga memiliki hasil yang lebih tinggi karena sangat efektif digunakan untuk membantu pencarian dokumen terutama kata atau masalah yang tidak dikenali pada dokumen berdasarkan original kueri dengan nilai 23%, 16%, 50%, 46%, 60%, dan 76% berdasarkan evaluasi P5, P10, Recall, MRR, SR5K, dan SR10K secara berurutan. Metode usulan QE hybrid lebih mampu mendapatkan dokumen relevan lebih baik dengan hasil 27%, 19%, 61%, 52%, 78%, 90% berdasarkan evaluasi P5, P10, Recall, MRR, SR5K, dan SR10K secara berurutan. Berdasarkan hasil uji coba dan analisis menunjukkan bahwa metode usulan mampu memberikan kandidat term yang relevan dengan konteks kalimat sesuai kueri pengguna sehingga hasil dokumen pencarian dapat meningkatkan kinerja dan relevansi sistem pencarian dokumen terutama domain sosial-keagamaan.
================================================================================================
Indonesia is a country that has a majority population of Muslims. Thus, it causes the Islamic Law accommodation in national Law. The Law formulation requires a reference book (document) as a basis for consideration. In addition, the community also needs answers based on relevant book references. An appropriate query can find relevant book references in the document search system. Therefore, the Law formulation team (users) and the community need several relevant terms to be used as a query. However, it is not easy to formulate a relevant query. Query Expansion (QE) is a method that can help reformulate the initial query by adding a few words or terms to improve search results.
Adding candidate terms using QE in document search requires understanding the proper context. Documents search regarding Islamic Law relating to current social condition is enormously different from society condition when the reference documents were written. It causes many less relevant documents or Islamic Law when directly applied to overcome current problems. Thus, it requires understanding similar problem context or terminology of words to the problem (across fields). System development using statistical and semantic-based hybrid QE that can provide term candidates with different cross-field terminology is needed to increase the relevance of Arabic document search results.
In this research, we proposed the development of a new hybrid QE method that combines statistical and semantic information to find relevant candidate terms with understanding the sentence context according to user queries in Arabic document searches. The proposed system consists of four main stages, which are translation methods with a combination of dictionaries, Google Translate, and Word Embedding, Statistical QE using Pseudo-Relevance Feedback with a combination of term extraction, Semantic QE using Wikipedia knowledge (BabelNet Embedding), and Word Embedding methods (BERT and Word2Vec), and both Statistical and Semantic QE will be combined into Hybrid QE, filtering to minimize irrelevant term candidates and final document search. Two types of documents were used in this research, namely static documents from the Arabic book dataset reference and do not increase over time. Meanwhile, documents from forum results consisting of questions and references were dynamic documents because the data always increased.
The proposed translation method with a combination of dictionaries, Google Translate, and Word Embedding got higher results than Google Translate result, with the evaluation of BLEU score with 0.47. In addition, the proposed Statistical QE can also provide candidate terms that have the same domain characteristics as the query and context understanding. Thus, the document search results have higher results than the baseline methods with 21%, 14%, 42%, 42%, 45%, and 58% results based on metrics evaluation of the Precision (10 documents (P10) and 5 documents (P5)), Recall, Mean Reciprocal Rank (MRR), and Success Rate (SR5K for 5 documents and SR10K for 10 documents) respectively. The proposed semantic QE methods also have higher results because it is very effectively used to help search documents, mainly unrecognized words or problems in documents based on the original query, with values of 23%, 16%, 50%, 46%, 60%, and 76% based on the P5, P10, Recall, MRR, SR5K, and SR10K respectively. The proposed hybrid QE method can obtain relevant documents with results of 27%, 19%, 61%, 52%, 78%, 90% on the P5, P10, Recall, MRR, SR5K, and SR10K, respectively. The result of the experiment and analysis show that the proposed method can provide relevant candidate terms with understanding the sentence context according to user queries. Therefore, the search document results can improve the performance and relevance of the document search system, especially in the social-religious domain.

Item Type:	Thesis (Doctoral)
Uncontrolled Keywords:	Query expansion hybrid, Query expansion statistik, Query expansion semantik, Pencarian dokumen, Dokumen statis dan dinamis, Query Expansion Hybrid, Statistical QE, Semantic QE, Document Search, Static and Dynamic Documents
Subjects:	Q Science > Q Science (General) > Q325.5 Machine learning. Support vector machines. R Medicine > R Medicine (General) > R858 Deep Learning Z Bibliography. Library Science. Information Resources > ZA Information resources > Z699.5 Information storage and retrieval systems
Divisions:	Faculty of Intelligent Electrical and Informatics Technology (ELECTICS) > Informatics Engineering > 55001-(S3) PhD Thesis (Comp Science)
Depositing User:	Maryamah Maryamah
Date Deposited:	09 Feb 2022 06:36
Last Modified:	09 Feb 2022 06:36
URI:	http://repository.its.ac.id/id/eprint/93321

Actions (login required)

View Item