Syaifudin, Mohamad Fahmi (2025) Entity Matching Menggunakan Large Language Model dan Knowledge Graph untuk Mencari Redundansi Fitur dalam Software Specification. Masters thesis, Institut Teknologi Sepuluh Nopember.
![]() |
Text
6026231014-Master_Thesis.pdf - Accepted Version Restricted to Repository staff only until 1 April 2027. Download (5MB) | Request a copy |
Abstract
Dalam Software Development Lifecycle (SDLC), berbagai dokumentasi software pendukung dihasilkan, salah satunya adalah dokumen software specification, yang berisi deskripsi fitur yang dibangun. Permasalahan umum dalam pengembangan software adalah munculnya requirement baru yang mirip dengan fungsionalitas modul yang sudah ada. Masalah ini dapat diatasi dengan memanfaatkan entity matching (EM) untuk mendeteksi redundansi antar fitur dalam software atau modul yang berbeda. EM bertujuan untuk menentukan apakah dua entitas yang berbeda mengacu pada entitas yang sama. Pendekatan EM tradisional umumnya menggunakan rule-based, deep learning, atau machine learning. Penelitian ini mengusulkan pendekatan entity matching menggunakan Large Language Model (LLM) dan Knowledge Graph (KG). Dataset yang digunakan dalam penelitian ini antara lain dari dokumen panduan user Odoo, dokumen panduan Zoho Commerce, dan dokumen public user requirement. Entitas dan relasi diekstrak dan dirubah kedalam dimensi vector melalui proses knowledge embedding. Neo4j digunakan untuk menyimpan KG dan memanfaatkan vektor index mencari entitas atau relasi yang relevan untuk EM. Penelitian ini menggunakan metode sentence-based dan graph-based dalam EM. Sentence-based matching memiliki F1-score 0.762, lebih tinggi dibandingkan metode graph-based dengan F1-score maksimum 0.471. Namun, graph-based matching unggul pada Mean Reciprocal Rank dengan nilai 0.660 dibandingkan sentence-based yang hanya 0.477. Keunggulan lain dari graph-based adalah ketidakbergantungannya pada jenis word embedding, yang ditunjukkan oleh konsistensi F1-score dalam rentang 0.4 sampai 0.5 untuk berbagai variasi model embedding.
==============================================================================================================================
In the Software Development Lifecycle (SDLC), various supporting software documentation is produced, one of which is the software specification document that contains a description of the features being developed. A common issue in software development is the emergence of new requirements that are similar to the functionalities of existing modules. This can be addressed by leveraging entity matching (EM) to identify redundancies between features in different software or modules. EM aims to determine whether two different entities refer to the same entity. Traditional EM approaches generally use rule-based, deep learning, or machine learning methods. This study proposes an entity matching approach using a Large Language Model (LLM) and a Knowledge Graph (KG). The datasets used in this research include user guide documents from Odoo, Zoho Commerce user guide documents, and public user requirement documents. Entities and relationships are extracted and transformed into vector dimensions through a knowledge embedding process. Neo4j is used to store the KG and to utilize vector indexing for finding relevant entities or relationships for EM. This research employs both sentence-based and graph-based methods in EM. Sentence-based matching achieved an F1-score of 0.762, which is higher than the graph-based method with a maximum F1-score of 0.471. However, graph-based matching outperformed in Mean Reciprocal Rank (MRR) with a score of 0.660 compared to 0.477 for the sentence-based approach. Another advantage of the graph-based method is its independence from the type of word embedding, as indicated by the consistent F1-score range of 0.4 to 0.5 across various embedding model variations.
Item Type: | Thesis (Masters) |
---|---|
Uncontrolled Keywords: | Software requirement, Software Specification, Entity Matching, Neo4j, LLM, RAG, Knowledge Graph |
Subjects: | T Technology > T Technology (General) > T57.8 Nonlinear programming. Support vector machine. Wavelets. Hidden Markov models. T Technology > T Technology (General) > T58.62 Decision support systems |
Divisions: | Faculty of Intelligent Electrical and Informatics Technology (ELECTICS) > Information System > 59101-(S2) Master Thesis |
Depositing User: | Mohamad Fahmi Syaifudin |
Date Deposited: | 02 Feb 2025 06:59 |
Last Modified: | 02 Feb 2025 06:59 |
URI: | http://repository.its.ac.id/id/eprint/117535 |
Actions (login required)
![]() |
View Item |