Perbandingan Kinerja Sistem Tanya Jawab Hukum Indonesia Dengan LLM Open-Source Berbasis RAG

Prasetya, Darren (2025) Perbandingan Kinerja Sistem Tanya Jawab Hukum Indonesia Dengan LLM Open-Source Berbasis RAG. Other thesis, Institut Teknologi Sepuluh Nopember.

[thumbnail of 5025211162-Undergraduate_Thesis.pdf]

Text
5025211162-Undergraduate_Thesis.pdf - Accepted Version
Restricted to Repository staff only
Download (7MB) | Request a copy

Abstract

Masalah hukum seringkali menjadi sumber tantangan dalam kehidupan masyarakat awam. Banyak individu tidak memiliki kemampuan atau pengetahuan yang memadai untuk mengatasi permasalahan ini. Oleh karena itu, sistem tanya jawab otomatis seringkali diusulkan sebagai solusi. Salah satu pendekatan yang menjanjikan adalah sistem Retrieval-Augmented Generation (RAG), yang mengintegrasikan pengambilan informasi hukum dengan Large Language Model (LLM). Namun, belum ada penelitian yang mengeksplorasi aplikasi RAG pada ranah hukum Indonesia. Penelitian ini bertujuan untuk mengembangkan sistem RAG yang dikhususkan untuk hukum Indonesia dengan menggunakan dataset Belgian Statutory Article Retrieval Dataset (BSARD) yang diterjemahkan dan dipetakan ke hukum Indonesia yang diambil dari https://peraturan.go.id. Penelitian ini memiliki tiga tahapan proses: menyiapkan dan menyusun dataset lengkap tentang undang-undang Indonesia, merancang pemetaan antara pertanyaan hukum dan pasal yang relevan, serta membandingkan hasil dari model open-source yang dilatih dengan Bahasa Indonesia dalam aplikasi RAG untuk sistem tanya-jawab pada hukum Indonesia. Penelitian ini memanfaatkan model open-source yang dilatih dengan Bahasa Indonesia seperti Cendol 7B, Qwen2.5 8B, SEA-LION3.5 8B, dan SahabatAI Gemma2 9B.
Hasil dari perbandingan performa model open-source menunjukkan bahwa model SahabatAI Gemma2 9B memiliki performa tertinggi pada metrik leksikal seperti BLEU dengan skor 12,09, ROGUE-1 dengan skor 39,69, ROGUE-2 dengan skor 17,58, dan ROGUE-L dengan skor 29,17. Kemudian, pada metrik Embedding Cosine Similarity, Qwen 2.5 8B mencapai performa tertinggi dengan skor 94.1. Dalam evaluasi menggunakan kumpulan metrik LLM-as-a-Judge, model SEA-LION 3.5 8B menunjukkan performa tertinggi pada tiga kategori: akurasi, halusinasi, dan kegunaan. Secara spesifik, model ini mendapatkan skor akurasi tertinggi saat dinilai oleh LLM Judge GPT-4o dengan nilai 61,13 dan DeepSeek R1 dengan nilai 73,59, serta nilai tertinggi untuk halusinasi dengan nilai 72,81 dan kegunaan dengan nilai 71,77. Di sisi lain, untuk metrik relevansi, SahabatAI mencapai nilai tertinggi dengan nilai 79,61.
======================================================================================================================================
Legal issues often pose significant challenges in the lives of the common laypeople. Many individuals lack the skills or adequate knowledge to address these problems. Therefore, automated question-answering systems are frequently proposed as a solution. One promising approach is the Retrieval-Augmented Generation (RAG) system, which integrates legal information retrieval with Large Language Models (LLMs). However, there are no existing studies done for the implementation of RAG for Indonesian laws. This research aims to develop a RAG system specifically for Indonesian law using the Belgian Statutory Article Retrieval Dataset (BSARD), which has been translated and mapped to Indonesian legal articles sourced from https://peraturan.go.id. The study has three primary objectives: preparing and compiling a dataset on Indonesian legislation, designing an efficient mapping between legal questions and relevant articles, and compare the results of open-source tuned Indonesian language models in a RAG application for a legal question-answering system on Indonesian laws. The research will leverage open-source fine-tuned Indonesian language models such as Cendol 7B, Qwen2.5 8B, SEA-LION3.5 8B, and SahabatAI Gemma2 9B.
The results of the open-source model comparison shows that SahabatAI Gemma 2 9B has the highest performance on lexical metrics like BLEU with a score of 12,09, ROGUE-1 with a score of 39,69, ROGUE-2 with a score of 17,58, and ROGUE-L with a score of 29,17. Then, on the Embedding Cosine Similarity metric, Qwen 2.5 8B reached the highest score with 94.1. According to the group of LLM-as-a-Judge evaluation metrics, SEA-LION 3.5 8B model excelled in accuracy, hallucination, and usefulness. It scored highest in accuracy as scored by both LLM Judge GPT-4o with a score of 61,13 and DeepSeek R1 with a score of 73,59, and also with the highest score in hallucination with a score of 72,81 and usefulness with a score of 71,77. On the other side, for the relevance metric, SahabatAI scored the highest with 79,61.

Item Type:	Thesis (Other)
Uncontrolled Keywords:	Dataset Hukum Indonesia, Sistem Tanya Jawab Hukum, Large Language Model, Retrieval-Augmented Generation, Pengambilan Informasi, Indonesian Legal Dataset, Legal Question Answering, Information Retrieval.
Subjects:	T Technology > T Technology (General) > T11 Technical writing. Scientific Writing T Technology > T Technology (General) > T57.5 Data Processing
Divisions:	Faculty of Intelligent Electrical and Informatics Technology (ELECTICS) > Informatics Engineering > 55201-(S1) Undergraduate Thesis
Depositing User:	Darren Prasetya
Date Deposited:	27 Jul 2025 07:06
Last Modified:	27 Jul 2025 07:06
URI:	http://repository.its.ac.id/id/eprint/121575

Actions (login required)

View Item