Large Language Model dan Komputasi Paralel untuk Deteksi Kontradiksi pada Dokumen Perundangan

Benito, Davian (2024) Large Language Model dan Komputasi Paralel untuk Deteksi Kontradiksi pada Dokumen Perundangan. Other thesis, Institut Teknologi Sepuluh Nopember.

[thumbnail of 5025201220-Undergraduate_Thesis.pdf] Text
5025201220-Undergraduate_Thesis.pdf - Accepted Version
Restricted to Repository staff only until 1 October 2026.

Download (3MB) | Request a copy

Abstract

Kontradiksi merupakan peristiwa di mana terdapat sepasang pernyataan yang saling bertentangan dalam suatu konteks. Peristiwa kontradiksi dapat terjadi di berbagai aspek kehidupan, salah satunya pada aspek hukum negara. Berbagai macam lembaga berwenang serta banyaknya pembentukan peraturan perundangan baru setiap tahunnya menjadi faktor penyebab dapat terjadinya kontradiksi di antara ayat dokumen peraturan perundangan. Salah satu kasus kontradiksi nyata yang pernah terjadi yaitu kemunculan kontradiksi antara peraturan komisi pemilihan umum (PKPU) dengan permendagri mengenai ketentuan syarat larangan keanggotaan partai politik yang berbeda. Hasil analisis terhadap kasus tersebut ditemukan bahwa akibat dari kontradiksi tersebut, terjadi ketidakpastian hukum pada saat menjalankan peraturan tersebut. Oleh sebab itu, diperlukan pengembangan sistem yang dapat mendeteksi kontradiksi dengan akurat dan cepat.
Tahapan pada penelitian ini mencangkup pengumpulan data peraturan perundangan, pembuatan dataset, pelatihan model deteksi kontradiksi menggunakan komputasi paralel, dan evaluasi dan analisis hasil pengujian. Tahap pengumpulan data terdiri dari proses web scraping pada situs web Jaringan Dokumentasi Informasi Hukum (JDIH) Komisi Pemilihan Umum (KPU) dan ekstraksi teks ayat untuk mendapatkan data peraturan perundangan. Lalu tahap pembuatan dataset terdiri dari proses peringkasan dan topic modeling untuk dapat mengelompokkan data ayat representatif peraturan perundangan sesuai dengan topik bahasan dokumen. Proses pembuatan label data latih dilakukan dengan dua teknik yaitu, proses anotasi manual dan proses penghasilan data sintetik menggunakan ChatGPT. Proses pseudolabel menggunakan LLM lalu dilakukan untuk mendapatkan jumlah label data latih yang lebih banyak. Tahap pelatihan model terdiri dari pelatihan berbagai LLM yaitu IndoBERT, mBERT, XLM-RoBERTa, serta versi finetune setiap model pada dataset SNLI dan dataset IndoNLI. Penggunaan akselerator unit pemrosesan GPU-A100 dan TPU-V2 yang tersedia pada Google Colab dipakai pada proses komputasi paralel pelatihan model. Tahap terakhir terdiri dari evaluasi hasil ujicoba terhadap performa akurasi dan waktu yang diperoleh.
Hasil penelitian diperoleh penggunaan TPU-V2 dengan batch size besar memiliki performa waktu terbaik yaitu 82 detik per epoch. Sedangkan pada performa akurasi, XLM-RoBERTa dengan penggunaan batch size 64 mendapatkan nilai metrik terbaik yaitu akurasi senilai 0,98229 dan recall senilai 0,97865. Pada penelitian ini, ekstraksi teks peraturan perundangan dilakukan secara copy-paste manual. Lalu anotasai manual dilakukan pada dokumen perubahan perundangan untuk memperoleh data latih awal penelitian. Pengembangan LLM dilakukan dengan komputasi paralel melalui unit pemrosesan TPU-V2. Penggunaan batch size tinggi pada TPU-V2 ditemukan meningkatkan performa waktu latih dari LLM.
======================================================================================
A contradiction is an event where there is a pair of statements that contradict each other in a context. Contradiction events can occur in various aspects of life, one of which is the legal aspect of the country. The various types of authorized institutions and the large number of new legislative regulations established every year are factors that cause contradictions between paragrafs of statutory regulatory documents. One case of real contradiction that has occurred is the emergence of a contradiction between the General Election Commission (PKPU) regulations and the Minister of Home Affairs regarding the conditions for prohibiting membership of different political parties. The results of the analysis of this case found that as a result of this contradiction, legal uncertainty occurred when implementing these regulations. Therefore, it is necessary to develop a system that can detect contradictions accurately and quickly.
The stages of this research include collecting data on legal regulations, creating a dataset, training a contradiction detection model using parallel computing, and evaluating and analyzing test results. The data collection stage consists of a web scraping process on the General Election Commission (KPU) Legal Information Documentation Network (JDIH) website and extracting paragraf texts to obtain data on statutory regulations. Then the dataset creation stage consists of a summarization process and topic modeling to be able to group data from representative paragrafs of legal regulations according to the topic of the document. The process of creating training data labels is carried out using two techniques, namely, the manual annotation process and the synthetic data generation process using ChatGPT. The pseudolabel process using LLM is then carried out to get a larger number of training data labels. The model training stage consists of training various LLMs, namely IndoBERT, mBERT, XLM-RoBERTa, as well as finetuned versions of each model on the SNLI dataset and IndoNLI dataset. The use of the GPU-A100 and TPU-V2 processing unit accelerators available on Google Colab is used in the parallel computing process of model training. The final stage consists of evaluating the test results regarding the accuracy and time performance obtained.
The research results showed that using TPU-V2 with a large batch size had the best time performance, namely 82 seconds per epoch. Meanwhile, in term of accuracy performance, XLM-RoBERTa using batch size 64 got the best metric values, namely accuracy of 0.98229 and recall of 0.97865. In this research, legal and regulatory text extraction was carried out manually by copy-pasting. Then manual annotation was carried out on the change documents to obtain initial research training data. LLM development is carried out with parallel computing via the TPU-V2 processing unit. The use of high batch sizes on TPU-V2 was found to improve the training time performance of LLM.

Item Type: Thesis (Other)
Uncontrolled Keywords: Peraturan Perundangan, Deteksi Kontradiksi, Large Language Model, Komputasi Paralel, Topic modeling, Legislation, Contradiction Detection, Large Language Model, Parallel Computing, Topic modeling
Subjects: T Technology > T Technology (General) > T11 Technical writing. Scientific Writing
T Technology > T Technology (General) > T385 Visualization--Technique
T Technology > T Technology (General) > T57.5 Data Processing
Divisions: Faculty of Intelligent Electrical and Informatics Technology (ELECTICS) > Informatics Engineering > 55201-(S1) Undergraduate Thesis
Depositing User: Davian Benito
Date Deposited: 02 Aug 2024 02:21
Last Modified: 02 Aug 2024 02:21
URI: http://repository.its.ac.id/id/eprint/111817

Actions (login required)

View Item View Item