Peringkasan Teks Menggunakan Large Language Model dengan Metode MapReduce dan Clustering

Aminy, Moh. Rosy Haqqy (2025) Peringkasan Teks Menggunakan Large Language Model dengan Metode MapReduce dan Clustering. Other thesis, InstitutTeknologi Sepuluh Nopember.

[thumbnail of 5025211012-Undergraduate_Thesis.pdf] Text
5025211012-Undergraduate_Thesis.pdf - Accepted Version
Restricted to Repository staff only

Download (7MB) | Request a copy

Abstract

Pertumbuhan pesat informasi digital telah menghasilkan ledakan volume dokumen teks yang tersedia secara daring, mulai dari artikel berita, laporan keuangan, hingga makalah ilmiah. Kondisi ini memicu kebutuhan akan sistem peringkasan otomatis yang mampu menyajikan informasi penting secara ringkas dan mudah dipahami. Penerapan model peringkasan berbasis Large Language Model (LLM) pada dokumen panjang masih menghadapi berbagai tantangan, termasuk keterbatasan panjang input. Oleh karena itu, diperlukan strategi baru yang mampu menangani keterbatasan ini tanpa mengorbankan kualitas ringkasan.
Penelitian ini mengusulkan metode peringkasan teks panjang berbasis LLM dengan menggabungkan pendekatan MapReduce dan Clustering. Model Qwen2.5-7B Instruct digunakan dengan teknik fine-tuning Low-Rank Adaptation berbasis instruksi. Dokumen panjang dibagi menjadi chunk, kemudian dikonversi menjadi embedding dan dikelompokkan menggunakan algoritma K-Means. Hasil clustering digunakan untuk memilih representasi informasi yang paling relevan, yang kemudian diringkas pada tahap Map. Ringkasan parsial tersebut kemudian digabung dan diringkas ulang pada tahap Reduce untuk menghasilkan satu ringkasan akhir. Pendekatan ini mampu melakukan pengolahan dokumen yang melebihi batas token LLM.
Evaluasi hasil peringkasan, digunakan metrik ROUGE (Recall-Oriented Understudy for Gisting Evaluation) yang secara luas dipakai dalam domain pemrosesan bahasa alami. ROUGE mengukur kesamaan antara ringkasan yang dihasilkan sistem dengan ringkasan referensi melalui perhitungan n-gram overlap sampai longest common subsequence. Evaluasi eksperimental dilakukan pada dokumen sektor perbankan pada dataset BRI dan menunjukkan bahwa metode MapReduce dan Clustering meningkatkan performa secara signifikan dibandingkan pendekatan pemotongan langsung. Skor ROUGE meningkat dari ROUGE-1: 0,320 menjadi 0,416; ROUGE-2: 0,090 menjadi 0,118; dan ROUGE-L: 0,168 menjadi 0,219.

======================================================================================================================================

The rapid growth of digital information has resulted in an explosion in the volume of text documents available online, ranging from news articles to financial reports to scientific papers. This triggers the need for automated summarization systems that are able to present important information in a concise and easy-to-understand manner. The application of Large Language Model (LLM)-based summarization models on long documents still faces various challenges, including limited input length. Therefore, a new strategy is needed that is able to handle these limitations without sacrificing the quality of the summary.
This research proposes an LLM-based long text summarization method by combining MapReduce and Clustering approaches. Qwen2.5-7B Instruct model is used with instruction-based Low-Rank Adaptation fine-tuning technique. Long documents are divided into chunks, then converted to embedding and clustered using K-Means algorithm. The clustering results are used to select the most relevant information representation, which is then summarized at the Map stage. The partial summaries are then merged and re-summarized at the Reduce stage to produce one final summary. This approach is capable of processing documents that exceed the LLM token limit.
To evaluate the summarization results, the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metric is used widely used in the natural language processing domain. ROUGE measures the similarity between the system-generated summary and the reference summary through the calculation of n-gram overlap to longest common subsequence. Experimental evaluation was conducted on banking sector documents in the BRI dataset and showed that the MapReduce and Clustering methods significantly improved performance over the direct truncation approach. ROUGE scores increased from ROUGE-1: 0.320 to 0.416; ROUGE-2: 0.090 to 0.118; and ROUGE-L: 0.168 to 0.219.

Item Type: Thesis (Other)
Uncontrolled Keywords: Clustering, Large Language Model, MapReduce, Peringkasan, ROUGE, Clustering, Large Language Model, MapReduce, ROUGE, Summarization.
Subjects: T Technology > T Technology (General)
Divisions: Faculty of Intelligent Electrical and Informatics Technology (ELECTICS) > Informatics Engineering > 55201-(S1) Undergraduate Thesis
Depositing User: Moh. Rosy Haqqy Aminy
Date Deposited: 28 Jul 2025 10:27
Last Modified: 28 Jul 2025 10:28
URI: http://repository.its.ac.id/id/eprint/122225

Actions (login required)

View Item View Item