NER dengan Augmentasi Data dan Fine Tuning berbasis Text Generation menggunakan Large Language Model

Andrian, Andrian (2025) NER dengan Augmentasi Data dan Fine Tuning berbasis Text Generation menggunakan Large Language Model. Other thesis, Institut Teknologi Sepuluh Nopember.

[thumbnail of 5025211079-Undergraduate_Thesis.pdf] Text
5025211079-Undergraduate_Thesis.pdf - Accepted Version

Download (9MB)

Abstract

Perkembangan Named Entity Recognition (NER) menghadapi tantangan dalam keterbatasan data berlabel, yang berdampak pada peforma model dalam mengenali entitas nama, lokasi, dan organisasi. Metode token-classification, seperti pada BERT dan IndoBERT, memerlukan jumlah data yang banyak untuk mendapatkan peforma yang optimal, sementara Large Language Model (LLM) berbasis instruction fine-tuning memberikan solusi alternatif dengan kemampuan generalisasi lebih baik. Penelitian ini mengusulkan kombinasi augmentasi data dengan pendekatan pembentukan teks menggunakan Knowledge Entity Extraction (KEE) dan Entity-to-Text Generation untuk meningkatkan variasi data menggunakan kedua metode pelatihan. Model yang diuji mencakup IndoBERT sebagai baseline token-classification, LLAMA-3 SahabatAI, SEA-LION, dan LLAMA 3.1 untuk kedua metode pelatihan model. Model yang besar dilakukan kuantisasi menggunakan 4-bit NormalFloat (NF4) dan dioptimalkan menggunakan LoRA (Low Rank Adaptation) untuk meningkatkan efisiensi. Dataset berupa 3150 kalimat Twitter berbahasa Indonesia tentang bencana alam dengan 1000 data uji, serta dataset Bali yang terdiri dari 120 cerita tradisional (4343 data pelatihan, 2229 data uji) dengan entitas karakter naratif untuk evaluasi generalisasi lintas domain. Evaluasi menggunakan metrik F1-score dari pustaka Seqeval. Hasil menunjukkan bahwa instruction fine-tuning pada LLM memiliki kinerja paling tinggi dan konsisten lintas domain. SahabatAI mencapai F1-score tertinggi 0,7192 pada Twitter dan 0,8044 pada Bali, melampaui IndoBERT (0,711). SEA-LION dan LLAMA 3,1 memperoleh skor 0,7175 dan 0,7081. Evaluasi generator augmentasi menunjukkan SahabatAI menghasilkan performa lebih baik (0,7968) dibandingkan GPT-4o Mini (0,7831) dengan konsistensi output 98,1%. IndoBERT mengalami penurunan kinerja dengan augmentasi data, sedangkan LLM menunjukkan peningkatan konsisten. Metode KEE efektif untuk entitas organisasi (peningkatan 177%), sementara Entity-to-Text Generation unggul untuk entitas lokasi (212%). Penelitian membuktikan kombinasi LLM dengan instruction fine-tuning dan augmentasi data dapat meningkatkan kinerja NER pada teks media sosial Indonesia dengan kemampuan transfer efektif antar domain.
==================================================================================================================================
The development of Named Entity Recognition (NER) faces challenges in limited labeled data, which impacts model performance in recognizing name, location, and organization entities. Token-classification methods such as IndoBERT require large amounts of data for optimal performance, while Large Language Models (LLMs) based on instruction fine-tuning offer alternative solutions with better generalization capabilities. This research proposed data augmentation combinations using Knowledge Entity Extraction (KEE) and Entity-to-Text Generation to enhance data variety using both training methods. The tested models included IndoBERT as baseline, along with LLAMA-3 SahabatAI, SEA-LION, and LLAMA 3.1 optimized using 4-bit quantization and LoRA. The dataset comprised 3150 Indonesian Twitter sentences about natural disasters with 1000 test data, as well as a Bali dataset consisting of 120 traditional stories (4343 training data, 2229 test data) with narrative character entities for cross-domain generalization evaluation. Evaluation used F1-score metrics to test method consistency on informal social media language characteristics versus traditional narrative language. Results showed that instruction fine-tuning on LLMs achieved the highest and most consistent performance across domains. SahabatAI reached F1-scores of 0,7192 on Twitter and 0,8044 on Bali, surpassing IndoBERT (0,711). KEE augmentation with SahabatAI as generator produced better performance (0,7968) compared to GPT-4o Mini (0,7831) with 98,1% output consistency. The KEE method was effective for organization entities (177% improvement), while Entity-to-Text Generation excelled for location entities (212%). The research demonstrated that combining LLMs with instruction fine-tuning and data augmentation improved NER performance with effective transfer capabilities across Indonesian language domains.

Item Type: Thesis (Other)
Uncontrolled Keywords: Knowledge Enhanced Entity, Entity-to-Text Generation, Named Entity Recognition, Token Classification, Instruction fine-tuning. Knowledge Enhanced Entity, Entity-to-Text Generation, Named Entity Recognition, Token Classification, Instruction fine-tuning.
Subjects: T Technology > T Technology (General) > T57.5 Data Processing
T Technology > T Technology (General) > T58.62 Decision support systems
Divisions: Faculty of Intelligent Electrical and Informatics Technology (ELECTICS) > Informatics Engineering > 55201-(S1) Undergraduate Thesis
Depositing User: Andrian Andrian
Date Deposited: 28 Jul 2025 02:09
Last Modified: 28 Jul 2025 02:09
URI: http://repository.its.ac.id/id/eprint/121883

Actions (login required)

View Item View Item