Klasifikasi Tipe Tokoh Protagonis dan Antagonis Dalam Cerita Rakyat Nusantara Menggunakan Named Entity Recognition (NER)

Ravelia, Rayssa (2025) Klasifikasi Tipe Tokoh Protagonis dan Antagonis Dalam Cerita Rakyat Nusantara Menggunakan Named Entity Recognition (NER). Other thesis, Institut Teknologi Sepuluh Nopember.

[thumbnail of 5025211219-Undergraduate_Thesis.pdf] Text
5025211219-Undergraduate_Thesis.pdf - Accepted Version
Restricted to Repository staff only

Download (4MB) | Request a copy

Abstract

Cerita rakyat Nusantara memiliki kekayaan naratif yang tinggi dan berperan penting dalam pelestarian budaya. Namun, analisis manual terhadap struktur dan karakter dalam cerita rakyat memerlukan waktu dan tenaga yang tidak sedikit. Oleh karena itu, diperlukan pendekatan otomatis untuk mengidentifikasi serta mengklasifikasikan tipe karakter dalam cerita rakyat, seperti protagonis, antagonis, dan lainnya.

Penelitian ini mengusulkan pipeline bertingkat yang terdiri dari tiga tahapan utama: Named Entity Recognition (NER), alias clustering, dan klasifikasi tipe karakter. Data input berupa cerita rakyat dalam bentuk narasi paragraf yang kemudian diproses melalui tahap ekstraksi entitas karakter menggunakan model NER. Selanjutnya, hasil ekstraksi tersebut dikelompokkan melalui alias clustering, yang memanfaatkan beberapa metode similarity distance serta digabungkan dengan word sense mapping untuk menyatukan berbagai penyebutan karakter yang merujuk pada entitas yang sama. Tahap akhir menghasilkan output berupa klasifikasi tipe karakter (protagonis, antagonis, atau lainnya) berdasarkan hasil analisis yang dilakukan baik pada tingkat kalimat maupun karakter, menggunakan pendekatan leksikal, machine learning klasik, dan deep learning berbasis BERT.

Berdasarkan hasil uji coba, model NER terbaik, yaitu CahyaBERT 1.5G, menghasilkan F1-Score makro sebesar 0,8940. Pada tahap alias clustering, metode similarity distance terbaik adalah Jaro-Winkler dengan threshold 0,85 yang digabungkan dengan word sense mapping, dan menghasilkan F1-Score sebesar 0,6038. Untuk klasifikasi tipe karakter, performa terbaik diperoleh dari model Random Forest Normalized pada tingkat kalimat dengan mekanisme majority vote, mencapai F1-Score makro sebesar 0,9069. Meskipun pipeline menunjukkan hasil yang menjanjikan, masih terdapat kendala pada tahap alias clustering yang menyebabkan beberapa penyebutan karakter tidak dikelompokkan secara konsisten. Secara keseluruhan, pendekatan ini memberikan kontribusi awal dalam digitalisasi cerita rakyat serta pengembangan NLP berbahasa Indonesia.
=====================================================================================================================================
Nusantara folktales possess rich narrative structures and play an important role in cultural preservation. However, manual analysis of their structure and characters requires significant time and effort. Therefore, an automated approach is needed to identify and classify character types in folktales, such as protagonists, antagonists, and others.

This study proposes a multi-stage pipeline consisting of three main components: Named Entity Recognition (NER), alias clustering, and character type classification. The input data consists of folktales in paragraph-form narratives, which are processed through a character entity extraction stage using NER models. The extracted entities are then grouped using alias clustering, which combines multiple similarity distance methods with word sense mapping to unify various character mentions that refer to the same entity. The final stage produces output in the form of character type classification (protagonist, antagonist, or others), based on sentence-level and character-level analysis using lexical approaches, classical machine learning, and BERT-based deep learning models.

Based on the experimental results, the best-performing NER model, CahyaBERT 1.5G, achieved a macro F1-score of 0.8940. For the alias clustering stage, the best similarity distance method was Jaro-Winkler with a threshold of 0.85, combined with word sense mapping, resulting in a macro F1-score of 0.6038. In the character classification stage, the best performance was achieved by the Random Forest Normalized model at the sentence level using a majority voting mechanism, with a macro F1-score of 0.9069. Although the pipeline demonstrates promising classification performance, challenges remain in the alias clustering stage, where several character mentions failed to be grouped consistently. Overall, this approach provides a valuable early contribution to the digitalization of folktales and the development of Indonesian-language NLP.

Item Type: Thesis (Other)
Uncontrolled Keywords: cerita rakyat, Named Entity Recognition, alias clustering, klasifikasi tipe karakter, digitalisasi budaya, Natural Language Processing, folktales, Named Entity Recognition, character types, cultural digitalization, Natural Language Processing.
Subjects: Q Science > QA Mathematics > QA336 Artificial Intelligence
T Technology > T Technology (General) > T57.5 Data Processing
Divisions: Faculty of Intelligent Electrical and Informatics Technology (ELECTICS) > Informatics Engineering > 55201-(S1) Undergraduate Thesis
Depositing User: Rayssa Ravelia
Date Deposited: 27 Jul 2025 02:49
Last Modified: 27 Jul 2025 02:49
URI: http://repository.its.ac.id/id/eprint/121648

Actions (login required)

View Item View Item