Identifikasi Tokoh dan Penokohan pada Teks Cerita Berbahasa Bali Menggunakan Pendekatan Berbasis Aturan dan Supervised Machine Learning

Bimantara, I Made Satria (2024) Identifikasi Tokoh dan Penokohan pada Teks Cerita Berbahasa Bali Menggunakan Pendekatan Berbasis Aturan dan Supervised Machine Learning. Masters thesis, Institut Teknologi Sepuluh Nopember.

[thumbnail of 6025221014-Master_Thesis.pdf] Text
6025221014-Master_Thesis.pdf - Accepted Version
Restricted to Repository staff only until 1 October 2026.

Download (11MB) | Request a copy

Abstract

Tokoh pada teks cerita merupakan entitas yang dapat menerima dan melakukan aksi. Tokoh merupakan komponen utama sebelum pengembangan elemen pada teks cerita lainnya dapat dibuat, seperti plot, alur, latar, dan penokohan. Pengarang dapat menggambarkan tokoh ke dalam peran tertentu, misalnya protagonis atau antagonis. Proses mendeteksi kemunculan tokoh pada suatu segmen teks disebut sebagai identifikasi tokoh. Mayoritas penelitian identifikasi tokoh hanya berfokus pada karya sastra barat umumnya berbahasa Inggris karena ditunjang dengan kakas bantu yang melimpah (high resource language). Padahal, terdapat banyak karya sastra yang tidak dibuat dengan bahasa Inggris, seperti teks cerita berbahasa Bali (Satua Bali), sebagai salah satu low resource language. Identifikasi tokoh dan penokohan pada Satua Bali berbasis komputasi secara otomatis belum dilakukan karena masih terdapat tantangan. Pertama, tokoh pada teks dapat muncul dengan beragam bentuk selain person proper noun, misalnya frasa nomina dan frasa pronomina. Kedua, tokoh pada teks cerita fiksi tidak hanya terdiri dari entitas manusia saja, melainkan entitas lainnya juga dapat menjadi tokoh, misalnya non-human animate entities, inanimate object, makhluk fantasi, dan entitas tokoh yang spesifik. Ketiga, pengarang inkonsisten dalam menyebut nama suatu tokoh dengan menggunakan berbagai alias (misalnya Mr. Sherlock Holmes, Sherlock, Mr. Holmes), sehingga tugas pengelompokkan nama alias (alias clustering) diperlukan. Oleh karena itu, identifikasi tokoh dengan menggunakan NER berbasis aturan yang hanya mengidentifikasi entitas PERSON saja sebagai tokoh tidak cukup. Keempat, narasi yang rumit dan penokohan yang kompleks seringkali ditampilkan pada Satua Bali, sehingga memperumit penokohan suatu tokoh. Kelima, minimnya kakas bantu dan belum ada model state-of-the-art untuk tugas identifikasi dan klasifikasi penokohan pada teks Satua Bali.
Penelitian ini mengembangkan kerangka kerja untuk proses identifikasi tokoh dan penokohan pada Satua Bali berbasis komputasi. Terdapat dua modul utama yang dikembangkan, yaitu: 1) modul identifikasi tokoh dan pengelompokkan nama alias; serta 2) modul klasifikasi penokohan tokoh. Penelitian ini menganotasi sebanyak 120 teks Satua Bali bernama “Balinese Story Texts Dataset” sebagai ground truth dalam pengembangan kerangka kerja dengan melibatkan validasi dari ahli. NER hybrid bernama SatuaNER berbasis Binary Particle Swarm Optimization – Conditional Random Fields (BPSO-CRF) diusulkan untuk bisa mengenali lima kategori nama entitas tokoh (CHAR NE). Setiap frasa yang dilabeli sebagai CHAR NE oleh SatuaNER selanjutnya diidentifikasi sebagai tokoh. Algoritma berbasis aturan digunakan untuk mengelompokkan setiap nama alias yang merujuk pada entitas tokoh yang sama ke dalam suatu kelompok tokoh dengan menggunakan lima metode pairwise distance string similarity, yaitu: Ratcliff-Obershelp Algorithm, Jaccard similarity, Sorensen-dice similarity, Jaro-distance similarity, dan Jaro-winkler. Klasifikasi penokohan dari setiap kelompok tokoh ke dalam protagonis atau antagonis dilakukan menggunakan konteks kalimat yang merujuk atau merepresentasikan semua nama alias pada kelompok tokoh tersebut. Fitur lexicon dan word vector kemudian diekstraksi dari konteks kalimat tersebut. Enam model supervised machine learning (ML), yaitu k-Nearest Neighbor (KNN), Decision Tree (DT), Support Vectori Machine (SVM), Random Forest (RF), Gaussian Naïve Bayes (GNB), dan Multilayer Perceptron (MLP) dievaluasi sebagai model benchmark untuk mengklasifikasikan penokohan kelompok tokoh ke dalam protagonis atau antagonis. Performa model-model pada kedua modul dievaluasi dengan recall, precision, dan F1-score.
Performa model yang diusulkan dibandingkan dengan model baseline. Performa F1-score dari SatuaNER ketika dievaluasi dengan 10-fold cross-validation dan data uji masing-masing sebesar 95,4% dan 95,53%. Evaluasi pada data uji menunjukkan peningkatan performa SatuaNER sebesar 7,42% dan 6,01% terhadap Hidden Markov Model (HMM). Performa pretrained SatuaNER juga lebih baik apabila dibandingkan dengan NER berbasis aturan dalam mengidentifikasi tokoh pada level kalimat. Pretrained SatuaNER menghasilkan performa rata-rata F1-score sebesar 95,08% dan 96,04%, sedangkan NER berbasis aturan hanya 63,79% ketika dievaluasi dengan 10-fold cross-validation. Terdapat peningkatan performa sebesar 27,63%. Model alias clustering yang diterapkan terhadap hasil identifikasi tokoh pada SatuaNER dengan sorensen-dice similarity dan jaro-distance similarity memberikan performa rata-rata F1-score sebesar 70,69% dan 71,72%. Terjadi peningkatan performa rata-rata F1-score tertinggi dari model alias clustering yang diusulkan terhadap baseline sebesar 26,91% dan 27,77% ketika dievaluasi pada 120 teks cerita. Evaluasi performa dari enam model supervised ML untuk klasifikasi penokohan menunjukkan bahwa peningkatan performa rata-rata F1-score tertinggi dihasilkan model SVM dari yang awalnya (baseline) sebesar 52,92% menjadi 68,45% dengan menggunakan fitur FastText word embedding dan menerapkan tahapan coreference resolution terhadap konteks kalimat kelompok tokoh. Performa F1-score dari model supervised ML untuk klasifikasi penokohan lebih unggul apabila dibandingkan dengan performa F1-score dari model pengklasifikasi berbasis aturan sederhana yang menggunakan leksikon berbahasa Bali yang menghasilkan 51,48%.
===========================================================
Characters in story texts are entities that can receive and carry out actions. Characters are the main component before the development of other elements in the story text can be made, such as plot, flow, setting, and characterization. Authors can portray characters in certain roles, for example, protagonist or antagonist. The process of detecting the appearance of characters in a text segment is called character identification. Most of character identification research only focuses on western literary works, generally in English, because they are supported by abundant tools (high-resource languages). In fact, there are many literary works that are not written in English, such as story texts in Balinese (Satua Bali), which is one of the low-resource languages. Automatic computing-based identification of characters and characterizations in Satua Bali has not been carried out because there are still challenges. First, characters in the text can appear in various forms other than person-proper nouns, for example, noun phrases and pronoun phrases. Second, characters in fictional story texts do not only consist of human entities, but other entities can also become characters, for example, non-human animate entities, inanimate objects, fantasy creatures, and specific character entities. Third, the author is inconsistent in naming a character using various aliases (for example, Mr. Sherlock Holmes, Sherlock, Mr. Holmes), so the task of grouping aliases (aka clustering) is necessary. Therefore, character identification using rule-based NER, which only identifies the PERSON entity as a character, is not sufficient. Fourth, complicated narratives and complex characterizations are often presented in Satua Bali, thus complicating the characterization of a character. Fifth, there is a lack of tools, and there is no state-of-the-art model for the task of identifying and classifying characterizations in the Satua Bali text.
This research develops a computing-based framework for the process of character identification and characterization in Satua Bali. There are two main modules being developed, namely: 1) the character identification and alias clustering module; and 2) the character characterization classification module. This research annotated 120 Satua Bali texts called the "Balinese Story Texts Dataset" as ground truth in developing a framework involving validation from experts. A hybrid NER called SatuaNER based on Binary Particle Swarm Optimization with Conditional Random Fields (BPSO-CRF) is proposed to be able to recognize five categories of character entity names (CHAR NE). Each phrase labeled as CHAR NE by SatuaNER is further identified as a character. A rule-based algorithm is proposed to group each alias that refers to the same character entity into a group of characters using five pairwise distance string similarity methods, namely: Ratcliff-Obershelp Algorithm, Jaccard similarity, Sorensen-dice similarity, Jaro-distance similarity, and Jaro-Winkler. Characterization classification of each character groups into protagonists or antagonists is carried out using the context of sentences that refer to or represent all aliases in that character group. Lexicon and word vector features are then extracted from the context of the sentence. Six supervised machine learning (ML) models, namely k-Nearest Neighbor (KNN), Decision Tree (DT), Support Vector Machine (SVM), Random Forest (RF), Gaussian Naïve Bayes (GNB), and Multilayer Perceptron (MLP), were evaluated as benchmark models for classifying characterizations of character groups into protagonists or antagonists. The performance of the models in both modules is evaluated using recall, precision, and F1-score.
The performance of the proposed model is compared with the baseline model. The F1-score performance of SatuaNER when evaluated with 10-fold cross-validation and test data was 95.4% and 95.53%, respectively. Evaluation of test data shows an increase in SatuaNER performance of 7.42% and 6.01% against the Hidden Markov Model (HMM). SatuaNER's pretrained performance is also better when compared to rule-based NER in identifying characters at the sentence level. Pretrained SatuaNER produces average F1-score performance of 95.08% and 96.04%, while rule-based NER is only 63.79% when evaluated with 10-fold cross-validation. There is a performance increase of 27.63%. The alias clustering model applied to the results of character identification in SatuaNER with sorensen-dice similarity and jaro-distance similarity gives an average F1-score performance of 70.69% and 71.72%. There was an increase in the highest average F1-score performance of the proposed alias clustering model against the baseline of 26.91% and 27.77% when evaluated on 120 story texts. Performance evaluation of six supervised ML models for characterization classification shows that the highest average F1-score performance increase was produced by the SVM model from the initial (baseline) of 52.92% to 68.45% by using the FastText word embedding feature and applying the coreference stage resolution of the context of the character group's sentences. The F1-score performance of the supervised ML model for characterization classification is superior when compared to the F1-score performance of a simple rule-based classifier model that uses the Balinese lexicon which produces 51.48%.

Item Type: Thesis (Masters)
Uncontrolled Keywords: identifikasi tokoh, klasifikasi tokoh, dataset teks cerita berbahasa Bali, supervised machine learning, Hybrid Binary Particle Swarm Optimization – Conditional Random Fields (BPSO-CRF), alias clustering berbasis pairwise distance string similarity, leksikon sentimen berbahasa Bali (BaliSentiLex), characters identification, character classification, Balinese story text dataset, pairwise distance string similarity-based alias clustering, Balinese Sentiment Lexicon (BaliSentiLex)
Subjects: Q Science > Q Science (General) > Q325.5 Machine learning. Support vector machines.
Q Science > Q Science (General) > Q337.3 Swarm intelligence
Q Science > QA Mathematics > QA76.9.D343 Data mining. Querying (Computer science)
Q Science > QA Mathematics > QA9.58 Algorithms
T Technology > T Technology (General) > T57.5 Data Processing
T Technology > T Technology (General) > T57.8 Nonlinear programming. Support vector machine. Wavelets. Hidden Markov models.
Divisions: Faculty of Intelligent Electrical and Informatics Technology (ELECTICS) > Informatics Engineering > 55101-(S2) Master Thesis
Depositing User: I Made Satria Bimantara
Date Deposited: 01 Aug 2024 04:14
Last Modified: 01 Aug 2024 04:14
URI: http://repository.its.ac.id/id/eprint/111416

Actions (login required)

View Item View Item