Ekstraksi Sumber Informasi dalam Artikel Berita Berbahasa Indonesia

ANSYAH, ADI SURYA SUWARDI (2025) Ekstraksi Sumber Informasi dalam Artikel Berita Berbahasa Indonesia. Masters thesis, Institut Teknologi Sepuluh Nopember.

[thumbnail of 6025221037-Master_Thesis.pdf]

Text
6025221037-Master_Thesis.pdf - Accepted Version
Restricted to Repository staff only
Download (1MB) | Request a copy

Abstract

Native advertising sering sulit dikenali karena tampil menyerupai artikel berita biasa. Salah satu indikatornya adalah tidak adanya sumber informasi atau hanya menggunakan satu perspektif sumber. Oleh karena itu, diperlukan model yang mampu mengekstraksi sumber informasi dalam artikel berita menggunakan Named Entity Recognition (NER). Meskipun NER dapat mengekstraksi sumber informasi, metode ini belum mampu mengenali entitas yang identik tetapi ditulis dalam berbagai variasi, seperti alias, singkatan, atau kesalahan penulisan. Penelitian ini mengusulkan metode ekstraksi sumber informasi unik dengan mengintegrasikan NER berarsitektur XLNet+BiLSTM+CRF dan klasifikasi entitas identik menggunakan Convolutional Neural Network (CNN). Proses integrasi diawali dengan mengekstraksi sumber informasi melalui NER, kemudian dilanjutkan dengan klasifikasi pasangan entitas untuk menentukan apakah keduanya identik atau tidak. Klasifikasi ini memanfaatkan beberapa fitur, seperti kemiripan string, pemotongan kata, serta kesamaan semantik dan kontekstual. Hasil klasifikasi entitas identik selanjutnya digunakan untuk mengelompokkan sumber informasi yang unik. Model yang diusulkan terbukti efektif dalam mengekstraksi sumber informasi unik. Hasil pengujian menunjukkan bahwa penggunaan NER saja menghasilkan 3.029 entitas sumber informasi, sedangkan setelah dilakukan klasifikasi entitas identik jumlah tersebut berkurang menjadi 1.082, mendekati ground truth sebanyak 1.137 entitas. NER dengan arsitektur XLNet+BiLSTM+CRF menghasilkan F1-score sebesar 92,84% untuk ekstraksi sumber informasi, lebih tinggi dibandingkan model dasar BERT Linear yang memperoleh 91,74%. Sementara itu, F1-score untuk identifikasi entitas identik juga mencapai 92,84%, melampaui model yang hanya menggunakan fitur XLNet sebesar 89%. Integrasi kedua model menghasilkan rata-rata cakupan sumber informasi unik sebesar 97,40%. Hasil ini menunjukkan bahwa model yang diusulkan sangat potensial sebagai fitur tambahan dalam deteksi native advertising untuk meningkatkan transparansi media dan memberikan perlindungan yang lebih baik bagi pembaca.
==================================================================================================================================
Native advertising is often difficult to identify because it closely resembles ordinary news articles. One key indicator is the absence of information sources or the use of only a single perspective. Therefore, a model capable of extracting information sources from news articles such as Named Entity Recognition (NER). While NER can extract information sources, it still struggles to recognize identical entities written in various forms, such as aliases, abbreviations, or typographical errors. This study proposed a unique information source extraction method by integrating NER with the XLNet+BiLSTM+CRF architecture and identical entity classification using a Convolutional Neural Network (CNN). The integration process begins with extracting information sources using NER and then classifying pairs of extracted entities to determine whether they are identical. This classification leverages several features, including string similarity, word segmentation, and semantic and contextual similarity. The identical entity classification results are then used to cluster the unique information sources.
The proposed model proved effective in extracting unique information sources. Testing showed that using NER alone produced 3,029 information source entities, but after identical entity classification, this number was reduced to 1,082, which closely matches the ground truth of 1,137. The NER architecture based on XLNet+BiLSTM+CRF achieved an F1-score of 92.84% for information source extraction, surpassing the BERT Linear baseline (91.74%). The F1-score for identical entity identification also reached 92.84%, outperforming the single-feature XLNet model (89%). The integration of both models resulted in an average unique information source coverage of 97.40%. These results indicate that the proposed model is highly promising as an additional feature for native advertising detection, thereby enhancing media transparency and reader protection.

Item Type:	Thesis (Masters)
Uncontrolled Keywords:	Source Information, NER, Identical-Entity Detection, CNN
Subjects:	T Technology > T Technology (General) > T57.5 Data Processing
Divisions:	Faculty of Intelligent Electrical and Informatics Technology (ELECTICS) > Informatics Engineering > 55101-(S2) Master Thesis
Depositing User:	Adi Surya Suwardi Ansyah
Date Deposited:	04 Aug 2025 11:52
Last Modified:	02 Jul 2026 08:38
URI:	http://repository.its.ac.id/id/eprint/125195

Actions (login required)

View Item