Ekstraksi Sumber Informasi dalam Artikel Berita Berbahasa Indonesia

ANSYAH, ADI SURYA SUWARDI (2025) Ekstraksi Sumber Informasi dalam Artikel Berita Berbahasa Indonesia. Masters thesis, Institut Teknologi Sepuluh Nopember.

[thumbnail of 6025221037-Master_Thesis.pdf]

Text
6025221037-Master_Thesis.pdf - Accepted Version
Restricted to Repository staff only
Download (1MB) | Request a copy

Abstract

Native advertising is often difficult to identify because it closely resembles ordinary news articles. One key indicator is the absence of information sources or the use of only a single perspective. Therefore, a model capable of extracting information sources from news articles such as Named Entity Recognition (NER). While NER can extract information sources, it still struggles to recognize identical entities written in various forms, such as aliases, abbreviations, or typographical errors.
This study proposed a unique information source extraction method by integrating NER with the XLNet+BiLSTM+CRF architecture and identical entity classification using a Convolutional Neural Network (CNN). The integration process begins with extracting information sources using NER and then classifying pairs of extracted entities to determine whether they are identical. This classification leverages several features, including string similarity, word segmentation, and semantic and contextual similarity. The identical entity classification results are then used to cluster the unique information sources.
The proposed model proved effective in extracting unique information sources. Testing showed that using NER alone produced 3,029 information source entities, but after identical entity classification, this number was reduced to 1,082, which closely matches the ground truth of 1,137. The NER architecture based on XLNet+BiLSTM+CRF achieved an F1-score of 92.84% for information source extraction, surpassing the BERT Linear baseline (91.74%). The F1-score for identical entity identification also reached 92.84%, outperforming the single-feature XLNet model (89%). The integration of both models resulted in an average unique information source coverage of 97.40%. These results indicate that the proposed model is highly promising as an additional feature for native advertising detection, thereby enhancing media transparency and reader protection.

Item Type:	Thesis (Masters)
Uncontrolled Keywords:	Source Information, NER, Identical-Entity Detection, CNN
Subjects:	T Technology > T Technology (General) > T57.5 Data Processing
Divisions:	Faculty of Intelligent Electrical and Informatics Technology (ELECTICS) > Informatics Engineering > 55101-(S2) Master Thesis
Depositing User:	Adi Surya Suwardi Ansyah
Date Deposited:	04 Aug 2025 11:52
Last Modified:	04 Aug 2025 11:52
URI:	http://repository.its.ac.id/id/eprint/125195

Actions (login required)

View Item