DETEKSI TOPIK VIRAL PADA WEBSITE MEDIA BERITA ONLINE MENGGUNAKAN ALGORITMA REFINED TERM FREQUENCY INVERSE DOCUMENT FREQUENCY DAN LATENT DIRICHLET ALLOCATION

Sierra, Evelyn (2023) DETEKSI TOPIK VIRAL PADA WEBSITE MEDIA BERITA ONLINE MENGGUNAKAN ALGORITMA REFINED TERM FREQUENCY INVERSE DOCUMENT FREQUENCY DAN LATENT DIRICHLET ALLOCATION. Other thesis, Institut Teknologi Sepuluh Nopember.

[thumbnail of 05111940000111-Undergraduate_Thesis.pdf] Text
05111940000111-Undergraduate_Thesis.pdf - Accepted Version
Restricted to Repository staff only until 3 October 2025.

Download (3MB) | Request a copy

Abstract

Perkembangan banyaknya data yang dihasilkan oleh media online, terutama terkait berita viral, mengekstraksi pengetahuan yang relevan telah menjadi tantangan yang kompleks. Salah satu metode yang digunakan untuk deteksi topik dari data berita adalah menggunakan ekstraksi kata kunci dengan metode Term Frequency-Inverse Document Frequency (TF-IDF). TF-IDFdigunakan untuk menghitung bobot kata-kata dalam information retrieval. Namun, metode ini memiliki keterbatasan dalam menentukan topik viral secara akurat dikarenakan belum menggunakan faktor waktu dan atensi dari pembaca. Oleh karena itu, penelitian ini akan menggunakan Refined TF-IDF (TA TF-IDF), dengan faktor waktu dan atensi. Sistem terdiri dari 4 tahapan yaitu, tahap pengambilan data dimana dataset berita dari media online, website dan sosial media Instagram untuk detik.com dan cnnindonesia.com diambil menggunakan library Scrapy. Data yang telah diambil akan melalui tahapan pembersihan data, seperti POS Tagging serta ekstraksi kata kunci menggunakan TF-IDF dan dilanjutkan dengan Refined TF-IDF (TA TF-IDF). Deteksi topik dengan pemodelan topik menggunakan KMeans dan Latent Dirichlet Allocation (LDA). Uji coba pada Tugas Akhir ini akan dimulai dari tahap pengambilan data yang merupakan perbandingan sumber berita dari website berita dengan sosial media Instagram, perbandingan hasil pembersihan dengan dan tanpa POS Tagging, serta perbandingan hasil ekstraksi kata kunci dengan TA TF-IDF dan TF-IDF. Hasil dari berita yang telah difiltrasi dengan kata kunci akan melalui pemodelan dengan KMeans dan LDA. Hasil dari analisis serta validasi dengan Google Trend 2022 menunjukkan bahwa topik hasil ekstraksi TA TF-IDF dua kali lebih akurat daripada TF-IDF.
===========================================================================================================================
The development of a vast amount of data generated by online media, especially concerning viral news, has posed a complex challenge in extracting relevant knowledge. One of the methods used for topic detection from news data is keyword extraction using the Term Frequency-Inverse Document Frequency (TF-IDF) method. TF-IDF is employed to calculate the weights of words in information retrieval. However, this method has limitations in accurately determining viral topics due to the absence of time and attention factors from readers. Therefore, this research will utilize Refined TF-IDF (TA TF-IDF), which incorporates time and attention factors. The system consists of four stages: data collection, where news datasets from online media, websites, and the Instagram social media platform for detik.com and cnnindonesia.com are gathered using the Scrapy library. The collected data will undergo data cleaning, including POS tagging, as well as keyword extraction using TF-IDF, followed by Refined TF-IDF (TA TF-IDF). Topic detection will be performed using topic modeling with KMeans and Latent Dirichlet Allocation (LDA). The testing in this Final Project will begin with the data collection stage, which involves comparing news sources from news websites with those from the Instagram social media platform. It will also compare the results of data cleaning with and without POS tagging, as well as the outcomes of keyword extraction using TA TF-IDF and TF-IDF. The news filtered by the extracted keywords will then undergo topic modeling with KMeans and LDA. The analysis and validation results, along with the comparison to Google Trend 2022, indicate that the topics extracted using TA TF-IDF are twice as accurate as those extracted using TF-IDF.

Item Type: Thesis (Other)
Uncontrolled Keywords: Website Berita, Topik Viral, Refined Term Frequency Inverse Document Frequency, Latent Dirichlet Allocation, News Website, Hot Topic, Refined Term Frequency Inverse Document Frequency, Latent Dirichlet Allocation
Subjects: Q Science > Q Science (General) > Q325.5 Machine learning.
T Technology > T Technology (General) > T57.5 Data Processing
Divisions: Faculty of Intelligent Electrical and Informatics Technology (ELECTICS) > Informatics Engineering > 55201-(S1) Undergraduate Thesis
Depositing User: Evelyn Sierra
Date Deposited: 31 Jul 2023 06:31
Last Modified: 31 Jul 2023 06:31
URI: http://repository.its.ac.id/id/eprint/101116

Actions (login required)

View Item View Item