Analisis Kinerja Metode Oversampling dan Ensemble Learning pada Data Tidak Seimbang

Niza, Yulia (2022) Analisis Kinerja Metode Oversampling dan Ensemble Learning pada Data Tidak Seimbang. Other thesis, Institut Teknologi Sepuluh Nopember.

[thumbnail of 05111840000053-Undergraduate_Thesis.pdf] Text
05111840000053-Undergraduate_Thesis.pdf - Published Version
Restricted to Repository staff only until 1 October 2024.

Download (5MB) | Request a copy

Abstract

Data tidak seimbang (imbalanced data) merupakan keadaan apabila terdapat perbedaan jumlah persebaran data yang cukup tinggi antara kelas mayoritas dan kelas minoritas. Permasalahan utama pada kasus data tidak seimbang adalah ketika kelas minoritas yang memiliki informasi lebih penting untuk diprediksi, akan tetapi cenderung disalah klasifikasikan sebagai kelas mayoritas. Penanganan data tidak seimbang sangat penting dilakukan, karena banyaknya kasus klasifikasi pada dunia nyata yang sangat krusial apabila terjadi kesalahan klasifikasi. Salah satu cara untuk menangani permasalahan data tidak seimbang, dilakukan menggunakan metode oversampling. Selain itu, pemilihan penggunaan metode klasifikasi juga dapat berpengaruh terhadap performa klasifikasi yang dihasilkan. Untuk itu, digunakan beberapa metode klasifikasi dari single learning dan ensemble learning untuk mengetahui kinerja dari metode yang dihasilkan.
Pada penelitian ini, dilakukan analisis kinerja metode oversampling dan ensemble learning berdasarkan evaluasi model yang dihasilkan. Hasil penelitian menunjukkan bahwa metode oversampling dapat meningkatkan hasil klasifikasi kelas data minoritas, serta untuk metode ensemble learning menghasilkan performa yang cukup baik pada beberapa dataset yang diujikan. Berdasarkan hasil evaluasi menggunakan confusion matrix, pada dataset binary yang dilakukan pengukuran performa menggunakan recall, menghasilkan performa tertinggi masing-masing dataset sebesar 0.6883 untuk dataset Haberman’s Survival dengan metode klasifikasi KNN dan oversampling Borderline-SMOTE, untuk dataset COVID-19 dihasilkan recall tertinggi sebesar 0.8391 menggunakan metode klasifikasi KNN dan oversampling SMOTE, lalu untuk dataset Credit Card Fraud menghasilkan nilai recall tertinggi sebesar 0.9476 dengan metode klasifikasi XGBoost dan oversampling ROS. Sedangkan untuk dataset multiclass pada Car Evaluation, dilakukan pengukuran performa menggunakan F1-Score yang menghasilkan nilai tertinggi sebesar 0.9989 menggunakan metode XGBoost dan Borderline-SMOTE. Hasil dari performa setiap metode oversampling dan ensemble learning bergantung dengan karakteristik dari masing-masing dataset.
==================================================================================================================================
Unbalanced data is a condition if there is a difference in the amount of data distribution that is quite high between the majority class and the minority class. The main problem in the case of unbalanced data is when the minority class which has more important information to predict, but tends to be misclassified as the majority class. Handling unbalanced data is very important, because there are many classification cases in the real world which are very crucial in the event of a classification error. One way to deal with the problem of unbalanced data is using the oversampling method. In addition, the selection of the use of classification methods can also affect the performance of the resulting classification. For this reason, several classification methods from single learning and ensemble learning are used to determine the performance of the resulting methods.
In this study, an analysis of the performance of the oversampling and ensemble learning methods was carried out based on the evaluation of the resulting model. The results show that the oversampling method can improve the classification results of minority data classes, and the ensemble learning method produces a fairly good performance on some of the tested datasets. Based on the results of the evaluation using a confusion matrix, the binary dataset performed performance measurements using recall, resulting in the highest performance of each dataset of 0.6883 for the Haberman's Survival dataset with the KNN classification method and Borderline-SMOTE oversampling, for the COVID-19 dataset the highest recall was 0.8391. using the KNN classification method and SMOTE oversampling, then for the Credit Card Fraud dataset the highest recall value is 0.9476 with the XGBoost classification method and ROS oversampling. As for the multi-class dataset on Car Evaluation, performance measurements were carried out using the F1-Score which resulted in the highest score of 0.9989 using the XGBoost and Borderline-SMOTE methods. The results of the performance of each oversampling and ensemble learning method depend on the characteristics of each dataset.

Item Type: Thesis (Other)
Additional Information: RSIf 006.31 Niz a-1
Uncontrolled Keywords: Data Tidak Seimbang, Ensemble Learning, Klasifikasi, Oversampling, Classification, Ensemble Learning, Unbalanced Data
Subjects: Q Science > Q Science (General) > Q325.5 Machine learning.
Divisions: Faculty of Intelligent Electrical and Informatics Technology (ELECTICS) > Informatics Engineering > 55201-(S1) Undergraduate Thesis
Depositing User: - Davi Wah
Date Deposited: 18 Apr 2023 02:34
Last Modified: 18 Apr 2023 02:34
URI: http://repository.its.ac.id/id/eprint/97880

Actions (login required)

View Item View Item