Analisis Perbandingan Feature-based dan Image-based untuk Deteksi Deepfake Speech Menggunakan Teknik Machine Learning dan Deep Learning

Fitriatmo, Galih (2025) Analisis Perbandingan Feature-based dan Image-based untuk Deteksi Deepfake Speech Menggunakan Teknik Machine Learning dan Deep Learning. Other thesis, Institut Teknologi Sepuluh Nopember.

[thumbnail of 5003211087-Undergraduate_Thesis.pdf]

Text
5003211087-Undergraduate_Thesis.pdf - Accepted Version
Restricted to Repository staff only
Download (6MB) | Request a copy

Abstract

Perkembangan teknologi kecerdasan buatan (AI) telah melahirkan berbagai inovasi, salah satunya adalah deepfake speech, yang menggunakan algoritma deep learning seperti Text-to-Speech (TTS) dan Voice Conversion (VC) untuk menghasilkan suara sintetis yang hampir identik dengan suara asli. Teknologi ini memberikan dampak positif di berbagai bidang. Namun di samping itu juga membawa risiko yang membahayakan, termasuk penipuan berbasis suara dan hingga penyebaran berita palsu. Oleh karena itu, diperlukan metode pendeteksian suara palsu yang efektif untuk mengurangi potensi penyalahgunaan teknologi ini. Penelitian ini membandingkan dua pendekatan klasifikasi dalam mendeteksi suara palsu (deepfake speech), yaitu feature-based classifier dan image-based classifier. Pendekatan feature-based menggunakan fitur numerik seperti Chroma (1-12), Mel-Frequency Cepstral Coefficients (1-20), Spectral centroid, Spectral spread, Spectral rolloff, Zero crossing rate, dan Root mean square energy, yang kemudian diklasifikasikan menggunakan model Support Vector Machine (SVM) dan Random forest (RF). Sementara itu, pendekatan image-based memanfaatkan fitur Mel-spectrogram yang kemudian diklasifikasikan menggunakan CNN dan CNN-LSTM. Hasil menunjukkan bahwa model CNN memiliki performa terbaik dalam klasifikasi dengan Akurasi 91,43% dan Recall spoofed 88,56%, disertai waktu pelatihan 1.243,26 detik, waktu ekstraksi fitur 0,0082 detik/file, dan waktu prediksi 0,1148 detik/instance. CNN-LSTM menyusul dengan Akurasi 89,76%, Recall spoofed 85,13%, waktu pelatihan 1.524,06 detik, waktu ekstraksi 0,0081 detik/file, dan waktu prediksi 0,1184 detik/instance. Di sisi lain, model SVM menunjukkan efisiensi waktu terbaik dengan waktu pelatihan hanya 2,1486 detik, ekstraksi fitur 0,0285 detik/file, dan waktu prediksi tercepat sebesar 0,0010 detik/instance, tetapi dengan Akurasi yang lebih rendah sebesar 67,66% dan Recall spoofed 53,51%. Model random forest mencatat Akurasi 80,85% dan Recall spoofed 74,91%, dengan waktu pelatihan 5,9914 detik, ekstraksi fitur 0,0285 detik/file, dan waktu prediksi 0,0046 detik/instance. Secara keseluruhan, model image-based, khususnya CNN, unggul dalam ketepatan klasifikasi dan generalisasi model, sementara model feature-based, terutama SVM, lebih efisien dari sisi waktu pemrosesan.
=======================================================================================================================================
The advancement of artificial intelligence (AI) technology has led to various innovations, one of which is deepfake speech, which utilizes deep learning algorithms such as Text-to-Speech (TTS) and Voice Conversion (VC) to generate synthetic speech that is nearly identical to natural human voice. This technology has positive impacts in various fields. However, it also poses significant risks, including voice-based fraud and the spread of misinformation. Therefore, effective methods for detecting fake speech are necessary to mitigate the potential misuse of this technology. This study compares two classification approaches for detecting fake speech (deepfake speech): feature-based classifiers and image-based classifiers. The feature-based approach uses numerical features such as Chroma (1-12), Mel-Frequency Cepstral Coefficients (1-20), Spectral Centroid, Spectral Spread, Spectral Rolloff, Zero Crossing Rate, and Root Mean Square Energy, which are then classified using Support Vector Machine (SVM) and Random Forest (RF) models. Meanwhile, the image-based approach leverages Mel-spectrogram features, which are classified using CNN and CNN-LSTM models. The results show that the CNN model achieves the best classification performance, with an accuracy of 91,43% and spoofed recall of 88,56%, along with a training time of 1.243,26 seconds, feature extraction time of 0,0082 seconds per file, and prediction time of 0,1148 seconds per instance. The CNN-LSTM model follows with an accuracy of 89,76%, spoofed recall of 85,13%, training time of 1.524,06 seconds, feature extraction time of 0,0081 seconds per file, and prediction time of 0,1184 seconds per instance. On the other hand, the SVM model demonstrates the best computational efficiency, with the shortest training time of only 2,1486 seconds, feature extraction time of 0,0285 seconds per file, and the fastest prediction time of 0,0010 seconds per instance, although it performs lower in terms of accuracy 67,66% and spoofed recall 53,51%. The Random Forest model achieves an accuracy of 80,85% and spoofed recall of 74,91%, with a training time of 5,9914 seconds, feature extraction time of 0,0285 seconds per file, and prediction time of 0,0046 seconds per instance. The image-based classifier, particularly the CNN model, demonstrates superior classification accuracy and generalization capability, while the feature-based classifier, especially SVM, offers better efficiency in terms of processing time.

Item Type:	Thesis (Other)
Uncontrolled Keywords:	Deepfake speech, Deep learning, Feature-based classifier, Image-based Classifier, Machine learning
Subjects:	Q Science > Q Science (General) > Q325.5 Machine learning. Support vector machines. Q Science > QA Mathematics > QA76.87 Neural networks (Computer Science) T Technology > TA Engineering (General). Civil engineering (General) > TA1637 Image processing--Digital techniques. Image analysis--Data processing. T Technology > TK Electrical engineering. Electronics Nuclear engineering > TK5102.9 Signal processing. T Technology > TK Electrical engineering. Electronics Nuclear engineering > TK7882.S65 Automatic speech recognition.
Divisions:	Faculty of Science and Data Analytics (SCIENTICS) > Statistics > 49201-(S1) Undergraduate Thesis
Depositing User:	Galih Fitriatmo
Date Deposited:	31 Jul 2025 10:17
Last Modified:	31 Jul 2025 10:17
URI:	http://repository.its.ac.id/id/eprint/124723

Actions (login required)

View Item