Edgar, Edrickson (2026) Pendekatan Multi-Dimensi Berbasis Formant Untuk Deteksi Ucapan Hasil Generasi AI. Other thesis, Institut Teknologi Sepuluh Nopember.
|
Text
5024221036-Undergraduate_Thesis.pdf - Accepted Version Restricted to Repository staff only Download (2MB) | Request a copy |
Abstract
Seiring dengan meningkatnya aksesibilitas sistem text-to-speech berkualitas tinggi, verifikasi keaslian audio menghadirkan tantangan yang berkembang di berbagai aplikasi verifikasi. Pendekatan deep learning saat ini mencapai akurasi tinggi namun kurang dapat diinterpretasikan oleh pengguna non-teknis. Penelitian ini menghadirkan kerangka kerja multidimensional berbasis formant secara eksploratif yang menggabungkan analisis geometri vowel space, karakterisasi spectral envelope, dan analisis turunan temporal, menggunakan ambang batas yang divalidasi secara manual tanpa machine learning. Kami memperkenalkan tanda tangan akustik baru termasuk distribusi F1-F2 berbentuk tau dan lonjakan spektral F3-F4 yang katastrofis yang membedakan suara sintetis dari autentik. Evaluasi pada 100 sampel terkontrol (20 autentik dari korpus TIMIT, 80 sintetis dari 4 sistem TTS komersial) yang digunakan untuk penemuan ambang batas dan validasi mengungkapkan sidik jari spesifik sistem dengan ukuran efek melebihi 100× untuk metrik kunci. Sistem penilaian berbasis ambang batas kami yang menggabungkan 34 fitur turunan formant memungkinkan generasi bukti visual dan atribusi sumber TTS sambil mempertahankan kecepatan pemrosesan yang relatif cepat (15-30 detik per sampel pada perangkat keras standar), mengatasi kesenjangan keterjelasan dalam forensik audio. Hasil menunjukkan akurasi 78,0% dengan presisi 98,3% (hanya 5% tingkat false positive) pada dataset pengembangan, mencapai 0,9481 area di bawah kurva ROC. Kinerja bervariasi di berbagai sistem TTS (75-85% deteksi untuk varian XTTS, 65-70% untuk ElevenLabs dan Cartesia). Proof-of-concept ini menetapkan kelayakan deteksi berbasis formant melalui penyetelan ambang batas manual.
==================================================================================================================================
As high-quality text-to-speech systems become increasingly accessible, verifying audio authenticity presents growing challenges across verification applications. Current deep learning approaches achieve high accuracy but lack interpretability for non-technical users. This work presents an exploratory formant-centered multi-dimensional framework combining geometric vowel space analysis, spectral envelope characterization, and temporal derivative analysis, using manually-validated thresholds rather than machine learning. We introduce novel acoustic signatures including tau-shaped F1-F2 distributions and catastrophic F3-F4 spectral spikes that distinguish synthetic from authentic speech. valuation on 100 controlled samples (20 authentic TIMIT, 80 synthetic from 4 commercial TTS systems) used for both threshold discovery and validation reveals system-specific fingerprints with effect sizes exceeding 100× for key metrics. Our threshold-based scoring system combining 34 formant-derived features enables visual evidence generation and TTS source attribution while maintaining relatively fast processing speeds (15-30 seconds per sample on standard hardware), addressing the explainability gap in audio forensics. Results demonstrate 78.0% accuracy with 98.3% precision (only 5% false positive rate) on the development dataset, achieving 0.9481 area under ROC curve. Performance varies across TTS systems (75-85% detection for XTTS variants, 65-70% for ElevenLabs and Cartesia). This proof-of-concept establishes the feasibility of formant-based detection through manual threshold tuning.
| Item Type: | Thesis (Other) |
|---|---|
| Uncontrolled Keywords: | deteksi ucapan sintetis, analisis formant, text-to-speech, forensik audio yang dapat dijelaskan, deteksi deepfake, ekstraksi fitur akustik, synthetic speech detection, formant analysis, explainable audio forensics, deepfake detection, acoustic feature extraction. |
| Subjects: | T Technology > TK Electrical engineering. Electronics Nuclear engineering > TK5102.9 Signal processing. |
| Divisions: | Faculty of Intelligent Electrical and Informatics Technology (ELECTICS) > Computer Engineering > 90243-(S1) Undergraduate Thesis |
| Depositing User: | Edrickson Edgar |
| Date Deposited: | 22 Jan 2026 02:08 |
| Last Modified: | 22 Jan 2026 02:08 |
| URI: | http://repository.its.ac.id/id/eprint/130011 |
Actions (login required)
![]() |
View Item |
