Sintesis Suara Bahasa Indonesia Natural dengan Feed Forward Transformer untuk Meningkatkan Kualitas Alamiah Sintesis

Arifianto, Muhammad (2023) Sintesis Suara Bahasa Indonesia Natural dengan Feed Forward Transformer untuk Meningkatkan Kualitas Alamiah Sintesis. Other thesis, Institut Teknologi Sepuluh Nopember.

[thumbnail of 02311940000026-Undergraduate_Thesis.pdf] Text
02311940000026-Undergraduate_Thesis.pdf - Accepted Version
Restricted to Repository staff only until 1 October 2025.

Download (4MB) | Request a copy

Abstract

Sintesis suara bahasa menjadi topik penelitian yang semakin penting dalam perkembangan teknologi ucapan dan telah digunakan dalam berbagai aplikasi sehari-hari seperti layanan Apple Siri dan Google Voice Search. Teknologi Text-to-speech (TTS) berbasis neural network telah menghadirkan kualitas suara yang tinggi dan alami. Namun, di Indonesia, penelitian tentang sintesis suara bahasa Indonesia masih terbatas padahal saat ini, TTS telah memainkan peran penting dalam tugas interaksi manusia-komputer dan kealamiahan suara pada pengaplikasian TTS menjadi faktor penting untuk diterima oleh pengguna. Untuk meningkatkan kualitas alamiah sintesis suara, penelitian ini menggunakan pendekatan Feed Forward Transformer dimana model yang sudah ada FastPitch dengan Waveglow sebagai model dasar dan diperbarui dengan model FastPitch dan vocoder HiFi-GAN. Penelitian ini mengembangkan dua model utama, yaitu model single speaker dan multi-speaker. Hasil eksperimen menunjukkan bahwa model single speaker mencapai kualitas suara alami yang baik dengan nilai MOS yang tinggi dan evaluasi objektif yang memuaskan. Namun, model multispeaker menunjukkan kualitas suara yang lebih rendah dengan nilai MCD, STOI, dan MOS yang buruk dibandingkan model dasar. Dalam kesimpulannya, penelitian ini berhasil meningkatkan kualitas alamiah sintesis suara bahasa Indonesia dengan pendekatan Feed Forward Transformer dan vocoder HiFi-GAN pada model single speaker. Namun, pemodelan multi-speaker masih memerlukan perbaikan untuk mencapai kualitas yang lebih baik. Hasil penelitian ini berpotensi untuk mendukung pengembangan teknologi ucapan dan aplikasi yang lebih maju di masa depan.
=================================================================================================================================
Speech synthesis research has become increasingly important in the development of speech technology and has been widely used in various everyday applications, such as Apple Siri and Google Voice Search services. Neural network-based Text-to-Speech (TTS) technology has brought about high-quality and natural-sounding voices. However, in Indonesia, research on Indonesian language speech synthesis is still limited, despite TTS playing a crucial role in human-computer interaction tasks, and the naturalness of the synthesized voice being a crucial factor for user acceptance. To enhance the naturalness of speech synthesis, this study adopts a Feed Forward Transformer approach, building upon the existing FastPitch model with Waveglow as the base model, and further refining it with FastPitch and the HiFi-GAN vocoder. The research focuses on developing two main models: a single-speaker model and a multispeaker model. The experimental results indicate that the single-speaker model achieves excellent natural-sounding voice quality with high MOS (Mean Opinion Score) values and satisfactory objective evaluations. However, the multi-speaker model shows lower voice quality with poor MCD (Mel Cepstral Distortion), STOI (Short-Time Objective Intelligibility), and MOS scores compared to the base model. In conclusion, this research successfully improves the naturalness of Indonesian language speech synthesis using the Feed Forward Transformer approach and HiFi-GAN vocoder for the single-speaker model. However, the multi-speaker modeling requires further improvement to achieve better quality. The findings of this research have the potential to support the development of advanced speech technology and applications in the future.

Item Type: Thesis (Other)
Uncontrolled Keywords: Speech synthesis, Bahasa Indonesia, Feed Forward Transformer, FastPitch, HiFi-Gan, Bahasa
Subjects: T Technology > T Technology (General) > T57.5 Data Processing
Divisions: Faculty of Industrial Technology and Systems Engineering (INDSYS) > Physics Engineering > 30201-(S1) Undergraduate Thesis
Depositing User: Muhammad Arifianto
Date Deposited: 27 Nov 2023 03:55
Last Modified: 27 Nov 2023 03:55
URI: http://repository.its.ac.id/id/eprint/104068

Actions (login required)

View Item View Item