Pembacaan Gerak Bibir Menggunakan Fitur Sudut dengan Deep Learning

Aminah, Luthfia Cucu (2025) Pembacaan Gerak Bibir Menggunakan Fitur Sudut dengan Deep Learning. Masters thesis, Institut Teknologi Sepuluh Nopember.

[thumbnail of 6022232014-Master_Thesis.pdf] Text
6022232014-Master_Thesis.pdf - Accepted Version
Restricted to Repository staff only

Download (53MB) | Request a copy

Abstract

Pembacaan gerak bibir merupakan teknologi yang mampu mengenali ucapan seseorang tanpa melibatkan suara. Menggunakan kecerdasan buatan, teknologi ini dapat diimplementasikan di berbagai bidang, misalnya keamanan, komunikasi di tempat bising, hingga alat bantu bagi tunarungu. Namun, penerapan pembacaan gerak bibir masih menghadapi tantangan, seperti variasi bentuk dan gerakan bibir, sudut orientasi, pencahayaan, dan keterbatasan dataset. Untuk itu, penelitian ini dilakukan dengan tujuan untuk mengetahui pengaruh penambahan data (interpolasi dan data asli) terhadap akurasi pengenalan ucapan, menentukan fitur tambahan terbaik yang dapat berkontribusi pada peningkatan akurasi pengenalan ucapan, dan menentukan arsitektur model yang menghasilkan akurasi pengenalan ucapan terbaik. Metodologi penelitian melibatkan enam tahap utama, yaitu: (1) pengambilan dataset berupa video dan pelabelannya, (2) pre-processing data melalui ekstraksi frame, penandaan landmark menggunakan MediaPipe, serta perhitungan sudut dari koordinat X dan Y, (3) pemilihan data berdasarkan 300 data asli, ditambah 100 data hasil interpolasi dan 100 data asli tambahan, (4) pemilihan fitur sudut terbaik berdasarkan bukaan bibir secara vertikal dan horizontal, (5) pemilihan arsitektur model deep learning mulai dari CNN saja hingga kombinasi CNN dengan LSTM, Bi-LSTM, GRU, dan Bi-GRU, serta (6) evaluasi performa klasifikasi. Hasil penelitian menunjukkan bahwa interpolasi data mampu meningkatkan akurasi pengenalan ucapan mencapai 79,36\% dibandingkan penambahan data asli yang hanya mencapai 76,07\%. Pada aspek pemilihan fitur terbaik, dipilihlah fitur bagian beserta poin atau titik referensinya, yaitu inner lip vertikal (21 dan 31), kombinasi outer dan inner lip vertikal (0), pipi kanan atas (kombinasi seluruh titik), rahang kanan (80), pipi kiri atas (kombinasi seluruh titik), rahang kiri (74), inner lip horizontal (0), hidung (41), dagu (77), dan pertengahan antara kedua mata (110). Arsitektur terbaik diperoleh dari kombinasi 2 layer CNN dan 1 layer Bi-GRU, yang mampu mencapai akurasi tertinggi sebesar 98,1\%. Penelitian ini membuktikan bahwa fitur sudut dengan arsitektur CNN + Bi-GRU dapat secara efektif meningkatkan performa sistem pembacaan gerak bibir.
=================================================================================================================================
Lipreading is a technology that can recognize a person’s speech without involving sound. Using artificial intelligence, this technology can be applied in various fields, such as security, communication in noisy environments, and assistive tools for the deaf. However, the implementation of lipreading still faces challenges, such as variations in lip shape and movemnet, orientation angles, lighting, and dataset limitations. Therefore, this research was conducted to investigate the effect of data augmentation (interpolation and original data) on speech recognition accuracy, to determine the best additional features that can contribute to improving recognition accuracy, and to identify the model architecture that yields the highest recognition accuracy. The research methodology involves six main stages: (1) collecting datasets in the form of videos and labeling them, (2) data pre-processing through frame extraction, landmark marking using MediaPipe, and angle calculation from X and Y coordinates, (3) data selection based on 300 original samples, plus 100 interpolated samples, and 100 additional original samples, (4) selecting the best angle features based on vertical and horizontal mouth openings, (5) selecting deep learning model architectures ranging from CNN only to combinations of CNN with LSTM, Bi-LSTM, GRU, and Bi-GRU, and (6) evaluating classification performance. The results show that data interpolation can increase speech recognition accuracy up to 79,36\% compared to adding original data, which only reached 76,07\%. For the best feature selection, the chosen parts and their reference points are: inner lip vertical (21 and 31), a combination of outer and inner lip vertical (0), upper right cheeck (combination of all points), right jaw (80), upper left cheek (combination of all points), left jaw (74), inner lip horizontal (0), nose (41), chin (77), and the midpoint between the eyes (110). The best architecture was obtained from a combination of layers CNN layers and 1 Bi-GRU layer, which achieved the highest accuracy of 98,1\%. This study proves that ange-based features with the CNN + Bi-GRU architecture can effectively improve the performance of lipreading systems.

Item Type: Thesis (Masters)
Uncontrolled Keywords: Angular Features, Deep Learning, Lip Reading, Fitur Sudut, Pembacaan Gerak Bibir
Subjects: Q Science > Q Science (General) > Q337.5 Pattern recognition systems
Q Science > QA Mathematics > QA336 Artificial Intelligence
Q Science > QA Mathematics > QA76.87 Neural networks (Computer Science)
T Technology > T Technology (General)
T Technology > TK Electrical engineering. Electronics Nuclear engineering > TK7882.P3 Pattern recognition systems
T Technology > TK Electrical engineering. Electronics Nuclear engineering > TK7882.S65 Automatic speech recognition.
T Technology > TK Electrical engineering. Electronics Nuclear engineering > TK7895.S65 Speech recognition systems
Divisions: Faculty of Intelligent Electrical and Informatics Technology (ELECTICS) > Electrical Engineering > 20101-(S2) Master Thesis
Depositing User: Luthfia Cucu Aminah
Date Deposited: 21 Jul 2025 02:58
Last Modified: 21 Jul 2025 02:58
URI: http://repository.its.ac.id/id/eprint/120182

Actions (login required)

View Item View Item