Kusnanti, Eka Alifia (2025) Tune-A-Video Dengan Dual Attention Pada Aplikasi Text-to-Video Untuk Semantic Editing. Masters thesis, Institut Teknologi Sepuluh Nopember.
![]() |
Text
6025231021-Master_Thesis.pdf - Accepted Version Restricted to Repository staff only until 1 April 2027. Download (6MB) | Request a copy |
Abstract
AI generatif telah mendapatkan perhatian yang signifikan dalam komunitas visi komputer, terutama dengan kemajuan algoritma berbasis Diffusion yang digunakan dalam pembuatan gambar. Salah satu pencapaian penting adalah kemampuan untuk mengubah gambar statis menjadi konten video dinamis, yang memungkinkan ekspresi visual lebih kaya dengan mengubah karya seni, foto, atau materi gambar lainnya menjadi video. Meskipun demikian, teknologi ini menghadapi tantangan dalam mempertahankan konsistensi temporal dan keselarasan semantik, terutama pada video dengan banyak objek atau occlusion. Untuk mengatasi masalah ini, diterapkan arsitektur Dual Attention, yang memfasilitasi pemahaman mendalam antara fitur semantik dan noise serta memungkinkan penggabungan fitur dari berbagai domain dengan lebih efektif. Hasilnya, representasi fitur menjadi lebih spesifik, menjaga konsistensi makna sepanjang video, dan memastikan keselarasan antara teks dan hasil video. Eksperimen menunjukkan bahwa metode Dual Attention menghasilkan kualitas video yang unggul dibandingkan metode sebelumnya, dengan skor frame consistency mencapai 94.81 dan textual alignment 32.38. Skor ini menunjukkan peningkatan signifikan dalam kualitas visual dan kesesuaian dengan deskripsi teks, terutama dibandingkan dengan metode baseline Tune-A-Video yang hanya mencapai frame consistency 92.40 dan textual alignment 27.58. Selain itu, metode Dual Attention juga memperbaiki detail visual, memberikan transisi antarframe yang lebih halus, dan meningkatkan keselarasan antara elemen video dan teks dibandingkan metode lain, seperti Pix2Video (frame consistency 94.40, textual alignment 29.00) dan Video-P2P (frame consistency 94.37, textual alignment 24.72). Secara keseluruhan, pendekatan ini memperlihatkan keunggulan baik dari segi skor kuantitatif maupun kualitas visual yang terlihat.
================================================================================================================================
Generative AI has gained significant attention in the computer vision community, particularly with advancements in diffusion-based algorithms used for image generation. One key achievement is the ability to transform static images into dynamic video content, which allows for richer visual expression by converting artwork, photos, or other image materials into video. However, this technology faces challenges in maintaining temporal consistency and semantic alignment, especially in videos with many objects or occlusion. To address this issue, the Dual Attention architecture is applied, which facilitates a deeper understanding between semantic features and noise, and enables more effective integration of features from various domains. The result is that the feature representation becomes more specific, maintaining semantic consistency throughout the video, and ensuring alignment between the text and the video output. The experiments demonstrate that the Dual Attention method produces superior video quality compared to previous methods, achieving a frame consistency score of 94.81 and textual alignment of 32.38. These scores indicate significant improvements in visual quality and alignment with text descriptions, particularly when compared to the baseline method, Tune-A-Video, which achieved a frame consistency score of 92.40 and textual alignment of 27.58. Additionally, the Dual Attention method enhances visual details, provides smoother frame transitions, and improves alignment between video elements and text compared to other methods, such as Pix2Video (frame consistency 94.40, textual alignment 29.00) and Video-P2P (frame consistency 94.37, textual alignment 24.72). Overall, this approach demonstrates advantages both in quantitative metrics and perceived visual quality.
Item Type: | Thesis (Masters) |
---|---|
Uncontrolled Keywords: | Dual Attention, Duffusion Models, Semantic Editing, Text-to-video, Tune-A-Video. Dual Attention, Duffusion Models, Semantic Editing, Text-to-video, Tune-A-Video. |
Subjects: | Q Science > Q Science (General) > Q325.5 Machine learning. Support vector machines. Q Science > QA Mathematics > QA336 Artificial Intelligence Q Science > QA Mathematics > QA76.6 Computer programming. Q Science > QA Mathematics > QA76.87 Neural networks (Computer Science) T Technology > T Technology (General) > T385 Visualization--Technique T Technology > T Technology (General) > T57.5 Data Processing |
Divisions: | Faculty of Intelligent Electrical and Informatics Technology (ELECTICS) > Informatics Engineering > 55101-(S2) Master Thesis |
Depositing User: | Eka Alifia Kusnanti |
Date Deposited: | 02 Feb 2025 13:51 |
Last Modified: | 02 Feb 2025 13:51 |
URI: | http://repository.its.ac.id/id/eprint/117543 |
Actions (login required)
![]() |
View Item |