Nurhidayah, Fadilla Rizky (2025) Eksperimen Text-to-Video Generation menggunakan Model Wan2.1-T2V-14B dengan Fine-tuning. Project Report. [s.n.], [s.l.]. (Unpublished)
![]() |
Text
5025211110-Project_Report.pdf - Accepted Version Restricted to Repository staff only Download (962kB) | Request a copy |
Abstract
Text-to-video generation merupakan teknologi generatif yang memungkinkan pembuatan video berkualitas tinggi berdasarkan deskripsi teks menggunakan model diffusion. Kerja Praktik ini bertujuan untuk mengeksplorasi penerapan teknik fine-tuning LoRA (Low-Rank Adaptation) pada model Wan2.1-T2V-14B untuk menghasilkan video dengan gaya domain spesifik. Eksperimen dilakukan dengan membandingkan hasil dari model baseline dan model yang telah di-fine-tune menggunakan dataset gambar yang dikategorikan menjadi tiga domain: Old Book Illustrations, Natural Scenes, dan Artistic Styles. Metodologi penelitian menggunakan framework diffusion-pipe untuk implementasi fine-tuning LoRA dan ComfyUI untuk inference serta testing. Evaluasi dilakukan menggunakan metrik kuantitatif meliputi CLIP Score untuk semantic alignment, Temporal Consistency untuk stabilitas visual antar frame, serta penilaian kualitatif terhadap style accuracy dan kualitas visual secara keseluruhan. Hasil eksperimen menunjukkan bahwa teknik fine-tuning LoRA memberikan peningkatan signifikan dibandingkan model baseline, dengan peningkatan Style Accuracy sebesar 53.6%, CLIP Score sebesar 20.0%, dan Temporal Consistency sebesar 3.6%. Video yang dihasilkan mampu mengadopsi karakteristik visual domain spesifik sambil mempertahankan konsistensi temporal dan kualitas gerakan yang baik. Studi ini memberikan kontribusi dalam pengembangan text-to-video generation sebagai solusi parameter-efficient untuk adaptasi domain dalam aplikasi pembuatan konten kreatif, edukasi, dan multimedia.
=====================================================================================================================================
Text-to-video generation is a generative technology that enables the creation of high-quality videos based on textual descriptions using diffusion models. This internship project aims to explore the application of LoRA (Low-Rank Adaptation) fine-tuning techniques on the Wan2.1-T2V-14B model to generate videos with domain-specific styles. Experiments were conducted by comparing the results of the baseline model with a fine-tuned model trained on image datasets categorized into three domains: Old Book Illustrations, Natural Scenes, and Artistic Styles. The research methodology utilizes the diffusion-pipe framework to implement LoRA fine-tuning and ComfyUI for inference and testing. The evaluation was performed using quantitative metrics, including CLIP Score for semantic alignment, Temporal Consistency for visual stability across frames, and qualitative assessments of style accuracy and overall visual quality. The experimental results show that LoRA fine-tuning significantly improves performance compared to the baseline model, with a 53.6% increase in Style Accuracy, 20.0% increase in CLIP Score, and 3.6% improvement in Temporal Consistency. The generated videos successfully adopt domain-specific visual characteristics while maintaining temporal consistency and good motion quality. This study contributes to the advancement of text-to-video generation as a parameter-efficient solution for domain adaptation in creative content creation, education, and multimedia applications.
Item Type: | Monograph (Project Report) |
---|---|
Uncontrolled Keywords: | Diffusion Model, Fine-tuning, Generative AI, LoRA, Parameter-Efficient Learning, Text-to-Video Generation, Wan2.1-T2V-14B |
Subjects: | T Technology > T Technology (General) > T57.5 Data Processing |
Divisions: | Faculty of Information Technology > Informatics Engineering > 55201-(S1) Undergraduate Thesis |
Depositing User: | Fadilla Rizky Nurhidayah |
Date Deposited: | 16 Jul 2025 05:37 |
Last Modified: | 16 Jul 2025 05:37 |
URI: | http://repository.its.ac.id/id/eprint/119857 |
Actions (login required)
![]() |
View Item |