Pembangkitan Deskripsi Gambar Fashion Berbahasa Inggris Dengan Metode Bootstrapping Language-Image Pre-Training

Pandya, Duevano Fairuz (2025) Pembangkitan Deskripsi Gambar Fashion Berbahasa Inggris Dengan Metode Bootstrapping Language-Image Pre-Training. Other thesis, Institut Teknologi Sepuluh Nopember.

[thumbnail of 5025211052-Undergraduate_Thesis.pdf]

Text
5025211052-Undergraduate_Thesis.pdf - Accepted Version
Restricted to Repository staff only
Download (9MB) | Request a copy

Abstract

Pembangkitan deskripsi gambar otomatis adalah proses menghasilkan teks dari gambar. Tugas akhir ini mengatasi keterbatasan model multimodal Bootstrapping Language-Image Pre-training (BLIP-2) yang, meskipun akurat setelah dilatih ulang pada Fashion Captioning Dataset (FACAD), hanya mampu mendeskripsikan satu objek pakaian per gambar. Untuk menghasilkan deskripsi yang komprehensif, tugas akhir ini mengusulkan alur kerja multi-tahap: pertama, mengisolasi setiap objek pakaian menggunakan model segmentasi Mask R-CNN; kedua, menghasilkan deskripsi individual untuk setiap objek dengan BLIP-2; dan terakhir, memadukan beberapa hasil caption menjadi sebuah kalimat yang koheren dan natural menggunakan Large Language Model (LLM). Saat dievaluasi, model yang diusulkan menunjukkan skor rendah pada metrik kuantitatif standar (ROUGE, BLEU, CIDEr, dan SPICE), yang teridentifikasi bukan akibat kinerja model, melainkan karena ketidakselarasan fundamental data uji, meliputi domain shift, perbedaan kosakata, dan anotasi ground-truth yang tidak dirancang untuk deskripsi multi-objek. Mengingat metrik otomatis tidak andal, bukti efektivitas utama model dibuktikan melalui evaluasi subjektif, di mana 76,9% respons partisipan kuesioner secara signifikan lebih memilih deskripsi yang dihasilkan model usulan dibandingkan ground-truth dan model BLIP-2 dasar. Tugas akhir ini berhasil mendemonstrasikan efektivitas metode yang diusulkan untuk menghasilkan deskripsi fashion yang komprehensif dan, yang lebih penting, menyoroti kebutuhan krusial akan benchmark evaluasi baru dengan anotasi ground-truth yang konsisten dan secara spesifik dirancang untuk mengukur kualitas deskripsi multi-objek yang komprehensif.
=======================================================================================================================================
Automatic image captioning is the process of generating text from an image. This thesis addresses the limitation of the multimodal model, Bootstrapping Language-Image Pre-training (BLIP-2), which, despite being accurate after fine-tuning on the Fashion Captioning Dataset (FACAD), can only describe a single clothing item per image. To generate comprehensive descriptions, this study proposes a multi-stage workflow: first, isolating each clothing item using the Mask R-CNN segmentation model; second, generating an individual description for each item with BLIP-2; and finally, fusing these descriptions into a coherent and natural sentence using a Large Language Model (LLM). When evaluated, the proposed model achieved low scores on standard quantitative metrics (ROUGE, BLEU, CIDEr, and SPICE), which was found not to be caused by the model’s performance but by a fundamental misalignment in the test data, including a domain shift, vocabulary mismatches, and ground-truth annotations not designed for multi-object descriptions. Given that automated metrics proved unreliable, the primary proof of effectiveness was demonstrated through subjective evaluation, where 76.9% of questionnaire responses significantly favored the descriptions generated by the proposed model over both the ground-truth and the baseline BLIP-2 model. This thesis successfully demonstrates the effectiveness of the proposed method for generating comprehensive fashion descriptions and, more importantly, highlights the crucial need for new evaluation benchmarks with consistent ground-truth annotations specifically designed to measure the quality of comprehensive, multi-object descriptions.

Item Type:	Thesis (Other)
Uncontrolled Keywords:	blip-2, deskripsi gambar fashion, deskripsi multi-objek, large language models, segmentasi objek individu, fashion image captioning, instance segmentation, multi-object captioning
Subjects:	Q Science > QA Mathematics > QA76.87 Neural networks (Computer Science) T Technology > T Technology (General) > T57.5 Data Processing
Divisions:	Faculty of Intelligent Electrical and Informatics Technology (ELECTICS) > Informatics Engineering > 55201-(S1) Undergraduate Thesis
Depositing User:	Duevano Fairuz Pandya
Date Deposited:	28 Jul 2025 09:29
Last Modified:	28 Jul 2025 09:29
URI:	http://repository.its.ac.id/id/eprint/122267

Actions (login required)

View Item