Mahmudin, Riyan (2026) Generasi Caption Citra Makanan Multi-Gaya Melalui Integrasi Style Adapter Dan Contrastive Learning Pada InstructBLIP. Masters thesis, Institut Teknologi Sepuluh Nopember.
|
Text
6025231060_Master_Thesis.pdf - Accepted Version Restricted to Repository staff only Download (5MB) | Request a copy |
Abstract
Food image captioning memerlukan generasi deskripsi yang bervariasi dalam gaya, mulai dari ringkasan singkat hingga analisis estetika terperinci sambil mempertahankan akurasi semantik. InstructBLIP telah menunjukkan kemampuan yang baik dalam mengikuti instruksi visual melalui mekanisme Q-Former yang instruction-aware. Satu fungsi adaptasi global yang digunakan oleh Q-Former membuka peluang eksplorasi pendekatan baru dalam menangani keberagaman penekanan visual yang dituntut oleh berbagai gaya instruksi. Penelitian ini mengusulkan arsitektur InstructBLIP yang diintegrasikan dengan dua komponen sinergis: Parallel Style Adapters yang memperkenalkan transformasi fitur spesifik gaya dan Style-Conditioned Contrastive Loss yang meregularisasi output adapter untuk mencegah pergeseran representasi. Temuan kunci dalam penelitian ini adalah komplementaritas arsitektural: adapter saja menurunkan performa akibat instabilitas representasi, sementara contrastive loss saja berkonflik dengan objektif generatif. Namun, kombinasi keduanya mencapai peningkatan pada metrik (CIDEr: +0,8%; SPICE: +1,9%), dengan peningkatan lebih substansial untuk gaya caption terstruktur (Ingredient: +3,1% CIDEr). Hasil evaluasi kualitas caption berdasarkan persepsi manusia pada model yang diusulkan menunjukan konsistensi dengan metrik otomatis pada gaya Short dan Dishware . Metrik otomatis (Short : +0,5% CIDEr; +1,3% SPICE) dan (Dishware: +0,4% CIDEr; +1,8% SPICE) lalu evaluasi manusia (Short: +1,1% dan Dishware: +2,5%). Namun untuk tiga gaya lainnya terjadi inkonsisten yang signifikan di mana metrik otomatis menunjukkan peningkatan (Ingredient: +0,5% CIDEr; +1,3% SPICE), (Detail: -0,6% CIDEr; +2% SPICE) dan (Detail+Aesthetic: +0,7% CIDEr; +1,8% SPICE) sementara evaluasi manusia justru menunjukkan penurunan (Ingredient: -1,7%, Detail: -5,8% dan Detail+Aesthetic: -2,1%). Ketidak konsistenan mencerminkan perbedaan mendasar pada kedua pendekatan evaluasi tersebut. Metrik otomatis seperti CIDEr dan SPICE mengukur kecocokan atau kemiripan antara caption yang dihasilkan model dengan (ground truth) sedangkan evaluasi manusia dirancang untuk menilai aspek yang berbeda, yaitu kemampuan model dalam mengikuti instruksi sesuai gaya yang diharapkan terlepas dari kecocokannya dengan caption referensi.
=====================================================================================================================================
Food image captioning requires the generation of descriptions that vary in style, ranging from brief summaries to detailed aesthetic analyses while maintaining semantic accuracy. InstructBLIP has demonstrated good ability to follow visual instructions through its instruction-aware Q-Former mechanism. A global adaptation function used by Q-Former opens up opportunities to explore new approaches to handling the diversity of visual emphasis required by various instruction styles. This research proposes an InstructBLIP architecture integrated with two synergistic components: Parallel Style Adapters that introduce style-specific feature transformations and Style-Conditioned Contrastive Loss that regularizes adapter output to prevent representation drift. A key finding in this study is architectural complementarity: adapters alone decrease performance due to representation instability, while contrastive loss alone conflicts with generative objectives. However, combining both achieves improvements in metrics (CIDEr: 181.4 vs. baseline 179.9, +0.8%; SPICE: +1.9%), with more substantial improvements for structured caption styles (ingredient captions: +3.1% CIDEr). The results of the caption quality evaluation based on human perception of the proposed model show consistency with the automatic metrics in the Short and Dishware styles. Automatic metrics (Short: +0.5% CIDEr; +1.3% SPICE) and (Dishware: +0.4% CIDEr; +1.8% SPICE) followed by human evaluation (Short: +1.1% and Dishware: +2.5%). However, for the other three styles, there was significant inconsistency where automatic metrics showed an increase (Ingredient: +0.5% CIDEr; +1.3% SPICE), (Detail: -0.6% CIDEr; +2% SPICE) and (Detail+Aesthetic: +0.7% CIDEr; +1.8% SPICE), while human evaluation showed a decrease (Ingredient: -1.7%, Detail: -5.8% and Detail+Aesthetic: -2.1%). This inconsistency reflects a fundamental difference between the two evaluation approaches. Automatic metrics such as CIDEr and SPICE measure the match or similarity between the model-generated caption and the ground truth, while human evaluation is designed to assess a different aspect, namely the model's ability to follow instructions in the expected style regardless of its match with the reference caption.
| Item Type: | Thesis (Masters) |
|---|---|
| Uncontrolled Keywords: | Image Captioning, Large Language Model, LAVIS Library, Visual Instruction Tuning |
| Subjects: | Q Science > QA Mathematics > QA76.6 Computer programming. |
| Divisions: | Faculty of Intelligent Electrical and Informatics Technology (ELECTICS) > Informatics Engineering > 55101-(S2) Master Thesis |
| Depositing User: | Riyan Mahmudin |
| Date Deposited: | 29 Jan 2026 08:03 |
| Last Modified: | 29 Jan 2026 08:03 |
| URI: | http://repository.its.ac.id/id/eprint/131069 |
Actions (login required)
![]() |
View Item |
