Pembangkitan Deskripsi Citra Makanan Berdasarkan Aspek Estetika Menggunakan Encoder-Decoder CNN-LSTM Model

Santosa, Arif Nugraha (2025) Pembangkitan Deskripsi Citra Makanan Berdasarkan Aspek Estetika Menggunakan Encoder-Decoder CNN-LSTM Model. Other thesis, Institut Teknologi Sepuluh Nopember.

[thumbnail of 5025211048-Undergraduate_Thesis.pdf] Text
5025211048-Undergraduate_Thesis.pdf - Accepted Version

Download (9MB)

Abstract

Penelitian ini mengembangkan sebuah sistem Image Aesthetic Captioning untuk menghasilkan deskripsi estetika citra makanan. Sistem memiliki dua tahap utama yaitu Single-Aspect Captioning berbasis arsitektur CNN-LSTM yang menghasilkan caption untuk enam aspek estetika visual, serta Unsupervised Text Summarization menggunakan model Denoising Autoencoder untuk merangkum berbagai caption tersebut menjadi satu narasi utuh. Model pretrain CNN-LSTM dilakukan pelatihan menggunakan dataset MS COCO 2017 yang telah difilter dengan domain makanan. Dataset fine-tune diperoleh dari platform komunitas fotografi dan diproses melalui tahapan pembersihan teks, pengukuran informativeness menggunakan informativeness score yang terinspirasi dari metode TF-IDF, serta topic mining menggunakan LDA yang selanjutnya dilakukan aspect mapping secara manual terhadap caption mentahnya. Hasil evaluasi menunjukkan bahwa model Single-Aspect Captioning mencapai performa tertinggi pada aspek general impression dengan skor BLEU-1 sebesar 0.2768, aspek composition dengan BLEU-1 sebesar 0.1993 dan METEOR 0.0655, aspek color & light dengan BLEU-1 sebesar 0.1793 dan METEOR 0.0639. Model DAE mampu menyusun narasi akhir dengan skor BLEU-1 sebesar 0.4167 dan ROUGE-L sebesar 0.2244. Untuk meningkatkan koherensi, dilakukan proses retouch menggunakan LLM berbasis instruksi dengan prompt terstruktur. Hasil retouch menunjukkan peningkatan pada mayoritas metrik termasuk BLEU-1 menjadi 0.4323, ROUGE-L menjadi 0.2431, dan METEOR menjadi 0.1314. Meskipun skor CIDEr sedikit menurun, pendekatan retouch ini terbukti efektif dalam menghasilkan caption estetika yang informatif, terstruktur, dan representatif terhadap seluruh aspek visual.
=====================================================================================================================================
This study develops an Image Aesthetic Captioning system to generate aesthetic descriptions of food images. The system consists of two main stages Single-Aspect Captioning based on CNN-LSTM architecture, which generates captions for six visual aesthetic aspects, and Unsupervised Text Summarization using a Denoising Autoencoder (DAE) model to merge these captions into a single coherent narrative. The CNN-LSTM model was pretrained using the MS COCO 2017 dataset filtered for the food domain. The fine-tuning dataset was collected from a photography community platform and processed through several steps, including text cleaning, informativeness scoring inspired by TF-IDF, topic mining using LDA, and manual aspect mapping for raw captions. Evaluation results show that the Single-Aspect Captioning model achieved its highest performance on the general impression aspect with a BLEU-1 score of 0.2768, along with strong performance on the composition aspect (BLEU-1: 0.1993, METEOR: 0.0655) and the color & light aspect (BLEU-1: 0.1793, METEOR: 0.0639). The DAE model successfully composed a final narrative with a BLEU-1 score of 0.4167 and a ROUGE-L score of 0.2244. To improve coherence and readability, a retouch process was conducted using an instruction-tuned Large Language Model (LLM) with structured prompts. The retouched captions showed improvements across most metrics, including BLEU-1 (0.4323), ROUGE-L (0.2431), and METEOR (0.1314). Although the CIDEr score slightly decreased, the retouch approach proved effective in producing aesthetic captions that are informative, well-structured, and representative of all visual aspects.

Item Type: Thesis (Other)
Uncontrolled Keywords: Image Aesthetic Captioning, CNN-LSTM, Denoising Autoencoder, Large Language Model, Citra Makanan, Deskripsi Estetika. Image Aesthetic Captioning, CNN-LSTM, Denoising Autoencoder, Large Language Model, Food Image, Aesthetic Captioning.
Subjects: N Fine Arts > N Visual arts (General) For photography, see TR
Q Science > QA Mathematics > QA336 Artificial Intelligence
Q Science > QA Mathematics > QA76.87 Neural networks (Computer Science)
T Technology > T Technology (General) > T385 Visualization--Technique
T Technology > T Technology (General) > T57.5 Data Processing
T Technology > TR Photography
Divisions: Faculty of Intelligent Electrical and Informatics Technology (ELECTICS) > Informatics Engineering > 55201-(S1) Undergraduate Thesis
Depositing User: Arif Nugraha Santosa
Date Deposited: 28 Jul 2025 08:05
Last Modified: 28 Jul 2025 08:05
URI: http://repository.its.ac.id/id/eprint/122201

Actions (login required)

View Item View Item