Pembangkitan Caption Pada Citra Lingkungan Lalu Lintas Perkotaan Berbasis Object Relation Transformer Dan Beam Search Size 2

Setiyono, Javier Nararya Aqsa (2025) Pembangkitan Caption Pada Citra Lingkungan Lalu Lintas Perkotaan Berbasis Object Relation Transformer Dan Beam Search Size 2. Other thesis, Institut Teknologi Sepuluh Nopember.

[thumbnail of 5025211245-Undergraduate_Thesis.pdf]

Text
5025211245-Undergraduate_Thesis.pdf - Accepted Version
Restricted to Repository staff only
Download (9MB) | Request a copy

Abstract

Peningkatan kecelakaan lalu lintas di Indonesia mendorong pengembangan teknologi yang mendukung keselamatan berkendara. Salah satu solusi pendukung keselamatan berkendara tersebut adalah fitur Advanced Driver Assistance Systems (ADAS) yang menggunakan pengambilan keputusan berdasarkan perhatian visual. Penelitian ini mengembangkan metode deskripsi citra lalu lintas menggunakan arsitektur Object Relation Transformer yang dipadukan dengan strategi decoding Beam Search size 2 untuk menghasilkan deskripsi yang sesuai dan kontekstual terhadap citra. Keunggulan Object Relation Transformer terletak pada kemampuannya dalam memahami relasi spasial antar objek melalui mekanisme geometric attention yang memperkuat pemahaman semantik dalam citra. Sementara Beam Search Size 2 memungkinkan eksplorasi struktur kalimat yang lebih optimal.
Data citra lingkungan lalu lintas diambil dari situs BerkeleyDeepDrive yang kemudian diberi 2 anotasi deskripsi untuk tiap citra oleh penulis. Ekstraksi fitur visual dilakukan menggunakan pretrained ResNet101 guna menghasilkan representasi objek yang mendalam. Data teks diproses menggunakan tokenizer Stanza yang kemudian dikonversikan menjadi indeks berdasarkan kosakata yang dibangun. Proses pemodelan dilakukan dengan menggabungkan fitur visual dan relasi spasial ke dalam arsitektur Transformer, kemudian deskripsi dihasilkan menggunakan Beam Search Size 2.
Evaluasi dilakukan menggunakan metrik BLEU, ROUGE-L, dan BERT-SCORE, dengan ketentuan semakin tinggi nilai evaluasi, maka semakin sesuai pula hasil deskripsi yang dihasilkan. Model Object Relation Transformer dengan Beam Search Size 2 mencapai skor BLEU-1 hingga BLEU-4 berturut-turut sebesar 34.48, 15.30, 8.92, dan 5.66, ROUGE-L sebesar 18.55, serta BERT-SCORE 74.25. Sebagai pembanding, model Show and Tell dengan basis CNN-LSTM menghasilkan skor BLEU-1 hingga BLEU-4 berturut-turut sebesar 21.88, 7.10, 3.70, dan 2.27, ROUGE-L sebesar 14.49, dan BERT-SCORE 70.51.
Dari sisi efisiensi, model Object Relation Transformer dengan Beam Search Size 2 membutuhkan waktu inferensi sekitar 39 detik untuk 125 citra, atau setara dengan 0,318 detik per citra. Meskipun kecepatan pemrosesan model ini lebih rendah dibandingkan model Show and Tell dengan basis CNN-LSTM yang mampu memproses 53,41 citra per detik atau sekitar 0,019 detik per citra, peningkatan waktu komputasi tersebut sebanding dengan peningkatan kualitas deskripsi yang lebih sesuai dan kontekstual. Dengan demikian, model ini tetap relevan untuk diterapkan dalam sistem transportasi cerdas.
========================================================================================================================
The The increasing traffic accidents in Indonesia have driven the development of technologies that support driving safety. One such safety support solution is the Advanced Driver Assistance Systems (ADAS) feature, which utilizes decision-making based on visual attention. This study develops a traffic image captioning method using the Object Relation Transformer architecture combined with a Beam Search decoding strategy of size 2 to generate accurate and contextual descriptions of images. The strength of the Object Relation Transformer lies in its capability to understand spatial relationships among objects through a geometric attention mechanism that enhances semantic understanding within images. Meanwhile, the Beam Search Size 2 enables a more optimal exploration of sentence structures.
The traffic environment image data were sourced from the Berkeley DeepDrive dataset and manually annotated with two descriptive captions per image by the author. Visual feature extraction was performed using a pretrained ResNet-101 to generate deep object representations. Text data were processed using the Stanza tokenizer and converted into indices based on a constructed vocabulary. The modeling process integrated visual features and spatial relations into the Transformer architecture, and descriptions were generated using Beam Search Size 2.
Evaluation was conducted using BLEU, ROUGE-L, and BERT-SCORE metrics, where higher scores indicate better alignment of the generated descriptions with the images. The Object Relation Transformer model with Beam Search Size 2 achieved BLEU-1 to BLEU-4 scores of 34.48, 15.30, 8.92, and 5.66 respectively, ROUGE-L of 18.55, and a BERT-SCORE of 74.25. In comparison, the Show and Tell model based on CNN-LSTM obtained BLEU-1 to BLEU-4 scores of 21.88, 7.10, 3.70, and 2.27, ROUGE-L of 14.49, and a BERT-SCORE of 70.51.
In terms of efficiency, the Object Relation Transformer model with Beam Search Size 2 required approximately 39 seconds for inference on 125 images, equivalent to 0.318 seconds per image. Although its processing speed is lower compared to the CNN-LSTM-based Show and Tell model, which can process 53.41 images per second or approximately 0.019 seconds per image, the increase in computation time is justified by the improved quality of descriptions that are more accurate and contextual. Therefore, this model remains relevant for application in intelligent transportation systems.

Item Type:	Thesis (Other)
Uncontrolled Keywords:	Advanced Driver Assistance Systems, Beam Search Size 2, Geometric Attention, Image Captioning, Object Relation Transformer.
Subjects:	Q Science > Q Science (General) Q Science > Q Science (General) > Q325.5 Machine learning. Support vector machines. Q Science > QA Mathematics > QA336 Artificial Intelligence Q Science > QA Mathematics > QA402.6 Transportation problems (Programming) T Technology > T Technology (General) T Technology > TA Engineering (General). Civil engineering (General) T Technology > TA Engineering (General). Civil engineering (General) > TA1637 Image processing--Digital techniques. Image analysis--Data processing. T Technology > TE Highway engineering. Roads and pavements > TE228.3 Intelligent transportation systems. T Technology > TL Motor vehicles. Aeronautics. Astronautics > TL152.8 Vehicles, Remotely piloted. Autonomous vehicles.
Divisions:	Faculty of Intelligent Electrical and Informatics Technology (ELECTICS) > Informatics Engineering > 55201-(S1) Undergraduate Thesis
Depositing User:	Javier Nararya Aqsa S
Date Deposited:	31 Jul 2025 02:05
Last Modified:	31 Jul 2025 02:05
URI:	http://repository.its.ac.id/id/eprint/124011

Actions (login required)

View Item