Implementasi Agentic Retrieval Augmented Generation AI Inklusif Berbasis Model Context Protocol Dan Multimodal Inference Engine Menggunakan Arsitektur Gemma Dan CLIP Vision Transformer

Harsanti, Firania Putri (2026) Implementasi Agentic Retrieval Augmented Generation AI Inklusif Berbasis Model Context Protocol Dan Multimodal Inference Engine Menggunakan Arsitektur Gemma Dan CLIP Vision Transformer. Other thesis, Institut Teknologi Sepuluh Nopember.

[thumbnail of 5025221023-Undergraduate_Thesis.pdf] Text
5025221023-Undergraduate_Thesis.pdf - Accepted Version
Restricted to Repository staff only

Download (12MB) | Request a copy

Abstract

Informasi visual krusial bagi mobilitas mandiri, namun teknologi asistif saat ini sering terkendala privasi data akibat ketergantungan pada layanan cloud. Penelitian ini
mengimplementasikan Agentic Retrieval Augmented Generation (RAG) berbasis Model Context Protocol (MCP) menggunakan arsitektur lokal Gemma-2b dan CLIP. Dengan dataset orisinal dari kampus ITS dan trotoar Surabaya, evaluasi sistem dilakukan melalui metrik BLEU, ROUGE-L, serta analisis latensi. Pengujian menunjukkan bahwa optimasi Projection Adapter meningkatkan skor BLEU-4 dari 11,88% menjadi 15,77% dan ROUGE-L sebesar 34,29%. Arsitektur MCP menghasilkan performa paling unggul dibandingkan model baseline di berbagai kondisi, dengan capaian pada skenario outdoor (BLEU-4 16,65%; ROUGE-L46,08%), skenario indoor (BLEU-4 8,77%; ROUGE-L 33,57%), serta skenario street atau Trotoar Surabaya (BLEU-4 7,64%; ROUGE-L 28,21%). Meski menambah latensi pada lingkungan padat, sistem menghasilkan efisiensi respons signifikan pada skenario street sebesar 10,75 detik, lebih cepat dibanding baseline VLM (15,17 detik). Integrasi MCP efektif memitigasi halusinasi visual melalui verifikasi metadata lokal, sehingga menghasilkan solusi asistif yang presisi dan aman secara mandiri (on-premise).
================================================================================================================================================
Visual information is crucial for independent mobility, yet current assistive technology is often hindered by data privacy constraints due to a dependency on cloud services. This research implements Agentic Retrieval Augmented Generation (RAG) based on the Model Context Protocol (MCP) using local Gemma-2b and CLIP architectures. With an original dataset from the ITS campus and Surabaya sidewalks, the system evaluation was performed through BLEU and ROUGE-L metrics, as well as latency analysis. Testing shows that Projection Adapter optimization increased the BLEU-4 score from 11.88% to 15.77% and ROUGE-L by 34.29%. The MCP architecture produced the most superior performance compared to baseline models across various conditions, with results in outdoor scenarios (BLEU-4 16.65%; ROUGE-L 46.08%), indoor scenarios (BLEU-4 8.77%; ROUGE-L 33.57%), and street or Surabaya Sidewalk scenarios (BLEU-4 7.64%; ROUGE-L 28.21%). Although adding latency in dense environments, the system generated significant response efficiency in street scenarios at 10.75 seconds, which is faster than the VLM baseline (15.17 seconds). MCP integration effectively mitigates visual hallucinations through local metadata verification, resulting in a precise and secure on-premise assistive solution.

Item Type: Thesis (Other)
Uncontrolled Keywords: Agentic AI, Image Captioning, Model Context Protocol (MCP), Multimodal RAG, Gemma, CLIP. Agentic AI, Image Captioning, Model Context Protocol (MCP), Multimodal RAG, Gemma, CLIP
Subjects: T Technology > T Technology (General) > T57.5 Data Processing
Divisions: Faculty of Intelligent Electrical and Informatics Technology (ELECTICS) > Informatics Engineering > 55201-(S1) Undergraduate Thesis
Depositing User: Firania Putri Harsanti
Date Deposited: 30 Jan 2026 10:28
Last Modified: 30 Jan 2026 10:28
URI: http://repository.its.ac.id/id/eprint/131341

Actions (login required)

View Item View Item