Harsanti, Firania Putri (2026) Implementasi Agentic Retrieval Augmented Generation AI Inklusif Berbasis Model Context Protocol Dan Multimodal Inference Engine Menggunakan Arsitektur Gemma Dan CLIP Vision Transformer. Other thesis, Institut Teknologi Sepuluh Nopember.
|
Text
5025221023-Undergraduate_Thesis.pdf - Accepted Version Restricted to Repository staff only Download (12MB) | Request a copy |
Abstract
Informasi visual krusial bagi mobilitas mandiri, namun teknologi asistif saat ini sering terkendala privasi data akibat ketergantungan pada layanan cloud. Penelitian ini
mengimplementasikan Agentic Retrieval Augmented Generation (RAG) berbasis Model Context Protocol (MCP) menggunakan arsitektur lokal Gemma-2b dan CLIP. Dengan dataset orisinal dari kampus ITS dan trotoar Surabaya, evaluasi sistem dilakukan melalui metrik BLEU, ROUGE-L, serta analisis latensi. Pengujian menunjukkan bahwa optimasi Projection Adapter meningkatkan skor BLEU-4 dari 11,88% menjadi 15,77% dan ROUGE-L sebesar 34,29%. Arsitektur MCP menghasilkan performa paling unggul dibandingkan model baseline di berbagai kondisi, dengan capaian pada skenario outdoor (BLEU-4 16,65%; ROUGE-L46,08%), skenario indoor (BLEU-4 8,77%; ROUGE-L 33,57%), serta skenario street atau Trotoar Surabaya (BLEU-4 7,64%; ROUGE-L 28,21%). Meski menambah latensi pada lingkungan padat, sistem menghasilkan efisiensi respons signifikan pada skenario street sebesar 10,75 detik, lebih cepat dibanding baseline VLM (15,17 detik). Integrasi MCP efektif memitigasi halusinasi visual melalui verifikasi metadata lokal, sehingga menghasilkan solusi asistif yang presisi dan aman secara mandiri (on-premise).
================================================================================================================================================
Visual information is crucial for independent mobility, yet current assistive technology is often hindered by data privacy constraints due to a dependency on cloud services. This research implements Agentic Retrieval Augmented Generation (RAG) based on the Model Context Protocol (MCP) using local Gemma-2b and CLIP architectures. With an original dataset from the ITS campus and Surabaya sidewalks, the system evaluation was performed through BLEU and ROUGE-L metrics, as well as latency analysis. Testing shows that Projection Adapter optimization increased the BLEU-4 score from 11.88% to 15.77% and ROUGE-L by 34.29%. The MCP architecture produced the most superior performance compared to baseline models across various conditions, with results in outdoor scenarios (BLEU-4 16.65%; ROUGE-L 46.08%), indoor scenarios (BLEU-4 8.77%; ROUGE-L 33.57%), and street or Surabaya Sidewalk scenarios (BLEU-4 7.64%; ROUGE-L 28.21%). Although adding latency in dense environments, the system generated significant response efficiency in street scenarios at 10.75 seconds, which is faster than the VLM baseline (15.17 seconds). MCP integration effectively mitigates visual hallucinations through local metadata verification, resulting in a precise and secure on-premise assistive solution.
| Item Type: | Thesis (Other) |
|---|---|
| Uncontrolled Keywords: | Agentic AI, Image Captioning, Model Context Protocol (MCP), Multimodal RAG, Gemma, CLIP. Agentic AI, Image Captioning, Model Context Protocol (MCP), Multimodal RAG, Gemma, CLIP |
| Subjects: | T Technology > T Technology (General) > T57.5 Data Processing |
| Divisions: | Faculty of Intelligent Electrical and Informatics Technology (ELECTICS) > Informatics Engineering > 55201-(S1) Undergraduate Thesis |
| Depositing User: | Firania Putri Harsanti |
| Date Deposited: | 30 Jan 2026 10:28 |
| Last Modified: | 30 Jan 2026 10:28 |
| URI: | http://repository.its.ac.id/id/eprint/131341 |
Actions (login required)
![]() |
View Item |
