Comprehensive Desktop Application for Deep Learning Optimization on Jetson Architecture with Advanced Tensor Core Alignment and Memory Bandwith Optimization

Sudjatmiko, Farrell Rafee (2026) Comprehensive Desktop Application for Deep Learning Optimization on Jetson Architecture with Advanced Tensor Core Alignment and Memory Bandwith Optimization. Other thesis, Institut Teknologi Sepuluh Nopember.

[thumbnail of 5024221011_Undergraduate-Thesis.pdf]

Text
5024221011_Undergraduate-Thesis.pdf - Accepted Version
Restricted to Repository staff only
Download (4MB) | Request a copy

Abstract

Deep learning models for video-based analysis often achieve high accuracy at the cost of significant computational overhead, which limits their deployment on resource-constrained and heterogeneous hardware platforms. This thesis investigates a systematic optimization framework for pre-trained convolutional neural networks, with a particular focus on cross-platform inference efficiency and hardware-aware execution. Rather than retraining models, the proposed approach emphasizes post-training optimization techniques that adapt existing models to diverse execution environments. An optimization application was developed using an Electron-based graphical interface with a Python backend, enabling automated model conversion, optimization, and benchmarking across multiple platforms. The system supports both CPU- and GPU-based execution paths and integrates precision scaling, structured channel pruning, weight clustering, and computation graph transformations including operator fusion, constant folding, and batch normalization folding. For GPU-enabled platforms, the framework further exploits TensorRT optimization and FP16 Tensor Core acceleration. Experiments were conducted using a pre-trained MobileNet-based student engagement detection model evaluated on a self-recorded video dataset. Three alternative architectures were initially considered; however, only MobileNet demonstrated stable accuracy across platforms and was therefore selected for optimization. Performance was evaluated on four representative devices: an NVIDIA Jetson Orin Nano, an x86-based ASUS laptop, an Apple MacBook Air with M1 architecture, and a legacy Intel-based Lenovo laptop. Evaluation metrics included inference latency, throughput, memory consumption, and classification performance. The results demonstrate that hardware-aware optimization significantly improves inference performance while preserving model accuracy. On the Jetson platform, TensorRT FP16 optimization enabled real-time inference by effectively utilizing Tensor Cores. On x86 CPUs, TensorFlow Lite FP16 conversion provided substantial speedups through software-level optimization. Apple Silicon devices exhibited unstable FP32 execution, but achieved reliable real-time performance after FP16 conversion. In contrast, optimization had limited impact on legacy CPU hardware, indicating architectural bottlenecks rather than model inefficiency. Overall, this work highlights the importance of aligning model optimization strategies with underlying hardware characteristics. The proposed framework provides a practical and extensible solution for deploying deep learning models across heterogeneous platforms, enabling efficient inference without retraining and extending the usable lifespan of existing hardware.
==========================================================================================================================================
Model deep learning untuk analisis video umumnya memiliki akurasi tinggi namun membutuhkan sumber daya komputasi yang besar, sehingga membatasi penerapannya pada perangkat dengan kemampuan komputasi yang beragam. Penelitian ini mengkaji suatu kerangka optimasi sistematis untuk model convolutional neural network pra-latih dengan fokus pada efisiensi inferensi lintas platform dan pemanfaatan karakteristik perangkat keras. Pendekatan yang digunakan tidak melibatkan pelatihan ulang model, melainkan menekankan optimasi pasca-pelatihan agar model dapat dijalankan secara efisien pada berbagai arsitektur. Sebuah aplikasi optimasi dikembangkan menggunakan antarmuka grafis berbasis Electron dengan backend Python untuk mendukung proses konversi model, optimasi, dan benchmarking secara otomatis. Sistem ini mendukung eksekusi berbasis CPU dan GPU serta mengintegrasikan teknik optimasi seperti penurunan presisi numerik, pruning terstruktur, clustering bobot, dan transformasi graf komputasi yang meliputi operator fusion, constant folding, dan batch normalization folding. Untuk platform yang mendukung akselerasi GPU, sistem juga memanfaatkan optimasi TensorRT dan eksekusi FP16 berbasis Tensor Core. Eksperimen dilakukan menggunakan model MobileNet pra-latih untuk klasifikasi keterlibatan mahasiswa pada data video. Beberapa arsitektur alternatif diuji pada tahap awal, namun hanya MobileNet yang menunjukkan kestabilan akurasi lintas platform sehingga dipilih sebagai model utama untuk optimasi. Evaluasi dilakukan pada empat perangkat representatif, yaitu NVIDIA Jetson Orin Nano, laptop ASUS berbasis CPU x86, Apple MacBook Air dengan arsitektur M1, dan laptop Lenovo berbasis CPU lama. Metode evaluasi mencakup waktu inferensi, throughput, penggunaan memori, dan performa klasifikasi. Hasil eksperimen menunjukkan bahwa optimasi yang selaras dengan karakteristik perangkat keras secara signifikan meningkatkan performa inferensi tanpa menurunkan akurasi model. Pada platform Jetson, optimasi TensorRT FP16 memungkinkan inferensi real-time melalui pemanfaatan Tensor Core. Pada platform x86, konversi TensorFlow Lite FP16 memberikan percepatan inferensi melalui optimasi perangkat lunak. Pada perangkat Apple Silicon, eksekusi FP32 menunjukkan ketidakstabilan, namun performa real-time dapat dicapai setelah konversi ke FP16. Sebaliknya, optimasi memberikan dampak terbatas pada perangkat CPU lama, yang menunjukkan bahwa keterbatasan performa lebih dipengaruhi oleh arsitektur perangkat keras daripada kompleksitas model. Secara keseluruhan, penelitian ini menegaskan pentingnya pendekatan optimasi yang mempertimbangkan kesesuaian antara model dan perangkat keras. Kerangka optimasi yang dikembangkan menyediakan solusi praktis dan fleksibel untuk penerapan inferensi deep learning lintas platform tanpa pelatihan ulang model, sekaligus memperpanjang masa guna perangkat komputasi yang tersedia.

Item Type:	Thesis (Other)
Uncontrolled Keywords:	Deep Learning Optimisation, Student Engagement Detection, Mixed Precision, Pruning, Cross-Hardware Deployment, TensorRT, Optimasi Deep Learning, Deteksi Keterlibatan Mahasiswa, Presisi Campuran, Pruning, Deploy Lintas Perangkat
Subjects:	T Technology > T Technology (General) > T57.8 Nonlinear programming. Support vector machine. Wavelets. Hidden Markov models.
Divisions:	Faculty of Intelligent Electrical and Informatics Technology (ELECTICS) > Computer Engineering > 90243-(S1) Undergraduate Thesis
Depositing User:	Sudjatmiko Farrell Rafee
Date Deposited:	20 Jan 2026 02:56
Last Modified:	20 Jan 2026 02:56
URI:	http://repository.its.ac.id/id/eprint/129773

Actions (login required)

View Item