Al Ghozi, Ihsan Kamil (2025) Perbandingan Performa Teknologi Data Lake Dan Data Warehouse Dalam Pengelolaan Data Stack Overflow Menggunakan Arsitektur Hybrid Hub-and-Spoke Dan Kimball Method. Other thesis, Institut Teknologi Sepuluh Nopember.
![]() |
Text
5026211117-Undergraduate_Thesis.pdf - Accepted Version Restricted to Repository staff only Download (10MB) | Request a copy |
Abstract
Kemajuan teknologi informasi menuntut organisasi mengelola data skala besar secara efisien untuk mendukung keputusan berbasis data. Dua arsitektur dominan, Data Warehouse yang terstruktur dan Data Lake yang fleksibel, kerap membingungkan organisasi karena minimnya studi perbandingan langsung terkait performa keduanya. Penelitian ini membandingkan performa Data Warehouse berbasis Kimball Method dengan Data Lake berarsitektur Hybrid Hub-and-Spoke. Evaluasi dilakukan dengan empat metrik utama: waktu pembangunan sistem, durasi ETL, Query per Second (QPS), dan efisiensi penyimpanan. Implementasi dilakukan menggunakan dataset Stack Overflow. Data Warehouse dibangun dengan SQL Server 2019 dan ETL melalui Pentaho, sedangkan Data Lake memanfaatkan MinIO, ETL Python, dan DuckDB sebagai engine analitik. Pengujian QPS dilakukan secara sistematis menggunakan test harness Python. Hasil menunjukkan Data Lake unggul signifikan dalam QPS dan efisiensi penyimpanan berkat Parquet dan DuckDB yang ter-vektorisasi. Sebaliknya, Data Warehouse lebih cepat dalam proses ETL jika sumber data sudah terstruktur. Namun, ETL Data Lake menjadi lebih lama saat tahap awal ingestion (konversi dari XML). Penelitian ini menyimpulkan bahwa untuk beban kerja analitis modern, Data Lake menawarkan solusi lebih unggul dan menjadi panduan praktis bagi organisasi dalam memilih arsitektur sesuai kebutuhan dan karakteristik datanya.
==================================================================================================================================
The advancement of information technology demands organizations to manage large-scale data efficiently to support data-driven decision-making. Two dominant architectures—structured Data Warehouses and flexible Data Lakes—often pose challenges for organizations due to the lack of direct performance comparison studies. This study compares the performance of a Kimball Method-based Data Warehouse with a Data Lake built on Hybrid Hub-and-Spoke architecture. The evaluation focuses on four key performance metrics: system build time, ETL duration, Query per Second (QPS), and storage efficiency. The implementation uses the Stack Overflow dataset. The Data Warehouse is built with SQL Server 2019 and ETL using Pentaho, while the Data Lake leverages MinIO, Python-based ETL, and DuckDB as the analytical engine. QPS testing is conducted systematically using a Python-based test harness. Results show that the Data Lake significantly outperforms QPS and storage efficiency, thanks to the columnar Parquet format and vectorized DuckDB engine. On the other hand, the Data Warehouse performs faster ETL when data is already structured. However, Data Lake ETL takes longer due to the initial ingestion phase (XML to Parquet conversion). The study concludes that for modern analytical workloads, the Data Lake offers a more performant solution and provides practical guidance for organizations in selecting the architecture that best fits their needs and data characteristics.
Item Type: | Thesis (Other) |
---|---|
Uncontrolled Keywords: | Data Lake, Data Warehouse, Performa ETL, QPS, 4-step Kimball Method, Hub-and-Spoke, DuckDB, Data Lake, Data Warehouse, DuckDB, ETL Performance, Hub-and-Spoke, QPS, 4-step Kimball Method |
Subjects: | Q Science > QA Mathematics > QA76.9.D37 Data warehousing. Q Science > QA Mathematics > QA76.9D338 Data integration T Technology > T Technology (General) > T57.5 Data Processing Z Bibliography. Library Science. Information Resources > ZA Information resources > ZA4450 Databases |
Divisions: | Faculty of Intelligent Electrical and Informatics Technology (ELECTICS) > Information System > 57201-(S1) Undergraduate Thesis |
Depositing User: | Ihsan Kamil Al Ghozi |
Date Deposited: | 22 Jul 2025 03:54 |
Last Modified: | 22 Jul 2025 03:54 |
URI: | http://repository.its.ac.id/id/eprint/120442 |
Actions (login required)
![]() |
View Item |