Pratama, Raditya (2025) Perancangan dan Implementasi Scalable dan Acid-Compliant Data Lakehouse. Other thesis, Institut Teknologi Sepuluh Nopember.
Text
5027211047-Undergraduate_Thesis.pdf - Accepted Version Restricted to Repository staff only until 1 April 2027. Download (6MB) | Request a copy |
Abstract
Dalam beberapa tahun terakhir, kebutuhan akan sistem penyimpanan dan pengolahan big data yang efisien serta mampu menangani volume besar dan analitik kompleks semakin meningkat. Arsitektur data lakehouse, yang menggabungkan keunggulan data lake (penyimpanan berbagai format data) dan data warehouse (pengolahan data terstruktur), menjadi solusi inovatif untuk memenuhi kebutuhan ini. Arsitektur ini dirancang untuk memenuhi kriteria ACID compliance (Atomicity/atomisitas, Consistency/konsistensi, Isolation/isolasi, Durability/daya tahan), yang menjamin integritas dan konsistensi data pada skala besar. Atomisitas memastikan transaksi diproses secara utuh atau tidak sama sekali; konsistensi menjamin bahwa setiap transaksi memindahkan basis data antar keadaan valid; isolasi memastikan transaksi yang berjalan secara bersamaan tidak saling mengganggu; dan daya tahan menjamin bahwa perubahan data yang telah di-commit tetap terjaga meskipun terjadi kegagalan sistem. Penelitian ini mengembangkan dan mengevaluasi arsitektur data lakehouse menggunakan integrasi Docker, MinIO, MariaDB, Hive Metastore, Trino, Apache Iceberg, dan Metabase. Implementasi melibatkan MinIO sebagai penyimpanan objek, Hive Metastore untuk manajemen metadata, Apache Iceberg untuk menyediakan tabel yang mendukung ACID compliance, Trino untuk analitik data, dan Metabase untuk visualisasi data. Seluruh komponen dijalankan dalam lingkungan terisolasi Docker untuk memastikan skalabilitas dan kemudahan pengelolaan. Pengujian dilakukan untuk mengevaluasi skalabilitas, efisiensi pengolahan data, kemudahan manajemen sistem, konsistensi metadata, dan ACID compliance. Hasil pengujian menunjukkan bahwa penskalaan horizontal lebih efektif dibandingkan penskalaan vertikal; penambahan node worker pada kluster Trino berhasil mengurangi waktu eksekusi kueri hingga 19,8% pada dataset berukuran 10 GB, dengan distribusi beban kerja yang lebih merata. Sistem ini juga memenuhi semua kriteria ACID compliance, memastikan integritas data dalam berbagai skenario. Selain itu, efisiensi pengolahan data pada
arsitektur data lakehouse lebih unggul dibandingkan dengan PostgreSQL, dengan waktu eksekusi rata-rata yang lebih cepat hingga 84,8%.
================================================================================================================================
In recent years, the demand for efficient storage and processing systems capable of handling large volumes of data and complex analytics has significantly increased. The data lakehouse architecture, combining the strengths of data lakes (storing various data formats) and data warehouses (structured data processing), has emerged as an innovative solution to meet these needs. This architecture is designed to comply with ACID principles (Atomicity, Consistency, Isolation, and Durability), ensuring data integrity and consistency on a large scale. Atomicity ensures transactions are processed entirely or not at all; consistency guarantees that each transaction moves the database between valid states; isolation ensures that concurrent transactions do not interfere with each other; and durability ensures that committed changes persist even in the event of a system failure. This research develops and evaluates a data lakehouse architecture using an integration of Docker, MinIO, MariaDB, Hive Metastore, Trino, Apache Iceberg, and Metabase. The implementation includes MinIO for object storage, Hive Metastore for metadata management, Apache Iceberg for providing ACID-compliant tables, Trino for data analytics, and Metabase for data visualization. All components are deployed in an isolated Docker environment to ensure scalability and ease of management. The evaluation focuses on scalability, data processing efficiency, system management ease, metadata consistency, and ACID compliance. The results demonstrate that horizontal scaling is more effective than vertical scaling; adding worker nodes to the Trino cluster reduced query execution time by up to 19.8% for a 10 GB dataset, with a more evenly distributed workload. The system also satisfies all ACID compliance criteria, ensuring data integrity in various scenarios. Furthermore, the data lakehouse architecture outperforms PostgreSQL in data processing efficiency, achieving up to 84.8% faster average query execution times across all datasets.
Item Type: | Thesis (Other) |
---|---|
Uncontrolled Keywords: | ACID Compliance, Apache Iceberg, Data Lakehouse, Docker, Efisiensi Pengolahan Data, Manajemen Metadata, Metabase, MinIO, Skalabilitas, Trino, Efficiency in Data Processing, Metadata Management, Scalability |
Subjects: | T Technology > T Technology (General) T Technology > T Technology (General) > T57.5 Data Processing T Technology > T Technology (General) > T58.64 Information resources management T Technology > T Technology (General) > T58.8 Productivity. Efficiency |
Divisions: | Faculty of Intelligent Electrical and Informatics Technology (ELECTICS) > Information Technology > 59201-(S1) Undergraduate Thesis |
Depositing User: | Raditya Pratama |
Date Deposited: | 23 Jan 2025 02:03 |
Last Modified: | 23 Jan 2025 02:03 |
URI: | http://repository.its.ac.id/id/eprint/116688 |
Actions (login required)
View Item |