Integrasi Machine Learning Utility dalam Pengembangan Model Sintesis Data Tabular Menggunakan Loss Function Learning

Nur, Muhammad Rizqi (2024) Integrasi Machine Learning Utility dalam Pengembangan Model Sintesis Data Tabular Menggunakan Loss Function Learning. Masters thesis, Institut Teknologi Sepuluh Nopember.

[thumbnail of 6026221012-Master_Thesis.pdf] Text
6026221012-Master_Thesis.pdf - Accepted Version
Restricted to Repository staff only

Download (9MB) | Request a copy

Abstract

Sejak diberlakukannya General Data Protection Regulation (GDPR) dan Undang-undang Perlindungan Data Pribadi, perusahaan harus berhati-hati dalam berbagi data untuk menjaga privasi pengguna. Data sintetis menjadi salah satu alternatif solusi untuk masalah ini. Berbagai model sintesis data tabular berbasis deep learning telah diajukan sebelumnya dengan machine learning utility (ML utility) sebagai metrik evaluasi utama. ML utility diukur melalui model yang dilatih dengan data sintetis dan diuji dengan data asli. Namun, model-model sebelumnya tidak mengintegrasikan ML utility dalam training untuk menghasilkan ML utility yang sama, karena ML utility tidak dapat dihitung langsung dari dataset. Penelitian ini bertujuan mengintegrasikan ML utility dalam model sintesis data dengan loss function learning, yaitu memodelkan loss function ML utility dengan model deep learning. Model berbasis transformer dikembangkan untuk mengestimasi ML utility, lalu diintegrasikan ke dalam empat model sintesis data dengan melakukan backpropagation dari error antara nilai ML utility estimasi dan nilai ML utility yang diharapkan. Penelitian ini menggunakan TVAE, LCT-GAN, TabDDPM, dan REaLTabFormer sebagai model dasar sintesis data dan Catboost sebagai model prediktif penilai ML utility. Penelitian ini menggunakan tiga dataset berjenis regresi, klasifikasi biner, dan klasifikasi multiclass. Integrasi ML utility berhasil meningkatkan ML utility untuk ketiga dataset pada model LCT-GAN dan dua dataset pada model Realtabformer. Peningkatan ML utility ini berdampak signifikan terhadap kemiripan distribusi, perbedaan korelasi antar fitur, dan jarak terkecil terhadap data asli (privasi), tapi arahnya tidak konsisten. Korelasi Pearson antara perubahan ML utility dan perubahan Jensen-Shannon Divergence (JSD) dan Wasserstein Distance (WD) selalu negatif, mengindikasikan distribusi menjadi lebih mirip data asli dengan peningkatan ML utility. Pemeriksaan pada sampel dataset menunjukkan bahwa pada dataset yang nilai ML utility-nya membaik, korelasi fiturnya terhadap variabel target juga membaik, meskipun pada dataset yang secara keseluruhan korelasinya memburuk. Pemeriksaan sampel juga menunjukkan bahwa sebagian baris data menjadi lebih mirip dengan data asli secara satuan, yang berarti privasi lebih tidak terjaga. Selain integrasi ML utility, penelitian ini juga menunjukkan bahwa pola antar baris dalam dataset dapat dipelajari, sehingga model sintesis data yang lebih baik dapat dikembangkan berdasarkan kedua hal ini.
============================================================
Since the enactment of the General Data Protection Regulation (GDPR) and the Indonesian Personal Data Protection Law, companies must be careful in sharing data to protect user privacy. Synthetic data has become an alternative solution to this problem. Various tabular data synthesis models based on deep learning have been proposed with machine learning utility (ML utility) as the main evaluation metric. ML utility is measured through models trained with synthetic data and tested with real data. However, previous models did not integrate ML utility in their training to produce the same ML utility, because ML utility cannot be calculated directly from a dataset. This research aims to integrate ML utility in data synthesizers using loss function learning, that is by modelling an ML utility loss function using deep learning. A transformer-based model is developed to estimate ML utility, and then integrated into four data synthesis models by performing backpropagation of the error between the estimated ML utility and the expected ML utility value. This research uses TVAE, LCT-GAN, TabDDPM, and REaLTabFormer as the base synthesizers and Catboost model as the predictive model to evaluate ML utility. This study used three datasets of regression, binary classification, and multiclass classification. The integration of ML utility successfully improved ML utility for all three datasets for LCT-GAN and two datasets for Realtabformer. This improvement in ML utility had a significant impact on the similarity of the distribution, the difference in correlation between features, and the distance to closest real record (privacy), but the direction was not consistent. The Pearson correlation between the change in ML utility and the change in Jensen-Shannon Divergence (JSD) and Wasserstein Distance (WD) were always negative, indicating that the distribution becomes more similar to the original data along with increase in ML utility. Inspection of dataset samples showed that in datasets where the ML utility improves, the correlation of the features towards the target variable also improves, even in a dataset where the overall correlation deteriorates. Sample inspection also showed that some data rows become more similar to the original data, individually, which means that privacy is less preserved. In addition to ML utility integration, this study also showed that patterns between rows in the dataset can be learned, so better data synthesizers can be developed based on these two.

Item Type: Thesis (Masters)
Uncontrolled Keywords: loss function learning, machine learning utility, regresi transformer, tabular data synthesis, transformer regression, sintesis data tabular,
Subjects: Q Science > Q Science (General) > Q325.5 Machine learning. Support vector machines.
Q Science > Q Science (General) > Q325.78 Back propagation
Q Science > QA Mathematics > QA336 Artificial Intelligence
Q Science > QA Mathematics > QA76.87 Neural networks (Computer Science)
Q Science > QA Mathematics > QA76.9.D343 Data mining. Querying (Computer science)
T Technology > T Technology (General)
T Technology > T Technology (General) > T57.5 Data Processing
Divisions: Faculty of Intelligent Electrical and Informatics Technology (ELECTICS) > Information System > 59101-(S2) Master Thesis
Depositing User: Muhammad Rizqi Nur
Date Deposited: 09 Aug 2024 07:37
Last Modified: 09 Aug 2024 07:37
URI: http://repository.its.ac.id/id/eprint/112451

Actions (login required)

View Item View Item