A Comparative Study of Generating Synthetic Tabular Data Using CTGAN and DP-CTGAN : Fidelity and Risk Assessment

Bastola, Namita

A Comparative Study of Generating Synthetic Tabular Data Using CTGAN and DP-CTGAN : Fidelity and Risk Assessment

dc.contributor.author	Bastola, Namita
dc.contributor.department	fi=Tietotekniikan laitos\|en=Department of Computing\|
dc.contributor.faculty	fi=Teknillinen tiedekunta\|en=Faculty of Technology\|
dc.contributor.studysubject	fi=Information and Communication Technology\|en=Information and Communication Technology\|
dc.date.accessioned	2025-12-16T22:04:15Z
dc.date.available	2025-12-16T22:04:15Z
dc.date.issued	2025-12-10
dc.description.abstract	The accelerated increase of digital records in the medical field offers unprecedented opportunities for research advancement and data-driven healthcare analytics. However, the sensitive nature of data raises significant privacy concerns that constrain direct data sharing and collaborative research. Synthetic data generation has been explored as the most practical approach to address these challenges by producing more realistic artificial datasets that closely preserve the key statistical properties while mitigating privacy risk. This thesis examines the synthetic tabular data generation following the Conditional Tabular Generative Adversarial Networks (CTGAN) proposed by Xu et al., which is designed to generate synthetic tabular data and further integrates differential privacy with varying privacy budgets. For each model, this study implements rigorous model training and executes it across multiple independent runs, producing synthetic datasets and preserving them in a well-structured directory for evaluation. For fidelity assessment, quantitative evaluation metrics—Hellinger distance, pair- wise correlation (Spearman and Cramér’s V), and SDV metrics such as Total Variation Distance analysis for categorical variables and the Kolmogorov-Smirnov (KS) statistic for continuous variables—have been implemented. Additionally, for privacy exposure measurement, this study adopted a single-out risk—an attack-based framework, that simulates an adversary’s likelihood and potential of re-identification for each run-wise synthetic dataset. Furthermore, the thesis analyzes feature-wise fidelity and run-to-run model variability across multiple trainings and the inherent trade-off between data privacy and fidelity for each model. The results show that CTGAN generates better quality synthetic tabular data compared to DPCTGAN model variants, as privacy control causes fidelity degradation. However, DPCTGAN models are applicable when privacy is a major factor, but there is a trade-off in quality. This work is applicable to collaborative guidelines for generating synthetic data and analyzing healthcare data while maintaining privacy and quality of data.
dc.format.extent	68
dc.identifier.olddbid	211686
dc.identifier.oldhandle	10024/194705
dc.identifier.uri	https://www.utupub.fi/handle/11111/17051
dc.identifier.urn	URN:NBN:fi-fe20251216120254
dc.language.iso	eng
dc.rights	fi=Julkaisu on tekijänoikeussäännösten alainen. Teosta voi lukea ja tulostaa henkilökohtaista käyttöä varten. Käyttö kaupallisiin tarkoituksiin on kielletty.\|en=This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.\|
dc.rights.accessrights	avoin
dc.source.identifier	https://www.utupub.fi/handle/10024/194705
dc.subject	Synthetic Data Generation, Medical Tabular Datasets, Differential Privacy, Generative Adversarial Networks, CTGAN, DPCTGAN, Fidelity, Single-Out Risk, Model Stability
dc.title	A Comparative Study of Generating Synthetic Tabular Data Using CTGAN and DP-CTGAN : Fidelity and Risk Assessment
dc.type.ontasot	fi=Diplomityö\|en=Master's thesis\|

Tiedostot

Näytetään 1 - 1 / 1

Name:: Bastola_Namita_Thesis.pdf
Size:: 1.23 MB
Format:: Adobe Portable Document Format

Lataa

Kokoelmat

Pro gradu -tutkielmat ja diplomityöt sekä syventävien opintojen opinnäytetyöt (kokotekstit)