A Comparative Study of Generating Synthetic Tabular Data Using CTGAN and DP-CTGAN : Fidelity and Risk Assessment

dc.contributor.authorBastola, Namita
dc.contributor.departmentfi=Tietotekniikan laitos|en=Department of Computing|
dc.contributor.facultyfi=Teknillinen tiedekunta|en=Faculty of Technology|
dc.contributor.studysubjectfi=Information and Communication Technology|en=Information and Communication Technology|
dc.date.accessioned2025-12-16T22:04:15Z
dc.date.available2025-12-16T22:04:15Z
dc.date.issued2025-12-10
dc.description.abstractThe accelerated increase of digital records in the medical field offers unprecedented opportunities for research advancement and data-driven healthcare analytics. However, the sensitive nature of data raises significant privacy concerns that constrain direct data sharing and collaborative research. Synthetic data generation has been explored as the most practical approach to address these challenges by producing more realistic artificial datasets that closely preserve the key statistical properties while mitigating privacy risk. This thesis examines the synthetic tabular data generation following the Conditional Tabular Generative Adversarial Networks (CTGAN) proposed by Xu et al., which is designed to generate synthetic tabular data and further integrates differential privacy with varying privacy budgets. For each model, this study implements rigorous model training and executes it across multiple independent runs, producing synthetic datasets and preserving them in a well-structured directory for evaluation. For fidelity assessment, quantitative evaluation metrics—Hellinger distance, pair- wise correlation (Spearman and Cramér’s V), and SDV metrics such as Total Variation Distance analysis for categorical variables and the Kolmogorov-Smirnov (KS) statistic for continuous variables—have been implemented. Additionally, for privacy exposure measurement, this study adopted a single-out risk—an attack-based framework, that simulates an adversary’s likelihood and potential of re-identification for each run-wise synthetic dataset. Furthermore, the thesis analyzes feature-wise fidelity and run-to-run model variability across multiple trainings and the inherent trade-off between data privacy and fidelity for each model. The results show that CTGAN generates better quality synthetic tabular data compared to DPCTGAN model variants, as privacy control causes fidelity degradation. However, DPCTGAN models are applicable when privacy is a major factor, but there is a trade-off in quality. This work is applicable to collaborative guidelines for generating synthetic data and analyzing healthcare data while maintaining privacy and quality of data.
dc.format.extent68
dc.identifier.olddbid211686
dc.identifier.oldhandle10024/194705
dc.identifier.urihttps://www.utupub.fi/handle/11111/17051
dc.identifier.urnURN:NBN:fi-fe20251216120254
dc.language.isoeng
dc.rightsfi=Julkaisu on tekijänoikeussäännösten alainen. Teosta voi lukea ja tulostaa henkilökohtaista käyttöä varten. Käyttö kaupallisiin tarkoituksiin on kielletty.|en=This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.|
dc.rights.accessrightsavoin
dc.source.identifierhttps://www.utupub.fi/handle/10024/194705
dc.subjectSynthetic Data Generation, Medical Tabular Datasets, Differential Privacy, Generative Adversarial Networks, CTGAN, DPCTGAN, Fidelity, Single-Out Risk, Model Stability
dc.titleA Comparative Study of Generating Synthetic Tabular Data Using CTGAN and DP-CTGAN : Fidelity and Risk Assessment
dc.type.ontasotfi=Diplomityö|en=Master's thesis|

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
Bastola_Namita_Thesis.pdf
Size:
1.23 MB
Format:
Adobe Portable Document Format