A Comparative Study of Generating Synthetic Tabular Data Using CTGAN and DP-CTGAN : Fidelity and Risk Assessment
Bastola, Namita (2025-12-10)
A Comparative Study of Generating Synthetic Tabular Data Using CTGAN and DP-CTGAN : Fidelity and Risk Assessment
Bastola, Namita
(10.12.2025)
Julkaisu on tekijänoikeussäännösten alainen. Teosta voi lukea ja tulostaa henkilökohtaista käyttöä varten. Käyttö kaupallisiin tarkoituksiin on kielletty.
avoin
Julkaisun pysyvä osoite on:
https://urn.fi/URN:NBN:fi-fe20251216120254
https://urn.fi/URN:NBN:fi-fe20251216120254
Tiivistelmä
The accelerated increase of digital records in the medical field offers unprecedented opportunities for research advancement and data-driven healthcare analytics. However, the sensitive nature of data raises significant privacy concerns that constrain direct data sharing and collaborative research. Synthetic data generation has been explored as the most practical approach to address these challenges by producing more realistic artificial datasets that closely preserve the key statistical properties
while mitigating privacy risk.
This thesis examines the synthetic tabular data generation following the Conditional Tabular Generative Adversarial Networks (CTGAN) proposed by Xu et al., which is designed to generate synthetic tabular data and further integrates differential privacy with varying privacy budgets. For each model, this study implements rigorous model training and executes it across multiple independent runs, producing synthetic datasets and preserving them in a well-structured directory for evaluation. For fidelity assessment, quantitative evaluation metrics—Hellinger distance, pair- wise correlation (Spearman and Cramér’s V), and SDV metrics such as Total Variation Distance analysis for categorical variables and the Kolmogorov-Smirnov (KS) statistic for continuous variables—have been implemented. Additionally, for privacy exposure measurement, this study adopted a single-out risk—an attack-based framework, that simulates an adversary’s likelihood and potential of re-identification for each run-wise synthetic dataset.
Furthermore, the thesis analyzes feature-wise fidelity and run-to-run model variability across multiple trainings and the inherent trade-off between data privacy and fidelity for each model. The results show that CTGAN generates better quality synthetic tabular data compared to DPCTGAN model variants, as privacy control causes fidelity degradation. However, DPCTGAN models are applicable when privacy is a major factor, but there is a trade-off in quality. This work is applicable to
collaborative guidelines for generating synthetic data and analyzing healthcare data while maintaining privacy and quality of data.
while mitigating privacy risk.
This thesis examines the synthetic tabular data generation following the Conditional Tabular Generative Adversarial Networks (CTGAN) proposed by Xu et al., which is designed to generate synthetic tabular data and further integrates differential privacy with varying privacy budgets. For each model, this study implements rigorous model training and executes it across multiple independent runs, producing synthetic datasets and preserving them in a well-structured directory for evaluation. For fidelity assessment, quantitative evaluation metrics—Hellinger distance, pair- wise correlation (Spearman and Cramér’s V), and SDV metrics such as Total Variation Distance analysis for categorical variables and the Kolmogorov-Smirnov (KS) statistic for continuous variables—have been implemented. Additionally, for privacy exposure measurement, this study adopted a single-out risk—an attack-based framework, that simulates an adversary’s likelihood and potential of re-identification for each run-wise synthetic dataset.
Furthermore, the thesis analyzes feature-wise fidelity and run-to-run model variability across multiple trainings and the inherent trade-off between data privacy and fidelity for each model. The results show that CTGAN generates better quality synthetic tabular data compared to DPCTGAN model variants, as privacy control causes fidelity degradation. However, DPCTGAN models are applicable when privacy is a major factor, but there is a trade-off in quality. This work is applicable to
collaborative guidelines for generating synthetic data and analyzing healthcare data while maintaining privacy and quality of data.
