Does Differentially Private Synthetic Data Lead to Synthetic Discoveries?

dc.contributor.authorMontoya Perez, Ileana
dc.contributor.authorMovahedi, Parisa
dc.contributor.authorNieminen, Valtteri
dc.contributor.authorAirola, Antti
dc.contributor.authorPahikkala, Tapio
dc.contributor.organizationfi=data-analytiikka|en=Data-analytiikka|
dc.contributor.organizationfi=terveysteknologia|en=Health Technology|
dc.contributor.organization-code1.2.246.10.2458963.20.28696315432
dc.contributor.organization-code1.2.246.10.2458963.20.68940835793
dc.converis.publication-id457851720
dc.converis.urlhttps://research.utu.fi/converis/portal/Publication/457851720
dc.date.accessioned2025-08-27T22:48:32Z
dc.date.available2025-08-27T22:48:32Z
dc.description.abstract<p><strong>Background</strong> Synthetic data have been proposed as a solution for sharing anonymized versions of sensitive biomedical datasets. Ideally, synthetic data should preserve the structure and statistical properties of the original data, while protecting the privacy of the individual subjects. Differential Privacy (DP) is currently considered the gold standard approach for balancing this trade-off.</p><p><strong>Objectives</strong> The aim of this study is to investigate how trustworthy are group differences discovered by independent sample tests from DP-synthetic data. The evaluation is carried out in terms of the tests' Type I and Type II errors. With the former, we can quantify the tests' validity, i.e., whether the probability of false discoveries is indeed below the significance level, and the latter indicates the tests' power in making real discoveries.</p><p><strong>Methods</strong> We evaluate the Mann–Whitney U test, Student's t-test, chi-squared test, and median test on DP-synthetic data. The private synthetic datasets are generated from real-world data, including a prostate cancer dataset (n = 500) and a cardiovascular dataset (n = 70,000), as well as on bivariate and multivariate simulated data. Five different DP-synthetic data generation methods are evaluated, including two basic DP histogram release methods and MWEM, Private-PGM, and DP GAN algorithms.</p><p><strong>Conclusion</strong> A large portion of the evaluation results expressed dramatically inflated Type I errors, especially at levels of ϵ ≤ 1. This result calls for caution when releasing and analyzing DP-synthetic data: low p-values may be obtained in statistical tests simply as a byproduct of the noise added to protect privacy. A DP Smoothed Histogram-based synthetic data generation method was shown to produce valid Type I error for all privacy levels tested but required a large original dataset size and a modest privacy budget (ϵ ≥ 5) in order to have reasonable Type II error levels.</p>
dc.format.pagerange35
dc.format.pagerange51
dc.identifier.eissn2511-705X
dc.identifier.jour-issn0026-1270
dc.identifier.olddbid202843
dc.identifier.oldhandle10024/185870
dc.identifier.urihttps://www.utupub.fi/handle/11111/50514
dc.identifier.urlhttps://doi.org/10.1055/a-2385-1355
dc.identifier.urnURN:NBN:fi-fe2025082785868
dc.language.isoen
dc.okm.affiliatedauthorMontoya Perez, Ileana
dc.okm.affiliatedauthorMovahedi, Parisa
dc.okm.affiliatedauthorNieminen, Valtteri
dc.okm.affiliatedauthorAirola, Antti
dc.okm.affiliatedauthorPahikkala, Tapio
dc.okm.discipline112 Statistics and probabilityen_GB
dc.okm.discipline113 Computer and information sciencesen_GB
dc.okm.discipline112 Tilastotiedefi_FI
dc.okm.discipline113 Tietojenkäsittely ja informaatiotieteetfi_FI
dc.okm.internationalcopublicationnot an international co-publication
dc.okm.internationalityInternational publication
dc.okm.typeA1 ScientificArticle
dc.publisherGeorg Thieme Verlag
dc.publisher.countryGermanyen_GB
dc.publisher.countrySaksafi_FI
dc.publisher.country-codeDE
dc.relation.doi10.1055/a-2385-1355
dc.relation.ispartofjournalMethods of Information in Medicine
dc.relation.issue1-2
dc.relation.volume63
dc.source.identifierhttps://www.utupub.fi/handle/10024/185870
dc.titleDoes Differentially Private Synthetic Data Lead to Synthetic Discoveries?
dc.year.issued2024

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
a-2385-1355.pdf
Size:
1.46 MB
Format:
Adobe Portable Document Format