Benchmarking Evaluation Protocols for Classifiers Trained on Differentially Private Synthetic Data

dc.contributor.authorMovahedi, Parisa
dc.contributor.authorNieminen, Valtteri
dc.contributor.authorPerez, Ileana Montoya
dc.contributor.authorDaafane, Hiba
dc.contributor.authorSukhwal, Dishant
dc.contributor.authorPahikkala, Tapio
dc.contributor.authorAirola, Antti
dc.contributor.organizationfi=data-analytiikka|en=Data-analytiikka|
dc.contributor.organizationfi=terveysteknologia|en=Health Technology|
dc.contributor.organization-code1.2.246.10.2458963.20.28696315432
dc.contributor.organization-code1.2.246.10.2458963.20.68940835793
dc.contributor.organization-code2610303
dc.converis.publication-id457862048
dc.converis.urlhttps://research.utu.fi/converis/portal/Publication/457862048
dc.date.accessioned2025-08-27T23:49:02Z
dc.date.available2025-08-27T23:49:02Z
dc.description.abstractDifferentially private (DP) synthetic data has emerged as a potential solution for sharing sensitive individual-level biomedical data. DP generative models offer a promising approach for generating realistic synthetic data that aims to maintain the original data's central statistical properties while ensuring privacy by limiting the risk of disclosing sensitive information about individuals. However, the issue regarding how to assess the expected real-world prediction performance of machine learning models trained on synthetic data remains an open question. In this study, we experimentally evaluate two different model evaluation protocols for classifiers trained on synthetic data. The first protocol employs solely synthetic data for downstream model evaluation, whereas the second protocol assumes limited DP access to a private test set consisting of real data managed by a data curator. We also propose a metric for assessing how well the evaluation results of the proposed protocols match the real-world prediction performance of the models. The assessment measures both the systematic error component indicating how optimistic or pessimistic the protocol is on average and the random error component indicating the variability of the protocol's error. The results of our study suggest that employing the second protocol is advantageous, particularly in biomedical health studies where the precision of the research is of utmost importance. Our comprehensive empirical study offers new insights into the practical feasibility and usefulness of different evaluation protocols for classifiers trained on DP-synthetic data.
dc.format.pagerange118637
dc.format.pagerange118648
dc.identifier.jour-issn2169-3536
dc.identifier.olddbid204680
dc.identifier.oldhandle10024/187707
dc.identifier.urihttps://www.utupub.fi/handle/11111/53268
dc.identifier.urlhttps://ieeexplore.ieee.org/document/10643135
dc.identifier.urnURN:NBN:fi-fe2025082786528
dc.language.isoen
dc.okm.affiliatedauthorMovahedi, Parisa
dc.okm.affiliatedauthorNieminen, Valtteri
dc.okm.affiliatedauthorMontoya Perez, Ileana
dc.okm.affiliatedauthorDaafane, Hiba
dc.okm.affiliatedauthorSukhwal, Dishant
dc.okm.affiliatedauthorPahikkala, Tapio
dc.okm.affiliatedauthorAirola, Antti
dc.okm.discipline113 Computer and information sciencesen_GB
dc.okm.discipline3111 Biomedicineen_GB
dc.okm.discipline113 Tietojenkäsittely ja informaatiotieteetfi_FI
dc.okm.discipline3111 Biolääketieteetfi_FI
dc.okm.internationalcopublicationnot an international co-publication
dc.okm.internationalityInternational publication
dc.okm.typeA1 ScientificArticle
dc.publisherIEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
dc.publisher.countryUnited Statesen_GB
dc.publisher.countryYhdysvallat (USA)fi_FI
dc.publisher.country-codeUS
dc.publisher.placePISCATAWAY
dc.relation.doi10.1109/ACCESS.2024.3446913
dc.relation.ispartofjournalIEEE Access
dc.relation.volume12
dc.source.identifierhttps://www.utupub.fi/handle/10024/187707
dc.titleBenchmarking Evaluation Protocols for Classifiers Trained on Differentially Private Synthetic Data
dc.year.issued2024

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
Benchmarking_Evaluation_Protocols_for_Classifiers_Trained_on_Differentially_Private_Synthetic_Data.pdf
Size:
2.88 MB
Format:
Adobe Portable Document Format