Comparison of missing data handling methods for variant pathogenicity predictors

Särkkä, Mikko; Myöhänen, Sami; Marinov, Kaloyan; Saarinen, Inka; Lahti, Leo; Fortino, Vittorio; Paananen, Jussi

Comparison of missing data handling methods for variant pathogenicity predictors

dc.contributor.author	Särkkä, Mikko
dc.contributor.author	Myöhänen, Sami
dc.contributor.author	Marinov, Kaloyan
dc.contributor.author	Saarinen, Inka
dc.contributor.author	Lahti, Leo
dc.contributor.author	Fortino, Vittorio
dc.contributor.author	Paananen, Jussi
dc.contributor.organization	fi=data-analytiikka\|en=Data-analytiikka\|
dc.contributor.organization-code	1.2.246.10.2458963.20.68940835793
dc.converis.publication-id	504736398
dc.converis.url	https://research.utu.fi/converis/portal/Publication/504736398
dc.date.accessioned	2026-01-21T12:35:30Z
dc.date.available	2026-01-21T12:35:30Z
dc.description.abstract	Modern clinical genetic tests utilize next-generation sequencing (NGS) approaches to comprehensively analyze genetic variants from patients. Out of millions of variants, clinically relevant variants that match the patient's phenotype must be identified accurately and rapidly. As manual evaluation is not a feasible option for meeting the speed and volume requirements of clinical genetic testing, automated solutions are needed. Various machine learning (ML), artificial intelligence (AI), and <i>in silico</i> variant pathogenicity predictors have been developed to solve this challenge. These solutions rely on comprehensive data and struggle with the sparse genetic annotations. Therefore, careful treatment of missing data is necessary, and the selected methods may have a huge impact on the accuracy, reliability, speed and associated computational costs. We present an open-source framework called AMISS that can be used to evaluate performance of different methods for handling missing genetic variant data in the context of variant pathogenicity prediction. Using AMISS, we evaluated 14 methods for handling missing values. The performance of these methods varied substantially in terms of precision, computational costs, and other attributes. Overall, simpler imputation methods and specifically mean imputation performed best.
dc.identifier.eissn	2631-9268
dc.identifier.jour-issn	2631-9268
dc.identifier.olddbid	212706
dc.identifier.oldhandle	10024/195724
dc.identifier.uri	https://www.utupub.fi/handle/11111/53057
dc.identifier.url	https://doi.org/10.1093/nargab/lqaf133
dc.identifier.urn	URN:NBN:fi-fe202601217078
dc.language.iso	en
dc.okm.affiliatedauthor	Lahti, Leo
dc.okm.discipline	112 Statistics and probability	en_GB
dc.okm.internationalcopublication	not an international co-publication
dc.okm.internationality	International publication
dc.okm.type	A1 ScientificArticle
dc.publisher	Oxford University Press (OUP)
dc.publisher.country	United Kingdom	en_GB
dc.publisher.country	Britannia	fi_FI
dc.publisher.country-code	GB
dc.relation.articlenumber	lqaf133
dc.relation.doi	10.1093/nargab/lqaf133
dc.relation.ispartofjournal	NAR Genomics and Bioinformatics: Nucleic Acids Research Genomics and Bioinformatics
dc.relation.issue	4
dc.relation.volume	7
dc.source.identifier	https://www.utupub.fi/handle/10024/195724
dc.title	Comparison of missing data handling methods for variant pathogenicity predictors
dc.year.issued	2025

Tiedostot

Näytetään 1 - 1 / 1

Name:: lqaf133.pdf
Size:: 2.41 MB
Format:: Adobe Portable Document Format

Lataa

Kokoelmat

Rinnakkaistallenteet