Comparison of missing data handling methods for variant pathogenicity predictors

dc.contributor.authorSärkkä, Mikko
dc.contributor.authorMyöhänen, Sami
dc.contributor.authorMarinov, Kaloyan
dc.contributor.authorSaarinen, Inka
dc.contributor.authorLahti, Leo
dc.contributor.authorFortino, Vittorio
dc.contributor.authorPaananen, Jussi
dc.contributor.organizationfi=data-analytiikka|en=Data-analytiikka|
dc.contributor.organization-code1.2.246.10.2458963.20.68940835793
dc.converis.publication-id504736398
dc.converis.urlhttps://research.utu.fi/converis/portal/Publication/504736398
dc.date.accessioned2026-01-21T12:35:30Z
dc.date.available2026-01-21T12:35:30Z
dc.description.abstractModern clinical genetic tests utilize next-generation sequencing (NGS) approaches to comprehensively analyze genetic variants from patients. Out of millions of variants, clinically relevant variants that match the patient's phenotype must be identified accurately and rapidly. As manual evaluation is not a feasible option for meeting the speed and volume requirements of clinical genetic testing, automated solutions are needed. Various machine learning (ML), artificial intelligence (AI), and <i>in silico</i> variant pathogenicity predictors have been developed to solve this challenge. These solutions rely on comprehensive data and struggle with the sparse genetic annotations. Therefore, careful treatment of missing data is necessary, and the selected methods may have a huge impact on the accuracy, reliability, speed and associated computational costs. We present an open-source framework called AMISS that can be used to evaluate performance of different methods for handling missing genetic variant data in the context of variant pathogenicity prediction. Using AMISS, we evaluated 14 methods for handling missing values. The performance of these methods varied substantially in terms of precision, computational costs, and other attributes. Overall, simpler imputation methods and specifically mean imputation performed best.
dc.identifier.eissn2631-9268
dc.identifier.jour-issn2631-9268
dc.identifier.olddbid212706
dc.identifier.oldhandle10024/195724
dc.identifier.urihttps://www.utupub.fi/handle/11111/53057
dc.identifier.urlhttps://doi.org/10.1093/nargab/lqaf133
dc.identifier.urnURN:NBN:fi-fe202601217078
dc.language.isoen
dc.okm.affiliatedauthorLahti, Leo
dc.okm.discipline112 Statistics and probabilityen_GB
dc.okm.discipline113 Computer and information sciencesen_GB
dc.okm.discipline3111 Biomedicineen_GB
dc.okm.discipline112 Tilastotiedefi_FI
dc.okm.discipline113 Tietojenkäsittely ja informaatiotieteetfi_FI
dc.okm.discipline3111 Biolääketieteetfi_FI
dc.okm.internationalcopublicationnot an international co-publication
dc.okm.internationalityInternational publication
dc.okm.typeA1 ScientificArticle
dc.publisherOxford University Press (OUP)
dc.publisher.countryUnited Kingdomen_GB
dc.publisher.countryBritanniafi_FI
dc.publisher.country-codeGB
dc.relation.articlenumberlqaf133
dc.relation.doi10.1093/nargab/lqaf133
dc.relation.ispartofjournalNAR Genomics and Bioinformatics: Nucleic Acids Research Genomics and Bioinformatics
dc.relation.issue4
dc.relation.volume7
dc.source.identifierhttps://www.utupub.fi/handle/10024/195724
dc.titleComparison of missing data handling methods for variant pathogenicity predictors
dc.year.issued2025

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
lqaf133.pdf
Size:
2.41 MB
Format:
Adobe Portable Document Format