Dealing with dimensionality: the application of machine learning to multi-omics data

dc.contributor.authorFeldner-Busztin Dylan
dc.contributor.authorNisantzis Panos F.
dc.contributor.authorEdmunds Shelley J.
dc.contributor.authorBoza Gergely
dc.contributor.authorRacimo Fernando
dc.contributor.authorGopalakrishnan Shyam
dc.contributor.authorLimborg Morten T.
dc.contributor.authorLahti Leo
dc.contributor.authorde Polavieja Gonzalo G.
dc.contributor.organizationfi=data-analytiikka|en=Data-analytiikka|
dc.contributor.organization-code1.2.246.10.2458963.20.68940835793
dc.converis.publication-id178948715
dc.converis.urlhttps://research.utu.fi/converis/portal/Publication/178948715
dc.date.accessioned2025-08-28T00:18:46Z
dc.date.available2025-08-28T00:18:46Z
dc.description.abstract<p><strong>Motivation:</strong> Machine learning (ML) methods are motivated by the need to automate information extraction from large datasets in order to support human users in data-driven tasks. This is an attractive approach for integrative joint analysis of vast amounts of omics data produced in next generation sequencing and other -omics assays. A systematic assessment of the current literature can help to identify key trends and potential gaps in methodology and applications. We surveyed the literature on ML multi-omic data integration and quantitatively explored the goals, techniques and data involved in this field. We were particularly interested in examining how researchers use ML to deal with the volume and complexity of these datasets.<br></p><p><strong>Results:</strong> Our main finding is that the methods used are those that address the challenges of datasets with few samples and many features. Dimensionality reduction methods are used to reduce the feature count alongside models that can also appropriately handle relatively few samples. Popular techniques include autoencoders, random forests and support vector machines. We also found that the field is heavily influenced by the use of The Cancer Genome Atlas dataset, which is accessible and contains many diverse experiments.</p><p><strong>Availability and implementation:</strong> All data and processing scripts are available at this GitLab repository: https://gitlab.com/polavieja_lab/ml_multi-omics_review/ or in Zenodo: https://doi.org/10.5281/zenodo.7361807.</p><p><strong>Supplementary information:</strong> Supplementary data are available at <em>Bioinformatics</em> online.</p>
dc.identifier.eissn1367-4811
dc.identifier.jour-issn1367-4803
dc.identifier.olddbid205503
dc.identifier.oldhandle10024/188530
dc.identifier.urihttps://www.utupub.fi/handle/11111/54907
dc.identifier.urlhttps://doi.org/10.1093/bioinformatics/btad021
dc.identifier.urnURN:NBN:fi-fe2023032132634
dc.language.isoen
dc.okm.affiliatedauthorLahti, Leo
dc.okm.discipline113 Computer and information sciencesen_GB
dc.okm.discipline113 Tietojenkäsittely ja informaatiotieteetfi_FI
dc.okm.internationalcopublicationinternational co-publication
dc.okm.internationalityInternational publication
dc.okm.typeA2 Scientific Article
dc.publisherOXFORD UNIV PRESS
dc.publisher.countryUnited Kingdomen_GB
dc.publisher.countryBritanniafi_FI
dc.publisher.country-codeGB
dc.relation.articlenumberbtad021
dc.relation.doi10.1093/bioinformatics/btad021
dc.relation.ispartofjournalBioinformatics
dc.relation.issue2
dc.relation.volume39
dc.source.identifierhttps://www.utupub.fi/handle/10024/188530
dc.titleDealing with dimensionality: the application of machine learning to multi-omics data
dc.year.issued2023

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
btad021.pdf
Size:
4.1 MB
Format:
Adobe Portable Document Format