Machine learning approaches in microbiome research: challenges and best practices

dc.contributor.authorPapoutsoglou Georgios
dc.contributor.authorTarazona Sonia
dc.contributor.authorLopes Marta B.
dc.contributor.authorKlammsteiner Thomas
dc.contributor.authorIbrahimi Eliana
dc.contributor.authorEckenberger Julia
dc.contributor.authorNovielli Pierfrancesco
dc.contributor.authorTonda Alberto
dc.contributor.authorSimeon Andrea
dc.contributor.authorShigdel Rajesh
dc.contributor.authorBéreux Stéphane
dc.contributor.authorVitali Giacomo
dc.contributor.authorTangaro Sabina
dc.contributor.authorLahti Leo
dc.contributor.authorTemko Andriy
dc.contributor.authorClaesson Marcus J.
dc.contributor.authorBerland Magali
dc.contributor.organizationfi=data-analytiikka|en=Data-analytiikka|
dc.contributor.organization-code1.2.246.10.2458963.20.68940835793
dc.converis.publication-id181463346
dc.converis.urlhttps://research.utu.fi/converis/portal/Publication/181463346
dc.date.accessioned2025-08-28T02:52:33Z
dc.date.available2025-08-28T02:52:33Z
dc.description.abstract<p>Microbiome data predictive analysis within a machine learning (ML) workflow presents numerous domain-specific challenges involving preprocessing, feature selection, predictive modeling, performance estimation, model interpretation, and the extraction of biological information from the results. To assist decision-making, we offer a set of recommendations on algorithm selection, pipeline creation and evaluation, stemming from the COST Action ML4Microbiome. We compared the suggested approaches on a multi-cohort shotgun metagenomics dataset of colorectal cancer patients, focusing on their performance in disease diagnosis and biomarker discovery. It is demonstrated that the use of compositional transformations and filtering methods as part of data preprocessing does not always improve the predictive performance of a model. In contrast, the multivariate feature selection, such as the Statistically Equivalent Signatures algorithm, was effective in reducing the classification error. When validated on a separate test dataset, this algorithm in combination with random forest modeling, provided the most accurate performance estimates. Lastly, we showed how linear modeling by logistic regression coupled with visualization techniques such as Individual Conditional Expectation (ICE) plots can yield interpretable results and offer biological insights. These findings are significant for clinicians and non-experts alike in translational applications.<br></p>
dc.identifier.eissn1664-302X
dc.identifier.olddbid209864
dc.identifier.oldhandle10024/192891
dc.identifier.urihttps://www.utupub.fi/handle/11111/49705
dc.identifier.urlhttps://doi.org/10.3389/fmicb.2023.1261889
dc.identifier.urnURN:NBN:fi-fe2025082788470
dc.language.isoen
dc.okm.affiliatedauthorLahti, Leo
dc.okm.discipline113 Computer and information sciencesen_GB
dc.okm.discipline1183 Plant biology, microbiology, virologyen_GB
dc.okm.discipline3111 Biomedicineen_GB
dc.okm.discipline113 Tietojenkäsittely ja informaatiotieteetfi_FI
dc.okm.discipline1183 Kasvibiologia, mikrobiologia, virologiafi_FI
dc.okm.discipline3111 Biolääketieteetfi_FI
dc.okm.internationalcopublicationinternational co-publication
dc.okm.internationalityInternational publication
dc.okm.typeA2 Scientific Article
dc.publisherFrontiers Research Foundation
dc.publisher.countrySwitzerlanden_GB
dc.publisher.countrySveitsifi_FI
dc.publisher.country-codeCH
dc.relation.articlenumber1261889
dc.relation.doi10.3389/fmicb.2023.1261889
dc.relation.ispartofjournalFrontiers in microbiology
dc.relation.volume14
dc.source.identifierhttps://www.utupub.fi/handle/10024/192891
dc.titleMachine learning approaches in microbiome research: challenges and best practices
dc.year.issued2023

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
fmicb-14-1261889.pdf
Size:
4.09 MB
Format:
Adobe Portable Document Format