Lifestyle factors in the biomedical literature: An ontology and comprehensive resources for named entity recognition

dc.contributor.authorNourani, Esmaeil
dc.contributor.authorKoutrouli, Mikaela
dc.contributor.authorXie, Yijia
dc.contributor.authorVagiaki, Danai
dc.contributor.authorPyysalo, Sampo
dc.contributor.authorNastou, Katerina
dc.contributor.authorBrunak, Søren
dc.contributor.authorJensen, Lars Juhl
dc.contributor.organizationfi=data-analytiikka|en=Data-analytiikka|
dc.contributor.organization-code1.2.246.10.2458963.20.68940835793
dc.converis.publication-id458967337
dc.converis.urlhttps://research.utu.fi/converis/portal/Publication/458967337
dc.date.accessioned2025-08-27T22:57:24Z
dc.date.available2025-08-27T22:57:24Z
dc.description.abstract<p>Motivation</p><p>Despite lifestyle factors (LSFs) being increasingly acknowledged in shaping individual health trajectories, particularly in chronic diseases, they have still not been systematically described in the biomedical literature. This is in part because no named entity recognition (NER) system exists, which can comprehensively detect all types of LSFs in text. The task is challenging due to their inherent diversity, lack of a comprehensive LSF classification for dictionary-based NER, and lack of a corpus for deep learning-based NER.</p><p>Results</p><p>We present a novel lifestyle factor ontology (LSFO), which we used to develop a dictionary-based system for recognition and normalization of LSFs. Additionally, we introduce a manually annotated corpus for LSFs (LSF200) suitable for training and evaluation of NER systems, and use it to train a transformer-based system. Evaluating the performance of both NER systems on the corpus revealed an F-score of 64% for the dictionary-based system and 76% for the transformer-based system. Large-scale application of these systems on PubMed abstracts and PMC Open Access articles identified over 300 million mentions of LSF in the biomedical literature.</p><p>Availability and implementation</p><p>LSFO, the annotated LSF200 corpus, and the detected LSFs in PubMed and PMC-OA articles using both NER systems, are available under open licenses via the following GitHub repository: https://github.com/EsmaeilNourani/LSFO-expansion. This repository contains links to two associated GitHub repositories and a Zenodo project related to the study. LSFO is also available at BioPortal: https://bioportal.bioontology.org/ontologies/LSFO.</p>
dc.identifier.eissn1367-4811
dc.identifier.jour-issn1367-4803
dc.identifier.olddbid203101
dc.identifier.oldhandle10024/186128
dc.identifier.urihttps://www.utupub.fi/handle/11111/50707
dc.identifier.urlhttps://doi.org/10.1093/bioinformatics/btae613
dc.identifier.urnURN:NBN:fi-fe2025082789998
dc.language.isoen
dc.okm.affiliatedauthorPyysalo, Sampo
dc.okm.discipline113 Computer and information sciencesen_GB
dc.okm.discipline3111 Biomedicineen_GB
dc.okm.discipline113 Tietojenkäsittely ja informaatiotieteetfi_FI
dc.okm.discipline3111 Biolääketieteetfi_FI
dc.okm.internationalcopublicationinternational co-publication
dc.okm.internationalityInternational publication
dc.okm.typeA1 ScientificArticle
dc.publisherOxford University Press (OUP)
dc.publisher.countryUnited Kingdomen_GB
dc.publisher.countryBritanniafi_FI
dc.publisher.country-codeGB
dc.relation.articlenumberbtae613
dc.relation.doi10.1093/bioinformatics/btae613
dc.relation.ispartofjournalBioinformatics
dc.relation.issue11
dc.relation.volume40
dc.source.identifierhttps://www.utupub.fi/handle/10024/186128
dc.titleLifestyle factors in the biomedical literature: An ontology and comprehensive resources for named entity recognition
dc.year.issued2024

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
btae613.pdf
Size:
2.23 MB
Format:
Adobe Portable Document Format