LSD600: the first corpus of biomedical abstracts annotated with lifestyle–disease relations

dc.contributor.authorNourani, Esmaeil
dc.contributor.authorMakri, Evangelia-Mantelena
dc.contributor.authorMao, Xiqing
dc.contributor.authorPyysalo, Sampo
dc.contributor.authorBrunak, Søren
dc.contributor.authorNastou, Katerina
dc.contributor.authorJensen, Lars Juhl
dc.contributor.organizationfi=data-analytiikka|en=Data-analytiikka|
dc.contributor.organization-code1.2.246.10.2458963.20.68940835793
dc.converis.publication-id477997547
dc.converis.urlhttps://research.utu.fi/converis/portal/Publication/477997547
dc.date.accessioned2025-08-28T01:08:03Z
dc.date.available2025-08-28T01:08:03Z
dc.description.abstractLifestyle factors (LSFs) are increasingly recognized as instrumental in both the development and control of diseases. Despite their importance, there is a lack of methods to extract relations between LSFs and diseases from the literature, a step necessary to consolidate the currently available knowledge into a structured form. As simple co-occurrence-based relation extraction (RE) approaches are unable to distinguish between the different types of LSF-disease relations, context-aware models such as transformers are required to extract and classify these relations into specific relation types. However, no comprehensive LSF-disease RE system existed, nor a corpus suitable for developing one. We present LSD600 (available at https://zenodo.org/records/13952449), the first corpus specifically designed for LSF-disease RE, comprising 600 abstracts with 1900 relations of eight distinct types between 5027 diseases and 6930 LSF entities. We evaluated LSD600's quality by training a RoBERTa model on the corpus, achieving an F-score of 68.5% for the multilabel RE task on the held-out test set. We further validated LSD600 by using the trained model on the two Nutrition-Disease and FoodDisease datasets, where it achieved F-scores of 70.7% and 80.7%, respectively. Building on these performance results, LSD600 and the RE system trained on it can be valuable resources to fill the existing gap in this area and pave the way for downstream applications. Database URL: https://zenodo.org/records/13952449.
dc.identifier.eissn1758-0463
dc.identifier.jour-issn1758-0463
dc.identifier.olddbid207078
dc.identifier.oldhandle10024/190105
dc.identifier.urihttps://www.utupub.fi/handle/11111/50278
dc.identifier.urlhttps://doi.org/10.1093/database/baae129
dc.identifier.urnURN:NBN:fi-fe2025082791500
dc.language.isoen
dc.okm.affiliatedauthorPyysalo, Sampo
dc.okm.discipline113 Computer and information sciencesen_GB
dc.okm.discipline113 Tietojenkäsittely ja informaatiotieteetfi_FI
dc.okm.internationalcopublicationinternational co-publication
dc.okm.internationalityInternational publication
dc.okm.typeA1 ScientificArticle
dc.publisherOxford University Press (OUP)
dc.publisher.countryUnited Kingdomen_GB
dc.publisher.countryBritanniafi_FI
dc.publisher.country-codeGB
dc.relation.articlenumberbaae129
dc.relation.doi10.1093/database/baae129
dc.relation.ispartofjournalDatabase: The Journal of Biological Databases and Curation
dc.relation.volume2025
dc.source.identifierhttps://www.utupub.fi/handle/10024/190105
dc.titleLSD600: the first corpus of biomedical abstracts annotated with lifestyle–disease relations
dc.year.issued2025

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
baae129.pdf
Size:
9.26 MB
Format:
Adobe Portable Document Format