RegulaTome: a corpus of typed, directed, and signed relations between biomedical entities in the scientific literature

dc.contributor.authorNastou, Katerina
dc.contributor.authorMehryary, Farrokh
dc.contributor.authorOhta, Tomoko
dc.contributor.authorLuoma, Jouni
dc.contributor.authorPyysalo, Sampo
dc.contributor.authorJensen, Lars Juhl
dc.contributor.organizationfi=data-analytiikka|en=Data-analytiikka|
dc.contributor.organizationfi=tietotekniikan laitos|en=Department of Computing|
dc.contributor.organization-code1.2.246.10.2458963.20.68940835793
dc.contributor.organization-code1.2.246.10.2458963.20.85312822902
dc.converis.publication-id458222413
dc.converis.urlhttps://research.utu.fi/converis/portal/Publication/458222413
dc.date.accessioned2025-08-28T02:55:43Z
dc.date.available2025-08-28T02:55:43Z
dc.description.abstractIn the field of biomedical text mining, the ability to extract relations from the literature is crucial for advancing both theoretical research and practical applications. There is a notable shortage of corpora designed to enhance the extraction of multiple types of relations, particularly focusing on proteins and protein-containing entities such as complexes and families, as well as chemicals. In this work, we present RegulaTome, a corpus that overcomes the limitations of several existing biomedical relation extraction (RE) corpora, many of which concentrate on single-type relations at the sentence level. RegulaTome stands out by offering 16 961 relations annotated in >2500 documents, making it the most extensive dataset of its kind to date. This corpus is specifically designed to cover a broader spectrum of >40 relation types beyond those traditionally explored, setting a new benchmark in the complexity and depth of biomedical RE tasks. Our corpus both broadens the scope of detected relations and allows for achieving noteworthy accuracy in RE. A transformer-based model trained on this corpus has demonstrated a promising F1-score (66.6%) for a task of this complexity, underscoring the effectiveness of our approach in accurately identifying and categorizing a wide array of biological relations. This achievement highlights RegulaTome's potential to significantly contribute to the development of more sophisticated, efficient, and accurate RE systems to tackle biomedical tasks. Finally, a run of the trained RE system on all PubMed abstracts and PMC Open Access full-text documents resulted in >18 million relations, extracted from the entire biomedical literature.
dc.identifier.jour-issn1758-0463
dc.identifier.olddbid209937
dc.identifier.oldhandle10024/192964
dc.identifier.urihttps://www.utupub.fi/handle/11111/50001
dc.identifier.urlhttps://doi.org/10.1093/database/baae095
dc.identifier.urnURN:NBN:fi-fe2025082792547
dc.language.isoen
dc.okm.affiliatedauthorMehryary, Farrokh
dc.okm.affiliatedauthorLuoma, Jouni
dc.okm.affiliatedauthorPyysalo, Sampo
dc.okm.discipline113 Computer and information sciencesen_GB
dc.okm.discipline3111 Biomedicineen_GB
dc.okm.discipline113 Tietojenkäsittely ja informaatiotieteetfi_FI
dc.okm.discipline3111 Biolääketieteetfi_FI
dc.okm.internationalcopublicationinternational co-publication
dc.okm.internationalityInternational publication
dc.okm.typeA1 ScientificArticle
dc.publisherOXFORD UNIV PRESS
dc.publisher.countryUnited Kingdomen_GB
dc.publisher.countryBritanniafi_FI
dc.publisher.country-codeGB
dc.publisher.placeOXFORD
dc.relation.articlenumberbaae095
dc.relation.doi10.1093/database/baae095
dc.relation.ispartofjournalDatabase: The Journal of Biological Databases and Curation
dc.relation.volume2024
dc.source.identifierhttps://www.utupub.fi/handle/10024/192964
dc.titleRegulaTome: a corpus of typed, directed, and signed relations between biomedical entities in the scientific literature
dc.year.issued2024

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
baae095.pdf
Size:
5.35 MB
Format:
Adobe Portable Document Format