RegulaTome: a corpus of typed, directed, and signed relations between biomedical entities in the scientific literature

Nastou, Katerina; Mehryary, Farrokh; Ohta, Tomoko; Luoma, Jouni; Pyysalo, Sampo; Jensen, Lars Juhl

RegulaTome: a corpus of typed, directed, and signed relations between biomedical entities in the scientific literature

Nastou, Katerina; Mehryary, Farrokh; Ohta, Tomoko; Luoma, Jouni; Pyysalo, Sampo; Jensen, Lars Juhl

RegulaTome: a corpus of typed, directed, and signed relations between biomedical entities in the scientific literature

Nastou, Katerina

Mehryary, Farrokh

Ohta, Tomoko

Luoma, Jouni

Pyysalo, Sampo

Jensen, Lars Juhl

Katso/Avaa

baae095.pdf (5.353Mb)

Lataukset:

OXFORD UNIV PRESS

doi:10.1093/database/baae095

URI

https://doi.org/10.1093/database/baae095

Näytä kaikki kuvailutiedot

Julkaisun pysyvä osoite on:
https://urn.fi/URN:NBN:fi-fe2025082792547

Tiivistelmä

In the field of biomedical text mining, the ability to extract relations from the literature is crucial for advancing both theoretical research and practical applications. There is a notable shortage of corpora designed to enhance the extraction of multiple types of relations, particularly focusing on proteins and protein-containing entities such as complexes and families, as well as chemicals. In this work, we present RegulaTome, a corpus that overcomes the limitations of several existing biomedical relation extraction (RE) corpora, many of which concentrate on single-type relations at the sentence level. RegulaTome stands out by offering 16 961 relations annotated in >2500 documents, making it the most extensive dataset of its kind to date. This corpus is specifically designed to cover a broader spectrum of >40 relation types beyond those traditionally explored, setting a new benchmark in the complexity and depth of biomedical RE tasks. Our corpus both broadens the scope of detected relations and allows for achieving noteworthy accuracy in RE. A transformer-based model trained on this corpus has demonstrated a promising F1-score (66.6%) for a task of this complexity, underscoring the effectiveness of our approach in accurately identifying and categorizing a wide array of biological relations. This achievement highlights RegulaTome's potential to significantly contribute to the development of more sophisticated, efficient, and accurate RE systems to tackle biomedical tasks. Finally, a run of the trained RE system on all PubMed abstracts and PMC Open Access full-text documents resulted in >18 million relations, extracted from the entire biomedical literature.

Kokoelmat

Rinnakkaistallenteet [29335]