Overview of DrugProt task at BioCreative VII: data and methods for large-scale text mining and knowledge graph generation of heterogenous chemical-protein relations

Miranda-Escalada Antonio; Mehryary Farrokh; Luoma Jouni; Estrada-Zavala Darryl; Gasco Luis; Pyysalo Sampo; Valencia Alfonso; Krallinger Martin

Overview of DrugProt task at BioCreative VII: data and methods for large-scale text mining and knowledge graph generation of heterogenous chemical-protein relations

dc.contributor.author	Miranda-Escalada Antonio
dc.contributor.author	Mehryary Farrokh
dc.contributor.author	Luoma Jouni
dc.contributor.author	Estrada-Zavala Darryl
dc.contributor.author	Gasco Luis
dc.contributor.author	Pyysalo Sampo
dc.contributor.author	Valencia Alfonso
dc.contributor.author	Krallinger Martin
dc.contributor.organization	fi=data-analytiikka\|en=Data-analytiikka\|
dc.contributor.organization-code	1.2.246.10.2458963.20.68940835793
dc.contributor.organization-code	2610301
dc.converis.publication-id	182074928
dc.converis.url	https://research.utu.fi/converis/portal/Publication/182074928
dc.date.accessioned	2025-08-27T22:40:27Z
dc.date.available	2025-08-27T22:40:27Z
dc.description.abstract	<p>It is getting increasingly challenging to efficiently exploit drug-related information described in the growing amount of scientific literature. Indeed, for drug–gene/protein interactions, the challenge is even bigger, considering the scattered information sources and types of interactions. However, their systematic, large-scale exploitation is key for developing tools, impacting knowledge fields as diverse as drug design or metabolic pathway research. Previous efforts in the extraction of drug–gene/protein interactions from the literature did not address these scalability and granularity issues. To tackle them, we have organized the DrugProt track at BioCreative VII. In the context of the track, we have released the DrugProt Gold Standard corpus, a collection of 5000 PubMed abstracts, manually annotated with granular drug–gene/protein interactions. We have proposed a novel large-scale track to evaluate the capacity of natural language processing systems to scale to the range of millions of documents, and generate with their predictions a silver standard knowledge graph of 53 993 602 nodes and 19 367 406 edges. Its use exceeds the shared task and points toward pharmacological and biological applications such as drug discovery or continuous database curation. Finally, we have created a persistent evaluation scenario on CodaLab to continuously evaluate new relation extraction systems that may arise. Thirty teams from four continents, which involved 110 people, sent 107 submission runs for the Main DrugProt track, and nine teams submitted 21 runs for the Large Scale DrugProt track. Most participants implemented deep learning approaches based on pretrained transformer-like language models (LMs) such as BERT or BioBERT, reaching precision and recall values as high as 0.9167 and 0.9542 for some relation types. Finally, some initial explorations of the applicability of the knowledge graph have shown its potential to explore the chemical–protein relations described in the literature, or chemical compound–enzyme interactions.<br></p>
dc.identifier.eissn	1758-0463
dc.identifier.jour-issn	1758-0463
dc.identifier.olddbid	202589
dc.identifier.oldhandle	10024/185616
dc.identifier.uri	https://www.utupub.fi/handle/11111/47689
dc.identifier.url	https://doi.org/10.1093/database/baad080
dc.identifier.urn	URN:NBN:fi-fe2025082785775
dc.language.iso	en
dc.okm.affiliatedauthor	Mehryary, Farrokh
dc.okm.affiliatedauthor	Luoma, Jouni
dc.okm.affiliatedauthor	Pyysalo, Sampo
dc.okm.discipline	113 Computer and information sciences	en_GB
dc.okm.discipline	113 Tietojenkäsittely ja informaatiotieteet	fi_FI
dc.okm.internationalcopublication	international co-publication
dc.okm.internationality	International publication
dc.okm.type	A1 ScientificArticle
dc.publisher	Oxford University Press
dc.publisher.country	United Kingdom	en_GB
dc.publisher.country	Britannia	fi_FI
dc.publisher.country-code	GB
dc.relation.articlenumber	baad080
dc.relation.doi	10.1093/database/baad080
dc.relation.ispartofjournal	Database: The Journal of Biological Databases and Curation
dc.relation.volume	2023
dc.source.identifier	https://www.utupub.fi/handle/10024/185616
dc.title	Overview of DrugProt task at BioCreative VII: data and methods for large-scale text mining and knowledge graph generation of heterogenous chemical-protein relations
dc.year.issued	2023

Tiedostot

Näytetään 1 - 1 / 1

Name:: baad080.pdf
Size:: 7.8 MB
Format:: Adobe Portable Document Format

Lataa

Kokoelmat

Rinnakkaistallenteet