STRING-ing together protein complexes: Corpus and methods for extracting physical protein interactions from the biomedical literature

dc.contributor.authorMehryary, Farrokh
dc.contributor.authorNastou, Katerina
dc.contributor.authorOhta, Tomoko
dc.contributor.authorJensen, Lars Juhl
dc.contributor.authorPyysalo, Sampo
dc.contributor.organizationfi=data-analytiikka|en=Data-analytiikka|
dc.contributor.organization-code1.2.246.10.2458963.20.68940835793
dc.converis.publication-id457893544
dc.converis.urlhttps://research.utu.fi/converis/portal/Publication/457893544
dc.date.accessioned2025-08-28T02:48:58Z
dc.date.available2025-08-28T02:48:58Z
dc.description.abstract<p>MOTIVATION: Understanding biological processes relies heavily on curated knowledge of physical interactions between proteins. Yet, a notable gap remains between the information stored in databases of curated knowledge and the plethora of interactions documented in the scientific literature.<br></p><p>RESULTS: To bridge this gap, we introduce ComplexTome, a manually annotated corpus designed to facilitate the development of text-mining methods for the extraction of complex formation relationships among biomedical entities targeting the downstream semantics of the physical interaction sub-network of the STRING database. This corpus comprises 1,287 documents with ∼3,500 relationships. We train a novel relation extraction model on this corpus and find that it can highly reliably identify physical protein interactions (F1-score = 82.8%). We additionally enhance the model's capabilities through unsupervised trigger word detection and apply it to extract relations and trigger words for these relations from all open publications in the domain literature. This information has been fully integrated into the latest version of the STRING database.<br></p><p>AVAILABILITY AND IMPLEMENTATION: We provide the corpus, code, and all results produced by the large-scale runs of our systems biomedical on literature via Zenodo https://doi.org/10.5281/zenodo.8139716, Github https://github.com/farmeh/ComplexTome_extraction, and the latest version of STRING database https://string-db.org/.<br></p><p>SUPPLEMENTARY INFORMATION: Supplementary information are available at Bioinformatics online.<br></p>
dc.identifier.eissn1367-4811
dc.identifier.jour-issn1367-4803
dc.identifier.olddbid209753
dc.identifier.oldhandle10024/192780
dc.identifier.urihttps://www.utupub.fi/handle/11111/49390
dc.identifier.urlhttps://doi.org/10.1093/bioinformatics/btae552
dc.identifier.urnURN:NBN:fi-fe2025082788433
dc.language.isoen
dc.okm.affiliatedauthorMehryary, Farrokh
dc.okm.affiliatedauthorPyysalo, Sampo
dc.okm.discipline113 Computer and information sciencesen_GB
dc.okm.discipline3111 Biomedicineen_GB
dc.okm.discipline113 Tietojenkäsittely ja informaatiotieteetfi_FI
dc.okm.discipline3111 Biolääketieteetfi_FI
dc.okm.internationalcopublicationinternational co-publication
dc.okm.internationalityInternational publication
dc.okm.typeA1 ScientificArticle
dc.publisherOxford University Press
dc.publisher.countryUnited Kingdomen_GB
dc.publisher.countryBritanniafi_FI
dc.publisher.country-codeGB
dc.relation.articlenumberbtae552
dc.relation.doi10.1093/bioinformatics/btae552
dc.relation.ispartofjournalBioinformatics
dc.relation.issue9
dc.relation.volume40
dc.source.identifierhttps://www.utupub.fi/handle/10024/192780
dc.titleSTRING-ing together protein complexes: Corpus and methods for extracting physical protein interactions from the biomedical literature
dc.year.issued2024

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
btae552.pdf
Size:
1.59 MB
Format:
Adobe Portable Document Format