CoNECo: a Corpus for Named Entity recognition and normalization of protein Complexes

dc.contributor.authorNastou, Katerina
dc.contributor.authorKoutrouli, Mikaela
dc.contributor.authorPyysalo, Sampo
dc.contributor.authorJensen, Lars Juhl
dc.contributor.organizationfi=data-analytiikka|en=Data-analytiikka|
dc.contributor.organization-code1.2.246.10.2458963.20.68940835793
dc.converis.publication-id458834899
dc.converis.urlhttps://research.utu.fi/converis/portal/Publication/458834899
dc.date.accessioned2025-08-27T22:49:29Z
dc.date.available2025-08-27T22:49:29Z
dc.description.abstract<p><b>Motivation </b><br></p><p>Despite significant progress in biomedical information extraction, there is a lack of resources for Named Entity Recognition (NER) and Named Entity Normalization (NEN) of protein-containing complexes. Current resources inadequately address the recognition of protein-containing complex names across different organisms, underscoring the crucial need for a dedicated corpus.<br></p><p><b>Results </b><br></p><p>We introduce the Complex Named Entity Corpus (CoNECo), an annotated corpus for NER and NEN of complexes. CoNECo comprises 1621 documents with 2052 entities, 1976 of which are normalized to Gene Ontology. We divided the corpus into training, development, and test sets and trained both a transformer-based and dictionary-based tagger on them. Evaluation on the test set demonstrated robust performance, with F-scores of 73.7% and 61.2%, respectively. Subsequently, we applied the best taggers for comprehensive tagging of the entire openly accessible biomedical literature.<br></p><p><b>Availability and implementation</b><br></p><p> All resources, including the annotated corpus, training data, and code, are available to the community through Zenodo https://zenodo.org/records/11263147 and GitHub https://zenodo.org/records/10693653.</p>
dc.identifier.eissn2635-0041
dc.identifier.olddbid202875
dc.identifier.oldhandle10024/185902
dc.identifier.urihttps://www.utupub.fi/handle/11111/50522
dc.identifier.urlhttps://doi.org/10.1093/bioadv/vbae116
dc.identifier.urnURN:NBN:fi-fe2025082789927
dc.language.isoen
dc.okm.affiliatedauthorPyysalo, Sampo
dc.okm.discipline113 Computer and information sciencesen_GB
dc.okm.discipline113 Tietojenkäsittely ja informaatiotieteetfi_FI
dc.okm.internationalcopublicationinternational co-publication
dc.okm.internationalityInternational publication
dc.okm.typeA1 ScientificArticle
dc.publisherOxford University Press (OUP)
dc.publisher.countryUnited Kingdomen_GB
dc.publisher.countryBritanniafi_FI
dc.publisher.country-codeGB
dc.publisher.placeOXFORD
dc.relation.articlenumbervbae116
dc.relation.doi10.1093/bioadv/vbae116
dc.relation.ispartofjournalBioinformatics Advances
dc.relation.issue1
dc.relation.volume4
dc.source.identifierhttps://www.utupub.fi/handle/10024/185902
dc.titleCoNECo: a Corpus for Named Entity recognition and normalization of protein Complexes
dc.year.issued2024

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
vbae116.pdf
Size:
606.46 KB
Format:
Adobe Portable Document Format