CoNECo: a Corpus for Named Entity recognition and normalization of protein Complexes

Nastou, Katerina; Koutrouli, Mikaela; Pyysalo, Sampo; Jensen, Lars Juhl

CoNECo: a Corpus for Named Entity recognition and normalization of protein Complexes

dc.contributor.author	Nastou, Katerina
dc.contributor.author	Koutrouli, Mikaela
dc.contributor.author	Pyysalo, Sampo
dc.contributor.author	Jensen, Lars Juhl
dc.contributor.organization	fi=data-analytiikka\|en=Data-analytiikka\|
dc.contributor.organization-code	1.2.246.10.2458963.20.68940835793
dc.converis.publication-id	458834899
dc.converis.url	https://research.utu.fi/converis/portal/Publication/458834899
dc.date.accessioned	2025-08-27T22:49:29Z
dc.date.available	2025-08-27T22:49:29Z
dc.description.abstract	<p><b>Motivation </b><br></p><p>Despite significant progress in biomedical information extraction, there is a lack of resources for Named Entity Recognition (NER) and Named Entity Normalization (NEN) of protein-containing complexes. Current resources inadequately address the recognition of protein-containing complex names across different organisms, underscoring the crucial need for a dedicated corpus.<br></p><p><b>Results </b><br></p><p>We introduce the Complex Named Entity Corpus (CoNECo), an annotated corpus for NER and NEN of complexes. CoNECo comprises 1621 documents with 2052 entities, 1976 of which are normalized to Gene Ontology. We divided the corpus into training, development, and test sets and trained both a transformer-based and dictionary-based tagger on them. Evaluation on the test set demonstrated robust performance, with F-scores of 73.7% and 61.2%, respectively. Subsequently, we applied the best taggers for comprehensive tagging of the entire openly accessible biomedical literature.<br></p><p><b>Availability and implementation</b><br></p><p> All resources, including the annotated corpus, training data, and code, are available to the community through Zenodo https://zenodo.org/records/11263147 and GitHub https://zenodo.org/records/10693653.</p>
dc.identifier.eissn	2635-0041
dc.identifier.olddbid	202875
dc.identifier.oldhandle	10024/185902
dc.identifier.uri	https://www.utupub.fi/handle/11111/50522
dc.identifier.url	https://doi.org/10.1093/bioadv/vbae116
dc.identifier.urn	URN:NBN:fi-fe2025082789927
dc.language.iso	en
dc.okm.affiliatedauthor	Pyysalo, Sampo
dc.okm.discipline	113 Computer and information sciences	en_GB
dc.okm.discipline	113 Tietojenkäsittely ja informaatiotieteet	fi_FI
dc.okm.internationalcopublication	international co-publication
dc.okm.internationality	International publication
dc.okm.type	A1 ScientificArticle
dc.publisher	Oxford University Press (OUP)
dc.publisher.country	United Kingdom	en_GB
dc.publisher.country	Britannia	fi_FI
dc.publisher.country-code	GB
dc.publisher.place	OXFORD
dc.relation.articlenumber	vbae116
dc.relation.doi	10.1093/bioadv/vbae116
dc.relation.ispartofjournal	Bioinformatics Advances
dc.relation.issue	1
dc.relation.volume	4
dc.source.identifier	https://www.utupub.fi/handle/10024/185902
dc.title	CoNECo: a Corpus for Named Entity recognition and normalization of protein Complexes
dc.year.issued	2024

Tiedostot

Näytetään 1 - 1 / 1

Name:: vbae116.pdf
Size:: 606.46 KB
Format:: Adobe Portable Document Format

Lataa

Kokoelmat

Rinnakkaistallenteet