S1000: a better taxonomic name corpus for biomedical information extraction

Luoma Jouni; Nastou Katerina; Ohta Tomoko; Toivonen Harttu; Pafilis Evangelos; Jensen Lars Juhl; Pyysalo Sampo

S1000: a better taxonomic name corpus for biomedical information extraction

dc.contributor.author	Luoma Jouni
dc.contributor.author	Nastou Katerina
dc.contributor.author	Ohta Tomoko
dc.contributor.author	Toivonen Harttu
dc.contributor.author	Pafilis Evangelos
dc.contributor.author	Jensen Lars Juhl
dc.contributor.author	Pyysalo Sampo
dc.contributor.organization	fi=data-analytiikka\|en=Data-analytiikka\|
dc.contributor.organization-code	2610301
dc.converis.publication-id	180376416
dc.converis.url	https://research.utu.fi/converis/portal/Publication/180376416
dc.date.accessioned	2025-08-27T23:11:15Z
dc.date.available	2025-08-27T23:11:15Z
dc.description.abstract	<p><b>Motivation</b> <br></p><p>The recognition of mentions of species names in text is a critically important task for biomedical text mining. While deep learning-based methods have made great advances in many named entity recognition tasks, results for species name recognition remain poor. We hypothesize that this is primarily due to the lack of appropriate corpora.<br></p><p><b>Results</b> <br></p><p>We introduce the S1000 corpus, a comprehensive manual re-annotation and extension of the S800 corpus. We demonstrate that S1000 makes highly accurate recognition of species names possible (F-score =93.1%), both for deep learning and dictionary-based methods.<br></p><p><b>Availability and implementation</b><br></p><p>All resources introduced in this study are available under open licenses from https://jensenlab.org/resources/s1000/. The webpage contains links to a Zenodo project and three GitHub repositories associated with the study.<br></p>
dc.identifier.eissn	1367-4811
dc.identifier.jour-issn	1367-4803
dc.identifier.olddbid	203563
dc.identifier.oldhandle	10024/186590
dc.identifier.uri	https://www.utupub.fi/handle/11111/39678
dc.identifier.url	https://doi.org/10.1093/bioinformatics/btad369
dc.identifier.urn	URN:NBN:fi-fe2025082786118
dc.language.iso	en
dc.okm.affiliatedauthor	Luoma, Jouni
dc.okm.affiliatedauthor	Pyysalo, Sampo
dc.okm.discipline	113 Computer and information sciences	en_GB
dc.okm.discipline	113 Tietojenkäsittely ja informaatiotieteet	fi_FI
dc.okm.internationalcopublication	international co-publication
dc.okm.internationality	International publication
dc.okm.type	A1 ScientificArticle
dc.publisher	OXFORD UNIV PRESS
dc.publisher.country	United Kingdom	en_GB
dc.publisher.country	Britannia	fi_FI
dc.publisher.country-code	GB
dc.relation.articlenumber	btad369
dc.relation.doi	10.1093/bioinformatics/btad369
dc.relation.ispartofjournal	Bioinformatics
dc.relation.issue	6
dc.relation.volume	39
dc.source.identifier	https://www.utupub.fi/handle/10024/186590
dc.title	S1000: a better taxonomic name corpus for biomedical information extraction
dc.year.issued	2023

Tiedostot

Näytetään 1 - 1 / 1

Name:: btad369.pdf
Size:: 767.65 KB
Format:: Adobe Portable Document Format

Lataa

Kokoelmat

Rinnakkaistallenteet