S1000: a better taxonomic name corpus for biomedical information extraction

dc.contributor.authorLuoma Jouni
dc.contributor.authorNastou Katerina
dc.contributor.authorOhta Tomoko
dc.contributor.authorToivonen Harttu
dc.contributor.authorPafilis Evangelos
dc.contributor.authorJensen Lars Juhl
dc.contributor.authorPyysalo Sampo
dc.contributor.organizationfi=data-analytiikka|en=Data-analytiikka|
dc.contributor.organization-code1.2.246.10.2458963.20.68940835793
dc.contributor.organization-code2610301
dc.converis.publication-id180376416
dc.converis.urlhttps://research.utu.fi/converis/portal/Publication/180376416
dc.date.accessioned2025-08-27T23:11:15Z
dc.date.available2025-08-27T23:11:15Z
dc.description.abstract<p><b>Motivation</b> <br></p><p>The recognition of mentions of species names in text is a critically important task for biomedical text mining. While deep learning-based methods have made great advances in many named entity recognition tasks, results for species name recognition remain poor. We hypothesize that this is primarily due to the lack of appropriate corpora.<br></p><p><b>Results</b> <br></p><p>We introduce the S1000 corpus, a comprehensive manual re-annotation and extension of the S800 corpus. We demonstrate that S1000 makes highly accurate recognition of species names possible (F-score =93.1%), both for deep learning and dictionary-based methods.<br></p><p><b>Availability and implementation</b><br></p><p>All resources introduced in this study are available under open licenses from https://jensenlab.org/resources/s1000/. The webpage contains links to a Zenodo project and three GitHub repositories associated with the study.<br></p>
dc.identifier.eissn1367-4811
dc.identifier.jour-issn1367-4803
dc.identifier.olddbid203563
dc.identifier.oldhandle10024/186590
dc.identifier.urihttps://www.utupub.fi/handle/11111/39678
dc.identifier.urlhttps://doi.org/10.1093/bioinformatics/btad369
dc.identifier.urnURN:NBN:fi-fe2025082786118
dc.language.isoen
dc.okm.affiliatedauthorLuoma, Jouni
dc.okm.affiliatedauthorPyysalo, Sampo
dc.okm.discipline113 Computer and information sciencesen_GB
dc.okm.discipline113 Tietojenkäsittely ja informaatiotieteetfi_FI
dc.okm.internationalcopublicationinternational co-publication
dc.okm.internationalityInternational publication
dc.okm.typeA1 ScientificArticle
dc.publisherOXFORD UNIV PRESS
dc.publisher.countryUnited Kingdomen_GB
dc.publisher.countryBritanniafi_FI
dc.publisher.country-codeGB
dc.relation.articlenumberbtad369
dc.relation.doi10.1093/bioinformatics/btad369
dc.relation.ispartofjournalBioinformatics
dc.relation.issue6
dc.relation.volume39
dc.source.identifierhttps://www.utupub.fi/handle/10024/186590
dc.titleS1000: a better taxonomic name corpus for biomedical information extraction
dc.year.issued2023

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
btad369.pdf
Size:
767.65 KB
Format:
Adobe Portable Document Format