Improving dictionary-based named entity recognition with deep learning

Nastou, Katerina; Koutrouli, Mikaela; Pyysalo, Sampo; Jensen, Lars Jyhl

Improving dictionary-based named entity recognition with deep learning

dc.contributor.author	Nastou, Katerina
dc.contributor.author	Koutrouli, Mikaela
dc.contributor.author	Pyysalo, Sampo
dc.contributor.author	Jensen, Lars Jyhl
dc.contributor.organization	fi=data-analytiikka\|en=Data-analytiikka\|
dc.contributor.organization-code	1.2.246.10.2458963.20.68940835793
dc.converis.publication-id	457882809
dc.converis.url	https://research.utu.fi/converis/portal/Publication/457882809
dc.date.accessioned	2025-08-28T01:37:46Z
dc.date.available	2025-08-28T01:37:46Z
dc.description.abstract	<p><b>Motivation:</b> Dictionary-based named entity recognition (NER) allows terms to be detected in a corpus and normalized to biomedical databases and ontologies. However, adaptation to different entity types requires new high-quality dictionaries and associated lists of blocked names for each type. The latter are so far created by identifying cases that cause many false positives through manual inspection of individual names, a process that scales poorly. <br></p><p><b>Results:</b> In this work, we aim to improve block list s by automatically identifying names to block, based on the context in which they appear. By comparing results of three well-established biomedical NER methods, we generated a dataset of over 12.5 million text spans where the methods agree on the boundaries and type of entity tagged. These were used to generate positive and negative examples of contexts for four entity types (genes, diseases, species, and chemicals), which were used to train a Transformer-based model (BioBERT) to perform entity type classification. Application of the best model (F1-score = 96.7%) allowed us to generate a list of problematic names that should be blocked. Introducing this into our system doubled the size of the previous list of corpus-wide blocked names. In addition, we generated a document-specific list that allows ambiguous names to be blocked in specific documents. These changes boosted text mining precision by ∼5.5% on average, and over 8.5% for chemical and 7.5% for gene names, positively affecting several biological databases utilizing this NER system, like the STRING database, with only a minor drop in recall (0.6%).</p>
dc.format.pagerange	ii52
dc.identifier.eissn	1367-4811
dc.identifier.jour-issn	1367-4803
dc.identifier.olddbid	207811
dc.identifier.oldhandle	10024/190838
dc.identifier.uri	https://www.utupub.fi/handle/11111/57232
dc.identifier.url	https://doi.org/10.1093/bioinformatics/btae402
dc.identifier.urn	URN:NBN:fi-fe2025082787792
dc.language.iso	en
dc.okm.affiliatedauthor	Pyysalo, Sampo
dc.okm.discipline	113 Computer and information sciences	en_GB
dc.okm.internationalcopublication	international co-publication
dc.okm.internationality	International publication
dc.okm.type	A1 ScientificArticle
dc.publisher	Oxford University Press
dc.publisher.country	United Kingdom	en_GB
dc.publisher.country	Britannia	fi_FI
dc.publisher.country-code	GB
dc.relation.doi	10.1093/bioinformatics/btae402
dc.relation.ispartofjournal	Bioinformatics
dc.relation.issue	2 Supplement
dc.relation.volume	40
dc.source.identifier	https://www.utupub.fi/handle/10024/190838
dc.title	Improving dictionary-based named entity recognition with deep learning
dc.year.issued	2024

Tiedostot

Näytetään 1 - 1 / 1

Name:: btae402.pdf
Size:: 1.18 MB
Format:: Adobe Portable Document Format

Lataa

Kokoelmat

Rinnakkaistallenteet