Improving dictionary-based named entity recognition with deep learning

dc.contributor.authorNastou, Katerina
dc.contributor.authorKoutrouli, Mikaela
dc.contributor.authorPyysalo, Sampo
dc.contributor.authorJensen, Lars Jyhl
dc.contributor.organizationfi=data-analytiikka|en=Data-analytiikka|
dc.contributor.organization-code1.2.246.10.2458963.20.68940835793
dc.converis.publication-id457882809
dc.converis.urlhttps://research.utu.fi/converis/portal/Publication/457882809
dc.date.accessioned2025-08-28T01:37:46Z
dc.date.available2025-08-28T01:37:46Z
dc.description.abstract<p><b>Motivation:</b> Dictionary-based named entity recognition (NER) allows terms to be detected in a corpus and normalized to biomedical databases and ontologies. However, adaptation to different entity types requires new high-quality dictionaries and associated lists of blocked names for each type. The latter are so far created by identifying cases that cause many false positives through manual inspection of individual names, a process that scales poorly. <br></p><p><b>Results:</b> In this work, we aim to improve block list s by automatically identifying names to block, based on the context in which they appear. By comparing results of three well-established biomedical NER methods, we generated a dataset of over 12.5 million text spans where the methods agree on the boundaries and type of entity tagged. These were used to generate positive and negative examples of contexts for four entity types (genes, diseases, species, and chemicals), which were used to train a Transformer-based model (BioBERT) to perform entity type classification. Application of the best model (F1-score = 96.7%) allowed us to generate a list of problematic names that should be blocked. Introducing this into our system doubled the size of the previous list of corpus-wide blocked names. In addition, we generated a document-specific list that allows ambiguous names to be blocked in specific documents. These changes boosted text mining precision by ∼5.5% on average, and over 8.5% for chemical and 7.5% for gene names, positively affecting several biological databases utilizing this NER system, like the STRING database, with only a minor drop in recall (0.6%).</p>
dc.format.pagerangeii45
dc.format.pagerangeii52
dc.identifier.eissn1367-4811
dc.identifier.jour-issn1367-4803
dc.identifier.olddbid207811
dc.identifier.oldhandle10024/190838
dc.identifier.urihttps://www.utupub.fi/handle/11111/57232
dc.identifier.urlhttps://doi.org/10.1093/bioinformatics/btae402
dc.identifier.urnURN:NBN:fi-fe2025082787792
dc.language.isoen
dc.okm.affiliatedauthorPyysalo, Sampo
dc.okm.discipline113 Computer and information sciencesen_GB
dc.okm.discipline3111 Biomedicineen_GB
dc.okm.discipline113 Tietojenkäsittely ja informaatiotieteetfi_FI
dc.okm.discipline3111 Biolääketieteetfi_FI
dc.okm.internationalcopublicationinternational co-publication
dc.okm.internationalityInternational publication
dc.okm.typeA1 ScientificArticle
dc.publisherOxford University Press
dc.publisher.countryUnited Kingdomen_GB
dc.publisher.countryBritanniafi_FI
dc.publisher.country-codeGB
dc.relation.doi10.1093/bioinformatics/btae402
dc.relation.ispartofjournalBioinformatics
dc.relation.issue2 Supplement
dc.relation.volume40
dc.source.identifierhttps://www.utupub.fi/handle/10024/190838
dc.titleImproving dictionary-based named entity recognition with deep learning
dc.year.issued2024

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
btae402.pdf
Size:
1.18 MB
Format:
Adobe Portable Document Format