A System for Identifying and Exploring Text Repetition in Large Historical Document Corpora

Aleksi Vesanto; Asko Nivala; Tapio Salakoski; Hannu Salmi; Filip Ginter

A System for Identifying and Exploring Text Repetition in Large Historical Document Corpora

dc.contributor.author	Aleksi Vesanto
dc.contributor.author	Asko Nivala
dc.contributor.author	Tapio Salakoski
dc.contributor.author	Hannu Salmi
dc.contributor.author	Filip Ginter
dc.contributor.organization	fi=kieli- ja puheteknologia\|en=Language and Speech Technology\|
dc.contributor.organization-code	1.2.246.10.2458963.20.47465613983
dc.converis.publication-id	20563854
dc.converis.url	https://research.utu.fi/converis/portal/Publication/20563854
dc.date.accessioned	2025-08-28T01:55:04Z
dc.date.available	2025-08-28T01:55:04Z
dc.description.abstract	<p> </p><div> <div> <div> <p>We present a software for retrieving and exploring duplicated text passages in low quality OCR historical text corpora. The system combines NCBI BLAST, a software created for comparing and aligning biological sequences, with the Solr search and indexing engine, providing a web interface to easily query and browse the clusters of duplicated texts. We demonstrate the system on a corpus of scanned and OCR-recognized Finnish newspapers and journals from years 1771 to 1910. </p> </div> </div> </div>
dc.format.pagerange	333
dc.identifier.isbn	978-91-7685-601-7
dc.identifier.issn	1650-3686
dc.identifier.olddbid	208277
dc.identifier.oldhandle	10024/191304
dc.identifier.uri	https://www.utupub.fi/handle/11111/57685
dc.identifier.url	http://www.ep.liu.se/ecp/131/049/ecp17131049.pdf
dc.identifier.urn	URN:NBN:fi-fe2021042716764
dc.language.iso	en
dc.okm.affiliatedauthor	Vesanto, Aleksi
dc.okm.affiliatedauthor	Nivala, Asko
dc.okm.affiliatedauthor	Salakoski, Tapio
dc.okm.affiliatedauthor	Salmi, Hannu
dc.okm.affiliatedauthor	Ginter, Filip
dc.okm.discipline	113 Computer and information sciences	en_GB
dc.okm.internationalcopublication	not an international co-publication
dc.okm.internationality	International publication
dc.okm.type	A4 Conference Article
dc.publisher.country	Sweden	en_GB
dc.publisher.country	Ruotsi	fi_FI
dc.publisher.country-code	SE
dc.publisher.place	Gothenburg
dc.relation.conference	Nordic Conference of Computational Linguistics
dc.relation.ispartofseries	NEALT Proceedings Series
dc.relation.volume	29
dc.source.identifier	https://www.utupub.fi/handle/10024/191304
dc.title	A System for Identifying and Exploring Text Repetition in Large Historical Document Corpora
dc.title.book	Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden
dc.year.issued	2017

Tiedostot

Näytetään 1 - 1 / 1

Name:: ecp17131049.pdf
Size:: 377.78 KB
Format:: Adobe Portable Document Format
Description:: Final version

Lataa

Kokoelmat

Rinnakkaistallenteet