A System for Identifying and Exploring Text Repetition in Large Historical Document Corpora

dc.contributor.authorAleksi Vesanto
dc.contributor.authorAsko Nivala
dc.contributor.authorTapio Salakoski
dc.contributor.authorHannu Salmi
dc.contributor.authorFilip Ginter
dc.contributor.organizationfi=Turun ihmistieteiden tutkijakollegium (TIAS)|en=Turku Institute for Advanced Studies (TIAS)|
dc.contributor.organizationfi=kieli- ja puheteknologia|en=Language and Speech Technology|
dc.contributor.organizationfi=kulttuurihistoria|en=Cultural History|
dc.contributor.organization-code1.2.246.10.2458963.20.19695555680
dc.contributor.organization-code1.2.246.10.2458963.20.47465613983
dc.contributor.organization-code1.2.246.10.2458963.20.78639161450
dc.converis.publication-id20563854
dc.converis.urlhttps://research.utu.fi/converis/portal/Publication/20563854
dc.date.accessioned2025-08-28T01:55:04Z
dc.date.available2025-08-28T01:55:04Z
dc.description.abstract<p> </p><div> <div> <div> <p>We present a software for retrieving and exploring duplicated text passages in low quality OCR historical text corpora. The system combines NCBI BLAST, a software created for comparing and aligning biological sequences, with the Solr search and indexing engine, providing a web interface to easily query and browse the clusters of duplicated texts. We demonstrate the system on a corpus of scanned and OCR-recognized Finnish newspapers and journals from years 1771 to 1910. </p> </div> </div> </div>
dc.format.pagerange330
dc.format.pagerange333
dc.identifier.isbn978-91-7685-601-7
dc.identifier.issn1650-3686
dc.identifier.olddbid208277
dc.identifier.oldhandle10024/191304
dc.identifier.urihttps://www.utupub.fi/handle/11111/57685
dc.identifier.urlhttp://www.ep.liu.se/ecp/131/049/ecp17131049.pdf
dc.identifier.urnURN:NBN:fi-fe2021042716764
dc.language.isoen
dc.okm.affiliatedauthorVesanto, Aleksi
dc.okm.affiliatedauthorNivala, Asko
dc.okm.affiliatedauthorSalakoski, Tapio
dc.okm.affiliatedauthorSalmi, Hannu
dc.okm.affiliatedauthorGinter, Filip
dc.okm.discipline113 Computer and information sciencesen_GB
dc.okm.discipline615 History and archaeologyen_GB
dc.okm.discipline113 Tietojenkäsittely ja informaatiotieteetfi_FI
dc.okm.discipline615 Historia ja arkeologiafi_FI
dc.okm.internationalcopublicationnot an international co-publication
dc.okm.internationalityInternational publication
dc.okm.typeA4 Conference Article
dc.publisher.countrySwedenen_GB
dc.publisher.countryRuotsifi_FI
dc.publisher.country-codeSE
dc.publisher.placeGothenburg
dc.relation.conferenceNordic Conference of Computational Linguistics
dc.relation.ispartofseriesNEALT Proceedings Series
dc.relation.volume29
dc.source.identifierhttps://www.utupub.fi/handle/10024/191304
dc.titleA System for Identifying and Exploring Text Repetition in Large Historical Document Corpora
dc.title.bookProceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden
dc.year.issued2017

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
ecp17131049.pdf
Size:
377.78 KB
Format:
Adobe Portable Document Format
Description:
Final version