A System for Identifying and Exploring Text Repetition in Large Historical Document Corpora
| dc.contributor.author | Aleksi Vesanto | |
| dc.contributor.author | Asko Nivala | |
| dc.contributor.author | Tapio Salakoski | |
| dc.contributor.author | Hannu Salmi | |
| dc.contributor.author | Filip Ginter | |
| dc.contributor.organization | fi=Turun ihmistieteiden tutkijakollegium (TIAS)|en=Turku Institute for Advanced Studies (TIAS)| | |
| dc.contributor.organization | fi=kieli- ja puheteknologia|en=Language and Speech Technology| | |
| dc.contributor.organization | fi=kulttuurihistoria|en=Cultural History| | |
| dc.contributor.organization-code | 1.2.246.10.2458963.20.19695555680 | |
| dc.contributor.organization-code | 1.2.246.10.2458963.20.47465613983 | |
| dc.contributor.organization-code | 1.2.246.10.2458963.20.78639161450 | |
| dc.converis.publication-id | 20563854 | |
| dc.converis.url | https://research.utu.fi/converis/portal/Publication/20563854 | |
| dc.date.accessioned | 2025-08-28T01:55:04Z | |
| dc.date.available | 2025-08-28T01:55:04Z | |
| dc.description.abstract | <p> </p><div> <div> <div> <p>We present a software for retrieving and exploring duplicated text passages in low quality OCR historical text corpora. The system combines NCBI BLAST, a software created for comparing and aligning biological sequences, with the Solr search and indexing engine, providing a web interface to easily query and browse the clusters of duplicated texts. We demonstrate the system on a corpus of scanned and OCR-recognized Finnish newspapers and journals from years 1771 to 1910. </p> </div> </div> </div> | |
| dc.format.pagerange | 330 | |
| dc.format.pagerange | 333 | |
| dc.identifier.isbn | 978-91-7685-601-7 | |
| dc.identifier.issn | 1650-3686 | |
| dc.identifier.olddbid | 208277 | |
| dc.identifier.oldhandle | 10024/191304 | |
| dc.identifier.uri | https://www.utupub.fi/handle/11111/57685 | |
| dc.identifier.url | http://www.ep.liu.se/ecp/131/049/ecp17131049.pdf | |
| dc.identifier.urn | URN:NBN:fi-fe2021042716764 | |
| dc.language.iso | en | |
| dc.okm.affiliatedauthor | Vesanto, Aleksi | |
| dc.okm.affiliatedauthor | Nivala, Asko | |
| dc.okm.affiliatedauthor | Salakoski, Tapio | |
| dc.okm.affiliatedauthor | Salmi, Hannu | |
| dc.okm.affiliatedauthor | Ginter, Filip | |
| dc.okm.discipline | 113 Computer and information sciences | en_GB |
| dc.okm.discipline | 615 History and archaeology | en_GB |
| dc.okm.discipline | 113 Tietojenkäsittely ja informaatiotieteet | fi_FI |
| dc.okm.discipline | 615 Historia ja arkeologia | fi_FI |
| dc.okm.internationalcopublication | not an international co-publication | |
| dc.okm.internationality | International publication | |
| dc.okm.type | A4 Conference Article | |
| dc.publisher.country | Sweden | en_GB |
| dc.publisher.country | Ruotsi | fi_FI |
| dc.publisher.country-code | SE | |
| dc.publisher.place | Gothenburg | |
| dc.relation.conference | Nordic Conference of Computational Linguistics | |
| dc.relation.ispartofseries | NEALT Proceedings Series | |
| dc.relation.volume | 29 | |
| dc.source.identifier | https://www.utupub.fi/handle/10024/191304 | |
| dc.title | A System for Identifying and Exploring Text Repetition in Large Historical Document Corpora | |
| dc.title.book | Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden | |
| dc.year.issued | 2017 |
Tiedostot
1 - 1 / 1
Ladataan...
- Name:
- ecp17131049.pdf
- Size:
- 377.78 KB
- Format:
- Adobe Portable Document Format
- Description:
- Final version