A System for Identifying and Exploring Text Repetition in Large Historical Document Corpora

Aleksi Vesanto; Asko Nivala; Tapio Salakoski; Hannu Salmi; Filip Ginter

A System for Identifying and Exploring Text Repetition in Large Historical Document Corpora

ecp17131049.pdf - 377.78 KB

Lataukset64

Pysyvä osoite

https://urn.fi/URN:NBN:fi-fe2021042716764

Tiivistelmä

We present a software for retrieving and exploring duplicated text passages in low quality OCR historical text corpora. The system combines NCBI BLAST, a software created for comparing and aligning biological sequences, with the Solr search and indexing engine, providing a web interface to easily query and browse the clusters of duplicated texts. We demonstrate the system on a corpus of scanned and OCR-recognized Finnish newspapers and journals from years 1771 to 1910.

Sarja

NEALT Proceedings Series

Tietueen kaikki tiedot

A System for Identifying and Exploring Text Repetition in Large Historical Document Corpora

Toimittaja(t)

Pysyvä osoite

Verkkojulkaisu

DOI

Tiivistelmä

Sarja

item.page.okmtext