A System for Identifying and Exploring Text Repetition in Large Historical Document Corpora

Hannu Salmi; Asko Nivala; Tapio Salakoski; Aleksi Vesanto; Filip Ginter

A System for Identifying and Exploring Text Repetition in Large Historical Document Corpora

Hannu Salmi; Asko Nivala; Tapio Salakoski; Aleksi Vesanto; Filip Ginter

A System for Identifying and Exploring Text Repetition in Large Historical Document Corpora

Hannu Salmi

Asko Nivala

Tapio Salakoski

Aleksi Vesanto

Filip Ginter

Katso/Avaa

Final version (377.7Kb)

Lataukset:

URI

http://www.ep.liu.se/ecp/131/049/ecp17131049.pdf

Näytä kaikki kuvailutiedot

Julkaisun pysyvä osoite on:
https://urn.fi/URN:NBN:fi-fe2021042716764

Tiivistelmä

We present a software for retrieving and
exploring duplicated text passages in low
quality OCR historical text corpora. The
system combines NCBI BLAST, a software created for comparing and aligning
biological sequences, with the Solr search
and indexing engine, providing a web interface to easily query and browse the
clusters of duplicated texts. We demonstrate the system on a corpus of scanned
and OCR-recognized Finnish newspapers
and journals from years 1771 to 1910.

Kokoelmat

Rinnakkaistallenteet [19207]