OCR Error Post-Correction with LLMs in Historical Documents: No Free Lunches

dc.contributor.authorKanerva, Jenna
dc.contributor.authorLedins, Cassadra
dc.contributor.authorKäpyaho, Siiri
dc.contributor.authorGinter, Filip
dc.contributor.organizationfi=data-analytiikka|en=Data-analytiikka|
dc.contributor.organization-code1.2.246.10.2458963.20.68940835793
dc.converis.publication-id506501669
dc.converis.urlhttps://research.utu.fi/converis/portal/Publication/506501669
dc.date.accessioned2026-01-21T12:32:57Z
dc.date.available2026-01-21T12:32:57Z
dc.description.abstract<p>Optical Character Recognition (OCR) systems often introduce errors when transcribing historical documents, leaving room for post-correction to improve text quality. This study evaluates the use of open-weight LLMs for OCR error correction in historical English and Finnish datasets. We explore various strategies, including parameter optimization, quantization, segment length effects, and text continuation methods. Our results demonstrate that while modern LLMs show promise in reducing character error rates (CER) in English, a practically useful performance for Finnish was not reached. Our findings highlight the potential and limitations of LLMs in scaling OCR post-correction for large historical corpora.<br></p>
dc.format.pagerange38
dc.format.pagerange47
dc.identifier.isbn978-9908-53-121-2
dc.identifier.olddbid212643
dc.identifier.oldhandle10024/195661
dc.identifier.urihttps://www.utupub.fi/handle/11111/52844
dc.identifier.urlhttps://aclanthology.org/2025.resourceful-1.8/
dc.identifier.urnURN:NBN:fi-fe202601216003
dc.language.isoen
dc.okm.affiliatedauthorKanerva, Jenna
dc.okm.affiliatedauthorLedins, Cassandra
dc.okm.affiliatedauthorKäpyaho, Siiri
dc.okm.affiliatedauthorGinter, Filip
dc.okm.discipline113 Computer and information sciencesen_GB
dc.okm.discipline113 Tietojenkäsittely ja informaatiotieteetfi_FI
dc.okm.internationalcopublicationnot an international co-publication
dc.okm.internationalityInternational publication
dc.okm.typeA4 Conference Article
dc.publisher.countryEstoniaen_GB
dc.publisher.countryVirofi_FI
dc.publisher.country-codeEE
dc.relation.conferenceResources and Representations for Under-Resourced Languages and Domains
dc.source.identifierhttps://www.utupub.fi/handle/10024/195661
dc.titleOCR Error Post-Correction with LLMs in Historical Documents: No Free Lunches
dc.title.bookProceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025)
dc.year.issued2025

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
2025.resourceful-1.8.pdf
Size:
593.99 KB
Format:
Adobe Portable Document Format