Building the Penitentiary Document Corpus (PeDoCo) for NLP: Balancing Data Complexity and Uniform Data Structure

dc.contributor.authorKupari, Hanna-Mari
dc.contributor.authorKorkiakangas, Timo
dc.contributor.authorLaippala, Veronika
dc.contributor.organizationfi=digitaalinen kielentutkimus, espanja, italia, kiina, ranska, saksa|en=Digital Language Studies, Chinese, French, German, Italian, Spanish|
dc.contributor.organization-code1.2.246.10.2458963.20.36764574459
dc.converis.publication-id485024197
dc.converis.urlhttps://research.utu.fi/converis/portal/Publication/485024197
dc.date.accessioned2025-08-28T02:41:21Z
dc.date.available2025-08-28T02:41:21Z
dc.description.abstract<p> This paper describes the process of creating a TEI XML corpus of late medieval Latin documents for NLP tasks from books in PDF format. The documents of the Apostolic Penitentiary (a tribunal of the Catholic Church responsible for granting absolutions, dispensations, and indulgences) have been originally published as printed books. For the purposes of this corpus, they were derived from PDF files used for proofreading before printing. These editions, containing 1,511 documents and 211,398 words, are designed by and for human scholars engaged in close reading. As a result, they encode implicit semantic information through typographical features such as page layout and italics, which are sometimes inconsistent. Although human readers, equipped with holistic understanding, can interpret such variations, NLP tools require unambiguous, text-only input.<br></p><p><br>We report in detail the process of transforming the PDF editions into a structured, machine-readable, and openly accessible corpus. Our approach combines a rule-based workflow using regular expressions with close reading and manual corrections. Such conversion procedures, which are highly time-consuming and require in-depth knowledge of medieval Latin philology and manuscript studies, are regrettably seldom made explicit, despite their vital role in ensuring the reproducibility and scalability of research. <br></p>
dc.identifier.olddbid209523
dc.identifier.oldhandle10024/192550
dc.identifier.urihttps://www.utupub.fi/handle/11111/46880
dc.identifier.urlhttps://doi.org/10.5617/dhnbpub.12301
dc.identifier.urnURN:NBN:fi-fe2025082788355
dc.language.isoen
dc.okm.affiliatedauthorKupari, Hanna-Mari
dc.okm.affiliatedauthorLaippala, Veronika
dc.okm.discipline6121 Languagesen_GB
dc.okm.discipline615 History and archaeologyen_GB
dc.okm.discipline6121 Kielitieteetfi_FI
dc.okm.discipline615 Historia ja arkeologiafi_FI
dc.okm.internationalcopublicationnot an international co-publication
dc.okm.internationalityInternational publication
dc.okm.typeA4 Conference Article
dc.publisher.countryEstoniaen_GB
dc.publisher.countryVirofi_FI
dc.publisher.country-codeEE
dc.relation.conferenceDigital Humanities in the Nordic and Baltic Countries Conference (DHNB 2025)
dc.relation.doi10.5617/dhnbpub.12301
dc.relation.ispartofjournalDigital Humanities in the Nordic and Baltic Countries Publications
dc.relation.volume7
dc.source.identifierhttps://www.utupub.fi/handle/10024/192550
dc.titleBuilding the Penitentiary Document Corpus (PeDoCo) for NLP: Balancing Data Complexity and Uniform Data Structure
dc.title.bookDigital Humanities in the Nordic and Baltic Countries Publications 7 (2)
dc.year.issued2025

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
Kupari-et-al-Building the Penitentiary Document Corpus-144.pdf
Size:
1.11 MB
Format:
Adobe Portable Document Format