Building the Penitentiary Document Corpus (PeDoCo)  for NLP: Balancing Data Complexity and Uniform  Data Structure

Kupari, Hanna-Mari; Korkiakangas, Timo; Laippala, Veronika

Building the Penitentiary Document Corpus (PeDoCo) for NLP: Balancing Data Complexity and Uniform Data Structure

dc.contributor.author	Kupari, Hanna-Mari
dc.contributor.author	Korkiakangas, Timo
dc.contributor.author	Laippala, Veronika
dc.contributor.organization	fi=digitaalinen kielentutkimus, espanja, italia, kiina, ranska, saksa\|en=Digital Language Studies, Chinese, French, German, Italian, Spanish\|
dc.contributor.organization-code	1.2.246.10.2458963.20.36764574459
dc.converis.publication-id	485024197
dc.converis.url	https://research.utu.fi/converis/portal/Publication/485024197
dc.date.accessioned	2025-08-28T02:41:21Z
dc.date.available	2025-08-28T02:41:21Z
dc.description.abstract	<p> This paper describes the process of creating a TEI XML corpus of late medieval Latin documents for NLP tasks from books in PDF format. The documents of the Apostolic Penitentiary (a tribunal of the Catholic Church responsible for granting absolutions, dispensations, and indulgences) have been originally published as printed books. For the purposes of this corpus, they were derived from PDF files used for proofreading before printing. These editions, containing 1,511 documents and 211,398 words, are designed by and for human scholars engaged in close reading. As a result, they encode implicit semantic information through typographical features such as page layout and italics, which are sometimes inconsistent. Although human readers, equipped with holistic understanding, can interpret such variations, NLP tools require unambiguous, text-only input.<br></p><p><br>We report in detail the process of transforming the PDF editions into a structured, machine-readable, and openly accessible corpus. Our approach combines a rule-based workflow using regular expressions with close reading and manual corrections. Such conversion procedures, which are highly time-consuming and require in-depth knowledge of medieval Latin philology and manuscript studies, are regrettably seldom made explicit, despite their vital role in ensuring the reproducibility and scalability of research. <br></p>
dc.identifier.olddbid	209523
dc.identifier.oldhandle	10024/192550
dc.identifier.uri	https://www.utupub.fi/handle/11111/46880
dc.identifier.url	https://doi.org/10.5617/dhnbpub.12301
dc.identifier.urn	URN:NBN:fi-fe2025082788355
dc.language.iso	en
dc.okm.affiliatedauthor	Kupari, Hanna-Mari
dc.okm.affiliatedauthor	Laippala, Veronika
dc.okm.discipline	6121 Languages	en_GB
dc.okm.internationalcopublication	not an international co-publication
dc.okm.internationality	International publication
dc.okm.type	A4 Conference Article
dc.publisher.country	Estonia	en_GB
dc.publisher.country	Viro	fi_FI
dc.publisher.country-code	EE
dc.relation.conference	Digital Humanities in the Nordic and Baltic Countries Conference
dc.relation.doi	10.5617/dhnbpub.12301
dc.relation.ispartofjournal	Digital Humanities in the Nordic and Baltic Countries Publications
dc.relation.volume	7
dc.source.identifier	https://www.utupub.fi/handle/10024/192550
dc.title	Building the Penitentiary Document Corpus (PeDoCo) for NLP: Balancing Data Complexity and Uniform Data Structure
dc.title.book	Digital Humanities in the Nordic and Baltic Countries Publications 7 (2)
dc.year.issued	2025

Tiedostot

Näytetään 1 - 1 / 1

Name:: Kupari-et-al-Building the Penitentiary Document Corpus-144.pdf
Size:: 1.11 MB
Format:: Adobe Portable Document Format

Lataa

Kokoelmat

Rinnakkaistallenteet