Hyppää sisältöön
    • Suomeksi
    • In English
  • Suomeksi
  • In English
  • Kirjaudu
Näytä aineisto 
  •   Etusivu
  • 3. UTUCris-artikkelit
  • Rinnakkaistallenteet
  • Näytä aineisto
  •   Etusivu
  • 3. UTUCris-artikkelit
  • Rinnakkaistallenteet
  • Näytä aineisto
JavaScript is disabled for your browser. Some features of this site may not work without it.

Building the Penitentiary Document Corpus (PeDoCo) for NLP: Balancing Data Complexity and Uniform Data Structure

Kupari, Hanna-Mari; Korkiakangas, Timo; Laippala, Veronika

Building the Penitentiary Document Corpus (PeDoCo) for NLP: Balancing Data Complexity and Uniform Data Structure

Kupari, Hanna-Mari
Korkiakangas, Timo
Laippala, Veronika
Katso/Avaa
Kupari-et-al-Building the Penitentiary Document Corpus-144.pdf (1.105Mb)
Lataukset: 

doi:10.5617/dhnbpub.12301
URI
https://doi.org/10.5617/dhnbpub.12301
Näytä kaikki kuvailutiedot
Julkaisun pysyvä osoite on:
https://urn.fi/URN:NBN:fi-fe2025082788355
Tiivistelmä

This paper describes the process of creating a TEI XML corpus of late medieval Latin documents for NLP tasks from books in PDF format. The documents of the Apostolic Penitentiary (a tribunal of the Catholic Church responsible for granting absolutions, dispensations, and indulgences) have been originally published as printed books. For the purposes of this corpus, they were derived from PDF files used for proofreading before printing. These editions, containing 1,511 documents and 211,398 words, are designed by and for human scholars engaged in close reading. As a result, they encode implicit semantic information through typographical features such as page layout and italics, which are sometimes inconsistent. Although human readers, equipped with holistic understanding, can interpret such variations, NLP tools require unambiguous, text-only input.


We report in detail the process of transforming the PDF editions into a structured, machine-readable, and openly accessible corpus. Our approach combines a rule-based workflow using regular expressions with close reading and manual corrections. Such conversion procedures, which are highly time-consuming and require in-depth knowledge of medieval Latin philology and manuscript studies, are regrettably seldom made explicit, despite their vital role in ensuring the reproducibility and scalability of research.

Kokoelmat
  • Rinnakkaistallenteet [27094]

Turun yliopiston kirjasto | Turun yliopisto
julkaisut@utu.fi | Tietosuoja | Saavutettavuusseloste
 

 

Tämä kokoelma

JulkaisuajatTekijätNimekkeetAsiasanatTiedekuntaLaitosOppiaineYhteisöt ja kokoelmat

Omat tiedot

Kirjaudu sisäänRekisteröidy

Turun yliopiston kirjasto | Turun yliopisto
julkaisut@utu.fi | Tietosuoja | Saavutettavuusseloste