Från dialektinspelning till talspråkskorpus – beskrivning av ett korpusbygge

Lisa Södergård; Therese Leinonen

Från dialektinspelning till talspråkskorpus – beskrivning av ett korpusbygge

Lisa Södergård

Therese Leinonen

Helsingin yliopisto

Pre-print

Från dialektinspelning till talspråkskorpus.pdf - 198.91 KB

Lataukset96

Pysyvä osoite

https://urn.fi/URN:NBN:fi-fe2021042714566

Tiivistelmä

The Talko corpus of Swedish spoken in Finland is a new research tool consisting of audio files linked to annotation, i.e., transcriptions on two parallel levels and part-of-speech tagging. The corpus is searchable through a web-based interface. The recordings were made in 2005–2008 in all parts of Swedish-language Finland. They have been transcribed in a broad phonetic transcription as well as in a standard orthographic transcription. The part-of-speech tagging is done with TreeTagger, trained on the Stockholm-Umeå Corpus of written Swedish. The automatically produced part-of-speech tags are manually corrected for subsets of the data, and the manually corrected data are subsequently added to the training data. This will gradually improve the result of the automatic tagging and compensate for differences between spoken and written Swedish and between Finland-Swedish and Sweden-Swedish.

Sarja

Nordica Helsingiensia

Tietueen kaikki tiedot

Från dialektinspelning till talspråkskorpus – beskrivning av ett korpusbygge

Toimittaja(t)

Pysyvä osoite

Verkkojulkaisu

DOI

Tiivistelmä

Sarja

item.page.okmtext