Från dialektinspelning till talspråkskorpus – beskrivning av ett korpusbygge

Helsingin yliopisto
Pre-print
Från dialektinspelning till talspråkskorpus.pdf - 198.91 KB
Lataukset67

Verkkojulkaisu

DOI

Tiivistelmä

The Talko corpus of Swedish spoken in Finland is a new research tool consisting of audio files linked to annotation, i.e., transcriptions on two parallel levels and part-of-speech tagging. The corpus is searchable through a web-based interface. The re­cord­ings were made in 2005–2008 in all parts of Swedish-language Finland. They have been transcribed in a broad phonetic transcription as well as in a standard ortho­graphic transcription. The part-of-speech tagging is done with TreeTagger, trained on the Stockholm-Umeå Corpus of written Swedish. The automatically pro­duced part-of-speech tags are manually corrected for subsets of the data, and the manually corrected data are subsequently added to the training data. This will grad­ually improve the result of the automatic tagging and compensate for differences between spoken and written Swedish and between Finland-Swedish and Sweden-Swedish.

Sarja

Nordica Helsingiensia

item.page.okmtext