Creating a parallel Finnish—Easy Finnish dataset from news articles

dc.contributor.authorDmitrieva Anna
dc.contributor.authorKonovalova Aleksandra
dc.contributor.organizationfi=kieli- ja käännöstieteiden laitos|en=School of Languages and Translation Studies|
dc.contributor.organization-code2602100
dc.converis.publication-id180195017
dc.converis.urlhttps://research.utu.fi/converis/portal/Publication/180195017
dc.date.accessioned2025-08-28T00:26:40Z
dc.date.available2025-08-28T00:26:40Z
dc.description.abstract<p>Modern natural language processing tasks such as text simplification or summarization are typically formulated as monolingual machine translation tasks. This requires appropriate datasets to train, tune, and evaluate the models. This paper describes the creation of a parallel Finnish–Easy Finnish dataset from the Yle News archives. The dataset contains 1919 manually verified pairs of articles, each containing an article in Easy Finnish (selkosuomi) and a corresponding article from Standard Finnish news. Standard Finnish texts total 687555 words, and Easy Finnish texts have 106733 words. This new aligned resource was created automatically based on the Yle News archives from the Language Bank of Finland (Kielipankki) and manually checked by a human expert. The dataset is available for download from Kielipankki. This resource will allow for more effective Easy Language research and for creating applications for automatic simplification and/or summarization of Finnish texts.<br></p>
dc.identifier.isbn978-84-1302-228-4
dc.identifier.olddbid205715
dc.identifier.oldhandle10024/188742
dc.identifier.urihttps://www.utupub.fi/handle/11111/56894
dc.identifier.urlhttps://macocu.eu/static/media/proceedings.37b7e88ce3dbab99adf9.pdf#page=27
dc.identifier.urnURN:NBN:fi-fe2025082791022
dc.language.isoen
dc.okm.affiliatedauthorKonovalova, Aleksandra
dc.okm.discipline113 Computer and information sciencesen_GB
dc.okm.discipline113 Tietojenkäsittely ja informaatiotieteetfi_FI
dc.okm.internationalcopublicationnot an international co-publication
dc.okm.internationalityInternational publication
dc.okm.typeA4 Conference Article
dc.publisher.countrySpainen_GB
dc.publisher.countryEspanjafi_FI
dc.publisher.country-codeES
dc.relation.conferenceWorkshop on Open Community-Driven Machine Translation
dc.source.identifierhttps://www.utupub.fi/handle/10024/188742
dc.titleCreating a parallel Finnish—Easy Finnish dataset from news articles
dc.title.bookProceedings of the 1st Workshop on Open Community-Driven Machine Translation
dc.year.issued2023

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
DmitrievaKonovalova.pdf
Size:
222.29 KB
Format:
Adobe Portable Document Format