Toxicity Detection in Finnish Using Machine Translation

dc.contributor.authorEskelinen Anni
dc.contributor.authorSilvala Laura
dc.contributor.authorGinter Filip
dc.contributor.authorPyysalo Sampo
dc.contributor.authorLaippala Veronika
dc.contributor.organizationfi=data-analytiikka|en=Data-analytiikka|
dc.contributor.organizationfi=kieli- ja käännöstieteiden laitos|en=School of Languages and Translation Studies|
dc.contributor.organization-code1.2.246.10.2458963.20.56461112866
dc.contributor.organization-code1.2.246.10.2458963.20.68940835793
dc.converis.publication-id380758462
dc.converis.urlhttps://research.utu.fi/converis/portal/Publication/380758462
dc.date.accessioned2025-08-28T02:53:22Z
dc.date.available2025-08-28T02:53:22Z
dc.description.abstract<p>Due to the popularity of social media platforms and the sheer amount of usergenerated content online, the automatic detection of toxic language has become crucial in the creation of a friendly and safe digital space. Previous work has been mostly focusing on English leaving many lower-resource languages behind. In this paper, we present novel resources for toxicity detection in Finnish by introducing two new datasets, a machine translated toxicity dataset for Finnish based on the widely used English Jigsaw dataset and a smaller test set of Suomi24 discussion forum comments originally written in Finnish and manually annotated following the definitions of the labels that were used to annotate the Jigsaw dataset. We show that machine translating the training data to Finnish provides better toxicity detection results than using the original English training data and zero-shot cross-lingual transfer with XLM-R, even with our newly annotated dataset from Suomi24.<br></p>
dc.format.pagerange685
dc.format.pagerange695
dc.identifier.isbn978-99-1621-999-7
dc.identifier.issn1736-8197
dc.identifier.jour-issn1736-8197
dc.identifier.olddbid209878
dc.identifier.oldhandle10024/192905
dc.identifier.urihttps://www.utupub.fi/handle/11111/49789
dc.identifier.urlhttps://aclanthology.org/2023.nodalida-1.68.pdf
dc.identifier.urnURN:NBN:fi-fe2025082792529
dc.language.isoen
dc.okm.affiliatedauthorGinter, Filip
dc.okm.affiliatedauthorEskelinen, Anni
dc.okm.affiliatedauthorLaippala, Veronika
dc.okm.affiliatedauthorPyysalo, Sampo
dc.okm.discipline113 Computer and information sciencesen_GB
dc.okm.discipline113 Tietojenkäsittely ja informaatiotieteetfi_FI
dc.okm.internationalcopublicationnot an international co-publication
dc.okm.internationalityInternational publication
dc.okm.typeA4 Conference Article
dc.publisher.countryEstoniaen_GB
dc.publisher.countryVirofi_FI
dc.publisher.country-codeEE
dc.publisher.placeFaroe Islands
dc.relation.conferenceNordic Conference on Computational Linguistics
dc.relation.ispartofjournalNEALT proceedings series
dc.relation.ispartofseriesNEALT Proceedings Series
dc.relation.volume52
dc.source.identifierhttps://www.utupub.fi/handle/10024/192905
dc.titleToxicity Detection in Finnish Using Machine Translation
dc.title.bookThe 24rd Nordic Conference on Computational Linguistics (NoDaLiDa 2023)
dc.year.issued2023

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
2023.nodalida-1.68.pdf
Size:
295.61 KB
Format:
Adobe Portable Document Format