Toxicity Detection in Finnish Using Machine Translation
| dc.contributor.author | Eskelinen Anni | |
| dc.contributor.author | Silvala Laura | |
| dc.contributor.author | Ginter Filip | |
| dc.contributor.author | Pyysalo Sampo | |
| dc.contributor.author | Laippala Veronika | |
| dc.contributor.organization | fi=data-analytiikka|en=Data-analytiikka| | |
| dc.contributor.organization | fi=kieli- ja käännöstieteiden laitos|en=School of Languages and Translation Studies| | |
| dc.contributor.organization-code | 1.2.246.10.2458963.20.56461112866 | |
| dc.contributor.organization-code | 1.2.246.10.2458963.20.68940835793 | |
| dc.converis.publication-id | 380758462 | |
| dc.converis.url | https://research.utu.fi/converis/portal/Publication/380758462 | |
| dc.date.accessioned | 2025-08-28T02:53:22Z | |
| dc.date.available | 2025-08-28T02:53:22Z | |
| dc.description.abstract | <p>Due to the popularity of social media platforms and the sheer amount of usergenerated content online, the automatic detection of toxic language has become crucial in the creation of a friendly and safe digital space. Previous work has been mostly focusing on English leaving many lower-resource languages behind. In this paper, we present novel resources for toxicity detection in Finnish by introducing two new datasets, a machine translated toxicity dataset for Finnish based on the widely used English Jigsaw dataset and a smaller test set of Suomi24 discussion forum comments originally written in Finnish and manually annotated following the definitions of the labels that were used to annotate the Jigsaw dataset. We show that machine translating the training data to Finnish provides better toxicity detection results than using the original English training data and zero-shot cross-lingual transfer with XLM-R, even with our newly annotated dataset from Suomi24.<br></p> | |
| dc.format.pagerange | 685 | |
| dc.format.pagerange | 695 | |
| dc.identifier.isbn | 978-99-1621-999-7 | |
| dc.identifier.issn | 1736-8197 | |
| dc.identifier.jour-issn | 1736-8197 | |
| dc.identifier.olddbid | 209878 | |
| dc.identifier.oldhandle | 10024/192905 | |
| dc.identifier.uri | https://www.utupub.fi/handle/11111/49789 | |
| dc.identifier.url | https://aclanthology.org/2023.nodalida-1.68.pdf | |
| dc.identifier.urn | URN:NBN:fi-fe2025082792529 | |
| dc.language.iso | en | |
| dc.okm.affiliatedauthor | Ginter, Filip | |
| dc.okm.affiliatedauthor | Eskelinen, Anni | |
| dc.okm.affiliatedauthor | Laippala, Veronika | |
| dc.okm.affiliatedauthor | Pyysalo, Sampo | |
| dc.okm.discipline | 113 Computer and information sciences | en_GB |
| dc.okm.discipline | 113 Tietojenkäsittely ja informaatiotieteet | fi_FI |
| dc.okm.internationalcopublication | not an international co-publication | |
| dc.okm.internationality | International publication | |
| dc.okm.type | A4 Conference Article | |
| dc.publisher.country | Estonia | en_GB |
| dc.publisher.country | Viro | fi_FI |
| dc.publisher.country-code | EE | |
| dc.publisher.place | Faroe Islands | |
| dc.relation.conference | Nordic Conference on Computational Linguistics | |
| dc.relation.ispartofjournal | NEALT proceedings series | |
| dc.relation.ispartofseries | NEALT Proceedings Series | |
| dc.relation.volume | 52 | |
| dc.source.identifier | https://www.utupub.fi/handle/10024/192905 | |
| dc.title | Toxicity Detection in Finnish Using Machine Translation | |
| dc.title.book | The 24rd Nordic Conference on Computational Linguistics (NoDaLiDa 2023) | |
| dc.year.issued | 2023 |
Tiedostot
1 - 1 / 1