Machine Translation and Toxicity Detection in Finnish: A FinBERT Approach

dc.contributor.authorEskelinen, Anni
dc.contributor.departmentfi=Tietotekniikan laitos|en=Department of Computing|
dc.contributor.facultyfi=Teknillinen tiedekunta|en=Faculty of Technology|
dc.contributor.studysubjectfi=Tietotekniikka|en=Information and Communication Technology|
dc.date.accessioned2025-08-25T21:03:15Z
dc.date.available2025-08-25T21:03:15Z
dc.date.issued2025-08-18
dc.description.abstractIn the age of social media, an overwhelming amount of content is generated by users, making automated content moderation essential for maintaining safe online spaces. While English dominates much of the internet, the need for content moderation extends to smaller languages, such as Finnish, where resources and tools for automatic toxicity detection are still limited. This thesis investigates the feasibility of building an effective Finnish toxicity detection model using unified datasets created through machine translation as a form of cross-lingual transfer. The thesis builds on previous work that introduced a toxicity detection model for Finnish and two Finnish toxicity datasets: a machine translated Jigsaw dataset and a manually annotated test set built from Suomi24 comments. FinBERT, a Finnish pre-trained transformer-based model, is fine-tuned on machine-translated data and evaluated on a new manually annotated corpora made for the purposes of the thesis. The thesis explores how well data from other cultures works in the Finnish context, whether models generalize across datasets, and how safe and useful the models can be in practical use. The thesis uses both quantitative experiments and qualitative analyses, such as error examination and prediction explainability using integrated gradients. Despite differences in cultural context, language, and label distributions, results show that unified translated datasets can support the development of robust models. The best-performing model achieved competitive results that were better than the existing model, although the model tended to prioritize recall over precision, occasionally flagging non-toxic content as toxic. While the resulting model is not a replacement for humans, it can serve as a valuable aid in moderation workflows and data preprocessing. Alongside its theoretical contributions, the thesis offers practical resources: a new Finnish toxicity detection model, a new manually annotated test set and the machine translated datasets, as well as code for unifying datasets, model training, and inference.
dc.format.extent80
dc.identifier.olddbid199812
dc.identifier.oldhandle10024/182839
dc.identifier.urihttps://www.utupub.fi/handle/11111/10592
dc.identifier.urnURN:NBN:fi-fe2025082584405
dc.language.isoeng
dc.rightsfi=Julkaisu on tekijänoikeussäännösten alainen. Teosta voi lukea ja tulostaa henkilökohtaista käyttöä varten. Käyttö kaupallisiin tarkoituksiin on kielletty.|en=This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.|
dc.rights.accessrightsavoin
dc.source.identifierhttps://www.utupub.fi/handle/10024/182839
dc.subjectnatural language processing, language technology, artificial intelligence, machine learning, toxicity detection, machine translation
dc.titleMachine Translation and Toxicity Detection in Finnish: A FinBERT Approach
dc.type.ontasotfi=Diplomityö|en=Master's thesis|

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
Anni_Eskelinen_thesis.pdf
Size:
1.93 MB
Format:
Adobe Portable Document Format