Machine Translation and Toxicity Detection in Finnish: A FinBERT Approach

Eskelinen, Anni

Machine Translation and Toxicity Detection in Finnish: A FinBERT Approach

dc.contributor.author	Eskelinen, Anni
dc.contributor.department	fi=Tietotekniikan laitos\|en=Department of Computing\|
dc.contributor.faculty	fi=Teknillinen tiedekunta\|en=Faculty of Technology\|
dc.contributor.studysubject	fi=Tietotekniikka\|en=Information and Communication Technology\|
dc.date.accessioned	2025-08-25T21:03:15Z
dc.date.available	2025-08-25T21:03:15Z
dc.date.issued	2025-08-18
dc.description.abstract	In the age of social media, an overwhelming amount of content is generated by users, making automated content moderation essential for maintaining safe online spaces. While English dominates much of the internet, the need for content moderation extends to smaller languages, such as Finnish, where resources and tools for automatic toxicity detection are still limited. This thesis investigates the feasibility of building an effective Finnish toxicity detection model using unified datasets created through machine translation as a form of cross-lingual transfer. The thesis builds on previous work that introduced a toxicity detection model for Finnish and two Finnish toxicity datasets: a machine translated Jigsaw dataset and a manually annotated test set built from Suomi24 comments. FinBERT, a Finnish pre-trained transformer-based model, is fine-tuned on machine-translated data and evaluated on a new manually annotated corpora made for the purposes of the thesis. The thesis explores how well data from other cultures works in the Finnish context, whether models generalize across datasets, and how safe and useful the models can be in practical use. The thesis uses both quantitative experiments and qualitative analyses, such as error examination and prediction explainability using integrated gradients. Despite differences in cultural context, language, and label distributions, results show that unified translated datasets can support the development of robust models. The best-performing model achieved competitive results that were better than the existing model, although the model tended to prioritize recall over precision, occasionally flagging non-toxic content as toxic. While the resulting model is not a replacement for humans, it can serve as a valuable aid in moderation workflows and data preprocessing. Alongside its theoretical contributions, the thesis offers practical resources: a new Finnish toxicity detection model, a new manually annotated test set and the machine translated datasets, as well as code for unifying datasets, model training, and inference.
dc.format.extent	80
dc.identifier.olddbid	199812
dc.identifier.oldhandle	10024/182839
dc.identifier.uri	https://www.utupub.fi/handle/11111/10592
dc.identifier.urn	URN:NBN:fi-fe2025082584405
dc.language.iso	eng
dc.rights	fi=Julkaisu on tekijänoikeussäännösten alainen. Teosta voi lukea ja tulostaa henkilökohtaista käyttöä varten. Käyttö kaupallisiin tarkoituksiin on kielletty.\|en=This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.\|
dc.rights.accessrights	avoin
dc.source.identifier	https://www.utupub.fi/handle/10024/182839
dc.subject	natural language processing, language technology, artificial intelligence, machine learning, toxicity detection, machine translation
dc.title	Machine Translation and Toxicity Detection in Finnish: A FinBERT Approach
dc.type.ontasot	fi=Diplomityö\|en=Master's thesis\|

Tiedostot

Näytetään 1 - 1 / 1

Name:: Anni_Eskelinen_thesis.pdf
Size:: 1.93 MB
Format:: Adobe Portable Document Format

Lataa

Kokoelmat

Pro gradu -tutkielmat ja diplomityöt sekä syventävien opintojen opinnäytetyöt (kokotekstit)