Pretraining Large Language Models for Finnish

dc.contributor.authorLuukkonen, Risto
dc.contributor.departmentfi=Tietotekniikan laitos|en=Department of Computing|
dc.contributor.facultyfi=Teknillinen tiedekunta|en=Faculty of Technology|
dc.contributor.studysubjectfi=Tietojenkäsittelytieteet|en=Computer Science|
dc.date.accessioned2025-06-03T21:05:45Z
dc.date.available2025-06-03T21:05:45Z
dc.date.issued2025-05-23
dc.description.abstractTransformers have revolutionized the field of natural language processing (NLP) and remain as the dominant large language model (LLM) architecture. However, openly available models have had limited support for low-resourced languages, with most of the work focusing on high-resource languages such as English, where pretraining datasets consists of hundreds of billions or even trillions of words. This work investigates the challenges of training a language model specifically for Finnish, a language spoken by only less than 0.1% of the world population. The training dataset is compiled from a combination of web crawls, news, social media and eBooks. This work explores two distinct approaches to pretraining: (1) training a family of monolingual models from scratch (186M to 13B parameters) named FinGPT and (2) continual pretraining of the multilingual BLOOM model on a mix of its original training data accompanied with Finnish, resulting in a 176 billion parameter model we call BLUUMI. To evaluate model performance, this work introduces FIN-bench, a Finnish adap tation of BIG-bench — a widely used benchmark for language model evaluation. Models are then evaluated using FIN-bench, along with additional assessments for toxicity and bias.
dc.format.extent89
dc.identifier.olddbid198640
dc.identifier.oldhandle10024/181678
dc.identifier.urihttps://www.utupub.fi/handle/11111/20306
dc.identifier.urnURN:NBN:fi-fe2025060359507
dc.language.isoeng
dc.rightsfi=Julkaisu on tekijänoikeussäännösten alainen. Teosta voi lukea ja tulostaa henkilökohtaista käyttöä varten. Käyttö kaupallisiin tarkoituksiin on kielletty.|en=This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.|
dc.rights.accessrightsavoin
dc.source.identifierhttps://www.utupub.fi/handle/10024/181678
dc.subjectTransformer, Large Language Model, Natural Language Processing, Distributed computing, Megatron-LM, DeepSpeed
dc.titlePretraining Large Language Models for Finnish
dc.type.ontasotfi=Pro gradu -tutkielma|en=Master's thesis|

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
Luukkonen_Risto_Thesis.pdf
Size:
3.69 MB
Format:
Adobe Portable Document Format