Pretraining Large Language Models for Finnish

Luukkonen, Risto

Pretraining Large Language Models for Finnish

dc.contributor.author	Luukkonen, Risto
dc.contributor.department	fi=Tietotekniikan laitos\|en=Department of Computing\|
dc.contributor.faculty	fi=Teknillinen tiedekunta\|en=Faculty of Technology\|
dc.contributor.studysubject	fi=Tietojenkäsittelytieteet\|en=Computer Science\|
dc.date.accessioned	2025-06-03T21:05:45Z
dc.date.available	2025-06-03T21:05:45Z
dc.date.issued	2025-05-23
dc.description.abstract	Transformers have revolutionized the field of natural language processing (NLP) and remain as the dominant large language model (LLM) architecture. However, openly available models have had limited support for low-resourced languages, with most of the work focusing on high-resource languages such as English, where pretraining datasets consists of hundreds of billions or even trillions of words. This work investigates the challenges of training a language model specifically for Finnish, a language spoken by only less than 0.1% of the world population. The training dataset is compiled from a combination of web crawls, news, social media and eBooks. This work explores two distinct approaches to pretraining: (1) training a family of monolingual models from scratch (186M to 13B parameters) named FinGPT and (2) continual pretraining of the multilingual BLOOM model on a mix of its original training data accompanied with Finnish, resulting in a 176 billion parameter model we call BLUUMI. To evaluate model performance, this work introduces FIN-bench, a Finnish adap tation of BIG-bench — a widely used benchmark for language model evaluation. Models are then evaluated using FIN-bench, along with additional assessments for toxicity and bias.
dc.format.extent	89
dc.identifier.olddbid	198640
dc.identifier.oldhandle	10024/181678
dc.identifier.uri	https://www.utupub.fi/handle/11111/20306
dc.identifier.urn	URN:NBN:fi-fe2025060359507
dc.language.iso	eng
dc.rights	fi=Julkaisu on tekijänoikeussäännösten alainen. Teosta voi lukea ja tulostaa henkilökohtaista käyttöä varten. Käyttö kaupallisiin tarkoituksiin on kielletty.\|en=This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.\|
dc.rights.accessrights	avoin
dc.source.identifier	https://www.utupub.fi/handle/10024/181678
dc.subject	Transformer, Large Language Model, Natural Language Processing, Distributed computing, Megatron-LM, DeepSpeed
dc.title	Pretraining Large Language Models for Finnish
dc.type.ontasot	fi=Pro gradu -tutkielma\|en=Master's thesis\|

Tiedostot

Näytetään 1 - 1 / 1

Name:: Luukkonen_Risto_Thesis.pdf
Size:: 3.69 MB
Format:: Adobe Portable Document Format

Lataa

Kokoelmat

Pro gradu -tutkielmat ja diplomityöt sekä syventävien opintojen opinnäytetyöt (kokotekstit)