Pretraining Large Language Models for Finnish
| dc.contributor.author | Luukkonen, Risto | |
| dc.contributor.department | fi=Tietotekniikan laitos|en=Department of Computing| | |
| dc.contributor.faculty | fi=Teknillinen tiedekunta|en=Faculty of Technology| | |
| dc.contributor.studysubject | fi=Tietojenkäsittelytieteet|en=Computer Science| | |
| dc.date.accessioned | 2025-06-03T21:05:45Z | |
| dc.date.available | 2025-06-03T21:05:45Z | |
| dc.date.issued | 2025-05-23 | |
| dc.description.abstract | Transformers have revolutionized the field of natural language processing (NLP) and remain as the dominant large language model (LLM) architecture. However, openly available models have had limited support for low-resourced languages, with most of the work focusing on high-resource languages such as English, where pretraining datasets consists of hundreds of billions or even trillions of words. This work investigates the challenges of training a language model specifically for Finnish, a language spoken by only less than 0.1% of the world population. The training dataset is compiled from a combination of web crawls, news, social media and eBooks. This work explores two distinct approaches to pretraining: (1) training a family of monolingual models from scratch (186M to 13B parameters) named FinGPT and (2) continual pretraining of the multilingual BLOOM model on a mix of its original training data accompanied with Finnish, resulting in a 176 billion parameter model we call BLUUMI. To evaluate model performance, this work introduces FIN-bench, a Finnish adap tation of BIG-bench — a widely used benchmark for language model evaluation. Models are then evaluated using FIN-bench, along with additional assessments for toxicity and bias. | |
| dc.format.extent | 89 | |
| dc.identifier.olddbid | 198640 | |
| dc.identifier.oldhandle | 10024/181678 | |
| dc.identifier.uri | https://www.utupub.fi/handle/11111/20306 | |
| dc.identifier.urn | URN:NBN:fi-fe2025060359507 | |
| dc.language.iso | eng | |
| dc.rights | fi=Julkaisu on tekijänoikeussäännösten alainen. Teosta voi lukea ja tulostaa henkilökohtaista käyttöä varten. Käyttö kaupallisiin tarkoituksiin on kielletty.|en=This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.| | |
| dc.rights.accessrights | avoin | |
| dc.source.identifier | https://www.utupub.fi/handle/10024/181678 | |
| dc.subject | Transformer, Large Language Model, Natural Language Processing, Distributed computing, Megatron-LM, DeepSpeed | |
| dc.title | Pretraining Large Language Models for Finnish | |
| dc.type.ontasot | fi=Pro gradu -tutkielma|en=Master's thesis| |
Tiedostot
1 - 1 / 1