Pretraining Large Language Models for Finnish

Luukkonen, Risto

Pretraining Large Language Models for Finnish

Luukkonen, Risto (2025-05-23)

Pretraining Large Language Models for Finnish

Luukkonen, Risto

(23.05.2025)

Katso/Avaa

Luukkonen_Risto_Thesis.pdf (3.693Mb)

Lataukset:

Julkaisu on tekijänoikeussäännösten alainen. Teosta voi lukea ja tulostaa henkilökohtaista käyttöä varten. Käyttö kaupallisiin tarkoituksiin on kielletty.

avoin

Näytä kaikki kuvailutiedot

Julkaisun pysyvä osoite on:
https://urn.fi/URN:NBN:fi-fe2025060359507

Tiivistelmä

Transformers have revolutionized the field of natural language processing (NLP) and remain as the dominant large language model (LLM) architecture. However, openly available models have had limited support for low-resourced languages, with most of the work focusing on high-resource languages such as English, where pretraining datasets consists of hundreds of billions or even trillions of words. This work investigates the challenges of training a language model specifically for Finnish, a language spoken by only less than 0.1% of the world population. The training dataset is compiled from a combination of web crawls, news, social media and eBooks. This work explores two distinct approaches to pretraining: (1) training a family of monolingual models from scratch (186M to 13B parameters) named FinGPT and (2) continual pretraining of the multilingual BLOOM model on a mix of its original training data accompanied with Finnish, resulting in a 176 billion parameter model we call BLUUMI. To evaluate model performance, this work introduces FIN-bench, a Finnish adap tation of BIG-bench — a widely used benchmark for language model evaluation. Models are then evaluated using FIN-bench, along with additional assessments for toxicity and bias.

Kokoelmat

Pro gradu -tutkielmat ja diplomityöt sekä syventävien opintojen opinnäytetyöt (kokotekstit) [9745]