Poro 34B and the Blessing of Multilinguality

dc.contributor.authorLuukkonen, Risto
dc.contributor.authorBurdge, Jonathan
dc.contributor.authorZosa, Elaine
dc.contributor.authorTalman, Aarne
dc.contributor.authorKomulainen, Ville
dc.contributor.authorHatanpää, Väinö
dc.contributor.authorSarlin, Peter
dc.contributor.authorPyysalo, Sampo
dc.contributor.organizationfi=data-analytiikka|en=Data-analytiikka|
dc.contributor.organization-code1.2.246.10.2458963.20.68940835793
dc.converis.publication-id506554658
dc.converis.urlhttps://research.utu.fi/converis/portal/Publication/506554658
dc.date.accessioned2026-01-21T14:45:28Z
dc.date.available2026-01-21T14:45:28Z
dc.description.abstract<p>The pretraining of state-of-the-art large language models now requires trillions of words of text, which is orders of magnitude more than available for the vast majority of languages. While including text in more than one language is an obvious way to acquire more pretraining data, multilinguality is often seen as a curse, and most model training efforts continue to focus near-exclusively on individual large languages. We believe that multilinguality can be a blessing: when the lack of training data is a constraint for effectively training larger models for a target language, augmenting the dataset with other languages can offer a way to improve over the capabilities of monolingual models for that language. In this study, we introduce Poro 34B, a 34 billion parameter model trained for 1 trillion tokens of Finnish, English, and programming languages, and demonstrate that a multilingual training approach can produce a model that substantially advances over the capabilities of existing models for Finnish and excels in translation, while also achieving competitive performance in its class for English and programming languages. We release the model parameters, scripts, and data under open licenses at https://huggingface.co/LumiOpen/Poro-34B.<br></p>
dc.format.pagerange367
dc.format.pagerange382
dc.identifier.isbn978-9908-53-109-0
dc.identifier.issn1736-8197
dc.identifier.jour-issn1736-8197
dc.identifier.olddbid213663
dc.identifier.oldhandle10024/196681
dc.identifier.urihttps://www.utupub.fi/handle/11111/55752
dc.identifier.urlhttps://aclanthology.org/2025.nodalida-1.40/
dc.identifier.urnURN:NBN:fi-fe202601216877
dc.language.isoen
dc.okm.affiliatedauthorLuukkonen, Risto
dc.okm.affiliatedauthorKomulainen, Ville
dc.okm.affiliatedauthorPyysalo, Sampo
dc.okm.discipline113 Computer and information sciencesen_GB
dc.okm.discipline6121 Languagesen_GB
dc.okm.discipline113 Tietojenkäsittely ja informaatiotieteetfi_FI
dc.okm.discipline6121 Kielitieteetfi_FI
dc.okm.internationalcopublicationnot an international co-publication
dc.okm.internationalityInternational publication
dc.okm.typeA4 Conference Article
dc.publisher.countryEstoniaen_GB
dc.publisher.countryVirofi_FI
dc.publisher.country-codeEE
dc.relation.conferenceNordic Conference on Computational Linguistics and Baltic Conference on Human Language Technologies
dc.relation.ispartofjournalNEALT proceedings series
dc.relation.volume57
dc.source.identifierhttps://www.utupub.fi/handle/10024/196681
dc.titlePoro 34B and the Blessing of Multilinguality
dc.title.bookProceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)
dc.year.issued2025

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
2025.nodalida-1.40.pdf
Size:
286.4 KB
Format:
Adobe Portable Document Format