An Expanded Massive Multilingual Dataset for High-Performance Language Technologies (HPLT)

dc.contributor.authorBurchell, Laurie
dc.contributor.authorDe Gibert Bonet
dc.contributor.authorOna
dc.contributor.authorArefyev, Nikolay
dc.contributor.authorAulamo, Mikko
dc.contributor.authorBañón, Marta
dc.contributor.authorChen, Pinzhen
dc.contributor.authorFedorova, Mariia
dc.contributor.authorGuillou, Liane
dc.contributor.authorHaddow, Barry
dc.contributor.authorHajič, Jan
dc.contributor.authorHelcl, Jindřich
dc.contributor.authorHenriksson, Erik
dc.contributor.authorKlimaszewski, Mateusz
dc.contributor.authorKomulainen, Ville
dc.contributor.authorKutuzov, Andrey
dc.contributor.authorKytöniemi, Joona
dc.contributor.authorLaippala, Veronika
dc.contributor.authorMæhlum, Petter
dc.contributor.authorMalik, Bhavitvya
dc.contributor.authorMehryary, Farrokh
dc.contributor.authorMikhailov, Vladislav
dc.contributor.authorMoghe, Nikita
dc.contributor.authorMyntti, Amanda
dc.contributor.authorO’Brien, Dayyán
dc.contributor.authorOepen, Stephan
dc.contributor.authorPal, Proyag
dc.contributor.authorPiha, Jousia
dc.contributor.authorPyysalo, Sampo
dc.contributor.authorRamírez-Sánchez, Gema
dc.contributor.authorSamuel, David
dc.contributor.authorStepachev, Pavel
dc.contributor.authorTiedemann, Jörg
dc.contributor.authorVariš, Dušan
dc.contributor.authorVojtěchová, Tereza
dc.contributor.authorZaragoza-Bernabeu, Jaume
dc.contributor.organizationfi=data-analytiikka|en=Data-analytiikka|
dc.contributor.organizationfi=digitaalinen kielentutkimus, espanja, italia, kiina, ranska, saksa|en=Digital Language Studies, Chinese, French, German, Italian, Spanish|
dc.contributor.organization-code1.2.246.10.2458963.20.36764574459
dc.contributor.organization-code1.2.246.10.2458963.20.68940835793
dc.converis.publication-id505515303
dc.converis.urlhttps://research.utu.fi/converis/portal/Publication/505515303
dc.date.accessioned2026-01-21T14:50:10Z
dc.date.available2026-01-21T14:50:10Z
dc.description.abstract<p>Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora, extending prior work of the HPLT project. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on HPLT v2, demonstrating its value.<br></p>
dc.format.pagerange17452
dc.format.pagerange17485
dc.identifier.issn0736-587X
dc.identifier.jour-issn0736-587X
dc.identifier.olddbid213763
dc.identifier.oldhandle10024/196781
dc.identifier.urihttps://www.utupub.fi/handle/11111/55765
dc.identifier.urlhttps://doi.org/10.18653/v1/2025.acl-long.854
dc.identifier.urnURN:NBN:fi-fe202601216997
dc.language.isoen
dc.okm.affiliatedauthorHenriksson, Erik
dc.okm.affiliatedauthorKomulainen, Ville
dc.okm.affiliatedauthorKytöniemi, Joona
dc.okm.affiliatedauthorLaippala, Veronika
dc.okm.affiliatedauthorMehryary, Farrokh
dc.okm.affiliatedauthorMyntti, Amanda
dc.okm.affiliatedauthorPiha, Jousia
dc.okm.affiliatedauthorPyysalo, Sampo
dc.okm.discipline113 Computer and information sciencesen_GB
dc.okm.discipline6121 Languagesen_GB
dc.okm.discipline113 Tietojenkäsittely ja informaatiotieteetfi_FI
dc.okm.discipline6121 Kielitieteetfi_FI
dc.okm.internationalcopublicationinternational co-publication
dc.okm.internationalityInternational publication
dc.okm.typeA4 Conference Article
dc.publisher.countryUnited Statesen_GB
dc.publisher.countryYhdysvallat (USA)fi_FI
dc.publisher.country-codeUS
dc.relation.conferenceAnnual Meeting of the Association for Computational Linguistics
dc.relation.doi10.18653/v1/2025.acl-long.854
dc.relation.ispartofjournalAnnual Meeting of the Association for Computational Linguistics
dc.source.identifierhttps://www.utupub.fi/handle/10024/196781
dc.titleAn Expanded Massive Multilingual Dataset for High-Performance Language Technologies (HPLT)
dc.title.bookProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
dc.year.issued2025

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
2025.acl-long.854.pdf
Size:
882.21 KB
Format:
Adobe Portable Document Format