An Expanded Massive Multilingual Dataset for High-Performance Language Technologies (HPLT)
| dc.contributor.author | Burchell, Laurie | |
| dc.contributor.author | De Gibert Bonet | |
| dc.contributor.author | Ona | |
| dc.contributor.author | Arefyev, Nikolay | |
| dc.contributor.author | Aulamo, Mikko | |
| dc.contributor.author | Bañón, Marta | |
| dc.contributor.author | Chen, Pinzhen | |
| dc.contributor.author | Fedorova, Mariia | |
| dc.contributor.author | Guillou, Liane | |
| dc.contributor.author | Haddow, Barry | |
| dc.contributor.author | Hajič, Jan | |
| dc.contributor.author | Helcl, Jindřich | |
| dc.contributor.author | Henriksson, Erik | |
| dc.contributor.author | Klimaszewski, Mateusz | |
| dc.contributor.author | Komulainen, Ville | |
| dc.contributor.author | Kutuzov, Andrey | |
| dc.contributor.author | Kytöniemi, Joona | |
| dc.contributor.author | Laippala, Veronika | |
| dc.contributor.author | Mæhlum, Petter | |
| dc.contributor.author | Malik, Bhavitvya | |
| dc.contributor.author | Mehryary, Farrokh | |
| dc.contributor.author | Mikhailov, Vladislav | |
| dc.contributor.author | Moghe, Nikita | |
| dc.contributor.author | Myntti, Amanda | |
| dc.contributor.author | O’Brien, Dayyán | |
| dc.contributor.author | Oepen, Stephan | |
| dc.contributor.author | Pal, Proyag | |
| dc.contributor.author | Piha, Jousia | |
| dc.contributor.author | Pyysalo, Sampo | |
| dc.contributor.author | Ramírez-Sánchez, Gema | |
| dc.contributor.author | Samuel, David | |
| dc.contributor.author | Stepachev, Pavel | |
| dc.contributor.author | Tiedemann, Jörg | |
| dc.contributor.author | Variš, Dušan | |
| dc.contributor.author | Vojtěchová, Tereza | |
| dc.contributor.author | Zaragoza-Bernabeu, Jaume | |
| dc.contributor.organization | fi=data-analytiikka|en=Data-analytiikka| | |
| dc.contributor.organization | fi=digitaalinen kielentutkimus, espanja, italia, kiina, ranska, saksa|en=Digital Language Studies, Chinese, French, German, Italian, Spanish| | |
| dc.contributor.organization-code | 1.2.246.10.2458963.20.36764574459 | |
| dc.contributor.organization-code | 1.2.246.10.2458963.20.68940835793 | |
| dc.converis.publication-id | 505515303 | |
| dc.converis.url | https://research.utu.fi/converis/portal/Publication/505515303 | |
| dc.date.accessioned | 2026-01-21T14:50:10Z | |
| dc.date.available | 2026-01-21T14:50:10Z | |
| dc.description.abstract | <p>Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora, extending prior work of the HPLT project. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on HPLT v2, demonstrating its value.<br></p> | |
| dc.format.pagerange | 17452 | |
| dc.format.pagerange | 17485 | |
| dc.identifier.issn | 0736-587X | |
| dc.identifier.jour-issn | 0736-587X | |
| dc.identifier.olddbid | 213763 | |
| dc.identifier.oldhandle | 10024/196781 | |
| dc.identifier.uri | https://www.utupub.fi/handle/11111/55765 | |
| dc.identifier.url | https://doi.org/10.18653/v1/2025.acl-long.854 | |
| dc.identifier.urn | URN:NBN:fi-fe202601216997 | |
| dc.language.iso | en | |
| dc.okm.affiliatedauthor | Henriksson, Erik | |
| dc.okm.affiliatedauthor | Komulainen, Ville | |
| dc.okm.affiliatedauthor | Kytöniemi, Joona | |
| dc.okm.affiliatedauthor | Laippala, Veronika | |
| dc.okm.affiliatedauthor | Mehryary, Farrokh | |
| dc.okm.affiliatedauthor | Myntti, Amanda | |
| dc.okm.affiliatedauthor | Piha, Jousia | |
| dc.okm.affiliatedauthor | Pyysalo, Sampo | |
| dc.okm.discipline | 113 Computer and information sciences | en_GB |
| dc.okm.discipline | 6121 Languages | en_GB |
| dc.okm.discipline | 113 Tietojenkäsittely ja informaatiotieteet | fi_FI |
| dc.okm.discipline | 6121 Kielitieteet | fi_FI |
| dc.okm.internationalcopublication | international co-publication | |
| dc.okm.internationality | International publication | |
| dc.okm.type | A4 Conference Article | |
| dc.publisher.country | United States | en_GB |
| dc.publisher.country | Yhdysvallat (USA) | fi_FI |
| dc.publisher.country-code | US | |
| dc.relation.conference | Annual Meeting of the Association for Computational Linguistics | |
| dc.relation.doi | 10.18653/v1/2025.acl-long.854 | |
| dc.relation.ispartofjournal | Annual Meeting of the Association for Computational Linguistics | |
| dc.source.identifier | https://www.utupub.fi/handle/10024/196781 | |
| dc.title | An Expanded Massive Multilingual Dataset for High-Performance Language Technologies (HPLT) | |
| dc.title.book | Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) | |
| dc.year.issued | 2025 |
Tiedostot
1 - 1 / 1