A New Massive Multilingual Dataset for High-Performance Language Technologies

dc.contributor.authorde Gibert, Ona
dc.contributor.authorNail, Graeme
dc.contributor.authorArefyev, Nikolay
dc.contributor.authorBañón, Marta
dc.contributor.authorvan der Linde
dc.contributor.authorJelmer
dc.contributor.authorJi, Shaoxiong
dc.contributor.authorZaragoza-Bernabeu, Jaume
dc.contributor.authorAulamo, Mikko
dc.contributor.authorRamírez-Sánchez, Gema
dc.contributor.authorKutuzov, Andrey
dc.contributor.authorPyysalo, Sampo
dc.contributor.authorOepen, Stephan
dc.contributor.authorTiedemann, Jörg
dc.contributor.organizationfi=data-analytiikka|en=Data-analytiikka|
dc.contributor.organization-code1.2.246.10.2458963.20.68940835793
dc.converis.publication-id457541413
dc.converis.urlhttps://research.utu.fi/converis/portal/Publication/457541413
dc.date.accessioned2025-08-28T00:11:21Z
dc.date.available2025-08-28T00:11:21Z
dc.description.abstractWe present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performance computing. Our monolingual collection focuses on low- to medium-resourced languages and covers 75 languages and a total of ≈ 5.6 trillion word tokens de-duplicated on the document level. Our English-centric parallel corpus is derived from its monolingual counterpart and covers 18 language pairs and more than 96 million aligned sentence pairs with roughly 1.4 billion English tokens. The HPLT language resources are one of the largest open text corpora ever released, providing a great resource for language modeling and machine translation training. We publicly release the corpora, the software, and the tools used in this work.
dc.format.pagerange1116
dc.format.pagerange1128
dc.identifier.isbn978-2-493814-10-4
dc.identifier.issn2522-2686
dc.identifier.jour-issn2522-2686
dc.identifier.olddbid205343
dc.identifier.oldhandle10024/188370
dc.identifier.urihttps://www.utupub.fi/handle/11111/54280
dc.identifier.urlhttps://aclanthology.org/2024.lrec-main.100
dc.identifier.urnURN:NBN:fi-fe2025082790922
dc.language.isoen
dc.okm.affiliatedauthorPyysalo, Sampo
dc.okm.affiliatedauthorJi, Shaoxiong
dc.okm.discipline113 Computer and information sciencesen_GB
dc.okm.discipline6121 Languagesen_GB
dc.okm.discipline113 Tietojenkäsittely ja informaatiotieteetfi_FI
dc.okm.discipline6121 Kielitieteetfi_FI
dc.okm.internationalcopublicationinternational co-publication
dc.okm.internationalityInternational publication
dc.okm.typeA4 Conference Article
dc.publisher.countryFranceen_GB
dc.publisher.countryRanskafi_FI
dc.publisher.country-codeFR
dc.relation.conferenceJoint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING)
dc.relation.ispartofjournalLREC Proceedings
dc.source.identifierhttps://www.utupub.fi/handle/10024/188370
dc.titleA New Massive Multilingual Dataset for High-Performance Language Technologies
dc.title.bookProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
dc.year.issued2024

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
PyysaloEtAl2024ANewMassiveMultilingualDataset.pdf
Size:
1.23 MB
Format:
Adobe Portable Document Format