A New Massive Multilingual Dataset for High-Performance Language Technologies

de Gibert, Ona; Nail, Graeme; Arefyev, Nikolay; Bañón, Marta; van der Linde; Jelmer; Ji, Shaoxiong; Zaragoza-Bernabeu, Jaume; Aulamo, Mikko; Ramírez-Sánchez, Gema; Kutuzov, Andrey; Pyysalo, Sampo; Oepen, Stephan; Tiedemann, Jörg

A New Massive Multilingual Dataset for High-Performance Language Technologies

dc.contributor.author	de Gibert, Ona
dc.contributor.author	Nail, Graeme
dc.contributor.author	Arefyev, Nikolay
dc.contributor.author	Bañón, Marta
dc.contributor.author	van der Linde
dc.contributor.author	Jelmer
dc.contributor.author	Ji, Shaoxiong
dc.contributor.author	Zaragoza-Bernabeu, Jaume
dc.contributor.author	Aulamo, Mikko
dc.contributor.author	Ramírez-Sánchez, Gema
dc.contributor.author	Kutuzov, Andrey
dc.contributor.author	Pyysalo, Sampo
dc.contributor.author	Oepen, Stephan
dc.contributor.author	Tiedemann, Jörg
dc.contributor.organization	fi=data-analytiikka\|en=Data-analytiikka\|
dc.contributor.organization-code	1.2.246.10.2458963.20.68940835793
dc.converis.publication-id	457541413
dc.converis.url	https://research.utu.fi/converis/portal/Publication/457541413
dc.date.accessioned	2025-08-28T00:11:21Z
dc.date.available	2025-08-28T00:11:21Z
dc.description.abstract	We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performance computing. Our monolingual collection focuses on low- to medium-resourced languages and covers 75 languages and a total of ≈ 5.6 trillion word tokens de-duplicated on the document level. Our English-centric parallel corpus is derived from its monolingual counterpart and covers 18 language pairs and more than 96 million aligned sentence pairs with roughly 1.4 billion English tokens. The HPLT language resources are one of the largest open text corpora ever released, providing a great resource for language modeling and machine translation training. We publicly release the corpora, the software, and the tools used in this work.
dc.format.pagerange	1128
dc.identifier.isbn	978-2-493814-10-4
dc.identifier.issn	2522-2686
dc.identifier.jour-issn	2522-2686
dc.identifier.olddbid	205343
dc.identifier.oldhandle	10024/188370
dc.identifier.uri	https://www.utupub.fi/handle/11111/54280
dc.identifier.url	https://aclanthology.org/2024.lrec-main.100
dc.identifier.urn	URN:NBN:fi-fe2025082790922
dc.language.iso	en
dc.okm.affiliatedauthor	Pyysalo, Sampo
dc.okm.affiliatedauthor	Ji, Shaoxiong
dc.okm.discipline	113 Computer and information sciences	en_GB
dc.okm.internationalcopublication	international co-publication
dc.okm.internationality	International publication
dc.okm.type	A4 Conference Article
dc.publisher.country	France	en_GB
dc.publisher.country	Ranska	fi_FI
dc.publisher.country-code	FR
dc.relation.conference	Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING)
dc.relation.ispartofjournal	LREC Proceedings
dc.source.identifier	https://www.utupub.fi/handle/10024/188370
dc.title	A New Massive Multilingual Dataset for High-Performance Language Technologies
dc.title.book	Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
dc.year.issued	2024

Tiedostot

Näytetään 1 - 1 / 1

Name:: PyysaloEtAl2024ANewMassiveMultilingualDataset.pdf
Size:: 1.23 MB
Format:: Adobe Portable Document Format

Lataa

Kokoelmat

Rinnakkaistallenteet