LREC-COLING 2024, pages 1116–1128 20-25 May, 2024. © 2024 ELRA Language Resource Association: CC BY-NC 4.0 1116 A New Massive Multilingual Dataset for High-Performance Language Technologies Ona de Gibert1, Graeme Nail2, Nikolay Arefyev3, Marta Bañón4, Jelmer van der Linde2, Shaoxiong Ji1, Jaume Zaragoza-Bernabeu4 Mikko Aulamo1, Gema Ramírez-Sánchez4, Andrey Kutuzov3, Sampo Pyysalo5, Stephan Oepen3 and Jörg Tiedemann1 University of Helsinki, Finland1 University of Edinburgh, UK2 University of Oslo, Norway3 Prompsit, Spain4 University of Turku, Finland5 {ona.degibert, shaoxiong.ji, mikko.aulamo, joerg.tiedemann}@helsinki.fi1, {graeme.nail, jelmer.vanderlinde}@ed.ac.uk2, {nikolare, andreku, oe}@ifi.uio.no3 {mbanon, jzaragoza, gramirez}@prompsit.com4, sampo.pyysalo@utu.fi5 Abstract We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performance computing. Our monolingual collection focuses on low- to medium-resourced languages and covers 75 languages and a total of ≈ 5.6 trillion word tokens de-duplicated on the document level. Our English-centric parallel corpus is derived from its monolingual counterpart and covers 18 language pairs and more than 96 million aligned sentence pairs with roughly 1.4 billion English tokens. The HPLT language resources are one of the largest open text corpora ever released, providing a great resource for language modeling and machine translation training. We publicly release the corpora, the software, and the tools used in this work. Keywords:Parallel Corpus, Monolingual Corpus, Low-resource Languages, Pre-training Datasets 1. Introduction The development of Large Language Models (LLMs) pre-trained on ever-increasing amounts of text combined with the ongoing advancements in Machine Translation (MT) has made the need for vast amounts of high-quality textual data more pressing than ever. Since the acquisition of large text corpora is a challenge, most works focus on the pre-processing of previously released corpora with new methods, such as more strict textual fil- ters or removal of biased or explicit content. In this work, we present a massive, brand-new dataset for language modeling and MT training based on web crawls produced by the Internet Archive,1 used for the first time at this scale to create multilingual text corpora, and from CommonCrawl.2 Under the umbrella of the High Performance Language Technologies (HPLT) project3 (Aulamo et al., 2023), we obtained access to the web crawls (1.85 PB of data in total at the current stage of the project), downloaded and processed them to create monolingual and parallel corpora with rich metadata: the HPLT language resources. We re- lease the collection under the permissive CC0 li- 1https://archive.org/ 2https://commoncrawl.org/ 3https://hplt-project.org/ cense4 through our project website5 and OPUS6 (Tiedemann, 2012). We also publish open-source tools and pipelines used for processing huge web archive data packages so that our real use case can serve as an example for others inside and out- side the research community. Software and tools are released through GitHub.7 Our contributions can be summarized as follows: • monoHPLT : Monolingual datasets covering 75 languages and over 5.6 trillion tokens. • biHPLT : Parallel datasets covering 18 lan- guage pairs and over 96 million sentence pairs. • multiHPLT : Synthetic datasets obtained by piv- oting our parallel datasets through English cov- ering 171 language pairs and 157 million sen- tence pairs. • Bitextor (Esplà-Gomis et al., 2016) models: 22 MT models for fast translation and bilingual document alignment covering 9 languages. 4We do not own any of the text from which these text data has been extracted. We release the data under a specific takedown policy, where any user can ask us to remove their data. 5https://hplt-project.org/datasets/ 6https://opus.nlpl.eu/ 7https://github.com/hplt-project 1117 • Bicleaner AI (Zaragoza-Bernabeu et al., 2022) models: 9 new Bicleaner models for sentence pair scoring. • Scripts and tools for managing, downloading and processing large amounts of web-crawled corpora. The rest of the paper is organized as follows. Section 2 provides an overview of previous work in constructing corpora for pre-training. Section 3 de- scribes the acquisition of the presented resources. Section 4 presents in detail the introduced lan- guage resources. Finally, Section 5 concludes our work and discusses future lines of research. 2. Related Work The development of LLMs and highly multilingual MT systems demands large amounts of high-quality data. The scale of training data required by these models makes it effectively impossible to only use curated samples; instead, the common solution to gathering sufficient data is to source it from the Internet. The compilation of text corpora from the Web, both monolingual and bilingual, has been going on for a long time (Kilgarriff and Grefenstette, 2003). While some noteworthy efforts focus on language-specific curated datasets, such as C4 in English (Dodge et al., 2021) and WuDaoCorpora in Chinese (Yuan et al., 2021), the current capacity of models in the field has grown, leading to a move towards large multilingual collections. Regarding monolingual resources, one of the most used sources is CommonCrawl (CC), pro- duced by a non-profit organization that has pub- lished a collection of monthly multilingual web snap- shots since 2011. Due to its size and noisy nature, there have been multiple efforts at processing CC data to compile cleaned versions: the multilingual OSCAR corpus (Suárez et al., 2019), as well as the English corpora Pile-CC (Gao et al., 2020), C4 (Dodge et al., 2021) and its multilingual counter- part mC4 (Xue et al., 2021). Other well-known multilingual corpora for language modeling include the recent BigScience ROOTS Corpus (Laurençon et al., 2022), covering 59 languages from a di- verse set of sources, CuturaX (Nguyen et al., 2023), a cleaned multilingual dataset in 167 languages, MADLAD-400 (Kudugunta et al., 2023), a large audited dataset in 419 languages, Glot500 (Imani- Googhari et al., 2023), a corpus covering 511 lan- guages, and SERENGETI (Adebara et al., 2023), a dataset in 517 African languages. Bapna et al. (2022) built a massively multilingual dataset in over 1,500 languages; however, they did not release it publicly. For parallel corpora, the largest publicly available bitext collection is OPUS (Tiedemann, 2012). The Data Download Sharding Text Extraction Sentence Splitting Translation Document Alignment Sentence Alignment Encoding Fixer Rule-based Cleaning Sentence Pair Scoring Deduplication Monolingual Datasets Parallel Datasets Language Identification The Bitextor Pipeline IA CC Encoding fixer Language Identification Cleaning Deduplication The Monotextor Pipeline Figure 1: General overview of the HPLT acquisition and processing pipeline. collection includes several large multilingual cor- pora, such as Paracrawl (Bañón et al., 2020), its current version 9 covers 42 languages and English- centric sentence pairs; CCMATRIX (Schwenk et al., 2021), obtained from CC, and the recent NLLB data (Costa-jussà et al., 2022), which aims at covering as many language pairs as possible. When dealing with web-crawled corpora, con- cerns arise regarding the original sources of the data and its level of noisiness. Several works have addressed this issue (Kreutzer et al., 2022; Abadji et al., 2022) and have led researchers to further ex- plore their own datasets and develop newmetadata schemes, such as adding genre labels (Laippala et al., 2022; Kuzman et al., 2023), or to include ex- tended annotations such as length, noise and adult content tags (Abadji et al., 2022). The HPLT lan- guage resources also contain additional paragraph- level metadata; see subsection 4.1 for more detail. 3. From Raw Data to Refined Corpora The management and processing of large datasets both introduce their separate challenges. In this section, we provide a detailed account of the meth- ods, techniques, and considerations employed to collect the raw data and transform it into the cor- pora presented in Section 4. A general overview of the pipeline is depicted in Figure 1. Data download Data acquisition in HPLT relies on two main sources of web crawls: the Internet Archive and Common Crawl. The national High- Performance Computing (HPC) storage resources of Sigma28 and CESNET9 were used to down- 8https://www.sigma2.no/data-storage 9https://www.cesnet.cz/ 1118 Crawl (collection) CC40 IA WIDE15 IA WIDE16 IA WIDE17 Total # WARC files 80 000 361 431 754 143 662 381 1 857 955 # files after warc2text 384 360 1 490 152 1 955 584 2 403 058 6 233 154 Compressed text size, TB 8.4 19 42 18 87.4 Uncompressed text size, TB 18.04 38.15 130.82 43.65 230.7 # text files 127 853 495 512 977 792 798 811 2 399 968 Table 1: Sizes of the raw texts extracted from crawls. ‘CC’ stands for ‘Common Crawl’, ‘IA’ stands for ‘Internet Archive’. load and pre-process web crawls from these two sources. The downloading scripts are published in the HPLT git repository.10 These enable paral- lelized data downloading while automatically verify- ing and retrying failed downloads after a back-off period. These features are vital for downloading large file collections such as web crawls. For the current data release, we have down- loaded three large web crawls from the Internet Archive (IA) namedWIDE15, WIDE16 andWIDE17, along with the CC-MAIN-2022-40 (CC40) crawl from Common Crawl. These crawls occupy a total of 1850 TB and are stored in WARC (Web Archive) format11. More data will be made available in the future releases. Text Extraction WARC files contain many types of data besides written text: images, sound, video, etc. In order to extract raw texts and conduct pre- liminary language identification, the downloaded crawls were processed by the warc2text tool from the Bitextor pipeline.12 warc2text finds doc- uments containing text in some natural language and does fast preliminary filtering of undesirable documents based on their URL or HTML tags. More thorough filtering happens at the next stages. From the remaining documents, it extracts raw un- formatted text and performs initial, document-level language detection. Running whitespace is normal- ized, and paragraph-like segments, as defined by HTML block elements (

,