Aurora-M: Open Source Continual Pre-training for Multilingual Language and Code

dc.contributor.authorNakamura, Taishi
dc.contributor.authorMishra, Mayank
dc.contributor.authorTedeschi, Simone
dc.contributor.authorChai, Yekun
dc.contributor.authorStillerman, Jason T.
dc.contributor.authorFriedrich, Felix
dc.contributor.authorYadav, Prateek
dc.contributor.authorLaud, Tanmay
dc.contributor.authorChien, Vu Minh
dc.contributor.authorZhuo, Terry Yue
dc.contributor.authorMisra, Diganta
dc.contributor.authorBogin, Ben
dc.contributor.authorVu, Xuan-Son
dc.contributor.authorKarpinska, Marzena
dc.contributor.authorDantuluri, Arnav Varma
dc.contributor.authorKusa, Wojciech
dc.contributor.authorFurlanello, Tommaso
dc.contributor.authorYokota, Rio
dc.contributor.authorMuennighoff, Niklas
dc.contributor.authorPai, Suhas
dc.contributor.authorAdewumi, Tosin
dc.contributor.authorLaippala, Veronika
dc.contributor.authorYao, Xiaozhe
dc.contributor.authorJunior, Adalberto Barbosa
dc.contributor.authorDrozd, Aleksandr
dc.contributor.authorClive, Jordan
dc.contributor.authorGupta, Kshitij
dc.contributor.authorChen, Liangyu
dc.contributor.authorSun, Qi
dc.contributor.authorTsui, Ken
dc.contributor.authorMoustafa-Fahmy, Nour
dc.contributor.authorMonti, Nicolo
dc.contributor.authorDang, Tai
dc.contributor.authorLuo, Ziyang
dc.contributor.authorBui, Tien-Tung
dc.contributor.authorNavigli, Roberto
dc.contributor.authorMehta, Virendra
dc.contributor.authorBlumberg, Matthew
dc.contributor.authorMay, Victor
dc.contributor.authorNguyen, Hiep
dc.contributor.authorPyysalo, Sampo
dc.contributor.organizationfi=data-analytiikka|en=Data-analytiikka|
dc.contributor.organizationfi=digitaalinen kielentutkimus, espanja, italia, kiina, ranska, saksa|en=Digital Language Studies, Chinese, French, German, Italian, Spanish|
dc.contributor.organization-code1.2.246.10.2458963.20.36764574459
dc.contributor.organization-code1.2.246.10.2458963.20.68940835793
dc.converis.publication-id508764398
dc.converis.urlhttps://research.utu.fi/converis/portal/Publication/508764398
dc.date.accessioned2026-04-24T19:26:20Z
dc.description.abstract<p>Pretrained language models are integral part of AI applications, but their high computational cost for training limits accessibility. Initiatives such as Bloom and StarCoder aim to democratize access to pretrained models for collaborative community development. Despite these efforts, such models encounter challenges such as limited multilingual capabilities, risks of catastrophic forgetting during continual pretraining, and the high costs of training models from scratch, alongside the need to align with AI safety standards and regulatory frameworks. This paper presents Aurora-M, a 15B parameter multilingual open-source model trained on English, Finnish, Hindi, Japanese, Vietnamese, and code. Continually pretrained from StarCoderPlus on 435B additional tokens, Aurora-M surpasses 2T tokens in total training token count. It is the first open-source multilingual model fine-tuned on human-reviewed safety instructions, thus aligning its development not only with conventional red-teaming considerations, but also with the specific concerns articulated in the Biden-Harris Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. We evaluate Aurora-M across a wide range of tasks and languages, showcasing its robustness against catastrophic forgetting and its superior performance in multilingual settings, particularly in safety evaluations. We open-source Aurora-M and its variants to encourage responsible open-source development of large language models at https://huggingface.co/aurora-m.<br></p>
dc.format.pagerange678
dc.format.pagerange656
dc.identifier.isbn979-8-89176-197-1
dc.identifier.urihttps://www.utupub.fi/handle/11111/59209
dc.identifier.urlhttps://aclanthology.org/2025.coling-industry.56/
dc.identifier.urnURN:NBN:fi-fe2026022315625
dc.language.isoen
dc.okm.affiliatedauthorLaippala, Veronika
dc.okm.affiliatedauthorPyysalo, Sampo
dc.okm.discipline113 Computer and information sciencesen_GB
dc.okm.discipline113 Tietojenkäsittely ja informaatiotieteetfi_FI
dc.okm.discipline6121 Languagesen_GB
dc.okm.discipline6121 Kielitieteetfi_FI
dc.okm.internationalcopublicationinternational co-publication
dc.okm.internationalityInternational publication
dc.okm.typeA4 Conference Article
dc.publisher.countryUnited Statesen_GB
dc.publisher.countryYhdysvallat (USA)fi_FI
dc.publisher.country-codeUS
dc.relation.conferenceInternational Conference on Computational Linguistics
dc.titleAurora-M: Open Source Continual Pre-training for Multilingual Language and Code
dc.title.bookProceedings of the 31st International Conference on Computational Linguistics : Industry Track
dc.year.issued2025

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
nakamura_etal_2025.pdf
Size:
759.44 KB
Format:
Adobe Portable Document Format