Scaling Data-Constrained Language Models

Muennighoff, Niklas; Rush, Alexander M.; Barak, Boaz; Le Scao, Teven; Piktus, Aleksandra; Tazi, Nouamane; Pyysalo, Sampo; Wolf, Thomas; Raffel, Colin

Scaling Data-Constrained Language Models

dc.contributor.author	Muennighoff, Niklas
dc.contributor.author	Rush, Alexander M.
dc.contributor.author	Barak, Boaz
dc.contributor.author	Le Scao, Teven
dc.contributor.author	Piktus, Aleksandra
dc.contributor.author	Tazi, Nouamane
dc.contributor.author	Pyysalo, Sampo
dc.contributor.author	Wolf, Thomas
dc.contributor.author	Raffel, Colin
dc.contributor.organization	fi=data-analytiikka\|en=Data-analytiikka\|
dc.contributor.organization-code	1.2.246.10.2458963.20.68940835793
dc.converis.publication-id	492253404
dc.converis.url	https://research.utu.fi/converis/portal/Publication/492253404
dc.date.accessioned	2025-08-27T22:19:52Z
dc.date.available	2025-08-27T22:19:52Z
dc.description.abstract	The current trend of scaling language models involves increasing both parameter count and training data set size. Extrapolating this trend suggests that training data set size may soon be limited by the amount of text data available on the internet. Motivated by this limit, we investigate scaling language models in data-constrained regimes. Specifically, we run a large set of experiments varying the extent of data repetition and compute budget, ranging up to 900 billion training tokens and 9 billion parameter models. We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data. However, with more repetition, the value of adding compute eventually decays to zero. We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters. Finally, we experiment with approach esmitigating data scarcity, including augmenting the training data set with code data or removing commonly used filters. Models and data sets from our 400 training runs are freely available athttps://github.com/huggingface/datablations.
dc.identifier.eissn	1533-7928
dc.identifier.jour-issn	1532-4435
dc.identifier.olddbid	201995
dc.identifier.oldhandle	10024/185022
dc.identifier.uri	https://www.utupub.fi/handle/11111/40921
dc.identifier.url	https://www.jmlr.org/papers/v26/24-1000.html
dc.identifier.urn	URN:NBN:fi-fe2025082789636
dc.language.iso	en
dc.okm.affiliatedauthor	Pyysalo, Sampo
dc.okm.discipline	113 Computer and information sciences	en_GB
dc.okm.discipline	113 Tietojenkäsittely ja informaatiotieteet	fi_FI
dc.okm.internationalcopublication	international co-publication
dc.okm.internationality	International publication
dc.okm.type	A1 ScientificArticle
dc.publisher	MICROTOME PUBL
dc.publisher.country	United States	en_GB
dc.publisher.country	Yhdysvallat (USA)	fi_FI
dc.publisher.country-code	US
dc.publisher.place	BROOKLINE
dc.relation.articlenumber	53
dc.relation.ispartofjournal	Journal of Machine Learning Research
dc.relation.volume	26
dc.source.identifier	https://www.utupub.fi/handle/10024/185022
dc.title	Scaling Data-Constrained Language Models
dc.year.issued	2025

Tiedostot

Näytetään 1 - 1 / 1

Name:: 24-1000.pdf
Size:: 2.08 MB
Format:: Adobe Portable Document Format

Lataa

Kokoelmat

Rinnakkaistallenteet