FinerWeb-10BT: Refining Web Data with LLM-Based Line-Level Filtering

Henriksson, Erik; Tarkka, Otto; Ginter, Filip

FinerWeb-10BT: Refining Web Data with LLM-Based Line-Level Filtering

dc.contributor.author	Henriksson, Erik
dc.contributor.author	Tarkka, Otto
dc.contributor.author	Ginter, Filip
dc.contributor.organization	fi=data-analytiikka\|en=Data-analytiikka\|
dc.contributor.organization	fi=digitaalinen kielentutkimus, espanja, italia, kiina, ranska, saksa\|en=Digital Language Studies, Chinese, French, German, Italian, Spanish\|
dc.contributor.organization-code	1.2.246.10.2458963.20.36764574459
dc.contributor.organization-code	1.2.246.10.2458963.20.68940835793
dc.converis.publication-id	506553763
dc.converis.url	https://research.utu.fi/converis/portal/Publication/506553763
dc.date.accessioned	2026-01-21T12:20:48Z
dc.date.available	2026-01-21T12:20:48Z
dc.description.abstract	<p>Data quality is crucial for training Large Language Models (LLMs). Traditional heuristic filters often miss low-quality text or mistakenly remove valuable content. In this paper, we introduce an LLM-based line-level filtering method to enhance training data quality. We use GPT-4o mini to label a 20,000-document sample from FineWeb at the line level, allowing the model to create descriptive labels for low-quality lines. These labels are grouped into nine main categories, and we train a DeBERTa-v3 classifier to scale the filtering to a 10B-token subset of FineWeb. To test the impact of our filtering, we train GPT-2 models on both the original and the filtered datasets. The results show that models trained on the filtered data achieve higher accuracy on the HellaSwag benchmark and reach their performance targets faster, even with up to 25% less data. This demonstrates that LLM-based line-level filtering can significantly improve data quality and training efficiency for LLMs. We release our quality-annotated dataset, FinerWeb-10BT, and the codebase to support further work in this area.<br></p>
dc.format.pagerange	268
dc.identifier.isbn	978-9908-53-109-0
dc.identifier.issn	1736-8197
dc.identifier.jour-issn	1736-8197
dc.identifier.olddbid	212367
dc.identifier.oldhandle	10024/195385
dc.identifier.uri	https://www.utupub.fi/handle/11111/51692
dc.identifier.url	https://aclanthology.org/2025.nodalida-1.27/
dc.identifier.urn	URN:NBN:fi-fe202601216859
dc.language.iso	en
dc.okm.affiliatedauthor	Henriksson, Erik
dc.okm.affiliatedauthor	Tarkka, Otto
dc.okm.affiliatedauthor	Ginter, Filip
dc.okm.discipline	113 Computer and information sciences	en_GB
dc.okm.internationalcopublication	not an international co-publication
dc.okm.internationality	International publication
dc.okm.type	A4 Conference Article
dc.publisher.country	Estonia	en_GB
dc.publisher.country	Viro	fi_FI
dc.publisher.country-code	EE
dc.relation.conference	Nordic Conference on Computational Linguistics and Baltic Conference on Human Language Technologies
dc.relation.ispartofjournal	NEALT proceedings series
dc.relation.volume	57
dc.source.identifier	https://www.utupub.fi/handle/10024/195385
dc.title	FinerWeb-10BT: Refining Web Data with LLM-Based Line-Level Filtering
dc.title.book	Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)
dc.year.issued	2025

Tiedostot

Näytetään 1 - 1 / 1

Name:: Henriksson_etal_2025.pdf
Size:: 326.1 KB
Format:: Adobe Portable Document Format

Lataa

Kokoelmat

Rinnakkaistallenteet