FinerWeb-10BT: Refining Web Data with LLM-Based Line-Level Filtering

dc.contributor.authorHenriksson, Erik
dc.contributor.authorTarkka, Otto
dc.contributor.authorGinter, Filip
dc.contributor.organizationfi=data-analytiikka|en=Data-analytiikka|
dc.contributor.organizationfi=digitaalinen kielentutkimus, espanja, italia, kiina, ranska, saksa|en=Digital Language Studies, Chinese, French, German, Italian, Spanish|
dc.contributor.organization-code1.2.246.10.2458963.20.36764574459
dc.contributor.organization-code1.2.246.10.2458963.20.68940835793
dc.converis.publication-id506553763
dc.converis.urlhttps://research.utu.fi/converis/portal/Publication/506553763
dc.date.accessioned2026-01-21T12:20:48Z
dc.date.available2026-01-21T12:20:48Z
dc.description.abstract<p>Data quality is crucial for training Large Language Models (LLMs). Traditional heuristic filters often miss low-quality text or mistakenly remove valuable content. In this paper, we introduce an LLM-based line-level filtering method to enhance training data quality. We use GPT-4o mini to label a 20,000-document sample from FineWeb at the line level, allowing the model to create descriptive labels for low-quality lines. These labels are grouped into nine main categories, and we train a DeBERTa-v3 classifier to scale the filtering to a 10B-token subset of FineWeb. To test the impact of our filtering, we train GPT-2 models on both the original and the filtered datasets. The results show that models trained on the filtered data achieve higher accuracy on the HellaSwag benchmark and reach their performance targets faster, even with up to 25% less data. This demonstrates that LLM-based line-level filtering can significantly improve data quality and training efficiency for LLMs. We release our quality-annotated dataset, FinerWeb-10BT, and the codebase to support further work in this area.<br></p>
dc.format.pagerange258
dc.format.pagerange268
dc.identifier.isbn978-9908-53-109-0
dc.identifier.issn1736-8197
dc.identifier.jour-issn1736-8197
dc.identifier.olddbid212367
dc.identifier.oldhandle10024/195385
dc.identifier.urihttps://www.utupub.fi/handle/11111/51692
dc.identifier.urlhttps://aclanthology.org/2025.nodalida-1.27/
dc.identifier.urnURN:NBN:fi-fe202601216859
dc.language.isoen
dc.okm.affiliatedauthorHenriksson, Erik
dc.okm.affiliatedauthorTarkka, Otto
dc.okm.affiliatedauthorGinter, Filip
dc.okm.discipline113 Computer and information sciencesen_GB
dc.okm.discipline6121 Languagesen_GB
dc.okm.discipline113 Tietojenkäsittely ja informaatiotieteetfi_FI
dc.okm.discipline6121 Kielitieteetfi_FI
dc.okm.internationalcopublicationnot an international co-publication
dc.okm.internationalityInternational publication
dc.okm.typeA4 Conference Article
dc.publisher.countryEstoniaen_GB
dc.publisher.countryVirofi_FI
dc.publisher.country-codeEE
dc.relation.conferenceNordic Conference on Computational Linguistics and Baltic Conference on Human Language Technologies
dc.relation.ispartofjournalNEALT proceedings series
dc.relation.volume57
dc.source.identifierhttps://www.utupub.fi/handle/10024/195385
dc.titleFinerWeb-10BT: Refining Web Data with LLM-Based Line-Level Filtering
dc.title.bookProceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)
dc.year.issued2025

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
2025.nodalida-1.27.pdf
Size:
236.49 KB
Format:
Adobe Portable Document Format