Towards better structured and less noisy Web data: Oscar with Register annotations

dc.contributor.authorLaippala Veronika
dc.contributor.authorSalmela Anna
dc.contributor.authorRönnqvist Samuel
dc.contributor.authorAji Alham Fikri
dc.contributor.authorChang Li-Hsin
dc.contributor.authorDhifallah Asma
dc.contributor.authorGoulart Larissa
dc.contributor.authorKortelainen Henna
dc.contributor.authorPàmies Marc
dc.contributor.authorPrina Dutra Deise
dc.contributor.authorSkantsi Valtteri
dc.contributor.authorSutawika Lingtang
dc.contributor.authorPyysalo Sampo
dc.contributor.organizationfi=data-analytiikka|en=Data-analytiikka|
dc.contributor.organizationfi=englannin kieli, klassilliset kielet ja monikielinen käännösviestintä|en=English, Classics and Multilingual Translation Studies|
dc.contributor.organizationfi=kieli- ja käännöstieteiden laitos|en=School of Languages and Translation Studies|
dc.contributor.organization-code1.2.246.10.2458963.20.22758552511
dc.contributor.organization-code1.2.246.10.2458963.20.56461112866
dc.contributor.organization-code1.2.246.10.2458963.20.68940835793
dc.contributor.organization-code2602100
dc.converis.publication-id177823149
dc.converis.urlhttps://research.utu.fi/converis/portal/Publication/177823149
dc.date.accessioned2025-08-27T23:29:13Z
dc.date.available2025-08-27T23:29:13Z
dc.description.abstract<p>Web-crawled datasets are known to be noisy, as they feature a wide range of language use covering both user-generated and professionally edited content as well as noise originating from the crawling process. This article presents one solution to reduce this noise by using automatic register (genre) identification -whether the texts are, e.g., forum discussions, lyrical or how-to pages. We apply the multilingual register identification model by Rönnqvist et al. (2021) and label the widely used Oscar dataset. Additionally, we evaluate the model against eight new languages, showing that the performance is comparable to previous findings on a restricted set of languages. Finally, we present and apply a machine learning method for further cleaning text files originating from Web crawls from remains of boilerplate and other elements not belonging to the main text of the Web page. The register labeled and cleaned dataset covers 351 million documents in 14 languages and is available at https://huggingface.co/datasets/TurkuNLP/register_oscar.<br></p>
dc.format.pagerange215
dc.format.pagerange221
dc.identifier.issn2951-2093
dc.identifier.jour-issn2951-2093
dc.identifier.olddbid204054
dc.identifier.oldhandle10024/187081
dc.identifier.urihttps://www.utupub.fi/handle/11111/52103
dc.identifier.urlhttps://aclanthology.org/2022.wnut-1.23/
dc.identifier.urnURN:NBN:fi-fe202301142857
dc.language.isoen
dc.okm.affiliatedauthorLaippala, Veronika
dc.okm.affiliatedauthorSalmela, Anna
dc.okm.affiliatedauthorRönnqvist, Samuel
dc.okm.affiliatedauthorChang, Li-Hsin
dc.okm.affiliatedauthorDhifallah, Asma
dc.okm.affiliatedauthorKortelainen, Henna
dc.okm.affiliatedauthorSkantsi, Valtteri
dc.okm.affiliatedauthorPyysalo, Sampo
dc.okm.discipline113 Computer and information sciencesen_GB
dc.okm.discipline6121 Languagesen_GB
dc.okm.discipline113 Tietojenkäsittely ja informaatiotieteetfi_FI
dc.okm.discipline6121 Kielitieteetfi_FI
dc.okm.internationalcopublicationinternational co-publication
dc.okm.internationalityInternational publication
dc.okm.typeA4 Conference Article
dc.publisher.countryKorea, Republic of (South Korea)en_GB
dc.publisher.countryKorean tasavalta (Etelä-Korea)fi_FI
dc.publisher.country-codeKR
dc.relation.conferenceInternational Conference on Computational Linguistics
dc.relation.ispartofjournalInternational Conference on Computational Linguistics
dc.relation.ispartofseriesInternational Conference on Computational Linguistics
dc.relation.volume29
dc.relation.volume4
dc.source.identifierhttps://www.utupub.fi/handle/10024/187081
dc.titleTowards better structured and less noisy Web data: Oscar with Register annotations
dc.title.bookProceedings of the Eighth Workshop on Noisy User-generated Text (W-NUT 2022)
dc.year.issued2022

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
2022.wnut-1.23.pdf
Size:
375.41 KB
Format:
Adobe Portable Document Format