From Web Crawl to Clean Register-Annotated Corpora

dc.contributor.authorLaippala Veronika
dc.contributor.authorRönnqvist Samuel
dc.contributor.authorHellström Saara
dc.contributor.authorLuotolahti
dc.contributor.authorJuhani
dc.contributor.authorRepo Liina
dc.contributor.authorSalmela Anna
dc.contributor.authorSkantsi Valtteri and Pyysalo Sampo
dc.contributor.organizationfi=kieli- ja käännöstieteiden laitos|en=School of Languages and Translation Studies|
dc.contributor.organizationfi=kieli- ja puheteknologia|en=Language and Speech Technology|
dc.contributor.organization-code1.2.246.10.2458963.20.47465613983
dc.contributor.organization-code1.2.246.10.2458963.20.56461112866
dc.contributor.organization-code2602100
dc.converis.publication-id51216717
dc.converis.urlhttps://research.utu.fi/converis/portal/Publication/51216717
dc.date.accessioned2022-02-25T16:09:34Z
dc.date.available2022-02-25T16:09:34Z
dc.description.abstract<p>The web presents unprecedented opportunities for large-scale collection of text in many languages. However, two critical steps in the development of web corpora remain challenging: the identification of clean text from source HTML and the assignment of genre or register information to the documents. In this paper, we evaluate a multilingual approach to this end. Our starting points are the Swedish and French Common Crawl datasets gathered for the 2017 CoNLL shared task, particularly the URLs. We 1) fetch HTML pages based on the URLs and run boilerplate removal, 2) train a classifier to further clean out undesired text fragments, and 3) annotate text registers. We compare boilerplate removal against the CoNLL texts, and find an improvement. For the further cleaning of undesired material, the best results are achieved using Multilingual BERT with monolingual fine-tuning. However, our results are promising also in a cross-lingual setting, without fine-tuning on the target language. Finally, the register annotations show that most of the documents belong to a relatively small set of registers, which are relatively similar in the two languages. A number of additional flags in the annotation are, however, necessary to reflect the wide range of linguistic variation associated with the documents.<br></p>
dc.format.pagerange14
dc.format.pagerange22
dc.identifier.isbn979-10-95546-68-9
dc.identifier.olddbid170291
dc.identifier.oldhandle10024/153401
dc.identifier.urihttps://www.utupub.fi/handle/11111/29335
dc.identifier.urlhttps://www.aclweb.org/anthology/2020.wac-1.3
dc.identifier.urnURN:NBN:fi-fe2021042820899
dc.language.isoen
dc.okm.affiliatedauthorLaippala, Veronika
dc.okm.affiliatedauthorRönnqvist, Samuel
dc.okm.affiliatedauthorHellström, Saara
dc.okm.affiliatedauthorLuotolahti, Matti
dc.okm.affiliatedauthorRepo, Liina
dc.okm.affiliatedauthorSalmela, Anna
dc.okm.affiliatedauthorSkantsi, Valtteri
dc.okm.affiliatedauthorPyysalo, Sampo
dc.okm.discipline113 Computer and information sciencesen_GB
dc.okm.discipline6121 Languagesen_GB
dc.okm.discipline113 Tietojenkäsittely ja informaatiotieteetfi_FI
dc.okm.discipline6121 Kielitieteetfi_FI
dc.okm.internationalcopublicationnot an international co-publication
dc.okm.internationalityInternational publication
dc.okm.typeA4 Conference Article
dc.publisher.countryFranceen_GB
dc.publisher.countryRanskafi_FI
dc.publisher.country-codeFR
dc.relation.conferenceWeb as Corpus Workshop
dc.relation.ispartofseriesProceedings of the Web as Corpus Workshop
dc.relation.volume12
dc.source.identifierhttps://www.utupub.fi/handle/10024/153401
dc.titleFrom Web Crawl to Clean Register-Annotated Corpora
dc.title.bookProceedings of the 12th Web as Corpus Workshop
dc.year.issued2020

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
2020.wac-1.3.pdf
Size:
441.73 KB
Format:
Adobe Portable Document Format