From Web Crawl to Clean Register-Annotated Corpora

Laippala Veronika; Rönnqvist Samuel; Hellström Saara; Luotolahti; Juhani; Repo Liina; Salmela Anna; Skantsi Valtteri and Pyysalo Sampo

From Web Crawl to Clean Register-Annotated Corpora

dc.contributor.author	Laippala Veronika
dc.contributor.author	Rönnqvist Samuel
dc.contributor.author	Hellström Saara
dc.contributor.author	Luotolahti
dc.contributor.author	Juhani
dc.contributor.author	Repo Liina
dc.contributor.author	Salmela Anna
dc.contributor.author	Skantsi Valtteri and Pyysalo Sampo
dc.contributor.organization	fi=kieli- ja käännöstieteiden laitos\|en=School of Languages and Translation Studies\|
dc.contributor.organization	fi=kieli- ja puheteknologia\|en=Language and Speech Technology\|
dc.contributor.organization-code	1.2.246.10.2458963.20.56461112866
dc.converis.publication-id	51216717
dc.converis.url	https://research.utu.fi/converis/portal/Publication/51216717
dc.date.accessioned	2022-02-25T16:09:34Z
dc.date.available	2022-02-25T16:09:34Z
dc.description.abstract	<p>The web presents unprecedented opportunities for large-scale collection of text in many languages. However, two critical steps in the development of web corpora remain challenging: the identification of clean text from source HTML and the assignment of genre or register information to the documents. In this paper, we evaluate a multilingual approach to this end. Our starting points are the Swedish and French Common Crawl datasets gathered for the 2017 CoNLL shared task, particularly the URLs. We 1) fetch HTML pages based on the URLs and run boilerplate removal, 2) train a classifier to further clean out undesired text fragments, and 3) annotate text registers. We compare boilerplate removal against the CoNLL texts, and find an improvement. For the further cleaning of undesired material, the best results are achieved using Multilingual BERT with monolingual fine-tuning. However, our results are promising also in a cross-lingual setting, without fine-tuning on the target language. Finally, the register annotations show that most of the documents belong to a relatively small set of registers, which are relatively similar in the two languages. A number of additional flags in the annotation are, however, necessary to reflect the wide range of linguistic variation associated with the documents.<br></p>
dc.format.pagerange	22
dc.identifier.isbn	979-10-95546-68-9
dc.identifier.olddbid	170291
dc.identifier.oldhandle	10024/153401
dc.identifier.uri	https://www.utupub.fi/handle/11111/29335
dc.identifier.url	https://www.aclweb.org/anthology/2020.wac-1.3
dc.identifier.urn	URN:NBN:fi-fe2021042820899
dc.language.iso	en
dc.okm.affiliatedauthor	Laippala, Veronika
dc.okm.affiliatedauthor	Rönnqvist, Samuel
dc.okm.affiliatedauthor	Hellström, Saara
dc.okm.affiliatedauthor	Luotolahti, Matti
dc.okm.affiliatedauthor	Repo, Liina
dc.okm.affiliatedauthor	Salmela, Anna
dc.okm.affiliatedauthor	Skantsi, Valtteri
dc.okm.affiliatedauthor	Pyysalo, Sampo
dc.okm.discipline	113 Computer and information sciences	en_GB
dc.okm.internationalcopublication	not an international co-publication
dc.okm.internationality	International publication
dc.okm.type	A4 Conference Article
dc.publisher.country	France	en_GB
dc.publisher.country	Ranska	fi_FI
dc.publisher.country-code	FR
dc.relation.conference	Web as Corpus Workshop
dc.relation.ispartofseries	Proceedings of the Web as Corpus Workshop
dc.relation.volume	12
dc.source.identifier	https://www.utupub.fi/handle/10024/153401
dc.title	From Web Crawl to Clean Register-Annotated Corpora
dc.title.book	Proceedings of the 12th Web as Corpus Workshop
dc.year.issued	2020

Tiedostot

Näytetään 1 - 1 / 1

Name:: 2020.wac-1.3.pdf
Size:: 441.73 KB
Format:: Adobe Portable Document Format

Lataa

Kokoelmat

Rinnakkaistallenteet