Register identification from the unrestricted open Web using the Corpus of Online Registers of English

Laippala Veronika; Rönnqvist Samuel; Oinonen Miika; Kyröläinen Aki-Juhani; Salmela Anna; Biber Douglas; Egbert Jesse; Pyysalo Sampo

Register identification from the unrestricted open Web using the Corpus of Online Registers of English

dc.contributor.author	Laippala Veronika
dc.contributor.author	Rönnqvist Samuel
dc.contributor.author	Oinonen Miika
dc.contributor.author	Kyröläinen Aki-Juhani
dc.contributor.author	Salmela Anna
dc.contributor.author	Biber Douglas
dc.contributor.author	Egbert Jesse
dc.contributor.author	Pyysalo Sampo
dc.contributor.organization	fi=data-analytiikka\|en=Data-analytiikka\|
dc.contributor.organization	fi=kieli- ja käännöstieteiden laitos\|en=School of Languages and Translation Studies\|
dc.contributor.organization-code	1.2.246.10.2458963.20.56461112866
dc.converis.publication-id	177822802
dc.converis.url	https://research.utu.fi/converis/portal/Publication/177822802
dc.date.accessioned	2025-08-27T23:26:46Z
dc.date.available	2025-08-27T23:26:46Z
dc.description.abstract	<p>This article examines the automatic identification of Web registers, that is, text varieties such as news articles and reviews. Most studies have focused on corpora restricted to include only preselected classes with well-defined characteristics. These corpora feature only a subset of documents found on the unrestricted open Web, for which register identification has been particularly difficult because the range of linguistic variation on the Web is known to be substantial. As part of this study, we present the first open release of the Corpus of Online Registers of English (CORE), which is drawn from the unrestricted open Web and, currently, is the largest collection of manually annotated Web registers. Furthermore, we demonstrate that the CORE registers can be automatically identified with competitive results, with the best performance being an F1-score of 68% with the deep learning model BERT. The best performance was achieved using two modeling strategies. The first one involved modeling the registers using propagated register labels, that is, repeating the main register label along with its corresponding subregister label in a multilabel model. In the second one, we explored how the length of the document affects model performance, discovering that the beginning provided superior classification accuracy. Overall, the current study presents a systematic approach for the automatic identification of a large number of Web registers from the unrestricted Web, hence providing new pathways for future studies.<br></p>
dc.identifier.eissn	1574-0218
dc.identifier.jour-issn	1574-020X
dc.identifier.olddbid	203978
dc.identifier.oldhandle	10024/187005
dc.identifier.uri	https://www.utupub.fi/handle/11111/51783
dc.identifier.url	https://link.springer.com/article/10.1007/s10579-022-09624-1
dc.identifier.urn	URN:NBN:fi-fe202301142856
dc.language.iso	en
dc.okm.affiliatedauthor	Laippala, Veronika
dc.okm.affiliatedauthor	Rönnqvist, Samuel
dc.okm.affiliatedauthor	Oinonen, Miika
dc.okm.affiliatedauthor	Kyröläinen, Aki
dc.okm.affiliatedauthor	Salmela, Anna
dc.okm.affiliatedauthor	Pyysalo, Sampo
dc.okm.discipline	113 Computer and information sciences	en_GB
dc.okm.internationalcopublication	international co-publication
dc.okm.internationality	International publication
dc.okm.type	A1 ScientificArticle
dc.publisher	Springer
dc.publisher.country	Netherlands	en_GB
dc.publisher.country	Alankomaat	fi_FI
dc.publisher.country-code	NL
dc.relation.doi	10.1007/s10579-022-09624-1
dc.relation.ispartofjournal	Language Resources and Evaluation
dc.source.identifier	https://www.utupub.fi/handle/10024/187005
dc.title	Register identification from the unrestricted open Web using the Corpus of Online Registers of English
dc.year.issued	2022

Tiedostot

Näytetään 1 - 1 / 1

Name:: s10579-022-09624-1.pdf
Size:: 1.42 MB
Format:: Adobe Portable Document Format

Lataa

Kokoelmat

Rinnakkaistallenteet