Register identification from the unrestricted open Web using the Corpus of Online Registers of English

dc.contributor.authorLaippala Veronika
dc.contributor.authorRönnqvist Samuel
dc.contributor.authorOinonen Miika
dc.contributor.authorKyröläinen Aki-Juhani
dc.contributor.authorSalmela Anna
dc.contributor.authorBiber Douglas
dc.contributor.authorEgbert Jesse
dc.contributor.authorPyysalo Sampo
dc.contributor.organizationfi=data-analytiikka|en=Data-analytiikka|
dc.contributor.organizationfi=kieli- ja käännöstieteiden laitos|en=School of Languages and Translation Studies|
dc.contributor.organization-code1.2.246.10.2458963.20.56461112866
dc.contributor.organization-code1.2.246.10.2458963.20.68940835793
dc.contributor.organization-code2602100
dc.converis.publication-id177822802
dc.converis.urlhttps://research.utu.fi/converis/portal/Publication/177822802
dc.date.accessioned2025-08-27T23:26:46Z
dc.date.available2025-08-27T23:26:46Z
dc.description.abstract<p>This article examines the automatic identification of Web registers, that is, text varieties such as news articles and reviews. Most studies have focused on corpora restricted to include only preselected classes with well-defined characteristics. These corpora feature only a subset of documents found on the unrestricted open Web, for which register identification has been particularly difficult because the range of linguistic variation on the Web is known to be substantial. As part of this study, we present the first open release of the Corpus of Online Registers of English (CORE), which is drawn from the unrestricted open Web and, currently, is the largest collection of manually annotated Web registers. Furthermore, we demonstrate that the CORE registers can be automatically identified with competitive results, with the best performance being an F1-score of 68% with the deep learning model BERT. The best performance was achieved using two modeling strategies. The first one involved modeling the registers using propagated register labels, that is, repeating the main register label along with its corresponding subregister label in a multilabel model. In the second one, we explored how the length of the document affects model performance, discovering that the beginning provided superior classification accuracy. Overall, the current study presents a systematic approach for the automatic identification of a large number of Web registers from the unrestricted Web, hence providing new pathways for future studies.<br></p>
dc.identifier.eissn1574-0218
dc.identifier.jour-issn1574-020X
dc.identifier.olddbid203978
dc.identifier.oldhandle10024/187005
dc.identifier.urihttps://www.utupub.fi/handle/11111/51783
dc.identifier.urlhttps://link.springer.com/article/10.1007/s10579-022-09624-1
dc.identifier.urnURN:NBN:fi-fe202301142856
dc.language.isoen
dc.okm.affiliatedauthorLaippala, Veronika
dc.okm.affiliatedauthorRönnqvist, Samuel
dc.okm.affiliatedauthorOinonen, Miika
dc.okm.affiliatedauthorKyröläinen, Aki
dc.okm.affiliatedauthorSalmela, Anna
dc.okm.affiliatedauthorPyysalo, Sampo
dc.okm.discipline113 Computer and information sciencesen_GB
dc.okm.discipline6121 Languagesen_GB
dc.okm.discipline113 Tietojenkäsittely ja informaatiotieteetfi_FI
dc.okm.discipline6121 Kielitieteetfi_FI
dc.okm.internationalcopublicationinternational co-publication
dc.okm.internationalityInternational publication
dc.okm.typeA1 ScientificArticle
dc.publisherSpringer
dc.publisher.countryNetherlandsen_GB
dc.publisher.countryAlankomaatfi_FI
dc.publisher.country-codeNL
dc.relation.doi10.1007/s10579-022-09624-1
dc.relation.ispartofjournalLanguage Resources and Evaluation
dc.source.identifierhttps://www.utupub.fi/handle/10024/187005
dc.titleRegister identification from the unrestricted open Web using the Corpus of Online Registers of English
dc.year.issued2022

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
s10579-022-09624-1.pdf
Size:
1.42 MB
Format:
Adobe Portable Document Format