Exploring the role of lexis and grammar for the stable identification of register in an unrestricted corpus of web documents

dc.contributor.authorLaippala Veronika
dc.contributor.authorEgbert Jesse
dc.contributor.authorBiber Douglas
dc.contributor.authorKyröläinen Aki-Juhani
dc.contributor.organizationfi=kieli- ja käännöstieteiden laitos|en=School of Languages and Translation Studies|
dc.contributor.organization-code1.2.246.10.2458963.20.56461112866
dc.converis.publication-id53306232
dc.converis.urlhttps://research.utu.fi/converis/portal/Publication/53306232
dc.date.accessioned2025-08-27T23:38:27Z
dc.date.available2025-08-27T23:38:27Z
dc.description.abstractThe Internet offers great possibilities for many scientific disciplines that utilize text data. However, the potential of online data can be limited by the lack of information on the genre or register of the documents, as register-whether a text is, e.g., a news article or a recipe-is arguably the most important predictor of linguistic variation (see Biber in Corpus Linguist Linguist Theory 8:9-37, 2012). Despite having received significant attention in recent years, the modeling of online registers has faced a number of challenges, and previous studies have presented contradictory results. In particular, these have concerned (1) the extent to which registers can be automatically identified in a large, unrestricted corpus of web documents and (2) the stability of the models, specifically the kinds of linguistic features that achieve the best performance while reflecting the registers instead of corpus idiosyncrasies. Furthermore, although the linguistic properties of registers vary importantly in a number of ways that may affect their modeling, this variation is often bypassed. In this article, we tackle these issues. We model online registers in the largest available corpus of online registers, the Corpus of Online Registers of English (CORE). Additionally, we evaluate the stability of the models towards corpus idiosyncrasies, analyze the role of different linguistic features in them, and examine how individual registers differ in these two aspects. We show that (1) competitive classification performance on a large-scale, unrestricted corpus can be achieved through a combination of lexico-grammatical features, (2) the inclusion of grammatical information improves the stability of the model, whereas many of the previously best-performing feature sets are less stable, and that (3) registers can be placed in a continuum based on the discriminative importance of lexis and grammar. These register-specific characteristics can explain the variation observed in previous studies concerning the automatic identification of online registers and the importance of different linguistic features for them. Thus, our results offer explanations for the jungle-likeness of online data and provide essential information on online registers for all studies using online data.
dc.identifier.eissn1574-0218
dc.identifier.jour-issn1574-020X
dc.identifier.olddbid204345
dc.identifier.oldhandle10024/187372
dc.identifier.urihttps://www.utupub.fi/handle/11111/52564
dc.identifier.urlhttps://link.springer.com/article/10.1007/s10579-020-09519-z
dc.identifier.urnURN:NBN:fi-fe2021042824797
dc.language.isoen
dc.okm.affiliatedauthorLaippala, Veronika
dc.okm.discipline6122 Literature studiesen_GB
dc.okm.discipline6122 Kirjallisuuden tutkimusfi_FI
dc.okm.internationalcopublicationinternational co-publication
dc.okm.internationalityInternational publication
dc.okm.typeA1 ScientificArticle
dc.publisherSPRINGER
dc.publisher.countryNetherlandsen_GB
dc.publisher.countryAlankomaatfi_FI
dc.publisher.country-codeNL
dc.relation.doi10.1007/s10579-020-09519-z
dc.relation.ispartofjournalLanguage Resources and Evaluation
dc.source.identifierhttps://www.utupub.fi/handle/10024/187372
dc.titleExploring the role of lexis and grammar for the stable identification of register in an unrestricted corpus of web documents
dc.year.issued2021

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
Laippala2021_Article_ExploringTheRoleOfLexisAndGram.pdf
Size:
1.74 MB
Format:
Adobe Portable Document Format
Description:
Publisher's PDF