From Discrete to Continuous Classes: A Situational Analysis of Multilingual Web Registers with LLM Annotations

dc.contributor.authorHenriksson, Erik
dc.contributor.authorMyntti, Amanda
dc.contributor.authorHellström, Saara
dc.contributor.authorErten-Johansson, Selcen
dc.contributor.authorEskelinen, Anni
dc.contributor.authorRepo, Liina
dc.contributor.authorLaippala, Veronika
dc.contributor.organizationfi=digitaalinen kielentutkimus, espanja, italia, kiina, ranska, saksa|en=Digital Language Studies, Chinese, French, German, Italian, Spanish|
dc.contributor.organizationfi=tietotekniikan laitos|en=Department of Computing|
dc.contributor.organization-code1.2.246.10.2458963.20.36764574459
dc.contributor.organization-code1.2.246.10.2458963.20.85312822902
dc.converis.publication-id477916769
dc.converis.urlhttps://research.utu.fi/converis/portal/Publication/477916769
dc.date.accessioned2025-08-27T23:53:09Z
dc.date.available2025-08-27T23:53:09Z
dc.description.abstract<p>In corpus linguistics, registers–language varieties suited to different contexts–have traditionally been defined by their situations of use, yet recent studies reveal significant situational variation within registers. Previous quantitative studies, however, have been limited to English, leaving this variation in other languages largely unexplored. To address this gap, we apply a quantitative situational analysis to a large multilingual web register corpus, using large language models (LLMs) to annotate texts in English, Finnish, French, Swedish, and Turkish for 23 situational parameters. Using clustering techniques, we identify six situational text types, such as “Advice”, “Opinion” and “Marketing”, each characterized by distinct situational features. We explore the relationship between these text types and traditional register categories, finding partial alignment, though no register maps perfectly onto a single cluster. These results support the quantitative approach to situational analysis and are consistent with earlier findings for English. Cross-linguistic comparisons show that language accounts for only a small part of situational variation within registers, suggesting registers are situationally similar across languages. This study demonstrates the utility of LLMs in multilingual register analysis and deepens our understanding of situational variation within registers.<br></p>
dc.format.pagerange308
dc.format.pagerange318
dc.identifier.isbn979-8-89176-181-0
dc.identifier.olddbid204797
dc.identifier.oldhandle10024/187824
dc.identifier.urihttps://www.utupub.fi/handle/11111/53474
dc.identifier.urlhttps://doi.org/10.18653/v1/2024.nlp4dh-1.30
dc.identifier.urnURN:NBN:fi-fe2025082786567
dc.language.isoen
dc.okm.affiliatedauthorHenriksson, Erik
dc.okm.affiliatedauthorMyntti, Amanda
dc.okm.affiliatedauthorHellström, Saara
dc.okm.affiliatedauthorErten Johansson, Selcen
dc.okm.affiliatedauthorEskelinen, Anni
dc.okm.affiliatedauthorRepo, Liina
dc.okm.affiliatedauthorLaippala, Veronika
dc.okm.discipline6121 Languagesen_GB
dc.okm.discipline6121 Kielitieteetfi_FI
dc.okm.internationalcopublicationnot an international co-publication
dc.okm.internationalityInternational publication
dc.okm.typeA4 Conference Article
dc.publisher.countryUnited Statesen_GB
dc.publisher.countryYhdysvallat (USA)fi_FI
dc.publisher.country-codeUS
dc.relation.conferenceInternational Conference on Natural Language Processing for Digital Humanities
dc.relation.doi10.18653/v1/2024.nlp4dh-1.30
dc.source.identifierhttps://www.utupub.fi/handle/10024/187824
dc.titleFrom Discrete to Continuous Classes: A Situational Analysis of Multilingual Web Registers with LLM Annotations
dc.title.bookProceedings of the 4th International Conference on Natural Language Processing for Digital Humanities
dc.year.issued2024

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
2024.nlp4dh-1.30.pdf
Size:
729.84 KB
Format:
Adobe Portable Document Format