Analyzing register variation in web texts through automatic segmentation

dc.contributor.authorHenriksson, Erik
dc.contributor.authorHellström, Saara
dc.contributor.authorLaippala, Veronika
dc.contributor.organizationfi=digitaalinen kielentutkimus, espanja, italia, kiina, ranska, saksa|en=Digital Language Studies, Chinese, French, German, Italian, Spanish|
dc.contributor.organizationfi=kieli- ja käännöstieteiden laitos|en=School of Languages and Translation Studies|
dc.contributor.organization-code1.2.246.10.2458963.20.56461112866
dc.contributor.organization-code1.2.246.10.2458963.20.36764574459
dc.converis.publication-id508751027
dc.converis.urlhttps://research.utu.fi/converis/portal/Publication/508751027
dc.date.accessioned2026-04-24T17:49:38Z
dc.description.abstract<p>This study introduces a novel method for analyzing register variation in web texts through classification-based register segmentation. While traditional text-linguistic register analysis treats web documents as single units, we present a recursive binary segmentation approach that automatically identifies register shifts within web documents without labeled segment data, using a ModernBERT classifier fine-tuned on full web documents. Manual evaluation shows our approach to be reliable, and our experimental results reveal that register segmentation leads to more accurate register classification, helps models learn more distinct register categories, and produces text units with more consistent linguistic characteristics. The approach offers new insights into documentinternal register variation in online discourse.<br></p>
dc.format.pagerange19
dc.format.pagerange7
dc.identifier.isbn979-8-89176-234-3
dc.identifier.urihttps://www.utupub.fi/handle/11111/59092
dc.identifier.urlhttps://doi.org/10.18653/v1/2025.nlp4dh-1.2
dc.identifier.urnURN:NBN:fi-fe2026022315578
dc.language.isoen
dc.okm.affiliatedauthorHenriksson, Erik
dc.okm.affiliatedauthorHellström, Saara
dc.okm.affiliatedauthorLaippala, Veronika
dc.okm.discipline113 Computer and information sciencesen_GB
dc.okm.discipline113 Tietojenkäsittely ja informaatiotieteetfi_FI
dc.okm.discipline6121 Languagesen_GB
dc.okm.discipline6121 Kielitieteetfi_FI
dc.okm.internationalcopublicationnot an international co-publication
dc.okm.internationalityInternational publication
dc.okm.typeA4 Conference Article
dc.publisher.countryUnited Statesen_GB
dc.publisher.countryYhdysvallat (USA)fi_FI
dc.publisher.country-codeUS
dc.relation.conferenceInternational Conference on Natural Language Processing for Digital Humanities
dc.relation.doi10.18653/v1/2025.nlp4dh-1.2
dc.titleAnalyzing register variation in web texts through automatic segmentation
dc.title.bookProceedings of the 5th International Conference on Natural Language Processing for Digital Humanities
dc.year.issued2025

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
2025.nlp4dh-1.2-3.pdf
Size:
1.57 MB
Format:
Adobe Portable Document Format