Analyzing register variation in web texts through automatic segmentation
| dc.contributor.author | Henriksson, Erik | |
| dc.contributor.author | Hellström, Saara | |
| dc.contributor.author | Laippala, Veronika | |
| dc.contributor.organization | fi=digitaalinen kielentutkimus, espanja, italia, kiina, ranska, saksa|en=Digital Language Studies, Chinese, French, German, Italian, Spanish| | |
| dc.contributor.organization | fi=kieli- ja käännöstieteiden laitos|en=School of Languages and Translation Studies| | |
| dc.contributor.organization-code | 1.2.246.10.2458963.20.56461112866 | |
| dc.contributor.organization-code | 1.2.246.10.2458963.20.36764574459 | |
| dc.converis.publication-id | 508751027 | |
| dc.converis.url | https://research.utu.fi/converis/portal/Publication/508751027 | |
| dc.date.accessioned | 2026-04-24T17:49:38Z | |
| dc.description.abstract | <p>This study introduces a novel method for analyzing register variation in web texts through classification-based register segmentation. While traditional text-linguistic register analysis treats web documents as single units, we present a recursive binary segmentation approach that automatically identifies register shifts within web documents without labeled segment data, using a ModernBERT classifier fine-tuned on full web documents. Manual evaluation shows our approach to be reliable, and our experimental results reveal that register segmentation leads to more accurate register classification, helps models learn more distinct register categories, and produces text units with more consistent linguistic characteristics. The approach offers new insights into documentinternal register variation in online discourse.<br></p> | |
| dc.format.pagerange | 19 | |
| dc.format.pagerange | 7 | |
| dc.identifier.isbn | 979-8-89176-234-3 | |
| dc.identifier.uri | https://www.utupub.fi/handle/11111/59092 | |
| dc.identifier.url | https://doi.org/10.18653/v1/2025.nlp4dh-1.2 | |
| dc.identifier.urn | URN:NBN:fi-fe2026022315578 | |
| dc.language.iso | en | |
| dc.okm.affiliatedauthor | Henriksson, Erik | |
| dc.okm.affiliatedauthor | Hellström, Saara | |
| dc.okm.affiliatedauthor | Laippala, Veronika | |
| dc.okm.discipline | 113 Computer and information sciences | en_GB |
| dc.okm.discipline | 113 Tietojenkäsittely ja informaatiotieteet | fi_FI |
| dc.okm.discipline | 6121 Languages | en_GB |
| dc.okm.discipline | 6121 Kielitieteet | fi_FI |
| dc.okm.internationalcopublication | not an international co-publication | |
| dc.okm.internationality | International publication | |
| dc.okm.type | A4 Conference Article | |
| dc.publisher.country | United States | en_GB |
| dc.publisher.country | Yhdysvallat (USA) | fi_FI |
| dc.publisher.country-code | US | |
| dc.relation.conference | International Conference on Natural Language Processing for Digital Humanities | |
| dc.relation.doi | 10.18653/v1/2025.nlp4dh-1.2 | |
| dc.title | Analyzing register variation in web texts through automatic segmentation | |
| dc.title.book | Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities | |
| dc.year.issued | 2025 |
Tiedostot
1 - 1 / 1