Intersecting Register and Genre: Understanding the Contents of Web-Crawled Corpora

Myntti, Amanda; Repo, Liina; Freyermuth, Elian; Kanner, Antti; Laippala, Veronika; Henriksson, Erik

Intersecting Register and Genre: Understanding the Contents of Web-Crawled Corpora

dc.contributor.author	Myntti, Amanda
dc.contributor.author	Repo, Liina
dc.contributor.author	Freyermuth, Elian
dc.contributor.author	Kanner, Antti
dc.contributor.author	Laippala, Veronika
dc.contributor.author	Henriksson, Erik
dc.contributor.organization	fi=digitaalinen kielentutkimus, espanja, italia, kiina, ranska, saksa\|en=Digital Language Studies, Chinese, French, German, Italian, Spanish\|
dc.contributor.organization	fi=kieli- ja käännöstieteiden laitos\|en=School of Languages and Translation Studies\|
dc.contributor.organization	fi=tietotekniikan laitos\|en=Department of Computing\|
dc.contributor.organization-code	1.2.246.10.2458963.20.85312822902
dc.converis.publication-id	477956278
dc.converis.url	https://research.utu.fi/converis/portal/Publication/477956278
dc.date.accessioned	2025-08-27T20:46:11Z
dc.date.available	2025-08-27T20:46:11Z
dc.description.abstract	<p> Web-scale corpora present valuable research opportunities but often lack detailed metadata, making them challenging to use in linguistics and social sciences. This study tackles this problem by exploring automatic methods to classify web corpora into specific categories, focusing on text registers such as Interactive Discussion and literary genres such as Politics and Social Sciences. We train two machine learning models to classify documents from the large web-crawled OSCAR dataset: a register classifier using the multilingual, manually annotated CORE corpus, and a genre classifier using a dataset based on Kindle US&UK. Fine-tuned from XLM-R Large, the register and genre classifiers achieved F1-scores of 0.74 and 0.70, respectively. Our analysis includes evaluating the distribution of the predicted text classes and examining the intersection of genre-register pairs using topic modelling. The results show expected combinations between certain registers and genres, such as the Lyrical register often aligning with the Literature & Fiction genre. However, most registers, such as Interactive Discussion, are divided across multiple genres, like Engineering & Transportation and Politics & Social Sciences, depending on the discussion topic. This enriched metadata provides valuable insights and supports new ways of studying digital cultural heritage. <br></p>
dc.identifier.isbn	979-8-89176-181-0
dc.identifier.olddbid	200194
dc.identifier.oldhandle	10024/183221
dc.identifier.uri	https://www.utupub.fi/handle/11111/45924
dc.identifier.urn	URN:NBN:fi-fe2025082789003
dc.language.iso	en
dc.okm.affiliatedauthor	Myntti, Amanda
dc.okm.affiliatedauthor	Repo, Liina
dc.okm.affiliatedauthor	Freyermuth, Elian
dc.okm.affiliatedauthor	Kanner, Antti
dc.okm.affiliatedauthor	Laippala, Veronika
dc.okm.affiliatedauthor	Henriksson, Erik
dc.okm.discipline	113 Computer and information sciences	en_GB
dc.okm.internationalcopublication	international co-publication
dc.okm.internationality	International publication
dc.okm.type	A4 Conference Article
dc.publisher.country	United States	en_GB
dc.publisher.country	Yhdysvallat (USA)	fi_FI
dc.publisher.country-code	US
dc.relation.conference	International Conference on Natural Language Processing for Digital Humanities
dc.relation.doi	10.18653/v1/2024.nlp4dh-1.38
dc.source.identifier	https://www.utupub.fi/handle/10024/183221
dc.title	Intersecting Register and Genre: Understanding the Contents of Web-Crawled Corpora
dc.title.book	Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities
dc.year.issued	2024

Tiedostot

Näytetään 1 - 1 / 1

Name:: Intersecting_Register_and_Genre__Understanding_the_Contents_of_Web_Crawled_Corpora.pdf
Size:: 729.68 KB
Format:: Adobe Portable Document Format

Lataa

Kokoelmat

Rinnakkaistallenteet