Intersecting Register and Genre: Understanding the Contents of Web-Crawled Corpora

dc.contributor.authorMyntti, Amanda
dc.contributor.authorRepo, Liina
dc.contributor.authorFreyermuth, Elian
dc.contributor.authorKanner, Antti
dc.contributor.authorLaippala, Veronika
dc.contributor.authorHenriksson, Erik
dc.contributor.organizationfi=digitaalinen kielentutkimus, espanja, italia, kiina, ranska, saksa|en=Digital Language Studies, Chinese, French, German, Italian, Spanish|
dc.contributor.organizationfi=kieli- ja käännöstieteiden laitos|en=School of Languages and Translation Studies|
dc.contributor.organizationfi=tietotekniikan laitos|en=Department of Computing|
dc.contributor.organization-code1.2.246.10.2458963.20.36764574459
dc.contributor.organization-code1.2.246.10.2458963.20.85312822902
dc.contributor.organization-code2602100
dc.converis.publication-id477956278
dc.converis.urlhttps://research.utu.fi/converis/portal/Publication/477956278
dc.date.accessioned2025-08-27T20:46:11Z
dc.date.available2025-08-27T20:46:11Z
dc.description.abstract<p> Web-scale corpora present valuable research opportunities but often lack detailed metadata, making them challenging to use in linguistics and social sciences. This study tackles this problem by exploring automatic methods to classify web corpora into specific categories, focusing on text registers such as Interactive Discussion and literary genres such as Politics and Social Sciences. We train two machine learning models to classify documents from the large web-crawled OSCAR dataset: a register classifier using the multilingual, manually annotated CORE corpus, and a genre classifier using a dataset based on Kindle US&UK. Fine-tuned from XLM-R Large, the register and genre classifiers achieved F1-scores of 0.74 and 0.70, respectively. Our analysis includes evaluating the distribution of the predicted text classes and examining the intersection of genre-register pairs using topic modelling. The results show expected combinations between certain registers and genres, such as the Lyrical register often aligning with the Literature & Fiction genre. However, most registers, such as Interactive Discussion, are divided across multiple genres, like Engineering & Transportation and Politics & Social Sciences, depending on the discussion topic. This enriched metadata provides valuable insights and supports new ways of studying digital cultural heritage. <br></p>
dc.identifier.isbn979-8-89176-181-0
dc.identifier.olddbid200194
dc.identifier.oldhandle10024/183221
dc.identifier.urihttps://www.utupub.fi/handle/11111/45924
dc.identifier.urnURN:NBN:fi-fe2025082789003
dc.language.isoen
dc.okm.affiliatedauthorMyntti, Amanda
dc.okm.affiliatedauthorRepo, Liina
dc.okm.affiliatedauthorFreyermuth, Elian
dc.okm.affiliatedauthorKanner, Antti
dc.okm.affiliatedauthorLaippala, Veronika
dc.okm.affiliatedauthorHenriksson, Erik
dc.okm.discipline113 Computer and information sciencesen_GB
dc.okm.discipline6121 Languagesen_GB
dc.okm.discipline113 Tietojenkäsittely ja informaatiotieteetfi_FI
dc.okm.discipline6121 Kielitieteetfi_FI
dc.okm.internationalcopublicationinternational co-publication
dc.okm.internationalityInternational publication
dc.okm.typeA4 Conference Article
dc.publisher.countryUnited Statesen_GB
dc.publisher.countryYhdysvallat (USA)fi_FI
dc.publisher.country-codeUS
dc.relation.conferenceInternational Conference on Natural Language Processing for Digital Humanities
dc.relation.doi10.18653/v1/2024.nlp4dh-1.38
dc.source.identifierhttps://www.utupub.fi/handle/10024/183221
dc.titleIntersecting Register and Genre: Understanding the Contents of Web-Crawled Corpora
dc.title.bookProceedings of the 4th International Conference on Natural Language Processing for Digital Humanities
dc.year.issued2024

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
Intersecting_Register_and_Genre__Understanding_the_Contents_of_Web_Crawled_Corpora.pdf
Size:
729.68 KB
Format:
Adobe Portable Document Format