Hyppää sisältöön
    • Suomeksi
    • In English
  • Suomeksi
  • In English
  • Kirjaudu
Näytä aineisto 
  •   Etusivu
  • 3. UTUCris-artikkelit
  • Rinnakkaistallenteet
  • Näytä aineisto
  •   Etusivu
  • 3. UTUCris-artikkelit
  • Rinnakkaistallenteet
  • Näytä aineisto
JavaScript is disabled for your browser. Some features of this site may not work without it.

Intersecting Register and Genre: Understanding the Contents of Web-Crawled Corpora

Myntti, Amanda; Repo, Liina; Freyermuth, Elian; Kanner, Antti; Laippala, Veronika; Henriksson, Erik

Intersecting Register and Genre: Understanding the Contents of Web-Crawled Corpora

Myntti, Amanda
Repo, Liina
Freyermuth, Elian
Kanner, Antti
Laippala, Veronika
Henriksson, Erik
Katso/Avaa
Intersecting_Register_and_Genre__Understanding_the_Contents_of_Web_Crawled_Corpora.pdf (729.6Kb)
Lataukset: 

doi:10.18653/v1/2024.nlp4dh-1.38
Näytä kaikki kuvailutiedot
Julkaisun pysyvä osoite on:
https://urn.fi/URN:NBN:fi-fe2025082789003
Tiivistelmä

Web-scale corpora present valuable research opportunities but often lack detailed metadata, making them challenging to use in linguistics and social sciences. This study tackles this problem by exploring automatic methods to classify web corpora into specific categories, focusing on text registers such as Interactive Discussion and literary genres such as Politics and Social Sciences. We train two machine learning models to classify documents from the large web-crawled OSCAR dataset: a register classifier using the multilingual, manually annotated CORE corpus, and a genre classifier using a dataset based on Kindle US&UK. Fine-tuned from XLM-R Large, the register and genre classifiers achieved F1-scores of 0.74 and 0.70, respectively. Our analysis includes evaluating the distribution of the predicted text classes and examining the intersection of genre-register pairs using topic modelling. The results show expected combinations between certain registers and genres, such as the Lyrical register often aligning with the Literature & Fiction genre. However, most registers, such as Interactive Discussion, are divided across multiple genres, like Engineering & Transportation and Politics & Social Sciences, depending on the discussion topic. This enriched metadata provides valuable insights and supports new ways of studying digital cultural heritage.

Kokoelmat
  • Rinnakkaistallenteet [27094]

Turun yliopiston kirjasto | Turun yliopisto
julkaisut@utu.fi | Tietosuoja | Saavutettavuusseloste
 

 

Tämä kokoelma

JulkaisuajatTekijätNimekkeetAsiasanatTiedekuntaLaitosOppiaineYhteisöt ja kokoelmat

Omat tiedot

Kirjaudu sisäänRekisteröidy

Turun yliopiston kirjasto | Turun yliopisto
julkaisut@utu.fi | Tietosuoja | Saavutettavuusseloste