Beyond the English web: Zero-shot cross-lingual and lightweight monolingual classification of registers

dc.contributor.authorRepo Liina
dc.contributor.authorSkantsi Valtteri
dc.contributor.authorRönnqvist Samuel
dc.contributor.authorHellström Saara
dc.contributor.authorOinonen Miika
dc.contributor.authorSalmela Anna
dc.contributor.authorBiber Douglas
dc.contributor.authorEgbert Jesse
dc.contributor.authorPyysalo Sampo
dc.contributor.authorLaippala Veronika
dc.contributor.organizationfi=data-analytiikka|en=Data-analytiikka|
dc.contributor.organizationfi=kieli- ja käännöstieteiden laitos|en=School of Languages and Translation Studies|
dc.contributor.organization-code1.2.246.10.2458963.20.56461112866
dc.contributor.organization-code1.2.246.10.2458963.20.68940835793
dc.contributor.organization-code2602100
dc.converis.publication-id66505697
dc.converis.urlhttps://research.utu.fi/converis/portal/Publication/66505697
dc.date.accessioned2022-10-28T13:15:37Z
dc.date.available2022-10-28T13:15:37Z
dc.description.abstract<p>We explore cross-lingual transfer of register classification for web documents. Registers,<br>that is, text varieties such as blogs or news are one of the primary predictors of linguistic variation and thus affect the automatic processing of language. <br></p><p>We introduce two new registerannotated corpora, FreCORE and SweCORE, for French and Swedish. We demonstrate that deep pre-trained language models perform strongly in these languages and outperform previous state-of-the-art in English and Finnish. <br></p><p>Specifically, we show 1) that zeroshot cross-lingual transfer from the large English CORE corpus can match or surpass previously published monolingual models, and 2) that lightweight monolingual classification requiring very little training data can reach or surpass our zero-shot performance. We further analyse classification results finding that certain registers continue to pose challenges in particular for cross-lingual transfer.<br></p>
dc.format.pagerange183
dc.format.pagerange191
dc.identifier.isbn978-1-954085-04-6
dc.identifier.olddbid180872
dc.identifier.oldhandle10024/163966
dc.identifier.urihttps://www.utupub.fi/handle/11111/36397
dc.identifier.urlhttps://aclanthology.org/2021.eacl-srw.24.pdf
dc.identifier.urnURN:NBN:fi-fe2021093048653
dc.language.isoen
dc.okm.affiliatedauthorRepo, Liina
dc.okm.affiliatedauthorSkantsi, Valtteri
dc.okm.affiliatedauthorRönnqvist, Samuel
dc.okm.affiliatedauthorHellström, Saara
dc.okm.affiliatedauthorOinonen, Miika
dc.okm.affiliatedauthorSalmela, Anna
dc.okm.affiliatedauthorPyysalo, Sampo
dc.okm.affiliatedauthorLaippala, Veronika
dc.okm.discipline113 Computer and information sciencesen_GB
dc.okm.discipline6121 Languagesen_GB
dc.okm.discipline113 Tietojenkäsittely ja informaatiotieteetfi_FI
dc.okm.discipline6121 Kielitieteetfi_FI
dc.okm.internationalcopublicationinternational co-publication
dc.okm.internationalityInternational publication
dc.okm.typeA4 Conference Article
dc.publisher.countryUnited Statesen_GB
dc.publisher.countryYhdysvallat (USA)fi_FI
dc.publisher.country-codeUS
dc.relation.conferenceEuropean Chapter of the Association for Computational Linguistics
dc.source.identifierhttps://www.utupub.fi/handle/10024/163966
dc.titleBeyond the English web: Zero-shot cross-lingual and lightweight monolingual classification of registers
dc.title.bookProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop
dc.year.issued2021

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
2021.eacl-srw.24.pdf
Size:
294.57 KB
Format:
Adobe Portable Document Format
Description:
Publisher's PDF