Multilingual and Zero-Shot is Closing in on Monolingual Web Register Classification

dc.contributor.authorRönnqvist Samuel
dc.contributor.authorSkantsi Valtteri
dc.contributor.authorOinonen Miika
dc.contributor.authorLaippala Veronika
dc.contributor.organizationfi=data-analytiikka|en=Data-analytiikka|
dc.contributor.organizationfi=kieli- ja käännöstieteiden laitos|en=School of Languages and Translation Studies|
dc.contributor.organization-code1.2.246.10.2458963.20.56461112866
dc.contributor.organization-code1.2.246.10.2458963.20.68940835793
dc.contributor.organization-code2602100
dc.converis.publication-id56911747
dc.converis.urlhttps://research.utu.fi/converis/portal/Publication/56911747
dc.date.accessioned2022-10-28T14:07:24Z
dc.date.available2022-10-28T14:07:24Z
dc.description.abstract<p>This article studies register classification of documents from the unrestricted web, such as news articles or opinion blogs, in a multilingual setting, exploring both the benefit of training on multiple languages and the capabilities for zero-shot cross-lingual transfer. While the wide range of linguistic variation found on the web poses challenges for register classification, recent studies have shown that good levels of cross-lingual transfer from the extensive English CORE corpus to other languages can be achieved. In this study, we show that training on multiple languages 1) benefits languages with limited amounts of register-annotated data, 2) on average achieves performance on par with monolingual models, and 3) greatly improves upon previous zero-shot results in Finnish, French and Swedish. The best results are achieved with the multilingual XLM-R model. As data, we use the CORE corpus series featuring register annotated data from the unrestricted web.<br></p>
dc.format.pagerange157
dc.format.pagerange165
dc.identifier.isbn978-91-7929-614-8
dc.identifier.issn1650-3686
dc.identifier.jour-issn1650-3686
dc.identifier.olddbid186398
dc.identifier.oldhandle10024/169492
dc.identifier.urihttps://www.utupub.fi/handle/11111/38221
dc.identifier.urlhttps://ep.liu.se/en/conference-article.aspx?series=ecp&issue=178&Article_No=16
dc.identifier.urnURN:NBN:fi-fe2021093048931
dc.language.isoen
dc.okm.affiliatedauthorRönnqvist, Samuel
dc.okm.affiliatedauthorSkantsi, Valtteri
dc.okm.affiliatedauthorOinonen, Miika
dc.okm.affiliatedauthorLaippala, Veronika
dc.okm.discipline113 Computer and information sciencesen_GB
dc.okm.discipline113 Tietojenkäsittely ja informaatiotieteetfi_FI
dc.okm.internationalcopublicationnot an international co-publication
dc.okm.internationalityInternational publication
dc.okm.typeA4 Conference Article
dc.publisher.countrySwedenen_GB
dc.publisher.countryRuotsifi_FI
dc.publisher.country-codeSE
dc.relation.conferenceNordic Conference on Computational Linguistics
dc.relation.ispartofjournalLinköping Electronic Conference Proceedings
dc.relation.ispartofseriesLinköping Electronic Conference Proceedings
dc.relation.volume178
dc.source.identifierhttps://www.utupub.fi/handle/10024/169492
dc.titleMultilingual and Zero-Shot is Closing in on Monolingual Web Register Classification
dc.title.bookProceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)
dc.year.issued2021

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
ecp2021178016.pdf
Size:
378.02 KB
Format:
Adobe Portable Document Format
Description:
Publisher's PDF