The birth of Romanian BERT

dc.contributor.authorStefan Dumitrescu
dc.contributor.authorAndrei-Marius Avram
dc.contributor.authorSampo Pyysalo
dc.contributor.organizationfi=kieli- ja puheteknologia|en=Language and Speech Technology|
dc.contributor.organization-code1.2.246.10.2458963.20.47465613983
dc.converis.publication-id51797331
dc.converis.urlhttps://research.utu.fi/converis/portal/Publication/51797331
dc.date.accessioned2022-10-28T13:53:39Z
dc.date.available2022-10-28T13:53:39Z
dc.description.abstractLarge-scale pretrained language models have become ubiquitous in Natural Language Processing. However, most of these models are available either in high-resource languages, in particular English, or as multilingual models that compromise performance on individual languages for coverage. This paper introduces Romanian BERT, the first purely Romanian transformer-based language model, pretrained on a large text corpus. We discuss corpus com-position and cleaning, the model training process, as well as an extensive evaluation of the model on various Romanian datasets. We opensource not only the model itself, but also a repository that contains information on how to obtain the corpus, fine-tune and use this model in production (with practical examples), and how to fully replicate the evaluation process.
dc.format.pagerange4324
dc.format.pagerange4328
dc.identifier.isbn978-1-952148-90-3
dc.identifier.jour-issn0736-587X
dc.identifier.olddbid185018
dc.identifier.oldhandle10024/168112
dc.identifier.urihttps://www.utupub.fi/handle/11111/40903
dc.identifier.urlhttps://www.aclweb.org/anthology/2020.findings-emnlp.387/
dc.identifier.urnURN:NBN:fi-fe2021042824112
dc.language.isoen
dc.okm.affiliatedauthorPyysalo, Sampo
dc.okm.discipline113 Computer and information sciencesen_GB
dc.okm.discipline113 Tietojenkäsittely ja informaatiotieteetfi_FI
dc.okm.internationalcopublicationinternational co-publication
dc.okm.internationalityInternational publication
dc.okm.typeA4 Conference Article
dc.publisher.countryUnited Statesen_GB
dc.publisher.countryYhdysvallat (USA)fi_FI
dc.publisher.country-codeUS
dc.relation.conferenceEmpirical Methods in Natural Language Processing
dc.relation.doi10.18653/v1/2020.findings-emnlp.387
dc.relation.ispartofjournalAnnual Meeting of the Association for Computational Linguistics
dc.source.identifierhttps://www.utupub.fi/handle/10024/168112
dc.titleThe birth of Romanian BERT
dc.title.bookFindings of the Association for Computational Linguistics: EMNLP 2020
dc.year.issued2020

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
2020.findings-emnlp.387.pdf
Size:
199.26 KB
Format:
Adobe Portable Document Format
Description:
Publisher's PDF