The birth of Romanian BERT
| dc.contributor.author | Stefan Dumitrescu | |
| dc.contributor.author | Andrei-Marius Avram | |
| dc.contributor.author | Sampo Pyysalo | |
| dc.contributor.organization | fi=kieli- ja puheteknologia|en=Language and Speech Technology| | |
| dc.contributor.organization-code | 1.2.246.10.2458963.20.47465613983 | |
| dc.converis.publication-id | 51797331 | |
| dc.converis.url | https://research.utu.fi/converis/portal/Publication/51797331 | |
| dc.date.accessioned | 2022-10-28T13:53:39Z | |
| dc.date.available | 2022-10-28T13:53:39Z | |
| dc.description.abstract | Large-scale pretrained language models have become ubiquitous in Natural Language Processing. However, most of these models are available either in high-resource languages, in particular English, or as multilingual models that compromise performance on individual languages for coverage. This paper introduces Romanian BERT, the first purely Romanian transformer-based language model, pretrained on a large text corpus. We discuss corpus com-position and cleaning, the model training process, as well as an extensive evaluation of the model on various Romanian datasets. We opensource not only the model itself, but also a repository that contains information on how to obtain the corpus, fine-tune and use this model in production (with practical examples), and how to fully replicate the evaluation process. | |
| dc.format.pagerange | 4324 | |
| dc.format.pagerange | 4328 | |
| dc.identifier.isbn | 978-1-952148-90-3 | |
| dc.identifier.jour-issn | 0736-587X | |
| dc.identifier.olddbid | 185018 | |
| dc.identifier.oldhandle | 10024/168112 | |
| dc.identifier.uri | https://www.utupub.fi/handle/11111/40903 | |
| dc.identifier.url | https://www.aclweb.org/anthology/2020.findings-emnlp.387/ | |
| dc.identifier.urn | URN:NBN:fi-fe2021042824112 | |
| dc.language.iso | en | |
| dc.okm.affiliatedauthor | Pyysalo, Sampo | |
| dc.okm.discipline | 113 Computer and information sciences | en_GB |
| dc.okm.discipline | 113 Tietojenkäsittely ja informaatiotieteet | fi_FI |
| dc.okm.internationalcopublication | international co-publication | |
| dc.okm.internationality | International publication | |
| dc.okm.type | A4 Conference Article | |
| dc.publisher.country | United States | en_GB |
| dc.publisher.country | Yhdysvallat (USA) | fi_FI |
| dc.publisher.country-code | US | |
| dc.relation.conference | Empirical Methods in Natural Language Processing | |
| dc.relation.doi | 10.18653/v1/2020.findings-emnlp.387 | |
| dc.relation.ispartofjournal | Annual Meeting of the Association for Computational Linguistics | |
| dc.source.identifier | https://www.utupub.fi/handle/10024/168112 | |
| dc.title | The birth of Romanian BERT | |
| dc.title.book | Findings of the Association for Computational Linguistics: EMNLP 2020 | |
| dc.year.issued | 2020 |
Tiedostot
1 - 1 / 1
Ladataan...
- Name:
- 2020.findings-emnlp.387.pdf
- Size:
- 199.26 KB
- Format:
- Adobe Portable Document Format
- Description:
- Publisher's PDF