Finnish SQuAD: A Simple Approach to Machine Translation of Span Annotations

dc.contributor.authorNuutinen, Emil
dc.contributor.authorRastas, Iiro
dc.contributor.authorGinter, Filip
dc.contributor.organizationfi=data-analytiikka|en=Data-analytiikka|
dc.contributor.organization-code1.2.246.10.2458963.20.68940835793
dc.converis.publication-id506499977
dc.converis.urlhttps://research.utu.fi/converis/portal/Publication/506499977
dc.date.accessioned2026-01-21T12:32:04Z
dc.date.available2026-01-21T12:32:04Z
dc.description.abstract<p>We apply a simple method to machine translate datasets with span-level annotation using the DeepL MT service and its ability to translate formatted documents. Using this method, we produce a Finnish version of the SQuAD2.0 question answering dataset and train QA retriever models on this new dataset. We evaluate the quality of the dataset and more generally the MT method through direct evaluation, indirect comparison to other similar datasets, a backtranslation experiment, as well as through the performance of downstream trained QA models. In all these evaluations, we find that the method of transfer is not only simple to use but produces consistently better translated data. Given its good performance on the SQuAD dataset, it is likely the method can be used to translate other similar span-annotated datasets for other tasks and languages as well. All code and data is available under an open license: data at HuggingFace TurkuNLP/squad_v2_fi, code on GitHub TurkuNLP/squad2-fi, and model at HuggingFace TurkuNLP/bert-base-finnish-cased-squad2.<br></p>
dc.format.pagerange424
dc.format.pagerange432
dc.identifier.isbn978-9908-53-109-0
dc.identifier.issn1736-8197
dc.identifier.jour-issn1736-8197
dc.identifier.olddbid212623
dc.identifier.oldhandle10024/195641
dc.identifier.urihttps://www.utupub.fi/handle/11111/52882
dc.identifier.urlhttps://aclanthology.org/2025.nodalida-1.46/
dc.identifier.urnURN:NBN:fi-fe202601215982
dc.language.isoen
dc.okm.affiliatedauthorNuutinen, Emil
dc.okm.affiliatedauthorRastas, Iiro
dc.okm.affiliatedauthorGinter, Filip
dc.okm.discipline113 Computer and information sciencesen_GB
dc.okm.discipline113 Tietojenkäsittely ja informaatiotieteetfi_FI
dc.okm.internationalcopublicationnot an international co-publication
dc.okm.internationalityInternational publication
dc.okm.typeA4 Conference Article
dc.publisher.countryEstoniaen_GB
dc.publisher.countryVirofi_FI
dc.publisher.country-codeEE
dc.relation.conferenceNordic Conference on Computational Linguistics and Baltic Conference on Human Language Technologies
dc.relation.ispartofjournalNEALT proceedings series
dc.relation.volume57
dc.source.identifierhttps://www.utupub.fi/handle/10024/195641
dc.titleFinnish SQuAD: A Simple Approach to Machine Translation of Span Annotations
dc.title.bookProceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)
dc.year.issued2025

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
2025.nodalida-1.46.pdf
Size:
250.9 KB
Format:
Adobe Portable Document Format