Textual Paraphrase Dataset for Deep Language Modelling

dc.contributor.authorKanerva Jenna
dc.contributor.authorGinter Filip
dc.contributor.authorChang Li-Hsin
dc.contributor.authorSkantsi Valtteri
dc.contributor.authorKilpeläinen Jemina
dc.contributor.authorKupari Hanna-Mari
dc.contributor.authorPiirto Aurora
dc.contributor.authorSaarni Jenna
dc.contributor.authorSevón Maija
dc.contributor.authorTarkka Otto
dc.contributor.organizationfi=data-analytiikka|en=Data-analytiikka|
dc.contributor.organizationfi=kieli- ja käännöstieteiden laitos|en=School of Languages and Translation Studies|
dc.contributor.organization-code1.2.246.10.2458963.20.68940835793
dc.contributor.organization-code2602100
dc.contributor.organization-code2610301
dc.converis.publication-id176823863
dc.converis.urlhttps://research.utu.fi/converis/portal/Publication/176823863
dc.date.accessioned2025-08-27T21:48:35Z
dc.date.available2025-08-27T21:48:35Z
dc.description.abstract<p>The Turku Paraphrase Corpus is a dataset of over 100,000 Finnish paraphrase pairs. During the corpus creation, we strived to gather challenging paraphrase pairs, more suitable to test the capabilities of natural language understanding models. The paraphrases are both selected and classified manually, so as to minimise lexical overlap, and provide examples that are structurally and lexically different to the maximum extent. An important distinguishing feature of the corpus is that most of the paraphrase pairs are extracted and distributed in their native document context, rather than in isolation. The primary application for the dataset is the development and evaluation of deep language models, and representation learning in general.</p>
dc.format.pagerange343
dc.format.pagerange348
dc.identifier.eisbn978-3-031-17258-8
dc.identifier.isbn978-3-031-17257-1
dc.identifier.issn1611-2482
dc.identifier.olddbid201170
dc.identifier.oldhandle10024/184197
dc.identifier.urihttps://www.utupub.fi/handle/11111/47755
dc.identifier.urlhttps://doi.org/10.1007/978-3-031-17258-8_27
dc.identifier.urnURN:NBN:fi-fe2022112967709
dc.language.isoen
dc.okm.affiliatedauthorKanerva, Jenna
dc.okm.affiliatedauthorGinter, Filip
dc.okm.affiliatedauthorChang, Li-Hsin
dc.okm.affiliatedauthorSkantsi, Valtteri
dc.okm.affiliatedauthorKilpeläinen, Jemina
dc.okm.affiliatedauthorKupari, Hanna-Mari
dc.okm.affiliatedauthorPiirto, Aurora
dc.okm.affiliatedauthorSaarni, Jenna
dc.okm.affiliatedauthorSevon, Maija
dc.okm.affiliatedauthorTarkka, Otto
dc.okm.discipline113 Computer and information sciencesen_GB
dc.okm.discipline6121 Languagesen_GB
dc.okm.discipline113 Tietojenkäsittely ja informaatiotieteetfi_FI
dc.okm.discipline6121 Kielitieteetfi_FI
dc.okm.internationalcopublicationnot an international co-publication
dc.okm.internationalityInternational publication
dc.okm.typeA3 Book
dc.publisherSpringer
dc.publisher.countrySwitzerlanden_GB
dc.publisher.countrySveitsifi_FI
dc.publisher.country-codeCH
dc.publisher.isbn978-81-322;978-3-540;978-3-642;978-3-662;978-3-7908;978-3-8274;978-3-8347;978-90-481;978-94-007;978-94-009;978-94-010;978-94-011;978-94-015;978-94-017;978-94-024;978-0-387;978-0-8176;978-1-4419;978-1-4612;978-1-4613;978-1-4614;978-1-4615;978-1-4684;978-1-4757;978-1-4899;978-1-4939;978-1-5041;978-3-319;978-1-4020;978-0-85729;978-1-4471;978-1-84628;978-1-84800;978-1-84882;978-1-84996;978-1-85233;978-3-211;978-3-7091;978-4-431;978-3-322;978-3-409;978-3-531;978-3-658;978-3-663;978-3-8100;978-981-287;978-981-10;978-981-13;978-3-030;978-981-32;978-981-15;978-981-16;978-981-329;978-981-334;978-981-336;978-3-031;978-981-19;
dc.relation.doi10.1007/978-3-031-17258-8_27
dc.relation.ispartofseriesCognitive Technologies
dc.source.identifierhttps://www.utupub.fi/handle/10024/184197
dc.titleTextual Paraphrase Dataset for Deep Language Modelling
dc.title.bookEuropean Language Grid: A Language Technology Platform for Multilingual Europe
dc.year.issued2022

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
978-3-031-17258-8_27.pdf
Size:
344.27 KB
Format:
Adobe Portable Document Format