Semantic search as extractive paraphrase span detection

Kanerva, Jenna

Semantic search as extractive paraphrase span detection

dc.contributor.author	Kanerva, Jenna
dc.contributor.organization	fi=data-analytiikka\|en=Data-analytiikka\|
dc.contributor.organization-code	1.2.246.10.2458963.20.68940835793
dc.contributor.organization-code	2610301
dc.converis.publication-id	386822908
dc.converis.url	https://research.utu.fi/converis/portal/Publication/386822908
dc.date.accessioned	2025-08-27T23:59:09Z
dc.date.available	2025-08-27T23:59:09Z
dc.description.abstract	In this paper, we approach the problem of semantic search by introducing a task of paraphrase span detection, i.e. given a segment of text as a query phrase, the task is to identify its paraphrase in a given document, the same modelling setup as typically used in extractive question answering. While current work in paraphrasing has almost uniquely focused on sentence-level approaches, the novel span detection approach gives a possibility to retrieve a segment of arbitrary length. On the Turku Paraphrase Corpus of 100,000 manually extracted Finnish paraphrase pairs including their original document context, we find that by achieving an exact match of 88.73 our paraphrase span detection approach outperforms widely adopted sentence-level retrieval baselines (lexical similarity as well as BERT and SBERT sentence embeddings) by more than 20pp in terms of exact match, and 11pp in terms of token-level F-score. This demonstrates a strong advantage of modelling the paraphrase retrieval in terms of span extraction rather than commonly used sentence similarity, the sentence-level approaches being clearly suboptimal for applications where the retrieval targets are not guaranteed to be full sentences. Even when limiting the evaluation to sentence-level retrieval targets only, the span detection model still outperforms the sentence-level baselines by more than 4 pp in terms of exact match, and almost 6pp F-score. Additionally, we introduce a method for creating artificial paraphrase data through back-translation, suitable for languages where manually annotated paraphrase resources for training the span detection model are not available. © 2024, The Author(s).
dc.identifier.eissn	1574-0218
dc.identifier.jour-issn	1574-020X
dc.identifier.olddbid	204979
dc.identifier.oldhandle	10024/188006
dc.identifier.uri	https://www.utupub.fi/handle/11111/53724
dc.identifier.url	https://doi.org/10.1007/s10579-023-09715-7
dc.identifier.urn	URN:NBN:fi-fe2025082786640
dc.language.iso	en
dc.okm.affiliatedauthor	Kanerva, Jenna
dc.okm.affiliatedauthor	Kitti, Hanna
dc.okm.affiliatedauthor	Chang, Li-Hsin
dc.okm.affiliatedauthor	Ginter, Filip
dc.okm.discipline	113 Computer and information sciences	en_GB
dc.okm.discipline	113 Tietojenkäsittely ja informaatiotieteet	fi_FI
dc.okm.internationalcopublication	not an international co-publication
dc.okm.internationality	International publication
dc.okm.type	A1 ScientificArticle
dc.publisher	Springer Science and Business Media B.V.
dc.publisher.country	Netherlands	en_GB
dc.publisher.country	Alankomaat	fi_FI
dc.publisher.country-code	NL
dc.relation.doi	10.1007/s10579-023-09715-7
dc.relation.ispartofjournal	Language Resources and Evaluation
dc.source.identifier	https://www.utupub.fi/handle/10024/188006
dc.title	Semantic search as extractive paraphrase span detection
dc.year.issued	2024

Tiedostot

Näytetään 1 - 1 / 1

Name:: s10579-023-09715-7.pdf
Size:: 1.44 MB
Format:: Adobe Portable Document Format

Lataa

Kokoelmat

Rinnakkaistallenteet