An Unsupervised Query Rewriting Approach Using N-gram Co-occurrence Statistics to Find Similar Phrases in Large Text Corpora

dc.contributor.authorHans Moen
dc.contributor.authorLaura-Maria Peltonen
dc.contributor.authorHenry Suhonen
dc.contributor.authorHanna-Maria Matinolli
dc.contributor.authorRiitta Mieronkoski
dc.contributor.authorKirsi Telen
dc.contributor.authorKirsi Terho
dc.contributor.authorTapio Salakoski
dc.contributor.authorSanna Salanterä
dc.contributor.organizationfi=hoitotieteen laitos|en=Department of Nursing Science|
dc.contributor.organizationfi=kieli- ja puheteknologia|en=Language and Speech Technology|
dc.contributor.organizationfi=tyks, vsshp|en=tyks, varha|
dc.contributor.organization-code1.2.246.10.2458963.20.27201741504
dc.contributor.organization-code1.2.246.10.2458963.20.47465613983
dc.contributor.organization-code2607400
dc.converis.publication-id44203057
dc.converis.urlhttps://research.utu.fi/converis/portal/Publication/44203057
dc.date.accessioned2022-10-28T14:37:05Z
dc.date.available2022-10-28T14:37:05Z
dc.description.abstract<p>We present our work towards developing a system that should find, in a large text corpus, contiguous phrases expressing similar meaning as a query phrase of arbitrary length. Depending on the use case, this task can be seen as a form of (phraselevel) query rewriting. The suggested approach works in a generative manner, is unsupervised and uses a combination of a semantic word n-gram model, a statistical language model and a document search engine. A central component is a distributional semantic model containing word n-grams vectors (or embeddings) which models semantic similarities between ngrams of different order. As data we use a large corpus of PubMed abstracts. The presented experiment is based on manual evaluation of extracted phrases for arbitrary queries provided by a group of evaluators. The results indicate that the proposed approach is promising and that the use of distributional semantic models trained with uni-, bi-and trigrams seems to work better than a more traditional unigram model.<br /></p>
dc.format.pagerange131
dc.format.pagerange139
dc.identifier.isbn978-91-7929-995-8
dc.identifier.issn1650-3686
dc.identifier.jour-issn1650-3686
dc.identifier.olddbid189298
dc.identifier.oldhandle10024/172392
dc.identifier.urihttps://www.utupub.fi/handle/11111/44347
dc.identifier.urlhttps://www.aclweb.org/anthology/W19-6114/
dc.identifier.urnURN:NBN:fi-fe2021042827307
dc.language.isoen
dc.okm.affiliatedauthorMoen, Hans
dc.okm.affiliatedauthorPeltonen, Laura-Maria
dc.okm.affiliatedauthorSuhonen, Henry
dc.okm.affiliatedauthorMatinolli, Hanna-Maria
dc.okm.affiliatedauthorRosio, Riitta
dc.okm.affiliatedauthorTelen, Kirsi
dc.okm.affiliatedauthorTerho, Kirsi
dc.okm.affiliatedauthorSalakoski, Tapio
dc.okm.affiliatedauthorSalanterä, Sanna
dc.okm.affiliatedauthorDataimport, tyks, vsshp
dc.okm.discipline113 Computer and information sciencesen_GB
dc.okm.discipline113 Tietojenkäsittely ja informaatiotieteetfi_FI
dc.okm.internationalcopublicationnot an international co-publication
dc.okm.internationalityInternational publication
dc.okm.typeA4 Conference Article
dc.publisher.countrySwedenen_GB
dc.publisher.countryRuotsifi_FI
dc.publisher.country-codeSE
dc.relation.conferenceNordic Conference on Computational Linguistics
dc.relation.ispartofjournalLinköping Electronic Conference Proceedings
dc.relation.ispartofseriesNEALT Proceedings Series
dc.relation.volume42
dc.source.identifierhttps://www.utupub.fi/handle/10024/172392
dc.titleAn Unsupervised Query Rewriting Approach Using N-gram Co-occurrence Statistics to Find Similar Phrases in Large Text Corpora
dc.title.bookProceedings of the 22nd Nordic Conference on Computational Linguistics (NoDaLiDa)
dc.year.issued2019

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
W19-6114.pdf
Size:
163.52 KB
Format:
Adobe Portable Document Format
Description:
Publisher's PDF