Building Question-Answer Data Using Web Register Identification

dc.contributor.authorEskelinen Anni
dc.contributor.authorMyntti Amanda
dc.contributor.authorHenriksson Erik
dc.contributor.authorPyysalo Sampo
dc.contributor.authorLaippala Veronika
dc.contributor.organizationfi=data-analytiikka|en=Data-analytiikka|
dc.contributor.organizationfi=digitaalinen kielentutkimus, espanja, italia, kiina, ranska, saksa|en=Digital Language Studies, Chinese, French, German, Italian, Spanish|
dc.contributor.organization-code1.2.246.10.2458963.20.36764574459
dc.contributor.organization-code1.2.246.10.2458963.20.68940835793
dc.converis.publication-id404724872
dc.converis.urlhttps://research.utu.fi/converis/portal/Publication/404724872
dc.date.accessioned2025-08-28T00:47:52Z
dc.date.available2025-08-28T00:47:52Z
dc.description.abstract<p>This article introduces a resource-efficient method for developing question-answer (QA) datasets by extracting QA pairs from web-scale data using machine learning (ML). Our method benefits from recent advances in web register (genre) identification and consists of two ML steps with an additional post-processing step. First, using XLM-R and the multilingual CORE web register corpus series with categories such as QA Forum, we train a multilingual classifier to retrieve documents that are likely to contain QA pairs from web-scale data. Second, we develop a NER-style token classifier to identify the QA text spans within these documents. To this end, we experiment with training on a semi-synthetic dataset built on top of the English LFQA, a small set of manually cleaned web QA pairs in English and Finnish, and a Finnish web QA pair dataset cleaned using ChatGPT. The evaluation of our pipeline demonstrates its capability to efficiently retrieve a substantial volume of QA pairs. While the approach is adaptable to any language given the availability of language models and extensive web data, we showcase its efficiency in English and Finnish, developing the first open, non-synthetic and non-machine translated QA dataset for Finnish – Turku WebQA – comprising over 200,000 QA pairs.</p>
dc.format.pagerange2595
dc.format.pagerange2611
dc.identifier.eisbn978-2-493814-10-4
dc.identifier.issn2522-2686
dc.identifier.jour-issn2522-2686
dc.identifier.olddbid206433
dc.identifier.oldhandle10024/189460
dc.identifier.urihttps://www.utupub.fi/handle/11111/45912
dc.identifier.urlhttps://aclanthology.org/2024.lrec-main.234.pdf
dc.identifier.urnURN:NBN:fi-fe2025082791251
dc.language.isoen
dc.okm.affiliatedauthorEskelinen, Anni
dc.okm.affiliatedauthorMyntti, Amanda
dc.okm.affiliatedauthorHenriksson, Erik
dc.okm.affiliatedauthorPyysalo, Sampo
dc.okm.affiliatedauthorLaippala, Veronika
dc.okm.discipline6121 Languagesen_GB
dc.okm.discipline6121 Kielitieteetfi_FI
dc.okm.internationalcopublicationnot an international co-publication
dc.okm.internationalityInternational publication
dc.okm.typeA4 Conference Article
dc.publisher.countryItalyen_GB
dc.publisher.countryItaliafi_FI
dc.publisher.country-codeIT
dc.relation.conferenceLanguage Resources and Evaluation
dc.relation.ispartofjournalLREC Proceedings
dc.relation.ispartofseriesLREC Proceedings
dc.source.identifierhttps://www.utupub.fi/handle/10024/189460
dc.titleBuilding Question-Answer Data Using Web Register Identification
dc.title.bookProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
dc.year.issued2024

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
2024.lrec-main.234.pdf
Size:
348.2 KB
Format:
Adobe Portable Document Format