Building Question-Answer Data Using Web Register Identification

Eskelinen, Anni

Building Question-Answer Data Using Web Register Identification

dc.contributor.author	Eskelinen, Anni
dc.contributor.organization	fi=data-analytiikka\|en=Data-analytiikka\|
dc.contributor.organization	fi=digitaalinen kielentutkimus, espanja, italia, kiina, ranska, saksa\|en=Digital Language Studies, Chinese, French, German, Italian, Spanish\|
dc.contributor.organization-code	1.2.246.10.2458963.20.36764574459
dc.contributor.organization-code	1.2.246.10.2458963.20.68940835793
dc.converis.publication-id	404724872
dc.converis.url	https://research.utu.fi/converis/portal/Publication/404724872
dc.date.accessioned	2025-08-28T00:47:52Z
dc.date.available	2025-08-28T00:47:52Z
dc.description.abstract	<p>This article introduces a resource-efficient method for developing question-answer (QA) datasets by extracting QA pairs from web-scale data using machine learning (ML). Our method benefits from recent advances in web register (genre) identification and consists of two ML steps with an additional post-processing step. First, using XLM-R and the multilingual CORE web register corpus series with categories such as QA Forum, we train a multilingual classifier to retrieve documents that are likely to contain QA pairs from web-scale data. Second, we develop a NER-style token classifier to identify the QA text spans within these documents. To this end, we experiment with training on a semi-synthetic dataset built on top of the English LFQA, a small set of manually cleaned web QA pairs in English and Finnish, and a Finnish web QA pair dataset cleaned using ChatGPT. The evaluation of our pipeline demonstrates its capability to efficiently retrieve a substantial volume of QA pairs. While the approach is adaptable to any language given the availability of language models and extensive web data, we showcase its efficiency in English and Finnish, developing the first open, non-synthetic and non-machine translated QA dataset for Finnish – Turku WebQA – comprising over 200,000 QA pairs.</p>
dc.format.pagerange	2611
dc.identifier.eisbn	978-2-493814-10-4
dc.identifier.issn	2522-2686
dc.identifier.jour-issn	2522-2686
dc.identifier.olddbid	206433
dc.identifier.oldhandle	10024/189460
dc.identifier.uri	https://www.utupub.fi/handle/11111/45912
dc.identifier.url	https://aclanthology.org/2024.lrec-main.234.pdf
dc.identifier.urn	URN:NBN:fi-fe2025082791251
dc.language.iso	en
dc.okm.affiliatedauthor	Eskelinen, Anni
dc.okm.affiliatedauthor	Myntti, Amanda
dc.okm.affiliatedauthor	Henriksson, Erik
dc.okm.affiliatedauthor	Pyysalo, Sampo
dc.okm.affiliatedauthor	Laippala, Veronika
dc.okm.discipline	6121 Languages	en_GB
dc.okm.discipline	6121 Kielitieteet	fi_FI
dc.okm.internationalcopublication	not an international co-publication
dc.okm.internationality	International publication
dc.okm.type	A4 Conference Article
dc.publisher.country	Italy	en_GB
dc.publisher.country	Italia	fi_FI
dc.publisher.country-code	IT
dc.relation.conference	Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING)
dc.relation.ispartofjournal	LREC Proceedings
dc.relation.ispartofseries	LREC Proceedings
dc.source.identifier	https://www.utupub.fi/handle/10024/189460
dc.title	Building Question-Answer Data Using Web Register Identification
dc.title.book	Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
dc.year.issued	2024

Tiedostot

Näytetään 1 - 1 / 1

Name:: 2024.lrec-main.234.pdf
Size:: 348.2 KB
Format:: Adobe Portable Document Format

Lataa

Kokoelmat

Rinnakkaistallenteet