A Quest for Paradigm Coverage: The Story of Nen

Muradoğlu Saliha; Suominen Hanna; Evans Nicholas

A Quest for Paradigm Coverage: The Story of Nen

dc.contributor.author	Muradoğlu Saliha
dc.contributor.author	Suominen Hanna
dc.contributor.author	Evans Nicholas
dc.contributor.organization	fi=tietotekniikan laitos\|en=Department of Computing\|
dc.contributor.organization-code	1.2.246.10.2458963.20.85312822902
dc.converis.publication-id	387427538
dc.converis.url	https://research.utu.fi/converis/portal/Publication/387427538
dc.date.accessioned	2025-08-28T01:12:09Z
dc.date.available	2025-08-28T01:12:09Z
dc.description.abstract	Language documentation aims to collect a representative corpus of the language. Nevertheless, the question of how to quantify the comprehensive of the collection persists. We propose leveraging computational modelling to provide a supplementary metric to address this question in a low-resource language setting. We apply our proposed methods to the Papuan language Nen. Nen is actively in the process of being described and documented. Given the enormity of the task of language documentation, we focus on one subdomain, namely Nen verbal morphology. This study examines four verb types: copula, positional, middle, and transitive. We propose model-based paradigm generation for each verb type as a new way to measure completeness, where accuracy is analogous to the coverage of the paradigm. We contrast the paradigm attestation within the corpus (constructed from fieldwork data) and the accuracy of the paradigm generated by Transformer models trained for inflection. This analysis is extended by extrapolating from the learning curve established to provide predictions for the quantity of data required to generate a complete paradigm correctly. We also explore the correlation between high-frequency morphosyntactic features and model accuracy. We see a positive correlation between high-frequency feature combinations and model accuracy, but this is only sometimes the case. We also see high accuracy for low-frequency morphosyntactic features. Our results show that model coverage is significantly higher for the middle and transitive verbs but not the positional verb. This is an interesting finding, as the positional verb paradigm is the smallest of the four.
dc.format.pagerange	85
dc.identifier.isbn	978-1-959429-60-9
dc.identifier.olddbid	207189
dc.identifier.oldhandle	10024/190216
dc.identifier.uri	https://www.utupub.fi/handle/11111/50818
dc.identifier.url	https://aclanthology.org/2023.fieldmatters-1.9/
dc.identifier.urn	URN:NBN:fi-fe2025082791537
dc.language.iso	en
dc.okm.affiliatedauthor	Suominen, Hanna
dc.okm.discipline	113 Computer and information sciences	en_GB
dc.okm.discipline	113 Tietojenkäsittely ja informaatiotieteet	fi_FI
dc.okm.internationalcopublication	international co-publication
dc.okm.internationality	International publication
dc.okm.type	A4 Conference Article
dc.publisher.country	United States	en_GB
dc.publisher.country	Yhdysvallat (USA)	fi_FI
dc.publisher.country-code	US
dc.relation.conference	Workshop on NLP Applications to Field Linguistics
dc.relation.doi	10.18653/v1/2023.fieldmatters-1.9
dc.source.identifier	https://www.utupub.fi/handle/10024/190216
dc.title	A Quest for Paradigm Coverage: The Story of Nen
dc.title.book	Proceedings of the Second Workshop on NLP Applications to Field Linguistics
dc.year.issued	2023

Tiedostot

Näytetään 1 - 1 / 1

Name:: 2023.fieldmatters-1.9(2).pdf
Size:: 1.08 MB
Format:: Adobe Portable Document Format

Lataa

Kokoelmat

Rinnakkaistallenteet