This is a self-archived – parallel-published version of an original article. This version may differ from the original in pagination and typographic details. When using please cite the original. AUTHOR Saliha Muradoglu, Hanna Suominen, and Nicholas Evans TITLE A Quest for Paradigm Coverage: The Story of Nen YEAR 2023 DOI 10.18653/v1/2023.fieldmatters-1.9 VERSION Publisher’s PDF CITATION Saliha Muradoglu, Hanna Suominen, and Nicholas Evans. 2023. A Quest for Paradigm Coverage: The Story of Nen. In Proceedings of the Second Workshop on NLP Applications to Field Linguistics, pages 74–85, Dubrovnik, Croatia. Association for Computational Linguistics. LICENSE CC BY Proceedings of the Second Workshop on NLP Applications to Field Linguistics, pages 74–85 May 6, 2023 ©2023 Association for Computational Linguistics A Quest for Paradigm Coverage: The Story of Nen Saliha Muradog˘lu♣♠ Hanna Suominen♣♢ Nicholas Evans♣♠ ♣The Australian National University (ANU) ♢University of Turku ♠ARC Centre of Excellence for the Dynamics of Language (CoEDL) Firstname.Lastname@anu.edu.au Abstract Language documentation aims to collect a rep- resentative corpus of the language. Neverthe- less, the question of how to quantify the com- prehensive of the collection persists. We pro- pose leveraging computational modelling to provide a supplementary metric to address this question in a low-resource language setting. We apply our proposed methods to the Papuan language Nen. Nen is actively in the process of being described and documented. Given the enormity of the task of language documen- tation, we focus on one subdomain, namely Nen verbal morphology. This study examines four verb types: copula, positional, middle, and transitive. We propose model-based paradigm generation for each verb type as a new way to measure completeness, where accuracy is anal- ogous to the coverage of the paradigm. We con- trast the paradigm attestation within the corpus (constructed from fieldwork data) and the accu- racy of the paradigm generated by Transformer models trained for inflection. This analysis is extended by extrapolating from the learning curve established to provide predictions for the quantity of data required to generate a complete paradigm correctly. We also explore the correla- tion between high-frequency morphosyntactic features and model accuracy. We see a posi- tive correlation between high-frequency feature combinations and model accuracy, but this is only sometimes the case. We also see high accuracy for low-frequency morphosyntactic features. Our results show that model coverage is significantly higher for the middle and tran- sitive verbs but not the positional verb. This is an interesting finding, as the positional verb paradigm is the smallest of the four. 1 Introduction A key question in studying language is: when do we have enough data to fully understand the sys- tem? This is especially important in language docu- mentation. As Himmelmann (1998) states, ‘the aim of language documentation is to provide a compre- hensive record of the linguistic practices charac- teristic of a given speech community.’. Bird (2015) extends this by asking, ‘If a comprehensive record is unattainable in principle, is there a consensus on what an adequate record looks like. How would you quantify it?’. Honouring their formulation, Baird et al. (2022) label this the ‘Himmelman-Bird’ problem.1 In their paper, the authors strive to explore this Himmelman-Bird problem for the inventory of phonemes, which are the subdomain of language with the smallest and hence most frequently- occurring units. They set the bar even lower by simply requiring that at least one allophone of each phoneme occur. They then examine how much text it might take to capture a language’s entire phoneme inventory, drawing on a sample of 137 distinct languages, some with additional dialectal or register variety taking the total to 158 speech varieties. Full ‘coverage’ is achieved, for a given domain of language (say, its phoneme inventory) and a given corpus, it there is at least one incidence of each relevant unit (in this case, each phoneme) in that corpus. Here we strive to follow a similar route for mor- phemes and their respective allomorphs, while still posing the problem in its simplest and hence most easily-satisfied form: we look just at verbs, and we restrict ourselves to one representative lexeme (the commonest) in each of the four main morphologi- cal classes – see below. The goal of collecting a representative sample has permeated many fields, from biology to sociol- ogy. Researchers have explored the idea of having a gold standard process for collecting all required components to describe a system. For example, if we wanted to gather all the phonemes for English, the ‘Rainbow Passage’ by Fairbanks (1960) may be chosen. The first four lines of the passage cap- 1This is akin to the problem of corpus representativity. 74 ture all phonemes for English. In morphology, we can discuss the idea of collecting all principal parts (Finkel and Stump, 2007) to construct the entire paradigm. This idea presents as a great solution to the diffi- culty faced by low-resource languages and, more specifically, language documentation. However, one caveat is the system knowledge required for designing such a task. For example, how might a linguist know all the phonemes before beginning their in-field analysis and recordings? Accordingly, we make the distinction between heuristic and at- testation coverage. The first refers to the discovery stage of a lan- guage, leading to a sketching of the dimensions of its design space - the logical space of all its possibilities in a particular domain, such as verbal inflections – through discovering the dimensions where it encodes contrasts (say ‘dual number’, ‘fu- ture imperative’, ‘imperfect aspect’), and mapping out the ways these interact (say ‘future imperfec- tive dual imperative’, as in Nen nandowabe ‘you two should be talking later on!’ (Evans, 2019). The latter describes the scenario where a description ex- ists, and the aim is to collect examples of language within the denoted design space. The concept of a ‘whole language’ is so vast and heterogenous that it is not operationally useful for many linguistic or practical purposes. To explore this question, we consider a particular component of language, inflectional morphology on the verb. We base our study on modelling morphological inflection in the Nen language and examine the attestation coverage observed in the transcribed natural spoken corpus and inflection models built on the same data. In this paper, we address the following questions: (1) How can we test the degree to which a linguistic subsystem exhibits coverage in a given corpus (2) How does the model coverage compare with the corpus? (3) Does corpus frequency relate to model accuracy? (4) Can we use model-based learning curves to predict the data required for complete coverage? We propose a test case for the model that asks to predict a complete paradigm, i.e. the complete mul- tidimensional array of inflected forms – English is too morphologically impoverished to furnish a good example (the best is with the copula to be: {am, (art), is, are; was, were; (to) be; being}. Our results indicate that the generalisations afforded by the Transformer model yield better coverage than the natural corpus. Furthermore, we explore two separate correlations of the high dimensional axes of Nen verbs; the undergoer and agent combina- tions and the agent and Tense, Aspect, and Mood (TAM) combinations. While frequent features tend to be captured correctly by the model, surprisingly, so are some low-frequency forms. Finally, we use learning curves to predict the data needed for 100% coverage. 2 Related Work To our knowledge, only two prior computational studies of Nen exist. Muradoglu et al. (2020) presents a finite-state description, while (Mu- radog˘lu et al., 2020) explores the use of neural architecture, to model Nen verbal morphology. The latter is based on two high performing submis- sions in the SIGMORPHON–CoNLL 2017 Shared Task (Cotterell et al., 2017). Between the two approaches, the finite-state description achieves a higher accuracy across the corpus. However, we note that the accuracies reported are not directly comparable given the ongoing development of the corpus. Despite the performance difference, we opt to use a neural approach to enlist the aid of its gener- alising ability. Moreover, the statistical nature of these models make the intersect with corpus lin- guistics an object of interest. Specifically, we use a Transformer (Vaswani et al., 2017) based model. Transformers have been successful in capturing complexities of phonological and morphological details (Pimentel et al., 2021; Kodner et al., 2022), often achieving state-of-the-art performance. Over the years, the inflection task has been extended to many languages, including other complex morpho- logical systems such as Murrinh-Patha, Kunwinjku and Seneca. 3 The Nen Language Nen is a Papuan language of the Morehead-Maro (or Yam) family (Evans, 2017). It is spoken as a native language in the village of Bimadbn in the Western Province of Papua New Guinea (Evans, 2015, 2019). Most Nen speakers are multilingual, typically speaking several of the neighbouring lan- guages. Verbs in Nen are notoriously complicated and are described as the most complicated word-class in Nen (Evans, 2015, 2019). They can be grouped 75 in several ways, either as prefixing and ambifixing or by further breaking down the inflection patterns. Prefixing verbs consist of the copula (and its deriva- tives ‘go’/‘come’/‘have’), ‘to walk’ and positional verbs. Another distinguishing feature of prefixing verbs, is the lack of infinitives. Both ambifixing and middle verbs form infinitives through suffix- ing -s to the verb stem. In this study, we have listed the prefixing verb lemmas as the verb stem. Ambifixing verbs can be separated into middle and transitive verbs. Here, we separate the verb types beyond the prefixing and ambifixing categories as the corresponding paradigms are distinct. We pro- vide details for the verbs we track below. 3.1 Copula The copula is a special case for our test, in that we test the generation of a partial paradigm as the model would have seen several forms of the cop- ula. We note that this verb, together with its direc- tional counterparts ‘come’ and ‘go’. The come/go paradigms are built using the copula with the ad- dition of directional prefixes, is the most frequent verb type in the corpus. The copula paradigm con- sists of 40 unique forms. See Evans (2014) for full paradigm. 3.2 Positional Verbs in the positional class fall into two main types: posture and position proper (Evans, 2015). For example, mängr ‘be lying in a jumble’ and érningr ‘be in hiding’ or spatial position in relation to some frame of reference like pingr ‘to be high (typically inanimate)’. So far, 45 verbs have been recorded. Verbs of this class have special stative suffixes -ngr for non-dual and -aran (dual). They exhibit properties of prefixing verbs: they do not have infinitives and cannot form present imperative (Evans, 2014). 3.3 Middle Middle and transitive verbs have the same TAM paradigm. Aside from valency, the distinction between the two is that the middle verbs have a dummy prefix with no semantic meaning other than to note that they are middle verbs. This prefix does not mark an argument like other verb types. In rare cases, middle verbs use the undergoer prefix slot to index large plurals. Example verbs of this type include owabs ‘to speak’ or ang¯s ‘to return’. Both these verbs are ambifixing, but the prefixal slot is restricted to {n-} (α–series), {k-} (β–series), {g-} (γ–series). 3.4 Transitive By contrast, transitive verbs utilize both prefixes and suffixes to mark person and number. Examples of this verb type include yis ‘to plant’ and waprs ‘to do’ These verbs allow for full prefixing and suffix- ing possibilities. The prefix set is divided through the use of the same arbitrarily labels α, β, and γ, as the middle verbs. Instead of the middle verb marker, transitive verbs allow for person/number undergoer marking. These dummy indices do not carry specific semantic values until they are unified with other TAM markings on the verb. Evans (2016) provides the canonical paradigms for the undergoer prefixes, thematics and desinences. Suffixes are constructed by combin- ing the corresponding thematic and the desinence. The future imperative construction is a special case, where an additional future imperative prefix is re- quired (Evans, 2015). 3.5 Directional Following the undergoer prefixes, a directional pre- fix slot is available. This can be filled with {-n-} ‘towards’, {-ng-} ‘away’ or left empty to convey a directionally neutral semantic. Consider the copula verb m ‘to be’, when marked for direction the resultant forms are as follows: y-n- m ‘(s)he coming (towards speaker)’, y-ng-m ‘(s)he is going (away from speaker)’. Note the speaker centric frame of reference. 4 Data The Nen corpus is made of 44 individual texts that were naturalistically recorded in the field. This amalgamates to approximately 8 hours of spo- ken text or over 30,000 words. This is filtered to over 6,000 verb instances representing 2,282 forms. Some of these forms are the same, with different feature combinations due to syncretism or polysemy. For example, the sequence yn- can be parsed in two ways. It can either mean the prefix yn- coding first person nonsingular undergoer for the α series or y-n the third singular undergoer with the ventive (towards) directional. Each of these in- stances are treated separately to expose the model to all possible meanings. A large portion of the texts in the corpus are 76 Figure 1: The coverage growth for four verb types in Nen, reported as a function of Annotation units (within corpus), where ‘annotation units’ are audibly-demarcated units in the flow of speech (typically by pause breaks). In our corpus, on average there is one verb per annotation unit, making annotation units a reasonable proxy of how often we would expect verbs to occur. The corpus accounts follow akingr ‘to be standing’ for the positional, owabs ‘to speak’ for the middle and räms ‘to do/give’ for the transitive. The confidence bands reported on the model results are calculated based on a 4-partition variance. The full Nen corpus currently consists of 6,446 annotation units. The starting point is 1,079 as this roughly corresponds to 382 (100 train + 282 dev) instances. coconut interviews2, these typically involve so- called biographical questions (parent names, place of birth etc), and questions about coconut trees that belong to the interviewee. This type of text was chosen as it can include a variety of tense - whether someone has planted or will plant a coconut tree - and is a topic that easily inspires conversation from locals. Although, these do not constitute a genre in the traditional sense, they do exhibit characteristic features, such as a high token count of the verb yis ‘to plant’ and third person non-past copula ym. The remaining texts range from anecdotal stories, folk tales, other narratives or procedural explanations. 5 Experiment We contrast the corpus-based account of the Nen verbal paradigm to that modelled by a Transformer model (Wu et al., 2021). Our study is conducted in two parts: first, we follow the attestation cover- age of the paradigm for one representative verb for each type in the corpus. Second, we train Trans- former models to generate a complete paradigm 2See Evans (2020) for more details. for an unseen (barring the copula) verb for each type with incremental amounts of data. We es- tablish a learning/coverage curve for each method (Anzanello and Fogliatto, 2011; Viering and Loog, 2022). We use the term coverage here to mean the percentage of cells observed in the corpus or correctly predicted by the models out of the entire language design space. 5.1 Corpus-based Account Here we present a corpus account of paradigm cov- erage. For each of our four verb types, we follow the trajectory of the lexeme.3 As it happens the top three verbs, by frequency, are the copula (most fre- quent at 80.46 IPT (Items per thousand)4, the mid- dle verb owabs ‘to speak’ (Second most frequent lexeme in the corpus, 6.83 IPT) and the transitive 3Where a lexeme is a‘dictionary word’, i.e. the citation form of a word used in a dictionary, and uniting all its inflected forms. Thus the lexeme run unites the inflected forms run, runs, ran and running. In Nen the number of inflected forms per lexeme is much larger, as we shall see below. 4The more common metric is IPM (items per million) but given that the size of the Nen corpus is in order of thousands, we report these figures in IPT. 77 verb räms‘to do/give’ (Third most frequent lexeme in corpus, 6.46 IPT). We then have to descend some way down the frequency list before reaching our highest-frequency positional verb, namely akingr ‘to be standing’ (16th most frequent lexeme, 1.83 IPT). For our four verbs, we then collate all distinct forms of the verb in question, tracking for where in the corpus it is encountered. For example, for the verb akingr, the first form yakingr is encountered at the 223rd annotation unit, the second ynakiaran at 242nd and so on. The texts within the corpus are concatenated, and the same order of the text is preserved for each analysis. The copula verb m is included in both training and test since it makes up for a large portion of the existing corpus and occupies the top 5 most fre- quent forms. It is the most frequent lexeme (80.46 IPT). This scenario can be seen as a more straight- forward case, as 62.5% of the copula paradigm (without the directional prefix) is attested in the complete 2,000 instance training data. So the model needs to reproduce these forms with the di- rectional prefixes. The remaining three verb types are not encountered in training time, barring the stem. 5.2 Model-based Account We train models like an ‘inflection’ task in the SIG- MORPHON shared tasks (Kodner et al., 2022), with tags identifying morpho-syntactic categories. The system is asked to produce the inflected form given the lemma and morpho-syntactic tags. For example, ⟨owabs, V;IPFV.NPHD;1SGA;M;α, nowabtan⟩ or the English equivalent ⟨talk, V;V.PTCP;PRS5, talking⟩. We additionally account for the copy bias re- ported in (Liu and Hulden, 2022) by including the three6 (see Section 5.2.2 for details) lemmas con- sidered during test time in the training set. Each model is trained using a character-level Transformer (Wu et al., 2021). This model has been used as the neural baseline for the SIGMORPHON shared task on morphological inflection7. We train models based on a Zipfian sampling strategy, as corpora obey Zipf’s law at all sample sizes (Baayen, 2001; Blevins et al., 2017). The dev set is determined as the least frequent 282 forms 5Present participle 6Since the model is already exposed to the copula during training time, it does not need to be included again. 7Model parameters follow (Wu et al., 2021). and is kept the same for every experiment. The distribution is calculated from the existing corpus study (Muradog˘lu, 2017). We train at 100 train- ing sample intervals, ranging from 100 to 2,000 instances. Prior work has explored the difference between random and Zipfian sampling. For example, Mu- radog˘lu et al. (2020) examined the difference and reported that random selection yielded better re- sults (or a faster coverage rate). However, given our research question, what random sampling means for language documentation is unclear. With many of the corpora built by field linguists built upon a combination of standard field method practices and anthropological story gathering, the type of data collected is hardly random. As such, the model results presented in this paper are based on Zipfian sampling. 5.2.1 Design of Test We propose a modified test case to measure paradigm coverage of the model. A lexeme is cho- sen for each verb type and tested for each cell or unique morphosyntactic description (MSD). The choice of lexeme is motivated by how regu- lar the inflection of its particular phonotactics are. With the purpose of testing generalisability, it fol- lows that our case study verbs are regular. Although we note that limitations of this approach, namely the variation of morphs across certain phonological properties of the stem (e.g., vowel harmony). Given resource and access limitations we have utilised the finite-state grammar for Nen (Mu- radoglu et al., 2020) to generate full paradigms for the positional and transitive verbs, these paradigms are later examined by a language expert. The mid- dle verb test is based on a full paradigm that was previously verified with Nen speakers. The full copula paradigm and its directional variants are sourced from the forthcoming grammar of Nen. In a sense our suggested test for coverage is sim- ilar to the wug test in the SIGMORPHON shared tasks (Kodner et al., 2022), but rather than gen- eral production processes of nonce words we are interested in generating complete paradigms. 5.2.2 Meet the Verbs m ‘to be’ The copula paradigm consists of 40 unique forms. The come/go paradigms are built using the copula with the addition of directional prefixes. 78 Figure 2: Bubble plot showcasing the frequency correlation between TAM and agent person number, reported numbers are percentage of corpus with the TAM/agent features. Navy lines indicate available cells described by the language design space. Note that the second and third persons are typically display syncretism except in the perfective past. See appendix A for details on TAM categories. The darker the colour (towards a blood orange) the more proficiency the model displays. Conversely the lighter the colour (orange) the more the model struggles to produce a correct form with the corresponding features. pingr (n-du)/piaran (du) ‘to be high/elevated’ Depending on the vowel of the stem (‘i’ in this case), the 2|3nsg prefix is e-, e.g., epingr ‘you two/they two are up high’. armbs ‘to climb’ As with all middle verbs, armbs begins with a vowel. It is somewhat similar to the most common middle verb in the corpus owabs ‘to speak’, with a shared b before the infinitive marker -s. In addition to exhibiting regular inflection, the forms have been verified by native Nen speakers. wambaes ‘to sniff’ There are a few key points to note for this verb. When verb infinitives end with a dipthong (e.g. ae) before the final s, the dipthong is shortened in the non-dual (e.g., wakaes ‘to look at’ but yakatan ‘I look at him/her’), but in the dual the full diphthong is present and also a dual-marking -w- which only occurs in such envi- ronments, e.g., yawakataewn ‘I look at the two of them’, yakataewm ‘we two look at him/her’. The most notable verb that is similar in phono- logical structure is wakaes ‘to see’. The corpus contains 36 unique forms for wakaes. 6 Results and Discussion A full paradigm for one verb is unlikely to be en- countered in natural speech, or language learning contexts (Chan, 2008; Blevins and Blevins, 2009). Although the focus of this paper is not language learning, the sparsity of paradigm coverage ob- served in these contexts is equally relevant here. Based on various well-known corpora, Chan (2008) shows that languages with larger verbal paradigms exhibit lower coverage. Most notably, the only lan- guage with full coverage of its verbal paradigm is English, which only has six verbal forms. By con- trast, Finnish has 365 verb forms and only a 40.3% saturation even though the corpus size is almost double (2.1 million words compared to the Brown corpus of 1.2 million words) that of the English counterpart. Muradog˘lu (2017) reports on the bleak data re- quirements to record each cell of the transitive verb in Nen. Here we have utilised the power of trans- former models to leverage abstraction and statisti- cal learning. Figure 1 shows that the model based 79 Figure 3: Relativised bubble plot of Actor and Undergoer person number for Nen. The navy blue blocks note the semantically disallowed combinations or in the case of first person acting on first person this meaning is achieved through reflexive constructions. The darker the colour (towards a purple) the more accurate the model is. Conversely the lighter the colour (lavender) the more the model struggles to produce a correct form with the corresponding features. on the corpus does significantly better in terms of coverage. This suggests that while each com- bination might not be present in the corpus, the relevant information is. This typically parallels a mechanism utilised by field linguists to bootstrap the mapping of a linguistic paradigm since going through a complete paradigm for one particular verb is implausible. Instead, the circumstantial context primes language informants to showcase verbs of different semantic domains. The field lin- guist typically obtains part of the paradigm (either through elicitation or by natural means) for each verb. These fragments likely allow for a recon- struction of the entire paradigm. Dimensional in- dependence allows the linguist to fill out parts of the paradigm. This task has been described as the paradigm cell filling problem (PCFC) Ackerman et al. (2009); Silfverberg and Hulden (2018); Liu and Hulden (2020). Figure 1 shows the paradigm coverage across the four verb types in question. We contrast model- based coverage with a corpus-based account. In both instances, we follow the trajectory of one rep- resentative verb. For the model, the four test verbs are detailed in the Section 5.2.2. The corpus cov- erage curve follows akingr ‘to be standing’ for the positional, owabs ‘to speak’ for the middle, and räms ‘to do/give’ for the transitive verb. The model and the corpus follow m ‘to be’ since the copula verb is one entity. The most observable behaviour shown in Figure 1 is the fluctuation across models trained across different training sizes. Although, in general, the growth is positive, we see a significant difference across each step. One explanation might be the skew within the samples added. In other words, the added examples negatively influence the gen- eralisations built by the model. Another might be the model sensitivity to initial training data and data order. To account for the statistical variation, we report confidence bands for each verb type by measuring the variation in accuracy by dividing the test case for each verb into four random parti- tions. The partitions are randomly sampled as the test file is constructed in paradigmatic order. If the partitioning is performed sequentially, we might 80 Corpus Model Annotation units # of words Training size Annotation units # of words All – – 198,000 560,000 2,610,000 Transitive 154,000 716,000 34,000 97,000 451,000 Middle 44,000 205,000 4,000 12,000 55,000 Positional 40,000 188,000 3,000 10,000 45,000 Copula 11,000 53,000 3,000 10,000 46,000 Table 1: Extrapolated values based on the learning curve for both corpus and model-based coverage. The corpus’s training size has been omitted as it does not bear any particular meaning. The numbers presented are rounded to the nearest thousand. observe bias in one part of the paradigm, yielding large error margins. The model shows greater coverage for the transi- tive, middle and copula verb types than the corpus account. Interestingly, the growth curve shows that the model-based account for positional verbs does worse than the corpus account. This is because the learning curve for the positional verb fluctu- ates substantially. The best-performing model for positional verbs is obtained with only 900 train- ing examples (or 3,339 annotation units) at 16.5% coverage compared with the corpus account of ak- ingr at 9% across the whole corpus. Given that the paradigm of the positional verb is the smallest among the four, we would have expected coverage to be high. A possible explanation for this might be that there are few instances of positional verbs in the corpus (26 distinct forms across seven lexemes) and, thus, the training set. We also observe looping errors as described in Shcherbakov et al. (2020), particularly for training sets below 1,000 instances. We describe the coverage growth relative to an- notation units to capture the data requirements for paradigm representation fully. The texts are seg- mented into annotation units to retain some of the contextual information surrounding the verb in question. These units are typically one complete sentence and most commonly correspond to a seg- ment in ELAN (Sloetjes and Wittenburg, 2008). On average, 4.7 words per intonation unit, one of which is usually a verb. With 6,446 annotation units across the corpus, on average, for every 2.88 units, there is a distinct form encountered. The model paradigm coverage is contrasted with that from the Nen spoken corpus. We make a point to situate the required data size for training the model (i.e., train + dev) with units that relate to the corpus to help highlight the distillation process. Typically, the model training size is measured in the number of instances. However, when collating a data set for a specific natural language processing (NLP) task – such as morphological inflection, the corpus is filtered from total words (assuming tran- scription exists) and later further distilled to types from tokens. To address our third question, we analyse the fre- quency of the verb features along the TAM/Actor and Actor/Undergoer dimensions. We expect a strong correlation between highly frequent features in the corpus and the model accuracy for that slot. Figures 2 and 3 show the frequency of feature bun- dles. In both figures, the size of the bubbles corre- sponds to the frequency of the two sets of features in question (TAM and Actor or Actor and Under- goer). The saturation of the bubble shows how successful the model is in capturing the particular feature combination. The darker the bubble, the more likely the model will produce the correct cor- responding form. These results are based on the model training with the entire training set available (2,000 instances). As expected, both figures show a correlation between the bubble size (corpus frequency) and saturation (model accuracy). Nevertheless, there are cases where the corpus frequency is low, but the model proves to be proficient in producing the correct form. One such example is the imperfec- tive imperative (ipfv.imp), the second person plural actor (which requires a prefix of the α series and the -tang suffix) makes up for 0.29% of the training data, but the model produces the correct form more than 66% of the time. One explanation might be that the rule’s complexity and the chosen test verbs do not trigger allomorphic variants. We note the morphophonological element of in- flecting. While we have tried to choose regular verbs, they still exhibit a phonological layer. It is hard to disentangle such effects. One possible 81 future direction would be to choose a list of verbs across the categories presented here which exhibit the full range of phonological phenomena observed in Nen. For example, verbs that might trigger vowel harmony and the consequent allomorphs. We further our analysis by providing a predictive quantity of data needed to reach 100% accuracy. We utilise scipy-based (Virtanen et al., 2020) ex- trapolation by treating the resultant coverage curve as a learning curve. The predictions presented here are optimistic; to ensure that the predictions are based on monotonically increasing functions, we ensure that: A(AU ′) > A(AU) where A is the accuracy, AU is the annotation units and AU ′ > AU . Given the predictions’ variability, the numbers are rounded to the nearest thousand. Table 1 shows that the amount of data needed for the model to reach full coverage is significantly less than a corpus-based account. In some cases, such as the transitive and middle verb, the estimated quantity is over four times less. We expect these paradigms to benefit the most from generalising as they typically display regular inflection. Addition- ally, the paradigm size for both is substantial. It is tempting to draw parallels between language learning and the analysis presented here. However, we remind readers that we base our predictions on one representative verb and focus on attestation coverage rather than heuristic coverage. Further- more, we note that heuristic coverage would require a vastly more significant quantity of data. In addi- tion, the numbers here are for one verb only, and it does not extend to include all parts of speech. 7 Conclusion We propose ‘coverage’ as a new way to measure the comprehensiveness of a corpus for morphological paradigms. Here we present this application to Nen verbal morphology. This methodology can be extended to include other parts of speech or languages. Our results show that using deep learning ap- proaches, more specifically the Transformer archi- tecture (Gillioz et al., 2020; Lin et al., 2022) allows us to exploit the generalisable parts of a paradigm and thus grant us a higher coverage. The model- based account yielded higher attestation for three of the four verbs considered. In an ideal setting, each inflection feature for each word would be ob- served and recorded naturally. However, this is an impossible feat in real-life. Using statistics-based modelling like the Transformer model allows us to synthesise forms based on examples encountered in the training data. As a result, the existing corpus can account for more of the system than a simple count within the corpus would suggest. We have explored the basis of the conventional wisdom of higher frequency yielding better model performance. While this holds, we observe a pos- itive correlation between high-frequency feature combinations and model accuracy; we also see that the model can correctly generate less frequent fea- ture combinations as well. We provide data quantity estimations based on the learning curves generated. These predictions are meant only as a guide rather than anything definitive, as they present an optimistic case defined by the enforcement of monotonicity. The extension of our proposed methodology to other languages with diverse morphological charac- teristics remains an open direction for future work. Limitations One major limitation of the study presented here is the microscopic tracking of one representative verb. As mentioned earlier, one potential solution is to track several verbs of each inflection type. These might be chosen based on phonological behaviour, allowing us to account for allomorphy. Another difficulty to note is the generalisability of parts of the paradigm. By using a neural approach, we wish to leverage the generalisability of the system but to cover even a subsection of language like verbal morphology fully, sometimes a direct exposure to the exceptions is needed. Ethics Statement Data on Nen were gathered by Evans under the projects Language and Social Cognition (ANU Aries protocol 2008/253), Languages of South- ern New Guinea (ANU Aries protocol 2011/313) and The Wellsprings of Linguistic Diversity (ANU Aries Protocol 2014/224). Nen data are lodged on open access in the PARADISEC archive. 82 References Farrell Ackerman, James P. Blevins, and Robert Mal- ouf. 2009. Parts and Wholes. Implicative Patterns in Inflectional Paradigms. In Analogy in Grammar: Form and Acquisition, page 54–82. Oxford University Press. Michel Jose Anzanello and Flavio Sanson Fogliatto. 2011. Learning curve models and applications: Lit- erature review and research directions. International Journal of Industrial Ergonomics, 41(5):573–583. R. Harald Baayen. 2001. Word Frequencies, pages 1–38. Springer Netherlands, Dordrecht, The Netherlands. Louise Baird, Nicholas Evans, and Simon J. Green- hill. 2022. Blowing in the wind: Using ‘north wind and the sun’ texts to sample phoneme inventories. Journal of the International Phonetic Association, 52(3):453–494. Steven Bird. 2015. Email. Resource Network for Lin- guistic Diversity Discussion List. James P. Blevins and Juliette Blevins. 2009. Analogy in Grammar: Form and Acquisition. Oxford University Press. James P. Blevins, Petar Milin, and Michael Ramscar. 2017. The zipfian paradigm cell filling problem. In Perspectives on Morphological Organization, pages 139 – 158. Brill, Leiden, The Netherlands. Erwin Chan. 2008. Structures and distributions in mor- phology learning. Ph.D. thesis, University of Penn- sylvania, PA,USA. Ryan Cotterell, Christo Kirov, John Sylak-Glassman, Géraldine Walther, Ekaterina Vylomova, Patrick Xia, Manaal Faruqui, Sandra Kübler, David Yarowsky, Jason Eisner, and Mans Hulden. 2017. CoNLL- SIGMORPHON 2017 shared task: Universal mor- phological reinflection in 52 languages. In Proceed- ings of the CoNLL SIGMORPHON 2017 Shared Task: Universal Morphological Reinflection, pages 1–30, Vancouver. Association for Computational Linguis- tics. Nicholas Evans. 2014. Positional verbs in Nen. Oceanic Linguistics, 53(2):225–255. Nicholas Evans. 2015. Valency in Nen. In Andrej Malchukov and Bernard Comrie, editors, Volume 2 Case Studies from Austronesia, the Pacific, the Amer- icas, and Theoretical Outlook, pages 1069–1116. De Gruyter Mouton, Berlin, München, Boston. Nicholas Evans. 2016. Inflection in Nen. In Matthew Baerman, editor, The Oxford Handbook of Inflection, pages pages 543–575. Oxford University Press, USA. Nicholas Evans. 2017. Quantification in Nen, pages 571–607. Springer International Publishing, Cham. Nicholas Evans. 2019. Waiting for the Word: Dis- tributed Deponency and the Semantic Interpretation of Number in the Nen Verb. Morphological Perspec- tives. Papers In Honour of Greville G. Corbett, pages 100–123. Nicholas Evans. 2020. One thousand and one coconuts: Growing memories in Southern New Guinea. The Contemporary Pacific, 32(1):72–96. Grant Fairbanks. 1960. Voice and Articulation Drill- book, Second edition. Harper & Row, New York, NY, USA. Raphael Finkel and Gregory Stump. 2007. Principal Parts and Morphological Typology. Morphology, 17(1):39–75. Anthony Gillioz, Jacky Casas, Elena Mugellini, and Omar Abou Khaled. 2020. Overview of the Transformer-based models for NLP tasks. In 2020 15th Conference on Computer Science and Informa- tion Systems (FedCSIS), pages 179–183. Nikolaus P Himmelmann. 1998. Documentary and De- scriptive Linguistics. Linguistics, 36(1):161–196. Jordan Kodner, Salam Khalifa, Khuyagbaatar Bat- suren, Hossep Dolatian, Ryan Cotterell, Faruk Akkus, Antonios Anastasopoulos, Taras Andrushko, Arya- man Arora, Nona Atanalov, Gábor Bella, Elena Budianskaya, Yustinus Ghanggo Ate, Omer Gold- man, David Guriel, Simon Guriel, Silvia Guriel- Agiashvili, Witold Kieras´, Andrew Krizhanovsky, Natalia Krizhanovsky, Igor Marchenko, Magdalena Markowska, Polina Mashkovtseva, Maria Nepomni- ashchaya, Daria Rodionova, Karina Scheifer, Alexan- dra Sorova, Anastasia Yemelina, Jeremiah Young, and Ekaterina Vylomova. 2022. SIGMORPHON– UniMorph 2022 shared task 0: Generalization and typologically diverse morphological inflection. In Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 176–203, Seattle, Washing- ton. Association for Computational Linguistics. Tianyang Lin, Yuxin Wang, Xiangyang Liu, and Xipeng Qiu. 2022. A survey of Transformers. AI Open, 3:111–132. Ling Liu and Mans Hulden. 2020. Leveraging princi- pal parts for morphological inflection. In Proceed- ings of the 17th SIGMORPHON Workshop on Com- putational Research in Phonetics, Phonology, and Morphology, pages 153–161, Online. Association for Computational Linguistics. Ling Liu and Mans Hulden. 2022. Can a Transformer pass the wug test? tuning copying bias in neural mor- phological inflection models. In Proceedings of the 60th Annual Meeting of the Association for Compu- tational Linguistics (Volume 2: Short Papers), pages 739–749, Dublin, Ireland. Association for Computa- tional Linguistics. 83 Saliha Muradoglu, Nicholas Evans, and Hanna Suomi- nen. 2020. To compress or not to compress? A finite- state approach to Nen verbal morphology. In Pro- ceedings of the 58th Annual Meeting of the Associa- tion for Computational Linguistics: Student Research Workshop, pages 207–213, Online. Association for Computational Linguistics. Saliha Muradog˘lu. 2017. When is enough enough ? A corpus-based study of verb inflection in a morpho- logically rich language (Nen). Masters thesis, The Australian National University. Saliha Muradog˘lu, Nicholas Evans, and Ekaterina Vylo- mova. 2020. Modelling verbal morphology in Nen. In Proceedings of the The 18th Annual Workshop of the Australasian Language Technology Associa- tion, pages 43–53, Virtual Workshop. Australasian Language Technology Association. Tiago Pimentel, Maria Ryskina, Sabrina J. Mielke, Shijie Wu, Eleanor Chodroff, Brian Leonard, Gar- rett Nicolai, Yustinus Ghanggo Ate, Salam Khalifa, Nizar Habash, Charbel El-Khaissi, Omer Goldman, Michael Gasser, William Lane, Matt Coler, Arturo Oncevay, Jaime Rafael Montoya Samame, Gema Ce- leste Silva Villegas, Adam Ek, Jean-Philippe Bernardy, Andrey Shcherbakov, Aziyana Bayyr-ool, Karina Sheifer, Sofya Ganieva, Matvey Plugaryov, Elena Klyachko, Ali Salehi, Andrew Krizhanovsky, Natalia Krizhanovsky, Clara Vania, Sardana Ivanova, Aelita Salchak, Christopher Straughn, Zoey Liu, Jonathan North Washington, Duygu Ataman, Witold Kieras´, Marcin Wolin´ski, Totok Suhardijanto, Niklas Stoehr, Zahroh Nuriah, Shyam Ratan, Francis M. Tyers, Edoardo M. Ponti, Grant Aiton, Richard J. Hatcher, Emily Prud’hommeaux, Ritesh Kumar, Mans Hulden, Botond Barta, Dorina Lakatos, Gá- bor Szolnok, Judit Ács, Mohit Raj, David Yarowsky, Ryan Cotterell, Ben Ambridge, and Ekaterina Vy- lomova. 2021. SIGMORPHON 2021 shared task on morphological reinflection: Generalization across languages. In Proceedings of the 18th SIGMOR- PHON Workshop on Computational Research in Pho- netics, Phonology, and Morphology, pages 229–259, Online. Association for Computational Linguistics. Andrei Shcherbakov, Saliha Muradoglu, and Ekaterina Vylomova. 2020. Exploring looping effects in RNN- based architectures. In Proceedings of the The 18th Annual Workshop of the Australasian Language Tech- nology Association, pages 115–120, Virtual Work- shop. Australasian Language Technology Associa- tion. Miikka Silfverberg and Mans Hulden. 2018. An encoder-decoder approach to the paradigm cell filling problem. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2883–2889, Brussels, Belgium. Association for Computational Linguistics. Han Sloetjes and Peter Wittenburg. 2008. Annotation by category-ELAN and ISO DCR. In 6th international Conference on Language Resources and Evaluation (LREC 2008). Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Process- ing Systems, 30. Tom Viering and Marco Loog. 2022. The Shape of Learning Curves: A Review. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–20. Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Ev- geni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C J Carey, I˙lhan Polat, Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles R. Harris, Anne M. Archibald, An- tônio H. Ribeiro, Fabian Pedregosa, Paul van Mul- bregt, and SciPy 1.0 Contributors. 2020. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261–272. Shijie Wu, Ryan Cotterell, and Mans Hulden. 2021. Ap- plying the transformer to character-level transduction. In Proceedings of the 16th Conference of the Euro- pean Chapter of the Association for Computational Linguistics: Main Volume, pages 1901–1907, Online. Association for Computational Linguistics. 84 A Appendix: Inflection categories IPFV.FIMP: Future Imperfective IPFV.IMP: Imperfective Imperative IPFV.MIMP: Mediated imperative IPFV.NPHD: Imperfective Nonprehodiernal IPFV.YPST: Imperfective Yesterday Past IPFV.RMPST: Imperfective Remote Past NEUT.PRIM: Neutral Primordial NEUT.PRET: Neutral Preterite NEUT.PIRR: Neutral Irrealis PFV.IMP: Perfective Imperative PFV.FUT: Perfective Future PFV.PST: Perfective Past 85