Proceedings of the Second Workshop on Universal Dependencies (UDW 2018), pages 47–54 Brussels, Belgium, November 1, 2018. c©2018 Association for Computational Linguistics 47 Mind the Gap: Data Enrichment in Dependency Parsing of Elliptical Constructions Kira Droganova∗ Filip Ginter† Jenna Kanerva† Daniel Zeman∗ ∗Charles University, Faculty of Mathematics and Physics †University of Turku, Department of Future Technologies {droganova,zeman}@ufal.mff.cuni.cz {figint,jmnybl}@utu.fi Abstract In this paper, we focus on parsing rare and non-trivial constructions, in particular ellip- sis. We report on several experiments in enrichment of training data for this specific construction, evaluated on five languages: Czech, English, Finnish, Russian and Slovak. These data enrichment methods draw upon self-training and tri-training, combined with a stratified sampling method mimicking the structural complexity of the original treebank. In addition, using these same methods, we also demonstrate small improvements over the CoNLL-17 parsing shared task winning sys- tem for four of the five languages, not only re- stricted to the elliptical constructions. 1 Introduction Dependency parsing of natural language text may seem like a solved problem, at least for resource- rich languages and domains, where state-of-the- art parsers attack or surpass 90% labeled attach- ment score (LAS) (Zeman et al., 2017). However, certain syntactic phenomena such as coordination and ellipsis are notoriously hard and even state- of-the-art parsers could benefit from better mod- els of these constructions. Our work focuses on one such construction that combines both coor- dination and ellipsis: gapping, an omission of a repeated predicate which can be understood from context (Coppock, 2001). For example, in Mary won gold and Peter bronze, the second instance of the verb is omitted, as the meaning is evident from the context. In dependency parsing this cre- ates a situation where the parent node is missing (omitted verb won) while its dependents are still present (Peter and bronze). In the Universal De- pendencies annotation scheme (Nivre et al., 2016) gapping constructions are analyzed by promoting one of the orphaned dependents to the position of its missing parent, and connecting all remain- ing core arguments to that promoted one with the orphan relation (see Figure 1). Therefore the de- pendency parser must learn to predict relations be- tween words that should not usually be connected. Gapping has been studied extensively in theoreti- cal works (Johnson, 2009, 2014; Lakoff and Ross, 1970; Sag, 1976). However, it received almost no attention in NLP works, neither concerned with parsing nor with corpora creation. Among the re- cent papers, Kummerfeld and Klein (2017) pro- posed a one-endpoint-crossing graph parser able to recover a range of null elements and trace types, and Schuster (Schuster et al., 2018) proposed two methods to recover elided predicates in sentences with gapping. The aforementioned lack of corpora that would pay attention to gapping, as well as natural relative rarity of gapping, leads to its un- derrepresentation in training corpora: they do not provide enough examples for the parser to learn gapping. Therefore we investigate methods of en- riching the training data with new material from large raw corpora. The present work consist of two parts. In the first part, we experiment on enriching data in gen- eral, without a specific focus on gapping construc- tions. This part builds upon self-training and tri- training related work known from the literature, but also develops and tests a stratified approach for selecting a structurally balanced subcorpus. In the second part, we focus on elliptical sentences, com- paring general enrichment of training data with en- richment using elliptical sentences artificially con- structed by removal of a coordinated element. 2 Data 2.1 Languages and treebanks For the parsing experiments we selected five tree- banks from the Universal Dependencies (UD) col- 48 (a) Marie won gold and Peter won bronze nsubj obj conj cc nsubj obj (b) Marie won gold and Peter bronze nsubj obj conj cc orphan Figure 1: UD representation of a sentence with repeated verb (a), and with an omitted verb in a gapping construc- tion (b). lection (Nivre et al., 2016). We experiment with the following treebanks: UD Czech, UD English, UD Finnish, UD Russian-SynTagRus, and UD Slovak. With the exception of UD Russian- SynTagRus, all our experiments are based on UD release 2.0. This UD release was used in the CoNLL-17 Shared Task on Multilingual Parsing from Raw Text to Universal Dependencies (Zeman et al., 2017), giving us a point of comparison to the state-of-the-art. For UD Russian-SynTagRus, we use UD release 2.1, which has a considerably improved annotation of elliptic sentences. For English, which has only a few elliptical sentences in the original treebank, we also utilize in testing a set of elliptical sentences gathered by Schuster et al. (2018). This selection of data strives to maximize the amount of elliptical constructions present in the treebanks (Droganova and Zeman, 2017), while also covering different modern languages and pro- viding variation. Decisions are based on the work by Droganova and Zeman (2017) who collected statistics on elliptical constructions that are explic- itly marked with orphan relation within the UD treebanks. Relatively high number of elliptical constructions within chosen treebanks is the prop- erty of the treebanks rather than the languages. 2.2 Additional material Automatic parses As an additional data source in our parsing experiments, we use the multilin- gual raw text collection by Ginter et al. (2017). This collection includes web crawl data for 45 languages automatically parsed using the UDPipe parser (Straka and Strakova´, 2017) trained on the UD version 2.0 treebanks. For Russian, where we use newer version of the treebank, we reparsed the raw data with UDPipe model trained on the corre- sponding treebank version to agree with the tree- bank data in use. As our goal is to use the web crawled data to enrich the official training data in the parsing ex- periments, we want to ensure the quality of the automatically parsed data. To achieve this, we apply a method that stands between the standard self-training and tri-training techniques. In self- training, the labeled training data (L) is iteratively enriched with unlabeled data (U ) automatically la- beled with the same learning system (L = L+Ul), whereas in tri-training (Zhou and Li, 2005) there are three different learning systems, A, B and C, and the labeled data for the system A is enriched with instances from U on which the two other sys- tems agree, therefore La = L+(Ub∩Uc). Differ- ent variations of these methods have been success- fully applied in dependency parsing, for example (McClosky et al., 2006; Søgaard and Rishøj, 2010; Li et al., 2014; Weiss et al., 2015). In this work we use two parsers (A andB) to process the unlabeled crawl data, and then the sentences where these two parsers fully agree are used to enrich the training data for the system A, i.e. La = L + (Ua ∩ Ub). Therefore the method can be seen as a form of ex- panded self-training or limited tri-training. A sim- ilar technique is successfully used for example by Sagae and Tsujii (2007) in parser domain adapta- tion and Bjo¨rkelund et al. (2014) in general pars- ing. In our experiments the main parser used in fi- nal experiments as well as labeling the crawl data, is the neural graph-based Stanford parser (Dozat et al., 2017), the winning and state-of-the-art sys- tem from the CoNLL-17 Shared Task (Zeman et al., 2017). The secondary parser for labeling the crawl data is UDPipe, a neural transition-based parser, as these parses are already provided to- gether with the crawl data. Both of these parsers include their own part-of-speech tagger, which is trained together (but not jointly) with the depen- dency parser in all our experiments. In the fi- nal self-training web crawl datasets we then keep only deduplicated sentences with identical part- of-speech and dependency analyses. All results reported in this paper are measured on gold to- kenization, and the parser hyperparameters are those used for these systems in the CoNLL-17 49 Shared Task. Artificial treebanks on elliptical constructions For specifically experimenting on elliptical con- structions, we additionally include data from the semi-automatically constructed artificial tree- banks by Droganova et al. (2018). These treebanks simulate gapping by removing words in particular coordination constructions, providing data for ex- perimenting with the otherwise very rare construc- tion. For English and Finnish the given datasets are manually curated for grammaticality and flu- ency, whereas for Czech the quality relies on the rules developed for the process. For Russian and Slovak, which are not part of the original artifi- cial treebank release, we create automatically con- structed artificial datasets by running the pipeline developed for the Czech language. Size of the ar- tificial data is shown in Table 1. Token Sentence Czech 50K 2876 English 7.3K 421 Finnish 13K 1000 Russian 87K 5000 Slovak 7.1 564 Table 1: The size of the artificial data 3 Experiments First, we set out to evaluate the overall quality of the trees in the raw enrichment dataset produced by our self-training variant by parsing and filtering web crawl data. In our baseline experiments we train parsers (Dozat et al., 2017) using purely the new self-training data. From the full self-training dataset we sample datasets comparable to the sizes of the original treebanks to train parsers. These parsers are then evaluated using the original test set of the corresponding treebank. This gives us an overall estimate of the self-training data quality compared to the original treebanks. 3.1 Tree sampling Predictably, our automatically selected self- training data is biased towards short, simple sen- tences where the parsers are more likely to agree. Long sentences are in turn often composed of sim- ple coordinated item lists. To rectify this bias, we employ a sampling method which aims to more closely follow the distribution of the original tree- bank compared to randomly sampling sentences from the full self-training data. We base the sam- pling on two features of every tree: the number of tokens, and the number of unique dependency re- lation types divided by the number of tokens. The latter accounts for tree complexity, as it penalizes trees where the same relation type is repeated too many times, and it specifically allows us to down- sample the long coordinated item lists where the ratio drops much lower than average. We of course take into account that a relation type can naturally occur more than once in a sentence, and that it is not ideal to force the ratio close to 1.0. However, as the sampling method tries to mimic the distri- bution from the original treebank, it should to pick the correct variance while discarding the extremes. The sampling procedure proceeds as follows: First, we divide the space of the two features, length and complexity, into buckets and estimate from the treebank training data the target distri- bution, and the expected number of trees to be sampled in each bucket. Then we select from the full self-training dataset the appropriate number of trees into each bucket. Since the web crawl data is heavily skewed, it is not possible to ob- tain a sufficient number of sampled trees in the ex- act desired distribution, because many rare length– complexity combinations are heavily underrepre- sented in the data. We therefore run the sampling procedure in several iterations, until the desired number of trees have been obtained. This results in a distribution closer to, although not necessarily fully matching, the original treebank. To evaluate the impact of this sampling proce- dure, we compare it to two baselines. RandomS randomly selects the exact same number of sen- tences as the above-mentioned Identical sampling procedure. This results in a dataset which is con- siderably smaller in terms of tokens, because the web crawl data (on which the two parsers agree) is heavily biased towards short trees. To make sure our evaluation is not affected by simply using less data in terms of tokens, we also provide the Ran- domT baseline, where trees are randomly selected until the same number of tokens is reached as in the Identical sample. Here we are able to evaluate the quality of the sampled data, not its bulk. In Table 2 we see that, as expected, when sam- pling the same amount of sentences as in the train- ing section of the original treebank, the RandomS sampling produces datasets considerably smaller in terms of tokens, whereas RandomT results in 50 Language Random T Random S Identical TB Czech 102K/982K 68K/611K 68K/982K 68K/1175K English 18K/183K 13K/102K 13K/183K 13K/205K Finnish 17K/144K 12K/92K 12K/144K 12K/163K Russian 73K/694K 49K/431K 49K/694K 49K/871K Slovak 11K/83K 8K/58K 8K/83K 8K/81K Table 2: Training data sizes after each sampling strategy compared to the original treebank training section (TB), sentences/tokens. datasets considerably larger in terms of trees when the same amount of tokens as in the RandomS dataset is sampled. This confirms the assumption that parsers tend to agree on shorter sentences in the web crawl data, introducing the bias towards them. On the other hand, when the same number of sentences is selected as in the RandomS sam- pling and the original treebank, the Identical sam- pling strategy results in dataset much closer to the original treebank in terms of tokens. Parsing results for the different sampling strate- gies are shown in Table 3. Except for Slovak, the results follow an intuitively expectable pattern: the sample with the least tokens results in the worst score, and of the two samples with the same num- ber of tokens, the one which follows the treebank distribution receives the better score. Surprisingly, for Slovak the sampling strategy which mimics the treebank distribution receives a score almost 3pp lower than the one with random sampling of the same amount of tokens. A possible explana- tion is given in the description of the Slovak tree- bank which mentions that it consists of sentences on which two annotators agreed, and is biased to- wards short and simple sentences. The data is thus not representative of the language use, pos- sibly causing the effect. Lacking a better explana- tion for the time being, we also add the RandomT sampling dataset into our experiments for Slovak. Overall, the parsing results on the automatically selected data are surprisingly good, lagging only several percent points behind parsers trained on the manually annotated treebanks. 3.2 Enrichment In this section, we test the overall suitability of the sampled trees as an additional data for pars- ing. We produce training data composed of the original treebank training section, and a progres- sively increasing number of sampled trees: 20%, 100%, and 200% (relative to the treebank train- ing data size, i.e. +100% sample doubles the to- tal amount of training data). The parsing results Language Random T Random S Identical TB Czech 88.50% 88.18% 88.77% 91.20% English 83.67% 82.86% 84.18% 86.94% Finnish 82.67% 80.69% 83.01% 87.89% Russian 91.28% 90.85% 91.49% 93.35% Slovak 85.02% 83.67% 82.35% 86.04% Table 3: Results of the baseline parsing experiments, using only automatically collected data, reported in terms of LAS%. Random T: random sample, same amount of tokens as in the Random S samples; Random S: random sample, same amount of sentences as in the original treebanks; Identical: identical sample, imitates the distribution of trees in the original treebanks. For comparison, the TB column shows the LAS of a parser trained on the original treebank training data. Language TB +20% +100% +200% Czech 91.20% 91.13% 90.98% 90.72% English 86.94% 87.32% 87.43% 87.29% Finnish 87.89% 87.83% 88.24% 88.32% Russian 93.35% 93.38% 93.22% 93.08% Slovak 86.04% 87.89% 88.36% 88.36% Slovak T 86.04% 88.14% 88.57% 88.77% Table 4: Enriching treebank data with identical sam- ple from automatic data, LAS%. TB: original tree- bank (baseline experiment; the scores are better than re- ported in the CoNLL-17 Shared Task because we eval- uate on gold segmentation while the shared task sys- tems are evaluated on predicted segmentation); +20% – +200%: size of the identical sample used to enrich the treebank data (with respect to the original treebank size). Slovak T: enriching Slovak treebank with ran- dom tokens sample instead of identical. are shown in Table 4. Positively, for all languages except Czech, we can improve the overall pars- ing accuracy, for Slovak by as much as 2.7pp, which is a rather non-trivial improvement. In gen- eral, the smaller the treebank, the larger the ben- efit. With the exception of Slovak, the improve- ments are relatively modest, in the less than half- a-percent range. Nevertheless, since our baseline is the winning parser of the CoNLL-17 Shared Task, these constitute improvements over the cur- rent state-of-the-art. Based on these experiments, 51 we can conclude that self-training data extracted from web crawl seem to be suitable material for enriching the training data for parsing, and in next section we continue to test whether the same data and methods can be used to increase occurrences of a rare linguistic construction to make it more learnable for parsers. 3.3 Ellipsis Our special focus point is that of parsing elliptic constructions. We therefore test whether increas- ing the number of elliptical sentences in the train- ing data improves the parsing accuracy of these constructions, without sacrificing the overall pars- ing accuracy. We follow the same data enrichment methods as used above in general domain and proceed to select elliptical sentences (recognized through the orphan relation) from the same self- training data automatically produced from web crawl (Section 2.2). We then train parsers using a combination of the ellipsis subset and the orig- inal training section for each language. We en- rich Czech, Russian and Slovak training data with elliptical sentences, progressively increasing their size by 5%, 10% and 15%. For Finnish, only 5% of elliptical sentences was available in the filtered web crawl data, and for English not a single sen- tence. The experiments showed mixed results (Ta- ble 5). For Russian and Slovak the accuracy of the dependencies involved in gapping is improved by web crawl enrichment, whereas the results for Czech remained largely the same and Finnish slightly decreased (column Web crawl). Unfortu- nately, for Slovak and Finnish, we cannot draw firm conclusions due to the small number of or- phan relations in the test set. For English, even the treebank results are very low: the parser predicts only very few orphan relations (recall 1.71%) and the web crawl data contains no orphans on which the two parsers could agree, thus making it impos- sible to enrich the data using this method. Clearly, English requires a different strategy, and we will return to it shortly. Positively, none of the lan- guages substantially suffered in terms of overall LAS when adding extra elliptical sentences into the training data. For Slovak, we can even see a significant improvement in overall parsing accu- racy, in line with the experiments in Section 3.1. Increasing the proportion of orphan sentences in the training data has the predictable effect of in- creasing the orphan F-score and decreasing the overall LAS of the parser. These differences are nevertheless only very minor and can only be ob- served for Czech and Russian which have suffi- cient number of orphan relation examples in the test set. For Slovak, with 18 examples, we can- not draw any conclusions, and for English and Finnish, there is not a sufficient number of orphan examples in the filtered web crawl data to allow us to vary the proportion. For all languages, we also experiment with the artificial elliptic sentence dataset of Droganova et al. (2018), described earlier in Section 2.2. For Czech, English and Finnish, the dataset contains semi-automatically produced, and in the case of English and Finnish, also manually validated in- stances of elliptic sentences. For Slovak and Rus- sian, we replicate the procedure of Droganova et al., sans the manual validation, obtaining artifi- cial orphan datasets for all the five languages un- der study. Subsequently, we train parsers using a combination of sentences from the artificial tree- bank and the original training set. The results of this experiments are in Table 5, column Artificial. Compared to web crawl, the artificial data results in a lower performance on orphans for Czech, Slo- vak and Russian, and higher for Finnish, but once again keeping in mind the small size of Finnish and Slovak test set, it is difficult to come to a firm conclusion. Clearly, though, the web crawl data does not perform substantially worse than the ar- tificial data, even though it is gathered fully au- tomatically. A very substantial improvement is achieved on English, where the web crawl data fails to deliver even a single orphan example, whereas the artificial data gains recall of 9.62%. This offers us an opportunity to once again try to obtain orphan examples for English from the web crawl data, since this time we can train the parsers on the combination of the original tree- bank and the artificial data, hopefully resulting in parsers which are in fact able to predict at least some orphan relations, which in turn can result in new elliptic sentences from the web crawl data. As seen from Table 5, the artificial data increases the orphan F-score from 3.36% to 17.18% relative to training only on the treebank, and we are there- fore able to obtain a parser which is at least by the order of magnitude comparable to the other four languages in parsing accuracy of elliptic construc- tions. We observe no loss in terms of the over- 52 Language All Treebank Web crawl +5/+10/+15% Artificial LAS O Pre O Rec O F LAS O Pre O Rec O F LAS O Pre O Rec O F Czech 418 91.20% 54.84% 56.94% 55.87% 91.22% 48.96% 61.72% 54.60% 91.15% 51.79% 58.85% 55.10%91.11% 49.80% 60.05% 54.45% 91.06% 50.19% 62.68% 55.74% English 2+466 86.94% 100.00% 1.71% 3.36% — — — — 86.95% 80.36% 9.62% 17.18% Finnish 43 87.89% 66.67% 32.56% 43.75% 87.76% 48.15% 30.23% 37.14% 88.04% 54.76% 53.49% 54.12% Russian 138 93.35% 44.57% 29.71% 35.65% 93.50% 42.86% 39.13% 40.91% 93.20% 33.14% 40.58% 36.48%93.41% 38.26% 41.30% 39.72% 93.42% 40.69% 42.75% 41.70% Slovak 18 86.04% 60.00% 16.67% 26.09% 87.90% 36.36% 22.22% 27.59% 87.80% 37.50% 16.67% 23.08%87.76% 33.33% 16.67% 22.22% 87.80% 30.77% 22.22% 25.81% Table 5: Enriching treebank data with elliptical sentences. All: number of orphan labels in the test data; Treebank: original treebank (baseline experiment); Web crawl: Enriching the original treebank with the elliptical sentences extracted from the automatically parsed web crawl data; Artificial: Enriching the original treebank with the artificial ellipsis treebank; LAS, %: overall parsing accuracy; O Prec (orphan precision): number of correct orphan nodes divided by the number of all predicted orphan nodes; O Rec (orphan recall): number of correct orphan nodes divided by the number of gold-standard orphan nodes; O F (Orphan F-score): F- measure restricted to the nodes that are labeled as orphan : 2PR / (P+R). For English, the orphan P/R/F scores are evaluated on a dataset of the two orphan relations in the original test section, combined with 466 English elliptic sentences of Schuster et al. (2018). The extra sentences are not used in the LAS column, so as to preserve comparability of overall LAS scores across the various runs. all LAS, demonstrating that it is in fact possible to achieve a substantial improvement in parsing of a rare, non-trivial construction without sacrificing the overall performance. Using the web data self-training filtering pro- cedure with two parsers trained on the tree- bank+artificial data, we can now repeat the exper- iment with enriching parser training data with or- phan relations, results of which are shown in Ta- ble 6. We test the following models: • original UD English v.2.0 treebank; • original UD English v.2.0 treebank com- bined with the artificial sentences; • original UD English v.2.0 treebank com- bined with the artificial sentences and web crawl dataset; size progressively increased by 5%, 10% and 15%. Here we use the original UD English v.2.0 treebank extended with the artificial sentences to train the models (Sec- tion 2.2) that produce the web crawl data for English. The best orphan F-score of 36%, more than ten times higher compared to using the original tree- bank, is obtained by enriching the training data with 15% elliptic sentences from the artificial and filtered web data. The orphan F-score of 36% is on par with the other languages and, positively, the overall LAS of the parser remains essentially un- changed — the parser does not sacrifice anything Model LAS O Precision O Recall O F-score Treebank 86.94% 100% 1.71% 3.36% Artificial 86.95% 80.36% 9.62% 17.18% Art.+Web 5% 86.72% 86.11% 19.87% 32.29% Art.+Web 10% 86.68% 78.36% 22.44% 34.88% Art.+Web 15% 87.07% 84.38% 23.08% 36.24% Table 6: Enriching the English treebank data with elliptical sentences. LAS, %: overall parsing accu- racy; O Precision (orphan precision): number of cor- rect orphan labels divided by the number of all pre- dicted orphan nodes; O Recall (orphan recall): num- ber of correct orphan labels divided by the number of gold-standard orphan nodes; O F-score (orphan F- score): F-measure restricted to the nodes that are la- beled as orphan : 2PR / (P+R). For English, the or- phan P/R/F scores are evaluated on a dataset of the two orphan relations in the original test set, combined with 466 English elliptic sentences of Schuster et al. (2018). The extra sentences are not used in the LAS column, so as to preserve comparability of overall LAS scores across the various runs. This is necessary since ellip- tic sentences are typically syntactically more complex and would therefore skew overall parser performance evaluation. in order to gain the improvement on orphan re- lations. These English results therefore not only explore the influence of the number of elliptical sentences on the parsing accuracy, but also test a method applicable in the case where the tree- bank does not contain almost any elliptical con- structions and results in parsers that only generate the relation very rarely. 53 4 Conclusions We have explored several methods of enrich- ing training data for dependency parsers, with a specific focus on rare phenomena such as ellip- sis (gapping). This focused enrichment leads to mixed results. On one hand, for several languages we did not obtain a significant improvement of the parsing accuracy of ellipsis, possibly in part owing to the small number of testing examples. On the other hand, though, we have demonstrated that for English ellipsis parsing accuracy can be improved from single digit numbers to performance on par with the other languages. We have also validated the method of constructing artificial elliptical ex- amples as a mean to enrich parser training data. Additionally, we have shown that useful training data can be obtained using web crawl data and a self-training or tri-training style method, even though the two parsers in question differ substan- tially in their overall performance. Finally, we have shown that this parser train- ing data enrichment can lead to improvements of general parser accuracy, improving upon the state of the art for all but one language. The improve- ment was especially notable for Slovak. Czech was the only treebank not benefiting from this ad- ditional data, likely owing to the fact that is is an already very large, and homogenous treebank. As part of these experiments, we have introduced and demonstrated the effectiveness of a stratified sam- pling method which corrects for the skewed dis- tribution of sentences selected in the web filtering experiments. Acknowledgments The work was partially supported by the grant 15-10472S of the Czech Science Foundation (GACˇR), the GA UK grant 794417, Academy of Finland, and Nokia Foundation. Computational resources were provided by CSC - IT Center for Science, Finland. References Anders Bjo¨rkelund, O¨zlem C¸etinog˘lu, Agnieszka Falen´ska, Richa´rd Farkas, Thomas Mu¨ller, Wolf- gang Seeker, and Zsolt Sza´nto´. 2014. Self-training for Swedish Dependency Parsing – Initial Results and Analysis. In Proceedings of the Fifth Swedish Language Technology Conference (SLTC 2014). Elizabeth Coppock. 2001. Gapping: In defense of deletion. In Proceedings of the Chicago Linguistics Society, volume 37, pages 133–148. Timothy Dozat, Peng Qi, and Christopher D Manning. 2017. Stanford’s Graph-based Neural Dependency Parser at the CoNLL 2017 Shared Task. Proceed- ings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 20–30. Kira Droganova and Daniel Zeman. 2017. Elliptic Constructions: Spotting Patterns in UD Treebanks. In Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies (UDW 2017), 135, pages 48–57. Kira Droganova, Daniel Zeman, Jenna Kanerva, and Filip Ginter. 2018. Parse Me if You Can: Artifi- cial Treebanks for Parsing Experiments on Ellipti- cal Constructions. Proceedings of the 11th Interna- tional Conference on Language Resources and Eval- uation (LREC 2018). Filip Ginter, Jan Hajicˇ, Juhani Luotolahti, Milan Straka, and Daniel Zeman. 2017. CoNLL 2017 Shared Task - Automatically Annotated Raw Texts and Word Embeddings. LINDAT/CLARIN digi- tal library at the Institute of Formal and Applied Linguistics (U´FAL), Faculty of Mathematics and Physics, Charles University. Kyle Johnson. 2009. Gapping is not (VP) ellipsis. Lin- guistic Inquiry, 40(2):289–328. Kyle Johnson. 2014. Gapping. Jonathan K Kummerfeld and Dan Klein. 2017. Parsing with Traces: An O(n4) Algorithm and a Structural Representation. arXiv preprint arXiv:1707.04221. George Lakoff and John Robert Ross. 1970. Gapping and the order of constituents. Progress in linguis- tics: A collection of papers, 43:249. Zhenghua Li, Min Zhang, and Wenliang Chen. 2014. Ambiguity-aware ensemble training for semi- supervised dependency parsing. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers), volume 1, pages 457–467. David McClosky, Eugene Charniak, and Mark John- son. 2006. Effective self-training for parsing. In Proceedings of the main conference on human lan- guage technology conference of the North American Chapter of the Association of Computational Lin- guistics, pages 152–159. Association for Computa- tional Linguistics. Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin- ter, Yoav Goldberg, Jan Hajicˇ, Christopher Man- ning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty, and Daniel Zeman. 2016. Universal Dependencies v1: A Multilingual Treebank Collection. In Proceedings of the 10th In- ternational Conference on Language Resources and 54 Evaluation (LREC 2016), pages 1659–1666, Por- torozˇ, Slovenia. Ivan Sag. 1976. Deletion and Logical Form. MIT. PhD dissertation. Kenji Sagae and Junichi Tsujii. 2007. Dependency parsing and domain adaptation with LR models and parser ensembles. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Lan- guage Processing and Computational Natural Lan- guage Learning (EMNLP-CoNLL). Sebastian Schuster, Joakim Nivre, and Christopher D. Manning. 2018. Sentences with Gapping: Parsing and Reconstructing Elided Predicates. In Proceed- ings of the 16th Annual Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies (NAACL 2018). Anders Søgaard and Christian Rishøj. 2010. Semi- supervised dependency parsing using generalized tri-training. In Proceedings of the 23rd Interna- tional Conference on Computational Linguistics, pages 1065–1073. Association for Computational Linguistics. Milan Straka and Jana Strakova´. 2017. Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 88–99, Vancouver, Canada. Association for Computational Linguistics. David Weiss, Chris Alberti, Michael Collins, and Slav Petrov. 2015. Structured Training for Neural Net- work Transition-Based Parsing. In Proceedings of ACL 2015, pages 323–333. Daniel Zeman, Martin Popel, Milan Straka, Jan Hajicˇ, Joakim Nivre, Filip Ginter, Juhani Luoto- lahti, Sampo Pyysalo, Slav Petrov, Martin Pot- thast, Francis Tyers, Elena Badmaeva, Memduh Go¨krmak, Anna Nedoluzhko, Silvie Cinkova´, jr. Jan Hajicˇ, Jaroslava Hlava´cˇova´, Va´clava Kettnerova´, Zdenˇka Uresˇova´, Jenna Kanerva, Stina Ojala, Anna Missila¨, Christopher Manning, Sebastian Schuster, Siva Reddy, Dima Taji, Nizar Habash, Herman Le- ung, Marie-Catherine de Marneffe, Manuela San- guinetti, Maria Simi, Hiroshi Kanayama, Valeria de Paiva, Kira Droganova, He´ctor Martı´nez Alonso, C¸ag˘r C¸o¨ltekin, Umut Sulubacak, Hans Uszkor- eit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Georg Rehm, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Michael Mandl, Jesse Kirchner, Hector Fernandez Alcalde, Jana Str- nadova´, Esha Banerjee, Ruli Manurung, Antonio Stella, Atsuko Shimada, Sookyoung Kwak, Gustavo Mendonc¸a, Tatiana Lando, Rattima Nitisaroj, and Josie Li. 2017. CoNLL 2017 shared task: Mul- tilingual parsing from raw text to universal depen- dencies. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Uni- versal Dependencies, pages 1–19, Stroudsburg, PA, USA. Charles University, Association for Computa- tional Linguistics. Zhi-Hua Zhou and Ming Li. 2005. Tri-training: Ex- ploiting unlabeled data using three classifiers. IEEE Transactions on knowledge and Data Engineering, 17(11):1529–1541.