Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities, pages 216–228
November 16, 2024. ©2024 Association for Computational Linguistics
216
Improving Latin Dependency Parsing by Combining Treebanks and
Predictions
Hanna-Mari Kupari, Erik Henriksson, Veronika Laippala, Jenna Kanerva
TurkuNLP, University of Turku, Finland
{hanna-mari.kupari, erik.henriksson, mavela, jmnybl}@utu.fi
Abstract
This paper introduces new models designed to
improve the morpho-syntactic parsing of the
five largest Latin treebanks in the Universal
Dependencies (UD) framework. First, using
two state-of-the-art parsers, Trankit and Stanza,
along with our custom UD tagger, we train
new models on the five treebanks both indi-
vidually and by combining them into novel
merged datasets. We also test the models on
the CIRCSE test set. In an additional experi-
ment, we evaluate whether this set can be accu-
rately tagged using the novel LASLA corpus
(https://github.com/CIRCSE/LASLA). Sec-
ond, we aim to improve the results by combin-
ing the predictions of different models through
an atomic morphological feature voting sys-
tem. The results of our two main experiments
demonstrate significant improvements, partic-
ularly for the smaller treebanks, with LAS
scores increasing by 16.10 and 11.85%-points
for UDante and Perseus, respectively (Gamba
and Zeman, 2023a). Additionally, the voting
system for morphological features (FEATS)
brings improvements, especially for the smaller
Latin treebanks: Perseus 3.15% and CIRCSE
2.47%-points. Tagging the CIRCSE set with
our custom model using the LASLA model im-
proves POS 6.71 and FEATS 11.04%-points
compared to our best-performing UD PROIEL
model. Our results show that larger datasets
and ensemble predictions can significantly im-
prove performance.
1 Introduction
In recent years, significant progress has been made
in morpho-syntactic dependency parsing for Latin,
an advancement that greatly benefits a wide range
of research in the humanities. Linguistically tagged
corpora are crucial, as lemmatized corpora, for in-
stance, are valuable also for historians searching
for sources within databases. The Universal De-
pendencies (UD) framework plays a key role by or-
ganizing linguistic analysis into machine-readable
databases with columns in tab-separated value ta-
bles. These CoNLL-U formatted treebanks pro-
vide essential information on lemmas, parts of
speech, morphological features, syntactic roles,
and dependency relations. In the realm of Latin
treebanks notable recent developments include the
morphological harmonization of the five largest
Latin treebanks (ITTB, LLCT, Perseus, PROIEL,
and UDante1), a significant milestone reached by
Gamba and Zeman (2023a) as a continuation of
earlier work on syntactic harmonization (Gamba
and Zeman, 2023b).
Additionally, there have been many efforts to
enhance the performance of Latin parsing tools.
These include the EvaLatin campaigns Sprugnoli
et al., 2022, 2024, as well as the application of GPT
models for part-of-speech (POS) tagging (Stüssi
and Ströbel, 2024). Despite these advancements,
there remains potential for further improvement,
particularly in syntactic parsing. For instance, the
highest Labeled Attachment Score (LAS) reported
by Gamba and Zeman (2023a) is 64.87% for the
UDante and 59.43% for Perseus.
In the present study, we leverage the recently
released harmonized treebanks (Gamba and Zeman,
2023a) to further enhance automatic parsing. Our
focus is on the five largest established treebanks in
the UD format, ensuring that our results are reliably
comparable to previous studies. Our models can
also easily be applied to parse new text corpora.
To achieve our goal, we employ two approaches:
First, we train new parser models using these har-
monized treebanks, along with two state-of-the-
art parsers —Stanza (Qi et al., 2020) and Trankit
(Nguyen et al., 2021)— as well as a custom UD
tagger by fine-tuning a BERT-based Latin language
model (Ströbel, 2022) following the architecture
of Devlin et al. (2019). The parsing models are
trained using both individual and diverse merged
1https://universaldependencies.org/la/
217
treebanks.
Second, we investigate whether combining pre-
dictions from our newly trained models in a voting
system targeting part-of-speech (POS) and mor-
phological features (FEATS) tags improves perfor-
mance. Our hypothesis is that selecting the most
common prediction from the different models en-
hances the results in a ’majority vote wins’ sce-
nario.
Third, we use the voting setup of the differ-
ent models to analyze how unanimous the various
parser models are in their POS predictions. This
provides insight into which tasks are accurately
tagged and offers potential for identifying prevail-
ing issues in the annotation guidelines.
Upon the publication of this paper,2 all data,
code, and results, as well as the models, will
be made openly and freely accessible for non-
commercial use. These resources include clear
instructions, designed to be easily used by scholars
who may not be familiar with language technology
but wish to experiment with their own texts.
2 Previous work
The first Latin BERT model by Bamman and Burns
(2020) provided the state-of-the-art POS scores of
its time (Perseus 94.3%, PROIEL 98.2%, ITTB
98.8%). Similarly, Nehrdich and Hellwig (2022)
reported very competitive LAS scores for the previ-
ous releases of the treebanks using a biaffine parser
on top of a Latin BERT (ITTB 92.99%, PROIEL
86.34% and PERSEUS 80.16%).
There have been some trials with merging
existing treebanks into larger training datasets.
Nehrdich and Hellwig (2022) combined the ITTB,
Perseus, and PROIEL treebanks, while Smith et al.
(2018) trained a single model for all ancient lan-
guages, including three Latin treebanks. Addition-
ally, Kondratyuk and Straka (2019) combined all
the UD treebanks into a single multilingual dataset
and trained a model for all UD languages. While
these studies demonstrated the potential for improv-
ing performance by merging training data from
multiple treebanks, the first reports only a single
experiment, and the latter two do not focus specif-
ically on Latin, leaving room for further experi-
ments. The challenge of selecting and combining
treebanks is also brought to attention in the latest
EvaLatin Campaign (Sprugnoli et al., 2024).
Merging treebanks for training models has not
2https://github.com/HannaKoo/Latin-Parsing
been widely explored, likely because the develop-
ers of the treebanks have varied interpretations of
the UD guidelines since the treebanks have been
composed at different points in time (with contin-
uous updates regarding the annotation guidelines).
These discrepancies in annotations has compli-
cated combining them into larger merged training
datasets. The work of Gamba and Zeman (2023a)
focuses on the harmonisation of the datasets, and
they train models using only the individual tree-
banks.
Combining the predictions of several models
through voting has been tested in many studies.
E.g. early pioneering work by Zeman and Žabokrt-
ský (2005) applied majority voting for four parsers
for Czech, reporting improvements of 2%-points
in dependency relation prediction. Combining
parser outputs has also been used by Passarotti
and Dell’Orletta (2010) to improve the parsing of
the ITTB treebank. More recent work by Stoeckel
et al. (2020) developed an ensemble classifier by
applying a voting model on top of several POS
taggers. Their voting model was designed to learn
which predictions to trust in different contexts.
3 Data
There are five Latin UD treebanks used for training:
the Index Thomisticus Treebank (ITTB) (Passarotti,
2019), the Late Latin Charter Treebank (LLCT)
(Cecchini et al., 2020b), Perseus (Bamman and
Crane, 2011), PROIEL (Haug and Jøhndal, 2008),
and UDante (Cecchini et al., 2020a). For a concise
numerical comparison of these Latin UD treebanks
and a detailed description of their contents, see
3. For a general overview, see Gamba and Zeman
(2023b).
The efforts of Gamba and Zeman (2023a) are
crucial for merging the treebanks and serve as
a foundation of our model training. These har-
monized treebanks are accessible at a GitHub-
repository 4. For a concise numerical overview
and a brief description of the treebanks used in this
study, refer to Table 1.
3https://universaldependencies.org/treebanks/
la-comparison.html
4https://github.com/fjambe/Latin-variability/
tree/main/morpho_harmonization/
morpho-harmonized-treebanks
218
3.1 CIRCSE test set
The novel sixth UD Latin treebank, CIRCSE5, con-
sists solely of a test set because of its small size
along the UD guidelines. This test set is valuable
for evaluating our models because it differs from
the established larger treebanks, which predomi-
nantly feature texts from the middle ages. For in-
stance, the ITTB and LLCT together contain 692K
tokens, whereas Perseus focuses on Classical texts
with a total of only 29K tokens. CIRCSE is also
distinct in genre, featuring a total of 13,294 tokens
of tragedy: Hercules Furens (7,714 tokens, 555
sentences) and Agamemnon (5,580 tokens, 409 sen-
tences) by Seneca (c. 4 BC – AD 65), along with
the treatise Germania (5,674 tokens, 299 sentences)
by Tacitus (c. AD 56 – c. 120).
3.2 Merged treebanks
Merging treebanks presents challenges not only
due to potential differences in annotation guidelines
but also because of the linguistic variation they
reflect. The five treebanks span several millennia
and cover a wide range of genres, factors that can
influence the performance of models trained on
them. One of the key research questions we explore
is whether, for example, the inclusion of a large
amount of medieval Latin training data affects the
parsing results for Classical Latin.
In addition to merging all the training datasets,
we combine the individual treebanks into five the-
matically organized merged treebanks, as shown in
Table ??, based on a holistic understanding of the
nature of the different Latin UD datasets. We also
experiment with merged sets focused on specific
time periods, drawing on a heuristic understand-
ing of historical linguistics and the evolution of
the Latin language. The goal is to compile sets
that support one another, rather than confuse the
models with training data that is too varied or even
contradictory. Beyond linguistic considerations, to
address machine learning challenges and mitigate
the risk of overfitting—particularly when working
with datasets from unequally sized and heteroge-
neous treebanks—the merged training sets were
constructed by iteratively concatenating one-fifth
of each individual treebank, ordered from smallest
to largest, into the new datasets.
5https://universaldependencies.org/treebanks/
la_circse/index.html
3.3 The Corpus Corporum monolingual
training set
While most of our experiments are based on the
widely applied Stanza and Trankit parsers (see Sec-
tion 4), neither of them support using a dedicated
pre-trained Latin language model. Therefore, we
also experiment with our custom tagger utilizing
a language model trained on Latin data only (see
Section 4.3). The language model (Ströbel, 2022)
has been produced by using the Corpus Corporum
dataset (Roelli, 2014). This dataset contains a con-
siderably large portion of patristic texts from the
Patrologia Latina (8.4 M words). For a concise
overview of the texts currently included in this
database see the listing on the project website 6.
The previous work of Bamman and Burns (2020)
with a monolingual model for POS tagging is pro-
duced with a very large dataset of 642.7M tokens
that includes for example Latin Wikipedia of 16M
tokens. This provides obvious problems as to reli-
able quality of the training data, since contributions
to Vicipaedia are not subject to expert language
check and the RoBERTa Latin model by Ströbel
(2022) is focused to solve this very issue.
3.4 The LASLA dataset
Since texts from the Classical period are underrep-
resented in the UD treebanks, we conduct a small
experiment using the non-UD LASLA dataset,
which lacks dependency parsing annotation. In
terms of POS tagging, lemmatization and morphol-
ogy, the 1.8M-token LASLA dataset is notably
large, created through a joint effort by members of
the LiLa and LASLA teams.7 We use the LASLA
corpus as a basis to make our own train, dev, and
test sets for a small-scale experiment aimed at im-
proving our custom model for the POS and mor-
phological analysis of the CIRCSE test set. Our
modification of the CoNLL-U Plus formatted files
excludes the texts in the CIRCSE test set (see 3.1)
and removes non-relevant fields. The larger files
are split and concatenated in random order.
4 Methods
In our aim to improve morpho-syntactic parsing
tools for Latin, we use two different methods: train-
ing new models and experimenting with a voting
system. Our first task is the training of new parser
6https://mlat.uzh.ch/browser?path=/
7https://github.com/CIRCSE/LASLA?tab=
readme-ov-file
219
Token counts or words in datasets
Dataset Short Description Train Dev Test Total
CIRCSE Seneca’s tragedies and Tacitus’ treatise - - 19 483 19 483
ITTB Texts of Thomas of Aquinas, 13th century 392 017 29 968 29 920 451 905
LLCT 8th century legal charters from Tuscany 194 193 24 195 24 079 242 467
Perseus Classical auctors e.g. Caesar and Ovid 16 859 1 566 11 149 29 574
PROIEL Classical auctors and New Testament 172 261 13 955 14 114 200 330
· Classical E.g. Cicero and Palladius 76 647 - - 76 647
· Vulgate Jerome’s Vulgate 95 614 7 123 - 102 737
UDante Works of Dante Alighieri, 13th-14th century 30 567 11 689 13 502 55 758
CC Massive Corpus Corporum text database 162 M
LASLA Classical Latin database 1 856 296 32 756 35413 1 856 296
Table 1: Overview of the used datasets for train, dev and test. We have spilt PROIEL to include Classical secular
texts and Vulgate. For Perseus, where the original release does not include a separate development set for parameter
optimization, we created one by dividing the train set. The UD CIRCSE treebank only contains a test set due to its
size. The Corpus Corporum dataset is the basis for the monolingual BERT (Ströbel, 2022) used for our custom
model UD tagger. Our modification of the LASLA database (https://github.com/CIRCSE/LASLA/tree/main)
is used in an experiment to improve the results of the CIRCSE test set.
Training data ITTB LLCT Perseus PROIEL UDante Tokens in total
Classical Latin 9% 91% 205 K
Late and Medieval Latin 62% 32% 6% 683 K
Later and Christian Latin 54% 28% 13% 5% 785 K
Merged 48% 25% 2% 21% 5% 887 K
Table 2: Overview of the merged treebanks used for training Stanza and Trankit and fine-tuning the custom model.
models based on the newest treebanks described
in Tables 1 and 2. For full morpho-syntactic pars-
ing, we apply the commonly used Trankit (Nguyen
et al., 2021) and Stanza (Qi et al., 2020) toolkits.
As neither Trankit nor Stanza support the usage
of a custom pretrained language model, we also
experiment with a custom part-of-speech and mor-
phological tagger trained on top of a monolingual
Latin language model (Ströbel, 2022) following the
task-specific fine-tuning of Devlin et al. (2019).
4.1 Trankit
Trankit (Nguyen et al., 2021) is a light-weight trans-
former based toolkit, which provides a trainable
pipeline for morpho-syntactic parsing. It reports
outperforming prior multilingual NLP pipelines
over sentence segmentation, POS and FEAT tag-
ging as well as in dependency parsing while main-
taining competitive performance for tokenization,
multi-word token expansion, and lemmatization
over 90 UD treebanks. It is based on training
adapter modules (Houlsby et al., 2019; Pfeiffer
et al., 2020) on top of the multilingual pretrained
XLM-R language model (Conneau et al., 2020).
The parser is designed to be efficient in multilin-
gual usage (shared multilingual language model),
while still giving state-of-the-art results for individ-
ual treebanks (treebank-specific adaptors).
4.2 Stanza
Stanza (Qi et al., 2020) is a trainable, language-
agnostic neural pipeline for morpho-syntactic pars-
ing. Stanza includes a Bi-LSTM encoder capable
of utilizing pre-trained word embeddings, and uses
the biaffine neural dependency parser by Dozat
and Manning (2017). This is the same parser that
Gamba and Zeman (2023a) employed. We use stan-
dard model training in order to have a model that
matches the Trankit training to ensure a reliable
comparison between the models.
4.3 Custom tagger with a Latin language
model
Earlier studies, e.g. Pyysalo et al. (2021); Bam-
man and Burns (2020), have shown that for certain
languages the usage of a dedicated monolingual
language model may result in better performance
compared to multilingual models or not using a
220
pretrained language model at all. While neither
Trankit nor Stanza support the usage of a custom
pretrained language model, we implement a POS
and morphological tagger by fine-tuning a mono-
lingual Latin language model. As a pretrained lan-
guage model we use the pstroe/roberta-base-latin-
v38 pretrained on the Corpus Corporum Latin text
collection (see Section 3.3). The tagger jointly
predicts the POS and morphological features by
adding a task-specific token classification layer on
top of the pretrained language model, following the
architecture of Devlin et al. (2019). The classifica-
tion layer is trained on treebank data updating also
the weights of the original language model.
4.4 Voting
In POS tag and FEATS predictions voting we run a
simple majority vote of the three parsers (Trankit,
Stanza, and Custom tagger), for each treebank se-
lecting the generally best performing model of each
parser. In a tie situation, the voting defaults to
Trankit which generally receives the best individ-
ual scores. The voting script does not take into
account the fact that the numerically highest scores
for POS and UFEATS might come from different
models, and our preference is for overall best re-
sults.
For POS tags, the possible voting scenarios
when using three parsers are cases where all three
agree, two outvote the third one and all parsers dis-
agree. When analysing the model predictions for
the Perseus treebank, in 86% of tokens the three
parsers agree on UPOS, in 13% of tokens there is
a majority agreement, and only in a bit more than
1% all three parsers disagree on UPOS.
However, in terms of morphological features the
same agreement rates on Perseus are 59%, 31%,
and 10% respectively, when voting on the level
of full feature analyses — the entire FEATS field
that consists of several categories such as number
and tense. The large variation in predicted feature
combinations therefore increases the percentage
of tokens where there is no majority consensus
available (10%).
To be able to at least partially account for these
tokens as well, for morphological features we pro-
ceed the voting in two steps. First, the voting is
done on the level of full feature analysis (e.g. for
nouns this means that all the diverse elements in
the category, such as case, number and gender), but
8https://huggingface.co/pstroe/
roberta-base-latin-cased3
in cases where we are not able to find a majority
vote, we continue to the second option of voting
on category level. In the second step, the feature
analyses are split into individual (category, value)
-pairs, and for each category we run the majority
voting of values predicted for that particular cate-
gory. To avoid the situation where the final analysis
is a union of different categories predicted by three
parsers, we obtain the categories from the default
Trankit parser, therefore in practice only voting val-
ues for Trankit predicted categories. It should also
be noted that the LASLA model for CIRCSE is not
included in the vote, as it would require a close
reading of potentially non-UD-style morphological
annotations, which the script does not consider.
5 Results
The performance of the trained models is summa-
rized in Table 3, which presents the results for the
five largest established treebanks. Additionally,
the outcomes specific to the CIRCSE treebank are
detailed in Table 6 and Table 7. The findings un-
derscore the importance of selecting optimal tree-
banks for training, as discussed by Sprugnoli et al.
(2024). While the prevailing trend in training large
language models has been to utilize increasingly
larger datasets, our results indicate a different effect.
Specifically, the Perseus treebank shows signifi-
cant improvement when trained with the Classical
dataset, indicating that quality of data is more criti-
cal than quantity, challenging the assumption that
"more is better". The effects of this improvement
are highlighted in Table 8.
The complete set of metrics is available on the
project’s GitHub page9 and the all CoNLL-U for-
matted treebanks respectively10. In this paper, we
report and discuss the scores for tokenization, POS,
morphological features (FEATS), lemmatization,
and syntax, including both the unlabeled attach-
ment score (UAS) and labeled attachment score
(LAS). For the custom tagger, only the UPOS and
FEATS results are relevant. All metrics were gen-
erated using the UD evaluation tools, based on the
CoNLL 2018 shared task script11.
In the results presented below we discuss the
9https://anonymous.4open.science/r/
Latin-Parsing-627B//Results/Evaluation_metrics/
eval_table.tsv
10https://anonymous.4open.science/r/
Latin-Parsing-43B5/Results/conllu_files
11https://github.com/UniversalDependencies/
tools/blob/master/eval.py
221
Compilation of Results Tasks:
Treebank and model Tokens UPOS UFeats Lemmas UAS LAS
ITTB
Stanza 100.00 98.64 96.16 99.05 88.50 86.61
Trankit 99.99 98.99 97.52 97.63 92.09 90.71
Trankit Late and Christian 100.00 99.05 97.61 97.87 91.86 90.52
Trankit Five Merged 99.99 99.07 97.55 97.82 91.90 90.41
Custom tagger Late and Christian - 98.72 96.61 - - -
LLCT
Stanza 100.00 99.61 96.95 98.07 95.85 94.83
Trankit 99.99 99.66 97.36 96.50 96.15 95.37
Trankit Late and Medieval 99.99 99.66 97.18 96.69 96.46 95.51
Custom tagger - 99.14 95.67 - - -
Perseus
Stanza 99.94 89.44 80.17 80.97 69.75 61.93
Stanza Classical 99.92 90.09 81.33 85.89 75.28 68.29
Trankit Classical 99.74 90.50 83.25 74.60 77.89 71.28
Trankit Five Merged 99.79 91.83 80.94 76.55 77.72 70.59
Custom tagger Classical - 89.58 82.58 - - -
Custom tagger Five Merged - 89.66 78.43 - - -
PROIEL
Stanza 99.99 97.22 92.14 96.63 78.12 74.56
Trankit 99.87 97.29 92.77 89.37 84.09 80.97
Trankit Five Merged 99.88 97.30 92.96 89.24 83.94 80.92
Custom tagger Five Merged - 96.44 91.64 - - -
UDante
Stanza 99.65 89.98 81.00 86.94 68.37 59.15
Trankit Five Merged 99.66 91.46 84.42 77.50 79.63 73.42
Custom tagger Five Merged - 89.91 82.24 - - -
Table 3: A compilation of the most important F1-scores. The best score for each treebanks is in bold.
most relevant numbers and some case study ex-
amples. In Table 9 we also include the previous
state-of-the-art outcomes from two recent studies.
Our state-of-the-art results demonstrate improve-
ments in POS-tagging of 8.41 %-points for Perseus,
7.78 for PROIEL, and 5.93 for UDante compared to
the findings of Stüssi and Ströbel (2024). Addition-
ally, our results show an improvement in LAS of
11.85%-points for Perseus and 16.10%-points for
UDante compared to Gamba and Zeman (2023a).
All numerically highest F1 scores achieved by
the models are in the Table 3. The effects of the
merging of training data set for training are in Ta-
ble 8. The results of the majority vote win for POS
and FEATS are in Table 5.
5.1 Tokenization
Tokenization results have very little room for im-
provement, the best models already obtaining an F1
score of 100 % for ITTB and LLCT with individual
training. From close reading we find that the only
aspect of tokenization that requires improvement is
the prediction of multi-word tokens (MWTs). This
issue arises from the complete absence or inclu-
sion of only a few trivial MWTs in these corpora.
E.g. the ITTB train set contains only instances of
nonne ’isn’t it?’, which is clearly insufficient for
effectively training the models on something as
complex as Latin enclitics). Upon close reading
the output, we identified predictions that are sig-
nificantly off. For instance, in the Perseus corpus
parsed by Stanza, the word pulsabantque’and beat’
is incorrectly tokenized as "pullaaa" and "que" in-
stead of the correct "pulsabant" and "que,"
The tokenization of the CIRCSE test set
achieved a perfect accuracy of 100.00% with the
Stanza PROIEL model. However, this test set lacks
punctuation, leading to poor performance in the
222
task of sentence segmentation across all models.
Several of our models were unable to segment sen-
tences and attempted to dependency parse the entire
dataset as a single 19K words long sentence. To
address this, we experimented using a crude fix of
adding a full stop at the end of each sentence using
a script, and assigned a mock HEAD-tag pointing
to the last word of each sentence, resembling the
use of GS segmentation. For further details and
results of this experiment, see Table 7.
5.2 Part-of-Speech (POS)
Overall, the results for POS tagging have for a long
time highly accurate and for most treebanks can
only be marginally improved.
All the results of the POS vote are written in a
new ConLL-U-styled tsv-table that first includes
the winner of the majority vote, the predicted forms
in the following order: Trankit, Stanza and custom
model.12 After that a column indicates the results
of the vote being either unanimous, two-to-one or
even. The resulting file13 includes also a column
that indicates if the result of the vote is correct,
this information is especially informative for close
reading. Scholars are able to form a general idea
of what kind of tasks the parsers are capable of
predicting and can especially focus on the difficul-
ties and understand if there is an underlying trend
that could be fixed (i.e. relating to the annotation
guidelines).
The most interesting cases are the ones with dis-
persed results and here we will highlight some case
examples. From the ITTB treebank we find a case
with the word necesse-esse ’necessarily existent’
with POS predictions: ADJ, VERB, AUX where
our custom model gets it right according to the GS
of the morphological harmonization, but the earlier
realise tags this as NOUN. From LLCT we find
instances like decimas (from phrase per quadrag-
inta annos abuerunt consuetudo offertas et decimas
dare ad predicta ecclesia) as ADJ, NOUN, NUM,
where Stanza gets the POS tag of ’tithes’ right.
There are a lot of expressions of date, for example
in mense december where one instance the vote is
even for december resulting as NOUN, ADJ, NUM
while all other instances in the test dataset get it
unanimously right as ADJ. From the expression
adfinis terra ’boundaries of the land’ adfinis as
ADJ, NOUN, ADP when Trankit gets it correct.
The Perseus and UDante outputs have substantially
12Results/conllu_files/voted_extended
13Results/conllu_files/gold_extended
PROIEL Correct % Wrong %
Unanimous 13 295 99% 132 0.98%
Two to one 463 75% 154 25%
Dispersed 15 44% 19 55%
Table 4: An example of the accuracies of the voting on
POS tagging in PROIEL
more even votes than the other five established
treebanks. These include iuro NOUN, VERB,
ADJ from the phrase per flumina iuro (swear by
the rivers), also we find Aeoliis as ADJ, PROPN,
ADP where Trankit gets it correct. From PROIEL
promissa as NOUN, ADJ, VERB from the expres-
sion ceterorum que promissa which is easy to un-
derstand, since the participe promissum ’promises’
and we would also imagine this being difficult for
Latin students, but Trankit is correct with NOUN.
From UDante the phrase praedictis finibus ’of the
aforementioned borders’ where the participle is
predicted as DET, VERB, ADJ and only Stanza is
correct. An example of voting accuracy in PROIEL
in Table 4.
For CIRCSE, the best UD framework based
model part-of-speech tagging result comes from
Stanza trained on PROIEL at 84.46%, but other
models are close. However, our small experiment
with the LASLA model does bring an improvement
of 6.71%-points (UPOS 91.17%) hinting that the
results for many other out of genre texts from the
Classical period might be considerably improved
with larger training data.
5.3 UFeats
The morphological analysis results seem to vary
greatly between different treebanks, from ITTB
reaches already a very impressive result of 97.61%
but for UDante only at 84.42%. This seems to
follow the trend, that when there is enough of in-
domain training data, the results have very little
room for improvement. The best UD framework
based CIRCSE morphological analysis is achieved
with the Stanza PROIEL model (59.48%) as was
for POS. Surprisingly using the LASLA model
gives an improvement of 11.04%-points (UFeats
70.52%).
5.4 Lemmas
Accurate automatic lemmatization is a very rele-
vant task for a highly inflected language like Latin.
The results have a high amount of variation across
different treebanks but overall Stanza models seem
to consistently outperform on this task. The re-
223
sults for ITTB comes from the Stanza individually
trained model at an impressive 99.05% as well as
for LLCT 98.07%, PROIEL 89.24% and UDante
86.94%. For Perseus the best score 85.89% is pro-
duced by using the Stanza Classical model.
The best lemmatization score for CIRCSE is
Stanza Five Merged 78.00%.
5.5 Unlabeled Attachment Score: UAS
Latin papers on automatic parsing usually report
the unlabelled attachment scores (UAS) along with
labeled attachment score (LAS). The UAS metric
means the percentage of words that are assigned
the correct head in the sentence. The results syn-
tactic tagging vary greatly. The Perseus treebank
benefits from only seeing training data from its own
time period. On the contrary, the same does not
apply for UDante, which benefits greatly from the
merged training data and obtains a 79.63% score
with Trankit Five Merged (66.79% on UDante).
For CIRCSE the best score is only 51.29% by
the Trankit Five Merged model, this is understand-
able considering how far the training model data is
as genre for parsing the tragedies. Adding the punc-
tuation with a very coarse simple full stop addition
at the end of each sentence makes this dataset much
easier for models to syntactically parse, this alone
leads to a 59.16% with above mentioned model.
5.6 Labeled Attachment Score: LAS
For the second metric on syntax, the Labeled At-
tachment Score LAS, the results are in line with
UAS findings. The LAS score is the percentage
of words that are assigned both to the correct head
and the correct dependency label. The results in
Table 3 show that the results tend to be dependent
on the amount of similar training data.
The LAS score of the CIRCSE test set shows
the true nature and difficulty of out of domain Latin
syntax parsing. Our experiment reflects the more
of a real life situation with parsing new data and
our best score is 44.54% from Trankit Five Merged.
The altered punctuation yields a 50.91% score on
same model. The EvaLatin2024 (Sprugnoli et al.,
2024) results reach 77.41% for prose and 75.75%
poetry. The task performance is not comparable
since for the shared task included the use of train
and dev datasets and had only the dependency pars-
ing task. Straka et al. (2024) report leveraging the
GS morphological annotation as an additional input
for the parser.
5.7 Voting results
The results of the voting experiment are reported
in Table 5, giving the baseline scores for the three
parsers (Trankit, Stanza, custom tagger), and the
majority voting results. In addition to this, we
also report Oracle score to illustrate the theoret-
ical upper bound for voting when it is based on
these three parsing models, i.e. the accuracy of a
hypothetical voting system that is always able to
select the best option among the predictions. Based
on the results by Zeman and Žabokrtský (2005)
we expected a possible an increase of roughly two
percentage points. The improvement of the voting
results is reported in 5 and ranges from 0.00% to
+0.89 for POS tagging and for FEATS from +0.09%
to +3.15%.
6 Conclusion and future studies
The task of full morpho-syntactic parsing across
the five largest established treebanks comprises 30
subtasks, of which 8 are best performed by the
Trankit Five Merged model. This model demon-
strates particular strength in part-of-speech label-
ing. Additionally, Stanza’s lemmatization capa-
bilities are noteworthy, consistently achieving the
highest numerical values across all five treebanks.
Overall it can be stated that merging the avail-
able five Latin UD datasets is very beneficial espe-
cially when it comes to smaller treebanks and out
of domain parsing. With our experiments, by us-
ing thematically compiled and everything merged
datasets, we are able to set a new state of the art for
many morpho-syntactic parsing tasks. The average
improvement of our final results are reported in Ta-
ble 9. Our initial results of morphological features
are even further improved by using the FEATS
atomic voting system especially on the smaller tree-
banks. The results reaching +3.15 %-points.
Future studies should first focus on addressing
the issues related to the treatment of multi-word
tokens. One approach could involve ensuring that
the five established treebanks strictly adhere to cur-
rent guidelines, such as avoiding the splitting of
enclitics (e.g., -que ’and’) into separate tokens. Ad-
ditionally, the introduced voting system could be
further refined and applied to a gold-standard pre-
tokenized input, followed by a detailed numerical
error analysis and close reading. This enables deter-
mining the specific morphological annotation tasks
that our current models succeed upon. Such analy-
sis could also determine whether observed errors
224
ITTB LLCT Perseus PROIEL UDante CIRCSE
UPOS UFeat UPOS UFeat UPOS UFeat UPOS UFeat UPOS UFeat UPOS UFeat
Trankit 99.07 97.55 99.63 97.15 91.83 80.94 97.30 92.96 91.46 84.42 83.21 57.76
Stanza 98.64 96.15 99.61 96.96 90.81 82.03 97.14 92.18 89.85 80.92 84.47 56.85
Custom 98.72 96.61 99.14 95.67 89.58 82.58 96.44 91.64 89.91 82.24 79.72 55.29
Majority 99.07 97.64 99.64 97.32 92.72 85.73 97.78 93.98 91.73 85.25 85.25 60.23
Change +0.00 +0.09 +0.01 +0.17 +0.89 +3.15 +0.48 +1.02 +0.27 +0.83 +0.78 +2.47
Oracle 99.60 99.01 99.82 98.46 96.11 92.64 98.83 96.98 94.19 90.69 90.22 65.31
Table 5: Results of the majority voting system compared to the three individual models used in voting. Oracle
stands for a theoretical upper bound for voting of always selecting the best option among the predictions.
suggest the need for further harmonization of the
treebanks themselves or are these cases difficult to
grammatically analyze as such?
On one hand, many tasks are successfully accom-
plished using a single treebank for training, devel-
opment, and testing, as demonstrated by the ITTB
data, which does not require the inclusion of addi-
tional treebanks for improving performance. This
highlights the importance of incorporating new gen-
res across a broad time span into the UD Latin tree-
bank family, ensuring that the training data is suffi-
ciently diverse, comprehensive and large enough.
While the development of novel gold-standard an-
notated datasets offers significant benefits, it is also
highly demanding in terms of human resources. We
hope that our high-performing models will facili-
tate the annotation of these datasets by providing
accurate predictions that serve as a strong starting
point for manual corrections, thereby easing the
process.
On the other hand, one of the conclusions drawn
from our diverse merged training sets is that the
notion of "Latin is Latin" does not hold true. It is
well established that medieval Latin is distinctly
different from Classical Latin. In practical terms
most scholars often identify themselves as experts
in one or the other. However, a possible future
study could investigate the specific attributes in a
treebank’s training data that make a parser model
particularly adept at Classical or medieval Latin.
Another conclusion from our experiments is that
the accuracy of parsing Latin from the Classical
period (broadly defined) is diminished when the
model is exposed to medieval training data. This
warrants further exploration to define the character-
istics that distinguish the two and will shed more
light into computational historical linguistics. One
study could be the evolution of medieval Latin and
the extent to which medieval treebanks reflect pre-
serving features of Classical Latin, analyzed by auc-
tor and decade. It might reveal how well and what
ways medieval writers were competent in Classi-
cal Latin. Another potential research direction is
to investigate why parsing the UDante treebank
appears less selective, with all five merged mod-
els performing well. This raises the question of
whether users of Latin from this late medieval pe-
riod were equally accustomed and influenced by
reading both Classical and medieval authors. Alter-
natively, this phenomenon might be explained by
the size of the training data, where additional exam-
ples contribute to improved results, as our LASLA
experiment in the CIRCSE test set show.
7 Limitations
Firstly, the harmonization of UD Latin syntactic
annotation (Gamba and Zeman, 2023b) and mor-
phological annotation (Gamba and Zeman, 2023a)
has been taken as a given and we have not sub-
jected the annotations to any closer examination.
As suggested by the case study sample finding of
necesse-esse ’necessarily existent’ (as discussed in
the Section 5.2) the training datasets might include
seldom errors from automatic processing. Sec-
ondly, the data in the LASLA corpus14 has not
been examined for any potential divergences from
the UD framework. We don’t inspect the results
from the reserved test set we have set aside for
possible further experiments on the LASLA corpus
based model with our custom model. This would
need more resources and we leave this for the fu-
ture, since our focus only on one experiment of the
CIRCSE test set.
Acknowledgements
The authors wish to acknowledge CSC – IT Center
for Science, Finland, for computational resources
as well as The Emil Aaltonen Foundation for grant
"Exploring linguistic variation in medieval Latin
using computational methods" for Hanna-Mari Ku-
pari 2022-2024.
14https://github.com/CIRCSE/LASLA/tree/main
225
References
Rie Kubota Ando and Tong Zhang. 2005. A framework
for learning predictive structures from multiple tasks
and unlabeled data. Journal of Machine Learning
Research, 6:1817–1853.
Galen Andrew and Jianfeng Gao. 2007. Scalable train-
ing of L1-regularized log-linear models. In Proceed-
ings of the 24th International Conference on Machine
Learning, pages 33–40.
David Bamman and Patrick J. Burns. 2020. Latin bert:
A contextual language model for classical philology.
Preprint, arXiv:2009.10053.
David Bamman and Gregory Crane. 2011. The an-
cient greek and latin dependency treebanks. In Lan-
guage Technology for Cultural Heritage, pages 79–
98, Berlin, Heidelberg. Springer Berlin Heidelberg.
Flavio M Cecchini, Rachele Sprugnoli, Giovanni
Moretti, and Marco Passarotti. 2020a. Udante: First
steps towards the universal dependencies treebank of
dante’s latin works. Accademia University Press.
Flavio Massimiliano Cecchini, Timo Korkiakangas, and
Marco Passarotti. 2020b. A new Latin treebank for
Universal Dependencies: Charters between Ancient
Latin and Romance languages. In Proceedings of the
Twelfth Language Resources and Evaluation Confer-
ence, pages 933–942, Marseille, France. European
Language Resources Association.
Alexis Conneau, Kartikay Khandelwal, Naman Goyal,
Vishrav Chaudhary, Guillaume Wenzek, Francisco
Guzmán, Edouard Grave, Myle Ott, Luke Zettle-
moyer, and Veselin Stoyanov. 2020. Unsupervised
cross-lingual representation learning at scale. In Pro-
ceedings of the 58th Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 8440–
8451, Online. Association for Computational Lin-
guistics.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional transformers for language under-
standing. In Proceedings of the 2019 Conference of
the North American Chapter of the Association for
Computational Linguistics: Human Language Tech-
nologies, Volume 1 (Long and Short Papers), pages
4171–4186, Minneapolis, Minnesota. Association for
Computational Linguistics.
Timothy Dozat and Christopher D. Manning. 2017.
Deep biaffine attention for neural dependency pars-
ing. In International Conference on Learning Repre-
sentations (ICLR).
Federica Gamba and Daniel Zeman. 2023a. Latin mor-
phology through the centuries: Ensuring consistency
for better language processing. In Proceedings of
the Ancient Language Processing Workshop, pages
59–67, Varna, Bulgaria. INCOMA.
Federica Gamba and Daniel Zeman. 2023b. Univer-
salising latin universal dependencies: a harmonisa-
tion of latin treebanks in ud. In Proceedings of the
Sixth Workshop on Universal Dependencies (UDW,
GURT/SyntaxFest 2023), pages 7–16, Stroudsburg,
PA, USA. Association for Computational Linguistics.
Dag Trygve Truslew Haug and Marius L. Jøhndal. 2008.
Creating a parallel treebank of the old indo-european
bibletranslations.
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski,
Bruna Morrone, Quentin De Laroussilhe, Andrea
Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019.
Parameter-efficient transfer learning for nlp. In In-
ternational conference on machine learning, pages
2790–2799. PMLR.
Dan Kondratyuk and Milan Straka. 2019. 75 languages,
1 model: Parsing Universal Dependencies univer-
sally. In Proceedings of the 2019 Conference on
Empirical Methods in Natural Language Processing
and the 9th International Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP), pages
2779–2795, Hong Kong, China. Association for Com-
putational Linguistics.
Sebastian Nehrdich and Oliver Hellwig. 2022. Accu-
rate dependency parsing and tagging of Latin. In
Proceedings of the Second Workshop on Language
Technologies for Historical and Ancient Languages,
pages 20–25, Marseille, France. European Language
Resources Association.
Minh Van Nguyen, Viet Dac Lai, Amir Pouran Ben Vey-
seh, and Thien Huu Nguyen. 2021. Trankit: A light-
weight transformer-based toolkit for multilingual nat-
ural language processing. In Proceedings of the 16th
Conference of the European Chapter of the Associa-
tion for Computational Linguistics: System Demon-
strations, pages 80–90, Online. Association for Com-
putational Linguistics.
Marco Passarotti. 2019. The Project of the Index
Thomisticus Treebank, pages 299–320. De Gruyter
Saur, Berlin, Boston.
Marco Passarotti and Felice Dell’Orletta. 2010. Im-
provements in parsing the index Thomisticus tree-
bank. revision, combination and a feature model for
medieval Latin. In Proceedings of the Seventh In-
ternational Conference on Language Resources and
Evaluation (LREC’10), Valletta, Malta. European
Language Resources Association (ELRA).
Jonas Pfeiffer, Ivan Vulic´, Iryna Gurevych, and Se-
bastian Ruder. 2020. MAD-X: An Adapter-Based
Framework for Multi-Task Cross-Lingual Transfer.
In Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing (EMNLP),
pages 7654–7673, Online. Association for Computa-
tional Linguistics.
Sampo Pyysalo, Jenna Kanerva, Antti Virtanen, and
Filip Ginter. 2021. WikiBERT models: Deep trans-
fer learning for many languages. In Proceedings
226
of the 23rd Nordic Conference on Computational
Linguistics (NoDaLiDa), pages 1–10, Reykjavik, Ice-
land (Online). Linköping University Electronic Press,
Sweden.
Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and
Christopher D Manning. 2020. Stanza: A python
natural language processing toolkit for many human
languages. arXiv.org.
Mohammad Sadegh Rasooli and Joel R. Tetreault. 2015.
Yara parser: A fast and accurate dependency parser.
Computing Research Repository, arXiv:1503.06733.
Version 2.
Philipp Roelli. 2014. The corpus corporum, a new open
latin text repository and tool. Archivum Latinitatis
Medii Aevi, 72(1):289–304.
Aaron Smith, Bernd Bohnet, Miryam de Lhoneux,
Joakim Nivre, Yan Shao, and Sara Stymne. 2018. 82
treebanks, 34 models: Universal Dependency parsing
with multi-treebank models. In Proceedings of the
CoNLL 2018 Shared Task: Multilingual Parsing from
Raw Text to Universal Dependencies, pages 113–123,
Brussels, Belgium. Association for Computational
Linguistics.
Rachele Sprugnoli, Federica Iurescia, and Marco Pas-
sarotti. 2024. Overview of the EvaLatin 2024 evalu-
ation campaign. In Proceedings of the Third Work-
shop on Language Technologies for Historical and
Ancient Languages (LT4HALA) @ LREC-COLING-
2024, pages 190–197, Torino, Italia. ELRA and
ICCL.
Rachele Sprugnoli, Marco Passarotti, Flavio Massim-
iliano Cecchini, Margherita Fantoli, and Giovanni
Moretti. 2022. Overview of the EvaLatin 2022 eval-
uation campaign. In Proceedings of the Second
Workshop on Language Technologies for Historical
and Ancient Languages, pages 183–188, Marseille,
France. European Language Resources Association.
Manuel Stoeckel, Alexander Henlein, Wahed Hemati,
and Alexander Mehler. 2020. Voting for POS tag-
ging of Latin texts: Using the flair of FLAIR to better
ensemble classifiers by example of Latin. In Proceed-
ings of LT4HALA 2020 - 1st Workshop on Language
Technologies for Historical and Ancient Languages,
pages 130–135, Marseille, France. European Lan-
guage Resources Association (ELRA).
Milan Straka, Jana Straková, and Federica Gamba. 2024.
úfal latinpipe at evalatin 2024: Morphosyntactic anal-
ysis of latin. Preprint, arXiv:2404.05839.
Phillip Benjamin Ströbel. 2022. Roberta base latin
cased v2.
Elina Stüssi and Phillip Ströbel. 2024. Part-of-speech
tagging of 16th-century Latin with GPT. In Proceed-
ings of the 8th Joint SIGHUM Workshop on Com-
putational Linguistics for Cultural Heritage, Social
Sciences, Humanities and Literature (LaTeCH-CLfL
2024), pages 196–206, St. Julians, Malta. Association
for Computational Linguistics.
Daniel Zeman and Zdeneˇk Žabokrtský. 2005. Improv-
ing parsing accuracy by combining diverse depen-
dency parsers. In Proceedings of the Ninth Inter-
national Workshop on Parsing Technology, pages
171–178, Vancouver, British Columbia. Association
for Computational Linguistics.
227
A Appendix
CIRCSE test set results Tasks:
Model Name Tokens UPOS UFeats Lemmas UAS LAS
Stanza PROIEL 100.00 84.46 59.48 72.37 48.18 41.38
Trankit PROIEL 99.24 81.50 55.39 60.08 49.44 41.92
Custom Perseus - 76.29 47.79 - - -
Custom PROIEL - 79.72 55.29 - - -
Custom Five Merged - 81.30 57.11 - - -
Custom Classical - 80.84 56.53 - - -
LASLA - 91.17 70.52 - - -
Stanza Classical 100.00 84.37 56.79 73.36 49.64 43.03
Stanza Five Merged 99.98 82.56 51.23 78.00 47.00 40.14
Trankit Classical 99.71 83.08 57.09 62.87 50.57 43.06
Trankit Five Merged 99.82 83.21 57.76 68.15 51.29 44.54
Table 6: The results of the CIRCSE test set. For models trained on individual treebank data only the results for
PROIEL are given for all models, since both Stanza and Trankit Perseus models failed to run because of severe
sentence segmentation issues.
CIRCSE altered test set Tasks
Automatically added punctuation Tokens UPOS UFeats Lemmas UAS LAS
Stanza ITTB 99.98 81.64 56.32 73.32 50.49 41.53
Stanza LLCT 99.99 75.41 40.54 56.13 37.24 25.18
Stanza PROIEL 100.00 79.98 62.06 74.20 46.17 38.59
Stanza Perseus 99.93 83.96 57.26 70.16 46.75 38.43
Stanza Classical 100.00 85.81 59.54 75.46 54.20 46.93
Stanza Five Merged 100.00 83.94 54.33 79.58 53.89 46.60
Trankit Classical 99.78 85.21 59.75 65.44 56.61 48.43
Trankit Late and Christian 99.80 84.85 58.20 66.69 54.53 45.41
Trankit Late and Medieval 99.74 82.68 55.35 63.18 51.99 42.52
Trankit Five Merged 99.79 87.05 61.39 71.73 59.16 50.91
Table 7: The effects to the performance of the different models with the added punctuation to the CIRCSE gold
standard test set. The results are not comparable to the UD released test set and given in italics.
228
Effects of merged treebanks in training Tasks:
Treebank and model Tokens UPOS UFeats Lemmas UAS LAS
ITTB
Custom tagger - 98.66 96.50 - - -
Improvement from Late and Christian - 0.06 0.11 - - -
LLCT
Trankit 99.99 99.66 97.36 96.50 96.15 95.37
Improvement from Late and Medieval 0.00 0.00 -0.18 0.19 0.31 0.14
Perseus
Stanza 99.94 89.44 80.17 80.97 69.75 61.93
Improvement from Classical 0.02 0.65 0.96 4.92 5.53 6.36
Trankit 99.46 88.90 77.98 63.99 74.08 66.97
Improvement from Classical 0.28 1.60 5.27 10.61 3.81 4.31
Custom tagger - 86.29 76.17 - - -
Improvement from Classical - 3.29 6.41 - - -
PROIEL
Custom tagger - 96.42 91.26 - - -
Improvement from Five Merged - 0.02 0.38 - - -
UDante
Trankit 99.50 91.17 80.71 73.89 75.92 68.65
Improvement from Five Merged 0.16 0.29 3.71 3.61 3.71 4.77
Custom tagger - 87.43 75.84 - - -
Improvement from Five Merged - 2.48 6.40 - - -
Average improvement 0.15 1.19 2.78 4.84 3.34 3.90
Table 8: The most important results of the merging of diverse training data.
Tasks: POS UFEATS UAS LAS
Treebank Our highest Change Our highest Change Our highest Change Our highest Change
ITTB 99.07 4.19 97.64 1.49 92.09 -0.19 90.71 2.42
LLCT 99.66 5.16 97.36 0.55 96.46 0.38 95.51 0.60
PERSEUS 92.72 8.41 85.73 7.87 77.89 8.92 71.28 11.85
PROIEL 97.78 7.78 93.98 1.26 84.09 -0.82 80.97 -0.28
UDante 91.73 5.93 85.25 5.95 79.63 12.84 80.97 16.10
Average change 6.29 3.42 4.23 6.14
Table 9: Summary of our best F1 scores. The ones produced by the voting system are given in a bold typeset. The
change as percentage points to the most recent POS tagging study by Stüssi and Ströbel (2024). For ITTB the
best score 99.07% is predicted by Trankit Five Merged (in experimenting with a GPT model on POS tagging the
best results reported by Stüssi and Ströbel (2024) is 94.88 produced on GPT-4 train1000). The same applies for
Perseus as well 91.83% (84.31 on GPT-4 train2000), PROIEL at 97.30% (90.00 on GPT-4 train5000) and UDante
91.46% (85.8 on GPT-4 train200). For LLCT the best score 99.66% (94.5 on GPT-4 train1000) is produced by
the Trankit individually trained model. For UAS and LAS the results are compared to best numbers reported by
Gamba and Zeman (2023a). They have accomplished this using jackknifing technique. In this the training data is
divided into n parts, where n-1 parts are used to train a model to annotate the remaining nth part. When rotating
this n times, we receive a version of the whole training data with predicted annotations, which can be used during
final model training. Therefore, the final model is trained using predicted annotations, in this case the dependency
parsing model is trained using predicted morphology and lemmas.