Proceedings of the 28th International Conference on Computational Linguistics, pages 904–914
Barcelona, Spain (Online), December 8-13, 2020
904
Exploring Cross-sentence Contexts for
Named Entity Recognition with BERT
Jouni Luoma
TurkuNLP group
University of Turku
Turku, Finland
jouni.a.luoma@utu.fi
Sampo Pyysalo
TurkuNLP group
University of Turku
Turku, Finland
sampo.pyysalo@utu.fi
Abstract
Named entity recognition (NER) is frequently addressed as a sequence classification task with
each input consisting of one sentence of text. It is nevertheless clear that useful information for
NER is often found also elsewhere in text. Recent self-attention models like BERT can both
capture long-distance relationships in input and represent inputs consisting of several sentences.
This creates opportunities for adding cross-sentence information in natural language processing
tasks. This paper presents a systematic study exploring the use of cross-sentence information
for NER using BERT models in five languages. We find that adding context as additional sen-
tences to BERT input systematically increases NER performance. Multiple sentences in input
samples allows us to study the predictions of the sentences in different contexts. We propose
a straightforward method, Contextual Majority Voting (CMV), to combine these different pre-
dictions and demonstrate this to further increase NER performance. Evaluation on established
datasets, including the CoNLL’02 and CoNLL’03 NER benchmarks, demonstrates that our pro-
posed approach can improve on the state-of-the-art NER results on English, Dutch, and Finnish,
achieves the best reported BERT-based results on German, and is on par with other BERT-based
approaches in Spanish. We release all methods implemented in this work under open licenses.
1 Introduction
Named entity recognition (NER) approaches have evolved through various methodological phases,
broadly including rule/knowledge-based, unsupervised, feature engineering and supervised learning, and
feature inferring approaches (Yadav and Bethard, 2018; Li et al., 2020a). The use of cross-sentence in-
formation in some form has been a normal part of many NER methods in the former categories, but its
role has diminished with the current feature inferring deep learning based approaches. Rule/knowledge-
based approaches such as that of Mikheev et al. (1998) typically match strings to lexicons and similar
domain knowledge sources, possibly going through text multiple times with refinement based on enti-
ties found on earlier passes. Later, manually engineered features were used to incorporate information
from the surrounding text, whole documents, data sets and also from external sources. The number of
different features and classifiers grew during the years and it was normal that the features also contained
cross-sentence information in some form as for example in (Krishnan and Manning, 2006). Dense repre-
sentations of text such as word, character, string and subword embeddings first started to appear in NER
methods as additional features given to classifiers (Collobert et al., 2011). Step by step, feature engi-
neering has been demoted to a lesser role, as the most recent deep learning approaches learn to create
meaningful and context-sensitive representations of text by pre-training with vast amounts of unlabelled
data. These contextual representations are often used directly as features for existing NER architectures
or fine-tuned with labelled data to match a certain task.
In recent years, the development of Natural Language Processing (NLP) in general and NER in par-
ticular have been greatly influenced by deep transfer learning methods capable of creating contextual
This work is licensed under a Creative Commons Attribution 4.0 International Licence. Licence details: http://
creativecommons.org/licenses/by/4.0/.
905
representations of text, to the extent that many of the state-of-the-art NER systems mainly differ from
one another on the basis of how these contextual representations are created (Peters et al., 2018; Devlin
et al., 2018; Akbik et al., 2018; Baevski et al., 2019). Using such models, sequence tagging tasks are
often approached one sentence at a time, essentially discarding any information available in the broader
surrounding context, and there is only little recent study on the use of cross-sentence context – sentences
around the sentence of interest – to improve sequence tagging performance. In this paper, we present a
comprehensive exploration of the use of cross-sentence context for named entity recognition, focusing
on the recent BERT deep transfer learning models (Devlin et al., 2018) based on self-attention and the
transformer architecture (Vaswani et al., 2017). BERT uses a fixed-size window that limits the amount of
text that can be input to the model at one time. The model maximum window size, or maximum sequence
length, is fixed during pre-training, with 512 wordpieces a common choice. This window fits dozens of
typical sentences of input at a time, allowing us to include extensive sentence context. Here, we first
study the effect of predicting tags for individual sentences when they are moved around the window,
surrounded by their original document context from the source data. Second, we utilize different pre-
dictions for the same sentences to potentially further improve performance, combining predictions using
majority voting, adapting an approach that has been used already in early NER implementations (Tjong
Kim Sang et al., 2000; Van Halteren et al., 2001; Florian et al., 2003). We evaluate these approaches
on five languages, contrasting NER results using BERT without cross-sentence information, sentences
in context, and aggregation using Contextual Majority Voting (CMV) on well-established benchmark
datasets. We show that using sentences in context consistently improves NER results on all of the tested
languages and CMV further improves the results in most cases. Comparing performance to the current
state-of-the-art NER results in the 5 languages, we find that our approach establishes new state-of-the-art
results for English, Dutch, and Finnish, the best BERT-based results on German, and effectively matches
the performance of a BERT-based method in Spanish.
2 Related work
The state-of-the-art in NER has recently moved from approaches using word/character representations
and manually engineered features (Passos et al., 2014; Chiu and Nichols, 2016) toward approaches
directly utilizing deep learning-based contextual representations (Akbik et al., 2018; Peters et al., 2018;
Devlin et al., 2018; Baevski et al., 2019) while adding few manually engineered features, if any. While
successful in terms of NER performance, these approaches have tended to predict tags for one sentence
at a time, discarding information from surrounding sentences.
One recent method taking sentence context into account is that of Akbik et al. (2019), which addresses
a weakness of an earlier contextual string embedding method (Akbik et al., 2018), specifically the issue
of rare word representations occurring in underspecified contexts. Akbik et al. (2019) make the intuitive
assumption that such occurrences happen when a named entity is expected to be known to the reader,
i.e. the name is either introduced earlier in text or is of general in-domain knowledge. Their approach
is to maintain a memory of contextual representations of each unique word/string in text and pool to-
gether contextual embeddings of a string occurring in text with the contextual embeddings of the same
string earlier in text. This pooled contextual embedding is then concatenated with the current contextual
embedding to get the final embedding to use in classification.
Another recent approach taking broader context into account for NER was proposed by Luo et al.
(2020), where in addition to token representations, also sentence and document level representations are
calculated and used for classification using a CRF model. A sliding window is used by Wu and Dredze
(2019) so that part of the input is preserved as context when the window is moved forward in text. Baevski
et al. (2019) state that they use longer paragraphs in pre-training their model, but it is not mentioned in
the paper if such longer paragraphs are used also in fine-tuning the model or predicting tags for NER.
Some other approaches such as that of Liu et al. (2019a) include explicit global information in the form
of e.g. gazetteers. Also, some approaches formulate NER as a span finding task instead of sequence
labelling (Banerjee et al., 2019; Li et al., 2020b). These approaches would likely allow the use of longer
sequences, but the incorporation of cross-sentence information is not explicitly proposed by the authors.
906
In the paper introducing BERT, Devlin et al. (2018) write in the description of their NER evaluation
“we include the maximal document context provided by the data.” However, no detailed description
of how this inclusion was implemented is provided, and some NER implementations using BERT have
struggled to reproduce the results of the paper.1,2 The addition of document context to NER using BERT
is discussed also by Virtanen et al. (2019), who fill each input sample with the following sentences and
use the first sentence in each sample for predictions, and thus only introduce context appearing after the
sentence of interest in the source text.
Of the related work discussed above, our approach most closely resembles that of Virtanen et al.
(2019), which in turn aims to directly follow Devlin et al. (2018). By contrast to other studies discussed
above, we do not introduce extra features or embeddings representing cross-sentence information or
incorporate extra information in addition to that captured by the BERT model. Instead, we directly
utilize the BERT architecture and rely on self-attention and voting to combine predictions for sentences
in different contexts.
3 Data
The data used in this study consists of pre-trained BERT models and NER datasets for five different
languages. We aimed to use monolingual BERT models as numerous recent studies have suggested that
well-constructed language-specific models outperform multilingual ones (Virtanen et al., 2019; Vries et
al., 2019; Le et al., 2020). We selected the following language-specific pre-trained BERT models for our
study, focusing on languages that also have established benchmark data for NER:
• BERTje base, Cased for Dutch (Vries et al., 2019)3
• BERT-Large, Cased (Whole Word Masking) for English 4
• FinBERT base, Cased for Finnish (Virtanen et al., 2019)5
• German BERT, Cased for German 6
• BETO, Cased for Spanish (Can˜ete et al., 2020)7 .
For comparison purposes we also tested multilingual BERT8 with the Spanish language. From the mod-
els introduced above all except German and multilingual BERT have used the Whole Word Masking
variation of the Masked Language Model objective in pre-training instead of the method introduced in
the original paper (Devlin et al., 2018). Whole Word Masking was introduced by the developers of BERT
after the original paper was published. In this pre-training objective, all the tokens corresponding to one
word in text are masked instead of completely random tokens, which often leaves some of the tokens in
multi-token words unmasked. We aimed to apply sufficiently large, widely-used benchmark datasets for
evaluating NER results, assessing our methods primarily on the CoNLL’02 and CoNLL’03 Shared task
Named entity recognition datasets (Tjong Kim Sang, 2002; Tjong Kim Sang and De Meulder, 2003),
which cover four of our five target languages. For the fifth language, Finnish, we use two recently pub-
lished named entity recognition corpora (Ruokolainen et al., 2019; Luoma et al., 2020)9,10. These two
Finnish datasets are annotated in a compatible way, and for this study they are combined into a single
corpus by simple concatenation, following Luoma et al. (2020).
1https://github.com/google-research/bert/issues/581
2https://github.com/google-research/bert/issues/569
3https://github.com/wietsedv/bertje
4https://github.com/google-research/bert
5https://github.com/TurkuNLP/FinBERT
6https://deepset.ai/german-bert
7https://github.com/dccuchile/beto
8https://github.com/google-research/bert
9https://github.com/mpsilfve/finer-data
10https://github.com/TurkuNLP/turku-ner-corpus
907
Tokens English German Spanish Dutch Finnish
Train 203,621 206,931 264,715 202,644 342,924
Development 51,362 51,444 52,923 37,687 31,872
Test 46,435 51,943 51,533 68,875 67,425
Entities English German Spanish Dutch Finnish
Train 23,499 11,851 18,798 13,344 27,026
Development 5,942 4,833 4,352 2,616 2,286
Test 5,648 3,673 3,559 3,941 5,129
Table 1: Key statistics of the NER data sets
All of the NER datasets define separate training, development and test sets, and we follow the given
subdivision for each. The training sets for each language are used for fine-tuning the corresponding
BERT model for NER, development sets are used for evaluation in hyperparameter selection, and the test
sets are only used in final experiments for evaluating models trained with the selected hyperparameters.
As previous studies vary in whether to combine development data to training data for training a final
model, we report also results where models are trained with a combined training and development set for
final test experiments. The datasets for the CoNLL shared task languages contain four different classes
of named entities: Person (PER), Organization (ORG), Location (LOC) and Miscellaneous (MISC). The
Finnish NER datasets also use the PER, ORG, and LOC types along with three others, Product (PROD),
Event (EVENT), and Date (DATE). For implementation purposes we converted all the datasets to the
same format prior to experiments: The character encoding of each file was converted to UTF-8, and the
NER labelling scheme was converted to IOB2 (Ratnaparkhi and Marcus, 1998) also for corpora that were
originally in the IOB scheme (Ramshaw and Marcus, 1995). By contrast to the older IOB scheme, in
the IOB2 scheme the label for the first token of a named entity is always marked with a B-prefix (e.g. B-
PER), even if the previous token is not part of a named entity. The key statistics for the NER datasets are
presented in Table 1. Finally, we note that all the datasets except CoNLL’02 Spanish provide information
on document boundaries using special -DOCSTART- tokens at the start of each new document.
4 Methods
As the starting point for exploring the cross-sentence information for NER using BERT, we use a NER
pipeline implementation introduced by Virtanen et al. (2019) that closely follows the straightforward ap-
proach presented by Devlin et al. (2018). Here, the last layer of the pre-trained BERT model is followed
by a single time-distributed dense layer which is fine-tuned together with the pre-trained BERT model
weights to produce the softmax probabilities of NER tags for input tokens. No modelling of tag transition
probabilities or any additional processing to validate tag sequences is used.
In our implementation, exactly one example is constructed for each sentence of the corpus unless the
sentence is so long that it does not fit to the maximum sequence length11. The sentence is placed at the
beginning of the BERT window and following sentences from the corpus are used to fill the window
(up to the maximum sequence length), with special separator ([SEP]) tokens separating the sentences.
Partial sentences are used to fill up the BERT examples. As a special case, the sentences used for filling
the window for the last sentences in input data are picked by wrapping back to the beginning of the
corpus. This approach creates situations where some input samples contain sentences from different
original documents, if the documents were next to one another in the corpus. For this reason, we also
implemented documentwise wrapping of sentences if the input data had document boundaries marked
with -DOCSTART- tokens. We used this information to build input samples by filling the sentences
at the end of one document with the sentences from beginning of that same document instead of the
next sentences in the original data. In this case only full sentences are added to each input sample, and
11In this special case the long sentence is split to produce multiple input sequences that are considered as sentences for the
rest of the implementation.
908
Figure 1: Illustration of various input representations for sequence labelling tasks. a) One sentence
per example (Single), b) including following sentences (First, CMV), c) including preceding and fol-
lowing sentences (Sentence in context). CMV combines predictions for the same sentence (e.g. S2 in
b) in various positions and contexts. The empty square () stands for special separator symbols (e.g.
[CLS], [SEP] and [PAD] for BERT); a light background color is used to represent special symbols
and incomplete sentences in c).
padding ([PAD]) tokens are used to fill empty space if the next sentence in the input data does not fit
into the window as demonstrated in (Figure 1b).
Constructing inputs in this way implies that the same sentences from the original data occur in different
positions and with varying (sizes of) left and right contexts in different samples. We wanted to examine
the predictions in different contexts more closely to see if there are consistent effects on tag prediction
quality depending on the starting position of a sentence inside a context. One challenge here was how to
consistently measure performance with different contexts: sentences are of different lengths, and as they
are added to input samples, the beginning of the window was only place where the starting locations of
sentences would align. Also, the number of sentences that fit into the window vary substantially. For this
reason, it is not possible e.g. to always pick the Nth sentence to study as there are no guarantees one will
exist in all examples. To address this issue and build input samples for testing predictions at different
locations, we placed the sentence of interest to start at a specified location inside the window, and filled
the window in both directions with sentences before / after the sentence of interest in the original data.
We tested the starting positions of the sentence of interest from 1 (0 being the [CLS] token) up to the
maximum sequence length (512 wordpieces) with intervals of 32 wordpieces. If the sentence of interest
was longer than the space between a starting position and the maximum sequence length, the starting
position for that particular sentence was moved backwards to fit the sentence in the window.
Ensembles of classifiers are commonly used to improve classification performance at various tasks,
and it seems reasonable to assume that predictions for the same input sentences in different positions
and contexts create an ensemble-like construct. This is not an ensemble in the conventional sense, as the
number of predictions we get for each sentence varies. We evaluate two different variations combining
the results from multiple predictions in different contexts. The first approach is to assign labels to sen-
tences in each location first, and then take a majority vote of the assigned labels. The second approach is
to add together the softmax probabilities of predictions in different contexts, and then take the argmax of
the sum. For simplicity, we here term both Contextual Majority Voting (CMV) as they are variations of
the same underlying idea. The implementation uses only predictions of tokens in whole sentences, not
ones in partial sentences that may appear in input examples.
For fine-tuning the pre-trained BERT models, we largely follow the process introduced in (Devlin et
al., 2018). We use the maximum sequence length of 512 in all experiments to include maximal cross-
909
100 200 300 400 500
Start location
0.958
0.960
0.962
0.964
0.966
0.968
0.970
F-
sc
or
e
Context, best
Context, mean
CMV, Documentwise
(a) English
100 200 300 400 500
Start location
0.910
0.912
0.914
0.916
0.918
0.920
0.922
0.924
0.926
F-
sc
or
e
Context, best
Context, mean
CMV, Documentwise
(b) Dutch
100 200 300 400 500
Start location
0.846
0.848
0.850
0.852
0.854
0.856
0.858
0.860
F-
sc
or
e
Context best
Context, mean
CMV, Documentwise
(c) German
100 200 300 400 500
Start location
0.865
0.870
0.875
0.880
0.885
F-
sc
or
e
mBERT mean
mBERT best
BETO mean
BETO best
CMV, mBERT
CMV, BETO
(d) Spanish
100 200 300 400 500
Start location
0.912
0.914
0.916
0.918
0.920
0.922
0.924
0.926
F-
sc
or
e
Context, best
Context, mean
CMV, Documentwise
(e) Finnish
Figure 2: NER performance on development set measured with CMV and in different sentence starting
locations. The lower curves show mean performance over whole hyperparameter range, and the upper
curves the results with the best hyperparameters (mean of 5 repetitions) for each location. The flat dashed
lines show the best CMV results.
sentence context, the Adam optimizer (Kingma and Ba, 2014) (β1 = 0.9, β2 = 0.999,  = 1e− 6) with
warmup of 10% of samples, linear learning rate decay, a weight decay rate of 0.01, and norm clipping
on 1.0. Sample weights are used for inputs so that the special tokens [CLS] and [PAD] are given zero
weight and everything else 1 when calculating the loss (sparse categorical cross-entropy).
We select hyperparameters with an exhaustive search of the grid proposed by Devlin et al., modified
to skip batch size 32 and add batch sizes 2 and 4 instead as our initial experiments indicated better
performance with smaller batch sizes. That is, the grid search is done over the following parameter
ranges:
• Learning rate: 2e-5, 3e-5, 5e-5
• Batch size: 2, 4, 8, 16
• Epochs: 1, 2, 3, 4
We repeated each experiment 5 times with every hyperparameter combination. The best hyperparameters
were selected based on the mean of exact mention-level F1 scores, as evaluated against the development
set using a Python implementation of the standard conlleval evaluation script.
As a reference we use a BERT model which is fine-tuned using only single sentences from the input
data. For this baseline, predictions are also made on the basis of single sentences (see Figure 1a).
5 Results
Based on initial development set results, we decided to focus only on CMV using examples constructed
document-wise of the variations of this method (see Section 4). The exception here is the Spanish
CoNLL dataset, for which document boundary information was not available. Further, as the differences
between CMV variations were found not to be large, we decided to only consider the variant that first
assigns labels and then votes between the labels.
The effect of the sentence of interest starting location and the effect of CMV method on development
data is illustrated in Figure 2. Our initial expectation was that placing the sentence of interest near
910
Precision Recall F1 F1 train+dev
English, CMV 93.06 (0.25) 93.78 (0.08) 93.42 (0.12) 93.57 (0.33)
English, First 93.15 (0.15) 93.73 (0.04) 93.44 (0.06) 93.74 (0.25)
English, Single 91.12 (0.25) 92.28 (0.23) 91.70 (0.24) 91.94 (0.15)
Dutch, CMV 93.12 (0.26) 93.26 (0.18) 93.19 (0.21) 93.49 (0.23)
Dutch, First 93.03 (0.65) 93.38 (0.38) 93.21 (0.51) 93.39 (0.26)
Dutch, Single 91.57 (0.35) 91.49 (0.41) 91.53 (0.37) 91.92 (0.30)
Finnish, CMV 92.91 (0.18) 94.42 (0.13) 93.66 (0.13) 93.78 (0.26)
Finnish, First 92.56 (0.14) 94.24 (0.08) 93.39 (0.10) 93.65 (0.26)
Finnish, Single 90.74 (0.10) 92.11 (0.24) 91.42 (0.16) 91.97 (0.21)
German, CMV 86.91 (0.31) 84.38 (0.32) 85.63 (0.30) 87.31 (0.27)
German, First 86.37 (0.39) 84.07 (0.10) 85.21 (0.22) 86.91 (0.11)
German, Single 85.55 (0.20) 81.81 (0.31) 83.64 (0.21) 85.67 (0.25)
Spanish, CMV 87.80 (0.25) 87.98 (0.18) 87.89 (0.21) 87.97 (0.21)
Spanish, First 86.71 (0.31) 87.41 (0.28) 87.06 (0.28) 87.27 (0.25)
Spanish, Single 87.43 (0.53) 87.90 (0.34) 87.66 (0.43) 87.52 (0.41)
S-mBERT, CMV 87.25 (0.50) 88.67 (0.46) 87.95 (0.47) 88.32 (0.26)
S-mBERT, First 86.92 (0.40) 87.88 (0.44) 87.40 (0.42) 87.54 (0.25)
S-mBERT, Single 87.19 (0.28) 87.81 (0.26) 87.50 (0.26) 87.57 (0.29)
Table 2: NER results for different methods and languages (standard deviation in parentheses).
the middle of the sequence would generally yield the best performance. However, while this effect
can be observed e.g. for English (Figure 2a), the pattern does not hold in all cases, although in most
cases performance does improve when moving the starting position away from either end of the context
window. The problem was that the performance in the middle of the context did not appear to be stable
enough to pick a reliable starting position to look at prediction time. This can be seen in the figure
2 where the results for different starting locations tend to vary without a clear central optimum. The
results for Dutch (Figure 2b) deviated the most from our expectations, and a possible reason for this
was later found from the source data: the sentence order of the documents inside the original Dutch
language data set has been randomized for copyright reasons. To test if randomizing the sentence order
of documents has an effect on results, we tested this with other languages. However, in our initial
experiments randomizing sentences inside each document did not result in significant performance drop
on any of the tested languages.
The final test set results for models trained with the best hyperparameter combinations found using
the development sets are summarized in Table 2. We report precision, recall and F1-score for models
trained only on the training dataset, and additionally F1-scores for models trained with combined training
and development sets using the same hyperparameters. For each language/BERT model pair, we report
performance for the baseline using only a single sentence per window (Single), the approach where
sentences from the following context are included but only predictions for the first sentence in each
window are used (First), and, finally, performance with CMV (see also Figure 1).
These results show that BERT NER predictions systematically benefit from access to cross-sentence
context. For all tested languages except Spanish, models that are fine-tuned and tested with samples
containing context outperform models which do not use any context and are relying only on single
sentences. What is not directly seen from Table 2 is that generally the results with the method First
outperform the results with the method Single, and similarly the method CMV generally outperforms
the method First. Both English and Dutch seem to perform well with the method First and for Spanish
the method Single also performs well. One thing to note is that English and Dutch results with CMV
outperform the method First with the hyperparameters that produced the best result for the method First.
However, the final results for CMV just were not as good with the hyperparameters that produced the
best performance for CMV on the development data.
911
Model Our F1 Our F1 (t+d) Current BERT Current SOTA
English 93.44 93.74 93.47 (Liu et al., 2019b) 93.5 (Baevski et al., 2019)
Dutch 93.21 93.49 90.94 (Wu and Dredze, 2019) 92.69 (Strakova´ et al., 2019)
Finnish 93.66 93.78 93.11 (Luoma et al., 2020) 93.11 (Luoma et al., 2020)
German 85.63 87.31 82.82 (Wu and Dredze, 2019) 88.32 (Akbik et al., 2018)
Spanish 87.89 87.97 88.43 (Can˜ete et al., 2020) 89.72 (Conneau et al., 2020)
Spanish, mBERT 87.95 88.32 88.43 (Can˜ete et al., 2020) 89.72 (Conneau et al., 2020)
Table 3: NER result comparison to the state of the art.
To further evaluate the performance of CMV method, we checked the results of each fine-tuned model
on the development set during hyperparameter search. There were 48 hyperparameter combinations to
evaluate for each model. For English, German, Spanish and Finnish, the CMV method outperformed
the method First for every hyperparameter combination when calculating the results as the mean of
mention-level F1 scores from 5 repetitions. For Spanish this includes both the experiments with the
Spanish monolingual model as well as the experiments with the multilingual model. The only exception
to this were the results on Dutch, for which CMV outperformed the method First in 41 cases out of
48. The fact that sentences in Dutch data are in randomized order may contribute to this. In total, the
CMV method improved the results over method First in 281 cases out of 288. In the same fashion,
we evaluated the difference in performance between the method Single and the method First evaluated
against the development set. The method First outperformed the method Single for every hyperparameter
combination for every tested language.
In Table 3 we compare the results using cross-sentence context with current the state-of-the-art in NER
for the languages studied here. We are able to establish a new state-of-the-art result for three languages,
English, Dutch and Finnish, as well as improve the best BERT-based score on German. These results
benefit from using the combined training and development set in final model training. The previous state-
of-the-art is also surpassed on Dutch and Finnish when only the training set is used for the final model.
On Spanish our results fall slightly below the reported state-of-the-art. Perhaps somewhat surprising was
that multilingual BERT outperformed the dedicated Spanish language BERT model, failing to replicate
the results of Can˜ete et al. (2020), who reported that the Spanish model outperformed that of Wu and
Dredze (2019), who had previously reached the best Spanish BERT performance using multilingual
BERT. Despite this minor discrepancy, we find that both the simple approach of including following
sentences as context as well as CMV are very effective, allowing a straightforward BERT NER model to
achieve state-of-the-art performance with only a few modifications of the representation.
6 Discussion
The results presented here are, as far as we know, the first systematic study on how cross-sentence in-
formation can be utilized with BERT for NER, and the methods presented here form a good starting
point for discussion and further research into the subject. Contextual Majority Voting is straightforward
to implement in existing BERT-based systems as the actual model and associated infrastructure is not
modified. It is quite probable that similar ways of including cross-sentence information or majority vot-
ing structures may be beneficial with other attention-based models as well. The computational overhead
for the required pre- and postprocessing of the samples is very modest, but increasing the maximum
sequence length in fine-tuning e.g. from 128 to 512 to fit more sentences in one sample does come with
a tradeoff of increased computational cost.
One aspect deserving more study is how prediction performance is affected if sentences are not re-
peated, or repeated fewer times, in examples during prediction. Reducing or entirely avoiding repetition
would allow for more efficient use of the model while still providing context for sentences, which might
be a reasonable compromise between performance and computational efficiency for large-scale practical
applications. A further possibility for future research would be to explore weighted majority voting. Our
results lend some support to the idea that predictions made for tokens around in the center of the window
are generally more reliable than predictions for tokens near its edges, where context is limited on one
side of the token. Providing higher weight to predictions in the middle of the sequence could potentially
912
help further improve the performance of the aggregation approach. Another aspect for future work would
be to study the effect of the context and sentence order. Our preliminary tests with randomized sentence
order from same documents showed minimal effect on performance. Is it enough to have context from
the same document? Would the situation change drastically if random sentences from the whole input
data were used instead? Finally, the incorporation of transition probabilities or other processing to check
tag sequences for illegal transitions would likely improve performance further.
7 Conclusions
We have presented a comprehensive evaluation of the effect of including cross-sentence context for
named entity recognition with BERT and introduced a simple and easy-to-implement approach for the
task using majority voting. The proposed method established new state-of-the-art results in named entity
recognition for three languages and is near the state-of-the-art for two other languages, demonstrating
how simple ideas may boost the performance of even very strong models.
We release all methods implemented in this work under open licenses from https://github.
com/jouniluoma/bert-ner-cmv .
Acknowledgements
We wish to thank the CSC – IT Center for Science, Finland, for generous computational resources. This
work was funded in part by the Academy of Finland.
References
Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual string embeddings for sequence labeling. In
COLING 2018, 27th International Conference on Computational Linguistics, pages 1638–1649.
Alan Akbik, Tanja Bergmann, and Roland Vollgraf. 2019. Pooled contextualized embeddings for named entity
recognition. In Proceedings of the 2019 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 724–
728, Minneapolis, Minnesota, June. Association for Computational Linguistics.
Alexei Baevski, Sergey Edunov, Yinhan Liu, Luke Zettlemoyer, and Michael Auli. 2019. Cloze-driven pretraining
of self-attention networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language
Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).
Pratyay Banerjee, Kuntal Kumar Pal, Murthy Devarakonda, and Chitta Baral. 2019. Knowledge guided named
entity recognition for biomedical text.
Jose´ Can˜ete, Gabriel Chaperon, Rodrigo Fuentes, and Jorge Pe´rez. 2020. Spanish pre-trained bert model and
evaluation data. In to appear in PML4DC at ICLR 2020.
Jason P.C. Chiu and Eric Nichols. 2016. Named entity recognition with bidirectional lstm-cnns. Transactions of
the Association for Computational Linguistics, 4:357370, Dec.
Ronan Collobert, Jason Weston, Le´on Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011.
Natural language processing (almost) from scratch. J. Mach. Learn. Res., 12(null):24932537, November.
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco
Guzma´n, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised cross-
lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics, pages 8440–8451, Online, July. Association for Computational Linguistics.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirec-
tional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Radu Florian, Abe Ittycheriah, Hongyan Jing, and Tong Zhang. 2003. Named entity recognition through classifier
combination. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003,
pages 168–171.
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980.
913
Vijay Krishnan and Christopher D. Manning. 2006. An effective two-stage model for exploiting non-local de-
pendencies in named entity recognition. In Proceedings of the 21st International Conference on Computational
Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pages 1121–1128, Syd-
ney, Australia, July. Association for Computational Linguistics.
Hang Le, Loı¨c Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen,
Benoıˆt Crabbe´, Laurent Besacier, and Didier Schwab. 2020. Flaubert: Unsupervised language model pre-
training for french. In Proceedings of The 12th Language Resources and Evaluation Conference, pages 2479–
2490, Marseille, France, May. European Language Resources Association.
Jing Li, Aixin Sun, Jianglei Han, and Chenliang Li. 2020a. A survey on deep learning for named entity recogni-
tion. IEEE Transactions on Knowledge and Data Engineering, page 11.
Xiaoya Li, Jingrong Feng, Yuxian Meng, Qinghong Han, Fei Wu, and Jiwei Li. 2020b. A unified MRC framework
for named entity recognition. In Proceedings of the 58th Annual Meeting of the Association for Computational
Linguistics, pages 5849–5859, Online, July. Association for Computational Linguistics.
Tianyu Liu, Jin-Ge Yao, and Chin-Yew Lin. 2019a. Towards improving neural named entity recognition with
gazetteers. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages
5301–5307, Florence, Italy, July. Association for Computational Linguistics.
Yijin Liu, Fandong Meng, Jinchao Zhang, Jinan Xu, Yufeng Chen, and Jie Zhou. 2019b. Gcdt: A global context
enhanced deep transition architecture for sequence labeling. Proceedings of the 57th Annual Meeting of the
Association for Computational Linguistics.
Ying Luo, Fengshun Xiao, and Hai Zhao. 2020. Hierarchical contextualized representation for named entity
recognition. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second
Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Ed-
ucational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages
8441–8448. AAAI Press.
Jouni Luoma, Miika Oinonen, Maria Pyyknen, Veronika Laippala, and Sampo Pyysalo. 2020. A broad-coverage
corpus for finnish named entity recognition. In Proceedings of The 12th Language Resources and Evaluation
Conference, pages 4615–4624, Marseille, France, May. European Language Resources Association.
Andrei Mikheev, Claire Grover, and Marc Moens. 1998. Description of the LTG system used for MUC-7. In
Seventh Message Understanding Conference (MUC-7): Proceedings of a Conference Held in Fairfax, Virginia,
April 29 - May 1, 1998.
Alexandre Passos, Vineet Kumar, and Andrew McCallum. 2014. Lexicon infused phrase embeddings for named
entity resolution. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning,
pages 78–86, Ann Arbor, Michigan, June. Association for Computational Linguistics.
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettle-
moyer. 2018. Deep contextualized word representations. Proceedings of the 2018 Conference of the North
American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1
(Long Papers).
Lance Ramshaw and Mitch Marcus. 1995. Text chunking using transformation-based learning. In Third Workshop
on Very Large Corpora.
Adwait Ratnaparkhi and Mitchell P. Marcus. 1998. Maximum Entropy Models for Natural Language Ambiguity
Resolution. Ph.D. thesis, USA. AAI9840230.
Teemu Ruokolainen, Pekka Kauppinen, Miikka Silfverberg, and Krister Linde´n. 2019. A Finnish news corpus for
named entity recognition. Language Resources and Evaluation, pages 1–26.
Jana Strakova´, Milan Straka, and Jan Hajic. 2019. Neural architectures for nested NER through linearization. In
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5326–5331,
Florence, Italy, July. Association for Computational Linguistics.
Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task. Proceedings of
the seventh conference on Natural language learning at HLT-NAACL 2003 -.
Erik F. Tjong Kim Sang, Walter Daelemans, Herve Dejean, Rob Koeling, Yuval Krymolowski, Vasin Punyakanok,
and Dan Roth. 2000. Applying system combination to base noun phrase identification. In COLING 2000
Volume 2: The 18th International Conference on Computational Linguistics.
914
Erik F. Tjong Kim Sang. 2002. Introduction to the conll-2002 shared task. proceeding of the 6th conference on
Natural language learning - COLING-02.
Hans Van Halteren, Jakub Zavrel, and Walter Daelemans. 2001. Improving accuracy in word class tagging through
the combination of machine learning systems. Computational Linguistics, 27(2):199–229.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser,
and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,
R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30,
pages 5998–6008. Curran Associates, Inc.
Antti Virtanen, Jenna Kanerva, Rami Ilo, Jouni Luoma, Juhani Luotolahti, Tapio Salakoski, Filip Ginter, and
Sampo Pyysalo. 2019. Multilingual is not enough: Bert for finnish.
Wietse de Vries, Andreas van Cranenburgh, Arianna Bisazza, Tommaso Caselli, Gertjan van Noord, and Malvina
Nissim. 2019. BERTje: A Dutch BERT Model. arXiv:1912.09582 [cs], December.
Shijie Wu and Mark Dredze. 2019. Beto, bentz, becas: The surprising cross-lingual effectiveness of bert. Proceed-
ings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International
Joint Conference on Natural Language Processing (EMNLP-IJCNLP).
Vikas Yadav and Steven Bethard. 2018. A survey on recent advances in named entity recognition from deep
learning models. In Proceedings of the 27th International Conference on Computational Linguistics, pages
2145–2158, Santa Fe, New Mexico, USA, August. Association for Computational Linguistics.