Using transformers for page count
prediction based on bibliographic
metadata
University of Turku
Department of Computing
Master of Science (Tech) Thesis
Data Analytics
April 2025
Teo Kekäläinen
Supervisors:
Prof. Leo Lahti
Ph.D. Ville Laitinen
The originality of this thesis has been checked in accordance with the University of Turku quality assurance
system using the Turnitin OriginalityCheck service.
UNIVERSITY OF TURKU
Department of Computing
Teo Kekäläinen: Using transformers for page count prediction based on biblio-
graphic metadata
Master of Science (Tech) Thesis, 52 p.
Data Analytics
April 2025
Bibliographic data science is a field of digital humanities which aims to enable the
usage of bibliographic data for quantitative research. Due to inconsistencies, bib-
liographic data needs to be harmonized before it can be used for research. A part
of the harmonization process is estimating the page count of documents based on
short text descriptions which contain numbers, words and abbreviations. This the-
sis proposes a new approach for page count estimation which takes advantage of
natural language processing to convert the text descriptions into vector form by
using a pre-trained encoder-only transformer model. The vectors are then used to
predict the page count by using an artificial neural network which is attached to the
encoder-only model.
Two experiments were done in this thesis by training two machine learning models.
First, a model was fine-tuned for page count prediction using a harmonized subset
of the Finnish national bibliography, Fennica, which contained both the page count
descriptions and numerical page counts. The second experiment was to use another
harmonized subset of Fennica, to fine-tune the encoder-only part of the model using
the masked language modeling task, which was done by using only the page count
descriptions from the second dataset. After masked language modeling fine-tuning,
the whole model consisting of the encoder-only model and the attached artificial
neural network, was fine-tuned for page count prediction using the first dataset.
Both models were able to predict the page count of documents but had worse accu-
racy when predicting high page counts. The model that was first fine-tuned using
masked language modeling performed better than the model that was only fine-tuned
for page count prediction.
The experiments show that encoder-only models are able to predict the page count
of documents, and that masked language modeling can be used to improve page
count prediction performance.
Keywords: bibliographic data science, NLP, machine learning, regression
TURUN YLIOPISTO
Tietotekniikan laitos
Teo Kekäläinen: Using transformers for page count prediction based on biblio-
graphic metadata
Diplomityö, 52 s.
Data-Analytiikka
Huhtikuu 2025
Bibliografinen datatiede on digitaalisten ihmistieteiden ala, jonka tavoitteena on
mahdollistaa bibliografisen datan käyttö kvantitatiiviseen tutkimukseen. Bibliogra-
fisen datan hyödyntäminen tutkimuksessa vaatii datan harmonisointia datan sisältä-
mien epäjohdonmukaisuuksien vuoksi. Dokumenttien sivumäärän arviointi lyhyiden
tekstikuvausten pohjalta on osa harmonisointiprosessia. Tämä opinnäytetyö ehdot-
taa luonnollisen kielen käsittelyä hyödyntävää lähestymistapaa, jossa transformer-
arkkitehtuuriin perustuvaa esikoulutettua encoder-mallia käytetään sivumääräku-
vausten muuntamiseen vektorimuotoon. Tämän jälkeen dokumenttien sivumäärää
ennustetaan vektoreiden pohjalta hyödyntämällä encoder-malliin liitettyä keinote-
koista neuroverkkoa.
Työssä tehtiin kaksi koetta kouluttamalla kaksi koneoppimismallia. Ensimmäinen
malli hienosäädettiin sivumäärän ennustamiseen käyttämällä Suomen kansallisesta
bibliografiasta, Fennicasta, johdettua tietoaineistoa. Tietoaineisto sisälsi sekä teks-
timuodossa olevat sivumääräkuvaukset että aiemmin arvioidut sivumäärät, joten
ainestoa pystyttiin käyttämään ohjattuun oppimiseen. Toisessa kokeessa käytetiin
toista Fennicasta otettua tietoaineistoa hienosäätämällä encoder-mallia sivumäärä-
kuvausten pohjalta käyttämällä masked language modeling -tehtävää. Masked lan-
guage modeling -hienosäädön jälkeen toinen malli hienosäädettiin sivumäärän en-
nustamiseen käyttämällä ensimmäistä tietoainestoa.
Molemmat mallit pystyivät ennustamaan dokumenttien sivumäärää, mutta niiden
ennustuskyky heikkeni suurilla sivumäärillä. Masked language modeling -tehtävään
hienosäädetty malli oli parempi sivumäärän ennustamisessa, kuin pelkkään sivu-
määrän ennustamiseen hienosäädetty malli.
Kokeet osoittavat, että encoder-malleilla pystytään ennustamaan dokumenttien si-
vumäärää ja, että masked language modeling -hienosäätö pystyy parantamaan si-
vumäärän ennustustarkkuutta.
Asiasanat: bibliografinen data tiede, NLP, koneoppiminen, regressio
Contents
1 Introduction 1
1.1 Research objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Thesis structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Machine learning and natural language processing 7
2.1 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Artificial neural networks . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Deep neural networks . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 ANN training . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Natural language processing . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Transformer architecture . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.6.1 Pre-training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.6.2 Downstream tasks . . . . . . . . . . . . . . . . . . . . . . . . 24
2.6.3 BERT variants . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3 Data inspection and processing 27
3.1 Regression dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.1 Data processing . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 MLM dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
i
4 Training the pure regression model 33
4.1 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Model architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Data preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.4 Training setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5 Training the MLM fine-tuned model 38
5.1 MLM fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2 Regression fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6 Conclusion 43
6.1 Model comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.4 Future research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
References 53
ii
List of Figures
2.1 ANN with one hidden layer, n inputs and two outputs . . . . . . . . 11
2.2 Artificial neuron with n inputs and n weights . . . . . . . . . . . . . 12
2.3 A deep neural network with n hidden layers and two outputs . . . . . 13
2.4 Convolution operation done on a 2D input . . . . . . . . . . . . . . . 14
2.5 Simplified illustration of the transformer architecture inspired by Fig-
ure 1 in Vaswani et al. [9] The model consists of an encoder and a
decoder stack which have N encoder and decoder blocks. The encoder
stack produces a sequence of contextual representations of the input,
which the decoder uses alongside the previously generated output to
generate more tokens. For simplicity, the residual connections and
layer normalization included in the encoder and decoder blocks are
not shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.6 Simple example of masked language modeling . . . . . . . . . . . . . 24
3.1 Count of documents by publication decade in the regression dataset . 29
3.2 Count of documents by publication decade in the MLM dataset . . . 31
4.1 Architecture of the whole regression model showing the regression
head attached to the DistilBERT model. The first number in brackets
shows the amount of inputs and the second the amount of outputs
for the fully-connected layers. . . . . . . . . . . . . . . . . . . . . . . 35
iii
4.2 An example of tokenizing an input, then converting the tokens to
input ids. Note that the input ids list has two more elements than
the tokenized input due to the [CLS] and [SEP] tokens being added. . 36
4.3 The predictions of the regression model vs. the page count values on
the validation set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.1 The predictions of the MLM fine-tuned model vs. the page count
values on the validation set. It can be seen that there were some
large outliers in AE for documents that had a page count that was
near one. In particular, the largest AE of 539 is shown in the middle-
left side of the figure. . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.1 Predictions and page counts for both the pure regression (graph A)
and MLM fine-tuned (graph B) models on the test set. The MLM
fine-tuned model had both a lower MAE and MSE than the pure
regression model. The MLM fine-tuned model was more accurate on
high page count documents in particular. . . . . . . . . . . . . . . . . 45
6.2 Distribution of the predictions’ AE for the test set. The pure regres-
sion model is shown in blue and MLM fine-tuned model is shown in
orange. The 29 predictions for the pure-regression model and the 23
predictions for the MLM fine-tuned model that had an AE higher
than 50 were excluded from the graph. The MLM fine-tuned model
had more predictions where AE  10. . . . . . . . . . . . . . . . . . . 46
iv
List of Tables
3.1 Some example values of the pagecount and pagecount_orig fields . . 27
3.2 Comparisons of the original "pagecount_orig" and cleaned "page-
count_orig" fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.1 Results for both models on the test and validation regression sets.
The best metrics achieved on both validation and test set are bolded.
The MLM fine-tuned model performed better on the test set, even
though its performance on the validation set was worse. . . . . . . . . 44
v
1 Introduction
Bibliographic data are structured information that describes documents [1]. The in-
formation can describe the form, context or content of the documents, e.g. the type,
language or title of a document. The documents can be in any form or medium. In
this thesis, the focus will be on bibliographic data that describes printed documents
such as books, maps or continuous publications.
Bibliographic data science [2] is an emerging field of digital humanities. It aims
to enable the use of bibliographic data for quantitative analysis. It is focused on
computational methods of harmonizing bibliographic data, and of integrating bib-
liographic data from multiple sources. The harmonization is done by, for example,
removing spelling errors and standardizing terms. It is also possible to derive new
features from the existing features such as the estimated amount of paper consumed
by printing each document [3] which can be derived from a document’s page count
or physical dimensions. From a methodological perspective, it is also important that
the methods used are scalable because collections of bibliographic data, bibliogra-
phies, can contain millions of entries.
Generally, most research on historical literature has focused on studying full
texts, but bibliographic data can also be used as a research object. One advantage
of bibliographic data is that it is often more structured, standardized and smaller
in terms of size than full texts. It can also be used in combination with full text
collections to provide additional context. [2] Bibliographic data has been used to
CHAPTER 1. INTRODUCTION 2
study e.g. the development of different book printing formats [2], the relation be-
tween book prices and demand in eighteenth-century Britain [4], and the history of
book-printing in Finland and Sweden [3].
Using bibliographic data for research isn’t simple, however. It requires both
understanding the context of the data and dealing with issues in data quality and
consistency. When bibliographic data were originally catalogued, the goal was to
maintain as much information as possible about the original document, which is
why inconsistencies such as misspellings were not fixed. Bibliographic data is of-
ten entered manually into databases which can lead to further inconsistencies [5].
Bibliographies also generally consist of data that has been catalogued over a long
period, which can lead to differences in the standards used for the entries. Finally,
the reasons why the data were originally collected can vary, which might lead to
biases in the data. All of these aforementioned things need to be considered when
using bibliographic data as a basis for quantitative research. [2]
The Finnish national bibliography, Fennica [6], is a bibliographic database con-
taining information on documents published in Finland. The earliest documents
described by the data were published in 1488. As of the time of writing this thesis,
the database has over 1.2 million entries. The first Finnish national bibliography,
which described literature published between the years 1544-1877, was made in the
1870s. Later the bibliography was extended to include entries up to the 1900s.
Converting the national bibliography to electronic form began in 1978. Nowadays,
almost all the printed bibliographies and index cards have been transferred to the
Fennica dataset. The two datasets used in this thesis are harmonized subsets [7] of
the Fennica dataset provided by the Turku Data Science Group.
The Fennica dataset uses MARC21 formats, which are maintained by the United
States Library of Congress. MARC21 formats are standards for the representa-
tion and communication of bibliographic data and related information in machine-
1.1 RESEARCH OBJECTIVES 3
readable form [8]. The formats exist for five types of data: bibliographic, holdings,
authority, classification, and community information. The MARC21 format for bib-
liographic data specifies the format for describing, retrieving, and controlling bibli-
ographic materials. There are different bibliographic data specifications for books,
serials, computer files, maps, music, visual materials and mixed materials.
A MARC record has three sections: the leader, the directory and the variable
fields. The leader defines the parameters for processing the record. The directory
describes the tag, starting location and length of each field in the record. Finally,
the data content of the record is in the variable fields which can be split into variable
control fields and variable data fields. Each field is identified by a three-character
tag. Fields can be grouped based on the first character of the tag which represents
the function of the data contained within the fields, for example, the fields starting
with 3 contain the physical description of documents in the bibliographic MARC
specification. The remaining two characters of the tag describe the type of infor-
mation stored in the field. The information in the variable data fields is stored in
coded subfields. For example, the main variable field this thesis is concerned with
is the field 300a, meaning the subfield a of field 300. [8]
1.1 Research objectives
The MARC21 300a field contains a short text description of the length of a docu-
ment. The text description can be just a single number like "54", but it can also
contain words or page ranges e.g. "5 pages, 5-45, 3 images". The entries can also be
written in different languages. In the Fennica dataset, most entries are written in
Finnish, but there are also entries written in other languages such as Swedish and
English. The values also often use abbreviations such as "p." or "s." to describe
the page count in English and Finnish respectively. Finally, the entries can contain
Roman numerals which can appear alongside Arabic numerals in a single entry.
1.1 RESEARCH OBJECTIVES 4
To be able to use the page count for quantitative analysis, the descriptions from
the 300a field need to be mapped to numbers, which are estimates of the page count
of the document. This has been done before [3] by parsing different values from
the entries, then using them to calculate a page count estimate. The problem with
this approach, however, is that it requires a lot of manual effort because rules such
as regular expressions need to be defined to parse the values from the 300a field.
The different languages, abbreviations and other inconsistencies make this a difficult
task because the values need to either be harmonized or the rules for parsing them
need to be robust enough to take every form of a term into account. Also, when
new bibliographic data from another source is added, it can often require writing
new rules since the new data might have different standards or entries written in
new languages.
Instead of manually parsing a set of features from text, it is possible to use
natural language processing (NLP) techniques, which are covered in section 2.3,
to produce a numerical, vector representation of the text. One way of doing this
is using encoder-only transformer [9] models, which are covered in sections 2.5 and
2.6. These models could allow for skipping the manual parsing process by producing
a vector representation of each MARC21 300a entry, which could then be used as
an input to an artificial neural network (section 2.2) to predict the page count of
each document. This approach would take advantage of machine learning (ML) to
learn the relations between the input and output automatically instead of using a
manually defined function. There are data that includes both the MARC21 300a
entries and the previous page count estimates, which can be used for training the
artificial neural network using a supervised learning approach.
There are also data which have the MARC 300a field, but where the page count
estimate is incorrect due to a problem with the harmonization function. Using
these data to fine-tune the encoder-only model using the masked language modeling
1.2 THESIS STRUCTURE 5
(MLM) task, covered in section 2.6.1, could help the model produce more accu-
rate representations of the MARC21 300a entries thus improving the page count
prediction performance.
The research questions of this thesis can be formulated as follows:
RQ1 Can a transformer encoder model be used to predict the page count of a
document based on the value of the MARC21 300a field?
RQ2 Can page count prediction performance be improved by fine-tuning the model
for the masked language modeling task using unannotated data?
This approach would be more scalable than the previous approach, since it could
easily be applied on large datasets without as much manual effort such as writing
regular expressions or standardizing the data. Moreover, there are multilingual
models, which can be used to generate the vector representations of the entries.
Using multilingual models could help with processing data that is written in multiple
languages.
1.2 Thesis structure
Chapter 2 is an introduction to the basics of machine learning and natural language
processing. The goal of the chapter is to provide enough background information so
that the process of training a transformer model can be understood. This includes
explanations of the basic NLP data pre-processing steps needed, and an explanation
of artificial neural networks. The chapter also explains the attention mechanism,
which is a key component of the transformer models, before explaining the original
transformer architecture and the BERT models used in this thesis.
Chapter 3 is an introduction to the datasets used in this thesis. The chapter
describes the size and features of the dataset, and the processing done to the data.
1.2 THESIS STRUCTURE 6
The data processing includes filtering the data and cleaning the MARC21 300a
entries.
Chapter 4 is dedicated to the training of the first model without MLM fine-
tuning. This model will be referred to as the pure regression model since it does not
take advantage of MLM fine-tuning. The model is trained for the regression task
with the page count being the target variable while using the MARC21 300a field to
make predictions. The chapter also explains the reasoning behind model selection
and a description of the model’s architecture. The chapter ends by visualizing and
describing the results on the validation set of the regression dataset.
Chapter 5 describes the process of fine-tuning the second model, which will be
referred to as the MLM fine-tuned model, for both MLM task and regression task.
The chapter also explains the reasoning behind MLM fine-tuning and contains the
results obtained on the validation set of the regression dataset.
Chapter 6 concludes the thesis. The chapter starts by comparing and analyzing
the performance of the two models on the test set of the regression dataset. Then, the
chapter answers the research questions, addresses the limitations of the results, and
proposes directions for further research. Finally, the chapter ends by summarizing
the thesis.
2 Machine learning and natural
language processing
This chapter introduces the relevant concepts in machine learning (ML) and natural
language processing (NLP) that are needed to understand the process of training and
using a transformer model for regression. The chapter also explains artificial neural
networks and their training process, the attention mechanism which is used by the
transformer models, the original full encoder-decoder transformer architecture, and
the encoder-only BERT model, a variant of which is used in this thesis.
2.1 Machine learning
Machine learning (ML) is a branch of artificial intelligence (AI). An often used defi-
nition for machine learning is the one by Mitchell [10]: "A computer program is said
to learn from experience E with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured by P, improves with ex-
perience E." The advantage of ML is that it allows for solving tasks that are too
complicated to be solved manually i.e. tasks where it would be impractical or even
impossible to manually define rules for all possible cases, such as the page count
prediction done in this thesis. [11, p. 99]
ML algorithms, which are also called models, learn by processing examples that
consist of a set of features. For example, in the datasets used in this thesis, each
2.1 MACHINE LEARNING 8
example is a set of bibliographic metadata which describes a single document. This
set includes features such as the title of the document, the year the document was
published and the primary language of the document. Each example can be rep-
resented as a vector of set length where each entry is a feature. These vectors can
then be collected into a so-called design matrix so that each row of the matrix is an
example, and each column is a feature. [11, p. 98-99, 106]
Supervised learning is a branch of ML, where the goal is to predict the value
of some target variable based on the available features. In supervised learning, the
desired values of the target are available and used for training the model, i.e. there
is a ground truth that the predictions of the algorithm can be compared to. [11,
p. 105] Mathematically, a supervised learning algorithm can be represented as a
function f(Xi) = Y^ i; Yi^  Yi, that takes the vector of features Xi as input and
produces an estimate Yi^ of the real target value Yi for each example i in the dataset.
Supervised learning requires annotated data which means that the data needs to
include the value of the target variable, i.e. the data is a set of input-output pairs
where the input is the set of features passed to the algorithm and the goal of the
algorithm is to produce the correct output. Annotation is typically done manually
which can be a time-consuming process, therefore there are fewer annotated than
unannotated data available [12].
The two most common supervised learning tasks are classification and regression.
In classification, the goal is to predict the value of some discrete variable i.e. assign
each example a label that belongs to a finite set of labels [11, p. 100]. For example,
learning to detect spam email is a binary classification task, meaning that each
email is assigned one of two labels: "spam" or "not spam". It is also possible for an
example to be given multiple labels, this type of classification is called multi-label
classification. A common way to evaluate classification performance is measuring
the accuracy of the model, which is simply the proportion of examples where the
2.1 MACHINE LEARNING 9
model predicted the correct label. Alternatively, the error rate of the model, which
is the proportion of predictions that are incorrect, can also be used. [11, p. 103-104]
Regression means predicting a numerical value based on the input features [11, p.
101]. The page count prediction done in this thesis is an example of a regression task.
Common performance metrics, that will also be used in this thesis for measuring
regression performance, are mean squared error (MSE, equation 2.1), and mean
absolute error (MAE, equation 2.2). Defined as:
MSE =
1
n
nX
i=1
(Yi   Y^ i)2 (2.1)
MAE =
1
n
nX
i=1
(jYi   Y^ ij) (2.2)
Where Yi is the actual value of the target variable and Y^ i is the value predicted
by the ML model for each example i. For these metrics, a good model would get a
value that is as close to zero as possible since the metrics measure the error rate in
the predictions.
Unlike supervised learning, unsupervised learning does not require annotated
data. Instead of comparing the model’s output to a ground truth, unsupervised
learning methods aim to learn useful properties about the structure of the dataset
without having the ground truth available. Clustering is an example of an unsuper-
vised learning task, where the goal is to form some predefined number of clusters
out of similar examples. [11, p. 146-150]
Self-supervised learning is a type of unsupervised learning where labels are gener-
ated automatically from unannotated data. This can mean, for example, predicting
some part of the input based on other parts of the input or corrupting a part of
the input, then predicting the value of the corrupted part. The masked language
modeling (MLM) [13] task used in this thesis is a self-supervised learning task, since
it consists of predicting the value of a masked word based on the other words in a
2.1 MACHINE LEARNING 10
text sequence. [12]
Generally, in ML, the goal is not to obtain the best performance on the data
used to train the model, but to obtain the best performance on new data that the
model has not seen before. A model’s ability to perform well on unseen data is
known as generalization. Generalization can be measured by splitting the dataset
into a training set and a test set. The training set is then used to train the model
and the test set is used to estimate the model’s generalization after training. It
is very important that the model does not see any of the test set examples during
training because then the performance on the test set does not reflect generalization
anymore. [11, p. 110]
The capacity of a ML model, meaning its ability to fit different functions, needs
to be adjusted based on the data and task. If a model’s capacity is too high, it will
perform far better on the training set than the test set, which is known as overfitting.
If a model’s capacity is not high enough, however, it will underfit meaning that the
model is not able to perform well enough on the training set [11, p. 110-114]. If a
model’s capacity is optimal, it will obtain good performance on both training and
test set.
One way to improve a model’s generalization is to use regularization. Regulariza-
tion means any modifications made to a model that aim to reduce the generalization
error without reducing the training error. [11, p. 120]
Hyperparameters are settings that control a ML model’s behaviour. The hy-
perparameters are not changed by the ML algorithm itself during training, instead
they are set before training. Many hyperparameters, such as the amount of layers
in an artificial neural network, affect the model’s capacity. These hyperparameters
can not be learned by the ML algorithm during training since that would always
lead to selecting values that result in the model having maximum capacity, and
the model overfitting the training data. Hyperparameters are often optimized by
2.2 ARTIFICIAL NEURAL NETWORKS 11
splitting another subset from the training data which is called the validation set.
Hyperparameter optimization can then be done by training a model with differ-
ent hyperparameters on the training set, and then evaluating the model’s perfor-
mance on the validation set to select the hyperparameters that obtained the best
performance. Once optimal hyperparameters have been selected the generalization
performance can be evaluated on the test set. [11, p. 120-121]
2.2 Artificial neural networks
An artificial neural network (ANN) is an ML model made of interconnected layers
of processing units which are called artificial neurons. The connections between the
layers are weighed and the model learns by adjusting the values of these weights
during training [14, p. 5]. The simplest ANN consists of an input layer and an
output layer.
Figure 2.1: ANN with one hidden layer, n inputs and two outputs
ANNs also often have one or more hidden layers between the input and output
layers. The amount of hidden layers and neurons per layer are hyperparameters that
are adjusted based on the nature and complexity of the target task. Feedforward
2.2 ARTIFICIAL NEURAL NETWORKS 12
neural networks are ANNs where the data passes only in one direction: from the first
layer to the last layer. [14, s. 21-23] Figure 2.1 shows a fully-connected feedforward
neural network with a single hidden layer and two outputs. Fully-connected means
that each artificial neuron of a layer is connected to every neuron of the subsequent
layer.
Figure 2.2: Artificial neuron with n inputs and n weights
The original artificial neuron is the McCulloch and Pitts neuron which was in-
troduced in 1943. It is a mathematical function that takes in some amount of inputs
x1; :::; xn and produces a single output y. Each input has a weight wi associated with
it. The inputs are multiplied by the weights and summed. An activation threshold
or bias  is then subtracted from the sum of the weighted inputs. Generally, the
activation threshold can be seen as the limit which the weighed sum of inputs needs
to reach for the neuron to be activated and produce an output. The sum of weighted
inputs and activation threshold is known as the activation potential u :
u =
nX
i=1
wi  xi    (2.3)
. [14, p. 11-13]
The activation potential is passed on to the activation function g, which generally
limits the value to be in some range, and the result y is the output of the neuron
2.2 ARTIFICIAL NEURAL NETWORKS 13
[14, p. 12]. Figure 2.2 shows an artificial neuron with n inputs and n weights.
The input layer of an ANN is responsible for receiving the input data. The data is
also often normalized or standardized to help with mathematical precision. Most of
the processing in an ANN occurs in the hidden layers where patterns associated with
the data and the target task are extracted. Finally, the output layer is responsible for
producing the final output of the network. The size of the output layer is adjusted
based on the task. For example, when using an ANN for regression, the output layer
only has a single neuron and the output of the neuron is the model’s prediction. For
classification tasks, the number of neurons in the output layer is equal to the amount
of target labels. [14, p. 21-23]
Figure 2.3: A deep neural network with n hidden layers and two outputs
2.2.1 Deep neural networks
A deep neural network (DNN) is an ANN with multiple hidden layers. Repre-
sentation learning methods, such as deep-learning (DL) using DNNs, are used to
automatically learn representations from raw data (see figure 2.3 for a schematic
representation). DL builds multiple levels of representations from the raw data with
each level having a slightly higher abstraction level. The main advantage of deep
learning is that it reduces the amount of manual feature engineering [15].
A convolutional neural network (CNN) is a type of DNN that has at least one
2.2 ARTIFICIAL NEURAL NETWORKS 14
Figure 2.4: Convolution operation done on a 2D input
layer which uses the convolution operation [11, p. 330]. The convolution operation
works by moving a convolution kernel, which consists of weights that are learned
during training, over the input. Figure 2.4 shows how the convolution operation is
applied on a 3x3 input to produce a 2x2 output. During each step of the operation,
the element-wise product of a part of the input and the weights of the kernel is
computed, then the values of the element-wise product are summed to get the output
for each step. CNNs are generally used on input data that is in n-dimensional form.
For example, images can be represented as a 3D array that consists of the RGB
values of each pixel in the 2D image. The details of CNNs are outside of the scope
of this thesis, for more information on CNNs, see chapter 9 of the Deep Learning
textbook by Goodfellow et al. [11].
Recurrent neural networks (RNN) are another type of DNN. RNNs are generally
used for processing sequential data such as text, which can be seen as a sequence
of words. RNNs work by processing the input one element at a time while also
maintaining a hidden state, that contains information about the elements that were
processed earlier. The hidden state is updated at each step and given as an additional
2.2 ARTIFICIAL NEURAL NETWORKS 15
input to the next step of the model. The final hidden state of an RNN can be used as
a representation of the entire input sequence, although RNNs can have difficulties
with longer inputs, since the impact of earlier elements decreases as the distance
between the first elements and the current element increases. [15] More advanced
RNN architectures such as the long short-term memory (LSTM) architecture [16] are
better at processing longer inputs. For this thesis, the details of RNN architectures
are not relevant. For more information on RNNs, see chapter 10 of the Deep Learning
textbook by Goodfellow et al. [11].
2.2.2 ANN training
Although types of ANNs, such as the self-organizing maps [17], can be used for
unsupervised learning, this thesis focuses on using ANNs for supervised learning.
ANNs are trained for supervised learning by adjusting the network’s parameters,
i.e. the weights and biases, to optimize the value of some objective function on
the training set. [15] The objective function is either minimized or maximized,
depending on the function. For example, when training an ANN for regression, the
model’s parameters are adjusted so that the value of the MSE loss is minimized.
The parameter adjustment works by calculating the gradient of the cost func-
tion with respect to the model parameters. If the goal is to minimize the objective
function, e.g. when using MSE, then the cost function is the same as the objective
function. However, if the goal is to maximize the objective function, such as classifi-
cation accuracy, then the cost function is the negative of the objective function. The
gradient of the cost function is calculated by using a method called back-propagation
[18], which works by calculating the gradient starting from the outputs of the model
and then working backwards through the layers. [15]
After calculating the gradient, model parameters are adjusted in the opposite
direction of the gradient, since that direction is where the cost function’s value
2.3 NATURAL LANGUAGE PROCESSING 16
decreases the most. The size of the adjustment is determined by a hyperparameter
called learning rate which is a positive real number. The learning rate can be static
or adjusted during training depending on the optimization method. [11, p.85-86]
The simple optimization method of adjusting the parameters based on the gra-
dient of the cost function is called gradient descent. Usually, instead of calculating
the gradient of all outputs at once, the training is done in batches which is known
as stochastic gradient descent. [15] More advanced optimization methods, such as
the Adam optimizer [19], which involve additional hyperparameters beyond just the
learning rate, are also used for training ANNs.
Two common methods of regularization for ANNs, that are also used in this
thesis, are dropout [20] and weight decay. Dropout works by temporarily removing
some artificial neurons and all of their connections from the network during training.
The probability for neurons to be removed is a hyperparameter. Weight decay means
that the model will prefer weights with smaller values. The amount of preference
for smaller weights is determined by another hyperparameter. [11, p. 119]
2.3 Natural language processing
Hirschberg and Manning [21] define natural language processing (NLP) as "the
subfield of computer science concerned with using computational techniques to learn,
understand, and produce human language content". Common NLP tasks include e.g.
machine translation, part-of-speech (POS) tagging and text classification. Machine
translation is an example of a sequence-to-sequence task where both the input and
output of a model is a sequence of text. POS tagging means labeling each word in
a sequence based on the type of word such as verb, adjective or noun. POS tagging
is an example of a sequence labeling task where each part of a sequence is given
its own label. Finally, text classification refers to classifying a text sequence e.g. a
sentence. Text classification is an example of a sequence level task whereas POS
2.3 NATURAL LANGUAGE PROCESSING 17
tagging is an example of a token level task. The page count prediction task done in
this thesis is an example of text regression which is a sequence level task.
Tokenization is a necessary data pre-processing step in NLP. It is the process
of splitting a sequence of text into a sequence of typographic units that are called
tokens. Tokens are often just words, but words can also be split into multiple
tokens, which is called subword tokenization. Tokenization can be done in many
ways, starting with simply splitting text based on whitespace. A more advanced
approach is to use a dictionary to match character sequences as tokens. Dictionary-
based tokenization methods can struggle with out-of-dictionary tokens, since the
size of the dictionary is limited. [22, p. 185]
One of the key challenges in NLP is finding a way to represent text in a numerical
format so that it can be used as the input for a ML model. The simplest way to
represent a token is to represent it as a one-hot vector, meaning a vector where a
single value is one and every other value is zero. These vectors are N -dimensional,
where N is the size of the dictionary, and the non-zero element is the element i,
where i is the index of the token in the dictionary. The problem with this approach,
however, is that the resulting vectors are very sparse which makes computation
inefficient. Also, the representations cannot be used to measure the similarity of
tokens and they do not take context into account. [23]
Word embeddings, which are dense vectors of real numbers, are a more sophisti-
cated method to represent the meaning of words or tokens. The advantage of word
embeddings is that they are more efficient computationally due to their density, and
that they can be used to compare the semantic similarity of words. In theory, words
that have a similar meaning should have embeddings that are close to each other.
[23]
Word embeddings can be categorized into static and contextual embeddings.
Static embedding methods, such as CBOW [24] and GloVE [25], assign each word a
2.4 ATTENTION 18
single representation that is not changed when training a model for a downstream
task such as text classification. More recent methods, such as BERT [13] and ELMO
[26], however, can produce contextual embeddings, which are adjusted during train-
ing to better reflect the context of the training data. [23]
2.4 Attention
Attention is a mechanism which allows ML models to learn what parts of the input
are relevant during training. It is based on assigning weights to different parts of
the input that show how relevant the parts of the input are for the target task.
Originally, a mechanism similar to attention was proposed for computer vision by
Hinton and Rochelle in 2010 [27], and the term attention was popularized by Mnih
et al. in 2015 [28]. However, in this thesis the focus is on using attention for NLP
tasks. In NLP, attention was originally used for machine translation by Bahdanau
et al. in 2015 [29], but since then it has been used for various other tasks [30].
Attention is the key mechanism behind the transformer models used in this thesis
[9].
Attention works by mapping a sequence of key vectors K to a distribution of
weights. The keys represent the input sequence in some way such as word embed-
dings. It is also common to use a query element q, that defines how input elements
should be emphasized. A vector e of energy scores is calculated from the keys and
query by using a compatibility function f. Each energy score ei in e represents the
relevance of a key ki in K.
e = f(q;K)
The energy scores are transformed to a distribution of attention weights a by passing
them to a distribution function g, a = g(e). The weights produced by the attention
mechanism reflect the relevance of each element in the input to the target task, with
2.4 ATTENTION 19
respect to the query and keys.
Many models also use an additional input sequence for computing attention. This
sequence is known as the values V, which can be seen as another representation of the
input data represented by the keys, with each element of V corresponding to only one
element of K. The values can be combined with the attention weights a to produce
a set of weighed representations of the values Z. These weighed representations can
then be merged to produce a compact representation of the input known as the
context vector c. [30]
The compactness of the context vector makes it easier to represent longer se-
quential inputs, which was one of the major downsides of using RNN hidden states
to represent inputs, because the context vector will focus on the relevant elements
of the input instead of assigning the same importance to all the elements in an in-
put sequence. Since the context vector condenses information, it also takes fewer
computational resources to process it compared to the original representation such
as a sequence of word embeddings. In this sense, attention can also be seen as a
way to condense inputs into a compact form. [30]
Self-attention means computing attention based on only the input sequence,
meaning that both the query and the keys are taken from the same sequence. The
most common approach to self-attention works by applying multiple steps of atten-
tion to an input sequence. On each step, a different element of the input sequence
is used as the query. This approach leads to contextual embeddings, that reflect
the relevance of each element in the input to all the other elements of the input.
Self-attention in particular helps with the problem of representing longer inputs,
since it allows each element of the input to affect the representations produced by
the attention mechanism. [30]
Multi-head attention refers to using multiple attention functions which are com-
puted in parallel on the same input with each attention head having its own set
2.5 TRANSFORMER ARCHITECTURE 20
of learnable weights for projecting the keys, queries and values. The context vec-
tors produced by the different attention heads are merged together to produce the
final representation at the end of each step. Multi-head attention is particularly
suitable for ambiguous data e.g. for data where words can have multiple meanings
because multi-head attention allows the representations produced by the attention
mechanism to combine information from different interpretations of the input. [30]
2.5 Transformer architecture
The transformer ANN architecture, introduced in 2017 by Vaswani et al. [9], is
based on using multi-head self-attention without recurrence or convolution, which
reduces the amount of sequential computation thus increasing the parallellizability
of the model. The increased parallelizability allows more efficient training using
large amounts of data. Models based on the transformer architecture have achieved
state-of-the-art results in many NLP tasks by taking advantage of a transfer learning
approach, where the models are first trained on a large amount of unannotated data
using a self-supervised task and then fine-tuned on smaller amounts of annotated,
domain-specific data for the desired downstream task such as text classification. [31]
The full transformer architecture consists of two parts: the encoder stack and the
decoder stack, which are shown in figure 2.5. The encoder and decoder stacks consist
of some amount of encoder and decoder blocks, respectively. The number of blocks in
each stack varies from model-to-model. In the original transformer architecture, the
encoder stack consisted of six identical encoder blocks. The encoder stack produces a
sequence of contextual representations of the input tokens and the decoder generates
tokens based on the encoder’s output and the previously generated tokens. [9]
The encoder blocks contain two sub-layers: a multi-head self-attention mecha-
nism and a position-wise fully connected feed-forward neural network. Each sub-
layer also has a residual connection around it which is followed by layer normaliza-
2.5 TRANSFORMER ARCHITECTURE 21
Figure 2.5: Simplified illustration of the transformer architecture inspired by Figure
1 in Vaswani et al. [9] The model consists of an encoder and a decoder stack
which have N encoder and decoder blocks. The encoder stack produces a sequence
of contextual representations of the input, which the decoder uses alongside the
previously generated output to generate more tokens. For simplicity, the residual
connections and layer normalization included in the encoder and decoder blocks are
not shown.
tion. The residual connection means that the output of each sub-layer is a combi-
nation of the input of the sub-layer and the output of the computation done in the
sub-layer i.e. the output of the multi-head attention or feedforward network. On
the first encoder block the self-attention mechanism is computed over the input of
the model and on the other blocks self-attention is calculated based on the output of
the previous encoder block. The output of the encoder stack is a sequence of token
representations, with each input token having an n-dimensional, contextual vector
representation [9]
The decoder stack consists of N identical decoder blocks, which also contain
the same self-attention layers, fully-connected layers and residual connections as the
encoder blocks. However, the decoder blocks also have an additional attention sub-
2.5 TRANSFORMER ARCHITECTURE 22
layer which performs multi-head attention on the output of the encoder stack i.e. the
contextual representations produced by the final encoder block. The self-attention
layer of the decoder stack calculates attention over the previously generated output
which is masked so that the attention is calculated based only on the earlier tokens
in the output sequence. In the original architecture, the output of the decoder stack
is a sequence of scores for the next token, these scores are turned into a probability
distribution by using the softmax function. The token that is generated can then
be sampled from the probability distribution. [9]
The transformer also has two embedding layers that share the same set of weights,
one for the encoder stack and one for the decoder stack, that use learned embed-
dings to transform the output and input tokens to vector representations. Since the
transformer uses neither convolution nor recurrence, special positional encodings
need to be added to the learned embeddings of the encoder and decoder stacks so
that the model can make use of the order of tokens in the input sequence. These
positional encodings contain information about the relative or absolute position of
tokens in the sequence and have the same dimensionality as the input embeddings
so that they can be summed together. [9]
Transformer-based models can be split into three categories based on which parts
of the transformer architecture they use: encoder-only, decoder-only and encoder-
decoder models. Decoder-only models, such as the Generative Pre-Trained Trans-
former (GPT) models [32], are generally used for language generation. Decoder-only
models are pre-trained using the self-supervised language modeling task, where the
goal is to predict the value of a token based on its preceding tokens. Encoder-only
models, such as BERT [13], are pre-trained using the MLM task, and they are used
for downstream tasks related to language understanding e.g. sequence classification
and question answering. Encoder-decoder models, such as T5 [33], contain both
parts of the transformer architecture and they are generally used for sequence-to-
2.6 BERT 23
sequence downstream tasks, such as machine translation. Encoder-decoder models
are pre-trained using denoising tasks where the goal is to reconstruct a text sequence
which has been corrupted in some way. [31]
2.6 BERT
BERT (Bidirectional Encoder Representations from Transformers) is an encoder-
only transformer model that was introduced in 2019. When BERT was introduced,
it achieved state-of-the-art results in eleven NLP tasks. The model was pre-trained
using the masked language modeling (MLM) task, an example of which is shown
in figure 2.6, and next sentence prediction (NSP) task using two English text cor-
pora containing a total of 3.3 billion words. BERT can be fine-tuned for various
downstream tasks by using smaller amounts of annotated data. [13]
BERT uses a form of subword tokenization, called WordPiece [34], with a dic-
tionary length of 30,000. Two special tokens are also added to each input sequence:
the [CLS] token and the [SEP] token. The [CLS] token is a special classification
token that is added to the start of each input sequence. The final contextual rep-
resentation of the [CLS] token can be used to represent the meaning of the whole
input sequence in sentence-level tasks. The [SEP] token is used separate sentences
from one another. [13]
2.6.1 Pre-training
BERT uses bidirectional self-attention which allows the model to pay attention to
the whole input sequence limited only by the length of the context window. Using
masked language modeling (MLM) for pre-training was the main innovation that
allowed using bidirectional self-attention. MLM was required because models trained
using the regular LM task couldn’t use bidirectional attention since that would allow
2.6 BERT 24
Figure 2.6: Simple example of masked language modeling
the model to see the token being predicted, thus trivializing the prediction process.
During the MLM pre-training process, 15% of the tokens were masked. Out of the
masked tokens, 80% were replaced by the special [MASK] token, 10% were replaced
by a random token and 10% were left unchanged. Instead of just using the [MASK]
token, the other alterations were made to reduce the difference between pre-training
and fine-tuning because the [MASK] tokens are not used in downstream tasks. [13]
The NSP task was used for pre-training because it allows the model to learn
the relationship between two sentences, benefiting tasks such as question answering
(QA). Each training example for NSP consists of two sentences: sentence A and
sentence B. The sentences are chosen so that 50% of the time sentence B is the
sentence that follows sentence A in the original text, and 50% of the time sentence
B is just a randomly chosen sentence from the corpus. The model then predicts
whether sentence B follows sentence A or not. [13]
2.6.2 Downstream tasks
There are two approaches to using BERT for downstream tasks: the feature-based
approach and the fine-tuning approach. In the feature-based approach the pre-
trained BERT is used to produce contextual representations of the input without
doing any fine-tuning. These representations are then used as the input for another
model. [13]
In the fine-tuning approach BERT is initialized with the parameters from pre-
training. Then a fully connected feed-forward network is attached to the top of the
BERT architecture. This attached network is generally called a head, e.g. classifi-
cation head in the case of classification. The outputs of BERT are pooled and then
2.6 BERT 25
fed to this network. The outputs and inputs of this network are adjusted based on
the task. For example, in text classification the network takes the contextual repre-
sentation of the [CLS] token as input and has an output for each target label. Each
output produces a real number called a logit. To get the class prediction, the softmax
function is used over these logits to turn them into a probability distribution which
shows the probability of each label. In token-level tasks, the input representations of
all tokens are used as the input of the attached network. The objective function used
during fine-tuning is also selected based on the downstream task. During training,
the parameters of both the BERT model and the head ANN are adjusted, although
some parameters can also be frozen depending on the approach. [13]
2.6.3 BERT variants
There are two variants of the original BERTmodel: large and base. The large version
contains 24 encoder-blocks (also called transformer blocks) whereas the base version
contains 12 encoder-blocks. The base model has a total of 110 million parameters
and 12 attention heads, and the large version has 340 million parameters with 16
attention heads. [13]
The BERT team also released a multilingual version of BERT that supports 104
languages. The multilingual version was trained on each language’s Wikipedia. The
supported languages were chosen based on the size of their Wikipedias. [35] There
are also monolingual BERT-based models for languages other than English such as
FinBERT for Finnish [36], CamemBERT for French [37] and BETO for Spanish
[38]. These monolingual models have been trained on text written in their target
languages.
DistilBERT is a smaller version of BERT that was trained using knowledge dis-
tillation in which a smaller student model is trained to reproduce the behaviour of
a larger teacher model. DistilBERT retained 97% of the performance of the origi-
2.6 BERT 26
nal BERT-base model on the General Language Understanding Evaluation (GLUE)
benchmark while being 40% smaller in terms of parameters and 60% faster at in-
ference time [39]. This thesis uses the distilled version of the multilingual BERT
model.
3 Data inspection and processing
In this chapter, the datasets used in this thesis are examined. The chapter contains
descriptions of the two datasets, and an explanation of how the data were processed
to create the final datasets: the regression dataset and the MLM dataset.
3.1 Regression dataset
The first dataset is a harmonized subset of the full Fennica dataset described in
section ??. This dataset contains bibliographic metadata describing documents
published between the years 1488-1800. In total, the original dataset has 19,152
examples before doing any filtering such as removing null values. The first dataset
will be referred to as the regression dataset since it will be used to fine-tune a model
for the regression task.
Table 3.1: Some example values of the pagecount and pagecount_orig fields
pagecount_orig pagecount
"[8] s., s. 481-544, [4] s." 76
"[8], 56 s" 64
"4 s" 4
"[26] s." 26
"X, 255, XXII s., [3] karttalehteä (taitettuna) :" 284
The dataset has a total of 34 features such as the title, publication year, language
and type of the document. The only two features used for this thesis are "page-
count_orig" and "pagecount". The "pagecount_orig" field contains the value of the
3.1 REGRESSION DATASET 28
MARC21 300a field, which is a short text description of a document’s page count,
and the "pagecount" field contains the curated page count estimate for each docu-
ment, which is an integer. Table 3.1 shows five examples taken from the dataset.
It can be seen that the length and complexity of the "pagecount_orig" field varies
from example-to-example.
Out of the 19,152 documents described in the unfiltered dataset, 18,393 are
books, 733 are maps, 20 are continuous publications and 6 are music. The described
documents are written in 18 different languages with Swedish, Latin and Finnish
being the most popular having 8,329, 6,976 and 2,958 documents written in each
language, respectively. The mean page count of all the documents in the dataset is
34 pages with a standard deviation of 126.
Figure 3.1 shows the distribution of documents by publication decade. The figure
shows that most of the documents were published near the end of the 18th century
with barely any documents being published before the 17th century. The last decade,
1800, also has far fewer entries because the dataset only contains information of
documents released during the first year of that decade.
3.1.1 Data processing
To prepare the final regression dataset, the data was first filtered by only selecting the
two relevant features: "pagecount" and "pagecount_orig". Then all the examples
that had a missing value for either feature were removed. This removed a total of
159 examples resulting in a dataset with 18,992.
After removing missing values, the duplicate examples with regard to the values
of the selected features were removed, i.e. the dataset was filtered so that each pair
of values for “pagecount_orig” and “pagecount” appears only once in the dataset.
These duplicates aren’t necessarily duplicates in the sense that they describe the
same documents, but having multiple of the same input-out pairs in the data would
3.1 REGRESSION DATASET 29
Figure 3.1: Count of documents by publication decade in the regression dataset
skew the results, because then it would be possible to have the same input-out
pairs appear in both the test and training set, which would make the results overly
optimistic. Removing these duplicates left a total of 3,583 examples out of the
original 18,992. Most of the removed examples had very simple "pagecount_orig"
values such as "1 p." and low page count values.
One outlier that had a page count of 11,080 was removed from the dataset
because it affected the loss during training, since its page count was over 8,000
pages larger than the second largest page count in the dataset. Next, the examples
with a page count of 1 were checked because in the second dataset, which will be
used for MLM-finetuning, there were a lot of incorrect page count values of 1. The
dataset contained eight examples that had a page count of 1. Out of these eight, four
were removed because the "pagecount_orig" field described a higher page count e.g.
an example with a "pagecount_orig" value of "201 [se on 199] s, 1 s" (in English:
3.2 MLM DATASET 30
Table 3.2: Comparisons of the original "pagecount_orig" and cleaned "page-
count_orig" fields
Original Cleaned
"[48] s." "48 s"
"[2] s." "2 s"
"[4], 16 s." "4 16 s"
"[26] s." "26 s"
"[4] s., s. 461-476" "4 s s 461-476"
"201 [it is 199), 1p.") was removed. Finally, the 42 values that had a page count of
0 were removed since it is not possible to have a printed document with 0 pages.
Afer filtering the data, the next step was to clean the values of the "page-
count_orig" field. This was done by writing a custom function, which first removed
all the extra whitespace. Then dots, commas, square brackets, question marks,
colons and semicolons were removed. Finally, the strings were converted to lower-
case. The values were cleaned to reduce the amount of noise in the data. Table 3.2
shows the difference between the original and cleaned values of the "pagecount_orig"
field for a few examples.
After pre-processing, the data was shuffled and then split into training, validation
and test sets using an 80-10-10 split. The final regression dataset had a training
subset of 2,828 examples, a test subset of 354 examples and a validation subset of 354
examples. The final regression dataset included only two features: "pagecount_orig"
and "pagecount".
3.2 MLM dataset
The second dataset is also a harmonized subset of the Fennica dataset. This dataset
will be referred to as the MLM dataset. The dataset consists of documents published
between the years 1809 and 1917, which means that the dataset has no overlap
with the regression dataset since the earliest documents of the MLM dataset were
published 9 years after the last documents in the regression dataset. The unfiltered
3.2 MLM DATASET 31
Figure 3.2: Count of documents by publication decade in the MLM dataset
MLM dataset contains a total of 66,890 entries. Figure 3.2 shows the amount of
documents published for each decade in the dataset. The figure shows that most of
the documents were published in the last four decades of the dataset’s time window.
The MLM dataset contains a total of 21 features, including both the "page-
count_orig" and "pagecount" features that were also in the regression dataset. The
values of the "pagecount" feature, however, are incorrect in the MLM dataset which
is why the dataset cannot be used for regression fine-tuning. The reason for the
incorrect page count values is that the function used to harmonize the data did not
work correctly. The values of the "pagecount_orig" feature can still be used though
for MLM fine-tuning since MLM is a self-supervised task which can be done with
unlabeled data.
For the MLM dataset, only the "pagecount_orig" feature was chosen. After
discarding all the missing values, the dataset had 60,998 examples left. After that,
the entries were cleaned with the same function that was used for the regression
dataset in section 4.3. After cleaning the "pagecount_orig" values, all duplicates
were removed, leaving the dataset with 15,536 examples left.
3.2 MLM DATASET 32
Finally, to avoid data leakage, the MLM dataset was filtered by removing all
values that also appeared in the regression dataset i.e. all values of the cleaned
"pagecount_orig" field that also were in the regression dataset. After filtering out
these values, the dataset was left with 15,260 examples. After processing, the dataset
was shuffled and divided into a training set of 13,734 and a test set of 1,526 using
a 90-10 training-test split. The validation set was omitted because the model’s
hyperparameters will not be tuned for MLM fine-tuning.
4 Training the pure regression model
This chapter focuses on the training of the first model which is fine-tuned on the
regression dataset. The chapter explains the reasoning behind model selection, the
data pre-processing done before training, the training set, hyperparameter optimiza-
tion process, and finally, the results achieved on the validation set of the regression
dataset.
4.1 Model selection
The model chosen for this thesis is the "distilbert-base-multilingual-cased" model
1. A multilingual model was chosen because the values of the "pagecount_orig"
field are written in multiple languages such as Finnish, English and Swedish. The
fact that most of the entries were written in Finnish also limited the selection to
models that support Finnish. The model is a distilled version of the BERT base
multilingual model [35]. The model has 134M parameters compared to the 177M
parameters of the original multilingual BERT model, and the model is about twice
as fast on average for inference and training.
The distilled model was chosen for this thesis because the goal is not to obtain
the best possible performance, but to instead see if encoder-only models are viable
for the page count prediction task. The distilled model’s performance should be close
to the performance of the original model [39]. The lower number of parameters can
1https://huggingface.co/distilbert/distilbert-base-multilingual-cased
4.2 MODEL ARCHITECTURE 34
also prevent overfitting because the regression dataset does not contain that many
examples in total. Finally, the model was fine-tuned locally using a single Nvidia
RTX 4070 GPU. Therefore it is beneficial to use a model that doesn’t require as
much computational power.
4.2 Model architecture
The model used for regression, shown in figure 4.1, consists of a custom regression
head and the pre-trained DistilBERT model. The regression head has two fully-
connected layers with a dropout layer between them. The first fully-connected layer
has 768 inputs and 768 outputs and the second fully-connected layer has 768 inputs
and a single output. The last fully-connected layer uses linear activation whereas
the first uses the ReLU activation function. The model was implemented in Python
by using the PyTorch and Transformers libraries.
The DistilBERT model produces a contextual representation, which is a 768-
dimensional vector, for each token in the input sequence. The contextual represen-
tation of the special [CLS] token is used as the input of the regression head. The
predictions produced by the model are scaled up by multiplying each prediction by
100. The predictions are scaled up because without scaling, the model failed to pre-
dict page counts of more than 400. A few different ways of scaling the predictions
were tried, such as min-max normalization, but multiplying by 100 was chosen due
to its simplicity and effectiveness. The scaling is done at the end of the model’s
forward pass, so that the loss is calculated based on the scaled values. After scaling,
the predictions are also rounded to the nearest integer because the page count is an
integer.
4.3 DATA PREPROCESSING 35
Figure 4.1: Architecture of the whole regression model showing the regression head
attached to the DistilBERT model. The first number in brackets shows the amount
of inputs and the second the amount of outputs for the fully-connected layers.
4.3 Data preprocessing
To pre-process the data for training the model, the values of the cleaned "page-
count_orig" feature were tokenized and then the tokens were converted into a list
of input ids. Each unique token in the model’s dictionary has its own input id. The
ids are then given to the model as input. Figure 4.2 shows the tokenization process
for a single example. The tokenization was done by using the DistilBERT model’s
tokenizer from the transformers library.
Finally, when the inputs were passed to the model, they were padded dynamically
so that the length of every input vector was the same for all the examples in a single
batch. This is done by adding special [PAD] tokens to the examples, which are then
masked by using an attention mask that marks them with a value of zero, meaning
that the model will not pay attention to the [PAD] tokens during training.
4.4 TRAINING SETUP 36
Figure 4.2: An example of tokenizing an input, then converting the tokens to input
ids. Note that the input ids list has two more elements than the tokenized input
due to the [CLS] and [SEP] tokens being added.
4.4 Training setup
The model was trained using the training set and the performance was evaluated on
the validation set after each epoch. For regularization, early stopping with a patience
of 3 was used, which means that training was stopped if the model’s performance on
the validation set didn’t improve in 5 epochs. After training, the model checkpoint
that performed the best on the validation set was loaded. The model was trained
using a batch size of 16 due to GPU memory limitations. A weight decay of 0.01
and a dropout probability of 0.1 were also used for regularization.
The hyperparameter optimization was done by using the optuna library [40].
After trying out optimizing the learning rate, batch size, weight decay, activation
function and dropout probability, only the learning rate was chosen for the final
hyperparameter optimization since changing it had a larger effect on the model’s
performance than changing the other hyperparameters. The AdamW optimizer [41]
was used with a 1 value of 0.9 and a 2 value of 0.99. The values control how the
learning rate is adjusted during training. The final hyperparameter optimization
consisted of 15 trials where optuna was used to suggest learning rates between 10 5
and 10 4.
Each trial consisted of training the model for 10 epochs, or until early stopping
was triggered i.e. until the model obtained the lowest MSE loss on the validation
set. After training, the model’s performance was evaluated on the validation set.
The best hyperparameters were chosen by taking the hyperparameters used in the
4.5 RESULTS 37
trial that achieved the smallest MSE. After hyperparameter optimization, the final
model was trained for 10 epochs using the best performing learning rate from the
trials which was 2:96 10 5.
4.5 Results
The final model achieved an MSE of 1,728 and an MAE of 16.40 on the validation
set. The median absolute error (AE) of the predictions was 5 and the 3rd quartile
AE was 13. This shows that most of the predictions were close, but that there were
some outliers which increased the mean absolute and squared errors. The highest
absolute error was 333, and there were a total of 14 predictions that had an absolute
error of more than 100.
Figure 4.3: The predictions of the regression model vs. the page count values on
the validation set.
Figure 4.3 shows the difference between the model’s predictions and the real page
count values. The figure shows that there were some large errors in predictions for
documents that had a low page count. The figure also shows that the validation set
only has a few documents with a page count greater than 800. The low amount of
high page count documents makes it difficult to judge the model’s ability to predict
high page count values.
5 Training the MLM fine-tuned
model
This chapter explains the training process of the second model, which is first fine-
tuned on the MLM task before fine-tuning it on the regression task. The chapter ex-
plains the MLM fine-tuning process and the reasoning behind the MLM fine-tuning,
summarizes the regression fine-tuning and hyperparameter optimization processes,
and analyses the metrics the MLM fine-tuned model achieved on the validation set
of the regression dataset.
5.1 MLM fine-tuning
The goal of fine-tuning the model using the MLM task is to help with domain adap-
tation. The multilingual DistilBERT model used in this thesis was pre-trained on
large amounts of general textual data, but the values of the "pagecount_orig" field
can differ a lot from general text because they contain a lot numbers, abbreviations
and specific terminology. Fine-tuning for the MLM task may help the DistilBERT
model to adapt to the bibliographic data domain. The MLM fine-tuning will only af-
fect the base DistilBERT model, whereas the regression head will only be fine-tuned
using the regression dataset.
The approach of first fine-tuning the model on the MLM task before regression
fine-tuning was inspired by the Universal Language Model Fine-tuning (ULMFiT)
5.1 MLM FINE-TUNING 39
approach [42], introduced by Howard and Ruder in 2018. In the ULMFiT approach,
a pre-trained language model was fine-tuned on domain-specific unannotated data
using the language modeling (LM) task. After LM fine-tuning the model was then
fine-tuned for text classification using a smaller amount of annotated data.
The approach taken in this thesis differs from the ULMFiT approach in a few
ways. First, the original ULMFiT approach used an older language model based
on the LSTM architecture, whereas in this thesis a transformer model is used. The
second difference is that this thesis uses the bidirectional MLM task instead of the
unidirectional LM task used by ULMFiT. The usage of MLM instead of LM is
enabled by using the encoder-only architecture [13]. Finally, ULMFiT was aimed
at improving text classification performance whereas in this thesis the goal is to
improve text regression performance.
The dataset used for the MLM fine-tuning, which was originally introduced in
section 3.2, only has the cleaned values of the "pagecount_orig" field. During data
pre-processing, the values of the "pagecount_orig" field were tokenized and turned
into input ids using the DistilBERT model’s tokenizer similarly to section 4.3. After
tokenization, the input ids of each example were copied so that they could be used
as labels during training. Finally, 15 % of all the tokens were masked using the data
collator from the transformers library. The masking process replaced the input ids
of the original tokens with the id of the special [MASK] token.
To train the model for the MLM task, a learning rate of 2e-5 was used along-
side a weight decay of 0.01 and a batch size of 16. The hyperparameters were not
optimized for the MLM task, since the MLM fine-tuning is only done to improve
regression performance, and the larger size of the MLM dataset made training the
model slower, which would have slowed down the hyperparameter optimization pro-
cess. The training was done with the model’s performance being evaluated on the
MLM dataset’s test set after each epoch. Early stopping with a patience of 3 was
5.3 RESULTS 40
used during training and the best performing parameters were loaded at the end of
training. The model was trained for 11 epochs in total, but the best performance on
the validation set was achieved after 8 epochs. The loss metric used during training
was cross-entropy loss.
5.2 Regression fine-tuning
After MLM fine-tuning, the model was fine-tuned for regression similarly to section
4.4 using the regression dataset. The model architecture was the exact same as for
the pure regression model except that the MLM fine-tuned DistilBERT was used
instead of the original multilingual DistilBERT model.
The same approach was used for hyperparameter optimization which consisted
of 15 trials where the Optuna library was used to suggest learning rates between
10 5 and 10 4. The optimized learning rate used for training the final model for 10
epochs was 2:52 10 5. Other than the learning rate, all the hyperparameters were
the same as for the pure regression model: a weight decay of 0.01, batch size of 16
and a dropout probability of 0.1 and the AdamW optimizer was used with 1=0.9
and 2=0.99.
5.3 Results
For the MLM task the metric used to evaluate the model’s performance was perplex-
ity, which is the exponent of the cross-entropy loss. Before MLM fine-tuning, the
model achieved a perplexity of 716.51 on the MLM dataset’s test set, which dropped
down to 5.71 after training. The significant drop in perplexity shows that the MLM
fine-tuning made the DistilBERT model adapt to the domain of bibliographic data,
at least when it comes to the MLM task.
For the regression task, the MLM fine-tuned model achieved an MSE of 2,540 and
5.3 RESULTS 41
an MAE of 17.34 on the validation set of the regression dataset. These results are
worse than what the pure regression model achieved, although the final comparison
will be done in the next chapter where both models will be evaluated on the test
set. Although the MLM fine-tuned model’s median AE of 5 was the same and the
3rd quartile AE of 10 was lower than what the pure regression model achieved, the
MSE and MAE were both worse. One of the reasons for this is that the largest AE
of the MLM fine-tuned model was 539, and the model had 13 predictions with an
AE greater than 100. The results show that the MLM fine-tuned model had bigger
outliers in terms of AE which is why the MSE of 2,540 in particular is a lot worse
than the MSE of 1,728 the pure regression model achieved. Due to the small size of
the validation set, outliers can have a large effect on MSE.
Figure 5.1: The predictions of the MLM fine-tuned model vs. the page count values
on the validation set. It can be seen that there were some large outliers in AE for
documents that had a page count that was near one. In particular, the largest AE
of 539 is shown in the middle-left side of the figure.
Figure 5.1 shows the predictions compared to the page counts on the validation
set. The figure shows that the prediction that had the biggest AE (539) was for a
document with a very low page count. The example had a "pagecount_orig" value
of "8 s 655 se on 639 s 1 s" ("8 p. 655 it is 639 p 1 p" in English) and a page
5.3 RESULTS 42
count of 8. The value of the page count feature for this example might be incorrect,
since the description contains more numbers than just 8. If the value is incorrect,
the MLM fine-tuned model’s prediction of 547 pages might be closer than the pure
regression model’s prediction of 295 pages.
The results obtained in this chapter show that the MLM fine-tuning did not lead
to improved performance on the validation set. However, the final effectiveness of
the MLM fine-tuning will be shown in the next chapter, where the performance of
the MLM fine-tuned model and the regression model are compared on the test set
of the regression dataset.
6 Conclusion
The chapter starts by comparing the two models’ performance on the test set of
the regression dataset. The chapter also contains discussion about the results, lim-
itations of the results, suggestions for future research and a final summary of the
thesis.
6.1 Model comparison
Table 6.1 shows the metrics for both models on both the validation and test subsets
of the regression dataset. The models were evaluated only once on the test set
after both models had been trained so that the decisions made when training the
models were not affected by the results obtained on the test set. On the test set,
the MLM fine-tuned model achieved an MSE of 3,597 and an MAE of 17.50. The
pure regression model achieved an MSE of 4,971 and an MAE of 21.11. The MLM
fine-tuned model performed worse on the validation set, but outperformed the pure
regression model on the test set. Both models had a noticeably worse MSE on the
test set, and the pure regression model’s MAE was also noticeably worse on the test
set than the validation set.
Both models had a median AE of 5 on the test set, which is the exact same
value they achieved on the validation set. The 3rd quartile AE was 12 for the pure
regression model and 11 for the MLM-finetuned model on the test set. The 3rd
quartile AE values achieved on the test set were also close to the values achieved
6.1 MODEL COMPARISON 44
on the validation set. This shows that the main difference between the test and
validation set performance is that the test set had more outliers with a large AE.
The differences could also be explained by the fact that the models’ hyperparameters
were optimized using the validation set, which lead to the models slightly overfitting
the validation set.
Table 6.1: Results for both models on the test and validation regression sets. The
best metrics achieved on both validation and test set are bolded. The MLM fine-
tuned model performed better on the test set, even though its performance on the
validation set was worse.
Model Regression dataset subset MSE MAE
Pure regression Validation 1,728 16.40
MLM fine-tuned Validation 2,540 17.34
Pure regression Test 4,971 21.11
MLM fine-tuned Test 3,597 17.50
The largest AE on the test set was 915 for the pure regression model and 829 for
the MLM fine-tuned model, this example can be seen on the bottom right corner
of the graphs for both models in figure 6.1. The example had a "pagecount_orig"
value of "s 943-" and a page count value of 943. The page count for this example
should have been simple to predict, however, it was by far the largest AE for both
models being 465 greater than the second largest AE for the pure regression model,
and 517 greater for the MLM fine-tuned model.
The hyphen in the "pagecount_orig" value is most likely the reason why the
prediction was off by such a large amount, although it is hard to know precisely how
the models make predictions. The hyphens were not removed when cleaning the
data because the values of the "pagecount_orig" feature contain page ranges, such
as "250-300 p.", which have a different meaning to just having two page numbers
back-to-back e.g. "250 300 p.". However, perhaps more careful cleaning of the data,
such as removing hyphens that are not between two numbers, would improve model
performance.
Figure 6.1 shows a side-by-side comparison of the two models’ predictions on the
6.1 MODEL COMPARISON 45
Figure 6.1: Predictions and page counts for both the pure regression (graph A) and
MLM fine-tuned (graph B) models on the test set. The MLM fine-tuned model had
both a lower MAE and MSE than the pure regression model. The MLM fine-tuned
model was more accurate on high page count documents in particular.
test set. The figure shows that the test set had more examples with a higher page
count than the validation set, and that both models struggled more with predicting
higher page count values. However, the MLM fine-tuned model performed better on
the high page count predictions, which is why it obtained better metrics. The results
also show that the MLM fine-tuned model generalized better because it obtained
better results on the test set and had a smaller difference between test set and
validation set performance.
Figure 6.2 shows the distribution of AE for the test set predictions of both mod-
els. The figure shows that the MLM fine-tuned model had slightly more predictions
that had a low AE, although the regression fine-tuned model also had a low AE
for most of its predictions. In total, out of the 354 test set predictions, the MLM
fine-tuned model had 263 and the pure regression model had 247 predictions where
AE  10.
In general, the small size of the test and validation sets makes it difficult to
6.2 DISCUSSION 46
Figure 6.2: Distribution of the predictions’ AE for the test set. The pure regression
model is shown in blue and MLM fine-tuned model is shown in orange. The 29
predictions for the pure-regression model and the 23 predictions for the MLM fine-
tuned model that had an AE higher than 50 were excluded from the graph. The
MLM fine-tuned model had more predictions where AE  10.
evaluate model performance, since just a few predictions being off by a large amount
will affect the MSE noticeably. For example, if the example with the highest AE
is removed from the results of both models, the MSE drops down to 2,614 for the
pure regression model and down to 1,660 for the MLM fine-tuned model. This is a
good demonstration of how MSE is affected by outliers, especially when the dataset
is small. Removing the example would lower the MAE of the pure regression model
to 18.58 and the MLM fine-tuned model’s MAE to 15.20 which shows that MAE is
more robust to outliers than MSE.
6.2 Discussion
The first research question of this thesis was "Can a transformer encoder model be
used to predict the page count of a document based on the value of the MARC21
300a field?". The results obtained on the test set show that the models trained in
this thesis can reliably predict the page count of documents that have fewer than
6.3 LIMITATIONS 47
around 600 pages, but struggle to predict the page count of longer documents. The
validation set only had a few high page count documents, which could explain why
the final models were less accurate at predicting high page count values, because
the models’ hyperparameters were optimized using the validation set.
The results show that a transformer encoder model could potentially achieve
better performance. For example, a larger model trained with more data could
achieve better results. The results were also pessimistic because of the removal of
duplicates that was done in section 4.3. For real world bibliographic data, it is likely
that there would be some duplicate values of the MARC21 300a field, particularly
for shorter documents. For example, there could be multiple examples that have
a MARC21 300a field value such as "2 pages" or "2 p." for documents that have
a page count of two. The reason the duplicate values were removed in this thesis
was to avoid having overly optimistic results due to the same "pagecount_orig" and
page count values appearing in both test and training sets of the regression dataset.
The second research question was "Can page count prediction performance be
improved by fine-tuning the model for the masked language modeling task using
unannotated data?". The MLM fine-tuned model obtained noticeably better metrics
on the test set, even though it performed worse on the validation set. This shows that
the MLM fine-tuned model had better generalization, meaning that it performed
better on unseen data. Based on the experiments of this thesis, it can be concluded
that MLM fine-tuning can improve a model’s regression performance, at least if the
amount of annotated data that can be used for regression fine-tuning is limited.
6.3 Limitations
The size of the validation and test sets might threaten the validity of the results since
a few outliers can skew the metrics. Model performance on the test set was evaluated
only once, but for the validation set, there was some variance in the metrics for
6.3 LIMITATIONS 48
models trained using the same hyperparameters. Therefore, to get more accurate
results, more data should be used for both training the model and evaluation of
model performance.
The datasets used in this thesis were both subsets of the Finnish national bib-
liography, Fennica. Most of the MARC21 300a entries which were used as the
input, were written in Finnish. The experiment results might not reflect the gen-
eral applicability of transformers models to page count prediction. For example, a
model trained on data taken from another national bibliography, with entries writ-
ten mostly in another language and potentially with other standards, may perform
differently. Also, the documents described by the datasets were published between
the years 1488 and 1917, which means that most of the bibliographic data was
catalogued a long time ago. More recently catalogued bibliographic data may use
different standards for page count entries, which can affect the effectiveness of the
approach taken in this thesis. This limitation could be fixed by using a more diverse
dataset which contains examples from multiple bibliographies.
Another potential limitation is that the annotated data used for regression fine-
tuning did not describe documents that were longer than 1,714 pages. The one
outlier example that had a page count of 11,080 was removed due to how it affected
the training loss. Generally, the models were less accurate at predicting higher page
counts. Therefore, the models used in this thesis might be more applicable to types
of documents that have a lower page count. Using a more balanced dataset, which
contains more examples with a high page count, for regression fine-tuning could
remove this limitation.
Finally, even though some examples that had an incorrect page count value were
removed from the regression dataset, there might still have been some incorrect
values left in the dataset which affected the results negatively. For example, the
validation set had the example with a MARC21 300a value of "8 s 655 se on 639 s
6.4 FUTURE RESEARCH 49
1 s" and a page count of 8. The page count value appears to be incorrect, since the
MARC21 300a value also contains other page count values. The MLM fine-tuned
model had its worst AE on this example, which affected the metrics obtained on the
validation set. More careful processing of the annotated data could potentially lead
to improved results.
6.4 Future research
The simplest approach for future research would be using more annotated data for
fine-tuning the model for regression. Having more data would reduce the uncertainty
of the validation and test set results while also potentially improving performance.
In particular, having more examples with a higher page count could lead to bet-
ter performance. More data could be obtained by integrating data from multiple
bibliographies into a single dataset. This would allow the evaluation of models on
entries that are written in various languages. Trying the approach of this thesis on
data which describes more recent documents could also be interesting since it would
show how the cataloging standards affect model performance.
The model selection in this thesis was limited by the small size of the regression
dataset and the lack of computational resources, which meant that a smaller model
had to be selected. Another limiting factor was that most of the MARC21 300a
entries were written in Finnish which is why a multilingual model that supported
Finnish was chosen. Without these limitations, there are many different encoder-
only models that could be used. If more data were obtained, some larger models
such as the large version of the BERT model could be used. There are also other,
more recent encoder-only models that could be used, such as ModernBERT [43].
With enough unannotated data, the full pre-training of an encoder-only model
could be tried. However, this would require a large amount of computational re-
sources and might not be worth it just for page count prediction. The MLM fine-
6.5 SUMMARY 50
tuning done in this thesis mimicked the typical MLM pre-training of an encoder-only
model, and it improved the model’s performance on the downstream regression task.
Full pre-training could potentially lead to even more improvement.
Finally, in cases where the amount of annotated data are limited, more sophisti-
cated training methods could be tried. These methods include freezing some layers
of the base encoder-only model either permanently or for some amount of time, so
that the parameters of the regression head could be adjusted without the encoder-
only model starting to overfit the training data. Another solution to having fewer
data, would be trying more efficient models, which can obtain good performance
with less computation and memory usage [44]. However, the more efficient models
might not support Finnish, which is why data written in other languages might be
needed to take advantage of them.
6.5 Summary
In this thesis, the research objective was to predict the page count of documents
based on the value of the MARC21 300a field, which is a bibliographic data field
that describes the length of a document. The values of the field are short text
descriptions, that vary from single numbers to long descriptions containing many
abbreviations, words and even roman numerals. Due to the varying standards used
for describing the documents, the task of mapping these text values to numerical
page counts is non-trivial.
The approach chosen for this thesis, was to use an encoder-only transformer
model to produce contextual, vector representations of the MARC21 300a entries.
These vectors were then passed to a regression head, which was a simple artificial
neural network with two fully-connected layers, that produced the final page count
predictions. The advantage of this approach is that it doesn’t require manual feature
engineering, such as parsing different values from the text descriptions. Using an
6.5 SUMMARY 51
encoder-only model also allowed taking advantage of a transfer learning approach,
since the model had been pre-trained on a large amount of data by the model’s
authors. The transfer learning was important, since the amount of annotated data
used in this thesis was limited.
Two harmonized subsets of the Finnish national bibliography, Fennica, were
used in this thesis. The first dataset, which was called the regression dataset since
it was used for regression, contained descriptions of documents released between the
years 1488 and 1800. This dataset contained both the MARC21 300a values and
previously estimated page counts for each document. The other dataset, called the
MLM dataset since it was used for masked language modeling (MLM), contained
metadata on documents published between the years 1809 and 1917. Many of the
page count estimates of the second dataset were incorrect, which is why the MLM
dataset could not be used for regression fine-tuning.
However, an experiment, where the second dataset was used to fine-tune the
base encoder-only model using the masked language modeling (MLM) task was
done to see if unannotated data could be used to improve regression performance.
To evaluate the effectiveness of MLM fine-tuning, a model was fine-tuned only on
the first dataset for the regression task. This model was called the pure regression
model and its performance was compared to the performance of the MLM fine-tuned
model, which was fine-tuned for the MLM task and the regression task.
The final result was that the MLM fine-tuned model performed better on the test
set of the regression dataset. However, both models had some outlier predictions
that had a large absolute error (AE), the small size of the regression dataset lead to
these outliers having a large effect on the performance metrics. In particular, both
models performed worse on documents that had a page count greater than around
600.
Overall, it can be concluded that the approach of using an encoder-only model
6.5 SUMMARY 52
for page count prediction is viable. The results of this thesis might not be applicable
to other bibliographies, since most of the MARC21 300a entries used for the exper-
iments of this thesis were written in Finnish, and the latest documents described
by the datasets were published in 1917. For future research, using more data or
other encoder-only models should be considered. The approach could also be tried
on data from other bibliographies.
References
[1] T. Umerle, G. Colavizza, E. Herden, et al., “An Analysis of the Current Bib-
liographical Data Landscape in the Humanities. A Case for the Joint Bib-
liodata Agendas of Public Stakeholders”, May 2022, Publisher: Zenodo. doi:
10.5281/ZENODO.6559857.
[2] L. Lahti, J. Marjanen, H. Roivainen, and M. Tolonen, “Bibliographic Data
Science and the History of the Book (c. 1500–1800)”, Cataloging & Classifica-
tion Quarterly, vol. 57, no. 1, pp. 5–23, Jan. 2019, issn: 0163-9374, 1544-4554.
doi: 10.1080/01639374.2018.1543747.
[3] M. Tolonen, L. Lahti, H. Roivainen, and J. Marjanen, “A Quantitative Ap-
proach to Book-Printing in Sweden and Finland, 1640–1828”, Historical Meth-
ods: A Journal of Quantitative and Interdisciplinary History, vol. 52, no. 1,
pp. 57–78, Jan. 2019, issn: 0161-5440, 1940-1906. doi: 10.1080/01615440.
2018.1526657.
[4] I. Tiihonen, L. Lahti, and M. Tolonen, “Print culture and economic constraints:
A quantitative analysis of book prices in eighteenth-century Britain”, Explo-
rations in Economic History, vol. 94, p. 101 614, Oct. 2024, issn: 00144983.
doi: 10.1016/j.eeh.2024.101614. (visited on 02/08/2025).
[5] M. Tolonen, J. Marjanen, H. Roivainen, and L. Lahti, “Scaling Up Biblio-
graphic Data Science”, Digital Humanities in the Nordic and Baltic Coun-
REFERENCES 54
tries Publications, vol. 2, no. 1, pp. 450–456, May 2019, issn: 2704-1441. doi:
10.5617/dhnbpub.11118.
[6] Fennica – the Finnish National Bibliography | Kansalliskirjasto. [Online]. Avail-
able: https://www.kansalliskirjasto.fi/en/services/fennica-finnish-
national-bibliography (visited on 03/12/2025).
[7] T. D. S. Group, Fennica metadata conversions: Statistical monitoring and
analysis, Feb. 2025. [Online]. Available: https : / / fennica - fennica . 2 .
rahtiapp.fi/ (visited on 03/23/2025).
[8] The MARC 21 Formats: Background and Principles. [Online]. Available: https:
//www.loc.gov/marc/96principl.html (visited on 03/12/2025).
[9] A. Vaswani, N. Shazeer, N. Parmar, et al., “Attention Is All You Need”, Aug.
2023. doi: 10.48550/arXiv.1706.03762.
[10] T. M. Mitchell, Machine learning (McGraw-Hill series in Computer Science).
New York: McGraw-Hill, 2013, isbn: 978-0-07-042807-2 978-0-07-115467-3.
[11] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016,
http://www.deeplearningbook.org.
[12] J. Gui, T. Chen, J. Zhang, et al., “A Survey on Self-Supervised Learning:
Algorithms, Applications, and Future Trends”, IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 46, no. 12, pp. 9052–9071, Dec. 2024,
issn: 1939-3539. doi: 10.1109/TPAMI.2024.3415112.
[13] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training
of Deep Bidirectional Transformers for Language Understanding”, May 2019.
doi: 10.48550/arXiv.1810.04805.
[14] I. N. Da Silva, D. Hernane Spatti, R. Andrade Flauzino, L. H. B. Liboni, and
S. F. Dos Reis Alves, Artificial Neural Networks. Cham: Springer International
Publishing, 2017. doi: 10.1007/978-3-319-43162-8.
REFERENCES 55
[15] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning”, Nature, vol. 521, no. 7553,
pp. 436–444, May 2015, issn: 0028-0836, 1476-4687. doi: 10.1038/nature14539.
[16] S. Hochreiter and J. Schmidhuber, “Long short-term memory”, vol. 9, no. 8,
pp. 1735–1780, Nov. 1997, issn: 0899-7667. doi: 10.1162/neco.1997.9.8.
1735.
[17] T. Kohonen, “The self-organizing map”, Proceedings of the IEEE, vol. 78, no. 9,
pp. 1464–1480, Sep. 1990, issn: 00189219. doi: 10.1109/5.58325.
[18] D. E. Rumelhart, G. E. Hintont, and R. J. Williams, “Learning representations
by back-propagating errors”, Nature, Oct. 1986. doi: https://doi.org/10.
1038/323533a0.
[19] D. P. Kingma and J. Ba, Adam: A Method for Stochastic Optimization, arXiv:1412.6980
[cs], Jan. 2017. doi: 10.48550/arXiv.1412.6980.
[20] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov,
“Dropout: A simple way to prevent neural networks from overfitting”, Journal
of Machine Learning Research, vol. 15, no. 56, pp. 1929–1958, 2014. [Online].
Available: http://jmlr.org/papers/v15/srivastava14a.html.
[21] J. Hirschberg and C. Manning, “Advances in natural language processing”,
Science (New York, N.Y.), vol. 349, pp. 261–266, Jul. 2015. doi: 10.1126/
science.aaa8685.
[22] J. Eisenstein, Natural Language Processing. The MIT Press, Oct. 2019, isbn:
978-0-262-04284-0.
[23] Q. Jiao and S. Zhang, “A Brief Survey of Word Embedding and Its Recent
Development”, in 2021 IEEE 5th Advanced Information Technology, Electronic
and Automation Control Conference (IAEAC), Chongqing, China: IEEE, Mar.
2021, pp. 1697–1701, isbn: 978-1-7281-8028-1. doi: 10.1109/IAEAC50856.
2021.9390956.
REFERENCES 56
[24] T. Mikolov, K. Chen, G. Corrado, and J. Dean, Efficient Estimation of Word
Representations in Vector Space, arXiv:1301.3781 [cs], Sep. 2013. doi: 10.
48550/arXiv.1301.3781.
[25] J. Pennington, R. Socher, and C. Manning, “Glove: Global Vectors for Word
Representation”, in Proceedings of the 2014 Conference on Empirical Meth-
ods in Natural Language Processing (EMNLP), Doha, Qatar: Association for
Computational Linguistics, 2014, pp. 1532–1543. doi: 10.3115/v1/D14-1162.
[26] M. Peters, M. Neumann, M. Iyyer, et al., “Deep Contextualized Word Rep-
resentations”, in Proceedings of the 2018 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long Papers), New Orleans, Louisiana: Association
for Computational Linguistics, 2018, pp. 2227–2237. doi: 10.18653/v1/N18-
1202.
[27] H. Larochelle and G. E. Hinton, “Learning to combine foveal glimpses with
a third-order Boltzmann machine”, in Advances in Neural Information Pro-
cessing Systems, J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, and A.
Culotta, Eds., vol. 23, Curran Associates, Inc., 2010.
[28] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu, “Recurrent models of vi-
sual attention”, in Proceedings of the 28th International Conference on Neural
Information Processing Systems - Volume 2, ser. NIPS’14, Montreal, Canada:
MIT Press, 2014, pp. 2204–2212.
[29] D. Bahdanau, K. Cho, and Y. Bengio, Neural Machine Translation by Jointly
Learning to Align and Translate, May 2016. doi: 10.48550/arXiv.1409.0473.
[30] A. Galassi, M. Lippi, and P. Torroni, “Attention in Natural Language Process-
ing”, IEEE Transactions on Neural Networks and Learning Systems, vol. 32,
REFERENCES 57
no. 10, pp. 4291–4308, Oct. 2021, issn: 2162-2388. doi: 10.1109/TNNLS.
2020.3019893.
[31] B. Min, H. Ross, E. Sulem, et al., “Recent Advances in Natural Language Pro-
cessing via Large Pre-trained Language Models: A Survey”, ACM Computing
Surveys, vol. 56, no. 2, pp. 1–40, Feb. 2024, issn: 0360-0300, 1557-7341. doi:
10.1145/3605943.
[32] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving Lan-
guage Understanding by Generative Pre-Training”, en, 2018.
[33] C. Raffel, N. Shazeer, A. Roberts, et al., “Exploring the limits of transfer
learning with a unified text-to-text transformer”, Journal of Machine Learning
Research, vol. 21, Jun. 2020.
[34] Y. Wu, M. Schuster, Z. Chen, et al., Google’s Neural Machine Translation Sys-
tem: Bridging the Gap between Human and Machine Translation, Oct. 2016.
doi: 10.48550/arXiv.1609.08144.
[35] Bert/multilingual.md at master · google-research/bert, en. [Online]. Available:
https://github.com/google-research/bert/blob/master/multilingual.
md (visited on 03/09/2025).
[36] A. Virtanen, J. Kanerva, R. Ilo, et al., “Multilingual is not enough: BERT for
Finnish”, Dec. 2019. doi: 10.48550/arXiv.1912.07076.
[37] L. Martin, B. Muller, P. J. O. Suárez, et al., “CamemBERT: A Tasty French
Language Model”, 2020. doi: 10.18653/v1/2020.acl-main.645.
[38] J. Cañete, G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, and J. Pérez, “Spanish
Pre-trained BERT Model and Evaluation Data”, Aug. 2023. doi: 10.48550/
arXiv.2308.02976.
REFERENCES 58
[39] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT, a distilled version
of BERT: Smaller, faster, cheaper and lighter”, Mar. 2020, arXiv:1910.01108
[cs]. doi: 10.48550/arXiv.1910.01108.
[40] T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, “Optuna: A next-
generation hyperparameter optimization framework”, in Proceedings of the
25th ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, 2019.
[41] I. Loshchilov and F. Hutter, Decoupled Weight Decay Regularization, Jan. 2019.
doi: 10.48550/arXiv.1711.05101.
[42] J. Howard and S. Ruder, Universal Language Model Fine-tuning for Text Clas-
sification, May 2018. doi: 10.48550/arXiv.1801.06146.
[43] B. Warner, A. Chaffin, B. Clavié, et al., Smarter, Better, Faster, Longer: A
Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context
Finetuning and Inference, Dec. 2024. doi: 10.48550/arXiv.2412.13663.
[44] Y. Tay, M. Dehghani, D. Bahri, and D. Metzler, “Efficient Transformers: A
Survey”, ACM Computing Surveys, vol. 55, no. 6, pp. 1–28, Jun. 2023, issn:
0360-0300, 1557-7341. doi: 10.1145/3530811.