How Far Can a Few Shots Take? Exploring Few-Shot Learning in Finnish Text Classification Through Sentence Transformer Fine-Tuning University of Turku Department of Computing Master’s Thesis Computer Science May 2025 Anna Salmela The originality of this thesis has been checked in accordance with the University of Turku quality assurance system using the Turnitin OriginalityCheck service. UNIVERSITY OF TURKU Department of Computing Anna Salmela: How Far Can a Few Shots Take? Exploring Few-Shot Learning in Finnish Text Classification Through Sentence Transformer Fine-Tuning Master’s Thesis, 53 p., 2 app. p. Computer Science May 2025 With natural language processing solutions on the rise, language models are getting larger with the number of parameters measured in billions, while using more and more data. In addition to this, both training a text classification model and using it later for inference can require significant computational resources. Fine-tuning language models has for a long time been a great way of adapting said models into specific domains, but they usually need significant amounts of labelled data to succeed. In this thesis, I examine the capabilities of few-shot learning by fine-tuning sentence embedding models for text classification with artificially restricted datasets created from benchmarked Finnish data to see how well considerably lighter models with fewer data perform compared to state-of-the-art solutions. As the main method, this thesis explores few-shot learning by using the SetFit library as a way to fine-tune sentence embedding models for text classification. SetFit enables the use of extremely small datasets for training, and dataset sizes of 8, 16, 32 and 64 samples per label are tested. The analysis includes comparing the results from several fine-tuned models, including both monolingual and multilingual sentence embedding models, with varying tasks: multilabel register (genre) classification, multilabel toxicity detection, multiclass news category classification and multiclass discussion forum topic classification. Even though state-of-the-art results are not reached by fine-tuning sentence em- bedding models, SetFit shows promise especially in the multiclass prediction tasks. While the benchmark results are higher, SetFit achieves decent model performance with smaller datasets. In some cases, it looks like 32 or even 16 examples per label might be enough to get the most out of this method. From the different sentence embedding models tested, the 125M parameter monolingual Finnish one fares the best in all tasks when fine-tuned with SetFit. The results of this thesis are promising for use cases where the amount of data and computational resources are limited. To my knowledge, this is the first time SetFit has been studied with Finnish data. Previously, Finnish few-shot classification has been tested with the aid of large language models, thus requiring significant com- putational resources. Compared to these methods, SetFit is very light to use and could lower the experimentation threshold for text classification tasks. Keywords: natural language processing, few-shot learning, text classification, Set- Fit, sentence embeddings, Sentence Transformers Contents 1 Introduction 1 2 Theoretical Background 6 2.1 Language Models and Transfer Learning . . . . . . . . . . . . . . . . 7 2.2 Word and Sentence Embeddings . . . . . . . . . . . . . . . . . . . . . 9 2.2.1 Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.2 Sentence Embeddings . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Text Classification and Few-Shot Learning . . . . . . . . . . . . . . . 14 2.3.1 SetFit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3 Methodology 21 3.1 Data and Classification Tasks . . . . . . . . . . . . . . . . . . . . . . 21 3.1.1 FinCORE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.1.2 Finnish Jigsaw Toxicity Challenge Dataset . . . . . . . . . . . 24 3.1.3 Yle Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.1.4 Ylilauta Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4 Results 31 4.1 FinCORE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 i 4.2 Toxicity Challenge Dataset . . . . . . . . . . . . . . . . . . . . . . . . 34 4.3 Yle Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.4 Ylilauta Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5 Discussion 45 6 Conclusion 51 References 54 Appendices A FinCORE Register Distribution A-1 ii 1 Introduction In recent years, the field of natural language processing (NLP) has seen a lot of de- velopment in numerous areas. Recently, as a backlash to the ever-growing resource needs of large language models, smaller and more efficient foundational model archi- tectures have gained ground [1]. Still, in the centre of all language model training remains the question of available data, which can be difficult to source especially for lower resource languages or data-sparse domains. In this thesis, I will examine how well considerably lighter models with fewer data perform compared to state-of-the- art solutions in Finnish text classification. Different kinds of text classification problems are at the core of NLP research. Fine-tuning foundational Transformer models has proved a powerful solution for many domains, and in some cases has already reached near human-level performance [2], [3]. However, Transformer model fine-tuning and inference take up considerable computational resources that might not be readily available for all. End-users are often faced with balancing computational capacities and data security with model accuracy, since most existing solutions require significant computational resources, and the models are often trained and used for inference in the public cloud [4]–[6]. Lighter models that can be easily run for inference on-device might be useful for cases where high privacy is necessary, or if one desires to restrict resource consumption for sustainability reasons. In addition to computational resources, most language models require a lot of CHAPTER 1. INTRODUCTION 2 data even for fine-tuning. This already might create a problem in data-sparse lan- guages and domains. The limited amount of data might be due to copyrights and other use restrictions, e.g. for data privacy reasons in the medical field, bad or noisy quality of data parsed from the Internet, or there just might be limited amounts of data in general in the designated domain. Even if there are available data re- sources, the annotation costs can rise high. Although current trends seem to favour the amount of data for the quality of data, research supports that in order to get high-quality results, one must use high-quality data that is suitable for the task and domain at hand [7]–[12]. Methods that require fewer data and computational resources could possibly lower the AI experimentation and adoption threshold. If data availability is a problem, what if there was a way to fine-tune language models with fewer data? These kinds of few-shot learning methods do exist, but many rely on large language models (LLMs) and thus do not answer the question of limited computational resources, as they are quite expensive when used for inference. The Hugging Face library SetFit addresses both issues by harnessing the capabilities of Sentence Transformers and contrastive learning and proposes a method for fine- tuning Sentence Transformers for text classification purposes with as few as eight examples per label [13]. They also report success in multilingual experiments, which is echoed by many others [7], [14]–[17]. The issue with Finnish NLP is that there aren’t many labelled datasets, and only a few reported benchmarks for text classification. There are some reported bench- marks [2], [9], [18], and some datasets have been created by machine-translating labelled English data [12], [18]. There has been success in text classification in Finnish with fine-tuning both monolingual Finnish base models [2], [12] as well as multilingual models [9]. These scenarios, however, have utilised large datasets, and there are very few examples of research on few-shot text classification in Finnish. The existing ones have used LLMs for in-context learning [3], [18]. Fine-tuning Sen- CHAPTER 1. INTRODUCTION 3 tence Transformers for Finnish few-shot text classification has not been researched at all. The goal of this thesis is to benchmark the SetFit few-shot fine-tuning method for Finnish text classification, as well as understand which kinds of tasks it is the most suitable for. As this research will be the first ever reported benchmark for fine-tuning Sentence Transformers for few-shot classification in Finnish with SetFit, it will provide insights on how the method works with different kinds of Sentence Transformer models and data formulations. It will also add into the existing pool of Finnish few-shot experimentation. If the method works, it will lower the threshold for NLP experimentation in lower end devices and data-sparse domains even in Finnish. With these goals in mind, I will conduct the research by focusing on the following research questions: 1. How well do the fine-tuned Sentence Transformer models perform with Finnish tasks compared to the reported benchmarks? Can the SetFit method reach results comparable to the state-of-the-art? 2. Is there a difference in performance when fine-tuning Finnish Sentence Trans- former models versus multilingual Sentence Transformer models with Finnish data? 3. Is there a difference in performance when fine-tuning Sentence Transformers with multilabel classification versus multiclass classification in Finnish? My hypothesis regarding the first research question is that even though SetFit seems like a promising solution, I doubt that it will reach state-of-the-art results. This hypothesis is backed up by the results from the original authors [13]. It remains to be seen how much the SetFit method’s results differ from the state-of-the-art. If the difference is minimal, SetFit could offer a good alternative for text classification CHAPTER 1. INTRODUCTION 4 fine-tuning as it enables faster training with fewer data. Regarding the second research question, there are successes in Finnish text clas- sification in favour of both monolingual [2], [12] and multilingual models [9], [19], although the differences between the two methods have not been too large. There- fore, I do not have a clear hypothesis for this research question. For the third research question, my hypothesis is that multiclass classification tasks with one class per example will fare better than multilabel tasks, where one example might belong to many different classes. This is due to SetFit’s fine-tuning process, which has the purpose of distancing the different classes in the embedding vector space [13]. This should be easier if the classes are distinct from one another. This thesis will be structured in the following manner: Chapter 2 Theoretical Background: I will introduce the main concepts used in this thesis. These include machine learning, language models, transfer learn- ing, word and sentence embedding representations, text classification and few-shot learning as well as introducing the main method for experiments conducted in this research, the Hugging Face library SetFit. Chapter 3 Methodology: I will introduce the data used in the experiments as well as the experimental setup, including the task definition and outline, as well as the evaluation methods used to compare the SetFit performance to the original benchmarks. Chapter 4 Results: In this chapter, I will go through the classification results of the fine-tuned Sentence Transformer models as well as look deeper into the learn- ing process when fine-tuning models with SetFit. I will compare different sizes of datasets with all the tasks and models I have chosen to test. Chapter 5 Discussion: I will gather my findings and draw up some generali- sations based on the results presented in Chapter 4. In addition, I will examine the results of the analysis process and the observations that I have made based on it. CHAPTER 1. INTRODUCTION 5 This chapter will also include any limitations or problems that have arisen during the research. Chapter 6 Conclusion: This chapter will sum up the thesis and introduce ideas for further research. 2 Theoretical Background In this chapter, I will introduce the idea behind modern neural network architectures and take a look at utilising neural networks in natural language processing in the sense of language models and transfer learning. I will also touch on the special case of few-shot learning when fine-tuning pretrained language models. The first steps in neural network research were taken already in 1943 by Mc- Culloch and Pitts [20] with the idea of modelling the neural activity of the human brain. However, the first artificial neural network was created in 1958 by Rosenblatt [21]. This network model is called a perceptron, and it consists of a single layer that imitates the behaviour of a neuron. A perceptron calculated a weighted sum of the inputs fed to it, and feeds it through a threshold function before outputting a result, such as probabilities for different classes in a classification task. Even though Rosenblatt’s invention holds only one hidden layer, perceptrons can be stacked in order to create a multilayer perceptron with more than one hidden layer between the input and output layers. This kind of a multi-layer network is often called a deep neural network. Perceptron’s simple architecture is the basis of modern neural network solutions. When speaking of machine learning, it is sensible to make the distinction between supervised and unsupervised learning. Supervised learning has a certain predefined goal, be it a correct translation, an accurate audio transcript or correctly predicted labels in a classification task. In unsupervised learning, such truth does not exist, 2.1 LANGUAGE MODELS AND TRANSFER LEARNING 7 but the machine is left to find patterns in the given training data on its own. An example of an unsupervised machine learning task is topic modelling. [22] This thesis will focus on supervised machine learning, although sentence transformers are suitable for unsupervised tasks as well [23]. However, the tasks chosen to evaluate the performance of few-shot learning will be classification tasks, which are supervised by nature. 2.1 Language Models and Transfer Learning Language models are an essential part of natural language processing. Training a machine learning model from scratch every time you need to change or update your task is not usually a great idea, as training a well-performing model often requires considerable computational resources, data, and time. Often, the smart solution is to utilise previous knowledge in the form of pre-trained models. In the context of NLP, these pre-trained models are called language models, which calculate probability distributions of words in a language. There are several different types of architecture and levels of complexity for lan- guage models. At its simplest, a language model may refer to an n-gram model with likelihoods of n-length sequences of words, for example, a bi-gram model calcu- lates the probability of sequences of two consecutive words. This type of language model is purely statistical and does not take the surrounding context into account [22]. Neural language models were created as a context-aware alternative for statis- tical language models. Examples of neural models are Recurrent Neural Networks, which introduced a hidden state to the hidden layer for memory reasons [24] and Transformers [25], with the introduction of the attention mechanism. On the more complex end of the spectrum are general-purpose large language models with bil- lions of parameters, such as GPT [26]–[28], BLOOM [29] and LLaMA [30]. The field of language models is rapidly evolving, and new types of breakthroughs are 2.1 LANGUAGE MODELS AND TRANSFER LEARNING 8 continuously introduced. Language models can be fine-tuned for specific tasks while preserving the previ- ous knowledge of a language the model has already learnt. This is called transfer learning. In order to utilise transfer learning, you must first choose a base model for fine-tuning. Then you need to fine-tune your model with data specifically curated for the task at hand. A language model can be fine tuned for text classification (such as toxicity detection, register classification and spam detection), part-of-speech tag- ging, named entity recognition, and many other tasks. A model will only learn from the data provided to it, and it will not have capability to apply the information in ways unknown to it. A pitfall that one might encounter while fine-tuning for a specific task is over- fitting: a phenomenon where a model adapts too closely to the target task and the training data it has received [22]. This is also known as catastrophic forgetting, as it essentially means that the model forgets previously learnt information to replace it with the new domain knowledge. Overfitting leads to models that do not per- form well with unseen data. There are several ways to prevent overfitting, including using a designated validation set for parameter tuning in addition to training and test datasets, using cross-validation or a dropout system, where randomised units are dropped from the neural network during training [22]. Even though there are some datasets for training machine learning models for specific tasks, Finnish language model resources are, however, somewhat limited. There are some monolingual Finnish general-purpose language models such as Fin- BERT [2], used previously in e.g. [9], [12], and a Finnish version of a sentence BERT, Finnish Paraphrase [31], which is trained from the basis of FinBERT. An- other solution with more options is to try fine-tuning a multilingual model such as XLM-RoBERTa [32] or multilingual M-BERT [33]. Large language models also provide multilingual opportunities, but in this thesis, I will focus on smaller-scale 2.2 WORD AND SENTENCE EMBEDDINGS 9 Transformer models. According to Rönnqvist et al. [19], multilingual models might have an advan- tage to monolingual models, although Eskelinen et al. [12] have proved that with sufficient amount of data, fine-tuned monolingual models outperform multilingual models in monolingual tasks. In this thesis, I will utilise transfer learning by fine- tuning existing sentence embedding models for certain classification tasks. I will compare the performance of both monolingual Finnish as well as multilingual em- bedding models with the reported state-of-the-art benchmarks that exist at the time of writing. With the constant advances in the field of NLP, there might already be solutions that outperform the benchmarks that I have used in this thesis. 2.2 Word and Sentence Embeddings "Embedding" is an umbrella term to describe numerical representation of language, images, audio, and data in general: essentially, an embedding is a vector. In the context of natural language processing (NLP), embeddings most often refer to words, sentences, or texts mapped into a vector space with predefined dimensions [22]. Embeddings derived from text can be used to measure similarity between different words or documents, text classification, document clustering or grouping, or feature extraction for further NLP tasks. The notion of modelling the language with vectorised words or sentences is noth- ing new. Most language models use vocabularies that map words into numbers. A similar thought can be applied to word and sentence embeddings, although they also model semantical relationships within the language. A famous example of this is the example of word vectors "King" - "Man" + "Woman" that result in "Queen", in- troduced by Mikolov et al. in 2013 [34], also visualised in Figure 2.1. With sentence embedding models, we are able to retain more context and even complex semantical structures. 2.2 WORD AND SENTENCE EMBEDDINGS 10 Figure 2.1: Representation of words in a vector space Embedding models are language models aimed at grouping semantically similar words or sentences closer together in the vector space, and respectively creating distance between semantically distant words or sentences [13]. The output of an embedding model is a vector, a dense numerical representation of text that can be used in downstream machine learning tasks. In this section, I will briefly introduce some of the most influential advances. The models are often validated using a similarity metric to measure the distance in the vector space by calculating either the angle or the distance between the embedding vectors being compared [35]. 2.2.1 Word Embeddings As mentioned above, Mikolov et al. [34] introduced the idea of capturing semantical relations between word vectors, which they refer to as analogies. Their Word2vec models improved the results from previous neural network architectures, using un- supervised CBOW and Skip-gram methods to create vector representations of words from large datasets. These models were evaluated by semantic and syntactic word similarity tasks, and perform especially well compared to previous models in seman- 2.2 WORD AND SENTENCE EMBEDDINGS 11 tic tasks while being much more computationally efficient. Global Vectors (GloVe) [35] combine the local context implementation used in the Word2vec Skip-gram model with global matrix factorisation. Pennington et al. criticise the two methods for poor generalisability: they either fail with statistical tasks or semantic analogy tasks, while performing well in the other. As a solution for this problem, they propose an unsupervised global log-bilinear regression model architecture that utilises word co-occurrences and captures global corpus statistics. The resulting model is somewhat lighter than Word2vec models and outperforms them in accuracy in word similarity, word analogy and named entity recognition tasks. The aforementioned models create an embedding vector per each word form, which results in several vectors per word in morphologically complex languages. FastText [36] takes word morphology into account, and instead of assigning each word a distinct vector, they propose a model where a word is represented as a sum of character n-gram vectors, an extension of the Skip-gram model introduced in [34] and outperforming the previous implementation. Languages such as Finnish with its 15 cases for nouns can benefit from models that take the word morphology into account, since some variations of a word might appear rarely or not at all in a training corpus. Because of the character-level representation of words, the FastText model is able to use data more robustly and thus requires fewer data and can better model infrequent words of even out-of-vocabulary words. 2.2.2 Sentence Embeddings In order to get sentence embedding vectors, word embedding vectors had been previ- ously mapped to sentences. Skip-Thought [37] applies the idea of unsupervised word embedding model training to sentences with a modified Skip-gram implementation: instead of using a word to predict surrounding words, they tried predicting sur- 2.2 WORD AND SENTENCE EMBEDDINGS 12 rounding sentences from a given sentence. This performed better than the previous implementations that had been utilising word embeddings to map sentences. Conneau et al. and their InferSent model [38] provide a supervised alternative for sentence embedding retrieval. They compare multiple neural network architectures and propose a bi-directional long short-term memory (LSTM) model as a solution for retrieving sentence vectors. As training data, they used the SNLI corpus [39] with sentence pairs labelled with "entailment", "contradiction" and "neutral" tags. They found that sentence embeddings work well in transfer learning tasks, and that the proposed solution is more computationally efficient compared to the unsupervised methods. The SNLI data used in InferSent training is also used in the training process for Universal Sentence Encoder [40]. Cer et al. found that the transformer archi- tecture [25] is optimal for training sentence embedding models for transfer learning. They evaluated the model by providing embeddings from the models to task-specific deep neural networks. They also tested training a deep averaging network archi- tecture where input embeddings are averaged before passing through the network; this implementation is computationally cheaper but not as well performing as the Transformer-based models. Although Transformer language models such as BERT [33] and RoBERTa [41] can be used to derive sentence embeddings by, for example, averaging the last hidden layer into a fixed size vector, the results are usually worse than even basic word embedding models such as glove [23]. Sentence Transformers (SBERT) [23] is based on the BERT architecture that uses Siamese networks to derive sentence embeddings while retaining semantical meaning. This means that the model is fine-tuned by comparing the pooling outputs of two sentences and updating the model’s weights to contain the semantical meaning. A visualisation of the model architecture can be viewed in Figure 2.2. 2.2 WORD AND SENTENCE EMBEDDINGS 13 Figure 2.2: SBERT architecture [23] SBERT was a significant improvement from earlier sentence embedding models and compared to the original BERT, the computational cost is very low when com- puting an embedding for a sentence [23]. This is important especially when trying to find the most similar sentence pairs and the dataset is large, since embedding similar- ity must be calculated for each sentence pair. The model is fine-tuned with Natural Language Inference (NLI) datasets SNLI [39] and MNLI [42] and validated with both supervised and unsupervised tasks, including fine-tuning the SBERT model with regression objective function. Contrary to [38] and [40], Reimers et al. [23] do not recommend SBERT for transfer learning, even though it surpasses the previous sentence embedding models in transfer learning tasks. Instead, they suggest that the original BERT should be used for the purpose. NLI datasets have been a solid base for Sentence Transformer training. These kinds of datasets are annotated with the information if two phrases contain positive, negative or neutral textual entailment [39]. In addition to NLI datasets, other common ways to train a Sentence Transformer are by using paraphrase or machine translation datasets [31]. 2.3 TEXT CLASSIFICATION AND FEW-SHOT LEARNING 14 2.3 Text Classification and Few-Shot Learning Text classification is a fundamental NLP task, where a given text sample is assigned one or multiple predefined classes. Different classification tasks can be divided in binary classification with two available classes, such as spam detection (spam or not spam), multiclass classification with more than two available classes, such as sentiment detection (positive, negative, or neutral sentiment), and multilabel classi- fication, such as emotion detection where each sample can be assigned with several coexisting classes [22]. Text classification can be approached with more traditional supervised machine learning methods, such as support vector machines or naive Bayes algorithms [22], but recently, the state-of-the-art results in classification tasks have been achieved with fine-tuned Transformer models [9], [12]. Although LLMs can also be used for zero-shot and few-shot text classification tasks, smaller models fine-tuned with larger datasets still often achieve the best results [43]. Traditional classification tasks may require large amounts of data in order to train a well-performing machine learning model. Textual datasets are often imbal- anced, and it is not always possible to acquire an adequate number of data points in order to teach the model to reliably recognise the correct class, be it due to issues in sampling or the cost of annotations. This can be the case in online reg- ister classification [9], [19], [44], toxicity and cyber harassment detection [12], [45] as well as different tagging tasks such as part-of-speech tagging and named entity recognition [46] among others. When a dataset is imbalanced with classes that are under-represented, the model will be biased towards the major classes in the data set. Optimising the overall accuracy of the model might lead to ignoring the minor classes. Strategies for dealing with imbalanced datasets include, for example, data augmentation for under-represented classes with generative language models [47], using a different loss algorithm (dice-loss vs binary cross-entropy) [46] and convert- ing classification tasks into entailment prediction tasks [48]. 2.3 TEXT CLASSIFICATION AND FEW-SHOT LEARNING 15 Few-shot learning approaches are options to consider when the data amount is limited or when the data contains classes with few examples [13]. "Shots" are used to describe the number of input-output pair examples introduced in training the model, thus few-shot means training a model to perform a task with only a few examples of data points. Another case is zero-shot classification, where the model is used to predict cases in unseen domains or languages. This could for example mean natural language inference (NLI) tasks [39], [42], [49], or data in a language that has not been used in training or fine-tuning the model used for predictions [19]. When you have some labelled data, there are several ways you can approach your classification problem. Often, a simple solution could be to use your small dataset to fine-tune a pre-trained language model such as RoBERTa [41] or BERT [33]. How- ever, if your dataset is minimal or very imbalanced, achieving good classification results might not be possible [11], [13], [15]. You could also try in-context learn- ing, parameter-efficient fine-tuning or pattern exploiting training [50]. According to Tunstall et al. [13], these scenarios can be impractical, as they often rely on large language models with billions of parameters, such as GPT-3 [26] and GPT-4 [28]. This means that the required computing resources are significant and not accessi- ble for everyone. In addition, parameter-efficient fine-tuning and pattern exploiting training tasks require manually generated prompts and are thus dependent on the quality of the prompt-engineering. Despite previous successes in Finnish NLP [2], [9], [12], [18], [31], [51]–[53], there is very little research on few-shot text classification in Finnish. The existing research is limited to Kortesalmi’s [3] comparison of LLM-reliant in-context learning method to traditional machine learning algorithms and a fine-tuned Transformer model, and a few-shot evaluation of the Finnish GPT (FinGPT) presented by Luukkonen et al. [18]. One of the goals of this thesis is to explore this area of Finnish NLP and add to the existing research. 2.3 TEXT CLASSIFICATION AND FEW-SHOT LEARNING 16 Figure 2.3: SetFit performance compared to fine-tuned RoBERTa-large in customer review sentiment classification [13] 2.3.1 SetFit As mentioned above, many few-shot solutions rely on large language models and are thus quite resource heavy. But what if you are in a situation where you lack computational resources in addition to data? Tunstall et al. propose a solution to a data-sparse task with SetFit [13], a few- shot fine-tuning framework based on Sentence Transformers [23], an architecture not originally intended for transfer learning. Sentence Transformers have previously been used for text classification by applying a logistic regression function on top of the sentence embeddings retrieved from the model [54], [55], but SetFit intro- duces the idea of fine-tuning any given Sentence Transformer model in a Siamese manner with the classification task in mind. Because SetFit supports any Sentence Transformer, it provides multilingual support if the user wants to use a multilingual model. On top of all this, Tunstall et al. claim that only 8 training examples are needed for a fine-tuned SetFit model to perform at a competitive level compared to a RoBERTa-large fine-tuned with the full dataset. SetFit’s performance is evaluated against a fine-tuned RoBERTa-large and sev- 2.3 TEXT CLASSIFICATION AND FEW-SHOT LEARNING 17 eral other few-shot methods: pattern exploiting training (PET) based ADAPET [56], Perfect [57], and parameter-efficient fine-tuning (PEFT) based T-Few [58] in several test datasets, including emotion recognition, sentiment detection, spam de- tection and news topic detection [13]. The SetFit method on average outperforms both ADAPET and Perfect methods and is comparable to the T-Few 3B parameter model with n = 8 examples while being 27 times smaller, and ends up outperforming it with n = 64 examples [13]. Figure 2.3 displays the performance against a fine- tuned RoBERTa: while never quite reaching the accuracy achieved by a RoBERTa model fine-tuned with the full customer review sentiment dataset, the SetFit fine- tuned MPNet model outperforms the RoBERTa one with smaller amounts of data. Compared to the T-FEW method that performs on a similar level, SetFit is faster in both training and inference [13]. With n = 8 examples per label, SetFit takes about 30 seconds to train. T-FEW takes over 20 times longer than that and requires more GPU memory to do it. In addition to this, SetFit needs significantly smaller disk storage space: 163 to 26 times less than T-FEW with the tested models [13]. These factors add to the attractiveness of the SetFit method for real-world applications. Multilingual Experiments Tunstall et al. also perform multilingual experiments with a multilingual MPNet model 1 compared to a cross-lingual XLM-RoBERTA-base2 and ADAPET. The eval- uation is done with Multilingual Amazon Reviews Corpus, which contains reviews in English, Japanese, German, French, Spanish, and Chinese [59]. SetFit outperforms both methods with limited data, but the performance is weaker than that of the XLM-RoBERTa model, which is fine-tuned with the full dataset [13]. SetFit has been used successfully in many low-resource language research tasks. 1huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2 2huggingface.co/xlm-roberta-base 2.3 TEXT CLASSIFICATION AND FEW-SHOT LEARNING 18 These include offensive content detection in Tamil [16], radiological text classification in Danish [17], few-shot classification benchmarking in Polish [15], legal judgment prediction in Korean [14], and Arabian dialect sentiment analysis [60]. It also shows promise in Dutch protest tweet classification [7], legal text classification [11] and financial domain text classification [8]. Even though certain experiments are done with a larger number of data samples [7], [14], [16], [17], [60], performance compara- ble to fine-tuned Transformers or other few-shot methods is achieved in true few-shot settings as well [8], [11], [15]. SetFit even sometimes outperforms a fine-tuned Trans- formers model [11], [16]. In [15], SetFit comes in second place when benchmarking few-shot methods for various classification tasks in Polish but achieves significantly lower results than in-context learning with GPT-3.5. Loukas et al. [8] found that by combining SetFit with representative samples chosen by a human expert, they were able to surpass state-of-the-art results in the financial domain. There is no previous research or benchmarking for SetFit in Finnish few-shot classification. The multilingual benchmarks do not include Finnish or other lan- guages from the Finno-Ugric language family in their datasets, even though the multilingual Sentence Transformer model used in the experiments in [13] does in- clude Finnish in its training data [61]. This thesis aims to expand this field of research into Finnish language tasks to gain some knowledge of the suitability of few-shot classification and SetFit in particular that could prove useful for a rela- tively low-resource language. Training SetFit training process is divided in two steps and described as follows [13]: 1. Fine-tune a Sentence Transformer with contrastive sentence pairs in a Siamese manner 2. Train a classification head with data that is generated by the Sentence Trans- 2.3 TEXT CLASSIFICATION AND FEW-SHOT LEARNING 19 former that is fine-tuned in the first step Figure 2.4: SetFit fine-tuning process [13] Figure 2.4 illustrates the steps of the Sentence Transformer fine-tuning and clas- sification head training. In the first step, the Sentence Transformer model is fine- tuned in a Siamese manner. This is achieved by essentially augmenting the data by contrasting pairs and generating sets of positive (randomly chosen examples with same label) and negative (randomly chosen examples with different label) pairs for each label in the data set. In the fine-tuning phase, the model will generate embeddings for each pair it receives and will modify the weights of these examples depending on the pair quality: positive pairs will receive more similar embeddings and negative pairs more differing embeddings. This contrastive training method enables the use of small datasets, since it compares pairs instead of individual examples. For example, a classification task with eight examples from two discrete classes each will result in 8 2   2 = 56 positive pairs and 88 = 64 negative pairs, augmenting the dataset from 16 examples up to 120 unique pairs. The number of pairs grows exponentially to the number of examples and classes. After fine-tuning the Sentence Transformer model, a classification head is trained on top of it. The training data is encoded by the fine-tuned Sentence Transformer, and the resulting embeddings as well as the labels form the training set. In [13], they use a logistic regression model to fine-tune the classification head. Since few-shot 2.3 TEXT CLASSIFICATION AND FEW-SHOT LEARNING 20 training might be unstable [62], [63], the authors created 10 random train splits in order to estimate the model’s performance. 3 Methodology In this chapter, I will introduce the data used in the study, as well as the corre- sponding benchmarks achieved with said data. Following that, I will outline the experimental setup of the thesis. 3.1 Data and Classification Tasks For the experiments, I will use pre-existing, monolingual Finnish text classification corpora with reported fine-tuning benchmarks. I have chosen both mono-label and multilabel corpora, with the "simplest" corpora containing only 10 unique classes [64], [65], and the most complex one 39 different classes with the possibility of multiple classes per example [9]. Another variable in the corpora is the style of language, which varies from the more formal news-texts [9], [65] to informal or straight offensive tone of language [9], [12], [64]. Some of the corpora also contain machine translations [9], [12] of varying quality and transcripts from spoken language [9]. 3.1.1 FinCORE FinCORE corpus1 consists of 10 754 Finnish web-crawled texts labelled into nine main registers and 30 subregisters (genres) [9]. One text can have more than one 1https://github.com/TurkuNLP/FinCORE_full 3.1 DATA AND CLASSIFICATION TASKS 22 Figure 3.1: FinCORE register distribution 3.1 DATA AND CLASSIFICATION TASKS 23 N F1-score Narrative 754 0.86 Opinion 284 0.78 Informational description 362 0.72 Informational persuasion 252 0.77 Interactive discussion 231 0.84 How-to/instrcutions 117 0.71 Lyrical 5 0.13 Machine-translated/generated 276 0.98 Spoken 18 0.64 Table 3.1: FinCORE main register test set classification results assigned register, which makes it a hybrid. There are 810 hybrids total in the corpus, of which 581 in the training set. The largest main register in the corpus is "Narrative" with 3 956 texts compris- ing 34.32 % of the whole corpus. On the other hand, the smallest main register, "Lyrical", only has 25 texts labelled to it. All examples that are tagged with a sub- register have the label for the corresponding main register as well. A visualisation of the label distribution can be viewed in Figure 3.1 as well as in Appendix A. Versions of the dataset have been used in several benchmarks [19], [44], [66], the most recent being Skantsi & Laippala [9], where they achieved F1-scores of 0.78 with monolingual FinBERT and 0.79 with multilingual XLM-RoBERTa (XLM-R) when using all the register labels, including hybrid examples. They also reported a CNN baseline of 0.6. In [44], Laippala et al. report an AUC score of 83.8, when training only with six main register labels: Narrative, Opinion, Informational description, Interactive discussion, Informational persuasion, and How-to. Skantsi and Laippala [9] observe some variation in register-specific classification results. Machine-translated texts achieved the highest F-score 0.98 as a category, followed by Narrative, the largest main register, with F-score 0.86. The second largest main register, Informational Description (n=1 719, 14.91% of the full corpus), gets a considerably lower F-score of 0.72, but some main registers get high scores even with fewer examples, such as Interactive Discussion with F-score of 0.84 (n=1 081, 9.38% of the full corpus). The authors note that many registers with a high number of examples are pre- 3.1 DATA AND CLASSIFICATION TASKS 24 dicted well, such as Personal blogs, News reports and Descriptions with intent to sell. Examples of high-performing subregisters with low number of examples are sports reports, Question-answer forums, Research articles, Job descriptions, Re- views, Discussion Forums and Religious texts. All of these examples of structurally consistent registers, which might explain good performance even with proportionally low coverage [9]. Respectively, registers such as Reports, FAQs, Poems, Course materials, Ad- vice and Informational blogs with few examples don’t perform too well. On the other hand, some registers have many examples and still underperform, e.g. Mag- azine/online articles and Community blogs. This could be explained by the lack of clear structure or characteristics and inner variation within the register [10]. 3.1.2 Finnish Jigsaw Toxicity Challenge Dataset As a second multilabel task, I am using TurkuNLP’s machine translated version of Jigsaw Toxicity dataset 2, specifically the version translated from English to Finnish with DeepL. The corpus consists of 223 549 comments on Wikipedia talk page discussions and is labelled in six categories: "Toxicity", "Severe toxicity", "Threat", "Obscene", "Insult", and "Identity attack". The majority of the comments have not been assigned any of these labels, and comments without any labels are treated as a seventh, "Clean", category [12]. The full label distribution can be viewed in Table 3.2. Train Test Total Identity attack 1 405 712 2 117 (0.95 %) Insult 7 877 3 427 11 304 (5.05 %) Obscene 8 449 3 691 12 140 (5.43 %) Severe toxicity 1 595 367 1 962 (0.88 %) Threat 478 211 689 (0.31 %) Toxicity 15 924 6 090 22 014 (9.85 %) Clean 143 346 57 735 201 081 (89.95 %) Table 3.2: Toxicity dataset label distribution 2https://huggingface.co/datasets/TurkuNLP/jigsaw_toxicity_pred_fi 3.1 DATA AND CLASSIFICATION TASKS 25 Eskelinen et al. [12] report an F-score of 0.66 for FinBERT and 0.65 for XLM- RoBERTa, which are comparable to the original 0.69 F-score of the fine-tuned En- glish BERT model. From now on, I will refer to the FinBERT’s higher F-score as a benchmark for this task. The authors highlight that the task’s ambiguity might influence the classification results. Even after weighing the labels to favour the smaller classes, they observed a large number of misclassifications into the "Clean" category, which accounts for over 89 % of the dataset and is the largest of the seven categories. This might also be due to the difficulty of the task or the subjectivity involved in annotation. Additionally, they suggest that subtle differences in nuance introduced by machine translation may also contribute to these errors. Another significant observation in their study is the frequent misclassification of examples from the "Severe toxicity" and "Threat" categories, which were under-predicted. 3.1.3 Yle Corpus The Yle corpus [65] has previously been used to evaluate the Finnish BERT model, FinBERT [2]. The dataset is created with Sampo Pyysalo’s tools 3, and contains 120 000 news articles, each labelled with one of the 10 most frequent topics, 12 000 examples per label. Virtanen et al. [2] report a 91.76% accuracy on a text classification task with FinBERT uncased, fine-tuned with a balanced dataset of 100 000 examples. 3.1.4 Ylilauta Corpus Similar to the Yle corpus, the Ylilauta corpus [64] used in FinBERT evaluation [2] is labelled into 10 classes, one class per example and is created with Sampo Pyysalo’s 3https://github.com/spyysalo/yle-corpus 3.2 EXPERIMENTAL SETUP 26 tools 4. In contrast to the Yle corpus’ formal news texts, the Ylilauta consists of 120 000 examples from a Finnish online discussion forum, where Virtanen et al. [2] have chosen the most frequent categories, 12 000 examples per label. They report an 82.20% accuracy when fine-tuning FinBERT uncased with 100 000 examples balanced by class. The authors note that this corpus performs considerably better with the mono- lingual FinBERT compared to the multilingual BERT, which might stem from the data used in the models’ training: whereas FinBERT training material contains informal Finnish, their comparison models, fastText embedding models and multi- lingual BERT, do not. 3.2 Experimental Setup The results from Tunstall et al. [13] are promising in several aspects. First, they offer a solution for limited amount of data, thus alleviating the cost for annotations. Secondly, their method uses a fraction of the resources fine-tuning a traditional Transformer model does, making the training process more accessible to lower-end devices. Lastly, they present performance close to that of the fine-tuned Transformer models with several different classification tasks. In this thesis, I will try to repli- cate their success and test out Finnish classification tasks with artificially restricted datasets. For the choice of different sentence embedding models to be fine-tuned, I have used the monolingual Finnish paraphrase model 5 [31] as a baseline. As preliminary criteria, I have used the following: 1. The model must support Finnish language. 4https://github.com/spyysalo/ylilauta-corpus 5https://huggingface.co/TurkuNLP/sbert-cased-finnish-paraphrase 3.2 EXPERIMENTAL SETUP 27 2. The model must be around 1GB or preferably smaller. To guide my choice in the sea of embedding models, I used Huggingface’s Massive Text Embedding Benchmark leaderboard 6 for English tasks to choose three models: multilingual-e5-small 7, paraphrase-multilingual-MiniLM-L12-v2 8, and paraphrase- multilingual-mpnet-base-v2 9, which has also been used in the multilingual section of the original paper. Since the leaderboard does not explicitly include Finnish language, this might have left out some potential models. In addition, I decided to test paraphrase-xlm-r-multilingual-v1 10, since it is based on the cross-lingual XLM- RoBERTa that has fared well in the benchmarks. In the end, I chose to compare five different models, which are described in more detail in Table 3.3. For clarity, from now on I will refer to the models with the aliases listed in the table. Model Alias Parameters Size MTEB sbert-cased-finnish-paraphrase FinSBERT 125M 0.50GB - multilingual-e5-small e5-small 118M 0.44GB 124 paraphrase-multilingual-MiniLM-L12-v2 MiniLM 118M 0.44GB 148 paraphrase-multilingual-mpnet-base-v2 MPNet 278M 1.04GB 143 paraphrase-xlm-r-multilingual-v1 XLM-R 278M 1.11GB - Table 3.3: Sentence embedding models chosen for the experiments As per Tunstall et al. [13], I will compose my experiments by fine-tuning the embedding models for 1 epoch with stable learning rate with no further parameter optimisation. For my experiments, I have chosen to use learning rate of 2e-5, which is the default learning rate provided by the SetFit library. To compare performance across different training data set sizes, I will divide the training set for each task into sets of 8, 16, 32 and 64 examples per class label. In multilabel tasks, each label might have fewer or more matches, depending on the label distribution in the dataset. In order to take variation into account, each sample size will have 10 6https://huggingface.co/spaces/mteb/leaderboard 7https://huggingface.co/intfloat/multilingual-e5-small 8https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 9https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2 10https://huggingface.co/sentence-transformers/paraphrase-xlm-r-multilingual-v1 3.2 EXPERIMENTAL SETUP 28 randomised training sets that will be used when fine-tuning the different embedding models. In the fine-tuning step, I will use the validation sets provided with the datasets, or in the case of the toxicity challenge dataset, I will split the training set with the same set sizes as Eskelinen et al. [12] before further division into the different training sets. This might not be necessary, since I’m only going to train each model for one epoch, and I’m not going to carry out any parameter optimisation or, for example, use early stopping to prevent overfitting. The final results of the experiment will be reported based on the models’ predictions on the provided test datasets. Something to note is that several classes in the FinCORE corpus have fewer than 64 or 32 examples: this means that it is impossible to create unique, balanced training sets with as many examples. As mentioned in 2.3.1, the SetFit training process consists of two steps: fine- tuning the sentence embedding model and training a classification head on top of the fine-tuned model. As a baseline, I will use the results of exposing the smallest datasets to the classification head only, without fine-tuning the sentence embedding models, as has been done in previous studies with using Sentence Transformers for text classification [54], [55]. I will then compare the results from fine-tuning with different size datasets to the original benchmarks as well as the baseline results to evaluate the models’ performance. To validate my implementation, I have tested the emotion recognition task from [13] and received results comparable to the ones in the original paper. All the experiments have been run on Finnish IT Center for Science Puhti su- percomputer 11. The full code implementation of the experiments is available on GitHub 12. 11https://csc.fi/en/ 12https://github.com/annsaln/setfit-classifier 3.2 EXPERIMENTAL SETUP 29 3.2.1 Evaluation Positive Negative Predicted positive True positive (TP) False positive (FP) Predicted negative False negative (TN) False Negative (FN) Table 3.4: Confusion matrix Evaluating text classification tasks can be conducted by using several different metrics. One of the simplest ways to evaluate a model’s performance is to use confusion matrix to establish the model’s precision, recall and accuracy [22]. Table 3.4 demonstrates the composition of the values used in the calculation, and the formulas for each of these metrics are the following: precision = TP TP + FP (3.1) recall = TP TP + FN (3.2) accuracy = TP + TN TP + FP + TN + FN (3.3) Simply put, accuracy measures the percentage of examples the model has labelled correctly. Accuracy can work well in situations, where the dataset is balanced. Such is the case of Yle and Ylilauta corpora, for which accuracy has been used as an evaluation metric [2]. In cases, where the data is imbalanced, like in the Toxicity challenge dataset, accuracy might seem high even when only predicting one class that comprises nearly 90% of the dataset. Because of the nature of the accuracy metric, it is rarely used in text classification evaluation [22]. Instead, it is useful to take a look at a model’s precision (P) and recall (R) metrics. A common way to measure a model’s performance is to use both to calculate their balanced representation, an F-score [22]: 3.2 EXPERIMENTAL SETUP 30 F = ( 2 + 1)PR 2P +R (3.4) The parameter denotes the weight one desires to put on the precision and recall of the model. Values below 1 will favour precision, whereas values above 1 will favour recall. The simplest way is to use = 1 to create a F1-score with equal weighing on both precision and recall: F1 = 2PR P +R (3.5) In this research, I will be mainly using F1-score as an evaluation metric to validate each model. Some of the tasks I’ve chosen do not have a benchmark that uses this metric, but I’ve chosen to use it for all experiments for ease of comparison in the scope of this work. 4 Results In this chapter, I will present the results that were obtained in the experiments described in Chapter 3. I will compare the sentence embedding models’ performance across different classification tasks and with different size training sets. To establish a baseline for each task and model, I’ve chosen to use the SetFit library to fine-tune only the classification head with the smallest size datasets. The different models are evaluated by inspecting the mean of micro average F1- score of 10 unique training instances, as well as the F1-score standard deviation (SD) and a mean difference to the baseline (BL). This is done to take data variation into account, even though the models themselves are trained with equal amount of exposure to different labels whenever it is allowed by the training data label distribution. FinCORE dataset especially has classes with so few examples that it is impossible to create a training set where there are 64 or even 32 examples of certain labels. The training sets are referred to as the number of examples per label, and the 10 training sets per sample size remain the same across all different model experiments. The tasks include two multilabel datasets, FinCORE and the Jigsaw toxicity challenge dataset. The evaluation of the multilabel models by assigning labels that receive a sigmoid output result that exceeds or equals the threshold of 0.5. The Yle and Ylilauta tasks use the maximum value of the softmax output to assign a predicted label to an example. 4.1 FINCORE 32 4.1 FinCORE I have chosen to split the FinCORE task into two parts: a. full corpus with all the subregister and main register labels and b. a more restricted approach with main register labels only. This is done due to the complexity of the corpus, and also to compare the method’s performance with the same dataset in varying granularity. Both of these approaches are true multilabel tasks, meaning that one example might have several true labels assigned to it. The averaged model results can be found in tables 4.1 and 4.2. baseline n = 8 n = 16 n = 32 n = 64 FinSBERT F1 0.34 F1 0.46 F1 0.50 F1 0.51 F1 0.53 SD 0.01 SD 0.02 SD 0.01 SD 0.01 SD 0.2 MiniLM F1 0.27 F1 0.33 F1 0.38 F1 0.40 F1 0.43 SD 0.01 SD 0.02 SD 0.01 SD 0.01 SD 0.01 e5-small F1 0.0002 F1 0.39 F1 0.46 F1 0.50 F1 0.52 SD 0.0002 SD 0.02 SD 0.02 SD 0.01 SD 0.01 XLM-R F1 0.30 F1 0.38 F1 0.39 F1 0.39 F1 0.42 SD 0.01 SD 0.01 SD 0.01 SD 0.01 SD 0.01 MPNet F1 0.23 F1 0.35 F1 0.37 F1 0.39 F1 0.43 SD 0.02 SD 0.02 SD 0.02 SD 0.02 SD 0.02 Table 4.1: FinCORE classification results for different sample sizes baseline n = 8 n = 16 n = 32 n = 64 FinSBERT F1 0.24 F1 0.42 F1 0.58 F1 0.64 F1 0.68 SD 0.04 SD 0.04 SD 0.02 SD 0.01 SD 0.01 MiniLM F1 0.12 F1 0.13 F1 0.32 F1 0.48 F1 0.54 SD 0.04 SD 0.05 SD 0.04 SD 0.02 SD 0.01 e5-small F1 0.0 F1 0.04 F1 0.47 F1 0.63 F1 0.68 SD 0.0 SD 0.04 SD 0.06 SD 0.02 SD 0.01 XLM-R F1 0.17 F1 0.24 F1 0.46 F1 0.52 F1 0.56 SD 0.03 SD 0.05 SD 0.03 SD 0.01 SD 0.01 MPNet F1 0.07 F1 0.19 F1 0.43 F1 0.50 F1 0.55 SD 0.03 SD 0.06 SD 0.02 SD 0.02 SD 0.02 Table 4.2: FinCORE main register classification results for different sample sizes In both tasks, TurkuNLP’s monolingual FinSBERT model reaches the best F1- scores when evaluated with the test set. In the main registers only task, the multi- lingual e5-small is tied for the first place, whereas in the full label set task, it comes a close second. What makes the two models’ performance differ is their learning curve: FinSBERT reaches the highest baseline results, when e5-small’s baseline is practically zero. This might indicate that e5-small greatly benefits from task-specific 4.1 FINCORE 33 fine-tuning, even with just a few examples. For some reason, the task with all labels included gets more stable results with fewer examples. This being a multilabel task, it might be that some classes get more exposure than others. It might also be a result of intraclass variance, as observed in [10]. In the end, with the largest sample size datasets, all models reach higher scores in the main register task. No model, however, comes too close to the original benchmark, F1-score 0.69 with the full label set. FinSBERT MiniLM e5-small XLM-R MPNet n = 64 n = 64 n = 64 n = 64 n = 64 Narrative (n=754) F1 0.74 F1 0.64 F1 0.73 F1 0.65 F1 0.62 SD 0.02 SD 0.1 SD 0.01 SD 0.01 SD 0.04 BL 0.19 BL 0.07 BL 0.0 BL 0.12 BL 0.01  0.56  0.57  0.73  0.53  0.61 Opinion (n=284) F1 0.60 F1 0.55 F1 0.65 F1 0.54 F1 0.55 SD 0.02 SD 0.02 SD 0.03 SD 0.02 SD 0.02 BL 0.30 BL 0.19 BL 0.0 BL 0.26 BL 0.14  0.30  0.35  0.65  0.28  0.41 Informational description (n=362) F1 0.59 F1 0.46 F1 0.58 F1 0.47 F1 0.49 SD 0.03 SD 0.03 SD 0.03 SD 0.01 SD 0.04 BL 0.14 BL 0.05 BL 0.0 BL 0.09 BL 0.01  0.44  0.41  0.58  0.38  0.48 Informational persuasion (n=252) F1 0.62 F1 0.46 F1 0.64 F1 0.49 F1 0.48 SD 0.02 SD 0.02 SD 0.03 SD 0.02 SD 0.03 BL 0.19 BL 0.11 BL 0.0 BL 0.13 BL 0.04  0.44  0.36  0.64  0.36  0.48 Interactive discussion (n=231) F1 0.71 F1 0.53 F1 0.71 F1 0.58 F1 0.59 SD 0.02 SD 0.03 SD 0.02 SD 0.04 SD 0.03 BL 0.14 BL 0.02 BL 0.0 BL 0.09 BL 0.003  0.57  0.52  0.71  0.49  0.58 How-to / instructions (n=117) F1 0.51 F1 0.43 F1 0.57 F1 0.44 F1 0.44 SD 0.03 SD 0.02 SD 0.03 SD 0.02 SD 0.02 BL 0.19 BL 0.08 BL 0.0 BL 0.20 BL 0.03  0.33  0.35  0.57  0.23  0.42 Lyrical (n=5) F1 0.24 F1 0.11 F1 0.37 F1 0.12 F1 0.15 SD 0.07 SD 0.05 SD 0.13 SD 0.05 SD 0.06 BL 0.18 BL 0.10 BL 0.0 BL 0.08 BL 0.12  0.07  0.01  0.37  0.04  0.03 Machine translated/generated (n=276) F1 0.95 F1 0.68 F1 0.91 F1 0.71 F1 0.68 SD 0.01 SD 0.05 SD 0.01 SD 0.03 SD 0.04 BL 0.59 BL 0.33 BL 0.0 BL 0.40 BL 0.29  0.36  0.35  0.91  0.31  0.38 Spoken (n=18) F1 0.24 F1 0.10 F1 0.21 F1 0.11 F1 0.13 SD 0.04 SD 0.10 SD 0.03 SD 0.03 SD 0.03 BL 0.05 BL 0.01 BL 0.0 BL 0.06 BL 0.0  0.19  0.10  0.21  0.05  0.13 Table 4.3: FinCORE main registers label specific average results per class and best model sample The label-specific evaluation brings us some additional insight into the model comparison, as seen in Table 4.3. None of the models can perform as well as the benchmark, except in the "Lyrical" class, where three of the models outperform the 4.2 TOXICITY CHALLENGE DATASET 34 benchmark (F1 = 0.13 [9]). However, the standard deviation for this class is high, ranging from 0.05 to 0.13, and a similarly sized class "Spoken" gets significantly worse results compared to the benchmark (F1 = 0.64 [9]) with all the models. While FinSBERT and e5-small perform in very similar ways at their best, there are some slight differences on the register level when comparing the top-performing instances. FinSBERT seems to do slightly better in registers "Narrative", "Infor- mational description", "Machine translated / generated" and "Spoken", whereas e5-small outperforms it in registers "Opinion", "Informational persuasion", "How- to / instructions", and "Lyrical". The largest differences are in the classification results of "Lyrical", "How-to / instructions", and "Opinion" registers, all in favour of e5-small. The greatest difference to the baseline when training with the FinSBERT model is observed with classes "Narrative", "Interactive discussion" and "Informational description". From these three, "Narrative" and "Informational description" are classes where FinSBERT also outperforms the rest of the models. E5-small overall has the largest difference to baseline, which is 0.0 for all the classes. 4.2 Toxicity Challenge Dataset Like the FinCORE dataset, the Toxicity challenge dataset is also a multilabel dataset. One exception to the rule is that the "Clean" label does not co-occur with the rest of the labels but instead indicates a lack of any of the other labels. Other speciality of this dataset is that it has been machine translated from English to Finnish. The classification results for this task can be viewed in Table 4.4. The results are low compared to the original benchmark (F1-score = 0.66 [12]), and the standard deviation is the highest of all the tasks across all models. This might be an indicator of the difficulty of the task. The model that performs the best on average is FinSBERT with 64 examples per label with mean F1-score of 0.53 4.2 TOXICITY CHALLENGE DATASET 35 baseline n = 8 n = 16 n = 32 n = 64 FinSBERT F1 0.26 F1 0.36 F1 0.35 F1 0.44 F1 0.53 SD 0.04 SD 0.09 SD 0.06 SD 0.05 SD 0.06 MiniLM F1 0.21 F1 0.43 F1 0.49 F1 0.48 F1 0.40 SD 0.03 SD 0.10 SD 0.09 SD 0.06 SD 0.16 e5-small F1 0.10 F1 0.13 F1 0.25 F1 0.37 F1 0.36 SD 0.0 SD 0.02 SD 0.05 SD 0.03 SD 0.14 XLM-R F1 0.26 F1 0.40 F1 0.39 F1 0.40 F1 0.44 SD 0.06 SD 0.13 SD 0.06 SD 0.11 SD 0.17 MPNet F1 0.20 F1 0.35 F1 0.39 F1 0.42 F1 0.43 SD 0.03 SD 0.13 SD 0.10 SD 0.07 SD 0.17 Table 4.4: Toxicity classification results for different sample sizes (F1-score, standard deviation in parenthesis) (SD = 0.06). Out of these instances, the best model got an F1-score of 0.62, which is not too far from the benchmark. The best model overall is, surprisingly, a clear outlier of a MiniLM model trained with only 16 examples per label with an F1-score of 0.67. Something to note is that all models experience some kind of relapse while increasing the training set size: even if the best performance is achieved with the largest training set, there is some regression when switching from a smaller dataset to a larger one in all models except MPNet. FinSBERT MiniLM e5-small XLM-R MPNet Sample size n = 64 n = 16 n = 32 n = 64 n = 64 Identity attack (n=712) F1 0.016 F1 0.03 F1 0.0 F1 0.01 F1 0.0 SD 0.03 SD 0.03 SD 0.0 SD 0.03 SD 0.01 BL 0.13 BL 0.16 BL 0.0 BL 0.16 BL 0.22  -0.12  -0.14  0.0  -0.15  -0.22 Insult (n=3 427) F1 0.31 F1 0.32 F1 0.18 F1 0.23 F1 0.22 SD 0.06 SD 0.05 SD 0.01 SD 0.07 SD 0.06 BL 0.19 BL 0.19 BL 0.10 BL 0.19 BL 0.20  0.13  0.14  0.08  0.04  0.01 Obscene (n=3 691) F1 0.32 F1 0.36 F1 0.19 F1 0.24 F1 0.22 SD 0.06 SD 0.06 SD 0.01 SD 0.07 SD 0.06 BL 0.18 BL 0.18 BL 0.11 BL 0.18 BL 0.19  0.14  0.19  0.08  0.06  0.04 Severe toxicity (n=367) F1 0.08 F1 0.18 F1 0.0 F1 0.0 F1 0.0 SD 0.06 SD 0.08 SD 0.0 SD 0.0 SD 0.0 BL 0.06 BL 0.09 BL 0.0 BL 0.09 BL 0.12  0.02  0.10  0.0  -0.09  -0.12 Threat (n=211) F1 0.04 F1 0.04 F1 0.0 F1 0.0 F1 0.0 SD 0.03 SD 0.05 SD 0.0 SD 0.0 SD 0.0 BL 0.12 BL 0.14 BL 0.0 BL 0.14 BL 0.21  -0.08  -0.10  0.0  -0.14  -0.21 Toxicity (n=6 090) F1 0.38 F1 0.38 F1 0.27 F1 0.33 F1 0.32 SD 0.06 SD 0.04 SD 0.01 SD 0.08 SD 0.08 BL 0.23 BL 0.22 BL 0.17 BL 0.22 BL 0.22  0.15  0.16  0.10  0.10  0.10 Clean (n=57 735) F1 0.71 F1 0.59 F1 0.59 F1 0.60 F1 0.59 SD 0.05 SD 0.10 SD 0.04 SD 0.30 SD 0.29 BL 0.38 BL 0.24 BL 0.0 BL 0.24 BL 0.19  0.33  0.36  0.59  0.36  0.39 Table 4.5: Toxicity label level results from the best performing batches 4.2 TOXICITY CHALLENGE DATASET 36 As we see in Table 4.5, performance is not consistent through all the categories: with most models, "Identity attack", "Severe toxicity" and "Threat" classes are severely under predicted. In fact, there is some indication of catastrophic forgetting, since the baseline results in these classes are in many cases higher than with the best performing models. Such is the case of "Identity attack" with FinSBERT, MiniLM, XLM-R and MPNet models, "Severe toxicity" with XLM-R and MPNet models as well as "Threat" with FinSBERT, MiniLM, XLM-R and MPNet models. Since e5-small baselines are already 0.0 in these classes, there isn’t any unlearning to do. This might also be related to the regression experienced when training with larger datasets. The largest difference within category is in the prediction results for "Severe toxicity", where MiniLM reaches F1-score 0.18, FinSBERT 0.08, and the rest of the models regressing to or staying at F1-score 0.0. "Clean" is another category with a clear best performer: FinSBERT gets roughly 0.10 points higher F1-scores than the other models. The greatest positive difference to the baseline can be observed in the "Clean" category with all models, where they also reach the highest label-level F1-scores. As the highest F1-scores go consistently to the clean category, one could argue that this method succeeds mainly in differentiating clean texts from toxic ones. This might be due to SetFit’s training procedure: since SetFit uses contrastive learning for fine-tuning the embedding model, it will create distance between examples that do belong to different classes. Because the "Clean" label does not coexist with any other labels, it should be easier to single out in the vector space. Figure 4.1 illustrates the evolution of FinSBERT models fine-tuned with increasing data: while the baseline model’s embeddings show all the classes jumbled in one mass, as the models are fine-tuned with more data, the blue "Clean" category starts to separate from the rest of the classes. However, one can observe that the distinction between 4.2 TOXICITY CHALLENGE DATASET 37 Figure 4.1: FinSBERT embeddings fine-tuned with Toxicity challenge data "Clean" examples and the rest is not perfect even when fine-tuning with the largest sample size. The data used for creating these embeddings is incremental, meaning that all the training data points from the smaller datasets are included in the larger datasets as well. Label combination Predictions Clean 35 584 Insult, Obscene, Toxicity 25 648 Toxicity 1 791 - 846 Insult, Toxicity 109 Table 4.6: Test set predictions from the best instance of e5-small A notable difference to the other tasks here is the poor performance of e5-small. While with FinCORE and Yle datasets it performs on par with the best model, here it gets the worst results of all, and seems to have trouble learning. The base- line prediction F1-score for four out of seven classes is 0.0, and even after fine-tuning, the results leave much to be desired. With the best instance of e5-small, "Identity attack", "Severe toxicity" and "Threat" classes are never once predicted when eval- uating the test set, and the predictions are basically split between "Clean" and 4.3 YLE CORPUS 38 "Insult"+"Obscene"+"Toxicity" combination, as shown in Table 4.6. When com- pared to the true test set label distribution shown in Table 3.2, it is clear that while the Clean category remains the largest predicted class, it is under-predicted, and the labels in the second largest prediction category are severely over-predicted. All tested models seem to prefer these two topmost label combinations shown in Table 4.6 at the expense of others, even though not as harshly as e5-small. With MiniLM, the classes are more distributed, but it looks like as the dataset size grows, the model will start preferring the toxic label categories: with a 64 example itera- tion, a model that had previously predicted most of the examples (correctly) to the "Clean" class starts to prefer the same label combination as above, and only 5 504 examples go to the "Clean" category. 4.3 Yle Corpus Contrary to the two previous tasks, the Yle task is a multiclass classification task, rather than a multilabel one. Each example receives only one label out of the ten balanced classes. The results of the experiments with this dataset can be viewed in Table 4.7. . baseline n = 8 n = 16 n = 32 n = 64 FinSBERT F1 0.58 F1 0.75 F1 0.83 F1 0.85 F1 0.86 SD 0.03 SD 0.02 SD 0.01 SD 0.01 SD 0.004 MiniLM F1 0.73 F1 0.78 F1 0.80 F1 0.81 F1 0.83 SD 0.02 SD 0.02 SD 0.01 SD 0.01 SD 0.004 e5-small F1 0.75 F1 0.81 F1 0.84 F1 0.85 F1 0.85 SD 0.01 SD 0.02 SD 0.01 SD 0.01 SD 0.01 XLM-R F1 0.58 F1 0.79 F1 0.81 F1 0.83 F1 0.82 SD 0.02 SD 0.01 SD 0.01 SD 0.01 SD 0.01 MPNet F1 0.71 F1 0.80 F1 0.82 F1 0.83 F1 0.82 SD 0.02 SD 0.02 SD 0.01 SD 0.01 SD 0.01 Table 4.7: Yle classification results for different sample sizes The performance across different models is surprisingly similar and stable with this task. While FinSBERT reaches the best overall F1-score of 0.86, the rest are not far behind. The standard deviation is also very low, especially compared to the 4.3 YLE CORPUS 39 Toxicity task results. The previously reported benchmark of 0.92 accuracy [2] is not far away, and here it would seem that the SetFit method succeeds fairly well even with a smaller amount of data. This is supported by the fact that the XLM-R and MPNet performance plateaus with 32 examples per label, and the other models gain only slight improvements to their results when doubling the dataset size to 64. However, it should be noted that the baseline F1-scores are also quite high, and reasonably good results are achievable even without fine-tuning the sentence embedding models. The FinSBERT and XLM-R models benefit the most from fine- tuning as their baseline results are lower, but the other models gain only a moderate increase in their performance. This contradicts the results from the FinCORE and Toxicity tasks, where these two models have reached considerably higher baseline results than the rest. The class-specific results seen in Table 4.8 reflect the results in Table 4.7. The performance is relatively stable across all classes with all models, with only mi- nor differences. FinSBERT outperforms the other models slightly in all categories but one, the greatest difference being in the class "Talous"1. FinSBERT and e5 get nearly identical F1-scores in categories "Koulutus ja kasvatus"2, "Liikenne ja kuljetus"3, "Onnettomuudet"4, "Terveys"5, and "Urheilu"6. The highest F1-scores go to "Urheilu" with all models, its F1-scores ranging from 0.94 to 0.97. The class’s baseline results are also rather high, even without fine-tuning the sentence embedding model. These observations are on par with the remarks in [9], where it is noted that sports reports are predicted with high accuracy even when the model is trained with a small number of examples. According to the 1Economy (author’s translation) 2Education and upbringing (author’s translation) 3Traffic and transport (author’s translation) 4Accidents (author’s translation) 5Health (author’s translation) 6Sports (author’s translation) 4.3 YLE CORPUS 40 FinBERT MiniLM e5-small XLM-R MPNet Sample size n = 64 n = 64 n = 64 n = 32 n = 32 Koulutus ja kasvatus (n=1 000) F1 0.90 F1 0.89 F1 0.90 F1 0.88 F1 0.89 SD 0.005 SD 0.005 SD 0.01 SD 0.01 SD 0.01 BL 0.70 BL 0.84 BL 0.81 BL 0.70 BL 0.78  0.20  0.04  0.09  0.18  0.11 Liikenne ja kuljetus (n=1 000) F1 0.84 F1 0.81 F1 0.84 F1 0.81 F1 0.82 SD 0.13 SD 0.01 SD 0.01 SD 0.01 SD 0.01 BL 0.61 BL 0.71 BL 0.68 BL 0.56 BL 0.69  0.23  0.09  0.16  0.25  0.13 Kulttuuri (n=1 000) F1 0.87 F1 0.82 F1 0.86 F1 0.82 F1 0.83 SD 0.01 SD 0.01 SD 0.01 SD 0.02 SD 0.01 BL 0.49 BL 0.67 BL 0.72 BL 0.46 BL 0.58  0.38  0.15  0.14  0.36  0.25 Luonto (n=1 000) F1 0.86 F1 0.79 F1 0.84 F1 0.80 F1 0.80 SD 0.01 SD 0.0 SD 0.01 SD 0.01 SD 0.01 BL 0.55 BL 0.70 BL 0.73 BL 0.55 BL 0.69  0.31  0.09  0.11  0.25  0.11 Onnettomuudet (n=1 000) F1 0.86 F1 0.84 F1 0.86 F1 0.82 F1 0.83 SD 0.02 SD 0.02 SD 0.01 SD 0.03 SD 0.03 BL 0.63 BL 0.70 BL 0.74 BL 0.61 BL 0.73  0.24  0.15  0.12  0.21  0.10 Politiikka (n=1 000) F1 0.82 F1 0.78 F1 0.80 F1 0.78 F1 0.78 SD 0.02 SD 0.02 SD 0.02 SD 0.02 SD 0.02 BL 0.55 BL 0.70 BL 0.72 BL 0.56 BL 0.71  0.27  0.08  0.08  0.22  0.07 Rikokset (n=1 000) F1 0.88 F1 0.84 F1 0.86 F1 0.85 F1 0.85 SD 0.01 SD 0.01 SD 0.01 SD 0.01 SD 0.01 BL 0.62 BL 0.75 BL 0.74 BL 0.63 BL 0.76  0.25  0.09  0.12  0.21  0.09 Talous (n=1 000) F1 0.79 F1 0.71 F1 0.76 F1 0.74 F1 0.73 SD 0.01 SD 0.02 SD 0.02 SD 0.01 SD 0.02 BL 0.45 BL 0.56 BL 0.65 BL 0.43 BL 0.58  0.34  0.13  0.11  0.31  0.15 Terveys (n=1 000) F1 0.86 F1 0.83 F1 0.86 F1 0.83 F1 0.83 SD 0.01 SD 0.01 SD 0.01 SD 0.01 SD 0.02 BL 0.54 BL 0.74 BL 0.77 BL 0.64 BL 0.75  0.33  0.09  0.09  0.19  0.08 Urheilu (n=1 000) F1 0.97 F1 0.94 F1 0.97 F1 0.95 F1 0.95 SD 0.01 SD 0.01 SD 0.005 SD 0.01 SD 0.01 BL 0.69 BL 0.86 BL 0.88 BL 0.68 BL 0.80  0.28  0.08  0.08  0.27  0.15 Table 4.8: Yle classification label level results 4.4 YLILAUTA CORPUS 41 authors, this might be due to the often formulaic structure of the texts. "Talous" receives the worst F1-scores with all the models, ranging from 0.73 to 0.79. The class "Kulttuuri"7 appears to have benefitted the most from fine-tuning, when comparing the final results to the baseline. With FinSBERT, the difference to baseline is 0.38, and the class ends up being among the best performing ones. Also, XLM-R and MPNet have "Kulttuuri" as the class where the most improvement can be seen. With MiniLM, "Kulttuuri" is accompanied by "Liikenne ja kuljetus", whereas e5-small has the largest difference to the baseline in the class "Onnetto- muudet". The smallest improvement can be seen in "Koulutus ja kasvatus" with FinSBERT, MiniLM and XLM-R models, "Urheilu" with e5-small and "Politiikka" with MPNet. 4.4 Ylilauta Corpus Similarly to the Yle task, the Ylilauta task consists of multiclass classification into ten balanced classes. The main difference here is the register: the Yle corpus contains news texts from ten different categories, whereas the Ylilauta corpus is gathered from a discussion forum and its ten most common discussion topic themes. Contrary to the Yle corpus, the Ylilauta corpus contains also informal language, and the texts might be more non-formulaic than news texts in general. The classification results for this task can be viewed in Table 4.9. Based on the results in Table 4.9, it looks like FinSBERT outperforms the rest of the models by a large margin with an F1-score of 0.74. Otherwise, the rest of the models perform at a similar level with F1-scores around 0.60. While the previously reported accuracy of 0.82 with FinBERT [2] is not reached, the FinSBERT model does perform relatively well when considering the small number of examples com- pared to the original 10 000 per class. Even though most of the models reach their 7Culture (author’s translation) 4.4 YLILAUTA CORPUS 42 baseline n = 8 n = 16 n = 32 n = 64 FinSBERT F1 0.41 F1 0.54 F1 0.70 F1 0.73 F1 0.74 SD 0.02 SD 0.04 SD 0.01 SD 0.01 SD 0.004 MiniLM F1 0.45 F1 0.47 F1 0.51 F1 0.54 F1 0.57 SD 0.02 SD 0.02 SD 0.02 SD 0.01 SD 0.01 e5-small F1 0.48 F1 0.50 F1 0.57 F1 0.59 F1 0.61 SD 0.02 SD 0.03 SD 0.02 SD 0.01 SD 0.01 XLM-R F1 0.44 F1 0.51 F1 0.58 F1 0.60 F1 0.60 SD 0.02 SD 0.02 SD 0.01 SD 0.01 SD 0.01 MPNet F1 0.52 F1 0.54 F1 0.59 F1 0.61 F1 0.60 SD 0.02 SD 0.02 SD 0.01 SD 0.01 SD 0.01 Table 4.9: Ylilauta classification results for different sample sizes best performance with the largest dataset, the improvement when sizing up from 32 examples per class to 64 examples per class isn’t enormous even with models that do benefit from the larger dataset. XLM-R and MPNet reach their best scores with only 32 examples per class. The standard deviation scores, while on average higher than with the Yle task, are quite low. All models receive similar baseline F1-scores, with a 0.11 difference between the best (MPNet) and worst (FinSBERT) performing model. The per- formance gain is not great with all the models: for example, MPNet gains only 0.09 points increase from the baseline score with fine-tuning. However, FinSBERT benefits from fine-tuning, with 0.33 difference to the baseline. The class-level classification results can be viewed in Table 4.10. The best per- forming class overall is "Ajoneuvot"8 with FinSBERT F1-score of 0.84. The other models succeeded the best in the class "Televisio"9, with F1-scores ranging from 0.71 to 0.74. The class "Hikky"10 proved to be the most difficult one to predict for all models except MiniLM, which performed the worst in the class "Sota"11. The greatest improvement to the baseline score can be observed in the class "Penkkiurheilu"12 with FinSBERT gaining 0.43 increase to the baseline with fine- 8Vehicles (author’s translation) 9Television (author’s translation) 10Hikikomori (author’s translation) 11War (author’s translation) 12Spectator sports (author’s translation) 4.4 YLILAUTA CORPUS 43 FinSBERT MiniLM e5-small XLM-R MPNet Sample size n = 64 n = 64 n = 64 n = 32 n = 32 Ajoneuvot (n=1 000) F1 0.84 F1 0.65 F1 0.70 F1 0.67 F1 0.70 SD 0.01 SD 0.02 SD 0.01 SD 0.02 SD 0.02 BL 0.51 BL 0.56 BL 0.60 BL 0.58 BL 0.63  0.34  0.09  0.10  0.09  0.07 Hikky (n=1 000) F1 0.65 F1 0.49 F1 0.50 F1 0.47 F1 0.49 SD 0.01 SD 0.01 SD 0.02 SD 0.02 SD 0.03 BL 0.34 BL 0.39 BL 0.38 BL 0.32 BL 0.42  0.30  0.10  0.12  0.15  0.07 Kuntosali (n=1 000) F1 0.73 F1 0.54 F1 0.57 F1 0.59 F1 0.58 SD 0.01 SD 0.02 SD 0.01 SD 0.02 SD 0.02 BL 0.42 BL 0.38 BL 0.46 BL 0.40 BL 0.48  0.31  0.16  0.10  0.19  0.11 Muoti (n=1 000) F1 0.73 F1 0.57 F1 0.57 F1 0.60 F1 0.61 SD 0.01 SD 0.01 SD 0.02 SD 0.01 SD 0.02 BL 0.40 BL 0.43 BL 0.39 BL 0.47 BL 0.53  0.32  0.14  0.17  0.13  0.09 Pelit (n=1 000) F1 0.73 F1 0.56 F1 0.67 F1 0.63 F1 0.62 SD 0.01 SD 0.01 SD 0.01 SD 0.02 SD 0.02 BL 0.32 BL 0.39 BL 0.51 BL 0.38 BL 0.47  0.40  0.17  0.17  0.25  0.14 Penkkiurheilu (n=1 000) F1 0.77 F1 0.63 F1 0.68 F1 0.66 F1 0.69 SD 0.02 SD 0.01 SD 0.01 SD 0.02 SD 0.02 BL 0.34 BL 0.51 BL 0.57 BL 0.48 BL 0.57  0.43  0.12  0.12  0.18  0.12 Politiikka (n=1 000) F1 0.75 F1 0.65 F1 0.65 F1 0.65 F1 0.67 SD 0.01 SD 0.01 SD 0.02 SD 0.04 SD 0.02 BL 0.43 BL 0.53 BL 0.54 BL 0.48 BL 0.60  0.32  0.12  0.11  0.17  0.07 Seksuaalisuus (n=1 000) F1 0.70 F1 0.49 F1 0.53 F1 0.54 F1 0.54 SD 0.02 SD 0.02 SD 0.02 SD 0.02 SD 0.02 BL 0.44 BL 0.42 BL 0.44 BL 0.40 BL 0.50  0.26  0.07  0.09  0.14  0.05 Sota (n=1 000) F1 0.72 F1 0.48 F1 0.53 F1 0.53 F1 0.53 SD 0.01 SD 0.03 SD 0.02 SD 0.03 SD 0.03 BL 0.45 BL 0.35 BL 0.29 BL 0.33 BL 0.40  0.27  0.13  0.24  0.20  0.13 Televisio (n=1 000) F1 0.80 F1 0.71 F1 0.74 F1 0.72 F1 0.73 SD 0.01 SD 0.01 SD 0.01 SD 0.02 SD 0.01 BL 0.46 BL 0.58 BL 0.56 BL 0.55 BL 0.68  0.34  0.12  0.18  0.17  0.06 Table 4.10: Ylilauta label specific classification results 4.4 YLILAUTA CORPUS 44 tuning. The other models benefit the most from fine-tuning in the classes "Pelit"13 and "Sota". On the other hand, all the models see the least improvement in the class "Seksuaalisuus"14, except for XLM-R, which learns the least in "Ajoneuvot". The largest in-category variance is in the classes "Sota" and "Seksuaalisuus", with F1-scores ranging from 0.48 to 0.72 in "Sota", and from 0.49 to 0.70 in "Sek- suaalisuus". Respectively, the smallest variance can be observed within the classes "Televisio" and "Politiikka"15, where the F1-scores from all models are within 0.10 range. 13Games (author’s translation) 14Sexuality (author’s translation) 15Politics (author’s translation) 5 Discussion The hypothesis to my first research question "How well do the fine-tuned Sentence Transformer models perform with Finnish tasks compared to the reported bench- marks? Can the SetFit method reach results comparable to the state-of-the-art?" was that Sentence Transformer models fine-tuned by SetFit would not reach state- of-the-art results. As evidenced in Chapter 4, the hypothesis came true. This supports the results from earlier research, indicating that SetFit might not be the best solution for fine-tuning a model when the highest possible accuracy is neces- sary. However, the results from this research show promise for the more data-sparse tasks: in the event that one might need a classification model for a clearly defined task with only a few existing examples, SetFit could be a solution worth testing. In this chapter, I will further discuss the possible benefits, drawbacks and limitations of the method I have encountered while doing this research, as well as considerations for anyone interested in testing SetFit themselves. The second research question was interested in comparing the different Sentence Transformer models and if there would be a difference in the performance between multilingual and monolingual Finnish models when classifying Finnish text. In Chapter 4, it was shown that the monolingual FinSBERT was a stable performer and best in all tasks. The difference to the following models varied, and the least difference was perceived in the Yle task while the largest difference was with the Toxicity challenge data, FinSBERT leading the scoreboard by 0.09 points difference CHAPTER 5. DISCUSSION 46 Figure 5.1: Comparison of monolingual FinSBERT and multilingual e5-small perfor- mances when fine-tuned with SetFit, highlighted area represents standard deviation to the second-best model. From the multilingual models, e5-small seems promising, but fails in the Toxicity challenge task. Despite this, it performed well and on- par with FinSBERT in the complex multilabel FinCORE task. In Figure 5.1, you can find a comparison of these two models’ performance across the different tasks. FinSBERT clearly fares better, especially in tasks that contain informal language, such as the Toxicity challenge dataset or the Ylilauta corpus. Another point of interest was whether there would be a difference in the Set- Fit performance with multilabel and multiclass classification tasks, the hypothesis being that multiclass tasks would perform better than multilabel tasks. Using this method of fine-tuning, multilabel tasks such as FinCORE and the Toxicity challenge clearly receive worse results than multiclass tasks such as Yle and Ylilauta corpora. However, there was some variation, and it would seem that the quality or format of the language also has an effect. The more neutral FinCORE and Yle tasks received overall better scores than the more informal Toxicity and Ylilauta tasks, regardless of task composition. Even though SetFit’s performance in the multilabel tasks was somewhat lacking, it could be argued that it might work as an alternative solution in multiclass labelling, even if state-of-the-art results are not reached. CHAPTER 5. DISCUSSION 47 This being a thesis about few-shot classification, how much data is needed then? Even though SetFit is aimed for few-shot classification, according to the results found in this research, in general, more data leads to better performance. Previous research has seen success with larger datasets [7], [14], [16], [17], [60], and Tunstall et al. recommend increasing the amount of data rather than training time for performance improvements [13]. However, while looking at Figure 5.1, an elbow in the learning curve can be observed when increasing the amount of data, and the performance increase decreases after 32 or even 16 examples. As shown in Figure 5.1, certain tasks seem more data-hungry than others. Yle performance is quite stable already with 16 examples per label, whereas the Toxicity challenge task learning curves with the two models presented are contradictory: looks like FinSBERT might have still learnt something with a larger dataset, but e5-small performance has already dropped after increasing the number of examples per label to 64 (please do note the large standard deviation). The sales pitch of "only 8 examples per label" [13] doesn’t seem to hold at least with these example tasks — although if 16 or 32 examples per label is what it takes to achieve decent results, it isn’t that far either. This number is considerably lower than the sheer amount of data used in fine-tuning the benchmark models. In addition to fewer data requirements, the other benefit of SetFit is the lighter computational resource needs compared to many other fine-tuning methods. For example, fine-tuning with the seven-label Toxicity challenge dataset with 16 samples per label, the training time averages in 57.4 seconds per instance with MiniLM and 1.94 minutes with the larger XLM-R model. The largest compute time turned out to be in the e5-small model, but at least with the training set sizes smaller than 32 examples per label, the difference to the other models isn’t too large. MiniLM is the fastest to train overall and might be worth a consideration in situations where the available computational resources are severely limited. CHAPTER 5. DISCUSSION 48 Figure 5.2: SetFit average training times with different numbers of classes and samples per label Due to SetFit’s data augmentation feature, training time grows relatively quickly in relation to the number of labels and the number of samples per label. Although training with the smallest dataset size was very fast with all the tasks, the time increased quite a bit when growing the amount of data, as evidenced in Figure 5.2. This could probably be countered by optimising the number of learning steps, which could also be beneficial to reduce overfitting. In this research, I did not utilise early stopping callback to return the best model but trained each instance for one full epoch like Tunstall et al. [13]. Another options to reduce training time could be downsampling the number of examples by for example manually curating the training sets, such as in [8], or setting SetFit to generate a smaller number of sentence pairs for training. There doesn’t seem to be a trend towards favouring larger models, rather the CHAPTER 5. DISCUSSION 49 small, 125M parameter monolingual FinSBERT model fares the best in all of the tasks, and the second-place e5-small is even smaller than that with only 118M pa- rameters. The MPNet and XLM-R models, despite being over double the size, fall short in performance. The size difference between the base model FinBERT and the Sentence Transformer FinSBERT isn’t very large, so in order to reach best performance with the datasets used in this thesis, fine-tuning FinBERT instead of FinSBERT is advisable. As I did not test fine-tuning FinBERT in a few-shot setting, the two method’s performance cannot be compared in a straightforward manner. In the case where the data is sparse, it might be worth it to test fine-tuning a Sentence Transformer model with SetFit. All models were trained for one full epoch, but the authors of [13] suggest that shorter training might yield better results and reduce overfitting. This is promis- ing for lower-end devices and would reduce the computational resources needed for training even further. However, testing this hypothesis is beyond the scope of this thesis and would make interesting research in the future. In this research, in most of the tasks, the difference to the state-of-the-art results remained quite large. All the Sentence Transformer models were fine-tuned for one full epoch with a stable learning rate. The original authors of [13] used a different learning rate of 1e-3, but I opted for the library default of 2e-5. At the time of doing my own experiments, some of the features in the experimental setup in [13] were already deprecated from the SetFit library. Such is the case of the sampling strategy, which I chose to optimise based on the FinCORE task performance and train time with FinSBERT. This differs from the original paper since the library itself had already deprecated the original sampling method. This might mean that the results are not necessarily comparable to the ones in the original paper and that with further parameter optimisation, better results might have been obtained. It could be interesting to, for example, choose a smaller sample size such as 16 samples CHAPTER 5. DISCUSSION 50 per label and optimise learning rate and training steps based on that. Choosing the training sets manually to be representative in the manner of [8] could also lead to better results. In this experiment, the training sets were ran- domised and might have been inconsistent in quality. However, there is no indi- cation of this when comparing the performance across different models: there are no particular training sets that consistently yield better or worse results than the others. In general, the results were surprisingly stable. Of course, in a data-sparse scenario, it might make sense to ensure the quality of each labelled example. In the case of FinCORE, I chose to include hybrid examples with several labels in the dataset. This could have been avoided, choosing only examples that represent a sin- gle main register and a possible sub-register. The same can be said for the Toxicity challenge dataset, where it would have been possible to choose only the "purest" representations of each class, with a minimum number of coinciding labels. However, by doing the experiments in the current way, it might have assisted the Sentence Transformer fine-tuning by bringing often co-occurring categories closer together in the vector space. This could also be one reason for the strange preference of the Insult+Obscene+Toxicity combination in the Toxicity challenge task. 6 Conclusion In this thesis, I set out to find if fine-tuning Sentence Transformer models with SetFit could achieve few-shot text classification performance on par with the state- of-the-art results in Finnish and also understand the capabilities of the method by comparing monolingual and multilingual models and a range of text classification tasks, including multilabel and multiclass classification. My first research question was: "How well do the fine-tuned Sentence Trans- former models perform with Finnish tasks compared to the reported benchmarks? Can the SetFit method reach results comparable to the state-of-the-art?" and my initial hypothesis was that following the results by Tunstall et al. [13], state-of- the-art results would not be achieved. This hypothesis was proven correct: I was unable to reach previous state-of-the-art results with SetFit. However, the results are promising for data-sparse classification tasks and lower-end computing: one can achieve decent classification results with a small amount of data and few computa- tional resources. The second research question was: "Is there a difference in performance when fine-tuning Finnish Sentence Transformer models versus multilingual Sentence Trans- former models with Finnish data?" and I found that the SetFit performance was linked to the model selection. The monolingual 125M parameter FinSBERT was generally the best in all tasks, followed by the multilingual e5-small. This indicates that the monolingual Sentence Transformer has an advantage in this kind of few- CHAPTER 6. CONCLUSION 52 shot learning, and that bigger models do not necessarily yield better results with this method, some of the models being double the size of the best performing ones. The third and final research question was: "Is there a difference in performance when fine-tuning Sentence Transformers with multilabel classification versus multi- class classification in Finnish?" and my hypothesis was that SetFit would succeed better in multiclass tasks than in multilabel tasks. The hypothesis was proven true, as the SetFit method showed promise with multiclass tasks, nearing state-of-the-art results with considerably smaller datasets than in the original benchmarks of the Yle and Ylilauta tasks in Virtanen et al. [2]. With SetFit, the Yle task gained a maximum F1-score of 0.86 (compared to the originally reported benchmark of 0.92 accuracy), and with the Ylilauta dataset, a maximum F1-score of 0.74 was reached (compared to the original benchmark of 0.82 accuracy). However, with the multi- label FinCORE and Toxicity challenge tasks, the results left much to be desired. This is especially the case with the Toxicity challenge dataset, where each model appeared to overfit to favour certain label combinations. Even though the results of this research show promise, any generalisations must be taken with a grain of salt. For now, there is no comparison with fine-tuning the state-of-the-art base models with the corresponding size training data sets, meaning that any straight comparison with them and SetFit cannot be made. The results from this research also could be improved: for comparison’s sake, the training argu- ments were not optimised, and the models might have achieved better results with further optimisation. I also did not compare SetFit to other few-shot methods, and there might be other methods more suitable for situations when the data is lim- ited but the computational resources are not. As state-of-the-art results were not reached in any of the tasks tested, it might also make sense just to fine-tune e.g. a FinBERT Transformer model when there is enough data available. This thesis laid some basic groundwork for validating few-shot methods in Finnish CHAPTER 6. CONCLUSION 53 language text classification. In the future, there are many paths to explore in few- shot learning research. For example, it could be very interesting to see SetFit used to its full capabilities with proper parameter optimisation by, for example, restrict- ing the training data set size and optimising other parameters, such as learning rate and the number of training steps. Another thing to consider would be the use of more curated datasets. In this thesis, the data here was randomly sampled, and might not have been the best representation. According to [8], samples chosen by a human expert yielded good results. For such small datasets, this wouldn’t be an impossible task. A third possible option for future research would be to compare SetFit with other few-shot methods and possibly test out the Finnish benchmarking dataset introduced in [18] to gain understanding of SetFit’s performance compared to other existing methods. In the age of GenAI, one can easily generate any number of examples with just a prompt. While this kind of synthetically generated data is certainly a possibility, there still remain fields of research a generative model might not have been exposed to in greater lengths. Such is the case with many high security domains, as well as highly specific research areas. In addition to this, one must take into account the possible bias, possible inaccuracies and other limitations when using machine- generated data. With the constant advances in the field of NLP, the situation might, however, change quite rapidly. At the moment of writing this thesis, SetFit can be considered as a cost-efficient solution for Finnish text classification especially when the data is scarce and on-premise processing is needed. References [1] S. Subramanian, V. Elango, and M. Gungor, Small language models (slms) can still pack a punch: A survey, 2025. arXiv: 2501.05465 [cs.CL]. [Online]. Available: https://arxiv.org/abs/2501.05465. [2] A. Virtanen, J. Kanerva, R. Ilo, et al., “Multilingual is not enough: BERT for Finnish”, CoRR, vol. abs/1912.07076, 2019. arXiv: 1912.07076. [Online]. Available: http://arxiv.org/abs/1912.07076. [3] V. Kortesalmi, “Sentiment analysis with language models on finnish work- place well-being surveys”, M.S. thesis, University of Helsinki, 2024. [Online]. Available: http://hdl.handle.net/10138/577817. [4] S. Samsi, D. Zhao, J. McDonald, et al., From words to watts: Benchmarking the energy costs of large language model inference, 2023. arXiv: 2310.03003 [cs.CL]. [Online]. Available: https://arxiv.org/abs/2310.03003. [5] E. Strubell, A. Ganesh, and A. McCallum, “Energy and policy considerations for deep learning in NLP”, in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, A. Korhonen, D. Traum, and L. Màrquez, Eds., Florence, Italy: Association for Computational Linguistics, Jul. 2019, pp. 3645–3650. doi: 10.18653/v1/P19-1355. [Online]. Available: https://aclanthology.org/P19-1355/. [6] A. Boulemtafes, A. Derhab, and Y. Challal, “A review of privacy-preserving techniques for deep learning”, Neurocomputing, vol. 384, pp. 21–45, 2020, issn: REFERENCES 55 0925-2312. doi: https://doi.org/10.1016/j.neucom.2019.11.041. [Online]. Available: https://www.sciencedirect.com/science/article/ pii/S0925231219316431. [7] M. Loerakker, L. Müter, and M. Schraagen, “Fine-tuning language models on Dutch protest event tweets”, in Proceedings of the 7th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE 2024), A. Hürriyetoğlu, H. Tanev, S. Thapa, and G. Uludoğan, Eds., St. Julians, Malta: Association for Computational Linguistics, 2024, pp. 6–23. [Online]. Available: https://aclanthology.org/2024.case-1.2/. [8] L. Loukas, I. Stogiannidis, P. Malakasiotis, and S. Vassos, “Breaking the bank with ChatGPT: Few-shot text classification for finance”, in Proceedings of the Fifth Workshop on Financial Technology and Natural Language Processing and the Second Multimodal AI For Financial Forecasting, C.-C. Chen, H. Taka- mura, P. Mathur, R. Sawhney, H.-H. Huang, and H.-H. Chen, Eds., 2023, pp. 74–80. [Online]. Available: https://aclanthology.org/2023.finnlp- 1.7/. [9] V. Skantsi and V. Laippala, “Analyzing the unrestricted web: The finnish corpus of online registers”, Nordic Journal of Linguistics, vol. 1, no. 1, 2023, issn: 15024717. doi: 10.1017/S0332586523000021. [10] V. Laippala, J. Egbert, D. Biber, and A.-J. Kyröläinen, “Exploring the role of lexis and grammar for the stable identification of register in an unrestricted corpus of web documents”, Language Resources and Evaluation, vol. 55, no. 3, pp. 757–788, Sep. 2021, issn: 1574-020X. doi: 10.1007/s10579-020-09519- z. [Online]. Available: https://doi.org/10.1007/s10579-020-09519-z. [11] B. Kilic, F. Bex, and A. Gatt, “Contrast is all you need”, in ASAIL 2023 - Automated Semantic Analysis of Information in Legal Text, F. Lagioia, J. REFERENCES 56 Mumford, D. Odekerken, and H. Westermann, Eds., ser. CEUR Workshop Proceedings, CEUR, 2023, pp. 72–82. [12] A. Eskelinen, L. Silvala, F. Ginter, S. Pyysalo, and V. Laippala, “Toxicity detection in Finnish using machine translation”, in Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), T. Alumäe and M. Fishel, Eds., Tórshavn, Faroe Islands: University of Tartu Library, May 2023, pp. 685–697. [Online]. Available: https://aclanthology.org/2023. nodalida-1.68. [13] L. Tunstall, N. Reimers, U. E. S. Jo, et al., “Efficient Few-Shot Learning Without Prompts”, 2022. arXiv: 2209.11055. [Online]. Available: http:// arxiv.org/abs/2209.11055. [14] A. S. Kwak, C. Jeong, J. W. Lim, and B. Min, A korean legal judgment predic- tion dataset for insurance disputes, 2024. arXiv: 2401.14654 [cs.CL]. [On- line]. Available: https://arxiv.org/abs/2401.14654. [15] T. Hadeliya and D. Kajtoch, Evaluation of few-shot learning for classifica- tion tasks in the polish language, 2024. arXiv: 2404.17832 [cs.CL]. [Online]. Available: https://arxiv.org/abs/2404.17832. [16] K. Pannerselvam, S. Rajiakodi, S. Thavareesan, S. Thangasamy, and K. Pon- nusamy, “SetFit: A robust approach for offensive content detection in Tamil- English code-mixed conversations using sentence transfer fine-tuning”, in Pro- ceedings of the Fourth Workshop on Speech, Vision, and Language Technolo- gies for Dravidian Languages, B. R. Chakravarthi, R. Priyadharshini, A. K. Madasamy, et al., Eds., St. Julian’s, Malta: Association for Computational Linguistics, 2024, pp. 35–42. [Online]. Available: https://aclanthology. org/2024.dravidianlangtech-1.6/. REFERENCES 57 [17] V. Beliveau, H. Kaas, M. Prener, et al., Classification of radiological text in small and imbalanced datasets in a non-english language, 2024. arXiv: 2409. 20147 [cs.CL]. [Online]. Available: https://arxiv.org/abs/2409.20147. [18] R. Luukkonen, V. Komulainen, J. Luoma, et al., “FinGPT: Large generative models for a small language”, in Proceedings of the 2023 Conference on Em- pirical Methods in Natural Language Processing, H. Bouamor, J. Pino, and K. Bali, Eds., Singapore: Association for Computational Linguistics, Dec. 2023, pp. 2710–2726. doi: 10.18653/v1/2023.emnlp-main.164. [Online]. Available: https://aclanthology.org/2023.emnlp-main.164/. [19] S. Rönnqvist, V. Skantsi, M. Oinonen, and V. Laippala, “Multilingual and Zero-Shot is Closing in on Monolingual Web Register Classification”, Proceed- ings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), pp. 157–165, 2021. [Online]. Available: https://aclanthology.org/2021. nodalida-main.16. [20] W. Mcculloch and W. Pitts, “A logical calculus of ideas immanent in nervous activity”, Bulletin of Mathematical Biophysics, vol. 5, pp. 127–147, 1943. [21] F. Rosenblatt, “The perceptron: A probabilistic model for information stor- age and organization in the brain”, eng, Psychological review, vol. 65, no. 6, pp. 386–408, 1958, issn: 0033-295X. doi: 10.1037/h0042519. [22] D. Jurafsky and J. H. Martin, Speech and language processing : an introduction to natural language processing, computational linguistics, and speech recogni- tion. Upper Saddle River, N.J.: Pearson Prentice Hall, 2009. [23] N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks”, CoRR, vol. abs/1908.10084, 2019. arXiv: 1908. 10084. [Online]. Available: http://arxiv.org/abs/1908.10084. REFERENCES 58 [24] A. Sherstinsky, “Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network”, CoRR, vol. abs/1808.03314, 2018. arXiv: 1808.03314. [Online]. Available: http://arxiv.org/abs/1808.03314. [25] A. Vaswani, N. Shazeer, N. Parmar, et al., “Attention Is All You Need”, Ad- vances in Neural Information Processing Systems, vol. 2017-Decem, no. Nips, pp. 5999–6009, Jul. 2017, issn: 10495258. arXiv: 1706.03762. [Online]. Avail- able: http://arxiv.org/abs/1706.03762. [26] T. B. Brown, B. Mann, N. Ryder, et al., Language models are few-shot learners, 2020. arXiv: 2005.14165 [cs.CL]. [27] G. Yenduri, R. M, C. S. G, et al., Generative pre-trained transformer: A com- prehensive review on enabling technologies, potential applications, emerging challenges, and future directions, 2023. arXiv: 2305.10435 [cs.CL]. [Online]. Available: https://arxiv.org/abs/2305.10435. [28] OpenAI, J. Achiam, S. Adler, et al., Gpt-4 technical report, 2024. arXiv: 2303. 08774 [cs.CL]. [29] B. Workshop, : T. L. Scao, et al., Bloom: A 176b-parameter open-access multi- lingual language model, 2023. arXiv: 2211.05100 [cs.CL]. [Online]. Available: https://arxiv.org/abs/2211.05100. [30] H. Touvron, T. Lavril, G. Izacard, et al., Llama: Open and efficient founda- tion language models, 2023. arXiv: 2302.13971 [cs.CL]. [Online]. Available: https://arxiv.org/abs/2302.13971. [31] J. Kanerva, F. Ginter, L. H. Chang, et al., “Towards diverse and contextually anchored paraphrase modeling: A dataset and baselines for Finnish”, Natural Language Engineering, vol. 34, no. 6, pp. 1–35, 2023, issn: 14698110. doi: 10.1017/S1351324923000086. REFERENCES 59 [32] A. Conneau, K. Khandelwal, N. Goyal, et al., “Unsupervised Cross-lingual Representation Learning at Scale”, CoRR, pp. 31–38, Nov. 2019. doi: 10. 18653/v1/p19-4007. arXiv: 1911.02116. [Online]. Available: http://arxiv. org/abs/1911.02116. [33] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, vol. 1, no. Mlm, pp. 4171–4186, Oct. 2018. arXiv: 1810.04805. [Online]. Available: http://arxiv.org/abs/1810.04805. [34] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space”, CoRR, vol. abs/1301.3781, 2013. [Online]. Available: http://dblp.uni- trier.de/db/journals/corr/corr1301. html#abs-1301-3781. [35] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation”, in Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1532–1543. [Online]. Available: http://www.aclweb. org/anthology/D14-1162. [36] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vec- tors with subword information”, Transactions of the Association for Com- putational Linguistics, vol. 5, L. Lee, M. Johnson, and K. Toutanova, Eds., pp. 135–146, 2017. doi: 10.1162/tacl_a_00051. [Online]. Available: https: //aclanthology.org/Q17-1010/. [37] R. Kiros, Y. Zhu, R. Salakhutdinov, et al., Skip-thought vectors, 2015. arXiv: 1506.06726 [cs.CL]. [Online]. Available: https://arxiv.org/abs/1506. 06726. REFERENCES 60 [38] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes, “Supervised learning of universal sentence representations from natural language inference data”, in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, M. Palmer, R. Hwa, and S. Riedel, Eds., Copenhagen, Denmark: Association for Computational Linguistics, Sep. 2017, pp. 670–680. doi: 10.18653/v1/D17-1070. [Online]. Available: https://aclanthology. org/D17-1070/. [39] S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning, “A large annotated corpus for learning natural language inference”, in Proceedings of the 2015 Con- ference on Empirical Methods in Natural Language Processing, L. Màrquez, C. Callison-Burch, and J. Su, Eds., Lisbon, Portugal: Association for Computa- tional Linguistics, Sep. 2015, pp. 632–642. doi: 10.18653/v1/D15- 1075. [Online]. Available: https://aclanthology.org/D15-1075/. [40] D. Cer, Y. Yang, S. Kong, et al., “Universal Sentence Encoder”, CoRR, vol. abs/1803.11175, 2018. arXiv: 1803.11175. [Online]. Available: http://arxiv.org/abs/1803. 11175. [41] Y. Liu, M. Ott, N. Goyal, et al., “RoBERTa: A Robustly Optimized BERT Pretraining Approach”, no. 1, 2019, issn: 2331-8422. arXiv: 1907.11692. [On- line]. Available: http://arxiv.org/abs/1907.11692. [42] A. Williams, N. Nangia, and S. Bowman, “A broad-coverage challenge cor- pus for sentence understanding through inference”, in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, Volume 1 (Long Papers), M. Walker, H. Ji, and A. Stent, Eds., New Orleans, Louisiana: Association for Computational Linguistics, Jun. 2018, pp. 1112–1122. doi: 10.18653/v1/N18- 1101. [Online]. Available: https://aclanthology.org/N18-1101/. REFERENCES 61 [43] A. Edwards and J. Camacho-Collados, “Language models for text classifica- tion: Is in-context learning enough?”, in Proceedings of the 2024 Joint Inter- national Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci, S. Sakti, and N. Xue, Eds., Torino, Italia: ELRA and ICCL, May 2024, pp. 10 058–10 072. [Online]. Available: https://aclanthology.org/2024. lrec-main.879/. [44] V. Laippala, R. Kyllönen, J. Egbert, D. Biber, and S. Pyysalo, “Toward Mul- tilingual Identification of Online Registers”, in Proceedings of the 22nd Nordic Conference on Computational Linguistics, Turku, Finland: Linköping Univer- sity Electronic Press, 2019, pp. 292–297. [Online]. Available: https://www. aclweb.org/anthology/W19-6130. [45] M. Tolba, S. Ouadfel, and S. Meshoul, “Hybrid ensemble approaches to on- line harassment detection in highly imbalanced data”, Expert Systems with Applications, vol. 175, p. 114 751, 2021, issn: 0957-4174. doi: https://doi. org/10.1016/j.eswa.2021.114751. [Online]. Available: https://www. sciencedirect.com/science/article/pii/S0957417421001925. [46] X. Li, X. Sun, Y. Meng, J. Liang, F. Wu, and J. Li, Dice loss for data- imbalanced nlp tasks, 2020. arXiv: 1911.02855 [cs.CL]. [47] S. Shaikh, S. M. Daudpota, A. S. Imran, and Z. Kastrati, “Towards improved classification accuracy on highly imbalanced text dataset using deep neural language models”, Applied Sciences, vol. 11, no. 2, 2021, issn: 2076-3417. doi: 10.3390/app11020869. [Online]. Available: https://www.mdpi.com/2076- 3417/11/2/869. [48] S. Wang, H. Fang, M. Khabsa, H. Mao, and H. Ma, Entailment as few-shot learner, 2021. arXiv: 2104.14690 [cs.CL]. REFERENCES 62 [49] M. Lewis, Y. Liu, N. Goyal, et al., “BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension”, in Proceedings of the 58th Annual Meeting of the Association for Computa- tional Linguistics, D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault, Eds., Online: Association for Computational Linguistics, Jul. 2020, pp. 7871–7880. doi: 10.18653/v1/2020.acl- main.703. [Online]. Available: https:// aclanthology.org/2020.acl-main.703/. [50] T. Schick and H. Schütze, Exploiting cloze questions for few shot text classifi- cation and natural language inference, 2021. arXiv: 2001.07676 [cs.CL]. [51] J. Luoma, M. Oinonen, M. Pyykönen, V. Laippala, and S. Pyysalo, “A broad- coverage corpus for Finnish named entity recognition”, Proceedings of the 12th Language Resources and Evaluation Conference, no. 5, pp. 4615–4624, 2020. [Online]. Available: https://aclanthology.org/2020.lrec-1.567. [52] J. Luotolahti, J. Kanerva, V. Laippala, S. Pyysalo, and F. Ginter, “Towards universal web parsebanks”, in Proceedings of the Third International Confer- ence on Dependency Linguistics (Depling 2015), J. Nivre and E. Hajičová, Eds., Uppsala, Sweden: Uppsala University, Uppsala, Sweden, Aug. 2015, pp. 211–220. [Online]. Available: https://aclanthology.org/W15-2124/. [53] J. Kanerva, F. Ginter, N. Miekka, A. Leino, and T. Salakoski, “Turku Neural Parser Pipeline: An End-to-End System for the CoNLL 2018 Shared Task”, in Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Brussels, Belgium: Association for Computa- tional Linguistics, 2018, pp. 133–142, isbn: 9781948087827. doi: 10.18653/ v1/K18-2013. [Online]. Available: http://aclweb.org/anthology/K18- 2013. REFERENCES 63 [54] C. S. Perone, R. P. Silveira, and T. S. Paula, “Evaluation of sentence embed- dings in downstream and linguistic probing tasks”, CoRR, vol. abs/1806.06259, 2018. arXiv: 1806.06259. [Online]. Available: http://arxiv.org/abs/1806. 06259. [55] G. Piao, “Scholarly text classification with sentence bert and entity embed- dings”, in Trends and Applications in Knowledge Discovery and Data Mining, M. Gupta and G. Ramakrishnan, Eds., Cham: Springer International Publish- ing, 2021, pp. 79–87, isbn: 978-3-030-75015-2. [56] D. Tam, R. R. Menon, M. Bansal, S. Srivastava, and C. Raffel, “Improving and simplifying pattern exploiting training”, in Proceedings of the 2021 Con- ference on Empirical Methods in Natural Language Processing, M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih, Eds., Online and Punta Cana, Domini- can Republic: Association for Computational Linguistics, Nov. 2021, pp. 4980– 4991. doi: 10.18653/v1/2021.emnlp-main.407. [Online]. Available: https: //aclanthology.org/2021.emnlp-main.407/. [57] R. Karimi Mahabadi, L. Zettlemoyer, J. Henderson, et al., “Prompt-free and efficient few-shot learning with language models”, in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), S. Muresan, P. Nakov, and A. Villavicencio, Eds., Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 3638–3652. doi: 10.18653/v1/2022.acl- long.254. [Online]. Available: https:// aclanthology.org/2022.acl-long.254/. [58] H. Liu, D. Tam, M. Muqeeth, et al., Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning, 2022. arXiv: 2205.05638 [cs.LG]. [Online]. Available: https://arxiv.org/abs/2205.05638. REFERENCES 64 [59] P. Keung, Y. Lu, G. Szarvas, and N. A. Smith, “The multilingual Amazon reviews corpus”, in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), B. Webber, T. Cohn, Y. He, and Y. Liu, Eds., Online: Association for Computational Linguistics, Nov. 2020, pp. 4563–4568. doi: 10.18653/v1/2020.emnlp-main.369. [Online]. Available: https://aclanthology.org/2020.emnlp-main.369/. [60] E. Fsih, S. Kchaou, R. Boujelbane, and L. Hadrich-Belguith, “Benchmark- ing transfer learning approaches for sentiment analysis of Arabic dialect”, in Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP), H. Bouamor, H. Al-Khalifa, K. Darwish, et al., Eds., Abu Dhabi, United Arab Emirates (Hybrid): Association for Computational Linguistics, 2022, pp. 431–435. doi: 10.18653/v1/2022.wanlp-1.44. [Online]. Available: https://aclanthology.org/2022.wanlp-1.44/. [61] N. Reimers and I. Gurevych, Making monolingual sentence embeddings multi- lingual using knowledge distillation, 2020. arXiv: 2004.09813 [cs.CL]. [On- line]. Available: https://arxiv.org/abs/2004.09813. [62] J. Dodge, G. Ilharco, R. Schwartz, A. Farhadi, H. Hajishirzi, and N. A. Smith, “Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping”, CoRR, vol. abs/2002.06305, 2020. arXiv: 2002.06305. [Online]. Available: https://arxiv.org/abs/2002.06305. [63] T. Zhang, F. Wu, A. Katiyar, K. Q. Weinberger, and Y. Artzi, “Revisiting few- sample BERT fine-tuning”, CoRR, vol. abs/2006.05987, 2020. arXiv: 2006. 05987. [Online]. Available: https://arxiv.org/abs/2006.05987. [64] Ylilauta, The Downloadable Version of the Ylilauta Corpus, data set, 2016. [Online]. Available: http://urn.fi/urn:nbn:fi:lb-2016101210. REFERENCES 65 [65] Yleisradio, Yle Finnish News Archive 2011-2018, source, data set. [Online]. Available: http://urn.fi/urn:nbn:fi:lb-2017070501. [66] L. Repo, V. Skantsi, S. Rönnqvist, et al., “Beyond the English web: Zero- shot cross-lingual and lightweight monolingual classification of registers”, in Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, I.-T. Sorodoc, M. Sushil, E. Takmaz, and E. Agirre, Eds., Online: Association for Computational Linguistics, Apr. 2021, pp. 183–191. doi: 10.18653/v1/2021.eacl-srw.24. [Online]. Available: https://aclanthology.org/2021.eacl-srw.24. Appendix A FinCORE Register Distribution Register name Abbreviation N Narrative (main) NA 3 956 News reports / news blogs NE 1 359 Personal blog PB 1 160 Community blog CB 374 Sports reports SR 357 Magazine / Online article OA 342 Story FC 121 Travel blog TB 82 Historical article HA 79 Informational description (main) IN 1 719 Description of a thing DT 550 Encyclopedia articles EN 238 Description of a person DP 142 Information blog IB 125 Report RP 121 Legal terms / conditions LT 114 Research article RA 78 Course material CM 61 APPENDIX A. FINCORE REGISTER DISTRIBUTION A-2 Job description JD 47 FAQs FA 23 Opinion (main) OP 1 399 Reviews RV 554 Religious text/sermon RS 405 Opinion blog OB 363 Advice AV 31 Machine translated / generated texts (main) MT 1 388 Informational persuasion (main) IP 1 334 Description with intent to sell DS 1 145 News-opinion blog / editorial EB 97 Interactive discussion (main) ID 1 081 Discussion forums DF 749 Question-answer forum QA 91 How-to / instructions (main) HI 549 Recipe RE 45 Spoken (main) SP 75 Interview IT 50 Formal speech FS 25 Lyrical (main) LY 25 Poem PO 25