Distinguishing Noise and Main Text Content from Web-Sourced Plain Text Documents Using Sequential Neural Networks
Salmela, Anna (2022-05-04)
Distinguishing Noise and Main Text Content from Web-Sourced Plain Text Documents Using Sequential Neural Networks
Salmela, Anna
(04.05.2022)
Julkaisu on tekijänoikeussäännösten alainen. Teosta voi lukea ja tulostaa henkilökohtaista käyttöä varten. Käyttö kaupallisiin tarkoituksiin on kielletty.
avoin
Julkaisun pysyvä osoite on:
https://urn.fi/URN:NBN:fi-fe2022053039480
https://urn.fi/URN:NBN:fi-fe2022053039480
Tiivistelmä
Boilerplate removal and the identification of the actual textual content is a crucial step in web corpus creation. However, existing methods don’t always filter out the noise perfectly and are often not applicable for plain text corpora. In this thesis, I will develop machine learning methods to identify the main textual content in plain text documents. I will utilize transfer learning and pretrained language models as a base for training monolingual models with French and Swedish data as well as a multilingual model with French, Swedish, English, Finnish, German and Spanish data. I will compare two machine learning architectures based on the XLM-RoBERTa language model: first a classification model built on top of the pretrained XLM-RoBERTa model and a second model using an additional Long Short-Term Memory (LSTM) network layer. I will show that the LSTM layer improves the classification of the XLM-RoBERTa model and the built multilingual model performs well even with data in unseen languages. I will perform a further analysis on the results and show that the results of the boilerplate detection with the trained models differ with text varieties. Certain types of text documents, such as lyrical texts or discussion forum texts pose challenges in boilerplate detection, and it would be beneficial for future research to focus on gathering data that has been difficult to clean.