High Quality NLP Data Pipelines

Torikka, Juuso

High Quality NLP Data Pipelines

Torikka, Juuso

2025-01-31

Diplomityö

Lääketieteellinen tekniikka ja terveysteknologia

Torikka_Juuso_Thesis.pdf

363.62 KB

avoin

Julkaisu on tekijänoikeussäännösten alainen. Teosta voi lukea ja tulostaa henkilökohtaista käyttöä varten. Käyttö kaupallisiin tarkoituksiin on kielletty.

Lataukset618

Pysyvä osoite

https://urn.fi/URN:NBN:fi-fe202502059785

Tiivistelmä

In an era where companies utilize Natural Language Processing to guide business processes, the importance of high quality data pipelines cannot be overstated. Customer interactions can produce vast amounts of data containing crucial information about the customers. This thesis explores the factors of high quality NLP data pipelines, their challenges and best practices for building a robust applications delivering reliable and actionable insights. Producing valuable insights through NLP requires an automated, scalable and dependable data pipeline, which is able to produce inference results for various types of interaction texts while maintaining data quality. This thesis explores through literature the requirements and steps to produce these pipelines in an industrial scale while uncovering seven key factors which the system should aim to fulfll. This thesis considers factors from ways of working to data governance and creates an end-to-end suitable list of factors to consider applying when building an NLP data pipeline. These factors suggest that the system must be scalable, dependable, automated, monitored, well documented, its data quality validated and evaluated, while being developed in an effective iterative way. This thesis emphasizes the importance of data quality. Data quality is assessed through various validation techniques and data quality dimensions which can identify deficiencies or issues within the resulting data. As this thesis aims for industrial scale results, it also explores the aspects of data governance and documentation by means of data lineage and metadata. A literature review provides a frame for the implementation. At the end of this thesis, a high quality NLP data pipeline is developed according to the high quality factors identified from the literature. The pipeline makes use of pre-built models to extract topics and sentiments from interaction texts. Ultimately, this thesis sets out to provide a comprehensive guide for practitioners and researchers to design and implement high quality NLP data pipelines.

Tietueen kaikki tiedot

High Quality NLP Data Pipelines

Toimittaja(t)

Pysyvä osoite

Verkkojulkaisu

DOI

Tiivistelmä

item.page.okmtext