High Quality NLP Data Pipelines
Torikka, Juuso (2025-01-31)
High Quality NLP Data Pipelines
Torikka, Juuso
(31.01.2025)
Julkaisu on tekijänoikeussäännösten alainen. Teosta voi lukea ja tulostaa henkilökohtaista käyttöä varten. Käyttö kaupallisiin tarkoituksiin on kielletty.
avoin
Julkaisun pysyvä osoite on:
https://urn.fi/URN:NBN:fi-fe202502059785
https://urn.fi/URN:NBN:fi-fe202502059785
Tiivistelmä
In an era where companies utilize Natural Language Processing to guide business
processes, the importance of high quality data pipelines cannot be overstated. Customer interactions can produce vast amounts of data containing crucial information
about the customers. This thesis explores the factors of high quality NLP data
pipelines, their challenges and best practices for building a robust applications
delivering reliable and actionable insights.
Producing valuable insights through NLP requires an automated, scalable
and dependable data pipeline, which is able to produce inference results for various
types of interaction texts while maintaining data quality. This thesis explores
through literature the requirements and steps to produce these pipelines in an
industrial scale while uncovering seven key factors which the system should aim to
fulfll. This thesis considers factors from ways of working to data governance and
creates an end-to-end suitable list of factors to consider applying when building
an NLP data pipeline. These factors suggest that the system must be scalable,
dependable, automated, monitored, well documented, its data quality validated
and evaluated, while being developed in an effective iterative way.
This thesis emphasizes the importance of data quality. Data quality is assessed through various validation techniques and data quality dimensions which
can identify deficiencies or issues within the resulting data. As this thesis aims
for industrial scale results, it also explores the aspects of data governance and
documentation by means of data lineage and metadata.
A literature review provides a frame for the implementation. At the end of
this thesis, a high quality NLP data pipeline is developed according to the high
quality factors identified from the literature. The pipeline makes use of pre-built
models to extract topics and sentiments from interaction texts.
Ultimately, this thesis sets out to provide a comprehensive guide for practitioners and researchers to design and implement high quality NLP data pipelines.
processes, the importance of high quality data pipelines cannot be overstated. Customer interactions can produce vast amounts of data containing crucial information
about the customers. This thesis explores the factors of high quality NLP data
pipelines, their challenges and best practices for building a robust applications
delivering reliable and actionable insights.
Producing valuable insights through NLP requires an automated, scalable
and dependable data pipeline, which is able to produce inference results for various
types of interaction texts while maintaining data quality. This thesis explores
through literature the requirements and steps to produce these pipelines in an
industrial scale while uncovering seven key factors which the system should aim to
fulfll. This thesis considers factors from ways of working to data governance and
creates an end-to-end suitable list of factors to consider applying when building
an NLP data pipeline. These factors suggest that the system must be scalable,
dependable, automated, monitored, well documented, its data quality validated
and evaluated, while being developed in an effective iterative way.
This thesis emphasizes the importance of data quality. Data quality is assessed through various validation techniques and data quality dimensions which
can identify deficiencies or issues within the resulting data. As this thesis aims
for industrial scale results, it also explores the aspects of data governance and
documentation by means of data lineage and metadata.
A literature review provides a frame for the implementation. At the end of
this thesis, a high quality NLP data pipeline is developed according to the high
quality factors identified from the literature. The pipeline makes use of pre-built
models to extract topics and sentiments from interaction texts.
Ultimately, this thesis sets out to provide a comprehensive guide for practitioners and researchers to design and implement high quality NLP data pipelines.