Automated classification of receipts and invoices along with document extraction
Lehtonen, Roope (2020-11-24)
Automated classification of receipts and invoices along with document extraction
Lehtonen, Roope
(24.11.2020)
Julkaisu on tekijänoikeussäännösten alainen. Teosta voi lukea ja tulostaa henkilökohtaista käyttöä varten. Käyttö kaupallisiin tarkoituksiin on kielletty.
avoin
Julkaisun pysyvä osoite on:
https://urn.fi/URN:NBN:fi-fe2020120399232
https://urn.fi/URN:NBN:fi-fe2020120399232
Tiivistelmä
Companies might receive dozens or even hundreds of receipts and invoices per day. It
consumes a lot of working hours to keep them all organized – invoices must be paid on
time and receipts must be archived properly. This research aims to reduce the amount of
manual labor the organizing requires with automated classification.
Personally, I’m writing this thesis in collaboration with my workplace – a company called
Eneroc Ltd. They had a problem with document classification consuming too many working
hours. Therefore, they created a system to automate this process. The existing system
uses a text-based approach that searches for specific key words in the documents. The
system works rather well, but the company wanted to find out if some modern approach
could outperform the existing system and add more features into the process.
The goal of this research is to find out if a machine learning based approach could be
used to classify documents into invoices and receipts. In addition to the classification, the
approach should also be able to collect key information from the documents. This thesis
describes the workflow of creating a machine learning based solution to tackle the given
challenge.
The research resulted in an application that takes in invoices and receipts in PDF format.
The system trains a k-nearest neighbors model with training data, that was created in the
process of the research. The model is then used to classify different parts of the new PDF
files into predefined categories. The key information is extracted from these categories.
The k-NN model was validated with k-fold cross-validation. The validation showed that
the model is performing correctly. Some preprocessing was also introduced in the process,
which further improved the results. Good results with the k-NN model imply that using a
proper machine learning solution would be profitable.
The final classification between receipts and invoices, as well as the key information extraction,
is done based on the classified document parts. This works rather well on the
classification and simple key information extraction. But more complex key information
extraction – like the product list extraction – still requires more work.
The research proved that machine learning solution could be used to classify documents
into invoices and receipts, and also to collect key information from the documents. The
created application isn’t yet ready for deployment, but it gives a good foundation for
future development. The research also shows which steps to take next and where to focus
on when improving the system.
consumes a lot of working hours to keep them all organized – invoices must be paid on
time and receipts must be archived properly. This research aims to reduce the amount of
manual labor the organizing requires with automated classification.
Personally, I’m writing this thesis in collaboration with my workplace – a company called
Eneroc Ltd. They had a problem with document classification consuming too many working
hours. Therefore, they created a system to automate this process. The existing system
uses a text-based approach that searches for specific key words in the documents. The
system works rather well, but the company wanted to find out if some modern approach
could outperform the existing system and add more features into the process.
The goal of this research is to find out if a machine learning based approach could be
used to classify documents into invoices and receipts. In addition to the classification, the
approach should also be able to collect key information from the documents. This thesis
describes the workflow of creating a machine learning based solution to tackle the given
challenge.
The research resulted in an application that takes in invoices and receipts in PDF format.
The system trains a k-nearest neighbors model with training data, that was created in the
process of the research. The model is then used to classify different parts of the new PDF
files into predefined categories. The key information is extracted from these categories.
The k-NN model was validated with k-fold cross-validation. The validation showed that
the model is performing correctly. Some preprocessing was also introduced in the process,
which further improved the results. Good results with the k-NN model imply that using a
proper machine learning solution would be profitable.
The final classification between receipts and invoices, as well as the key information extraction,
is done based on the classified document parts. This works rather well on the
classification and simple key information extraction. But more complex key information
extraction – like the product list extraction – still requires more work.
The research proved that machine learning solution could be used to classify documents
into invoices and receipts, and also to collect key information from the documents. The
created application isn’t yet ready for deployment, but it gives a good foundation for
future development. The research also shows which steps to take next and where to focus
on when improving the system.