Automated classification of receipts and invoices along with document extraction

Lehtonen, Roope

Automated classification of receipts and invoices along with document extraction

Lehtonen, Roope

2020-11-24

Pro gradu -tutkielma

Tietojenkäsittelytiede

Lehtonen_Roope_opinnayte.pdf

1.03 MB

avoin

Julkaisu on tekijänoikeussäännösten alainen. Teosta voi lukea ja tulostaa henkilökohtaista käyttöä varten. Käyttö kaupallisiin tarkoituksiin on kielletty.

Lataukset2081

Pysyvä osoite

https://urn.fi/URN:NBN:fi-fe2020120399232

Tiivistelmä

Companies might receive dozens or even hundreds of receipts and invoices per day. It consumes a lot of working hours to keep them all organized – invoices must be paid on time and receipts must be archived properly. This research aims to reduce the amount of manual labor the organizing requires with automated classification. Personally, I’m writing this thesis in collaboration with my workplace – a company called Eneroc Ltd. They had a problem with document classification consuming too many working hours. Therefore, they created a system to automate this process. The existing system uses a text-based approach that searches for specific key words in the documents. The system works rather well, but the company wanted to find out if some modern approach could outperform the existing system and add more features into the process. The goal of this research is to find out if a machine learning based approach could be used to classify documents into invoices and receipts. In addition to the classification, the approach should also be able to collect key information from the documents. This thesis describes the workflow of creating a machine learning based solution to tackle the given challenge. The research resulted in an application that takes in invoices and receipts in PDF format. The system trains a k-nearest neighbors model with training data, that was created in the process of the research. The model is then used to classify different parts of the new PDF files into predefined categories. The key information is extracted from these categories. The k-NN model was validated with k-fold cross-validation. The validation showed that the model is performing correctly. Some preprocessing was also introduced in the process, which further improved the results. Good results with the k-NN model imply that using a proper machine learning solution would be profitable. The final classification between receipts and invoices, as well as the key information extraction, is done based on the classified document parts. This works rather well on the classification and simple key information extraction. But more complex key information extraction – like the product list extraction – still requires more work. The research proved that machine learning solution could be used to classify documents into invoices and receipts, and also to collect key information from the documents. The created application isn’t yet ready for deployment, but it gives a good foundation for future development. The research also shows which steps to take next and where to focus on when improving the system.

Tietueen kaikki tiedot

Automated classification of receipts and invoices along with document extraction

Toimittaja(t)

Pysyvä osoite

Verkkojulkaisu

DOI

Tiivistelmä

item.page.okmtext