Detecting and Analyzing Text Reuse with BLAST

dc.contributor.authorVesanto, Aleksi
dc.contributor.departmentfi=Tulevaisuuden teknologioiden laitos|en=Department of Future Technologies|
dc.contributor.facultyfi=Luonnontieteiden ja tekniikan tiedekunta|en=Faculty of Science and Engineering|
dc.contributor.studysubjectfi=Tietojenkäsittelytiede|en=Computer Science|
dc.date.accessioned2019-01-31T22:00:22Z
dc.date.available2019-01-31T22:00:22Z
dc.date.issued2019-01-15
dc.description.abstractIn this thesis I expand upon my previous work on text reuse detection. I propose a novel method of detecting text reuse by leveraging BLAST (Basic Local Alignment Search Tool), an algorithm originally designed for aligning and comparing biomedical sequences, such as DNA and protein sequences. I explain the original BLAST algorithm in depth by going through it step-by-step. I also describe two other popular sequence alignment methods. I demonstrate the effectiveness of the BLAST text reuse detection method by comparing it against the previous state-of-the-art and show that the proposed method beats it by a large margin. I apply the method to a dataset of 3 million documents of scanned Finnish newspapers and journals, which have been turned into text using OCR (Optical Character Recognition) software. I categorize the results from the method into three categories: every day text reuse, long term reuse and viral news. I describe them and provide examples of them as well as propose a new, novel method of calculating a virality score for the clusters.
dc.format.extent70
dc.identifier.olddbid163519
dc.identifier.oldhandle10024/146706
dc.identifier.urihttps://www.utupub.fi/handle/11111/14357
dc.identifier.urnURN:NBN:fi-fe201901313724
dc.language.isoeng
dc.rightsfi=Julkaisu on tekijänoikeussäännösten alainen. Teosta voi lukea ja tulostaa henkilökohtaista käyttöä varten. Käyttö kaupallisiin tarkoituksiin on kielletty.|en=This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.|
dc.rights.accessrightsavoin
dc.source.identifierhttps://www.utupub.fi/handle/10024/146706
dc.subjectText reuse, NLP, Bioinformatics, sequence alignment, OCR
dc.titleDetecting and Analyzing Text Reuse with BLAST
dc.type.ontasotfi=Pro gradu -tutkielma|en=Master's thesis|

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
Vesanto_Aleksi_opinnayte.pdf
Size:
3.33 MB
Format:
Adobe Portable Document Format