Detecting and Analyzing Text Reuse with BLAST
| dc.contributor.author | Vesanto, Aleksi | |
| dc.contributor.department | fi=Tulevaisuuden teknologioiden laitos|en=Department of Future Technologies| | |
| dc.contributor.faculty | fi=Luonnontieteiden ja tekniikan tiedekunta|en=Faculty of Science and Engineering| | |
| dc.contributor.studysubject | fi=Tietojenkäsittelytiede|en=Computer Science| | |
| dc.date.accessioned | 2019-01-31T22:00:22Z | |
| dc.date.available | 2019-01-31T22:00:22Z | |
| dc.date.issued | 2019-01-15 | |
| dc.description.abstract | In this thesis I expand upon my previous work on text reuse detection. I propose a novel method of detecting text reuse by leveraging BLAST (Basic Local Alignment Search Tool), an algorithm originally designed for aligning and comparing biomedical sequences, such as DNA and protein sequences. I explain the original BLAST algorithm in depth by going through it step-by-step. I also describe two other popular sequence alignment methods. I demonstrate the effectiveness of the BLAST text reuse detection method by comparing it against the previous state-of-the-art and show that the proposed method beats it by a large margin. I apply the method to a dataset of 3 million documents of scanned Finnish newspapers and journals, which have been turned into text using OCR (Optical Character Recognition) software. I categorize the results from the method into three categories: every day text reuse, long term reuse and viral news. I describe them and provide examples of them as well as propose a new, novel method of calculating a virality score for the clusters. | |
| dc.format.extent | 70 | |
| dc.identifier.olddbid | 163519 | |
| dc.identifier.oldhandle | 10024/146706 | |
| dc.identifier.uri | https://www.utupub.fi/handle/11111/14357 | |
| dc.identifier.urn | URN:NBN:fi-fe201901313724 | |
| dc.language.iso | eng | |
| dc.rights | fi=Julkaisu on tekijänoikeussäännösten alainen. Teosta voi lukea ja tulostaa henkilökohtaista käyttöä varten. Käyttö kaupallisiin tarkoituksiin on kielletty.|en=This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.| | |
| dc.rights.accessrights | avoin | |
| dc.source.identifier | https://www.utupub.fi/handle/10024/146706 | |
| dc.subject | Text reuse, NLP, Bioinformatics, sequence alignment, OCR | |
| dc.title | Detecting and Analyzing Text Reuse with BLAST | |
| dc.type.ontasot | fi=Pro gradu -tutkielma|en=Master's thesis| |
Tiedostot
1 - 1 / 1