A Comparative Study of Three Transformer-Based Models for Embedding-Powered XGBoost Code Clone Detection

dc.contributor.authorTalvitie, Lauri
dc.contributor.departmentfi=Tietotekniikan laitos|en=Department of Computing|
dc.contributor.facultyfi=Teknillinen tiedekunta|en=Faculty of Technology|
dc.contributor.studysubjectfi=Tietotekniikka|en=Information and Communication Technology|
dc.date.accessioned2026-06-15T19:32:17Z
dc.date.issued2026-05-29
dc.description.abstractCode cloning is a common yet potentially harmful practice in software development, which can degrade maintainability and increase the need for debugging. The first objective of this thesis is to investigate different types of code clones and the current approaches that are used to detect them. The second objective is to compare the performance of three different transformer-based models in detecting code clones. The study investigates how two smaller models specified on code-related tasks perform against larger general purpose Large Language Model. The research methods included a literature review and method development. The literature review is used to gather foundation for the current state of code clone detection, including the clone types and different clone detection approaches. For the method development, a code clone detection pipeline is constructed, by utilizing CodeT5+, GraphCodeBERT, and Llama 3.2 1B in generating code embeddings that are furthermore used to train XGBoost binary classifier. The results indicate that the code-specific models, CodeT5+ and GraphCodeBERT, perform significantly better than the larger general-purpose LLM Llama 3.2 1B model. The results show that pre-training data plays more crucial role in the model's performance than only the sheer size of the model.
dc.format.extent74
dc.identifier.urihttps://www.utupub.fi/handle/11111/61984
dc.identifier.urnURN:NBN:fi-fe2026061569405
dc.language.isoeng
dc.rightsfi=Julkaisu on tekijänoikeussäännösten alainen. Teosta voi lukea ja tulostaa henkilökohtaista käyttöä varten. Käyttö kaupallisiin tarkoituksiin on kielletty.|en=This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.|
dc.rights.accessrightsavoin
dc.subjectcode clone detection
dc.subjecttransformer model
dc.subjectcode embeddings
dc.titleA Comparative Study of Three Transformer-Based Models for Embedding-Powered XGBoost Code Clone Detection
dc.type.ontasotfi=Diplomityö|en=Master's thesis|

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
Talvitie_Lauri_opinnayte.pdf
Size:
4.13 MB
Format:
Adobe Portable Document Format