A Comparative Study of Three Transformer-Based Models for Embedding-Powered XGBoost Code Clone Detection

avoin
Julkaisu on tekijänoikeussäännösten alainen. Teosta voi lukea ja tulostaa henkilökohtaista käyttöä varten. Käyttö kaupallisiin tarkoituksiin on kielletty.
Lataukset6

Verkkojulkaisu

DOI

Tiivistelmä

Code cloning is a common yet potentially harmful practice in software development, which can degrade maintainability and increase the need for debugging. The first objective of this thesis is to investigate different types of code clones and the current approaches that are used to detect them. The second objective is to compare the performance of three different transformer-based models in detecting code clones. The study investigates how two smaller models specified on code-related tasks perform against larger general purpose Large Language Model. The research methods included a literature review and method development. The literature review is used to gather foundation for the current state of code clone detection, including the clone types and different clone detection approaches. For the method development, a code clone detection pipeline is constructed, by utilizing CodeT5+, GraphCodeBERT, and Llama 3.2 1B in generating code embeddings that are furthermore used to train XGBoost binary classifier. The results indicate that the code-specific models, CodeT5+ and GraphCodeBERT, perform significantly better than the larger general-purpose LLM Llama 3.2 1B model. The results show that pre-training data plays more crucial role in the model's performance than only the sheer size of the model.

item.page.okmtext