A Comparative Study of Three Transformer-Based Models for Embedding-Powered XGBoost Code Clone Detection

Talvitie, Lauri

A Comparative Study of Three Transformer-Based Models for Embedding-Powered XGBoost Code Clone Detection

Talvitie, Lauri

2026-05-29

Diplomityö

Tietotekniikka

Talvitie_Lauri_opinnayte.pdf

4.13 MB

avoin

Julkaisu on tekijänoikeussäännösten alainen. Teosta voi lukea ja tulostaa henkilökohtaista käyttöä varten. Käyttö kaupallisiin tarkoituksiin on kielletty.

Lataukset26

Pysyvä osoite

https://urn.fi/URN:NBN:fi-fe2026061569405

Tiivistelmä

Code cloning is a common yet potentially harmful practice in software development, which can degrade maintainability and increase the need for debugging. The first objective of this thesis is to investigate different types of code clones and the current approaches that are used to detect them. The second objective is to compare the performance of three different transformer-based models in detecting code clones. The study investigates how two smaller models specified on code-related tasks perform against larger general purpose Large Language Model. The research methods included a literature review and method development. The literature review is used to gather foundation for the current state of code clone detection, including the clone types and different clone detection approaches. For the method development, a code clone detection pipeline is constructed, by utilizing CodeT5+, GraphCodeBERT, and Llama 3.2 1B in generating code embeddings that are furthermore used to train XGBoost binary classifier. The results indicate that the code-specific models, CodeT5+ and GraphCodeBERT, perform significantly better than the larger general-purpose LLM Llama 3.2 1B model. The results show that pre-training data plays more crucial role in the model's performance than only the sheer size of the model.

code clone detection transformer model code embeddings

Tietueen kaikki tiedot

A Comparative Study of Three Transformer-Based Models for Embedding-Powered XGBoost Code Clone Detection

Toimittaja(t)

Pysyvä osoite

Verkkojulkaisu

DOI

Tiivistelmä

item.page.okmtext

A Comparative Study of Three Transformer-Based Models for Embedding-Powered XGBoost Code Clone Detection

Toimittaja(t)

Pysyvä osoite

Verkkojulkaisu

DOI

Tiivistelmä

item.page.okmtext

Avainsanat