Large Language Model Performance in Automatic Assessment on an Introductory Programming Course

Rytilahti, Juuso; Kaila, Erkki; Ingman, Valtteri

Large Language Model Performance in Automatic Assessment on an Introductory Programming Course

dc.contributor.author	Rytilahti, Juuso
dc.contributor.author	Kaila, Erkki
dc.contributor.author	Ingman, Valtteri
dc.contributor.organization	fi=ohjelmistotekniikka\|en=Software Engineering\|
dc.contributor.organization-code	1.2.246.10.2458963.20.71310837563
dc.converis.publication-id	508258622
dc.converis.url	https://research.utu.fi/converis/portal/Publication/508258622
dc.date.accessioned	2026-04-24T20:06:40Z
dc.description.abstract	<p>Large language models (LLMs) are a potential solution for solving the significant assessment load in big courses with numerous assignments. However, the quality of the automated assessment may not match the evaluation by teachers or other experts. In this paper, we examine the automated assessment of programming-related tasks in a large-scale introductory programming course. The study is structured into two parts: first, we examine how reliably LLMs can assess the tasks compared to course teachers when provided the same rubric. Second, we try to find out if a simple autonomous agent pipeline, mimicking a review board, can improve the assessment outcome. The study was conducted on a university-level introductory Python course with more than 500 students. We chose a total of four programming-related assignments from the four final weeks of the course. First, we provided the selected LLM models with the student answers accompanied by an evaluation rubric and a simple prompt and recorded the resulting scores and feedback comments. Second, we built a pipeline of autonomous agents with different roles of a review board and used the student submissions as input for the pipeline, again recording the scores and comments. In the article, we discuss the feasibility and the performance of the given approaches. We also provide a detailed analysis of the comparison between the results of the two approaches and the teacher-assessed results, and discuss the differences in the results and the likely reasons for them. Finally, we outline the potential for future work.<br></p>
dc.identifier.uri	https://www.utupub.fi/handle/11111/59406
dc.identifier.url	https://ceur-ws.org/Vol-4181/paper02.pdf
dc.identifier.urn	URN:NBN:fi-fe2026042333198
dc.language.iso	en
dc.okm.affiliatedauthor	Rytilahti, Juuso
dc.okm.affiliatedauthor	Kaila, Erkki
dc.okm.affiliatedauthor	Ingman, Valtteri
dc.okm.discipline	113 Computer and information sciences	en_GB
dc.okm.discipline	113 Tietojenkäsittely ja informaatiotieteet	fi_FI
dc.okm.internationalcopublication	not an international co-publication
dc.okm.internationality	International publication
dc.okm.type	A4 Conference Article
dc.publisher.country	Germany	en_GB
dc.publisher.country	Saksa	fi_FI
dc.publisher.country-code	DE
dc.relation.conference	Annual Doctoral Symposium of Computer Science
dc.relation.ispartofjournal	CEUR Workshop Proceedings
dc.relation.volume	4181
dc.title	Large Language Model Performance in Automatic Assessment on an Introductory Programming Course
dc.title.book	Proceedings of the Annual Doctoral Symposium of Computer Science 2025 (TKTP 2025), Helsinki, Finland, June, 2025
dc.year.issued	2026

Tiedostot

Näytetään 1 - 1 / 1

Name:: paper02.pdf
Size:: 1.11 MB
Format:: Adobe Portable Document Format

Lataa

Kokoelmat

Rinnakkaistallenteet