GlotEval: A Test Suite for Massively Multilingual Evaluation of Large Language Models

dc.contributor.authorLuo, Hengyu
dc.contributor.authorLi, Zihao
dc.contributor.authorAttieh, Joseph
dc.contributor.authorDevkota, Sawal
dc.contributor.authorde Gibert, Ona
dc.contributor.authorHuang, Xu
dc.contributor.authorJi, Shaoxiong
dc.contributor.authorLin, Peiqin
dc.contributor.authorMantina
dc.contributor.authorBhavani Sai Praneeth Varma
dc.contributor.authorSreenidhi, Ananda
dc.contributor.authorVázquez, Raúl
dc.contributor.authorWang, Mengjie
dc.contributor.authorYusofi, Samea
dc.contributor.authorYuan, Fei
dc.contributor.authorTiedemann, Jörg
dc.contributor.organizationfi=data-analytiikka|en=Data-analytiikka|
dc.contributor.organization-code1.2.246.10.2458963.20.68940835793
dc.converis.publication-id506505289
dc.converis.urlhttps://research.utu.fi/converis/portal/Publication/506505289
dc.date.accessioned2026-01-21T14:53:11Z
dc.date.available2026-01-21T14:53:11Z
dc.description.abstract<p>Large language models (LLMs) are advancing at an unprecedented pace globally, with regions increasingly adopting these models for applications in their primary languages. Evaluating these models in diverse linguistic environments, especially in low-resource languages, has become a major challenge for academia and industry. Existing evaluation frameworks suffer from inconsistency across different benchmarks, being disproportionately focused on English and a handful of high-resource languages, thereby overlooking the realistic performance of LLMs in multilingual and lower-resource scenarios. To address this critical challenge of fragmented and inconsistent multilingual evaluation, we introduce GlotEval, a unified and lightweight framework that systematically integrates 27 benchmarks under a standardized ISO 639-3 language identifier system, allowing for seamless incorporation of new benchmarks. Supporting nine key tasks (machine translation, text classification, summarization, open-ended generation, reading comprehension, sequence labeling, intrinsic evaluation, instruction following and reasoning), spanning over dozens to hundreds of languages, GlotEval uniquely enables language-specific, cross-benchmark analysis and non-English-centric evaluations at a scale previously less practical for many researchers. This enables a precise diagnosis of model strengths and weaknesses in diverse linguistic contexts. A multilingual translation case study demonstrates GlotEval’s applicability for multilingual and language-specific evaluations.</p>
dc.format.pagerange602
dc.format.pagerange614
dc.identifier.isbn979-8-89176-334-0
dc.identifier.olddbid213835
dc.identifier.oldhandle10024/196853
dc.identifier.urihttps://www.utupub.fi/handle/11111/56002
dc.identifier.urlhttps://aclanthology.org/2025.emnlp-demos.43/
dc.identifier.urnURN:NBN:fi-fe202601216057
dc.language.isoen
dc.okm.affiliatedauthorJi, Shaoxiong
dc.okm.discipline113 Computer and information sciencesen_GB
dc.okm.discipline113 Tietojenkäsittely ja informaatiotieteetfi_FI
dc.okm.internationalcopublicationinternational co-publication
dc.okm.internationalityInternational publication
dc.okm.typeA4 Conference Article
dc.publisher.countryUnited Statesen_GB
dc.publisher.countryYhdysvallat (USA)fi_FI
dc.publisher.country-codeUS
dc.relation.conferenceEmpirical Methods in Natural Language Processing
dc.relation.doi10.18653/v1/2025.emnlp-demos.43
dc.source.identifierhttps://www.utupub.fi/handle/10024/196853
dc.titleGlotEval: A Test Suite for Massively Multilingual Evaluation of Large Language Models
dc.title.bookProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing : System Demonstrations
dc.year.issued2025

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
2025.emnlp-demos.43.pdf
Size:
658.1 KB
Format:
Adobe Portable Document Format