Extracting Social Connections from Finnish Karelian Refugee Interviews Using LLMs

dc.contributor.authorLaato, Joonatan
dc.contributor.authorKanerva, Jenna
dc.contributor.authorLoehr, John
dc.contributor.authorLummaa, Virpi
dc.contributor.authorGinter, Filip
dc.contributor.organizationfi=data-analytiikka|en=Data-analytiikka|
dc.contributor.organizationfi=ekologia ja evoluutiobiologia|en=Ecology and Evolutionary Biology |
dc.contributor.organization-code1.2.246.10.2458963.20.20415010352
dc.contributor.organization-code1.2.246.10.2458963.20.68940835793
dc.converis.publication-id470957606
dc.converis.urlhttps://research.utu.fi/converis/portal/Publication/470957606
dc.date.accessioned2025-08-27T22:40:24Z
dc.date.available2025-08-27T22:40:24Z
dc.description.abstract<p>We performed a zero-shot information extraction study on a historical collection of 89,339 brief Finnish-language interviews of refugee families relocated post-WWII from Finnish Eastern Karelia. Our research objective is two-fold. First, we aim to extract social organizations and hobbies from the free text of the interviews, separately for each family member. These can act as a proxy variable indicating the degree of social integration of refugees in their new environment. Second, we aim to evaluate several alternative ways to approach this task, comparing a number of generative models and a supervised learning approach, to gain a broader insight into the relative merits of these different approaches and their applicability in similar studies. We find that the best generative model (GPT-4) is roughly on par with human performance, at an F-score of 88.8%. Interestingly, the best open generative model (Llama-3-70B-Instruct) reaches almost the same performance, at 87.7% F-score, demonstrating that open models are becoming a viable alternative for some practical tasks even on non-English data. Additionally, we test a supervised learning alternative, where we fine-tune a Finnish BERT model (FinBERT) using GPT-4 generated training data. By this method, we achieved an F-score of 84.1% already with 6K interviews up to an F-score of 86.3% with 30k interviews. Such an approach would be particularly appealing in cases where the computational resources are limited, or there is a substantial mass of data to process.</p>
dc.identifier.jour-issn1613-0073
dc.identifier.olddbid202587
dc.identifier.oldhandle10024/185614
dc.identifier.urihttps://www.utupub.fi/handle/11111/47705
dc.identifier.urlhttps://ceur-ws.org/Vol-3834/paper52.pdf
dc.identifier.urnURN:NBN:fi-fe2025082785773
dc.language.isoen
dc.okm.affiliatedauthorLaato, Joonatan
dc.okm.affiliatedauthorKanerva, Jenna
dc.okm.affiliatedauthorLummaa, Virpi
dc.okm.affiliatedauthorGinter, Filip
dc.okm.discipline113 Computer and information sciencesen_GB
dc.okm.discipline113 Tietojenkäsittely ja informaatiotieteetfi_FI
dc.okm.internationalcopublicationnot an international co-publication
dc.okm.internationalityInternational publication
dc.okm.typeA4 Conference Article
dc.publisher.countryGermanyen_GB
dc.publisher.countrySaksafi_FI
dc.publisher.country-codeDE
dc.relation.conferenceComputational Humanities Research
dc.relation.ispartofjournalCEUR Workshop Proceedings
dc.source.identifierhttps://www.utupub.fi/handle/10024/185614
dc.titleExtracting Social Connections from Finnish Karelian Refugee Interviews Using LLMs
dc.title.bookProceedings of the Computational Humanities Research Conference 2024 (CHR 2024), Aarhus, Denmark, December 4-6, 202
dc.year.issued2024

Tiedostot

Näytetään 1 - 1 / 1
Ladataan...
Name:
paper52.pdf
Size:
587.05 KB
Format:
Adobe Portable Document Format