Extracting Social Connections from Finnish Karelian Refugee Interviews Using LLMs

Laato, Joonatan; Kanerva, Jenna; Loehr, John; Lummaa, Virpi; Ginter, Filip

Extracting Social Connections from Finnish Karelian Refugee Interviews Using LLMs

dc.contributor.author	Laato, Joonatan
dc.contributor.author	Kanerva, Jenna
dc.contributor.author	Loehr, John
dc.contributor.author	Lummaa, Virpi
dc.contributor.author	Ginter, Filip
dc.contributor.organization	fi=data-analytiikka\|en=Data-analytiikka\|
dc.contributor.organization	fi=ekologia ja evoluutiobiologia\|en=Ecology and Evolutionary Biology \|
dc.contributor.organization-code	1.2.246.10.2458963.20.68940835793
dc.converis.publication-id	470957606
dc.converis.url	https://research.utu.fi/converis/portal/Publication/470957606
dc.date.accessioned	2025-08-27T22:40:24Z
dc.date.available	2025-08-27T22:40:24Z
dc.description.abstract	<p>We performed a zero-shot information extraction study on a historical collection of 89,339 brief Finnish-language interviews of refugee families relocated post-WWII from Finnish Eastern Karelia. Our research objective is two-fold. First, we aim to extract social organizations and hobbies from the free text of the interviews, separately for each family member. These can act as a proxy variable indicating the degree of social integration of refugees in their new environment. Second, we aim to evaluate several alternative ways to approach this task, comparing a number of generative models and a supervised learning approach, to gain a broader insight into the relative merits of these different approaches and their applicability in similar studies. We find that the best generative model (GPT-4) is roughly on par with human performance, at an F-score of 88.8%. Interestingly, the best open generative model (Llama-3-70B-Instruct) reaches almost the same performance, at 87.7% F-score, demonstrating that open models are becoming a viable alternative for some practical tasks even on non-English data. Additionally, we test a supervised learning alternative, where we fine-tune a Finnish BERT model (FinBERT) using GPT-4 generated training data. By this method, we achieved an F-score of 84.1% already with 6K interviews up to an F-score of 86.3% with 30k interviews. Such an approach would be particularly appealing in cases where the computational resources are limited, or there is a substantial mass of data to process.</p>
dc.identifier.olddbid	202587
dc.identifier.oldhandle	10024/185614
dc.identifier.uri	https://www.utupub.fi/handle/11111/47705
dc.identifier.url	https://ceur-ws.org/Vol-3834/paper52.pdf
dc.identifier.urn	URN:NBN:fi-fe2025082785773
dc.language.iso	en
dc.okm.affiliatedauthor	Laato, Joonatan
dc.okm.affiliatedauthor	Kanerva, Jenna
dc.okm.affiliatedauthor	Lummaa, Virpi
dc.okm.affiliatedauthor	Ginter, Filip
dc.okm.discipline	113 Computer and information sciences	en_GB
dc.okm.discipline	113 Tietojenkäsittely ja informaatiotieteet	fi_FI
dc.okm.internationalcopublication	not an international co-publication
dc.okm.internationality	International publication
dc.okm.type	A4 Conference Article
dc.publisher.country	Germany	en_GB
dc.publisher.country	Saksa	fi_FI
dc.publisher.country-code	DE
dc.relation.conference	Computational Humanities Research
dc.relation.ispartofjournal	CEUR Workshop Proceedings
dc.source.identifier	https://www.utupub.fi/handle/10024/185614
dc.title	Extracting Social Connections from Finnish Karelian Refugee Interviews Using LLMs
dc.title.book	Proceedings of the Computational Humanities Research Conference 2024 (CHR 2024), Aarhus, Denmark, December 4-6, 202
dc.year.issued	2024

Tiedostot

Näytetään 1 - 1 / 1

Name:: paper52.pdf
Size:: 587.05 KB
Format:: Adobe Portable Document Format

Lataa

Kokoelmat

Rinnakkaistallenteet