Extracting Social Connections from Finnish Karelian Refugee Interviews Using LLMs
| dc.contributor.author | Laato, Joonatan | |
| dc.contributor.author | Kanerva, Jenna | |
| dc.contributor.author | Loehr, John | |
| dc.contributor.author | Lummaa, Virpi | |
| dc.contributor.author | Ginter, Filip | |
| dc.contributor.organization | fi=data-analytiikka|en=Data-analytiikka| | |
| dc.contributor.organization | fi=ekologia ja evoluutiobiologia|en=Ecology and Evolutionary Biology | | |
| dc.contributor.organization-code | 1.2.246.10.2458963.20.20415010352 | |
| dc.contributor.organization-code | 1.2.246.10.2458963.20.68940835793 | |
| dc.converis.publication-id | 470957606 | |
| dc.converis.url | https://research.utu.fi/converis/portal/Publication/470957606 | |
| dc.date.accessioned | 2025-08-27T22:40:24Z | |
| dc.date.available | 2025-08-27T22:40:24Z | |
| dc.description.abstract | <p>We performed a zero-shot information extraction study on a historical collection of 89,339 brief Finnish-language interviews of refugee families relocated post-WWII from Finnish Eastern Karelia. Our research objective is two-fold. First, we aim to extract social organizations and hobbies from the free text of the interviews, separately for each family member. These can act as a proxy variable indicating the degree of social integration of refugees in their new environment. Second, we aim to evaluate several alternative ways to approach this task, comparing a number of generative models and a supervised learning approach, to gain a broader insight into the relative merits of these different approaches and their applicability in similar studies. We find that the best generative model (GPT-4) is roughly on par with human performance, at an F-score of 88.8%. Interestingly, the best open generative model (Llama-3-70B-Instruct) reaches almost the same performance, at 87.7% F-score, demonstrating that open models are becoming a viable alternative for some practical tasks even on non-English data. Additionally, we test a supervised learning alternative, where we fine-tune a Finnish BERT model (FinBERT) using GPT-4 generated training data. By this method, we achieved an F-score of 84.1% already with 6K interviews up to an F-score of 86.3% with 30k interviews. Such an approach would be particularly appealing in cases where the computational resources are limited, or there is a substantial mass of data to process.</p> | |
| dc.identifier.jour-issn | 1613-0073 | |
| dc.identifier.olddbid | 202587 | |
| dc.identifier.oldhandle | 10024/185614 | |
| dc.identifier.uri | https://www.utupub.fi/handle/11111/47705 | |
| dc.identifier.url | https://ceur-ws.org/Vol-3834/paper52.pdf | |
| dc.identifier.urn | URN:NBN:fi-fe2025082785773 | |
| dc.language.iso | en | |
| dc.okm.affiliatedauthor | Laato, Joonatan | |
| dc.okm.affiliatedauthor | Kanerva, Jenna | |
| dc.okm.affiliatedauthor | Lummaa, Virpi | |
| dc.okm.affiliatedauthor | Ginter, Filip | |
| dc.okm.discipline | 113 Computer and information sciences | en_GB |
| dc.okm.discipline | 113 Tietojenkäsittely ja informaatiotieteet | fi_FI |
| dc.okm.internationalcopublication | not an international co-publication | |
| dc.okm.internationality | International publication | |
| dc.okm.type | A4 Conference Article | |
| dc.publisher.country | Germany | en_GB |
| dc.publisher.country | Saksa | fi_FI |
| dc.publisher.country-code | DE | |
| dc.relation.conference | Computational Humanities Research | |
| dc.relation.ispartofjournal | CEUR Workshop Proceedings | |
| dc.source.identifier | https://www.utupub.fi/handle/10024/185614 | |
| dc.title | Extracting Social Connections from Finnish Karelian Refugee Interviews Using LLMs | |
| dc.title.book | Proceedings of the Computational Humanities Research Conference 2024 (CHR 2024), Aarhus, Denmark, December 4-6, 202 | |
| dc.year.issued | 2024 |
Tiedostot
1 - 1 / 1