Zero-Shot Approach to Redacting Personally Identifiable Information in Medical Reports Using Large Language Models
Myntti, Amanda (2025-06-24)
Zero-Shot Approach to Redacting Personally Identifiable Information in Medical Reports Using Large Language Models
Myntti, Amanda
(24.06.2025)
Julkaisu on tekijänoikeussäännösten alainen. Teosta voi lukea ja tulostaa henkilökohtaista käyttöä varten. Käyttö kaupallisiin tarkoituksiin on kielletty.
avoin
Julkaisun pysyvä osoite on:
https://urn.fi/URN:NBN:fi-fe2025063076147
https://urn.fi/URN:NBN:fi-fe2025063076147
Tiivistelmä
The presence of personally identifiable information in large data collections poses a significant barrier to their effective use. Specifically in the health care domain, a myriad of text documents filled with sensitive information — such as clinical notes and electronic health records — are created each day. While these document collections are valuable for machine learning, whether as the target of analysis or as training data for models, their usage is profoundly limited by the information they contain.
Natural Language Processing, and specifically Large Language Models, are at the forefront of current tools best suited for sensitive data redaction. In this thesis, a language model-based zero-shot approach to redacting personally identifiable information is implemented and applied to a synthetic medical corpus. The study investigates the effectiveness of this method across clinical text data in English, Finnish, and Spanish. The results show that despite clear differences in the capabilities of tested models, the zero-shot redaction method is incapable of reliably detecting personal information in electronic health records.
Natural Language Processing, and specifically Large Language Models, are at the forefront of current tools best suited for sensitive data redaction. In this thesis, a language model-based zero-shot approach to redacting personally identifiable information is implemented and applied to a synthetic medical corpus. The study investigates the effectiveness of this method across clinical text data in English, Finnish, and Spanish. The results show that despite clear differences in the capabilities of tested models, the zero-shot redaction method is incapable of reliably detecting personal information in electronic health records.