Zero-Shot Approach to Redacting Personally Identifiable Information in Medical Reports Using Large Language Models

Myntti, Amanda

Zero-Shot Approach to Redacting Personally Identifiable Information in Medical Reports Using Large Language Models

Myntti, Amanda (2025-06-24)

Zero-Shot Approach to Redacting Personally Identifiable Information in Medical Reports Using Large Language Models

Myntti, Amanda

(24.06.2025)

Katso/Avaa

Myntti_Amanda_opinnayte.pdf (1.015Mb)

Lataukset:

Julkaisu on tekijänoikeussäännösten alainen. Teosta voi lukea ja tulostaa henkilökohtaista käyttöä varten. Käyttö kaupallisiin tarkoituksiin on kielletty.

avoin

Näytä kaikki kuvailutiedot

Julkaisun pysyvä osoite on:
https://urn.fi/URN:NBN:fi-fe2025063076147

Tiivistelmä

The presence of personally identifiable information in large data collections poses a significant barrier to their effective use. Specifically in the health care domain, a myriad of text documents filled with sensitive information — such as clinical notes and electronic health records — are created each day. While these document collections are valuable for machine learning, whether as the target of analysis or as training data for models, their usage is profoundly limited by the information they contain.

Natural Language Processing, and specifically Large Language Models, are at the forefront of current tools best suited for sensitive data redaction. In this thesis, a language model-based zero-shot approach to redacting personally identifiable information is implemented and applied to a synthetic medical corpus. The study investigates the effectiveness of this method across clinical text data in English, Finnish, and Spanish. The results show that despite clear differences in the capabilities of tested models, the zero-shot redaction method is incapable of reliably detecting personal information in electronic health records.

Kokoelmat

Pro gradu -tutkielmat ja diplomityöt sekä syventävien opintojen opinnäytetyöt (kokotekstit) [10076]