Development of a Medical Question-Answering System Using Large Language Models with Retrieval Augmented Generation and Prompt Engineering

Tahir, Hammad

Development of a Medical Question-Answering System Using Large Language Models with Retrieval Augmented Generation and Prompt Engineering

Tahir, Hammad (2025-06-25)

Development of a Medical Question-Answering System Using Large Language Models with Retrieval Augmented Generation and Prompt Engineering

Tahir, Hammad

(25.06.2025)

Katso/Avaa

Tahir_Hammad_Ahmed_Thesis.pdf (661.2Kb)

Lataukset:

Julkaisu on tekijänoikeussäännösten alainen. Teosta voi lukea ja tulostaa henkilökohtaista käyttöä varten. Käyttö kaupallisiin tarkoituksiin on kielletty.

avoin

Näytä kaikki kuvailutiedot

Julkaisun pysyvä osoite on:
https://urn.fi/URN:NBN:fi-fe2025063075869

Tiivistelmä

This thesis develops and evaluates a Medical Question-Answering System (MQAS) that combines Retrieval-Augmented Generation (RAG) with advanced prompt engineering techniques to provide accurate, explainable, and safe answers to medical questions. Healthcare professionals often struggle to quickly access reliable medical information, and existing AI systems frequently produce hallucinations or lack transparency in their reasoning. To address these challenges, this research implements a RAG pipeline using GPT-4o-mini, MedEmbed-small-v0.1 embeddings, and Pinecone vector storage, enhanced with DSPy-based prompt engineering. The study compares three prompt engineering approaches—zero-shot Chain of Thought (CoT), few-shot CoT, and ensembled few-shot CoT—using the PubMedQA_instruction dataset. Evaluation combines quantitative RAGAS metrics with qualitative assessments from licensed physicians. Results demonstrate that few-shot CoT consistently outperforms other approaches, achieving superior scores in answer relevancy (0.9514), faithfulness (0.7317), and answer correctness (0.6243). Clinician evaluations further validate the system’s effectiveness, with particularly high ratings for explainability (0.905) and clinical accuracy (0.89). The findings suggest that structured prompting techniques, particularly few-shot CoT, can significantly enhance the performance of medical question-answering systems by improving both factual accuracy and reasoning transparency. This research contributes to the advancement of healthcare AI by demonstrating a GDPR compliant approach to developing medical QA systems that earn clinician trust through explainable, accurate responses. While limitations exist, including small clinician sample size and API budget constraints, this work establishes a foundation for future research into cost-effective prompting strategies and real-world clinical implementations that could meaningfully improve healthcare information access.

Kokoelmat

Pro gradu -tutkielmat ja diplomityöt sekä syventävien opintojen opinnäytetyöt (kokotekstit) [9740]