AI-Powered Retrieval of Medical Literature and Health Data with Vector Databases: Developing a Custom Search Assistant for Medical Research

avoin
Julkaisu on tekijänoikeussäännösten alainen. Teosta voi lukea ja tulostaa henkilökohtaista käyttöä varten. Käyttö kaupallisiin tarkoituksiin on kielletty.
Lataukset24

Verkkojulkaisu

DOI

Tiivistelmä

The rapid growth of online information makes locating precise and relevant biomedical literature increasingly difficult. Researchers often spend significant time filtering useful content from large volumes of unrelated material, making the process inefficient and demanding, particularly in scientific research. This thesis presents the development of a Retrieval-Augmented Generation (RAG) system for biomedical literature retrieval. The system evaluates three models, GPT, SBERT, and BioBERT, on biosignal-related topics such as ECG, EEG, and PPG. Queries are divided into two categories: expert-formulated and layman-formulated questions. The objective is to assess how accurately each model interprets and answers these different query types. In addition to literature retrieval, the system retrieves and visualizes relevant biosignal datasets, extracted from PhysioNet, to assist researchers in both understanding studies and exploring associated data. Research papers are extracted from PubMed and stored in a database along with embeddings for efficient access. Both quantitative and qualitative evaluations were conducted. Quantitative performance was initially evaluated using DOI and dataset matching; however, as these metrics were insufficient, embedding-based similarity, TF-IDF similarity, and keyword overlap were added to enhance the evaluation framework. Qualitative performance was assessed using an LLM-as-Judge framework. A custom test set was created containing ground-truth research papers, expert and layman queries, and corresponding dataset URLs from PhysioNet. Results indicate that GPT achieved the highest answer quality (overall 0.88) and strongest semantic alignment (embedding composite 0.60). SBERT demonstrated nearly identical embedding performance (0.60) and the most stable lexical retrieval performance in TF-IDF (0.50). BioBERT consistently showed lower performance (embedding composite 0.30, TF-IDF 0.17, overall 0.73). Considering the potential evaluation bias toward GPT-family models, specifically, that GPT was used both for final answer generation and for qualitative assessment under the LLM-as-Judge framework, SBERT was selected as the most reliable standalone retrieval model. Using GPT for both generation and evaluation may introduce systematic bias, as models from the same family can exhibit alignment in reasoning patterns and stylistic preferences, potentially inflating performance estimates.

item.page.okmtext