Bachelor's Thesis Amar Ribic
Investigating the Evaluation Landscape of Medical Information Retrieval
Evaluation practices in Medical Information Retrieval (IR) and Retrieval-Augmented Generation (RAG) research appear heterogeneous, with varying datasets, metrics, and reporting standards across studies. This fragmentation complicates comparison between systems and may hinder cumulative scientific progress. A structured analysis of the evaluation landscape is therefore needed.
This thesis systematically investigates evaluation practices in Medical IR and Medical RAG research published between 2020 and 2025. Specifically, it examines which datasets and evaluation metrics are used, how datasets can be taxonomized with respect to task type, medical subdomain, data source, and availability, and whether the majority of datasets are used only rarely, resulting in a large and fragmented evaluation landscape with little reuse. Furthermore, it tests whether a small subset of evaluation metrics dominates current practice or whether metric usage is more broadly distributed.
A systematic scoping review will be conducted following PRISMA-ScR guidelines. Relevant literature will be identified through structured searches in PubMed, DBLP, and Scopus. After applying predefined inclusion and exclusion criteria, metadata on tasks, datasets, availability, sample size, document characteristics, and evaluation metrics will be extracted using a predefined coding scheme. The extracted data will be analyzed both quantitatively and qualitatively to test these expectations and characterize the degree of heterogeneity in current evaluation practices.
The thesis provides a structured overview of the evaluation landscape in Medical IR and RAG research, offering an empirical basis for assessing how consistent and reusable current evaluation practices actually are.
Research Questions:
RQ1: Which datasets are used to evaluate Medical IR and Medical RAG systems, and does a long tail of rarely-reused datasets characterize current evaluation practices?
RQ2: How can these datasets be taxonomized with respect to task type, medical domain, data source, and availability?
RQ3: Does a small subset of evaluation metrics dominate current practice, or is metric usage more broadly distributed across studies?
| Attribute | Value |
|---|---|
| Title (de) | |
| Title (en) | Investigating the Evaluation Landscape of Medical Information Retrieval |
| Project | |
| Type | Bachelor's Thesis |
| Status | started |
| Student | Amar Ribic |
| Advisor | Fabian Karl |
| Supervisor | Prof. Dr. Florian Matthes |
| Start Date | 13.04.2026 |
| Sebis Contributor Agreement signed on | 08.04.2026 |
| Checklist filled | Yes |
| Submission date | 13.08.2026 |