FActBench: A Benchmark for Fine-grained Automatic Evaluation of LLM-Generated Text in the Medical Domain
Large Language Models tend to struggle when dealing with specialized domains. While all aspects of evaluation hold importance, factuality is the most critical one. Similarly, reliable fact-checking tools and data sources are essential for hallucination mitigation. We address these issues by providing a comprehensive Fact-checking Benchmark FActBench covering four generation tasks and six state-of-theart Large Language Models (LLMs) for the Medical domain. We use two state-of-the-art Fact-checking techniques: Chain-of-Thought (CoT) Prompting and Natural Language Inference (NLI). Our experiments show that the factchecking scores acquired through the Unanimous Voting of both techniques correlate best with Domain Expert Evaluation.
| Attribute | Value |
|---|---|
| Address | Odense, Denmark |
| Authors | Anum Afzal, Juraj Vladika |
| Citation | Anum Afzal, Juraj Vladika, and Florian Matthes. 2025. FActBench: A Benchmark for Fine-grained Automatic Evaluation of LLM-Generated Text in the Medical Domain. In Proceedings of the 8th International Conference on Natural Language and Speech Processing (ICNLSP-2025), pages 93–102, Southern Denmark University, Odense, Denmark. Association for Computational Linguistics. |
| Key | Af25b |
| Research project | |
| Title | FActBench: A Benchmark for Fine-grained Automatic Evaluation of LLM-Generated Text in the Medical Domain |
| Type of publication | Conference |
| Year | 2025 |
| Team members | Anum Afzal, Juraj Vladika |
| Publication URL | https://aclanthology.org/2025.icnlsp-1.11.pdf |
| Project | |
| Acronym | ICNLSP |