Research·Global

ThReadMed-QA Benchmark Reveals Limitations of LLMs

Global AI Watch · Editorial Team·13 March 2026·3 min read·arXiv cs.CL (NLP/LLMs)

Key Points

1New multi-turn benchmark for medical dialogues introduced.
2Highlights degradation in model accuracy across conversation turns.
3Impacts AI's reliability in critical healthcare information.

The ThReadMed-QA benchmark was introduced to enhance medical question-answering by focusing on the multi-turn dialogues that characterize real patient-physician interactions. This benchmark features 2,437 conversation threads, encompassing 8,204 question-answer pairs sourced from the r/AskDocs platform and evaluates advanced large language models (LLMs) such as GPT-5 and Claude Haiku. Despite being state-of-the-art, even the best-performing models achieved only 41.2% accuracy in providing fully correct answers when faced with the complexities of multi-turn questions.

The implications of these findings are significant for the deployment of LLMs in medical contexts. The benchmark's introduction underscores the need for models to maintain accuracy over iterative exchanges, a critical factor in healthcare settings. The high rate of error escalation as indicated by the new metrics, the Conversational Consistency Score (CCS) and Error Propagation Rate (EPR), poses challenges for relying on these AI systems for trustworthy medical advice, suggesting a potential risk in patient care if used without caution.

Free Daily Briefing

Top AI intelligence stories delivered each morning.

Subscribe Free →

SourcearXiv cs.CL (NLP/LLMs)Read original

Key Points

Explore Trackers