MedEvidence

Can Large Language Models Match the Conclusions of Systematic Reviews?

Systematic reviews (SR), in which experts summarize and analyze evidence across individual studies to provide insights on a specialized topic, are a cornerstone for evidence-based clinical decision-making, research, and policy. Given the exponential growth of scientific articles, there is growing interest in using large language models (LLMs) to automate SR generation. However, the ability of LLMs to critically assess evidence and reason across multiple documents to provide recommendations at the same proficiency as domain experts remains poorly characterized. We therefore ask:\textbf{ Can LLMs match the conclusions of systematic reviews written by clinical experts when given access to the same studies?} To explore this question, we present MedEvidence, a benchmark pairing findings from 100 SRs with the studies they are based on. We benchmark 24 LLMs on MedEvidence, including reasoning, non-reasoning, medical specialist, and models across varying sizes (from 7B-700B). Through our systematic evaluation, we find that reasoning does not necessarily improve performance, larger models do not consistently yield greater gains, and knowledge-based fine-tuning degrades accuracy on MedEvidence. Instead, most models exhibit similar behavior: performance tends to degrade as token length increases, their responses show overconfidence, and, contrary to human experts, all models show a lack of scientific skepticism toward low-quality findings. These results suggest that more work is still required before LLMs can reliably match the observations from expert-conducted SRs, even though these systems are already deployed and being used by clinicians. The visualization at the top of this page highlights the key challenges of our dataset.

Figure 1: MedEvidence curation pipeline. (1) We extract systematic reviews conducted by Cochrane that are indexed in PubMed, filtering for only reviews where all sourced studies are indexed in PubMed (with at least an abstract available). (2) Human annotators review the SR abstract and examine the "Main Results" subsection for statements that statistically compare an intervention with a control group. Conclusions are converted into question-answer pairs answers. (3) Human annotators parse the "Analysis" section of the SR's appendix to identify the cited studies relevant to the conclusion. (4) Annotators verify that the available content from the sourced studies (abstract or full-text) provides sufficient numerical or statistical detail to answer the question.

Dataset Description

MedEvidence Statistics: MedEvidence contains a total of 284 systematically human-curated questions derived from 100 systematic reviews with 329 referenced individual articles, of which 114 have full-text available. The benchmark covers topics from 10 medical specialties (e.g. public health, surgery, family medicine, etc.); five different outcome effects (higher, lower, no difference, uncertain effect, insufficient data); and three broad levels of concordance between the source papers and the correct answer (full agreement, no agreement, mixed agreement). The distributions of these key characteristics is shown in Figure 2 below.

Figure 2: Key statistical characteristics of the questions in MedEvidence. (a) shows the dataset distribution stratified by medical specialty. (b) presents the distribution stratified by outcome effect. (c) shows the distribution stratified by source concordance with the expert-assessed treatment outcome effect (i.e. the correct answer).

Data Format: MedEvidence is grouped by question; each question includes core data for evaluation, metadata, as well as the content details for the relevant sources. The core data consists of: a human-generated question of the form “Is [quantity of medical outcome] higher, lower, or the same when comparing [intervention] to [control]?"; the taxonomized answer to the question (higher, lower, no difference, uncertain effect, insufficient data); and the list of relevant studies (sources) used by the review authors to perform the analysis, identified by their unique PubMed IDs. We additionally provide the following metadata: the systematic review from which the question was extracted; the publication year of the systematic review; the authors’ confidence in their analysis, also referred to as the 'evidence certainty' (high, moderate, low, very low, or n/a if not provided); a Boolean identification of whether full-text is available and needed to answer the question; the exact fractional source concordance; and the medical specialty associated with the question. Separately, for each source, we provide the unique PubMed ID, title, publication date if available, and content (full-text if available, abstract otherwise).

Benchmarking LLM Performance

LLM Selection: We selected 24 LLMs across different configurations, including a variety of sizes (from 7B to 671B), reasoning and non-reasoning capabilities, commercial and non-commercial licensing, and medical fine-tuning. This selection includes GPT-o1, DeepSeek R1, OpenThinker2, GPT-4.1, Qwen3, Llama 4, HuatuoGPT-o1, OpenBioLLM, and more.

Prompting Setup: We evaluated all models in a zero-shot setting, prompting them to first provide a rationale for their answer, followed by an 'answer' field containing only one option from the list of five valid treatment outcome effects (higher, lower, no difference, uncertain effect, or insufficient data). We then utilized two setups. First, to assess the models' "natural" behavior, we used a basic setup, providing minimal guidance in the prompt beyond specifying the required response format, and supplying the abstracts or full text of the relevant studies as context. However, LLMs may not natively understand how to handle multiple levels of evidence, which can lead to unfair evaluations. Thus, we used an expert-guided prompt setup that instructs the LLM to summarize the study design and study population, and to assign a grade of evidence based on established definitions of grades of recommendation. For both cases, if the input exceeded the LLM's context window, we used multi-step refinement (via LangChain's RefineDocumentsChain) to iteratively refine the answer based on a sequence of article chunks.

LLM Evaluation: Model performance was evaluated using accuracy based on an exact match between the answer field and the ground truth. Model outputs were lower-cased and stripped of whitespace before comparison. If no 'answer' field was provided, or if its content was not an exact rule-based match with the correct answer, the output was deemed incorrect. Confidence intervals (CIs) were calculated via bootstrap (95%, N=1000).

Figure 3: (a) Average model accuracy (and 95% CI) on MedEvidence, overlaid on the percentage of questions where the model provided valid output (additional details in Appendix Section E). (b) Average recall by ground truth treatment outcome effect, aggregated across all models (with overall 95% interval).

Figure 4: (a) Accuracy as a function of evidence certainty shows a monotonically increasing trend. (b) Accuracy as a function of source concordance, defined as the percentage of relevant sources that agree with the final systematic review (SR) answer, also exhibits a monotonically increasing trend.

Findings

As shown in Figure 3(a), even frontier models such as DeepSeek V3 and GPT-4.1 demonstrate relatively low average accuracy of 62.40% (56.35, 68.45) and 60.40% (54.30, 66.50), respectively—far from saturating our benchmark. We identify four key factors that influence model performance on our benchmark: (1) token length, (2) dependency on treatment outcomes, (3) inability to assess the quality of evidence, and (4) lack of skepticism toward low-quality findings. Additionally, we found that (5) medical fine-tuning does not improve performance, and (6) model size shows diminishing returns beyond 70 billion parameters. Here, we highlight two of these key findings:

Model performance is dependent on treatment outcome effect. Figure 3(b) shows the per-class recall stratified by treatment outcome effect. Overall, all models perform best on questions where the correct answer corresponds to higher or lower effects—cases where a strong stance can be taken. They are slightly less successful on no difference and insufficient data questions, where a definitive conclusion is available but there is no clear preference for either treatment. Performance is lowest on the most ambiguous class, uncertain effect. Notably, we find that models are generally reluctant to express uncertainty, often committing to a more certain outcome that appears plausible instead.

Model performance improves with increasing levels of evidence. We leverage the evidence certainty levels reported by experts in each systematic review (SR). As shown in Figure 4(a), the overall ability of models to match SR conclusions improves as the level of evidence increases. We therefore explore whether model performance is also associated with the level of source concordance. As shown in Figure 4(b), models' ability to match human conclusions increases as the proportion of sources agreeing with the correct answer increases (e.g., DeepSeek V3 achieves 92.45% accuracy at 100% source agreement vs. 41.21% at 0% source agreement). This suggests that, unlike human experts, current LLMs struggle to critically evaluate the quality of evidence and to remain skeptical of results. We observe that this behavior persists even when models are prompted (using the expert-guided prompt) to consider study design, population, and level of evidence.

Conclusion

Benchmarks drive advancements by providing a standard to measure progress and enabling researchers to identify weaknesses in current approaches. While LLMs are already deployed for scientific synthesis, our understanding of their failure modes still requires broader investigation. In this work, we present MedEvidence, a benchmark derived from gold-standard medical systematic reviews. We use MedEvidence to characterize the performance of 24 LLMs and find that, unlike humans, LLMs struggle with uncertain evidence and cannot exhibit skepticism when studies present design flaws. Consequently, given the same studies, frontier LLMs fail to match the conclusions of systematic reviews in at least 37% of evaluated cases. We release MedEvidence to enable researchers to track progress.

Citation

      @misc{polzak2025largelanguagemodelsmatch,
        title={Can Large Language Models Match the Conclusions of Systematic Reviews?}, 
        author={Christopher Polzak and Alejandro Lozano and Min Woo Sun and James Burgess and Yuhui Zhang and Kevin Wu and Serena Yeung-Levy},
        year={2025},
        eprint={2505.22787},
        archivePrefix={arXiv},
        primaryClass={cs.CL},
        url={https://arxiv.org/abs/2505.22787}, 
      }