Why We Need Platforms to Evaluate LLMs

Op-Med is a collection of original essays contributed by Doximity members.

Without a doubt, large language models (LLMs) represent a potential paradigm shift for clinical medicine. Yet, despite the excitement surrounding these tools, there is limited empirical research comparing them. LLMs are not without their limitations, prompting critical considerations: How accurate are LLMs when faced with real-life clinical scenarios? Which clinical citations, if any, are they using to generate their answers? Is it safe to use LLMs despite documented issues like racial and gender biases? And if we deem their use appropriate in certain contexts, how do we pick the best LLM for the job?

LLMs are not monolithic; they possess distinct capabilities and weaknesses and are trained on different datasets. In our research, we have found a nearly 4x difference in pairwise win rates between top-performing and bottom-performing LLMs. Because of this, LLMs should not be treated as ‘one-size-fits-all’ solutions. Instead, different models may be more appropriate for different types of questions. For instance, Gemini Flash Thinking 2.0 delivers the most accurate summarization of diagnostic and management recommendations for cardiology questions, while GPT-4o excels at neuroanatomy.

Another challenge that clinicians face is that, while they have improved in accuracy over the past 10 years, LLMs may not be 100% accurate in all cases.

Systematic evaluation platforms are beginning to emerge to aid this process. One platform, SourceCheckup, evaluates whether the sources cited by LLMs exist, and whether they support the LLM’s response. Another platform, MedArena.ai, developed by Stanford University researchers (full disclosure, I am one of them), offers a free framework to directly compare responses from various LLMs (including models from OpenAI, Google, Meta, Perplexity, and Anthropic) on user-defined clinical tasks.

The goal of such evaluation is to bring empirical rigor to the practice of medicine with LLMs; in effect, to run a clinical trial of LLM efficacy. These insights will allow researchers to identify specific use cases for specific models, and ultimately build an application layer that automatically selects the best tool for the specific clinical task at hand.

We are experiencing one of the most dynamic times in clinical medicine. As LLMs continue to evolve, they will likely become further embedded in our practice. Proactive engagement and critical evaluation by the clinical community itself will be important to ensuring that these powerful tools are harnessed effectively and responsibly for the benefit of our patients.

Which LLMs do you find most helpful for your specialty? Share in the comments.

Angela Zhang, MD, PhD is an internal medicine resident at UCSF and a graduate of the Stanford MSTP MD-PhD program.

Illustration by Getty Images

All opinions published on Op-Med are the author’s and do not reflect the official position of Doximity or its editors. Op-Med is a safe space for free expression and diverse perspectives. For more information, or to submit your own opinion, please see our submission guidelines or email opmed@doximity.com.

Why We Need Platforms to Evaluate LLMs

More from Op-Med

Finding Human Moments in the Scripted World of Medicine

6 a.m., 2 p.m., and 10 p.m.

Navigating the Gray: Defining Boundaries on Scope Creep in Medicine