PITTSBURGH - Using reasoning large language models (LLMs) for radiology reporting may raise your hospital’s electricity bill with little to no improvement, suggest findings presented June 10 at the Society for Imaging Informatics in Medicine (SIIM) annual meeting.
In his talk, Dharmam Savani, a software engineer from St. Jude Children’s Research Hospital in Indianapolis, IN, presented his team’s results showing that LLMs with reasoning capabilities do not improve chest x-ray report labeling accuracy. However, they use significantly more energy than standard LLMs.
Dharmam Savani presents his team's findings at SIIM 2026 showing that reasoning LLMs may not meaningfully improve radiology report labeling over standard LLMs, but do use more energy.
“Reasoning LLMs do not improve accuracy for chest x-ray labeling,” Savani said. “They just burn more energy at scale.”
Reasoning LLMs are being explored for their step-by-step reasoning. This “thinking out loud” approach addresses steps taken before the LLM generates a response, with the goal of reducing hallucinations.
However, Savani noted a lack of data on whether reasoning LLMs can improve radiology report labeling. He and colleagues in 2024 had research published showing that smaller fine-tuned LLMs are more sustainable than larger general-purpose LLMs.
The Savani team studied tradeoffs between reporting accuracy and energy use by reasoning and standard LLMs.
The researchers used 3,660 de-identified chest x-ray reports from the Indiana University dataset. They also used radiologist-annotated labels for 13 diseases as reference standards.
The team used four reasoning LLMs from the DeepSeek-R1 series: 1.5B, 7B, 14B, and 32B. It compared the performance of these models to that of the Qwen 2.5 series, a size-matched standardized LLM, and the CheXpert rule-based labeler.
Dharmam Savani shares his thoughts on what radiology departments should consider when choosing LLMs for reporting tasks and the continuous development of LLMs as noted by SIIM 2026 attendees.
The standard LLMs consistently matched or outperformed the reasoning LLMs across most metrics, Savani said. He also said the reasoning LLMs showed only about a 1.5% relative macro-F1 gain at a 14B scale while also needing two to three times more energy than standard LLMs.
The standard LLMs mostly outperformed CheXpert (F1 = 0.55) with F1 scores of 0.47 (1.5B), 0.66 (7B), 0.68 (14B), and 0.80 (32B), respectively.
Finally, Savani reported that the standard LLMs were up to four times more efficient than rule-based tools and up to six times more efficient than reasoning models.
He said that the simplicity of labeling tasks in medical imaging could be the reason for the respective performance by the LLMs in the study.
“It’s not similar to the mathematical problems that are really complex,” Savani told AuntMinnie. “It’s very simple; we don’t need to spend a lot of energy on top of that.”
Savani said the team is interested in studying how reasoning models perform in different tasks and with larger datasets.
Check out AuntMinnie’s full coverage of SIIM 2026 here.

