SIIM: LLMs can reliably perform numerical tasks in radiology

PITTSBURGH – Large language models (LLMs) can be relied upon for radiologic numerical tasks, according to research presented June 10 at the Society for Imaging Informatics in Medicine (SIIM) annual meeting. 

In his presentation, Ali Nowroozi, MD, from the University of California in San Francisco showed how LLMs performed in completing extraction and judgment numerical tasks, with some models consistently achieving accuracies above 95%. 

“Reasoning models complete radiological numerical tasks without mathematical errors, and non-mathematical errors are generally more common in all models,” Nowroozi said. 

While LLMs are able to reliably perform text-based tasks, they struggle with numerical tasks like mathematics. Nowroozi said there is a lack of data on how these models perform on numerical tasks within radiology. 

Ali Nowroozi, MD, presents his findings at SIIM 2026 on how LLMs perform in numerical tasks within radiology.Ali Nowroozi, MD, presents his findings at SIIM 2026 on how LLMs perform in numerical tasks within radiology.

He and fellow researchers tested the performance of several LLMs on radiology numerical tasks, including the following: Llama 3.1 8B, DeepSeek R1-distilled Llama 8B, OpenAI o1-mini, and OpenAI GPT 5-mini. 

The team defined six tasks for the models, three for image extraction and three for judging. The extraction tasks included minimum T-scoring from DEXA reports, maximum common bile duct diameter from ultrasound reports, and maximum lung nodule sizing from CT reports. And the judging tasks included presence of a highly hypermetabolic region on a PET report, whether a patient is osteoporotic based on a DEXA report, and whether a patient has a dilated common bile duct on an ultrasound report. 

Nowroozi reported the following findings: 

  • While Llama showed variable performance in the extraction tasks (accuracy range, 86% to 98.7%), the other models consistently achieved accuracies above 95%. 

  • GPT 5-mini achieved the best “lowest” accuracies in judgment tasks compared to o1-mini (91.7%), DeepSeek-distilled Llama (91.7%), and Llama (62%). 

  • The o1-mini and GPT 5-mini models achieved perfect accuracy in detecting osteoporosis. These models also did not commit any mathematical errors. 

  • Answer-only output formats reduced the performance of Llama and DeepSeek-distilled Llama, but not for o1-mini or GPT 5-mini. 

Nowroozi said that simpler models not based on reinforcement learning could achieve acceptable performance depending on the task. 

And while state-of-the-art models did not show obvious mathematical errors in the study, Nowroozi cautioned that these models could still hallucinate or perform tasks inaccurately, with medical knowledge-based errors being common. 

Future directions include having more raters for subjective evaluation of errors and extending this approach to more difficult tasks. 

Check out AuntMinnie’s full coverage of SIIM 2026 here.

Page 1 of 1881
Next Page