LLM pipeline could improve imaging workflows

Jun 16, 2026

A zero-shot large language model (LLM) pipeline could accurately generate structured oncologic histories, though performance depends on which LLM is used, according to research published June 16 in Radiology.

A team led by Karan Jani, MD, from the Mallinckrodt Institute of Radiology in St. Louis, found that an LLM pipeline using GPT-5-mini achieved the best performance in overall completeness of gathering oncologic history for patients while also leading to cost savings per radiologist.

“Our solution provides a flexible LLM pipeline that efficiently extracts oncologic summaries without manual fine-tuning or information retrieval,” the Jani team wrote.

Radiologists continue to explore the potential of LLMs to improve their workflows in several ways. These include structured reporting, impression generation, image protocoling, incidental follow-up, and summarizing clinical history.

Radiologists rely on clinical histories during image interpretation. But retrieving these details from electronic health records (EHRs) can be time-intensive and inefficient. Previous LLM studies used manual prompt tuning, which can interrupt clinical workflows.

Jani and colleagues studied whether a zero-shot approach could help in this area. Here, LLMs complete tasks without further training or access to specific data examples.

The team collected retrospective EHR data for thoracic oncology patients receiving treatment between 2018 and 2024. And 20 surveyed radiologists selected 10 clinical parameters for summarization.

From there, the team developed a retrieval-augmented generation pipeline to filter clinical data. It used three LLMs to test different versions of the pipeline: ChatGPT-4o mini, o3-mini, and GPT-5-mini (all OpenAI). These models produced structured summaries using a defined data planning and a zero-shot prompt approach.

Image shows an application pipeline with example output using retrieval-augmented generation to filter clinical data for processing by an LLM. For a given medical record number, data filters retrieve relevant context via structured query language (SQL), and an LLM is prompted to generate a structured summary according to the data schema provided via a secure application programming interface (API). Clinical parameters, including sites of metastatic disease and treatment history, are displayed in a user interface as shown here.RSNAThe researchers also manually evaluated summaries against retrieved clinical text. They based their completeness reference standard on the output of GPT-5-mini. Finally, two radiology residents were timed in summarizing clinical parameters, setting up a 240-second benchmark.

From a dataset of 2,433 patients, the researchers randomly selected 50 patients for LLM summarization.

While GPT-5-mini achieved the highest average completeness score, GPT-4o had the fastest processing time per summary. And o3-mini had a similarly high accuracy.

Comparison of LLM pipeline performance using different LLMs
Measure	GPT-4o mini	o3-mini	GPT-5-mini
Accuracy	93.3%	97.3%	97.9%
Completeness	62%	66.7%	95.5%
Cost per summary	$0.004	$0.154	$0.030
Time per summary	12.7 seconds	404 seconds	221 seconds

“Ideally, these time savings project to annual incremental revenue of $19,525 to $70,942 per radiologist, with costs of $30.26 to $32.46,” the researchers wrote. These numbers apply when radiologists only read oncologic cross-sectional imaging.

The study authors highlighted the pipeline’s potential to improve radiologist efficiency and accuracy while also leading to cost savings for practices.

“This modular framework can be adapted to various radiology indications and chart summarization tasks in other medical specialties,” they added.

The authors wrote that future studies will pilot the LLM pipeline in a clinical setting. They will study the pipeline’s effect on radiologist efficiency and accuracy to increase sample sizes, refine cost-benefit analyses, and gather radiologist feedback.

The team’s current study could “come to represent an important transition point” since it transformed a “highly anticipated idea” into an executable workflow question for radiology, according to an accompanying editorial written by Bahram Mohajer, MD, and Christian Terwiesch, PhD, from the University of Pennsylvania in Philadelphia.

The two highlighted the study as being a “credible proof of concept” for a use case in radiology.

“It shows that clinically relevant oncologic context can be programmatically retrieved, structured, and summarized with performance that is now good enough to justify rigorous pilot implementation,” Mohajer and Terwiesch wrote.

“The next studies should measure net interpretation time after verification, diagnostic accuracy, report quality, radiologist trust, cognitive load, and downstream clinical use of the generated context,” they added. “They should evaluate fragmented records, broader disease domains, and prospective human-in-the-loop design.”

Read the full study here.