A specialized large language model (LLM) fine-tuned for summarizing free-text radiology findings performed better than GPT-4o in generating radiology report summaries across diverse report types, according to a study published August 26 in Radiology.
The research highlights how a general-purpose model such as GPT-4o may be inferior to an LLM designed for this task, according to first authors Sunyi Zheng, PhD, of Tianjin Medical University Cancer Institute and Hospital in Tianjin, China, and Nannon Zhao, PhD, of Cancer Hospital of China Medical University in Shenyang, and colleagues.
"Using a dedicated LLM, rather than a general-purpose model like GPT-4o, may produce more accurate report summaries that align with the preferences of medical experts," the authors wrote. In a retrospective study, they compared their model, LLM-RadSum, developed from Meta’s open-source Llama2 (about 1.5 trillion words) and data from five cancer and general hospitals.
The local LLM was trained and evaluated using 1,062,466 CT and MRI radiology reports (956,219 in the training set, 106,247 in the internal test set, and 17,091 in the external training set) covering six anatomic regions. Chest was the most common examination category in both sets, Zheng, Zhao, and colleagues noted.
Using 1,800 reports randomly selected and assessed by senior radiologists and clinicians, the researchers highlighted the following findings:
- LLM-RadSum achieved a higher F1 score (balance between recall and precision) on summarized reports than GPT-4o (0.58 vs. 0.3, p < 0.001). This was observed across anatomic regions, both modalities, sex, ages, and impression lengths (all p < 0.001).
- In generating impressions from findings at CT and MRI, the specialized LLM outperformed GPT-4o (p < 0.001) with higher median F1 scores (CT: 0.53 vs. 0.31; MRI: 0.69 vs. 0.28).
- In addition, the model achieved median scores of 0.55 for male patients and 0.61 for female patients, compared to GPT-4o’s 0.31 and 0.29 (all p < 0.001).
More than 81.5% (1,467) of outputs from the specialized model met the standards of senior radiologists and clinicians in four key aspects: factual consistency, impression coherence, medical safety, and clinical use.
For example, measuring factual consistency, 88.9% (1,601) of the impressions generated were "completely consistent" with the original report, compared with 43.1% (775) for GPT-4o, according to the group.
Model comparisons by human evaluation based on 1,800 reports randomly selected from five hospitals that make up the internal and external test sets. (A-D) Component bar graphs show the evaluation results of the specialized large language model (LLM-RadSum) and GPT-4o (OpenAI) regarding factual consistency, impression coherence, medical safety, and clinical use.Caption and graphic courtesy of RSNA
About 81.5% (1,467) of the outputs could be signed off directly without safety issues, whereas about 74.7% (1,345) of the results from GPT-4o had minor errors that needed to be corrected with minimal edits before signing.
In an accompanying editorial, Merel Huisman, MD, PhD, of Radboud University Medical Center in the Netherlands, said the study raises strategic questions for hospital leaders: "Should hospitals pursue local domain adaptation independently where permissible under local regulations, use the latest -- possibly more powerful but general -- proprietary models within a privacy-preserving framework, collaborate with startups, or wait until industry leaders have fully integrated bespoke tools into their products?"
Huisman said LLM-RadSum should be seen as a research output rather than a step toward a usable, generalizable medical device.
The group plans to deploy the model in real-world clinical settings.
Read the full study here.