LLMs show promise in interventional radiology education

Will Morton, Associate Editor, AuntMinnie.com. Headshot

Large language models (LLMs) have the potential to enhance patient education in interventional radiology, a research team in Berlin has reported.

The finding is from a study that evaluated the performance of four advanced LLMs on frequently asked questions related to transarterial periarticular embolization (TAPE), CT-guided high-dose-rate (HDR) brachytherapy, and bleomycin electrosclerotherapy (BEST), noted lead authors Bogdan Levita, MD, and Semil Eminovic, MD, of Charité-Universitätsmedizin Berlin, and colleagues.

“DeepSeek-V3 and ChatGPT-4o demonstrated a strong performance in answering questions related to TAPE, BEST, and CT-HDR brachytherapy, highlighting their potential for patient education and communication improvement,” the group wrote. The study was published on October 13 in CVIR Endovascular.

TAPE is a procedure used to treat chronic joint pain; BEST is used in the treatment of vascular malformations; and CT-HDR brachytherapy is a minimally invasive procedure for liver tumors. All three are emerging procedures in interventional radiology, yet require comprehensive patient education and informed consent, which can be time- and resource-intensive in clinical practice, the researchers explained.

In this study, the group evaluated the potential of four LLMs for enhancing clinical workflows and patient comprehension, assessing associated risks with the approach.

First, two radiology residents created 107 questions based on commonly asked patient questions encountered in daily clinical practice before TAPE (35), CT-HDR brachytherapy (34), and BEST (36). Questions included, for instance, “What medication do I have to stop taking before a TAPE and for how long?” or “What happens if I cannot withstand the radiation during CT-HDR brachytherapy and have to interrupt the treatment?”

All questions were presented to ChatGPT-4o, DeepSeek-V3, OpenBioLLM-8b, and BioMistral-7b, with responses independently assessed by two board-certified radiologists. Accuracy was rated on a 5-point Likert scale.

According to the results, DeepSeek-V3 attained the highest mean scores for BEST (4.49) and CT-HDR (4.24) and demonstrated comparable performance to ChatGPT-4o for TAPE-related questions (DeepSeek-V3 [4.20] vs. ChatGPT-4o [4.17]).

In contrast, however, OpenBioLLM-8b's performance (3.51 on BEST questions, 3.32 on CT-HDR questions, and 3.34 on TAPE questions) and BioMistral-7b's performance (2.92 on BEST questions, 3.03 on CT-HDR questions, and 3.33 on TAPE questions) were significantly worse, the researchers reported.

“DeepSeek-V3 and ChatGPT-4o demonstrated strong performance by delivering accurate and understandable responses, while the medically pretrained OpenBioLLM-8b and BioMistral-7b performed significantly worse across all procedures and produced potentially hazardous answers,” the researchers wrote.

For instance, BioMistral-7b provided inaccurate information regarding radiation exposure by incorrectly asserting that the radiation dose would be similar to or lower than that of a chest x-ray in all three procedures, the group noted.

While the findings demonstrate that LLMs can’t yet substitute for comprehensive medical consultations, they do provide evidence that such models could play an increasing role in radiology and patient care, the authors concluded.

“Future research should validate these findings, incorporate patient feedback, and evaluate LLM integration into clinical workflows, particularly retrieval-augmented and fine-tuned models aligned with [interventional radiology] guidelines,” they wrote.

The full study is available here.

Page 1 of 185
Next Page