Are LLMs effective without specific training?

Wednesday, November 29 | 10:00 a.m. - 10:10 a.m. | W3-SSIN05-4 | S402

Researchers seeking to speed up the development of large-language models (LLMs) are exploring where and how to use a training technique called zero shot, as this Wednesday session will cover.

The zero shot, or no domain-specific training, may enable automatic processing of unstructured radiology reports with minimal development effort, according to presenter Viel Sandfort, MD. Sandfort’s study sought to determine the performance of LLMs in deriving the appropriate coronary artery disease reporting and data system (CAD-RADS) category from unstructured coronary CT angiography (CCTA) reports.

For the study, 400 fully anonymized CCTA reports from four hospitals across three U.S. regions (East, Midwest, and Southwest) were obtained from AI image data provider Segmed. CAD-RADS categories were determined by three cardiovascular experts, and any mention of CAD-RADS in the reports was manually removed to avoid bias.

ChatGPT3.5-turbo and two LLMs derived from publicly available MetaAI models (LLaMA and Stanford ALPACA 7B and 13B) were used for the analysis. The prompt used for the three LLMs included the CCTA report, CAD-RADS definitions, and instructions to provide a CAD-RADS category as output. Of note, no prompt optimization was performed. 

Among the findings, ChatGPT showed good agreement with expert-derived scores with a Cohen’s kappa of 0.67 (CI 0.62 - 0.72), and a significantly higher performance compared with Alpaca 7B and 13B (see figure). Alpaca 13B showed significantly higher performance compared to Alpaca 7B, with Cohen’s kappa 0.44 (CI 0.4-0.49) vs. 0.23 (CI 0.19 - 0.28), respectively.

“OpenAI’s ChatGPT showed good performance in assigning CAD-RADS categories to unstructured coronary CTA reports while both locally run ALPACA models showed lower performance compared to ChatGPT, with the 13B model performing better than the 7B model,” Sandfort, et al concluded, adding the method used for this assessment could also be useful for other reporting and data systems.

LLM-assisted CAD-RADS categorization could enable rapid deployment in the setting of quality improvement, administrative analyses, billing, research, and report quality control, and can accelerate the development of clinical natural language processing models, according to Sandfort. Stop by this Wednesday morning session to learn about the zero-shot LLM study.

Page 1 of 2
Next Page