Generative AI improves clinical decision-making in the ED

Kate Madden Yee, Senior Editor, AuntMinnie.com. Headshot

Generative AI software improves clinical decision-making "in alignment with expert evidence-based guidelines" in the emergency department (ED), researchers have reported.

The study results suggest that AI could reduce variability among ED physicians when it comes to ordering imaging exams, noted a team led by MD/PhD candidate Michael Yao of the University of Pennsylvania in Philadelphia. The group's findings were published August 4 in Nature: Communications Medicine.

"Emergency room doctors often need to quickly decide which medical scans, such as x-rays or CT scans, to order for patients," the group noted. "However, these decisions can vary significantly among doctors … Our results show that AI tools can accurately choose the right scans for patients and can also be helpful assistants for clinicians."

Diagnostic imaging is key to working up patients who present to the ED, but ordering appropriate studies in this context can be tricky, due to "a high degree of variability among healthcare providers," the investigators wrote. Recent research has explored whether generative AI and large language models (LLMs) can be used to recommend appropriate diagnostic imaging studies, but whether the technology can produce recommendations that align with medical guidelines is unclear, "especially given the limited diagnostic information available in acute care settings," they explained.

Yao and colleagues developed an algorithm that used LLMs (Claude Sonnet-3.5 and Meta Llama 3) to create imaging study recommendations consistent with the American College of Radiology's (ACR) Appropriateness Criteria. The algorithm, called RadCases, included data from more than 1,500 annotated case summaries that outlined common patient presentations.

The study found that "LLMs achieve better accuracy with regard to image ordering in the ED without significant changes to the rate of missed imaging, the rate of unnecessary imaging, or number of recommended imaging studies," the group reported, writing that the research demonstrated that "LLMs can be leveraged by clinicians as a [clinical decision support] assistant to improve the accuracy of ordered imaging studies without significantly affecting the [rate of unnecessary imaging] or the [rate of missed imaging] in … [an] acute care environment."

Compared with clinicians, Claude Sonnet-3.5 and Llama 3 achieve the same or better a accuracy scores; and (b) false positive rates (i.e., the rate at which a patient received at least one unnecessary imaging recommendation); (c) false negative rates (i.e., the rate at which a patient should have received an imaging workup but did not); and (d) F1 scores. (e) However, we observe that Claude Sonnet-3.5 orders a greater number of recommended imaging studies compared to clinicians. (f) According to the Dice-Sørensen Coefficient (DSC) metric, Claude Sonnet-3.5 and Llama 3 order imaging studies that are more similar to one another than to clinicians (two-sample, two-tailed homoscedastic t-test; p = 2.19 × 10−24). Error bars in (e, f) represent ± 95% CI over n = 117 independent patient cases. (CL): Claude Sonnet-3.5 and Llama 3 pairwise DSC metric. Figure and caption courtesy of Nature: Communications Medicine under a Creative Commons Attribution 4.0 International License.Compared with clinicians, Claude Sonnet-3.5 and Llama 3 achieve the same or better a accuracy scores; and (b) false positive rates (i.e., the rate at which a patient received at least one unnecessary imaging recommendation); (c) false negative rates (i.e., the rate at which a patient should have received an imaging workup but did not); and (d) F1 scores. (e) However, we observe that Claude Sonnet-3.5 orders a greater number of recommended imaging studies compared to clinicians. (f) According to the Dice-Sørensen Coefficient (DSC) metric, Claude Sonnet-3.5 and Llama 3 order imaging studies that are more similar to one another than to clinicians (two-sample, two-tailed homoscedastic t-test; p = 2.19 × 10−24). Error bars in (e, f) represent ± 95% CI over n = 117 independent patient cases. (CL): Claude Sonnet-3.5 and Llama 3 pairwise DSC metric. Figure and caption courtesy of Nature: Communications Medicine under a Creative Commons Attribution 4.0 International License.

"We hope this work can support faster, more accurate decision-making and reduce unnecessary tests," the investigators concluded.

The complete study can be found here.

Page 1 of 384
Next Page