Shawn Sun, MD, will share details of a study that took 70 gastrointestinal and genitourinary imaging cases from a radiology textbook, converted the case images and history into standardized prompts, and fed the prompts into both the ChatGPT-4 and ChatGPT-3.5 large-language models.
The accuracies for the top one and top three differential diagnoses were defined as the percentage of ChatGPT-generated responses that matched the original diagnosis and the complete differential provided by the original literature, according to Sun, et al.
While the two generations of ChatGPT were able to produce a differential diagnosis for prompts containing descriptive radiological findings, the responses matched the expert literature very little, though a statistically significant improvement was made in the top 1 diagnosis accuracy from 3.5 to the 4th generation algorithm.
An additional differential diagnosis score was defined as the proportion of differentials that matched the original literature’s answers for each case. The top 1 accuracy and top 3 accuracy of differential diagnoses for ChatGPT-3.5 versus ChatGPT-4 were 35.7% compared with 51.4% (p = 0.031) and 7.1% compared to 10% (p = 27), respectively. The average differential diagnosis score of ChatGPT-3.5 versus ChatGPT-4 was 42.4% compared with 44.7% (p = 0.39). ChatGPT-3.5 and ChatGPT-4 hallucinated 38.2% versus 18.8% (p = 0.0012) of the references provided and generated 23 total false statements versus four total false statements, respectively.
Sun et al also observed the hallucination effect as more common in the citations produced by the algorithm than in statements made by the algorithm, with the hallucination effect improved with ChatGPT-4 compared with ChatGPT-3.5.
Get all the details by sitting in on this Tuesday afternoon talk.