The latest version of ChatGPT has passed a radiology board-type exam, yet the language model's "hallucinations" raise concerns over its reliability, according to a study published May 16 in Radiology.
Dr. Rajesh Bhayana of the University of Toronto and colleagues tested ChatGPT-4 -- a recently released paid version of the artificial intelligence (AI) large-language model (LLM) -- on a multiple-choice text-only test that matched the style, content, and difficulty of the Canadian Royal College and American Board of Radiology exams. ChatGPT-4 achieved a score of 81%, but the chatbot's wrong answers raised concerns.
"We were initially surprised by ChatGPT's accurate and confident answers to some challenging radiology questions, but then equally surprised by some very illogical and inaccurate assertions," Bhayana said, in a news release from RSNA.
ChatGPT has potential as a tool in medical practice and education, but its performance in radiology remains unclear, the authors noted. ChatGPT-3.5 was released by OpenAI.com in November 2022. ChatGPT-4 was released in March.
The researchers tested ChatGPT-3.5 first, with results of the study also published May 16 in Radiology. The exam consisted of 150 text-only questions designed to assess the chatbot's ability to perform "lower-order thinking" involving knowledge recall and basic understanding and "higher-order thinking" involving descriptions of imaging findings and applying concepts.
ChatGPT-3.5 answered 69% of questions correctly (with 70% considered a passing score) and performed better on questions requiring lower-order thinking than on those requiring higher-order thinking, according to the findings.
On the same exam, ChatGPT-4 performed better. While there was no improvement over ChatGPT-3.5 on lower-order questions, the latest version performed specifically better on higher-order thinking questions (81% vs. 60%), according to the findings.
"Our study demonstrates an impressive improvement in performance of ChatGPT in radiology over a short time period, highlighting the growing potential of LLMs in this context," the researchers wrote.
As for what accounted for the improved performance, Bhayana suggested in an interview with AuntMinnie.com that ChatGPT-4 may have been trained on additional data and that its advanced reasoning capabilities appear to have been enhanced, but its developers have not publicly published these details.
Nonetheless, both ChatGPT-3.5 and ChatGPT-4 used confident language consistently, even when incorrect, and this raises questions related to ChatGPT's reliability for information gathering, the researchers wrote. This tendency is particularly dangerous if users solely rely upon information, especially novices who may not recognize confident incorrect responses as inaccurate, they noted.
"ChatGPT's dangerous tendency to produce inaccurate responses, termed 'hallucinations,' is less frequent in GPT-4 but still limits usability in medical education and practice at present," Bhayana and colleagues wrote.
Overall, the researchers dubbed the rapid advancement of these models as "exciting," and suggested that applications built on ChatGPT-4 with radiology-specific fine-tuning should be explored further.