Is ChatGPT too ‘smart’ for its own good?

Oct 20, 2023

ChatGPT shows promise in answering questions by radiology patients, yet the complexity of its responses may complicate its use as a patient education tool, according to a group at the University of Pennsylvania.

Researchers led by Emile Gordon, MD, tested ChatGPT in answering common imaging-related questions and further examined the effect of asking the chatbot to simplify its responses. While accurate, ChatGPT’s responses were uniformly complex, they found.

“None of the responses reached the eighth-grade readability recommended for patient-facing materials,” the group wrote. The study was published October 18 in the Journal of the American College of Radiology.

The ACR has prioritized effective patient communications in radiology and encourages its improvement, the authors wrote. ChatGPT has garnered attention as a potential tool in this regard. For instance, studies suggest that it could be useful for answering questions on breast cancer screening, the group noted.

However, its role in addressing patient imaging-related questions remains unexplored, they added.

To that end, the researchers asked ChatGPT 22 imaging-related questions deemed important to patients: safety, the radiology report, the procedure, preparation before imaging, and the meaning of terms and medical staff. The questions were posed with and without this follow-up prompt: “Provide an accurate and easy-to-understand response that is suited for an average person.”

Four experienced radiologists evaluated the answers, while two patient advocates also reviewed responses for their readability for patients. Readability was assessed by Flesch Kincaid Grade Level (FKGL).

According to the results, ChatGPT provided an accurate response 83% of the time and 87% when asked with a simple prompt, although this difference was not statistically significant.

In addition, although its responses were almost always partially relevant (99%), the proportion considered fully relevant significantly rose from 67% to 80% when the prompt accompanied the questions. Prompting also improved the response consistency from 72% to 86%, the researchers wrote.

Finally, they found the average FKGL of ChatGPT responses was high at 13.6, which was unchanged by the prompt (13).

“As it currently stands, the high complexity of the responses clouds the promise of true patient access to health information,” the group wrote.

The exploratory study underscores the potential of ChatGPT to streamline time-consuming tasks in patient health education, yet highlights challenges related to readability and potential risks of presenting misleading information to patients, the group wrote.

“Exploring strategies such as effective, prompt engineering will contribute to optimizing ChatGPT's output, ensuring its safety and effectiveness for patient use,” the group concluded.

The researchers disclosed that they used ChatGPT in order to enhance the clarity of select portions of the text, with the authors reviewing and editing the content as needed.

The full article is available here.