ChatGPT-4 not reliable in cancer patient messaging

ChatGPT-4 is not a reliable source for answering patients’ questions regarding cancer, a study published April 24 in The Lancet Digital Health found.

Researchers led by Danielle Bitterman, MD, from Mass General Brigham in Boston, MA, found that ChatGPT-4 generated acceptable messages to patients without any additional editing by radiation oncologists 58% of the time, and 7% of responses generated by GPT-4 were deemed unsafe by the radiation oncologists if left unedited.

“Taking the collective evidence as a whole, I would still consider generative AI for patient messaging at its current stage to be experimental,” Bitterman told “It is not clear yet whether these models are effective at addressing clinician burn-out, and more work is needed to establish its safety when used as with a human-in-the-loop.”

Medical specialties, including radiology and radiation oncology, continue to explore the potential of large language models such as ChatGPT. Proponents of the technology say that ChatGPT and other such models could help alleviate administrative and documentation responsibilities, which could in turn mitigate physician burnout.

The researchers noted that electronic health record (EHR) vendors have adopted generative AI algorithms to aid clinicians in drafting messages to patients. However, they also pointed out that the efficiency, safety, and clinical impact of their use isn’t well known.

Bitterman and colleagues used GPT-4 to generate 100 scenarios about patients with cancer and an accompanying patient question. No questions from actual patients were used for the study. Six radiation oncologists manually responded to the queries while GPT-4 generated responses to the questions.

Then, the researchers provided the same radiation oncologists with the GPT-generated responses for review and editing. The radiation oncologists did not know whether GPT-4 or a human had written the responses. In 31% of cases, the radiation oncologists believed that a GPT-generated response had been written by a human.

The study found that on average, physician-drafted responses were shorter than the GPT-generated responses. GPT-4 also included more educational background for patients but did not give as much directive instruction.

The physicians reported that GPT assistance improved their perceived efficiency and deemed the generated responses to be safe in 82.1% of cases. They also indicated that the generated responses were acceptable to send to a patient without any further editing in 58.3% of cases.

However, if left unedited, 7.1% of GPT-generated responses could pose a risk to patients and 0.6% of responses could pose a risk of death. The researchers highlighted that this was often because GPT-4’s responses did not urgently instruct patients to seek immediate medical care.

Finally, the team reported that GPT-generated responses edited by physicians were more similar in length and content to GPT-generated responses versus the manual responses.

Bitterman said that the GPT-assisted responses were similar to the large language model draft responses, the responses generated by GPT-4 before editing. This suggests physicians might take on the large language reasoning, raising the risk of model-assisted messaging and impacting clinical recommendation, she added.

“This emphasizes the need for multi-level approaches to evaluation and safety addressing the large language model itself, the human interacting with it, and the human-large language model system as a whole,” Bitterman said.

She told that the next step is to work with patients to understand their perceptions of large language models used in their care in this way and to understand their opinions on the different responses.

“We are also investigating how biases in large language models impact the safety and quality of their responses,” Bitterman said.

The full study can be found here.

Page 1 of 383
Next Page