ChatGPT gives 'mostly' appropriate responses for breast pathology

Jan 5, 2024

ChatGPT could be an accessible source of information for women waiting to discuss core-needle breast biopsy results with their healthcare practitioners, a study published January 3 in the American Journal of Roentgenology found.

Researchers led by Eniola Oluyemi, MD, from Johns Hopkins University in Baltimore, MD found that the chatbot provided “mostly appropriate” responses to various questions regarding breast pathologic diagnoses as rated by reviewers.

“The reviewers’ overall agreement ranged from 88% to 96% regarding responses’ accuracy, consistency, definition provided, and clinical significance conveyed,” Oluyemi and colleagues wrote.

It’s been one year since the first iteration of ChatGPT was released to the public. Since then, medical researchers, including radiologists, have studied the potential use of this and other chatbots in clinical and patient-facing settings.

Along with that, the 21st Century Cures Act provides patients with immediate access to their radiology and pathology reports. With this in mind, patients could look to online sources such as publicly available large language models to help better understand their imaging results.

The Oluyemi team studied whether ChatGPT version 3.5 (ChatGPT-3.5) gives appropriate responses to patients submitting questions about a variety of pathologic diagnoses encountered on breast core-needle biopsy. This includes how cancers are detected with mammography.

The team queried ChatGPT-3.5 on 14 pathologic diagnoses found on breast core-needle biopsy, including four benign, seven high-risk, and three malignant. Three reviewers from Johns Hopkins independently rated their agreement with responses given by the chatbot, taking the following qualities into consideration: accuracy, consistency, definition provided, clinical significance conveyed, and management recommendation provided.

The reviewers mostly agreed with ChatGPT-3.5’s responses. This included high agreement in the chatbot’s accuracy, consistency, and definitions provided. Additionally, there was moderate agreement with clinical significance and management recommendation responses.

Agreement between reviewers, ChatGPT responses
	Overall	Benign	High-risk	Malignant
Assessments	95%	92%	95%	100%
Consistency	93%	92%	90%	100%
Definition	93%	92%	95%	89%
Clinical significance	88%	83%	85%	100%
Management recommendations	71%	83%	52%	100%

The researchers also found that no reviewer agreed with ChatGPT's management recommendation for two high-risk lesions, atypical ductal hyperplasia and radial scar. The reviewers in comments reported that there was a lack of mention of surgical excision in ChatGPT’s responses.

Finally, the team reported the following interrater reliability values, expressed as kappa: -0.01 for accuracy, 0.31 for consistency, -0.04 for definitions, 0.28 for clinical significance, and 0.45 for management recommendations.

The study authors highlighted that these findings are consistent with previous research demonstrating ChatGPT’s ability to give appropriate responses for most questions asked about breast cancer prevention and screening.

“Nonetheless, patients must subsequently crosscheck information received from [large language models] with their healthcare practitioners, especially regarding management recommendations,” they added.

The study can be found in its entirety here.