ChatGPT-4 matches radiologists in flagging errors on reports

Apr 16, 2024

ChatGPT-4 matched radiologists when tasked with detecting errors in imaging reports, with the study suggesting the technology may be useful for improving reporting accuracy, according to a study published April 16 in Radiology.

A team at University Hospital Cologne in Germany compared the performance of ChatGPT-4 and six radiologists in detecting common errors in 200 radiology x-ray, CT, and MRI reports and found the large language model (LLM) was not only “on par” with radiologists, but required significant less processing time.

“Most strikingly, we found that the average reading time for a report was for the fastest radiologist 25 seconds on average and for ChatGPT-4 it was only 3.5 seconds,” said lead author Roman Gertz, MD, in an interview with AuntMinnie.com.

ChatGPT-4 matched radiologists in detecting errors in radiology reports.

Preliminary radiology reports are typically drafted by residents and subsequently reviewed and approved by board-certified radiologists. This legally necessary process increases accuracy, yet errors may occur due to resident-to-attending discrepancies, speech recognition inaccuracies, and high workload, Gertz and senior author Jonathan Kottlors, MD, explained.

In this study, the researchers intentionally inserted 150 errors from five error categories (omission, insertion, spelling, side confusion and “other”) into 100 of the 200 reports and tasked the ChatGPT-4 and two senior radiologists, two attending physicians, and two residents with detecting these errors.

ChatGPT-4’s detection rate was 82.7% (124 of 150), while the error detection rates were 89.3% for senior radiologists (134 out of 150) and 80% for attending radiologists and radiology residents (120 out of 150), on average, the researchers found.

In addition, GPT-4 required less processing time per radiology report than even the fastest human reader, and the use of GPT-4 resulted in lower mean correction cost per report than the most cost-efficient radiologist, Gertz and Kottlors noted.

The group has been exploring the use of ChatGPT in radiology applications for more than a year, and is “still shocked” by its performance, given that the LLM’s developer OpenAI.com has kept a lid on the data it used to train the model, Gertz said.

The study also suggests that ChatGPT-4 could potentially serve as a teaching tool for residents who might not have access to senior radiologists by providing a “feedback loop” for them to learn from their mistakes, Gertz said.

Could ChatGPT-4 be used as a training tool to improve reporting accuracy?

Ultimately, the study shows that the advanced text-processing capabilities of LLMs such as GPT-4 have the potential to enhance the report generation process, the researchers said. Yet Kottlors noted that this was a “proof-of-concept” study and that significant hurdles will need to be overcome before LLMs can be implemented in hospital departments.

For instance, dedicated software will need to be integrated into hospital systems using the technology to ensure privacy of clinical data, he suggested.

“At the moment, there’s no commercial software available,” he said.