ChatGPT-4 performs well in identifying incidental findings on CT via a process called single-shot learning, according to Canadian research published January 10 in the American Journal of Roentgenology.
A team led by Rajesh Bhayana, MD, from Toronto General Hospital in Canada found that the large language model achieved a perfect F1 score for incidental adrenal nodules and highlighted that their results show that such models can be applied flexibly in medical settings.
“Since incidental findings are commonly mismanaged, automatic identification in reports could improve management by increasing visibility or partially automating workup,” the Bhayana team wrote.
While it’s common for clinically important incidental imaging findings to be reported, they can be overlooked or not managed properly.
Large language models such as ChatGPT have shown high performance in completing tasks after training with few examples, also known as few-shot learning. Similarly, previous studies have explored their promise after training with just one example, known as single-shot learning.
Bhayana and colleagues highlighted that as these models become more implemented into electronic medical record (EMR) software, they could be used more flexibly than traditional natural language processing models.
For their study, the researchers tested ChatGPT-4’s performance with single-shot learning for identifying incidental findings in radiology reports, which contain medical jargon. They randomly selected 1,000 radiology reports for abdominal CT exams taken from the EMR, all of which were performed in 2018.
After potentially identifiable data were removed, a radiologist with years of post-training experience and a licensed physician with four years of experience independently reviewed the reports to establish the reference standard. From there, an additional radiologist with 10 years of experience resolved discrepancies.
The team manually reviewed the reports for the presence of the following: history of malignancy or definite new malignancy on exams, an adrenal nodule measuring less than 1 cm, a pancreatic cystic lesion, and vascular calcification. It considered adrenal nodules to be incidental in the absence of known malignancy. The team also used F1 scores to measure the predictive performance of ChatGPT-4, with a score of 1 indicating perfect precision.
ChatGPT-4 achieved F1 scores of 1.00 for incidental adrenal nodules, 0.91 for pancreatic cystic lesions, and 0.99 for vascular calcifications, using radiology reports as the reference.
The study authors suggested that as large language models continue to gain ground in their use within EMRs, their flexibility could present new opportunities for patient care improvement.
A previous study found that a natural language processing model manually trained on over 4,000 medical reports achieved an F1 score of 0.89 for new adrenal nodules. The study authors also highlighted that ChatGPT-4’s performance is comparable to those of task-specific natural language processing models in extracting findings from radiology reports.
“Automatic identification of incidental findings in radiology reports could improve patient care by highlighting the findings to referring clinicians, automating management, or facilitating population health initiatives,” they wrote.
The study can be found in its entirety here.