AI devices vary widely in lung cancer detection

In a head-to-head comparison, seven commercial AI devices demonstrated substantial variability in their ability to detect lung cancer on chest x-rays, according to a study published May 19 in Radiology

The findings reveal clinically meaningful differences across all key performance metrics, including sensitivity, specificity, and positive predictive value, and raise questions about how AI devices should be selected for deployment, noted lead author Ahmed Maiter, MD, of Sheffield Teaching Hospitals NHS Foundation Trust in the U.K., and colleagues. 

"A lack of information regarding comparative performance risks the selection of inferior devices, which could waste resources, adversely impact clinical workflows, and hinder advancement in the field," the authors wrote. 

The radiology landscape is becoming increasingly crowded with AI devices, with now more than 300 products available. Choosing among different devices for a similar purpose represents a growing challenge, the researchers noted. Effective selection requires an understanding of how devices perform in their intended patient populations and clinical settings and how performances differ among devices, yet comparative evidence remains limited, they added. 

To bridge the gap, the researchers developed a dataset of 5,235 posteroanterior x-rays from 5,235 patients acquired for any indication at a single U.K. center between July 2020 and February 2021. The median patient age was 60 years old (53.4% female; 79.4% white). Confirmed cancer was present in 1.4% with a visible tumor on x-ray. 

The group tested devices from seven manufacturers on each x-ray: Annalise Enterprise CXR (Harrison.ai, Australia), ChestView (Gleamer, France), InferRead DR Chest (InferVision, China), TechCare Chest (Milvue, France), ChestEye (Oxipit, Lithuania), qXR (Qure.ai, India), and Rayscape CXR (Rayscape, Romania).

Cropped secondary capture examples. These are illustrative and not intended to imply superiority or inferiority of any device. (A) Posteroanterior radiograph in a 46-year-old female patient. The device correctly identified a right lower lobe nodule projected below the right hemidiaphragm and hilar lymphadenopathy. (B) Posteroanterior radiograph in an 86-year-old female patient with a classic Golden S sign highly suggestive of cancer. Three devices did not identify any findings. (C) The output from one device for the same radiograph as in B. The device placed a contour around the area of abnormality but mislabeled it as segmental collapse, and there are no other elements in the output to raise suspicion of cancer. (D) Posteroanterior radiograph in a 60-year-old male patient -- a case of confirmed lung cancer that was not deemed visible in retrospect. The device identified multiple false-positive abnormalities. (E) Posteroanterior radiograph in a 77-year-old female patient with two right lower lobe nodules. The device mislabeled the abnormality as infection -- a diagnostic term that could incorrectly influence clinical management. (F) Posteroanterior radiograph in a 77-year-old female patient with a right hilar tumor. Most of the lungs have been labeled by the device, with excessive overlap of the abnormality that pragmatically represents an incorrect result. All annotations shown were produced by the devices. LL = lung lesion, LO = lung opacity, PO = pleural other, TBC = tuberculosis.Cropped secondary capture examples. These are illustrative and not intended to imply superiority or inferiority of any device. (A) Posteroanterior radiograph in a 46-year-old female patient. The device correctly identified a right lower lobe nodule projected below the right hemidiaphragm and hilar lymphadenopathy. (B) Posteroanterior radiograph in an 86-year-old female patient with a classic Golden S sign highly suggestive of cancer. Three devices did not identify any findings. (C) The output from one device for the same radiograph as in B. The device placed a contour around the area of abnormality but mislabeled it as segmental collapse, and there are no other elements in the output to raise suspicion of cancer. (D) Posteroanterior radiograph in a 60-year-old male patient -- a case of confirmed lung cancer that was not deemed visible in retrospect. The device identified multiple false-positive abnormalities. (E) Posteroanterior radiograph in a 77-year-old female patient with two right lower lobe nodules. The device mislabeled the abnormality as infection -- a diagnostic term that could incorrectly influence clinical management. (F) Posteroanterior radiograph in a 77-year-old female patient with a right hilar tumor. Most of the lungs have been labeled by the device, with excessive overlap of the abnormality that pragmatically represents an incorrect result. All annotations shown were produced by the devices. LL = lung lesion, LO = lung opacity, PO = pleural other, TBC = tuberculosis.RSNAAccording to the results, the area under the receiver operating characteristic curve varied from 0.80 to 0.94 across devices. Sensitivity ranged from 20.8% to 77.8%, specificity from 58.9% to 98.4%, and positive predictive value from 1.5% to 28.4%, with significant differences observed in 39 of 44 pairwise comparisons. In addition, device classification results showed minimal agreement, with a Fleiss κ of 0.24. Compared with radiologist reports, three devices detected more tumors and four detected fewer, while the number of additional false-positive results for tumor detection ranged from 10 to 2,039. 

“Compared with radiologist reports, three devices helped detect more cancerous tumors, whereas the other four devices helped detect fewer tumors, indicating that some devices could add more benefit to diagnostic pathways than others,” the group wrote. 

Future work should compare the impact of different devices on radiologists' diagnostic accuracy and reporting behavior, patient outcomes, and health care service delivery, the researchers concluded. 

In an accompanying editorial, Cornelia Schaefer-Prokop, MD, and Steven Schalekamp, MD, PhD, both of Radboud University Medical Centre in Nijmegen, the Netherlands, wrote that the study illustrates the importance but also the challenges of benchmarking the performance of various AI tools. 

“Although it is important to realize that there is wide variability in the performance of AI products, it is equally important to analyze specifically how tools differ and what the underlying reasons for performance differences are,” they wrote. 

Ultimately, a common framework for benchmarking will need to be developed by the scientific community and professional societies to allow for safe and reproducible comparison of AI tools in a realistic setting, Schaefer-Prokop and Schalekamp wrote. 

The full study is available here.

Page 1 of 386
Next Page