Expert highlights potential of open cloud datasets

Jun 19, 2023

2021 03 17 23 48 7787 Computer Ai Hud Display 400

Sharing open datasets in the cloud for AI development can allow users to spend more time on data analysis rather than acquisition, according to a June 14 presentation delivered at the Society for Imaging Informatics in Medicine (SIIM).

Data wranglers in imaging and other fields likely know that 80% of their time is spent finding, cleaning, transforming, and otherwise optimizing data as opposed to actually "doing cool stuff" with it, said Erin Chu, DVM, PhD, of Amazon Web Services (AWS).

"The whole reason my team exists is because we believe that sharing open data in the cloud lets people spend more time analyzing and innovating on that data rather than acquiring data," she said.

Chu leads AWS's open data team, which operates its Registry of Open Data, a digital catalog of all open data on AWS, with almost all of it located in object-based storage "S3 buckets" that provide access from native interfaces. The goal of the registry is to bring our users as close to the data as possible, according to Chu. Currently, when users go to the registry, they'll see actual S3 buckets that they can mine themselves, either in the console using third-party tools or in the command line interface. They are also provided usage examples, publications, and articles on how people are starting to leverage these data.

A second focus of Chu's team is a program called the Open Data Sponsorship Program. This is an application-based program that covers the cost to store and distribute high-value, high-impact data, with clients including the Imaging Data Commons, NYUMets, FastMRI, and the Emory Breast Imaging Dataset (EMBED), she said.

"And one thing to keep in mind here, these data are owned by the people who manage these data," she told session attendees.

To create successful open datasets, the data first needs to be optimized for analysis, Chu said. Whether data is stored on a hard drive or whether you're letting someone access it in place, it's important to determine if this access is optimized or if it requires more time to transform and process it.

In addition, "building a community around the data" is important, she added. Once you have a community, you want to be able to encourage the use of the data in different ways.

"What you need is that critical mass of people to say, 'Hey, what can we do with this data? How can we reuse it?' and how can we establish the standards and the metadata around optimizing this data for reuse in the future?" she said.

Ultimately, the most difficult challenge in medical imaging is the need for a gold standard deidentification process, Chu said. To date, this is human review. That's fine with a thousand images, but not with 20,000 or 200,000 images, which is the scale at which AWS is approaching on some of its big projects, she said.

"We have seen some benchmarking using AI for deidentification, but I still don't think that those have been accepted in practice," she concluded.