June 16, 2022 | Jorge Cardoso, a researcher at King’s College London and CTO at the London AI Centre, and NVIDIA have released 100,000 synthetic brain MRI images freely available to healthcare researchers. They believe the dataset will accelerate diagnosis and understanding of dementia, aging, or any sort of brain disease.
Last summer, researchers at King’s College London announced the Synthetic Brain Project, focused on building deep learning models to synthesize artificial 3D MRI images of human brains. The models were developed by King’s and NVIDIA data scientists and engineers as part of The London Medical Imaging & AI Centre for Value Based Healthcare research funded by UK Research and Innovation and a Wellcome Flagship Programme (in collaboration with University College London). The project was one of the first done on NVIDIA’s Cambridge-1, the United Kingdom’s most powerful supercomputer.
The value of synthetic data is twofold, Cardoso told Diagnostics World. First, he said, “The ability to generate synthetic data allows us to better understand the underlying anatomy of disease.” And second: “Synthetic data allows us to share data that was otherwise behind hospital firewalls, which will speed up the advancement of scientific research and healthcare,” he said.
It is the second point—access to healthcare data—that could truly speed up our disease understanding.
The model that NVIDIA and King’s used is a GAN, a generative adversarial network—the same network NVIDIA used to create This-person-does-not-exist.com. “It basically has a good cop and a bad cop,” explained Dr. Mona Flores, global head of medical AI at NVIDIA. “You have the one side trying to produce something that looks like an MRI compared to a real one. The discriminator looks at it and says, ‘Oh no, this is too far from reality. Go and try again.’ And you keep going back and forth… It keeps spitting out new images and comparing them to real images until you get to the point where they are good enough and close enough to the real image. However, now they are not associated with any specific patient. They are naturally anonymized.”
Natural anonymization means more data can be shared with more researchers. “We realized that these generative models can be a very important way to share private data without all the risks of privacy rights,” Cardoso said. “You’re not actually giving someone’s brain [image] away. You’re giving a model. You’re removing all the privacy concerns but making these data available.”
This approach to synthetic, naturally anonymized data could be extremely useful in a variety of applications, according to Flores. “One of the most dominant use cases for synthetic data is actually training AI models,” she said. “AI algorithms that would go and detect the anomaly, that would detect the rare disease, the thing that looks different. You’re not actually trying to detect that from the data itself. You’re using the synthetic data as a means to train an AI model that is able to go and diagnose disease or classify it or segment it,” she proposed.
“For instance, let’s say you want to train a model that is able to tell if there is a tumor in the brain and is able to draw a margin around it,” Flores proposed. “But let’s say you don’t have enough data. All of these models take a lot of data and a lot of labeling. You have to get so many different MRIs and you have to actually do manual labeling and then you feed it into a model that’s able to do segmentation… I can generate images of different MRIs that have that tumor, but they have it in a different location, or perhaps the background is different, they have something else in the image. Now I’ve augmented my dataset with all of this synthetic data, so I can go and train the segmentation model and it will actually perform well because it did see the tumor in all the synthetic data.”
Cardoso presents another use case capitalizing on the model’s computational power. “For example, if you’ve seen a bunch of brains with a brain tumor and you’ve seen a bunch of brains with a stroke and you’ve maybe seen two to three brains that have tumors and stroke—because it’s less common—then the model can generate loads of new brains with both tumors and strokes. They can mix information they’ve seen before.”
The dataset of 100,000 MRI images will be shared in three ways. Health Data Research UK, a national repository, is planning to host the 100,000 brain images from this project. The dataset will be on figshare, though Cardoso admitted it’s been “a bit tricky” to upload the half-a-terabyte dataset. Finally, the dataset will be shared via an academic torrent.
The team is also planning to create a simple web app that lets users plug in their own parameters—“I want to have a 54 year old female with a brain that is atrophied,” Cardoso proposed—to see new images. The web app will be hosted on Hugging Face, a machine learning hub.
The models will be shared so researchers can run them on their own computer and generate their own data. Finally, the group plans to release the software so that similar models can be trained using new datasets. “For example, if you’re doing lung research and you have 10,000 lung datasets and you want to release your data without releasing your data, you can just train this model on your data and you can release your model,” Cardoso said.
Training a model takes time. Cardoso’s team used the Cambridge-1 supercomputer for two weeks to create this set of 100,000 images. For researchers with more standard GPUs, “You can generate one image every 10-20 seconds; that means you can generate 100,000 images in a few months if you wanted to,” Cardoso estimates. “It’s not hardware bound; it’s time bound.”
However, Cardoso acknowledges Cambridge-1 did enable exploration in whole new ways.
An NVIDIA DGX SuperPOD, Cambridge-1 packs 640 NVIDIA A100 Tensor Core GPUs, each with enough memory to process one or two of the team’s massive images made up of 16 million 3D pixels.
MONAI, an AI framework for medical imaging, includes domain-specific data loaders, metrics, GPU accelerated transforms and an optimized workflow engine. The software’s smart caching and multi-node scaling can reportedly accelerate jobs up to 10x.
“The amount of GPUs and the type of GPUs that Cambrdige-1 has is very special. It allows us to train models with sufficient size.” Models have parameters, Cardoso explains, and the more parameters a model has, the more complex the patterns that can be learned, but it’s also taxing for the system.
“With Cambridge-1, we managed to scale these models really, really large. And it got to a point where they started performing really, really well. We didn’t have to do a lot of engineering tricks to make these models work. They just worked. The hardware allowed the research to happen without having to contort ourselves to the will of the hardware. The hardware worked as a tool.
Thus far, workable datasets for brain images have been relatively small. For instance, in the Alzheimer’s Disease Neuroimaging Initiative (ADNI), Cardoso estimates that there are 3,000-4,000 subjects. “That data is incredibly valuable!” he emphasized. “We are using that data to [inform] our models.”
But because those are the data available, Cardoso argues that exploration has been limited. “Researchers have refrained from training really advanced models because they don’t have the amount of data that those models would require,” he said. “They tend to limit themselves technologically to what can be run on a few hundred, a few thousand datasets.”
In releasing a dataset of 100,000 images, Cardoso hopes to open opportunities for much greater exploration.
Finding the Limits of Synthetic Data
If more data is better, why not use Cambridge-1 to generate 500,000 images? One million? More?
“If it was real data, my answer would have been the more data you have, the better any downstream model will work. With synthetic data, we don’t know yet the answer,” Cardoso answered. “Can you generate one million useful synthetic images from an input of 50,000 real images?”
Generative models do have very good computational power and they are adept at mixing characteristics they’ve seen before, Cardoso said. These models can create more complex data than the data that was acquired. “It’s not just replicating the characteristics of the original data, even though it is preserving them.”
But does it have more variability? “Do you actually learn more about humans from these synthetic datasets than you do from the 50,000? We do not know. The images that you generate are all different, but that doesn’t mean that they contain more information than your original dataset did. We do not know the answer to those questions.”
These are questions that are being actively explored. NIH is currently funding research into the role of synthetic data, NVIDIA’s Flores said, “Where can it be used? Where should it be used? What are the pitfalls of using it?” So many companies are working on synthetic data, she said, “and every one of them does it differently!”
Cardoso has great expectations for this technology. “You can imagine that if you somehow manage to transform all of this in a brain simulator—which is our ultimate goal—one of the things we’re looking at is trying to understand how patients with Alzheimer’s disease will progress,” he said, or perhaps multiple sclerosis or another degenerative disease. He imagines being able to model the brain at various stages of disease or on various drugs to visualize disease progression. While these are hypothetical use cases so far, he is hopeful. “It’s just learning patterns from data. The patterns are there; the model is learning those patterns.”