January 3, 2022 | In a paper published late last year in Nature Communications, researchers from Penn Medicine and Intel Labs report on what they call the largest to-date global federated learning effort to develop an accurate and generalizable machine learning model for detecting glioblastoma borders. The work has important implications both clinically and as a model for future large, federated learning projects.
“We’re hoping this can be a flagship study that helps encourage folks to realize that the practical federations are plausible,” Jason Martin, a Principal Engineer in the Security Solutions Lab and manager of the Secure Intelligence Team at Intel Labs, told Diagnostics World. Martin and Spyridon Bakas, Assistant Professor, Radiology & Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, are senior authors on the paper (DOI: 10.1038/s41467-022-33407-5). It gives further updates and resources from the federated learning project they launched in 2020.
Tumor Segmentation as the Model Organism
Clinically, the study’s focus was to create a machine learning (ML) model that could accurately distinguish the boundaries of sub-compartments within a glioblastoma tumor: the enhancing tumor, the tumor core, and the whole tumor. “Detecting these sub-compartment boundaries,” the authors write, “is a critical first step towards further quantifying and assessing this heterogeneous rare disease and ultimately influencing clinical decision-making.”
A machine learning model of this scope is a multi-parametric, multi-class learning problem, which requires expert clinicians following a manual annotation protocol across multi-parametric magnetic resonance imaging (mpMRI) scans. Because glioblastoma is a rare disease (with an incidence rate of about 3/100,000 people), training such a model would be unlikely at one site or even within one healthcare network. There just wouldn’t be enough data.
Bakas has been intimately aware of the dearth. He has been leading the International Brain Tumor Segmentation Challenge (BraTS) since 2017. The annual challenges, which began in 2012, evaluate methods for segmenting images of brain tumors in MRI scans. He has seen firsthand the challenges of creating a centralized dataset to serve as the community benchmark dataset for BraTS.
A federated learning approach offered a solution. After running a feasibility study with BraTS participants, Bakas and Martin launched a federation to tackle the problem. Along the way, they built and released the software infrastructure to facilitate the project which includes FeTS, the Federated Tumor Segmentation (FeTS) platform; OpenLF, an open-source framework for federated learning (GitHub; DOI: 10.1088/1361-6560/ac97d9); and GaNDLF, a Generally Nuanced Deep Learning Framework for Scalable End-to-End Clinical Workflows in Medical Imaging (pronounced Gandalf, Arxiv).
In May 2020, the federation included 30 groups that had committed research effort, not just data, to the project. In the final paper, the federation has grown to 71 distinct sites that contributed 25,256 MRI scans representing 6,314 patients from age 7 to 94.
The study represents the largest and most diverse dataset of glioblastoma patients ever considered in the literature, the authors claim, but Bakas emphasizes that size isn’t the only strength of the approach. “Data size does matter, but data alone does not guarantee success. It’s very important to pay attention to the labels and the quality of the data and the quality of the labels that our algorithm will learn from.”
Thus the first step for the final 71 groups in the federation was to identify their local data and begin preprocessing. “Whenever you’re doing a training procedure like this, you need to take the data you’re going to use, and you need to annotate it,” Martin said. “These participants were excited enough that all of them [devoted] local annotators—radiologists or assistants—from their institution that annotated the data they contributed... In this particular study, the annotators were assisted by an initial model. The annotation procedure was actually to run a model and then correct it, which isn’t quite the same as the heavy lift of doing it from scratch,” he adds.
Once the data were annotated, model training began. For the training process, each collaborator connected to a central aggregation server at the University of Pennsylvania to retrieve the public initial model—based on 231 cases from 16 sites.
Collaborating sites used their local data to train that model before sending it back to the aggregation server. The variously trained models were combined by averaging model parameters and the new consensus model was sent back to the participating collaborators to train again. Each cycle is called a “federated round”, and in this case after 42 rounds researchers could not observe any more meaningful changes. They stopped cycling the model after 73 total rounds.
In order to validate the final consensus model, 20% of the cases from each contributing institution were withheld from the training rounds and used to locally validate the final consensus model. Complete datasets from six sites were also not included in training and used to validate the final consensus model.
The researchers compared the performance of the public initial model—with an average Dice similarity coefficient (DSC) across the three sub-compartments of 0.66—to the performance of the final consensus model. When validated against the local validation sets as well as the six sites who did not take part in training at all, the final consensus model improved by 27%, 33%, and 16% across the three sub-compartments.
“The consensus model—the final federated-trained model—performed 33% better on the validation data than the model that was trained on the small public dataset,” Martin says. “That’s the improvement globally versus what you could do if you trained on the public datasets.”
The consensus model has been returned to the participating institutions, and some of them may tweak it further, Bakas predicts. “When you get a model that’s gained knowledge of all of these diverse populations and you want to apply it to your own institution or your own hospital or your own health systems, then you need it to be personalized to your attending patient population,” he says. “So yes, it has gained knowledge from all these diverse data, but it is also focused on the particular attending population.”
But Bakas, Martin, and the rest of the authors see value in this project far beyond the glioblastoma clinical implications, including insights that can pave the way for more successful complex and large-scale FL studies. “The lessons learned from this study with such a global footprint are invaluable and can be applied to a broad array of clinical scenarios with the potential for great impact to rare diseases and underrepresented populations,” the authors write.
The team learned a great deal over the past two years about the challenges—and potential—of large, federated learning, Martin says. He and Bakas both report that the glioblastoma community was eager to be part of the study, many of whom were clinicians who had not previously spent time building or training machine learning models before.
“I did not expect to see the amount of coordination that we needed to do. That was one of two main pain points of the complete study. It was the amount of coordination and the amount of time needed to go over multiple… with multiple people around the globe and the around the clock,” Bakas says.
That sort of coordination takes time. “You’ll see the timeline we’re operating on here,” Martin says, referring to the time between the federation launch and the published paper. “It could feel long, but realizing that under that was a lot of conversations with the data custodians and in some cases the InfoSec departments at some institutions.” Privacy often has a personal, emotional component, Martin adds. It was a priority for the Intel Labs team to be empathetic to that.
Once the stakeholders were on board and felt assured of the privacy and security of the federated learning model, there were other lessons to be learned as well.
Data prep and annotation needed to be addressed. As models returned to be aggregated, Martin says, sometimes a dataset was clearly an outlier. “Say a participant is doing their data preprocessing different from all of the other institutions, it kind of stands out in the metrics in a way that allows you to flag it and say, ‘Hey, you’re doing your data prep different,’” he explains. Because the data themselves never left the home institutions, the researchers had to get on the phone with the federation member and walk through the data preparation process to identify reasons for the discrepancy.
The process was valuable, Martin says, for both groups. For the data owners, this may be the first opportunity they had to compare their workflows to others. “It’s something they didn’t have before: sort of, ‘How am I doing on my data science compared to my peers?’” he says.
It also offered priceless feedback for the federation. “As we went along, as these issues cropped up during the large-scale federation, it was an opportunity to add checks to the tooling. I’m sure we haven’t found everything, but hopefully these things become easier and more scalable in the future—less human interaction and debugging, less time on the phone with a particular institution saying, ‘Can you look at this data?’”
Think about data preprocessing as a specification, Martin suggests. The challenge becomes: “How do you write a good specification for people who have to independently implement it across a distributed architecture?”
Intel has already incorporated the tools and “sanity checks” that have emerged from this project into OpenFL. “We’re trying to build community around [OpenFL],” he says. “I think it has all the core elements: a strong software engineering team at Intel robust-ifying it for production capability. But hoping that it will also be a trustable open-source project that others can use and contribute to.”
OpenFL is used by Frontier Development Lab, in which NASA, Mayo Clinic and Intel are studying the effect of cosmic rays on human life; Montefiore, which used OpenFL to gather data from multiple hospitals to predict ARDS deaths in COVID-19 patients; and Aster DM Healthcare, which linked three hospital clusters to train chest x-rays to detect pneumonia.
Members of the glioblastoma federation have also made many suggestions for the next problems to tackle. “We’ve currently opened up this collaborating network of participating sites to pursue further studies that ask different kinds of questions,” Bakas says. He lists patient prognosis, molecular markers, and tumor recurrence as possible areas of focus.
The wealth of options is prompting conversations about creating a persistent federation, Martin says, which will bring its own new challenges and opportunities both technically and on the governance side. “Rather than building a new workload for every experiment, let’s set up the infrastructure, leave it up and running, and build a governance model around how you get an experiment on it.”
Martin is also intrigued to explore how a federated learning model influences our understanding of bias in the data.
“We did some experiments [looking at] how did the consensus model do compared to each institution going it themselves, training on their own data with nothing else… We wanted to make sure they weren’t actually harmed by the federation—as in the model they could train themselves wasn’t better than the consensus model,” he says. “It’s one of the areas I’m most excited about. In the world of machine learning, often we concern ourselves mainly with accuracy metrics, which are very important. But sometimes we misinterpret a drop in accuracy as a failure rather than a signal that there’s something important going on.”
The model is generalizable, but it’s too generalizable. What I meant by that is When you get a model that’s gained knowledge of all of these diverse populations and you want to apply it to your own institution or your own hospital or your own health systems, then you need it to be personalized to your attending patient population. We’ve seen, for example, other instutions that have a predominance of white, Caucasian population. Others have a predominantly African American population. Upon circulation of the final consensus model, federated learning here enables the participating collaborators to personalize this consensus model to their own attending patient population.
So yes, it has gained knowledge from all these diverse data, but it is also focused on the particular attending population.