By Deborah Borfitz
July 8, 2020 | Connecting information from electronic health records (EHRs) over time can tell a better story of patients’ current and future health than the “black box” of deep learning. The novel sequencing approach being proposed by researchers in the Laboratory of Computer Science at Massachusetts General Hospital (MGH) employs an algorithm for “exploiting the temporal information in EHRs that is distorted by layers of administrative and healthcare system processes," according to assistant professor Hossein Estiri, Ph.D., lead author of a new study in Cell Patterns (DOI: 10.1016/j.patter.2020.100051) where the method was put to the test.
The so-called “transitive sequential pattern mining” approach mines temporal sequences from medication records and diagnosis codes in EHRs, which is “one way of translating events for machines in which time is embedded in that translation,” Estiri says. The chronology of relevant characteristics matters more than the “sum of data elements.”
The paper also introduces a machine learning “pipeline” that is capable of engineering predictive features without the need for expert involvement to model diseases and health outcomes. The pipeline was quickly deployed when the pandemic hit to model COVID-19 outcomes, Estiri says, and could enable a similar rapid response to future public health emergencies.
The problem in making sense of data in EHRs is that the systems weren’t built for research but for billing and communications purposes, says Estiri, so the information doesn’t necessarily reflect the real health status of patients. Physicians who want to order a lab test for a disease may need to enter a diagnosis code, for example, but that string of numbers doesn’t indicate if the patient does or doesn’t have the suspected disease. The date associated with the lab test—or any other record for that matter—also doesn’t necessarily align with disease onset. And any one diagnosis might be repeated many times in a patient’s history over years.
Systemic data quality problems like these make it difficult to leverage EHRs to address pressing health issues, he says. Connecting information on patients' medications and diagnoses over time, however, can more accurately compute the likelihood that patients may actually have an underlying disease.
As described in Cell Patterns, the computational approach provides a specification for sequential pattern mining and a formal dimensionality procedure—minimize sparsity and maximize relevance (MSMR)—that integrates feature selection into the classification task. In other words, says Estiri, the analysis was limited to the most commonly occurring and clinically relevant diagnosis codes and prescribed medications.
Telling Stories
As the recent study points out, coronary artery disease followed by chest pain in the medical record is more useful for predicting the development of heart failure than either of the factors on their own or in a different order. “The computer sorts through thousands of patients and can find sequences that physicians would likely never identify on their own as relevant, but actually are associated with the disease," Estiri says.
Clinical experts identified multiple other sequences that match a common clinical narrative among patients with heart failure—e.g., heart failure followed by the medication metoprolol or (less obviously) topical anti-infectives followed by unspecified kidney failure. In some cases, the sequences might generate hypotheses for clinical relationships not previously appreciated.
The MSMR and sequencing algorithms together not only improve phenotype prediction (by over 13%) compared to standard ‘‘atemporal’’ representations of discrete EHR data, but also computational classification of patient cohorts with a certain disease (by 4%), Estiri says. Using information from before heart failure was even observed in the medical records, predictive sequences identified by the new computation model may serve as novel disease markers.
Most of the algorithm’s sequential features are interpretable, since the long-term goal is to create a practical tool for physicians in the clinic, he continues. The research team has also designed a dashboard that graphically displays the progression of record pairs that were identified as important for classification or prediction.
“When applied to heart failure, many of the predictive sequences identified were recognizable to clinicians as common sequences of events for heart failure patients,” says Estiri. Others were not as obvious, although an association with heart failure could still be identified.
He cited the example of gout followed by an encounter for immunization. While neither diagnosis code directly references heart failure, the algorithm links the two codes in this order and raises a red flag. Risk factors for gout—chronic kidney disease, diabetes mellitus and cardiovascular disease—are also risk factors for heart disease. And even without specific details of which immunizations were given, clinicians know that immunizations are frequently recommended to patients with poorly controlled diabetes mellitus, alcohol use disorder, or cardiovascular disease—all risk factors for heart failure.
One possible use case for the predictive feature of the tool is to give physicians a probability score on the diagnosis they’re considering based on clinically relevant sequential patterns in a patient’s record, says Estiri. This could be particularly valuable as a decision support tool in under-resourced settings where patients see healthcare providers less often, to identify those at risk of developing a particular disease and recommend that they come in for evaluation.
Among the next planned steps for MGH researchers are to enhance the algorithm to look at more than two sequences of EHR events and, eventually, extend the computational exercise vertically across a network of datasets for even better prediction and classification, Estiri says. The team will also explore other potential use cases, such as uncovering the cascade of symptoms in currently hard-to-predict cancers.