August 9, 2022 | Researchers in Germany have developed a web-based tool, powered by machine learning, which extracts disease subtypes from large pools of patient data. The goal is more precise and robust predictions about molecular signatures that can serve as a starting point for investigating disease heterogeneity, says Josch Konstantin Pauling, Ph.D., chair of experimental bioinformatics at the Technical University of Munich (TUM).
Pauling is part of a research group known as LipiTUM—Computational Systems Medicine on Lipids and Metabolism. The new computational method is called MoSBi (molecular signature identification using biclustering) and it combines the results of existing algorithms—using transcriptomics, proteomics, and metabolomics data—with synthetic datasets to pinpoint the underlying molecular mechanisms and allow the development of better targeted treatments, he explains.
MoSBi can be handily applied to various large patient cohorts where molecular data has been measured for each patient to stratify and hence identify the subtypes of many different diseases, Pauling says. Analysis of the molecular clinical data can be done online by practitioners with no prior knowledge of bioinformatics.
“By integrating the predictions of multiple algorithms, MoSBi can overcome the specifics of individual algorithms and reduce the need to adjust parameters of each algorithm,” says LipiTUM doctoral candidate Tim Rose. As demonstrated in a study that recently published in Proceedings of the National Academy of Sciences (DOI: 10.1073/pnas.2118210119), the MoSBi ensemble method achieved the best results overall relative to other algorithms that scored very high on specific scenarios but failed at others.
“Furthermore, we developed a network visualization of the results that can facilitate an intuitive interpretation,” Rose continues. To date, biclustering algorithms have lacked visualizations that can be easily and broadly understood.
In principle, MoSBi analysis of molecular datasets works for a small number of patients but confidence in the results improves when more samples are included, says Pauling. It is an unsupervised learning technique that finds novel patterns in datasets but cannot output performance metrics for an analysis, meaning “each result needs to be investigated carefully and confirmed with additional experiments and expert knowledge.”
The software is open source and can be executed on any computer so sensitive patient information, which requires privacy, does not need to leave the institution conducting the research project, he notes.
The first biclustering algorithm was developed in 2000 and many others subsequently emerged with different aims, says Rose. The method is not as popular as clustering, which finds subgroups of samples in a dataset by searching for similarities over all measured molecules.
“In contrast, biclustering additionally identifies molecules that characterize the similarity of sample groups,” Rose says. The clustering of samples and molecular features simultaneously requires more interpretation of the results. MoSBi solves this problem, in addition to providing a scalable solution for visualizing the results, making “biclustering applicable for everyone.”
Researchers can now use MoSBi to analyze data in previously published studies and establish novel connections that can benefit future studies, says Pauling. They might also combine cohorts from multiple studies to search for subgroups that are conserved across studies.
For the latest PNAS study, the LipiTUM research team simulated various characteristics of molecular data to investigate the performance of MoSBi and other biclustering algorithms. This simulation workflow can also be adapted to evaluate future algorithms, Pauling says.
In a prior study, the LipiTUM team demonstrated their steps to analyzing clinical data with MoSBi and the benefits of combining predictions of patient subgroups and molecular signatures (Journal of Lipid Research, DOI: 10.1016/j.jlr.2021.100104). Specifically, researchers corresponded their predictions to nonalcoholic fatty liver disease (NAFLD) subgroups and identified lipid biomarkers that achieved robust predictions for classification of the patients, says Rose.
“Such markers have the potential to differentiate patients at different stages of the disease and therefore adapt treatments,” he adds. “This shows how computational methods can benefit clinical research to get a better disease understanding and aid doctors for treatment decisions.”
At present, the team is involved in several collaborations within TUM and external research institutes where MoSBi is being applied to get a new perspective on data, says Pauling. “The LipiTUM group is also developing other complementary methods to gain mechanistic understanding of diseases, for example [to] extract disease mechanisms from lipid data [Metabolites, DOI: 10.3390/metabo11080488].”
Their objective is to combine computational methods to streamline the interpretation of clinical datasets, Pauling says. “We are currently working on the large-scale application of MoSBi and other computational methods using publicly accessible disease data to obtain new insights similar to those we showed previously for NAFLD. Additionally, since MoSBi can be applied to any kind of molecular data, we will also apply it to nonclinical data such as in crop-based life sciences, a current focus at the TUM School of Life Sciences.”