August 20, 2025 | WASHINGTON, D.C.— Efforts to couple patient-level data with machine learning (ML) are going to accelerate “100-plus-fold” over the next decade, dramatically improving evidence-based decision-making and demonstrably improving patient care. In the diagnostics field, supervised learning that leverages classification represents the bulk of this work currently and may even suggest the algorithms best suited to the task at hand, according to Hooman Rashidi, M.D., professor and associate dean of AI in medicine at the University of Pittsburgh Medical Center.
Editor’s Note: Updates were made to this story on 8/21 to give additional details and clarity.
Rashidi was speaking about the on-premises, auto-ML framework of MILO (Machine Intelligence Learning Optimizer) during his opening keynote yesterday at the Next Generation Dx Summit about guidelines for using artificial intelligence (AI) and ML in point-of-care (POC) testing. He is co-developer of the MILO platform (developed at the University of California), designed for automating the process of building and deploying supervised predictive analytics machine learning models, which has been successfully validated and licensed to several industry partners.
“By default, we tell people not to make any assumptions about what algorithms [e.g., neural network, logistic regression, or K-nearest neighbors] are the best,” he says. “[MILO] is almost like a gladiator stadium where you let them fight for your data or your study and then you figure out which pipeline is the best.”
Users need only follow a simple, four-step process of uploading their data, picking their target, evaluating the basic statistics, and building their models to see how those algorithms are performing, says Rashidi. “If you can attach a file into an email, you can work with this.”
In just over 30 minutes, Rashidi walked the audience through the paces and showed how MILO works “under the hood” in real studies and practice. Besides the MILO Auto-ML platform that automates supervised machine learning and data cleaning, he also showcased the use of generative AI as a complementary significant solution to educating and training users of such diagnostic devices when powered by AI and ML, whether they are or aren’t being used at the point of care.
One of the biggest challenges when talking about AI with colleagues, administrators, and regulators has been the misperception that it is “one big thing,” says Rashidi, although people are now starting to understand that AI represents distinct entities. His talk focused on “narrow artificial intelligence,” where, as he puts it, “we teach the machine ... the machine doesn’t think for itself.” This stands in contrast to artificial general intelligence and hypothetical artificial superintelligence that surpasses human intelligence in all respects.
Rashidi homed in on non-generative AI, specifically classification tasks that revolve around “classes” such as positive versus negative sepsis or the ability to predict different grades of cancer.
Major differences exist between AI models for POC and non-POC devices, Rashidi shares. Notably, the setting of use for POC tests is more variable with more non-specialist operators, possibly patients, reading the results. The potential for error with POC devices also requires that the algorithms be “robust and interpretable for a much larger audience ... [than] within a very controlled laboratory setting.”
The regulatory pathway to market ranges from a low-risk CLIA (Clinical Laboratory Improvement Amendments) waiver to a medium-risk 510(k) or a higher-risk De Novo or Premarket Approval from the U.S. Food and Drug Administration (FDA). But the end game in all cases needs to be an AI tool whose benefits outweigh the risks it produces, he continues.
Many people are trying to apply best practices and implementation guidelines around data collection, usability and human factors, cybersecurity and privacy, and post-market surveillance in terms of monitoring the performance level of the impacted end user, says Rashidi. Among the most recent of these is the FDA-recommended “predetermined change control plans” (PCCPs) to establish in advance how an AI model will be retrained in response to data drift and other changing elements.
“Certain automated AI tools are now available to ensure our embedded algorithms in diagnostic devices adhere to some of these guidelines better and quicker than ever before,” he says. His presentation described two such tools available on MILO (free to use for educational and research purposes), one for data pre-processing (i.e., data cleaning) and the other for training and deploying the classification model.
For purposes of cleaning and preparing data for analysis, the goals are to ensure integrity, completeness, value, and reproducibility of the data. The best way to do that currently is in a non-automated way with a team of people, which can easily take weeks, Rashidi says.
With MILO’s data preprocessing tool, what he terms a “virtual machine learning, statistics, and software engineering team,” the job can be done in a matter of minutes. Rashidi’s example was a synthetic dataset for breast cancer with a bunch of missing values, some columns that contained text that needed conversion to numeric values, and multicollinearity issues.
Rashidi explains how the clean-up process within the MILO’s Automated Preprocessing Tool—a “human-in-the-loop simple seven-step process”—automatically imputes missing values with synthetic values, uses one-hot encoding to convert text into the appropriate numerical values, and minimizes multicollinearity issues, “all while following ML best practices” and in a matter of minutes rather than days or weeks. “The machine is helping you, but you are still driving it,” says Rashidi, “and you don’t have to accept all its recommendations.”
Besides needing to clean the data, people often start with a single dataset that needs to be turned into two separate datasets, one for training/initial validation testing and the other a secondary held-out test set, for the ML training and follow-up tests, he adds. Once that single file has been turned into two separate files, Rashidi notes, the data are ready for ML modeling, “which the MILO auto-ML is well suited to handle with ease.”
Using MILO’s auto-ML framework, building a machine learning model on the cleaned-up data is “super-fast,” says Rashidi. Here, his example was its use in building a sepsis predictor. Two burns sepsis datasets with various associated lab and clinical findings from 500 patients were used for the training and initial validation step, followed by a 204-patient sample used for the secondary generalization testing.
The entire process, as he describes it, follows a simple, four-step approach. First, the initial training/validation and secondary testing datasets are uploaded so that the training data can be mapped to the target of interest (e.g., sepsis positive and negative cases). Next, a “data science 101”-level univariate statistical analysis is conducted and visualized.
The third step is to select the algorithms and feature selectors, which leads to a “wheel of pipelines” upon the start of training with hundreds of individual spokes within the wheel representing thousands of eventually-optimized ML models, Rashidi continues. The final step is to evaluate the ML models, in terms of “how they end up serving your needs as the investigator or practitioner.”
Testing of the model could begin right away within an information management system or point-of-care device or platform, says Rashidi. In a clinical trial context, batch testing could also be done in parallel with the interpretation of patient test results. “These can be auto filled for you if it’s uploaded as a model within your framework and then you basically now make the prediction.”
MILO has been successfully used in numerous studies, including several acute kidney injury (AKI) studies on burn populations, Rashidi shares. AKI has traditionally been diagnosed based on criteria in the Kidney Disease: Improving Global Outcomes, or KDIGO, guidelines, which include changing creatinine levels and urine outputs that can take days to identify. Moreover, sensitivities were found to be “really poor, especially in our burns population.” This has led to greater reliance on newer biomarkers such as the neutrophil gelatinase-associated lipocalin (NGAL), used in Europe and, most recently, the U.S.
The big question then became whether NGAL’s diagnostic accuracy could be improved with machine learning. Once it was incorporated with the B-type natriuretic peptide (BNP) biomarker, creatinine, and urine output, “sensitivities and accuracies improved drastically, into the mid-90s,” he says, and in follow-up studies Rashidi and his colleagues were also able to show similar improvement in trauma patients as well.
Not only did ML drastically improve diagnostic precision for AKI, but since NGAL is part of the process, serial measures of creatinine and urine outputs are no longer required. The combination gets to answers in a fraction of the time, in line with POC approaches as shown in a subsequent, multicenter follow-up study.
Rashidi is also particularly enthusiastic about the education and training capabilities that AI can assist with, especially for end users of various diagnostic devices. But it could find just as much utility in helping people prepare for inspections in terms of knowing where the laboratory is out of compliance and the type of questions regulators may be asking. However, the type of AI that would be of greatest benefit in these educational and training realms are “not the MILOs of the world but rather our new generative AI large language model [LLM] frameworks.”
While revolutionary, text-based large language models (LLM) have “limitations,” most notably general chatbots like ChatGPT and Gemini, Rashidi says. “If you had questions about your own private policies and procedures, for example, that were never accessible to the internet, ... how would they be able to give you good, accurate results if your stuff was never available ... for them to learn from?”
Cloud-based machine learning tools are being promoted as a cost-effective way to adopt and integrate AI capabilities into various applications, but “once people get hooked ... API costs can shoot through the roof,” he warns. Rashidi is therefore a fan of the hybrid approach where vendor partners are sought for activities with a reasonable return on investment (ROI) and, when reasonable ROI is not there, open-source or “home-brewed” versions may become better suited.
The big worries beyond direct costs are cybersecurity and data usage on the internet, says Rashidi, limitations he believes can be overcome by keeping some LLMs local and customized. Two practical approaches for LLM customization include “fine-tuning” the LLM models or using a technique called “retrieval-augmented generation,” or a combination of the two.
One such solution is something he has developed called, “Pitt-GPT-Plus,” a custom local LLM framework that is similar in many ways to Chat-GPT but customized to one’s needs and fully on-prem, so there is nothing that is going to the cloud.
The framework was used to ingest 24 papers related to hematology (i.e., for diagnosing leukemia and lymphoma). As part of the exercise, Rashidi identified his negative and positive controls by asking the model about something initially it knew nothing about but was subsequently trained to know—the best soccer players in the world. After passing the negative and positive control tests, the model was then asked a clinical question, specifically whether someone had chronic or accelerated myeloid leukemia based on criteria found in those uploaded 24 papers and from which of the documents the information was pulled. By having a reliable and verifiable source verification, this approach not only improves the accuracy of the results but also increases trust in the model among end users, he says.
All told, the benefits of using MILO Auto-ML and the custom chatbots include enhanced efficiency and reduced errors or “confabulations” (i.e., hallucinations), respectively, says Rashidi. That also makes it more cost-effective and scalable institutionally while offering better data security.
Regardless of which AI platform is used, he adds, keep in mind that many end users are not going to be familiar with major performance standards for AI, especially generative AI. These include terms such as perplexity score, BLEU (bilingual evaluation understudy) and ROUGE (recall-oriented understudy for gisting evaluation). “They may not be as straightforward as a classification model ... based on a confusion-matrix-based performance measure which is more familiar to a larger audience.”
Ultimately, understanding AI requires a fundamental knowledge of its application in healthcare, says Rashidi. To that end, he and his colleagues recently published a seven-part AI review series featuring contributions from more than 40 global experts working at the intersection of AI and medicine (Modern Pathology, DOI: 10.1016/j.modpat.2024.100673).