This may already be happening with a wide array of products used to help treat serious illnesses such as heart disease and cancer. A recent STAT investigation found that only 73 of 161 AI products approved by the federal Food and Drug Administration publicly disclosed the amount of data used to validate the product, and just seven reported the racial makeup for their study populations. Even the sources of the data were almost never given.
Those findings were echoed in a paper by Stanford researchers who highlighted the lack of prospective studies, or studies that examine future outcomes, conducted on even higher-risk AI products cleared by the FDA. They also noted that most AI devices were evaluated at a small number of sites and that only a tiny fraction reported how the AI performed in different demographic groups.
“We would like the AI to work responsibly and reliably for different patients in different hospitals,” said James Zou, a professor of biomedical data science at Stanford and co-author of the paper. “So it’s especially important to be able to evaluate and test the algorithm across these diverse kinds of data.”
The review conducted by the University of Cambridge found that many studies not only lacked external validation, but also neglected to specify the data sources used or details on how their AI models were trained. All but 62 of the more than 400 papers failed to pass an initial quality screening based on those omissions and other lapses.
Even those that survived the initial screening suffered from multiple shortcomings— 55 of those 62 papers were found to be at high risk of bias due to a variety of problems, including reliance on public datasets where many images suspected to represent Covid-19 are not confirmed to be positive cases. A few AI models trained to diagnose adult Covid-19 cases on chest X-rays were tested on images of pediatric patients with pneumonia.
“The [pediatric images] were often of children below the age of 5, who have massive anatomical differences compared to adults, so it is absolutely no surprise that these models had really good results in picking out Covid versus non-Covid,” said Selby. “The patients looked completely different on the chest X-ray regardless of Covid status.”
The researchers found significant flaws with papers published on preprint servers as well as those published in journals that impose more scrutiny through peer review. The peer-review process can fail for a variety of reasons, including reviewers lacking a deep knowledge about machine learning methodology or bias towards prominent institutions or companies that results in superficial reviews of their papers. A larger problem is a lack of consensus standards for evaluating machine learning research in medicine, although that is beginning to change. The University of Cambridge researchers used a methodology checklist known as CLAIM, which establishes a common set of criteria for authors and reviewers.
“We tried in our paper to point out the necessity of the checklists,” Selby said. “It makes people question, ‘Have we addressed this issue? Have we thought about that?’ They may realize themselves that they could build a better model with a bit more thought and time.”
Among the papers that Selby and his colleagues found to present a high risk of bias was one published in Nature from researchers at Icahn School of Medicine at Mount Sinai in New York.
The paper found that an AI model for diagnosing Covid-19 on chest CT scans performed well on a common accuracy measure — area under the curve of .92 — and equaled the performance of a senior thoracic radiologist. A press release that accompanied the paper’s release said the tool “could help hospitals across the world quickly detect the virus, isolate patients, and prevent it from spreading during this pandemic.”
But the University of Cambridge researchers flagged the paper for a high risk of bias due to its small sample size of 424 Covid-positive patients spread across datasets used to train, tune, and test the AI. The data were obtained from 18 medical centers in China but it was unclear which centers provided the data on the positive and negative cases, which raises the possibility that the AI could simply be detecting differences in scanning methods and equipment, rather than in the physiology of the patients. The Cambridge researchers also noted that performance was not tested on an independent dataset to verify its ability to reliably recognize the illness in different groups of patients.
The paper did acknowledge the study’s small sample size and the need for additional data to test the AI in different patient populations, but the research team did not respond to a request for additional comment.
Time constraints may explain, if not excuse, some of the problems found with AI models developed for Covid-19. But similar methodological flaws are common in a wide swath of machine learning research. Pointing out these lapses has become its own subgenre of medical research, with many papers and editorials calling for better evaluation models and urging researchers to be more transparent about their methods.
The inability to replicate findings is especially problematic, eroding trust in AI and undermining efforts to deploy it in clinical care.
A recent review of 511 machine learning studies across multiple fields found that the ones produced in health care were particularly hard to replicate because the underlying code and datasets were seldom disclosed. The review, conducted by MIT researchers, found that only about 23% of machine learning studies in health care used multiple datasets to establish their results, compared to 80% in the adjacent field of computer vision, and 58% in natural language processing.
It is an understandable gap, given the privacy restrictions in health care and the difficulty of accessing data that spans multiple institutions. But it nonetheless makes it more difficult for AI developers in health care to obtain enough data to develop meaningful models in the first place and makes it even harder for them to publicly disclose their sources so findings can be replicated.
Google recently announced an app that uses AI to analyze skin conditions but declined to publicly disclose the sources of data used to create the model. A spokesperson explained that some of the datasets are licensed from third parties or donated by users and that the company could not publish the data under the terms of its agreements.
McDermott, the MIT researcher, said these structural barriers must be overcome to ensure that the effects of these tools can be fully evaluated and understood. He noted a number of ways to share data without undermining privacy or intellectual property, such as use of a federated learning method in which institutions can jointly develop models without exchanging their data. Others are also using synthetic data — or data modeled on real patients — to help preserve privacy.
McDermott said careful scrutiny of machine learning tools, and the data used to train them, is particularly important because they are making correlations that are hard, if not impossible, for humans to independently verify.
It is also important to consider the time-locked nature of AI models when they are evaluated. A model trained on one set of data that is then deployed in an ever-changing world is not guaranteed to work in the same way. The effects of diseases on patients can change, and so can the methods of treating them.
“We should inherently be more skeptical of any claims of long-term generalizability and stability of the results over time,” McDermott said. “A static regulatory paradigm where we say, ‘OK, this algorithm gets a stamp of approval and now you can go do what you want with it forever and ever’ — that feels dangerous to me.”