Hospitals are sitting on the most valuable dataset in the world

When Epic, one of the biggest EHR companies in the US, deployed its sepsis prediction model to hundreds of hospitals, it proved to be a bust. At Michigan Medicine it missed 67% of sepsis patients. Across BJC Healthcare's network of 9 hospitals it performed worst for the patients with the most severe comorbidities, the people who needed it the most. On top of that, it flagged 18% of all hospitalized patients as at risk. So many false positives that nurses just started ignoring the alerts entirely.

This isn't an isolated case. Google built a retinopathy screening tool that worked in their labs and fell apart in Thai clinics because the images were dimly lit and looked nothing like the well-controlled ones the model was trained on. IBM's Watson for Oncology learned oncology from a single institution, Memorial Sloan Kettering, and ended up recommending unsafe treatments when deployed to other hospitals. These are some of the biggest, best-funded efforts in healthcare ML and they all hit the same wall.

So what's going on?

If you look closely at these failures, a pattern emerges around the training data. Epic didn't reveal much about its dataset, but from what they disclosed it was 405,000 records from 3 health systems. Researchers found that the model couldn't generalize to severe sepsis cases, which tend to present more heterogeneously, suggesting the training data wasn't sufficient to cover those patterns. Google's data came from controlled lab conditions that didn't reflect real clinic environments. IBM trained its system on synthetic cases and protocols from a single cancer center. In every case the model worked in development and collapsed when it met the real world.

The natural question is: why can't we just train on real, diverse data from multiple hospitals?

There's strong evidence that this would actually help. A 2024 study tested deep learning models across four ICU datasets from the US and Europe and found that training on multiple sources produced "considerably more robust" models compared to single-dataset training. A brain tumor study across 71 institutions on 6 continents found a 33% improvement in tumor boundary detection when training was distributed across sites. The more hospitals contributing to the training data, the better the models perform.

But doing this in practice runs into serious barriers.

Patient records are heavily protected by regulations like HIPAA and by each institution's own policies. The data use agreements hospitals need to sign before sharing anything can take months to negotiate and sometimes fall through entirely. Getting two institutions to agree on data sharing terms is hard. Getting dozens to do it is a different challenge altogether.

That said, data sharing governance isn't a dead end. Research networks like PCORnet have actually cracked the code on cross-institutional collaboration at scale. It took them a decade and $460 million in funding from PCORI, but they've built a network of 79 health system sites covering 47 million patients, with shared data standards and master data sharing agreements in place. They've run over 300 studies. When there's real institutional will behind it, large-scale healthcare data collaboration is possible.

But even PCORnet can only go so far. Their infrastructure is set up so that data never leaves the hospital. Researchers send queries, each site runs them locally, and aggregate results come back. For comparing whether drug A or drug B has better outcomes, that works. You're counting. For training a model on combined patient records from 79 sites, it doesn't. The system was built for a different era of clinical research, one that didn't involve ML.

Federated learning is the most prominent attempt at cross-institutional ML training. Instead of moving patient data to a central location, each hospital trains a copy of the model on its own data locally. They send the model's updated parameters to a central server, which combines them into a single improved model and sends it back for another round of training. The process repeats until the model converges. Patient data never leaves the hospital. Only model parameters do.

In theory this is elegant. In practice, deploying it is a different story. Every participating hospital needs local GPU infrastructure for model training along with technical staff to manage it. Most hospitals have neither. Different hospitals record data differently, serve different patient populations, and follow different clinical practices, so achieving the alignment needed to train a single model takes serious work to get right. The whole process can also get bottlenecked by a single slow hospital since all sites need to finish training before the model can update. A recent review found that out of all the FL healthcare research published, only 10 studies actually ran across real distributed clinical environments. The practical barriers are well documented. And even after all that effort, it's still vulnerable to data leakage through the model parameters themselves. Sweden's privacy authority shut down a hospital FL project over exactly this.

So here's the current state of things. Hospital EHR systems across the US hold records on hundreds of millions of patients. There is clear evidence that training across this data produces better models. The institutional frameworks for collaboration have come further than most people realize. But there is still no way to actually compute on patient records across institutions without exposing them.

The cost of leaving this unsolved is measured in lives. 95% of rare diseases still lack effective treatments, largely because patient populations are spread so thin across hospitals that no single institution has enough cases to do meaningful research. Sepsis kills 11 million people a year globally, and the most widely deployed prediction model for it is failing the patients who need it most. Better models trained on broader data could catch patient deterioration earlier, improve cancer treatment outcomes, and prevent hospital readmissions that cost lives and billions of dollars. These aren't hypothetical use cases. They're problems that real patients face every day in real hospitals, and the data to address them already exists.

That is what makes hospital data the most valuable training dataset in medicine. And until we build systems that let hospitals contribute to model training without putting patient privacy at risk, it stays locked up.