The biggest bottleneck in enterprise ML isn't models

I spent the last year or so building ML systems at a large enterprise SaaS platform. Agents, RAG, semantic search. The platform had tens of millions of documents. Employee records, payroll, tax and compliance data. Sensitive stuff.

The models were fine. The infra was fine. The team was good. The problem was that we had almost no way to learn from how people were actually using what we built.

To be clear, that's not a complaint. When your platform holds millions of people's employment records you don't just pipe that into a training set. Client agreements exist for a reason. Privacy regulations exist for a reason. Those restrictions are protecting real people and that matters.

But it puts you in a weird position as an ML engineer. You're building systems to help real users but you can't observe real usage. You can't see how people phrase their questions. You can't study which searches return nothing. You can't build feedback loops because the feedback itself contains the data you're not allowed to look at.

So you do what everyone does. Engineers write synthetic queries. You generate training data based on what you think users might ask. You put together eval sets that seem reasonable but deep down you know aren't grounded in anything real. You ship it. It works okay.

This is not a niche problem. It's everywhere.

Samsung found this out the hard way when their engineers started pasting proprietary source code and internal meeting notes into ChatGPT to get help with debugging and documentation. Three separate leaks in under a month. Samsung's response was to ban ChatGPT entirely. That's the current state of the art in enterprise data protection: don't let the data near AI at all.

IBM's latest research shows that 72% of CEOs believe proprietary data is the key to unlocking generative AI value, but half admit their data environments can't actually support their AI ambitions. They know the data is the moat. They just can't use it.

And the numbers on the deployment side tell the same story. 79% of enterprises have adopted AI agents in some form, but only 11% have them in production. That gap isn't about model capability. It's about trust. Companies don't trust that their sensitive data will stay private once it enters an ML pipeline.

The workarounds are all some version of the same compromise. You either use synthetic data that doesn't capture real distribution, or you anonymize production data so aggressively that it loses the signal you needed, or you just don't build the ML feature at all. I've done all three.

There's always this gap though. Between what you shipped and what you could have shipped if there was a way to safely learn from production without compromising anyone's privacy. You want to tune retrieval but you need real documents. You want human feedback but the feedback contains PII. You want to understand failure modes but the failed queries are just as sensitive as the successful ones.

Healthcare companies are sitting on patient data that could train diagnostic models that save lives. Banks have transaction histories that could transform fraud detection. Legal firms have case documents that would power the best retrieval systems in the world. HR platforms have workforce data that could predict retention and compensation trends across entire industries. The data is right there. The models are ready. There's just no safe bridge between them.

The industry talks a lot about foundation models and training compute and architecture choices. In my experience the thing that actually determines whether an enterprise ML system is useful to a real person is data access. Not volume. Access. Can you safely learn from real usage without breaking the trust that users put in you when they handed over their data.

Both sides of that tension are completely valid. That's why it's hard. That's also why I think it's one of the more interesting problems to work on right now.