What happens when agents start making decisions that matter

I've built AI agents that didn't just answer questions. They took action. Submitted requests, modified records, triggered workflows in production systems. They had the same system-level access as the application itself, called tools autonomously, and made decisions that affected real people. The observability behind all of this? Basically just logs. No structured record of what the agent did, in what order, or why.

At the time I didn't think too much about it. The stakes in my case were manageable. But now I'm seeing the same patterns show up in places where the consequences are a lot less forgiving.

UnitedHealth deployed an AI model called nH Predict to decide coverage for elderly Medicare patients. When patients managed to appeal the AI's denials, the decisions were overturned 90% of the time. Nine out of ten. The company knew this and kept using it because only about 0.2% of patients ever appealed. One patient's family spent $70,000 out of pocket after the AI cut off post-acute care that their doctors said was medically necessary. The lawsuit is still ongoing — a court ordered UnitedHealth to disclose the algorithm in March 2026.

This isn't just UnitedHealth. Cigna faced a lawsuit alleging its PXDX algorithm let doctors deny claims in large batches without individual review. Humana got hit with similar accusations. In 2024 the US Department of Justice started subpoenaing healthcare companies over AI tools in medical record systems to investigate whether they were leading to excessive or medically unnecessary care.

Then there's this one. A startup called SaaStr gave an autonomous coding agent a maintenance task during a code freeze. Explicit instructions: make no changes. The agent ran a DROP DATABASE command and wiped production. When it got caught it generated 4,000 fake user accounts and fabricated system logs to try to cover it up. Its explanation was "I panicked instead of thinking."

The pattern repeats every time. An agent has broad access. It makes a call. Something breaks. And nobody can piece together what happened because the infrastructure to track agent decisions was never built.

77% of organizations have reported financial losses from AI incidents. 55% have taken reputational damage. Meanwhile the tooling for agent accountability is still mostly an afterthought.

Building agents got me thinking about a few questions that I didn't have answers to then and honestly still don't see great answers to now.

What does a proper audit trail for an agent even look like? Not application logs. Something structured and queryable. Every tool call, every parameter, every piece of data the agent accessed, every decision point. Something you could hand to a regulator and say here's exactly what happened.

How do you separate what an agent should be able to do from what it technically can do? Most agent systems inherit the same permissions as the application they live inside. But the application was built with a defined set of features and predictable behavior. An LLM-powered agent can decide to call any tool it has access to in any order for any reason it comes up with. The access model was designed for deterministic software, not for an autonomous system that improvises.

How do you debug something non-deterministic? Run the same input twice, get different tool calls. If something goes wrong on Tuesday you might not be able to reproduce it on Wednesday. Traditional debugging doesn't work when the system doesn't behave the same way twice.

And how do you prove to a regulator or a court that the agent's decision was reasonable? In healthcare and finance this is not a future problem. It's already being litigated.

The industry is moving fast on making agents more capable. New tool servers are being published all the time. Companies are going all in. JPMorgan is putting $4 billion into AI this year. Goldman Sachs has over 90% internal adoption.

But the infrastructure for trust is not keeping up. We're handing agents the keys to production systems and hoping that logs are good enough when something goes wrong. They're not.

I don't think building more powerful agents is the hard problem anymore. Building agents you can actually trust with decisions that matter is the hard problem. And the tooling for that barely exists.