What a risk desk taught me about evaluating AI agents

I have spent more than a decade deciding which trades to let through and which to halt, and I still do. Agents came later. The instinct that transfers most cleanly isn't technical at all. It's the habit of asking what an automated system does on its worst day, not its average one.

An AI agent in production is a risk surface. The failure modes that matter are almost never the ones in the demo. They live in the long tail: the inputs nobody scripted, the state nobody expected, the single path through the code that only runs once something else has already gone slightly wrong. That is the lens I bring to building agents, and it's the same discipline that keeps a book intact through the shocks that don't announce themselves in advance.

The demo is the average day

Every agent looks good in the demo. That is what a demo is for. You hand it the inputs you had in mind when you built it, it walks the path you designed, and it returns the answer you were hoping to show off. A demo is the average day with good lighting.

A risk desk does not get to live on the average day. The whole job is the other days. You can run clean books for a year and watch a single afternoon undo all of it, so the afternoon is where your attention goes. The industry's standard measure, value at risk, will tell you roughly what a normal bad day costs. It says almost nothing about the day that actually takes you out, because that day was sitting outside the model the entire time. Agent evaluation has the same blind spot. A benchmark score is an average over the cases someone thought to include, and it is perfectly quiet about the case nobody did.

Forty-five minutes

On the morning of August 1, 2012, a trading firm called Knight Capital pushed new software to its servers and missed one of the eight. On that eighth machine, a piece of dead code from 2003 woke up, switched on by a flag that had been quietly reused for something else years earlier. Nothing crashed. No alarm in the building announced that the system was broken, because by its own logic it wasn't. It was doing precisely what the old instructions told it to do, as fast as it could, which came to about four million orders in roughly forty-five minutes. By the time someone pulled the plug at 10:15, the firm was holding around seven billion dollars in positions it never meant to take and was down something like 440 million dollars. It nearly ended the company in under an hour.

I think about Knight Capital whenever someone walks me through an agent. Not because the technology rhymes, it doesn't, but because the failure does. An autonomous system, acting on stale assumptions, confidently, at machine speed, with no person in the loop fast enough to matter and no limit that bit before the damage was done. That is not really a story about bad code. The code did what it said. It is a story about the missing controls, the ones that exist precisely because someday your system will do exactly the wrong thing, and do it well.

What the desk actually does

People imagine risk management is about prediction. It mostly isn't. You are not going to predict the afternoon that goes wrong; if you could, it wouldn't. What you can do is decide, in advance and in cold blood, how much damage any single mistake is allowed to cause. That is the real craft, and almost all of it carries straight over to agents.

A desk runs on limits. A position can only grow so large before the system refuses it, no matter how good the trade looks from the inside. An agent needs the same reflex: a cap on tokens per call, a hard ceiling on spend per day, a turn limit the model cannot talk its way past. I wire these in before the feature even works, because an uncapped agent pointed at an API is a bad night waiting to happen, and the bad night comes on its own schedule rather than yours.

A desk has a kill switch, and someone whose job is to reach for it. An agent should have one too: a single place you can stand to stop every model call at once, tested to actually work on the day you are not calm. A desk puts four eyes on anything that matters, so nothing large ships on one person's word. The agent version is a human in the loop for actions that are expensive or hard to reverse, with full autonomy saved for the ones that are cheap and safe to undo. The whole skill is being honest about which is which.

And a desk assumes failure as a starting condition. Every control on it exists because someone, somewhere, already lived through the thing it now prevents. The useful question was never "will this break." It was "when it breaks, how will I find out, and what will it have cost me by then."

The dangerous failure is the quiet one

A crash is a gift. It tells you exactly where and when it broke, and then it stops. The failures that hurt are the ones that keep their voices down: the agent that hands back a confident, well-formatted, completely wrong answer and proceeds to the next step as if nothing happened.

Multi-step agents make this sharper, because the errors compound. A small misread at step two becomes a shaky assumption at step four becomes a polished, authoritative wrong answer by step nine, and each step on its own looked fine. That is the part worth losing sleep over. The system is not malfunctioning. It is functioning beautifully, in the direction of the wrong answer. Knight Capital again, in miniature, every time it happens.

So the job is to make the quiet failures loud. Instrument the steps, not only the final output. Record what the agent decided and why, so when something looks off you can point to the exact step that produced it instead of squinting at one opaque call. Design the thing to degrade on purpose: when the model is unreachable, or the output fails a check, fall back to a calm, safe default instead of guessing. An agent that says "I could not do this safely, so I stopped" is worth more than one that is right most of the time and silently catastrophic the rest.

Building like you'd run a desk

None of this is pessimism. A good risk operator is not a doomsayer. They are the reason the desk gets to keep taking risk at all. The limits and the kill switch and the worst-day question are exactly what let you ship the ambitious thing, because you have already settled what it is not allowed to cost you.

That is the posture I want around agents. Less "look what it can do" on a good day, more "here is what it does on its worst one, and here is why I can live with that." Build the controls before the capability. Assume the stale code is sitting on the eighth server. Ask the worst-day question out loud, while there is still time to answer it on your own terms.