Runaway tool-call recursion, silent context-window inflation, retry storms, sub-agent fan-out without a budget, prompt-cache misses on long sessions, and unguarded model upgrades. Each has a named control pattern. The taxonomy plus a $1 → $10k spend ladder lets you set the right alarm at the right tier — before prod traffic finds the loop your demo missed.

A reasonable AI agent cost €40,000 in eight hours. The team had built a tool-calling loop that worked beautifully on every test case. The first prod traffic pattern hit a recursive structure no test had simulated, the loop fired tool calls into a sub-agent, the sub-agent fanned out to its own sub-agents, and by the time the morning standup noticed, the cost dashboard had eaten the engineering budget for the quarter.

Six failure modes, six counters, one spend ladder. The taxonomy is the deliverable; the controls follow from it.

The six failure modes

1. Runaway tool-call recursion

The agent calls a tool that produces output that triggers another tool call that produces output that triggers another tool call. Counter: hard step caps. Every loop has a max-iterations guard. When it hits, stop and emit a structured error.

2. Silent context-window inflation

Each tool call appends its result to the context window. After 50 calls, you are paying for a 200K-token prompt on every model call. Counter: token budget guard per loop step. When the context exceeds a threshold, force a summarisation step.

3. Retry storms on transient errors

A flaky tool returns 503; the agent retries; the loop hits the same 503; the agent retries again; multiplied across 20 parallel sessions. Counter: exponential backoff with a circuit breaker. After three failures on the same tool in five minutes, the breaker trips and the loop fails fast.

4. Sub-agent fan-out without a budget

A parent agent spawns child agents to parallelise work. Each child can spawn its own children. Counter: a fan-out semaphore — a shared budget of total live agents the loop can have at any one time. When the budget is full, new agents wait.

5. Prompt-cache misses on long sessions

Prompt caching only works when the cacheable prefix actually matches. Long-running agent sessions accumulate dynamic state in the prefix and silently break caching, causing unit costs to balloon. Counter: cache-key audit. Log cache hit rate per session; alert when it drops below threshold.

6. Unguarded model upgrades

A new model release is auto-routed by the SDK; the new model uses 2x the tokens for the same task; nobody notices until the bill arrives. Counter: pin model versions in production. Upgrade behind a feature flag with cost regression measured on a representative workload.

Most agent disasters happen at the third deploy, not the first — because production traffic exposes loops your demo never hit.

The spend ladder

Set alarms at every tier:

  • €1/session — surprising; investigate the next morning.
  • €10/session — page someone; the loop is broken or the workload is mis-shaped.
  • €100/session — auto-pause the loop and notify operator.
  • €1,000/session — auto-kill, do not resume.
  • €10,000/day platform-wide — automated freeze on agent loops; only the on-call engineer can lift it.

The cost of building the controls is small. The cost of the disaster they prevent is the kind that ends careers. Build them before you ship the third deploy.

Or skip ahead and talk through it directly