A field-tested taxonomy: the eval gap, the prompt-as-spec drift, the orphaned pilot, the vendor-locked POC, and eight more — each with diagnostic signals and a single-line counter. Operators get a vocabulary to challenge vendors with. Almost every doomed AI project fails one of these, and only three fail catastrophically often.
The first AI project a company ships is rarely the one that fails. The second one is. The pilot was scoped tight, owned by the same person who advocated for it, and shipped before anyone could change their mind. The follow-on project picks up everyone's expectations and runs into one of the failure modes below.
A taxonomy. Twelve named failure modes, each with a diagnostic signal and a counter. Three of them — the eval gap, the orphaned pilot, and prompt-as-spec drift — cause about 80% of the damage we see. Most operators recognise their own organisation in at least one.
The big three
1. The eval gap
Symptom: the team cannot answer the question 'is the AI's output worse this week than last week?' Diagnostic: ask anyone on the project to show you the regression dashboard. If they do not have one, you are flying blind. Counter: invest in eval infrastructure before scale. Even a 200-example golden dataset run weekly catches more drift than no eval at all.
2. The orphaned pilot
Symptom: the pilot ships, the champion who built it moves on, and the system slowly degrades because nobody owns it day-to-day. Diagnostic: ask 'who is on call when this breaks?' and listen for awkward silence. Counter: every AI system needs a named operator before it goes to prod. If you cannot name them, do not ship.
3. Prompt-as-spec drift
Symptom: the prompt has been edited 30 times by 4 different people and nobody can articulate what it currently does. Diagnostic: ask for the system prompt and watch how many caveats they add when reading it to you. Counter: version your prompts, change-log them, and require eval regression before any prompt change ships.
The prompt is your spec. If you would not let an unsigned spec ship, do not let an unsigned prompt ship.
The other nine
- The vendor-locked POC. You built it on a vendor SDK that only works with one model. Switching costs are silently load-bearing.
- The data-tax surprise. Six months in, you discover the data layer needs more cleanup than the AI layer. Budget blows up.
- The 'one more eval' loop. The team keeps adding evals instead of shipping. The eval suite never feels complete; the system never reaches prod.
- The autonomous-agent burn. An agent loop with no spend cap discovers an infinite recursion. €40k in 8 hours is a real number.
- The trust collapse. The system gets one bad output in front of an executive. Six months of work is now under review whether the rest of the output is fine or not.
- The feature-creep pilot. Scope expanded mid-build. The pilot is now a platform. Neither shipped.
- The integration-tax shock. Connecting to your CRM, your auth, and your warehouse turned out to take longer than the AI work.
- The handover hole. The build team handed over to the ops team without a runbook. Ops cannot debug what they did not build.
- The compliance retrofit. Legal review happens after the system is live. Anything they reject costs 5x what it would have during build.
Using the taxonomy
The point of named failure modes is not to memorise twelve gotchas. It is to give your project a vocabulary. When a vendor says "this never fails", you can ask which of the twelve they mean. When a stakeholder asks "is this risky?", you can show them the named risks and the counters you have in place.
Operators with a vocabulary make better calls. Vendors without one tend to oversell.