Five messy real workbooks (close-pack, AP ageing, sales commission, headcount plan, FP&A consolidation) tested against Claude, GPT-5, and Gemini for formula-error detection, broken references, and tab-to-tab consistency. Claude wins on long-context multi-sheet reasoning. Gemini wins on cell-level formula checks.

This piece is being expanded into a full long-form article in the coming weeks. We publish each insight once the engagement it draws from has settled enough that we can name the trade-offs honestly — not while a pattern is still proving itself in production.

Field notes ship when the engagement they came from has stopped surprising us, not before.

Want the long-form version when it lands? Or want to skip ahead and talk through the same questions for your own company?

Or skip ahead and talk through it directly