Most SMB teams either skip evals entirely or buy a $50k enterprise observability platform. The middle path: golden dataset, three eval modes (deterministic, model-graded, human-spot), and a weekly regression report — built on open tooling for under $5k. Four metrics matter, seven look serious but waste your time.

Most SMB teams approach LLM evals one of two ways. The first is to skip them entirely — ship and pray. The second is to buy a $50,000-a-year enterprise observability platform that has more features than the team will ever use. There is a middle path that almost nobody picks: build the eval pipeline you actually need on $5,000 a year of open tooling. Below: the structure, the four metrics that matter, and the seven that do not.

The structure

Three components. Each one is unglamorous. Each one earns its keep.

1. A golden dataset

200-500 examples that represent the production traffic distribution. Each example is a pair of (input, expected output) plus a notes column for what makes the answer good. This dataset is the single most valuable asset in the eval pipeline — guard it like you guard your customer database. If you can only do one thing this quarter, build the golden dataset.

2. Three eval modes

  1. Deterministic checks. Does the output contain the required field? Does it parse as valid JSON? Does it stay within the token budget? Cheap, fast, run on every commit.
  2. Model-graded. Use a smaller model (Haiku 4.5 works) to grade outputs against the golden dataset on a rubric. Run weekly, not per-commit.
  3. Human spot-check. A senior person reviews 10-20 outputs per week. Catches the failure modes the other two modes are blind to.

3. A weekly regression report

Sent to a Slack channel every Monday morning. Three lines per metric: this week, last week, threshold. Anyone on the team should be able to look at it and answer the question "is the AI working as well as last week?" If they cannot, the report is wrong.

The most valuable thing you can do for your AI program this quarter is build a golden dataset of 200 examples.

Four metrics that matter

  • Task success rate — does the output meet the spec? This is the only metric that matters for "is the system working".
  • Faithfulness to source — for RAG-style systems, does the output stick to the retrieved context, or hallucinate? Especially important for regulated outputs.
  • Latency p95 — the slow tail. Average latency lies; p95 is what users actually feel.
  • Cost per successful task — not cost per call, cost per success. A cheaper model that fails twice as often is not cheaper.

Seven metrics that look serious but waste your time

  • Token count by itself. Without success rate context, meaningless.
  • ROUGE / BLEU scores for anything other than translation. They were designed for a different problem.
  • Embedding similarity as a primary metric. It rewards superficial matches and misses semantic failures.
  • Total cost. Without volume context, you cannot tell if it is high or low.
  • Average response time. P95 is the metric. Average hides everything.
  • Number of tool calls. It only matters if the cost or latency is suffering.
  • Coverage of the prompt library. Vanity metric. The model can pass every prompt and still ship bad output to users.

Adoption arc

Week 1-2: build the golden dataset. Week 3-4: add deterministic checks to CI. Week 5-8: build the model-graded eval and the weekly report. Week 9 onwards: the human spot-check is a permanent commitment, not a project. The whole arc is one engineer for ~30% of their time over two months. The total annual cost is the engineer time plus ~$200/month of eval inference. Less than $5k all-in for the year.

If you are tempted to skip the golden dataset because it is unglamorous, that is the moment to commit to building it. Every team we have seen that skipped this step paid for it later — usually six months after a model upgrade quietly broke the system and nobody noticed.

Or skip ahead and talk through it directly