Your Eval Suite Is Grading Fiction: Stop Inventing Test Cases and Mine Your Traces

#ai #evaluation #observability #typescript

Your eval suite is only as good as the cases in it, and almost nobody talks about where those cases come from. We argue endlessly about deterministic checks versus model-as-judge, about CI gates and drift thresholds — and then we feed all of that machinery a dataset of twelve examples someone made up in an afternoon. The scoring is rigorous. The corpus is fiction.

Here is the opinion I will defend: the hardest part of agent evaluation is not the scorer, it's the dataset. A perfect judge over an unrepresentative set of inputs gives you a confident green checkmark that means nothing. You can have flawless assertions and still ship a broken agent, because your eval set never contained the input that breaks it. The corpus is the test. Everything else is plumbing.

Why hand-written eval cases rot

When teams build their first eval set, they sit down and imagine inputs. "A user asks to summarize a ticket." "A user asks for a refund." These cases share a fatal property: they are what the engineer imagined a user would do, written by the same person who wrote the prompt. They encode the happy path twice — once in the agent, once in the test — and then congratulate each other for agreeing.

Real users do not behave like your imagination. They paste 8,000 tokens of Slack history into a one-line field. They ask three questions at once. They use your product for something you never designed. They write in the second language they're least comfortable in. None of that is in your hand-authored set, which means none of it is gated, which means every one of those inputs is a live grenade in production that your "comprehensive eval suite" has never seen.

Synthetic data generation — "ask GPT to write me 100 test queries" — feels like a fix and isn't. The model generates from the same distribution your prompt already handles well. You get 100 variations of the easy case and zero of the weird ones, because the weird ones are weird precisely because no model would think to generate them. Synthetic sets inflate your case count and your confidence without touching your actual risk.

The source of truth is your production traces

The only honest source of eval cases is the distribution you actually serve: production traffic. Your users are running a continuous, adversarial, free fuzzing campaign against your agent every single day. The job is to capture that, find the inputs that matter, and promote them into permanent regression cases.

This is exactly why I treat tracing and evaluation as one workflow instead of two products. AgentLens captures the full execution trace of every production run — the resolved input the model actually saw after template interpolation, every tool call with its arguments, the raw outputs, the final answer. That trace store is not just a debugging tool; it is the raw material your eval set is mined from. agent-eval is the other half: it takes a case, runs the deterministic checks and the model-as-judge rubric, and returns a pass/fail verdict you can gate on. The pairing matters because AgentLens decides which cases are worth testing, and agent-eval decides whether the agent passes them. A scorer with no pipeline from production is a scorer grading fiction; a trace store you never promote into evals is an archive nobody reads.

The loop looks like this: a production run gets a bad outcome (a thumbs-down, a support escalation, a failed downstream action). That trace is the most valuable test case you will ever have, because it is a real failure that really happened. You capture its resolved input, attach the corrected expected behavior, and it becomes a permanent case. Now that exact failure can never silently regress again.

Here is the harness that turns a flagged trace into a frozen regression case:

import { getTrace } from "agentlens";
import { evaluate, assert } from "agent-eval";
import { writeFileSync, readFileSync } from "node:fs";

interface GoldenCase {
  id: string;
  sourceTraceId: string;     // provenance: which real run this came from
  input: unknown;            // the RESOLVED input, exactly as the model saw it
  policy: string;            // the judge rubric this case must satisfy
  mustContain?: string[];    // deterministic anchors from the corrected answer
  mustNotContain?: string[]; // things the bad run did that we now forbid
}

// Promote a flagged production trace into a permanent eval case.
async function promoteTrace(traceId: string, policy: string): Promise<GoldenCase> {
  const trace = await getTrace(traceId);

  // Critical: freeze the RESOLVED input, not the template. The whole reason
  // this run failed may live in the interpolated context, not your prompt.
  const resolvedInput = trace.steps.find((s) => s.kind === "model")?.input;
  if (!resolvedInput) throw new Error(`no model step in trace ${traceId}`);

  const golden: GoldenCase = {
    id: `case_${traceId.slice(0, 8)}`,
    sourceTraceId: traceId,
    input: resolvedInput,
    policy,
  };

  const set: GoldenCase[] = JSON.parse(readFileSync("./goldens.json", "utf8"));
  set.push(golden);
  writeFileSync("./goldens.json", JSON.stringify(set, null, 2));
  return golden;
}

// Run the curated set. Every case here is a real failure we refuse to repeat.
async function runRegressionSet(): Promise<void> {
  const set: GoldenCase[] = JSON.parse(readFileSync("./goldens.json", "utf8"));

  const results = await Promise.all(
    set.map(async (g) => {
      const output = await runAgent(g.input);
      const report = await evaluate({
        input: g.input,
        output,
        checks: [
          assert.contains(g.mustContain ?? []),
          assert.notContains(g.mustNotContain ?? []),
          assert.judge({ criterion: g.policy, threshold: 0.7 }),
        ],
      });
      return { id: g.id, source: g.sourceTraceId, ...report };
    }),
  );

  const failed = results.filter((r) => !r.passed);
  for (const f of failed) {
    // Provenance pays off: jump straight back to the original incident.
    console.error(`FAIL ${f.id}  (regressed from trace ${f.source})`);
  }
  if (failed.length > 0) process.exit(1);
}

The detail that earns its keep is sourceTraceId. Every case carries a pointer back to the real run it came from. When a case fails six months later, you are not staring at a synthetic input wondering what it was supposed to prove — you open the original AgentLens trace and see the actual incident that motivated it. Your eval set becomes a museum of every real bug you've ever fixed, and the gate's job is to make sure none of them come back.

Curation is a discipline, not a one-time export

Mining traces is not "dump everything into the eval set." A set of 50,000 cases that takes four hours to run is a set nobody runs. Curation means actively managing the corpus:

Stratify by outcome. Deliberately oversample failures and edge inputs. A set that mirrors production exactly is 95% easy cases and tells you almost nothing per dollar of judge spend. You want the hard tail over-represented.
Deduplicate by behavior, not by string. Ten traces that all trip the same tool-selection bug are one case, not ten. Cluster on the failure mode and keep the clearest representative.
Expire cases that no longer test anything. When a capability is rock-solid for months, demote those cases to a nightly suite and keep the per-commit gate lean and fast.
Track coverage as a real metric. Which user intents, which tools, which input shapes have zero eval cases? Those gaps are exactly where your next production incident is hiding.

The takeaway

Stop pouring engineering effort into a more sophisticated scorer on top of a dataset you invented. The leverage is in the corpus. Build the pipeline that turns real production failures — captured as AgentLens traces — into permanent agent-eval cases, and your suite stops being a record of what you imagined could go wrong and becomes a record of what actually did. That is the only eval set that gets stronger every week instead of staler.

Your users are writing your test cases for you, every day, for free. The only question is whether you're capturing them — or letting them expire into a log file you'll never query.