Paulo Victor Leite Lima Gomes

Posted on Jun 16

agent evaluation is becoming the new test pyramid

#aiagents #evaluation #testing #platformengineering

We are starting to rediscover testing, but with more tool calls.

AWS published a post last week about Agent-EvalKit, an open-source toolkit for evaluating AI agents. The interesting part is not that another eval framework exists. We have plenty of those already, and half of them seem to be born with a leaderboard attached.

The interesting part is the shape of the problem it admits.

For normal software, you can often test the output and learn something useful. Give a function an input. Check the return value. Mock the slow thing. Assert the behavior. Add a regression test when it breaks.

Agents make that much less satisfying.

An agent can produce the right-looking answer for the wrong reason. It can call the wrong tool, ignore an empty result, invent the missing value, and still write a beautifully formatted response. It can accidentally skip the verification step that made the workflow trustworthy. It can get lucky on the final answer while quietly doing something you would never want as a habit.

That is the annoying thing about agents: the answer is not the only behavior.

The path matters.

output tests are not enough

I understand why teams start with output tests.

They are familiar. They are cheap to explain. They map nicely to product expectations: the user asked this, the agent answered that, the answer was good or bad.

But agents are not just text generators once we give them tools. They become small distributed systems with a language model in the middle. They read state, choose tools, pass parameters, interpret responses, make follow-up calls, write files, update tickets, open pull requests, and sometimes decide that silence from a tool is enough evidence to continue.

If you only check the final response, you miss the important failure mode.

Imagine a travel agent that returns a neat itinerary with flights, weather, exchange rates, and attraction details. The final answer is readable. The structure is useful. The tone is confident.

Now inspect the trace and discover that the currency tool returned nothing, the weather lookup failed, and the agent filled the gaps from vibes.

The user-facing answer was not the test. It was the cover story.

This is why the AWS example is useful. Their demo agent had high response quality but terrible faithfulness. In plain English: it sounded good while making things up when tools returned empty or incomplete data.

That is exactly the kind of bug output-only testing will flatter.

the new unit is the trace

The next useful testing unit for agents is not the prompt, and it is not the final message.

It is the run.

A run contains the input, model messages, tool calls, tool outputs, intermediate state, final response, timing, failures, retries, and maybe cost. That is the thing you evaluate because that is the thing that actually happened.

This sounds heavier than a unit test because it is.

But we went through a version of this before. Unit tests were never enough for distributed systems. We added integration tests, contract tests, synthetic checks, tracing, canaries, chaos experiments, and production monitoring because the behavior we cared about lived between components.

Agents push us into the same place, just with a softer and more annoying component in the loop.

The model is not deterministic enough to treat like a normal function. The tools are not decorative enough to ignore. The prompt is not complete enough to be the whole spec. The final answer is not honest enough to be the whole evidence trail.

So the trace becomes the test artifact.

Did the agent call the right tool? Did it pass the right parameters? Did it notice when the tool returned empty data? Did it distinguish known facts from guesses? Did it use the cheaper path when the expensive one was unnecessary? Did it stop when policy said stop?

Those are test questions.

They just do not look like expect(result).toEqual(...).

this is a platform feature

I do not think most product teams should build this from scratch.

That is not because they are incapable. It is because the work is tedious in exactly the way platform work is tedious: instrumentation, fixtures, synthetic cases, replay, trace storage, evaluator prompts, thresholds, reporting, CI integration, and enough history to see whether the agent got better or worse after a change.

You can absolutely hack together a notebook that scores a handful of examples.

That is not the same as an evaluation system.

An evaluation system needs to survive normal engineering life. Prompts change. Tools change. Models change. Schemas change. Product behavior changes. One team wants faithfulness. Another cares about tool parameter accuracy. Another cares about latency and cost. The security team wants to know whether the agent touched the wrong capability. The support team has real examples from customers that should become test cases.

This is where I think agent platforms will mature quickly.

The model picker is not enough. The chat UI is not enough. The workflow builder is not enough.

If the agent can take actions, the platform needs to make those actions measurable.

evals are not a scoreboard

The least useful version of this is a dashboard that says your agent is 87.3 percent good.

That number may be interesting, but it is not very actionable by itself. Good against what? Which tools? Which failure modes? Which customer scenarios? Which version of the prompt? Which model? Which hidden assumption?

Evaluation becomes useful when it points back to an engineering change.

This is one of the smarter parts of the Agent-EvalKit framing: the report is supposed to produce code-level recommendations, not just abstract scores. In the AWS example, the practical fix was not "make the agent better." It was closer to "add guardrails for empty tool results and improve error handling along the paths where the agent fabricates facts."

That is the difference between a metric and feedback.

A metric tells you faithfulness is low.

Feedback tells you where the agent loses contact with reality.

I want evaluation systems that create the second thing.

production will still surprise you

There is a trap here, because engineers love turning messy things into gates.

I am not against gates. If an agent workflow is important, there should be thresholds. A regression in faithfulness, tool accuracy, latency, or policy compliance should block a release the same way a broken test blocks a release.

But agent evaluation will not end at CI.

The weird cases will come from production. Users will ask things your synthetic data did not cover. Tools will return malformed data. Vendor APIs will degrade. Someone will add a new capability and accidentally change the search path. A model upgrade will improve the average answer and break one important edge case. A prompt edit will reduce hallucinations while making the agent annoyingly cautious.

That means the loop has to continue after deployment.

Real traces should feed new test cases. Rejected outputs should become examples. Incident analysis should add scenarios. Human review should calibrate the evaluator instead of being replaced by it. The test set should become a living artifact of what the organization has learned.

This is where the test pyramid metaphor is useful, but only if we do not take it too literally.

Agent evaluation probably needs layers: cheap deterministic checks, code-based assertions, LLM-as-judge scoring, trace inspection, human review, production monitoring, and regression suites built from real failures.

Not every workflow needs all of that.

But serious workflows need more than "the answer looked fine."

what i would start with

If I were introducing this in a team, I would not start with a grand universal eval platform.

I would pick one agent workflow that already matters.

Then I would define three things:

the final outcome that must be correct
the tool behavior that must be trustworthy
the failure mode that would embarrass us in production

For a support agent, that might be: answer grounded in retrieved docs, no invented policy, and escalation when confidence is low.

For a coding agent, it might be: tests run before the PR, no files outside scope, and no dependency changes without explicit instruction.

For an operations agent, it might be: read-only diagnosis by default, approved command list, and clear refusal when the requested action is unsafe.

Then I would capture traces and build a small regression set around those expectations.

The first version does not need to be elegant. It needs to be honest.

Once the team can see where the agent cheats, guesses, skips, overreaches, or wastes money, the next platform requirements become obvious.

the punchline

Agents are forcing evaluation to grow up.

Checking the final answer was fine when the agent was basically a chat box. It is not fine when the agent can inspect systems, call tools, write files, and make decisions that other people treat as work.

The mature question is no longer only "did it answer correctly?"

It is also "did it get there in a way we trust?"

That question needs traces, tool-call checks, faithfulness metrics, regression suites, production feedback, and reports that point to actual fixes.

In other words, it needs the boring testing culture we already learned to need everywhere else.

The agent era does not make tests obsolete.

It makes the test artifact bigger.

references

AWS: Evaluate AI agents systematically with Agent-EvalKit

To test my projects, I use Railway. If you want $20 USD to get started, use this link.

DEV Community