There is a specific kind of incident that no alert ever fires for, and it is the one I trust least. Nothing crashed. No exception, no 500, no failed health check. The agent ran every day, returned answers every time, and stayed green on every dashboard you own. And yet, over six weeks, it got measurably worse — and you found out from a customer, not a monitor.
That is drift, and it is the failure mode I think the industry is least prepared for. We have gotten good at catching the cliff: the agent throws, the tool 500s, the JSON won't parse, CI goes red. We are still terrible at catching the slope: answer quality bleeding out two percent a week while every system reports perfect health. Crashes are loud and self-announcing. Drift is silent by construction, and that silence is exactly why it wins.
Here is the opinion I will defend: drift is not an outlier problem, it's a baseline problem. You cannot detect decay by looking at any single run, because a single run looks completely fine. Drift only exists as a change in a distribution over time — so if you are not continuously scoring production and trending the score, you are structurally incapable of seeing it. Not unlucky. Incapable.
Why your code didn't change but your behavior did
The thing that makes drift so disorienting is that it violates our deepest instinct: if the code didn't change, the behavior didn't change. For agents, that is just wrong. Your agent decays while your git history sits perfectly still:
-
The model moves under you. You pinned
gpt-4o, but a pinned model name is not a pinned model — providers roll checkpoints and quietly re-tune behind a stable string. Your prompt is byte-for-byte identical and your outputs shifted anyway. - The world moves under your prompt. Your few-shot examples were written against March's reality. It is now September. Users ask about products and edge cases that did not exist when you froze the prompt, and the agent improvises — worse, but fluently.
- Your dependencies and inputs move. A retrieval index gets re-embedded; a tool renames a field; your user base grows into a new locale. The agent was never broken for the inputs you tested — it's that the data and traffic it actually serves have drifted away from them, and it keeps running while confidently citing slightly-wrong results.
Not one of these shows up in a code diff. Not one throws. Every one degrades what your users actually experience. This is why "we'll notice if it breaks" is a fantasy — the most expensive agent regressions don't break anything.
A baseline is the only thing drift is measured against
To detect drift you need two things: a baseline — what "normal" scored like over a trusted window — and a continuous signal, the same score computed the same way on live traffic. Drift is the gap between them, measured statistically, not by eyeball.
The naive version is a single threshold: "alert if quality drops below 0.8." That catches the cliff and misses the slope. A score that walks from 0.91 to 0.82 over five weeks never trips an absolute floor, yet it has lost nearly a tenth of its quality. You are not looking for low; you are looking for moving — a different statistical question, and it needs the baseline.
This is where evaluation and observability stop being separate concerns and become one workflow — because you need both a thing that scores and a thing that remembers the route. I run agent-eval to score and gate the agent's output: deterministic checks where it can, a model-as-judge rubric where it must, and crucially it persists each verdict so a series of scores exists to trend at all. And I run AgentLens to capture the trace behind every scored run — every model and tool step, the resolved inputs the model actually saw after interpolation, the raw outputs that came back. The pairing is the whole point: agent-eval tells you the score is drifting; AgentLens tells you which step started drifting. A drift alert with no trace behind it is just a number falling on a chart with no way to ask why — and "quality is down 6% this month, cause unknown" isn't an actionable signal, it's an anxiety generator.
Here is a drift detector over a rolling window of scored production runs. The scores come from agent-eval; each run's traceId points back into AgentLens so a flagged window is one click from the evidence:
import { queryScoredRuns } from "agent-eval";
interface ScoredRun {
runId: string;
traceId: string; // -> AgentLens: the full route that produced this score
score: number; // agent-eval rubric verdict, 0..1
at: number; // epoch ms
}
interface DriftReport {
drifting: boolean;
baselineMean: number;
recentMean: number;
deltaPct: number; // how far recent has moved from baseline
zScore: number; // is the move bigger than normal run-to-run noise?
sampleTraceIds: string[]; // worst recent runs, for AgentLens drill-in
}
function mean(xs: number[]): number {
return xs.reduce((a, b) => a + b, 0) / xs.length;
}
function stdev(xs: number[], mu: number): number {
return Math.sqrt(mean(xs.map((x) => (x - mu) ** 2)));
}
// Compare a recent window against a trusted baseline window.
// Drift = the recent mean has moved further than baseline NOISE explains.
function detectDrift(baseline: ScoredRun[], recent: ScoredRun[]): DriftReport {
const baseScores = baseline.map((r) => r.score);
const recentScores = recent.map((r) => r.score);
const baselineMean = mean(baseScores);
const recentMean = mean(recentScores);
const baselineSd = stdev(baseScores, baselineMean) || 1e-9;
// Standard error of the recent window's mean, scaled by baseline noise.
// This asks: is this gap real, or just the sample size talking?
const se = baselineSd / Math.sqrt(recentScores.length);
const zScore = (recentMean - baselineMean) / se;
const deltaPct = ((recentMean - baselineMean) / baselineMean) * 100;
// Flag when quality dropped AND the drop is statistically meaningful.
// z < -3 ~ a one-sided drop well outside normal run-to-run wobble.
const drifting = zScore < -3 && deltaPct < -2;
const sampleTraceIds = [...recent]
.sort((a, b) => a.score - b.score)
.slice(0, 5)
.map((r) => r.traceId);
return { drifting, baselineMean, recentMean, deltaPct, zScore, sampleTraceIds };
}
// Roll the windows forward continuously, not on deploy.
async function checkProductionDrift(): Promise<DriftReport> {
const baseline = await queryScoredRuns({ from: "-30d", to: "-7d" });
const recent = await queryScoredRuns({ from: "-7d", to: "now" });
return detectDrift(baseline, recent);
}
Two design decisions carry the whole approach.
It compares against baseline noise, not an absolute floor. The zScore is the entire trick. Every agent's scores wobble run-to-run — that is normal nondeterminism, not decay. By dividing the drop by the standard error of the recent window, you only fire when the move is bigger than the agent's own natural jitter. A 1% dip on a noisy agent is nothing; the same dip on a rock-steady one is a five-alarm signal. An absolute threshold cannot tell those apart.
It emits sampleTraceIds, not just a verdict. A boolean drifting: true is where most homegrown detectors stop, and it's why they get ignored — nobody can act on it. By attaching the five worst recent runs' trace IDs, the alert carries its own evidence: you open those AgentLens traces and read the resolved inputs and tool outputs that produced the low scores. That is the difference between "quality is down, somebody investigate" and "quality is down, and here is the retrieval step that started returning stale documents."
Segment your baseline or it will lie to you
One trap worth calling out, because it produces the most confusing drift incidents: a healthy aggregate can hide a brutal per-segment collapse. Your overall score holds at 0.90 while your Spanish-language traffic quietly craters from 0.88 to 0.61 — masked because it's only 8% of volume and the other 92% is fine. The aggregate is technically accurate and completely useless.
So slice the baseline along the dimensions that actually vary — language, tool path, user tier, intent — and run the same drift check per slice.
async function driftBySegment(segmentBy: (r: ScoredRun) => string) {
const baseline = await queryScoredRuns({ from: "-30d", to: "-7d" });
const recent = await queryScoredRuns({ from: "-7d", to: "now" });
const group = (runs: ScoredRun[]) => {
const m = new Map<string, ScoredRun[]>();
for (const r of runs) (m.get(segmentBy(r)) ?? m.set(segmentBy(r), []).get(segmentBy(r))!).push(r);
return m;
};
const baseGroups = group(baseline);
const recentGroups = group(recent);
for (const [seg, recentRuns] of recentGroups) {
const baseRuns = baseGroups.get(seg);
if (!baseRuns || recentRuns.length < 20) continue; // need signal to call it
const report = detectDrift(baseRuns, recentRuns);
if (report.drifting) {
console.warn(
`DRIFT [${seg}] ${report.baselineMean.toFixed(3)} -> ` +
`${report.recentMean.toFixed(3)} (${report.deltaPct.toFixed(1)}%) ` +
`traces: ${report.sampleTraceIds.join(", ")}`,
);
}
}
}
The most dangerous drift hides inside an average. Segmenting the baseline drags it into the light, and the per-segment trace IDs tell you, via AgentLens, exactly which step does badly on those inputs.
What to do Monday
You don't need a statistics PhD or a platform team to start — you need a baseline and a trend:
- Score production continuously, not just on deploy. Sample real traffic, run it through your agent-eval rubric, and persist every verdict with its AgentLens trace ID. Without a series of scores there is no trend, and without a trend there is no drift detection — just hope.
- Trend against a baseline window, and compare to noise, not a floor. Alert on statistically significant movement. You are hunting the slope, not the cliff.
- Segment the baseline — per-language, per-tool, per-tier — to find the collapse before the aggregate smothers it.
- Make every drift alert carry trace IDs. A signal you can't drill into is one your team learns to ignore. The score names the symptom; the trace names the cause.
The agents are not going to crash on their way down. They will keep answering, keep returning 200s, keep looking healthy, and get quietly worse until the decay is large enough for a human to notice — the most expensive possible detector. Score the output with agent-eval, keep the route with AgentLens, trend the two against a baseline, and you catch the slope while it's still two percent instead of explaining the cliff to a customer who found it first.
Top comments (1)
I think the concept of evals is interesting but quite challenging, particularly when we need the baseline "truth" to compare agentic output against. How do you think we can structure systems to identify this baseline truth? Or maybe another way to phrase this is — what sort of metrics are we looking at when testing the efficacy of modern agentic systems? Successful tool calls? Correct actions? etc?