Conformal prediction silently breaks under drift - and how to make it hold

#python #datascience #statistics #machinelearning

Conformal prediction is the easiest way to put a calibrated uncertainty band around any model: wrap a point predictor, and you get intervals with a finite-sample coverage guarantee — no distributional assumptions. It's deservedly popular.

There's a catch that bites in production: that guarantee is marginal and it assumes exchangeability. The moment your data drifts — almost any time series, any online-serving setting — exchangeability is gone, and split-conformal silently stops delivering the coverage it promises. No error, just a band that's quietly too narrow.

Here's the failure, then a fix that actually holds, with runnable code.

The failure, measured

Target 90% intervals. Residuals whose spread drifts upward over time (a textbook covariate/heteroscedastic shift). Calibrate split-conformal on the first chunk and let it run:

import numpy as np
rng = np.random.default_rng(0)

T, alpha, W = 4000, 0.10, 500                  # 90% target; W = calibration window
scale = 1.0 + 3.0 * (np.arange(T) / T) ** 2    # residual spread drifts upward
score = np.abs(rng.standard_normal(T) * scale) # nonconformity = |residual|

q = np.quantile(score[:W], 1 - alpha)          # frozen calibration quantile
static = score[W:] <= q
print(round(static.mean(), 3))                 # -> 0.579

58% coverage where you asked for 90% — and in the last quarter of the run, deep into the drift, it's 35%. A dashboard reporting "90% prediction intervals" would be off by more than half, with nothing flagging it.

Why it breaks, and the two things you have to fix

There are two distinct ways drift kills coverage, and they need different fixes:

The score scale goes stale. Your calibration scores were collected when residuals were small; now they're large. The frozen quantile is simply too small.
The miscoverage rate drifts. Even with a reasonable scale, the realized error rate wanders away from α.

Adaptive Conformal Inference (Gibbs & Candès, 2021) fixes #2 directly. It treats the target miscoverage as a control variable and runs a feedback loop: after each step, nudge α_t up if you've been covering too often, down if you've been missing too often.

alpha_t = alpha_t + gamma * (alpha - err_t)     # err_t = 1 if the point fell outside

A miss pushes α_t down → you use a higher quantile → wider next interval. It's a thermostat for coverage, and it gives a long-run coverage guarantee with no exchangeability assumption.

But ACI adapts the level, not the scale. Point it at a frozen calibration set and it helps a lot but hits a ceiling — once residuals exceed the largest score it ever saw, even α_t → 0 (the widest interval it can form) isn't wide enough. You also have to let the scores track the current regime, e.g. with a rolling window.

Measured, same setup, four ways:

method	overall coverage	coverage in late-drift tail
static split-conformal	0.579	0.347
ACI only (frozen calibration)	0.864	0.786
rolling window only	0.862	0.859
rolling window + ACI	0.900	0.904

Neither piece is enough alone. The rolling window supplies the right scale; ACI supplies the guarantee. Together they land exactly on target, even in the part of the series where the static method had collapsed to 35%.

a, hold = alpha, []
for t in range(W, T):
    pool = score[t - W:t]                              # rolling -> tracks the new scale
    a_eff = min(max(a, 1e-3), 1 - 1e-3)
    covered = score[t] <= np.quantile(pool, 1 - a_eff)
    hold.append(covered)
    a += 0.02 * (alpha - (0.0 if covered else 1.0))   # ACI feedback on miscoverage
print(round(np.mean(hold), 3))                         # -> 0.900

Three things that matter in practice

The score function decides marginal vs conditional coverage. |y − ŷ| gives you marginal coverage with a constant-width band. If your noise is heteroscedastic and you want bands that are locally right (conditional coverage), normalize the score — |y − ŷ| / σ̂(x), or use Conformalized Quantile Regression (CQR) where the score is the signed distance to predicted quantiles. The choice changes whether wide intervals show up where the data is actually noisy.
Coverage is a usable drift signal — but a noisy one. Rolling empirical coverage drifting away from 1 − α is a cheap, model-agnostic drift detector. Just remember it's a Bernoulli mean: its standard error is sqrt(c(1−c)/n), so over a 100-point window a 90%-coverage estimate has a ±3-point sampling wobble. Trigger on sustained deviation, not one short window.
Pick γ for your drift speed. Larger γ tracks faster but makes interval widths jumpier; smaller γ is smoother but lags. 0.01–0.05 is a sane starting range; tune against your realized coverage trace, not in the abstract.

The takeaway

A guarantee that assumes exchangeability is not a guarantee in production — it's an assumption wearing a guarantee's clothes. What makes ACI worth reaching for is that it drops the assumption and replaces it with a feedback loop you can actually verify online: watch the realized coverage, and let it correct itself. If you serve intervals anywhere a too-narrow band is expensive, that self-correction is the difference between a number you can trust and one that quietly lies as the world moves.

I work on reliability and verification for numerical and AI systems — calibration, drift, and "does the guarantee actually hold under load" tooling. The benchmark above is fully runnable; I'm happy to compare notes if you're putting conformal methods into production — GitHub.

Top comments (1)

Maya Andersson • Jun 18

The marginal-vs-conditional distinction is exactly the thing people skip, and the 58% number makes it concrete. Worth naming the two escape hatches for readers who want to go further: adaptive conformal inference (Gibbs and Candes) adjusts alpha online from realized coverage, and weighted / nonexchangeable conformal (Barber et al) reweights the calibration scores. ACI is usually the lighter lift inside a serving loop. The same failure shows up in LLM-as-judge calibration, which is my corner: you fix a judge threshold on one slice of traffic, the input distribution moves, and your '90% agreement with humans' quietly becomes 70% with nothing thrown. We re-estimate the calibration quantile on a rolling window for that reason. One question on your setup: are you measuring coverage on a rolling window too, or cumulatively? Cumulative coverage can look fine long after the tail has gone bad.