A model with R-squared near 0 can still give valid 90% prediction intervals - here's why (and the catch)

#python #datascience #statistics #machinelearning

I recently calibrated a recovery-rate model that had only two weak features. Its point accuracy was almost nothing — R² basically zero. I expected its uncertainty estimates to be junk too. They weren't: the 90% conformal prediction intervals covered ~89% of held-out outcomes. Valid, just wide.

That surprised me enough to nail it down, because it contradicts a belief a lot of us carry around: "my model isn't accurate, so I can't trust its uncertainty." For split conformal prediction, that's backwards. Here's the precise statement, a runnable demo, and the one caveat that actually bites.

Coverage is a property of the procedure, not the model

Split conformal prediction gives a distribution-free, finite-sample marginal coverage guarantee:

P( Y ∈ Ĉ(X) ) ≥ 1 − α

and it holds for any point model, as long as the calibration and test data are exchangeable. The model is a black box. You fit it however you like, then on a held-out calibration set you take the (1−α) quantile of the absolute residuals, and that quantile becomes the half-width of your intervals.

Nowhere does that construction require the model to be good. A bad model just has large residuals, so the calibration quantile is large, so the intervals are wide — wide enough to still cover at the stated rate. Accuracy doesn't buy you validity; it buys you efficiency (narrower intervals at the same coverage).

The demo (numbers are reproducible, seed fixed)

Same dataset and target, three models from strong to useless, target coverage 90%:

model	R²	marginal coverage	mean interval width
gradient boosting	0.741	0.895	5.39
weak linear (1 noisy feature)	0.061	0.905	10.39
predict-the-mean	−0.000	0.907	10.83

All three land at ~90% coverage. The only thing that changes is width: the good model's intervals are half as wide. That's the whole story in one table — validity is constant, efficiency tracks accuracy.

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor

rng = np.random.default_rng(20260617)
n = 6000
X = rng.normal(size=(n, 5))
group = rng.integers(0, 3, size=n)
y = X @ np.array([2.0, -1.5, 1.0, 0.5, -0.8]) + 1.5 * group + rng.normal(size=n)

s = lambda a: (a[:3000], a[3000:4500], a[4500:])
Xtr, Xcal, Xte = s(X); ytr, ycal, yte = s(y); _, _, gte = s(group)
ALPHA = 0.10

def conformal(model, label):
    model.fit(Xtr, ytr)
    res = np.abs(ycal - model.predict(Xcal))
    k = int(np.ceil((len(res) + 1) * (1 - ALPHA)))
    q = np.sort(res)[min(k, len(res)) - 1]          # calibration quantile
    pred = model.predict(Xte)
    covered = (yte >= pred - q) & (yte <= pred + q)
    r2 = 1 - np.sum((yte - pred)**2) / np.sum((yte - yte.mean())**2)
    gcov = {int(g): round(covered[gte == g].mean(), 3) for g in np.unique(gte)}
    print(f"{label}: R2={r2:6.3f} cov={covered.mean():.3f} width={2*q:5.2f} group={gcov}")

conformal(GradientBoostingRegressor(random_state=0), "strong")
class Weak(LinearRegression):
    def fit(s, X, y): return super().fit(X[:, 4:5], y)
    def predict(s, X): return super().predict(X[:, 4:5])
conformal(Weak(), "weak  ")

The catch: marginal ≠ conditional

Here's the part you can't skip. The guarantee is marginal — averaged over the whole distribution. It says nothing about coverage within a subgroup. Watch what the same run reports per subgroup:

model	marginal	group 0	group 1	group 2
strong GBM	0.895	0.835	0.985	0.857
predict-the-mean	0.907	0.889	0.933	0.897

The strong model has the worse conditional coverage — groups 0 and 2 sit at 83–86% while group 1 is over-covered at 98%. A single global residual quantile produces constant-width intervals that can't adapt to residuals that vary by group, so it robs the hard groups to pay the easy one. (The mean-only model looks more uniform here only because its residuals happen to be roughly homoskedastic across groups — luck, not virtue.)

If your decisions are made per-subgroup — per region, per asset class, per customer segment — marginal coverage is not enough, and a high overall number can hide silent under-coverage where it matters. The fixes are Mondrian / group-conditional conformal (calibrate a separate quantile per group) or a normalized/locally-weighted nonconformity score so interval width adapts.

What to take away

A weak model gives you wide but honest intervals, not invalid ones. "The model is bad so the uncertainty is meaningless" is the wrong instinct — wide intervals are the correct signal that the model doesn't know much.
The genuinely dangerous case is the opposite: a confident-looking narrow interval whose coverage is a lie. That happens not from low accuracy but from a broken exchangeability assumption — distribution drift between calibration and deployment. (That failure mode, and adaptive conformal as the fix, is a separate write-up.)
Always check conditional coverage on the groups you actually act on. The marginal number is necessary, not sufficient.

Conformal prediction is one of the few tools that gives you a real guarantee with almost no assumptions. Just remember which guarantee it gives — coverage over the whole distribution — and verify the rest yourself.