DEV Community

Whatsonyourmind
Whatsonyourmind

Posted on

Your bandit's exploration floor probably violates its own floor

Most multi-armed bandit / A-B allocation systems add a minimum exploration weight: every arm should get at least, say, 5% of traffic, so no variant is ever fully starved and you keep collecting data on all of them. The guarantee sounds simple — p_i >= f for every arm — and the implementation looks even simpler:

def clip_renorm(w, f):
    p = np.maximum(w, f)   # raise anything below the floor up to it
    return p / p.sum()     # renormalize so probabilities sum to 1
Enter fullscreen mode Exit fullscreen mode

This is wrong, and it fails silently. The renormalize step pushes the floored arms back below the floor.

Why clip-then-renormalize breaks

Clipping raises the small weights up to f, which makes the total exceed 1. Dividing by that total then scales everything down — including the arms you just clipped to f. So they land below f again, and the floor you advertised is not the floor you enforce.

Concrete case — 4 arms, a confident winner, floor f = 0.10:

w   = [0.94, 0.02, 0.02, 0.02]   floor = 0.10
clip-renorm -> [0.7581, 0.0806, 0.0806, 0.0806]   min = 0.0806  ❌ (< 0.10)
Enter fullscreen mode Exit fullscreen mode

The three starved arms each get 8.06%, not the 10% you promised. And it isn't an edge case. Over 100,000 random peaky weight vectors (Dirichlet, α=0.3, n=4, f=0.10):

clip-and-renormalize violated the floor 97.2% of the time — worst arm seen: 7.69% against a 10% floor.

Whenever one arm dominates (exactly when a bandit is exploiting), the floor leaks.

The fix: one affine map onto the simplex

Instead of clipping, mix the learned weights with the uniform floor. Put the weights on the simplex (sum(w) = 1), then:

def additive_simplex(w, f):
    w = w / w.sum()
    return f + (1.0 - len(w) * f) * w
Enter fullscreen mode Exit fullscreen mode

Each output is f + (non-negative), so p_i >= f holds exactly, and the total is n*f + (1 - n*f)*1 = 1 by construction — no renormalization needed, so nothing gets dragged back under the floor. It also preserves the ordering and relative spacing of w (it's affine), so you don't distort the policy you learned. Same run:

additive-simplex -> [0.664, 0.112, 0.112, 0.112]   min = 0.112  ✅
Enter fullscreen mode Exit fullscreen mode

Over the same 100,000 vectors it violated the floor 0.00% of the time.

The one guard you do need

The map needs n * f <= 1 — you can't promise four arms a 30% floor each (that's 120%). Handle it explicitly instead of producing negative weights:

def exploration_floor(w, f):
    n = len(w)
    if f < 0:
        raise ValueError("floor must be non-negative")
    if n * f >= 1.0:
        return np.full(n, 1.0 / n)          # floor is infeasible -> uniform
    w = np.asarray(w, dtype=float)
    w = w / w.sum()
    return f + (1.0 - n * f) * w
Enter fullscreen mode Exit fullscreen mode

That's the whole correct primitive: a non-negativity check, an infeasible-floor fallback to uniform, and the affine mix.

Why it actually matters

The exploration floor isn't cosmetic. It's what bounds worst-case regret and guarantees you keep collecting data on every arm — the property a lot of bandit regret arguments lean on, and often a fairness/SLA requirement too ("no variant ever drops below X%"). A floor that's silently 7.7% instead of 10% means the guarantee you reported to stakeholders, and any bound that depends on it, doesn't hold. The bug is invisible because the output still sums to 1 and still looks floored — the smallest number is just quietly too small.

import numpy as np
rng = np.random.default_rng(0)
f, n, viol = 0.10, 4, 0
for _ in range(100_000):
    w = rng.dirichlet(np.ones(n) * 0.3)
    p = np.maximum(w, f); p = p / p.sum()       # clip-renorm
    if p.min() < f - 1e-12: viol += 1
print(f"clip-renorm floor violations: {viol/100_000:.1%}")   # ~97%
Enter fullscreen mode Exit fullscreen mode

I ran into this reviewing a Thompson-sampling weighting routine and proposed the additive-simplex version (plus the two guards) as a fix upstream. If your bandit or weighted-experiment layer clips-then-renormalizes to enforce a minimum, it's worth a one-line check: does the smallest probability it emits actually clear the floor?

Top comments (0)