How much edge is left in the final 48h of a Polymarket? I measured price convergence on 18.6M snapshots

#python #datascience #trading #statistics

A question I kept circling when I started treating prediction markets as a trading venue: by the time a market is obviously heading one way, is the move already priced in — or is there still a systematic gap between price and outcome you can lean on?

"Buy the favorite at 0.85, collect the 0.15" sounds like free money until you account for the markets where the favorite loses. So I stopped eyeballing it and computed it.

The data

I've been archiving Polymarket order books every 15 minutes since late March 2026. As of mid-June that's ~18.6M price snapshots across ~18.6k markets (~77 days, most of them resolved). Prediction markets are one of the few venues where you get the probability estimate, the full path it took, and the realized 0/1 outcome — so convergence is actually measurable instead of hand-waved.

The measurement: how far is price from truth, as a function of time-to-resolution?

For every resolved market, line up each snapshot by hours-to-resolution, then measure the mean absolute error between price and the eventual outcome in each time bucket:

import pandas as pd

# df: rows of (market_id, ts, price_yes, resolved_outcome[0/1], resolve_ts)
df["hrs_to_res"] = (df.resolve_ts - df.ts).dt.total_seconds() / 3600
df["abs_err"]    = (df.price_yes - df.resolved_outcome).abs()

bins = [0, 1, 6, 12, 24, 48, 96, 168, 1e9]
df["window"] = pd.cut(df.hrs_to_res, bins)

conv = df.groupby("window").agg(
    mae     = ("abs_err", "mean"),
    brier   = ("abs_err", lambda s: (s**2).mean()),  # abs_err == |p - outcome|
    n_snaps = ("abs_err", "size"),
)
print(conv)

That's the whole thing. No model, no backtest — just a descriptive stat that most people assume the shape of instead of computing.

What the curve actually shows

Error shrinks fast in the last 24h, but it is not zero at the buzzer. There's a non-trivial residual right up to resolution — and that residual is concentrated in exactly the set of markets where "the obvious favorite" flips. Those tail markets are where a naive favorite-buying strategy quietly bleeds.
The 24–48h bucket is the interesting one for a trader. Most of the information has arrived, spreads have tightened — yet realized error is still meaningfully above the last-hour bucket. That gap is the thing to interrogate: is it edge, or is it just the variance of unresolved coin-flips that look settled?

The caveats people skip (and why I'm posting the method, not a strategy)

1. Survivorship / resolution-time leakage. resolve_ts is known only ex-post. If you bucket by hours-to-resolution you're implicitly conditioning on a market that did resolve when it did — fine for a descriptive convergence curve, fatal if you turn it into a live signal without a causal "what did I know at time t" cut.

2. Liquidity ≠ midprice. The curve above uses last-trade / mid. The order book (I store depth too) tells a different story near the buzzer: getting size off at the "converged" price is often impossible, and that illiquidity is precisely where the residual error lives.

3. Resolution risk is not modeled. Some of that terminal error is genuine ambiguity or dispute, not mispricing. You can't arb a market whose resolution criteria themselves are the uncertain part.

The takeaway

I'm not claiming an edge. I'm claiming the measurement is cheap and most people eyeball it instead of computing it. If you trade these, the convergence curve plus an honest point-in-time cut is the first thing worth building — before any strategy logic.

Reproduce it

The data behind this is open:

Free API, no signup: api.protodex.io — /stats, /markets, /market/{id}, /prices, /orderbook, /categories. Enough to rebuild the convergence curve yourself.
Full historical archive (18.7M+ snapshots, depth-level order books, resolved outcomes — the version I run this analysis on): Polymarket Quant Toolkit dataset.

What's your experience with the last-48h window — real residual edge, or just liquidity-trapped variance? I'd genuinely like to compare notes in the comments.