DEV Community

Cover image for I expected the cheaper model to be cheaper. It cost 8.6 more.
Yogesh23012001
Yogesh23012001

Posted on

I expected the cheaper model to be cheaper. It cost 8.6 more.

Reasoning tokens as a hidden cost factor

I'd routed the same one-word prompt to Claude Haiku and to Gemini 2.5 Flash. Flash has the lower per-token price, so this should have been an easy win. It wasn't. Flash is a thinking model: before it answered "Paris," it spent a few dozen tokens reasoning, and reasoning is billed as output. Haiku answered in 4 tokens. Flash spent about 28 — lower unit price, far more units, ~8.6× the bill per request. I only caught it because I'd instrumented every call: tokens, cost, latency, written to Postgres. And I instrument every call because that's what you do when you've spent years keeping payment systems honest.

That instinct is the whole point. For two and a half years I built cross-border real-time payments at NPCI — the kind of system where a rounding error is an incident and a downstream going dark is a Saturday night. This year I built an LLM gateway, and I kept reaching for the same tools. That's not a coincidence.

AI infrastructure looks like a new field. Underneath, it's the systems work backend engineers have always done. A model API is a downstream dependency: it's slow, it's occasionally down, it rate-limits you, and it bills you per call. You have integrated a dependency exactly like that before — a payment processor, a KYC provider, a partner bank. The model isn't magic. It's a new kind of expensive, flaky downstream, and the hard parts — reliability, cost control, failover — are the parts we already solved.

The gateway needed a circuit breaker per provider, and I'd built exactly that at NPCI — on a payments platform where every partner integration was a downstream in a different country, each with its own SLA and its own definition of "up." My fault-tolerant Go processors wrapped those partners in circuit breakers and rate limiters for one reason: when a partner started failing, the worst move was to keep hammering it while your own worker pools filled with stuck calls. So you trip, fail fast, and let a trickle of traffic probe for recovery. CLOSED, OPEN, HALF_OPEN. Wiring those same three states around Anthropic and Gemini, the only thing that changed was the noun — "partner bank" became "model provider."

The gateway needed to meter every call into Postgres. That's an audit log, and payments runs on audit logs. Across 20+ Go services I'd keyed everything on idempotency so a retried request never double-counted; the cost log keys on request_id for the identical reason. And money was always fixed-precision, never a float — which is why cost_usd is a NUMERIC, not a float. You don't approximate money, and you don't approximate spend.

Retries versus the breaker — transient versus sustained — was muscle memory. In payments a careless retry is a double-debit, so one timeout means an idempotent retry, never a blind resubmit; sustained failure means you stop sending. I wrote that distinction into financial workflows for years. Rewriting it for an LLM call, the cost of getting it wrong fell from someone's money to a wasted token — but the shape of the problem didn't move an inch.

I won't pretend it was all familiar. A few things were genuinely new, and they're all about the model itself. Token economics is a real discipline — the 8.6× surprise up top doesn't exist in any database I've ever billed for. A model is non-deterministic in a way a datastore isn't: the same input can return a different output, so you can't assert on exact strings, and "testing" becomes evals and distributions instead of equality. And a model quietly spending your output budget on its own reasoning has no equivalent in a payment switch. That's the new surface area — but it's a few weeks of new sitting on years of foundation, not a new career.

So if you're a backend engineer wondering whether your distributed-systems experience transfers to AI: it doesn't just transfer — it's the scarce part. Anyone can call an LLM API in an afternoon. Far fewer people can make that call reliable when the provider degrades, cheap when the token math turns against you, and observable when someone asks you to explain the bill. That's not AI expertise. That's the work you've already been doing, with a model attached.

The gateway is live at https://llm-gateway-python.onrender.com, and the code is on GitHub: https://github.com/Yogesh23012001/llm-gateway-python/tree/main.

Top comments (10)

Collapse
 
ggle_in profile image
HARD IN SOFT OUT

A payments engineer walks into an AI gateway meeting.

"We need to retry failed LLM calls," says the PM.

The engineer says "Sure — idempotently, with exponential backoff, circuit breakers, and a cost audit log."

The PM says "That's overkill."

The engineer says "That's how I kept your credit card from getting double‑charged."

The AI hallucinated an excuse. The database didn't.

This is the kind of post that should be required reading for everyone rushing to "just add AI."

The 8.6× surprise is brutal — and perfect. Everyone talks about per‑token pricing. Almost no one talks about token count variance between models. Flash thinking out loud before saying "Paris" is exactly the kind of hidden cost that eats budgets while the dashboard looks green.

Two things I'd add to your already excellent framework:

  1. The cost surprise gets worse with temperature. You ran at temp 0.0, I assume? At higher temps, thinking models don't just reason — they explore reasoning paths, and each explored path is billed as output. I've seen the same prompt cost 2x vs 20x depending on temperature, same model. That's a lever most people don't know they're pulling.

  2. The "non‑determinism breaks testing" point deserves its own post. You mentioned evals vs equality assertions. In payments, you'd test assert balance == expected. In LLMs, you test assert response contains "Paris" AND response length < 10 AND refusal=false. That shift from exact to probabilistic testing is, for many teams, the hardest mental model change — harder than circuit breakers or token economics.

One small suggestion: your post focuses on detecting the 8.6x surprise (good observability). I'd love to see a follow‑up on automated mitigation — e.g., a circuit breaker that trips when cost per request exceeds 3x the 7‑day moving average, or a router that learns which prompt types are cheaper on which models. That turns detection into action.

Solid read. Thanks for writing it.

Collapse
 
yogesh23012001 profile image
Yogesh23012001

built detection, not action. The cost-aware breaker is low-hanging precisely because the pieces exist: the gateway already has a circuit breaker (it trips on provider failures today) and already writes cost-per-request to Postgres, so "trip when cost > 3× the 7-day moving average" is just a new trip condition on an existing mechanism plus one query. The learned cheaper-model-per-prompt-type router is the bigger swing — that's real cost-based routing, and it's where this goes if I keep pulling the thread.

Thanks — this is the most useful comment I've gotten; half my follow-up backlog is now your comment.

Collapse
 
alexshev profile image
Alex Shev

This is exactly why per-token pricing is only half the cost model. The cheaper model can become expensive if it needs more retries, longer prompts, heavier routing logic, or more supervision to get the same outcome. In practice I would measure cost per accepted answer, not cost per raw call.

Collapse
 
merbayerp profile image
Mustafa ERBAY

One of the biggest misconceptions in AI is that model expertise matters more than systems expertise. In production, the hard problems are often the same ones we’ve dealt with for years: reliability, observability, cost control, retries, failover, and auditing.The model is new. Distributed systems are not.

Collapse
 
nazar_boyko profile image
Nazar Boyko

Since you're already metering every call into Postgres worth splitting reasoning_tokens into its own column instead of folding it into output_tokens, because they respond to completely different levers. The half that compounds your 8.6x: reasoning tokens don't get the prompt-cache discount a stable input prefix does, so on repeated calls a "cheap" thinking model keeps paying full freight while a non-thinking one rides the cache. At scale the unit-price comparison gets even more misleading than the single request suggests.

Collapse
 
mudassirworks profile image
Mudassir Khan

the "reasoning is billed as output" detail is the one that keeps biting people switching to thinking models. we ran into the exact same thing — moved a classification task to a thinking model expecting it to catch edge cases, didn't check the thinking budget config. the reasoning chain on a "is this message spam?" call was hitting 800 output tokens before the actual 5 token answer.

ended up turning off extended thinking for any task with eval variance under 0.3. saved 60% on that pipeline.

are you setting a max_thinking_tokens budget or letting the model decide on its own?

Collapse
 
yogesh23012001 profile image
Yogesh23012001

The 800 tokens for a 5 token spam verdict is exactly the failure mode. I'm setting a budget thinking_budget=2048 on the 2.5 models, added on top of the caller's max_tokens so thinking can't starve the answer (went with a budget vs 0 since 2.5 Pro rejects 0, Flash doesn't).

But your eval-variance gating is sharper — that's app-level knowledge a gateway doesn't have. A flat budget is just a safety net; turning thinking off for low-variance tasks is the real win. Ideal is both: safe default at the gateway + a per-request override. Curious whether you set that 0.3 line offline per task-type or measure it live?

Collapse
 
theuniverseson profile image
Andrii Krugliak

Same trap got me, and I only noticed because I was logging tokens and cost per call to Postgres. A thinking model bills its reasoning as output, so on short prompts the cheaper unit price ends up the more expensive call. The only number I trust now is cost per finished task, not per token.

Collapse
 
fastanchor_io profile image
FastAnchor_io

This is exactly why per-token pricing alone is misleading. The thinking/reasoning tokens are the silent budget killer — Flash spent 28 tokens thinking before answering, and those get billed at output rates. One thing I've found useful: when comparing models, track cost-per-useful-response, not cost-per-token. A $0.10/1M model that needs 3 retries and 200 reasoning tokens can easily cost more than a $0.40/1M model that gets it right in one shot. Your instrumentation setup paying for itself here is a great lesson.

Collapse
 
hamtek profile image
HamTek

That's cool!