Why Your Agents Are Silently Burning Tokens (And How to Stop Them)

#ai #agents #infrastructure #production

Why Your Agents Are Silently Burning Tokens (And How to Stop Them)

You deployed a coding agent last month. It runs autonomously, pulls tickets, files PRs, answers Slack questions. It's genuinely useful.

Then the bill arrived.

The agent consumed more API spend than you planned. You don't know why. It hits the model 30–50 times per ticket. Some of those calls are slow retries. Some are redundant context re-reads. Some might be the agent re-reading the system prompt or tool descriptions on every invocation.

By the time you noticed, the damage was done.

This is the most common problem I see in production agent deployments right now, and it's not a model issue. It's an infrastructure issue. The models are working as intended. The problem is that your team has no visibility into what the agent is spending, no way to isolate costs by agent or team, and no circuit breaker to stop a runaway agent before it burns a month's budget in a day.

The Silent Cost Problem

When you run a single API call from a script, the cost is obvious: 1 call = 1 payment. But agents are different. Agents are loops:

Agent reads the task.
Agent decides what tool to call.
Agent calls the tool.
Agent re-reads the output.
Agent decides next step or calls the model again.
Repeat 15–50 times per task.

Each step hits an LLM. Each hit costs tokens. If your agent re-reads the system prompt on every turn (a real pattern I see in production), or re-reads tool descriptions, or re-reads previous messages, that cost compounds invisibly. A bug that causes 2 extra reads per call becomes 100 extra reads per completed task.

The problem surfaces too late: you see the bill, not the calls.

Why This Breaks Teams

Reddit communities focused on agent infrastructure document this pattern clearly. The teams making money with agents (email-to-CRM routing, FAQ support, resume parsing) are not the ones building broad, multi-step autonomous systems. They're the ones that:

Built cost controls from day one.
Set a monthly budget ceiling and monitored approach to it.
Logged which agent, which user, which task triggered each call.
Could answer: "Why did this task cost $1.50 instead of $0.30?"

Teams that skip this step deploy an agent, it works fine for a week, then either:

It discovers an edge case that causes looping (re-trying the same tool call, or re-reading context endlessly).
One user accidentally triggers it 100 times in a row.
A bug in the tool integration causes the agent to call a tool, get a partial response, and retry 20 times.

By then, the cost is locked in. You can't retroactively understand it without detailed logs.

What You Need (And Probably Don't Have Yet)

To run agents in production, you need:

Per-agent cost tracking: Not total spend, but spend per agent, per user, per task. If the support agent is cheaper than the coding agent, you need to know that. If one user's runs cost 10x more than another's, you need to know that too.
Virtual keys and team isolation: You can't give every developer direct access to your LLM provider console. They'll burn budget on experiments. You need team-level or per-agent API keys that sit behind a gateway, so you can track and limit what each team spends.
Budget controls: A budget ceiling per team, per agent, per month. When the agent approaches the limit, it should alert you (not crash your system, but alert). When it hits hard limits, it should stop accepting new tasks.
Spend visibility in the UI: You need a dashboard where you can see, in real time or near-real time:
- Total spend this month
- Spend per agent
- Spend per team
- Average cost per task
- Cost trend (is it growing? why?)
Detailed call logs: Not just "the agent ran 10 times," but "the agent ran 10 times, made 400 LLM calls, averaged 40 calls per run, here's the distribution of call types."

If you're missing any of these, you're running blind. Your agent could be wasting 50% of its budget on inefficient patterns, and you'd only find out when the bill came.

How Production Teams Handle This

The coding agent that LiteLLM built internally (the one covering 30% of their engineering backlog) uses a specific pattern:

Brain + Sandbox split: The reasoning loop runs in a persistent pod. The execution runs in ephemeral sandboxes. This reduces re-boots and context re-reads.
Clear tool interface: Structured tool definitions, not prose descriptions that the model has to re-read every turn.
Cost tracking at the gateway: Every LLM call routes through the gateway, so every call is logged with the agent ID, team ID, and task ID. No guessing later.
Budget per agent, enforced: The agent knows its cost ceiling. It can check its remaining budget before taking on a task.

The result: predictable cost, observable behavior, and the ability to debug why a task cost more than expected.

The Infrastructure Gap

This is why agent platforms exist. They're not just frameworks for writing agent logic (that's what LangGraph or LLM frameworks handle). They're the operational layer that lets your team:

Run agents across different runtimes without rewriting the integration.
Manage sessions and memory so agents don't lose context on restarts.
Track costs and enforce budgets.
See what's happening in real time.
Revoke access to a runaway agent without deploying.

If you're building agents but you don't have a way to do this, you're building toward a cost explosion. The agent works fine until the day it doesn't, and by then you've burned budget you can't recover.

Where to Start

If you're running agents in production right now:

Audit your current spend: Pull your last month of API bills. How much did agents cost? If you don't know which agents cost what, you need visibility first.
Add cost tracking to every agent call: Before you deploy more agents, instrument every model call with the agent ID, task ID, and team ID. This is foundational.
Set a budget ceiling, today: Decide: "We'll spend $X per agent per month." Make it a hard number. Set up an alert at 80%. When you hit 100%, the agent stops accepting tasks (not crashes, but queues and waits for approval).
Log tool calls too: Not just LLM calls, but what tools the agent triggered and whether they succeeded. A tool that fails and causes retries is a budget leak waiting to happen.
Review the call distribution weekly: Take 10 minutes each Friday to ask: "Which agents cost the most? Why?" If one agent is averaging 60 calls per task instead of 20, there's a pattern to fix.

If you're building a production agent platform for your team, or evaluating platforms to run agents on, ask:

Does it let me set budgets per agent?
Can I see spend broken down by agent, team, and task?
Can I isolate API keys so not every developer has provider console access?
Can I add cost checks into my agent's decision loop?

These are not nice-to-have features. They're the infrastructure that separates agents that work reliably from agents that work until they don't.

The silent cost problem is solvable. It just requires infrastructure, visibility, and discipline from day one.

What's your team's approach to agent cost control? I'd be curious to hear what works in your production deployments, especially if you've hit the "bill surprise" and how you fixed it.