I've seen teams burn through their entire AI budget in weeks. Not because they built the wrong thing. Because they never looked at how each request flows through their pipeline.
That's the hidden cost of AI agents. It's not the API pricing page. It's the architecture decisions you make before you ship.
Here's what I've learned running production LLM pipelines that process 10,000+ jobs daily, and how to fix the leaks before they drain your budget.
The Three Cost Leaks Nobody Talks About
Most teams focus on the wrong thing. They obsess over per-token pricing when the real money bleeds from three structural problems.
Leak one: uniform model routing. Every request goes to the same expensive model because it's simpler to code. I've seen systems call GPT-4 to extract a date from a string. That's a regex job with an LLM-shaped price tag.
Leak two: synchronous everything. Each request opens a fresh connection, waits for a response, and holds resources idle. When you're processing thousands of jobs, the latency tax compounds into a cost tax.
Leak three: no caching. The same document gets re-embedded, the same prompt gets re-evaluated, the same extraction gets re-run. Every repeat call is pure waste.
After addressing these three leaks, the same workload can cost far less without changing any business logic.
Batch Everything That Doesn't Need Real-Time
The single most effective cost move I've made was adopting OpenAI's Batch API for non-urgent workloads.
Here's the tradeoff: batch jobs return in hours instead of seconds, but they cost 50% less. For any pipeline that processes data overnight, runs scheduled extractions, or handles background enrichment, this is free money.
// Before: individual API calls for each job
async function processJob(jobData: JobData) {
const response = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [{ role: 'user', content: buildPrompt(jobData) }]
});
return parseResponse(response);
}
// After: batch processing for non-urgent jobs
async function processBatch(jobs: JobData[]) {
const batch = jobs.map(job => ({
custom_id: job.id,
method: 'POST',
url: '/v1/chat/completions',
body: {
model: 'gpt-4o-mini',
messages: [{ role: 'user', content: buildPrompt(job) }]
}
}));
const batchFile = await openai.files.create({
file: new File([JSON.stringify({ requests: batch })], 'batch.jsonl'),
purpose: 'batch'
});
const batchJob = await openai.batches.create({
input_file_id: batchFile.id,
endpoint: '/v1/chat/completions',
completion_window: '24h'
});
return batchJob.id; // Poll this later for results
}
For a job platform I work on, description extraction runs through the Batch API. Jobs submitted during the day get results by morning. The cost difference is substantial.
Model Selection Per Task
Not every task needs a frontier model. I run a multi-tier routing system that assigns requests based on complexity.
Rules I apply in production:
- GPT-4o-mini handles most traffic: extraction, classification, simple generation. It's fast and cheap.
- DeepSeek V4 Flash handles structured output tasks where I need JSON reliability without paying GPT-4 prices. It costs roughly 23x less than GPT-4.1 for equivalent quality on well-defined schemas.
- GPT-4o only enters for tasks requiring nuanced reasoning, multi-step analysis, or when smaller models fail quality gates.
Here's the router pattern I use:
type TaskComplexity = 'simple' | 'medium' | 'complex';
function selectModel(task: TaskComplexity): string {
switch(task) {
case 'simple':
return 'gpt-4o-mini'; // $0.15/1M input tokens
case 'medium':
return 'deepseek-chat'; // ~$0.14/1M input, 23x cheaper than GPT-4.1
case 'complex':
return 'gpt-4o'; // Only when smaller models fail
}
}
The trick is having a quality gate that demotes failed outputs. If GPT-4o-mini returns a malformed JSON or misses a required field, the system escalates to the next tier. That way you're not guessing which model to use. The data decides.
Fallback Chains Prevent Costly Retries
When an LLM call fails, most teams retry the same model. That's expensive and pointless. The failure is often model-specific.
I build fallback chains that route through progressively cheaper models first, then escalate to expensive ones only when necessary.
async function callWithFallback(prompt: string, models: string[]) {
for (const model of models) {
try {
const response = await openai.chat.completions.create({
model,
messages: [{ role: 'user', content: prompt }],
temperature: 0.1 // Lower temperature for deterministic fallback behavior
});
return response.choices[0].message.content;
} catch (error) {
console.warn(`Model ${model} failed, trying next in chain`);
continue;
}
}
throw new Error('All models in fallback chain failed');
}
// Usage: try cheap models first
const result = await callWithFallback(prompt, [
'gpt-4o-mini',
'deepseek-chat',
'gpt-4o'
]);
This pattern keeps costs predictable. The cheap models succeed the vast majority of the time. The expensive ones only trigger for the edge cases.
Caching at Every Layer
Most teams cache at the database level. They miss the bigger wins.
Prompt result caching. If two jobs produce the same prompt (same input data, same task), the second call should return cached output. I use a simple key-value store with the prompt hash as the key.
Embedding caching. For RAG pipelines, the same documents get embedded repeatedly. Cache the embedding vectors by document hash. The first call pays the full cost. Every subsequent call costs a cache lookup.
Model selection caching. If a specific input pattern consistently fails on GPT-4o-mini and succeeds on DeepSeek, cache that mapping. The system learns which model works for which input signature without re-testing every time.
Monitoring That Tells You What's Broken
You can't fix what you don't measure. I track three metrics per pipeline:
- Cost per job. Not per token. Per job. This is the number that matters to the business.
- Model distribution. What percentage of requests hit each tier. If GPT-4o is handling a large share of traffic, that's a red flag.
- Fallback rate. How often does the cheap model fail and escalate? A high fallback rate means your routing rules need tuning.
Sentry catches errors. LogRocket shows user impact. But a simple dashboard tracking these three numbers catches the cost leaks before they become emergencies.
When to Spend More
Here's the counterintuitive part. Sometimes you should spend more, not less.
If a cheap model produces output that requires human review or rework, the cost of fixing bad output often exceeds the savings from the cheap call. I've seen teams save pennies on an API call and lose dollars in human labor fixing the result.
The rule: measure output quality alongside cost. If your fallback rate to expensive models stays low, the cheap tier is working. If it climbs, your routing needs adjustment, not just cost cutting.
If your team is building AI agents that need to handle production volume without surprise bills, that's the kind of thing I help with. Happy to compare notes on what's working in your pipeline.
Written by Abdul Rehman, full-stack AI engineer building production SaaS, MVPs, and AI automation. More at PrimeStrides.
Top comments (1)
Good breakdown. The fourth leak I would add: retries and silent fallbacks. A timeout that retries twice turns one request into three, and most teams never see it because it shows up as 'more requests', not 'a bug'. Our worst month was not tokens, it was a retry loop on a flaky downstream that billed us for the same completion three times. What finally made all of these legible was per-request cost attribution on the trace itself: tag each span with model, input/output tokens, cache-hit, and retry-count, then group spend by route. You stop guessing which leak is biggest before you start optimizing. On leak one, how are you picking the routing tier, a small classifier up front or rules on input length?