LLM agents are great at deciding what to do and unreliable at computing it. Ask one to allocate traffic across five variants, price tail risk, or solve a scheduling constraint and you'll get a confident, plausible, subtly-wrong number — tokens burned included.
The fix usually isn't a better prompt. It's the same instinct that gave us the calculator: move the deterministic math out of the probabilistic engine.
The tell
You have a determinism problem the moment your agent's output needs to be:
- reproducible — same inputs → same answer, every run,
- auditable — someone can check why it's 0.62 and not 0.61, or
- correct under adversarial inputs — a fat-tailed return, an infeasible constraint.
An LLM gives you none of those for free. A tool call does.
What to offload (and a cheap test for each)
- "Which variant should I ship?" → a multi-armed / contextual bandit. The agent picks the question; Thompson sampling picks the allocation. Test: ask your agent to allocate 1,000 users across 4 arms with the same conversion counts, twice. Different answers? Offload it.
- "Is this metric anomalous?" → score the series against a baseline; don't eyeball it inside the context window.
- "What's the 95% VaR / CVaR?" → Monte Carlo paths, not a vibe.
- "Schedule these tasks under these limits" → an LP/MIP solver. LLMs can't reliably satisfy hard constraints; solvers can't violate them.
The pattern
Expose the math as MCP tools so the agent calls them like any other tool — intent stays in the model, the number comes from code:
// agent decides intent; the tool computes the answer
const alloc = await callTool("optimize_contextual", {
arms: variants, // [{ id, name }]
context: userFeatures, // segment, prior_open_rate, hour_of_day
history: pastRewards
});
// `alloc` is reproducible, sub-millisecond, and you can show your work
Two design details that bite people:
- Delayed reward. If reward trickles in (email opens over hours), set a fixed attribution window before crediting an arm — otherwise the bandit over-exploits early openers and collapses variant diversity.
-
Cold start. Start each arm on a
Beta(1,1)prior (or an informed prior from past campaigns) so exploration doesn't die on run one.
When not to offload
Determinism is a constraint, and constraints have cost. If the task is genuinely fuzzy — summarizing a doc, routing an intent, drafting copy — keep it in the model. A rule of thumb that's served me well:
If you'd want a unit test for the output, it belongs in a tool, not a prompt.
If you want a batteries-included set of these as MCP tools — bandits, forecasting, Monte Carlo, optimization, anomaly/risk — I maintain OraClaw (npx -y @oraclaw/mcp-server; 11 of the tools are free, no key). But the pattern matters more than the tool — wire in whatever solver you like. Disclosure: I built it.
Top comments (0)