Determinism as a feature: when to let your agent call a math API instead of reasoning

#llm #mcp #ai #agents

LLM agents are great at deciding what to do and unreliable at computing it. Ask one to allocate traffic across five variants, price tail risk, or solve a scheduling constraint and you'll get a confident, plausible, subtly-wrong number — tokens burned included.

The fix usually isn't a better prompt. It's the same instinct that gave us the calculator: move the deterministic math out of the probabilistic engine.

The tell

You have a determinism problem the moment your agent's output needs to be:

reproducible — same inputs → same answer, every run,
auditable — someone can check why it's 0.62 and not 0.61, or
correct under adversarial inputs — a fat-tailed return, an infeasible constraint.

An LLM gives you none of those for free. A tool call does.

What to offload (and a cheap test for each)

"Which variant should I ship?" → a multi-armed / contextual bandit. The agent picks the question; Thompson sampling picks the allocation. Test: ask your agent to allocate 1,000 users across 4 arms with the same conversion counts, twice. Different answers? Offload it.
"Is this metric anomalous?" → score the series against a baseline; don't eyeball it inside the context window.
"What's the 95% VaR / CVaR?" → Monte Carlo paths, not a vibe.
"Schedule these tasks under these limits" → an LP/MIP solver. LLMs can't reliably satisfy hard constraints; solvers can't violate them.

The pattern

Expose the math as MCP tools so the agent calls them like any other tool — intent stays in the model, the number comes from code:

// agent decides intent; the tool computes the answer
const alloc = await callTool("optimize_contextual", {
  arms: variants,          // [{ id, name }]
  context: userFeatures,   // segment, prior_open_rate, hour_of_day
  history: pastRewards
});
// `alloc` is reproducible, sub-millisecond, and you can show your work

Two design details that bite people:

Delayed reward. If reward trickles in (email opens over hours), set a fixed attribution window before crediting an arm — otherwise the bandit over-exploits early openers and collapses variant diversity.
Cold start. Start each arm on a Beta(1,1) prior (or an informed prior from past campaigns) so exploration doesn't die on run one.

When not to offload

Determinism is a constraint, and constraints have cost. If the task is genuinely fuzzy — summarizing a doc, routing an intent, drafting copy — keep it in the model. A rule of thumb that's served me well:

If you'd want a unit test for the output, it belongs in a tool, not a prompt.

If you want a batteries-included set of these as MCP tools — bandits, forecasting, Monte Carlo, optimization, anomaly/risk — I maintain OraClaw (npx -y @oraclaw/mcp-server; 11 of the tools are free, no key). But the pattern matters more than the tool — wire in whatever solver you like. Disclosure: I built it.

DEV Community

Determinism as a feature: when to let your agent call a math API instead of reasoning

The tell

What to offload (and a cheap test for each)

The pattern

When not to offload

Top comments (0)