DEV Community

Whatsonyourmind
Whatsonyourmind

Posted on

Determinism as a feature: when to let your agent call a math API instead of reasoning

LLM agents are great at deciding what to do and unreliable at computing it. Ask one to allocate traffic across five variants, price tail risk, or solve a scheduling constraint and you'll get a confident, plausible, subtly-wrong number — tokens burned included.

The fix usually isn't a better prompt. It's the same instinct that gave us the calculator: move the deterministic math out of the probabilistic engine.

The tell

You have a determinism problem the moment your agent's output needs to be:

  • reproducible — same inputs → same answer, every run,
  • auditable — someone can check why it's 0.62 and not 0.61, or
  • correct under adversarial inputs — a fat-tailed return, an infeasible constraint.

An LLM gives you none of those for free. A tool call does.

What to offload (and a cheap test for each)

  1. "Which variant should I ship?" → a multi-armed / contextual bandit. The agent picks the question; Thompson sampling picks the allocation. Test: ask your agent to allocate 1,000 users across 4 arms with the same conversion counts, twice. Different answers? Offload it.
  2. "Is this metric anomalous?" → score the series against a baseline; don't eyeball it inside the context window.
  3. "What's the 95% VaR / CVaR?" → Monte Carlo paths, not a vibe.
  4. "Schedule these tasks under these limits" → an LP/MIP solver. LLMs can't reliably satisfy hard constraints; solvers can't violate them.

The pattern

Expose the math as MCP tools so the agent calls them like any other tool — intent stays in the model, the number comes from code:

// agent decides intent; the tool computes the answer
const alloc = await callTool("optimize_contextual", {
  arms: variants,          // [{ id, name }]
  context: userFeatures,   // segment, prior_open_rate, hour_of_day
  history: pastRewards
});
// `alloc` is reproducible, sub-millisecond, and you can show your work
Enter fullscreen mode Exit fullscreen mode

Two design details that bite people:

  • Delayed reward. If reward trickles in (email opens over hours), set a fixed attribution window before crediting an arm — otherwise the bandit over-exploits early openers and collapses variant diversity.
  • Cold start. Start each arm on a Beta(1,1) prior (or an informed prior from past campaigns) so exploration doesn't die on run one.

When not to offload

Determinism is a constraint, and constraints have cost. If the task is genuinely fuzzy — summarizing a doc, routing an intent, drafting copy — keep it in the model. A rule of thumb that's served me well:

If you'd want a unit test for the output, it belongs in a tool, not a prompt.


If you want a batteries-included set of these as MCP tools — bandits, forecasting, Monte Carlo, optimization, anomaly/risk — I maintain OraClaw (npx -y @oraclaw/mcp-server; 11 of the tools are free, no key). But the pattern matters more than the tool — wire in whatever solver you like. Disclosure: I built it.

Top comments (0)