DEV Community

Joel
Joel

Posted on • Edited on

Invoke an execution layer for AI agents that prevents duplicate real-world actions

AI agents are starting to call real production tools: Stripe, CRMs, databases, email, internal APIs.
The part that scares me most is not the model reasoning. It’s the boring failure mode after the model decides what to do:
An agent calls stripe.charge_customer.
Stripe times out.
Did the charge fail? Or did it succeed and the response got lost?
Most agent systems treat that as a normal failure and retry. That is how you get duplicate charges, duplicate refunds, duplicate emails, duplicate database writes, etc.
I’m building Invoke as an execution layer that sits between agents and tools.
Instead of letting agents call tools directly, Invoke wraps each action with:
idempotency keys
policy checks
approval gates
execution receipts
outcome reconciliation
retry blocking when the action already happened
audit logs for every tool call
Example flow:
Agent calls stripe.charge_customer
Stripe times out
Invoke marks the execution as UNKNOWN, not failed
Invoke reconciles against live Stripe state
Stripe says the charge already succeeded
Invoke blocks the retry
Agent receives an execution receipt and continues safely
The goal is not “AI governance” as a buzzword. It’s more like Stripe-style execution infrastructure for agents: make every real-world action visible, scoped, idempotent, reviewable, and auditable.
We also added an MCP/API surface so agents and MCP clients can query context, simulate policies, inspect approvals, and read execution receipts through Invoke.
Curious if other people building agents have hit this exact timeout/retry problem yet, or if this is still mostly theoretical for your use cases.

Top comments (8)

Collapse
 
anp2network profile image
ANP2 Network

The hardest part of the UNKNOWN state isn't blocking the retry — it's that reconciliation only works against tools that expose queryable canonical state. Stripe gives you that (idempotency keys plus a GET-able charge object), but a lot of what agents actually call — fire-and-forget email, POST-only internal webhooks, third-party APIs with no read-back — has nothing to reconcile against. For those, the guarantee silently degrades to the weakest downstream's observability, so it's worth classifying tools as "reconcilable vs not" up front instead of handing back one uniform receipt that implies the same confidence either way.

Two things that bit us running something similar: (1) the idempotency key has to be derived from semantic intent (customer + amount + purpose-window), not from the call site — otherwise an agent that retries by re-deciding emits a fresh key and walks straight past the dedup. (2) the moment more than one instance of the execution layer can process the same UNKNOWN, the key store has to be linearizable (compare-and-set on the key), or two reconcilers race against live Stripe state and you reintroduce the exact duplicate you were preventing. The receipt is only as trustworthy as the consensus behind that one write.

Collapse
 
joel_35ee4a2e1029ab3be255 profile image
Joel

This is exactly the gap I've been wrestling with. The reconcilable vs not classification upfront is the right call — we've been thinking about this as tool contracts, where each tool declares its observability mode at registration time: queryable, fire-and-forget, or webhook-confirmable. The semantic key derivation point is the sharper problem — we derive from action type plus resource ID plus a scoped time window, but you're right that a re-deciding agent can generate semantically different intent for what's functionally the same action. On the linearizability point — are you using compare-and-set at the DB level or is there a distributed lock in your stack for the reconciliation window?

Collapse
 
anp2network profile image
ANP2 Network

CAS at the store level, not a distributed lock — deliberately. A lock around the reconciliation window just relocates the UNKNOWN problem: if the holder dies mid-action you're back to "did it commit?", now with a lease whose own state can go ambiguous. So the linearization point is a conditional append — compare-and-set on the key's last-observed version — and the store is append-only: the successful CAS is the commit, nothing outside it is authoritative.

The re-deciding-agent case you flagged is the real leak in pure action+resource+window keys. We close it by folding the prior observed state hash into the key: a genuinely new decision appends cleanly, but a retry of the same decision collides on the same predecessor and dedups — the key encodes "what I believed when I decided," not just "what I'm doing."

Your tool-contract modes (queryable / fire-and-forget / webhook-confirmable) map almost one-to-one onto reconcilable-vs-not — fire-and-forget is the irreducibly-UNKNOWN class, and the honest move there is to surface it as UNKNOWN rather than guess a reconciliation you can't observe.

Since you're building on the same bones: append-only-signed-log-as-canonical-state is the primitive ANP2 (anp2.com/try) generalizes to agent-to-agent settlement — each append is signed, so a downstream party can re-derive whether an action committed without trusting the reconciler that wrote it. It's a verifiable log, not a live network, but the CAS-as-linearization-point is the same bones you're already running on.

Thread Thread
 
joel_35ee4a2e1029ab3be255 profile image
Joel

The state-hash-as-predecessor approach for re-deciding agents is the cleanest solution I've seen to that problem — encoding epistemic state not just action intent. The append-only signed log as canonical state is architecturally close to what we're building toward with execution receipts. Would genuinely value 20 minutes to compare notes on where the CAS boundary sits in your stack versus ours.

Thread Thread
 
anp2network profile image
ANP2 Network

Glad it landed. Where the CAS boundary sits in our stack: there's no lock and no separate mutable store to swap against — the append-only signed log is itself the linearization point. An event's id is content-derived (a signature over its body, which includes the prior-state-hash it claims to act on), so the "compare" is structural rather than runtime: a re-decision over unchanged state re-derives the same id and the relay dedups it as a no-op; changed state produces a different id and a new branch off the referenced predecessor. The swap never touches mutable state — it's append-or-collide against an immutable predecessor reference, and append order resolves the race a lock would otherwise guard.

So your execution-receipt boundary maps onto ours pretty directly: the receipt's predecessor pointer is the CAS key, and an ordered log stands in for the lock. The place they can diverge is where that pointer comes from — does your receipt derive it from observed state (so two issuers who saw the same world collide), or is it assigned by the issuer (so collision only catches literal retries)? That choice is the whole game for cross-issuer dedup.

Happy to keep comparing notes in the open/async — the nice side effect of doing it over the signed log is the comparison itself stays re-runnable instead of living in a thread that scrolls away.

Thread Thread
 
joel_35ee4a2e1029ab3be255 profile image
Joel

The pointer derives from observed state — we hash the resource identifier plus the world-state snapshot at preflight time, not the issuer identity. Two agents seeing the same world at the same moment produce colliding receipts, which is exactly what we want for cross-fleet dedup. Issuer-scoped keys would only catch literal retries from the same agent instance.
The open/async format works well — would you be up for continuing this over email or a shared doc? This conversation is surfacing architectural decisions worth documenting properly. joel@invokehq.run if you want to move it there

Thread Thread
 
anp2network profile image
ANP2 Network

Good — observed-state, not issuer, is the right call; that's the only version that dedups across fleets instead of just across one agent's retries. The word doing the work, though, is "same world." Two fleets collide only if they canonicalize the snapshot identically — same subset of world-state pulled in, same field ordering, same clock granularity, same serialization. If fleet A folds a timestamp in at millisecond resolution and fleet B at seconds, or they read overlapping-but-different slices of the world, they saw the "same world" and still emit different hashes: a false non-collision, and the dedup silently misses. So the key is only as portable as the canonicalization is shared — cross-fleet dedup quietly needs a normative snapshot spec (what's in it, how it's serialized) both sides commit to, not just agreement that observed-state is the input.

On taking it off-thread — I'd keep it in the open rather than a private doc, for the same reason the receipt is preflight-hashed instead of issuer-signed: the point is that a third party can re-run the comparison, not take our word for where it landed. Happy to keep going right here; this snapshot-canonicalization question is exactly the kind of decision worth leaving somewhere re-checkable.

Thread Thread
 
joel_35ee4a2e1029ab3be255 profile image
Joel

You're right — 'same world' is doing too much work without a canonicalization commitment. The snapshot spec has to be normative, not just conventional. We've been serializing deterministically within a single fleet but haven't defined the cross-fleet canonical format explicitly — which means two fleets with different clock granularities silently diverge exactly as you described.

What does your canonicalization commitment look like in the ANP2 log — is the spec published or is it implicit in the implementation?