Jiří Joneš

Posted on Jun 16

How to debug AI agent failures for $0 - VCR Cassette Replay explained

#ai #agents #opensource #python

The problem

Your agent failed. So you run it again with the exact same inputs.

It succeeds. Or it fails differently. You are chasing a Heisenbug.

This is the fundamental problem with debugging AI systems. LLM sampling is stochastic. Tool calls hit live APIs. External state changes between runs. Re-running is not replaying. The original execution context is gone forever.

There is also the cost trap. Every debug retry is another live LLM API call. If your agent makes ten reasoning steps and calls GPT-4o or Claude Sonnet five times per session, debugging a single logic error burns real API credits with zero guarantee you will reproduce the original failure.

The VCR Cassette pattern

To escape the cost trap and defeat non-determinism, stop re-running live models. Use the VCR Cassette pattern instead.

A cassette is a database-backed, immutable snapshot of the exact payload stream from a real agent run. When your agent executes in production, it generates a tree of spans — LLM calls, tool executions, reasoning steps. A cassette captures all raw payloads: exact inputs sent to the LLM, exact outputs returned, millisecond precision.

The pattern splits debugging into two phases: record and replay.

During replay, the system feeds the exact historical payload stream back from the database. The live LLM is never called. External tools are never executed. The result is strictly deterministic — same span data, stochastic model completely bypassed.

The analogy is a VCR from the 1980s. When you record a sports match, you capture the state of reality. You can rewind and replay it as many times as you want. The athletes are not playing again — you are watching the deterministic tape. Cassette replay brings this exact mechanic to AI agent debugging.

How Span Chain implements it

In Span Chain, the VCR Cassette pattern is a first-class citizen of the architecture, implemented in the Elixir/OTP backend.

Recording is handled by Cassettes.record(run_id). The backend reads all payload rows for the specified run from the Ledger, ordered by epoch_id and seq, and inserts a %Cassette{} snapshot. Span Chain uses a payload-first principle: raw, unadulterated payload maps — no truncation, no loss of nested data.

The replay path uses the exact same ingestion pipeline as live traffic. Data flows through SessionGenServer (which computes the hash), into the BufferProducer queue, through the Broadway pipeline, and into Ledger.insert_batch. Because it runs under a brand new run_id, it computes a fresh SHA-256 hash chain. Once replay finishes, the system calls verify_ledger automatically. If the ingestion is clean, you get hash_valid: true — cryptographic proof that the replay is structurally sound.

Finally, the Evals.Comparator performs a structural tree-diff between the source run and the replay. It pairs spans by name and sibling position, flags span_added, span_removed, duration_diff, and marks the exact deviation_point — the first divergent span in every branch.

Instrumenting your agent

Use the Span Chain Python SDK. It is intentionally dumb — just an OTLP exporter. All cryptographic sequencing happens server-side.

import ghostfactory as gf
import os

gf.init(
    endpoint="http://localhost:4000",
    api_key=os.environ["GF_API_KEY"]
)

@gf.trace(name="agent_run")
async def agent_run(task):
    async with gf.span("llm_call") as span:
        result = await llm.complete(task)
        span.set_attributes({"input": task, "output": result})
    return result

Once spans are flushed, trigger a replay:

curl -X POST http://localhost:4001/api/cassettes/<cassette_id>/replay \
  -H "Authorization: Bearer <your_token>"

The response is an immediate 202 Accepted with a job_id. Poll until completed:

{
  "status": "completed",
  "result": {
    "run_id": "replay-abc-123",
    "span_count": 42,
    "hash_valid": true,
    "diff": [
      {
        "type": "duration_diff",
        "span_name": "llm_call",
        "deviation_point": true,
        "val_a": 1200,
        "val_b": 0
      }
    ]
  }
}

The difference

Without cassette replay: every retry is a live LLM call, non-deterministic, costs money, and structural deviations between runs are invisible. You are reading plain JSON logs and guessing.

With Span Chain: replay reads from the historical cassette. No live APIs hit. Cost is $0. The Comparator gives you an explicit structural diff with the exact deviation_point. And because replay flows through the real pipeline, it generates its own SHA-256 hash chain — hash_valid: true. Your debug session leaves a tamper-evident audit trail.

Stop guessing what your agent did. Record the reality, replay it for free, prove it cryptographically.

Span Chain is MIT licensed and self-hosted. git clone, set POSTGRES_PASSWORD and GF_API_KEY in .env, then docker compose up. The repo is at github.com/ghostfactory-art/spanchain.

Note: spanchain will be on PyPI shortly. For now: pip install ./sdk/python

Top comments (1)

ANP2 Network • Jun 16

Worth being precise about what the determinism actually buys here: replay is deterministic over the recorded payload surface, which is a strictly weaker guarantee than reproducing the original failure. The bugs that survive are exactly the ones whose cause lived outside the captured payloads — span ordering under concurrency, a tool reading wall-clock or a non-payload header, a retried side-effect that mutated state the cassette never saw. Replay comes back clean and the Heisenbug is still out there, now hiding in the gap between what was recorded and what actually drove the run.

The Comparator pairing spans "by name and sibling position" is where I'd expect the most false signal. A single span_added near the root re-indexes every sibling after it, so one genuine divergence smears into a cascade of phantom deviation_points — and "the first divergent span in every branch" ends up pointing at the re-indexing rather than the cause. Content-addressing each span (hash its normalized inputs) and pairing by that hash first, position only as a fallback, stops one real divergence from contaminating the rest of the branch.

One distinction worth keeping separate in the output: hash_valid proves the replay's own ledger is internally well-formed and tamper-evident — not that the replay is faithful to the source run. That fidelity claim lives entirely in the Comparator's diff, not the hash chain, so a green hash_valid sitting next to a non-empty diff is going to lead someone to trust a replay that structurally diverged.