Mike Czerwinski

Posted on Jun 21

Anthropic measured the human side. Five operators are building the agent side.

#ai #llmops #agents #operatordiscipline

I joined dev.to a few days ago because I'd run out of paths to argue this stuff against. Months of building a framework — operator discipline as an orthogonal axis to autonomy, locked decisions with status fields, drift detection, supersession trails — and the only thing I was sure of was that internal coherence isn't proof of anything. Frameworks survive by surviving other people, not by surviving the author.

So I started publishing. Today the framework finally hit something outside my own head.

What Anthropic measured

On June 16, Anthropic Economic Research published "Agentic coding and persistent returns to expertise." About 400,000 interactive Claude Code sessions. About 235,000 people. October 2025 to April 2026. Expertise patterns, delegation patterns, success patterns.

The central finding, in their own words:

"The greater domain expertise a person brings to a session, the more work Claude does per instruction."

"Success is determined by how well a person understands the problem they are trying to solve, not whether they're trained in coding."

Anthropic did not measure operator discipline directly. It measured the closest empirical neighbor: expertise as a multiplier on agentic work.

Expert-rated sessions show about 2.4× as many Claude actions per prompt as novice-rated sessions, and roughly 5× the text output. The signal is not simply "knows how to code." The signal is "understands the problem well enough to steer the agent." That overlaps with the same axis I'd been arguing as a frame in my first post on dev.to: vibe coding is not a level, it's an orthogonal axis to autonomy. My stronger claim was that L1 + High discipline outperforms L5 + Low discipline over time. Anthropic does not measure that claim directly, but it gives the human side of the axis something measurable.

What the report does not try to answer is the agent-side question: what kind of state, memory, governance, and transition rules have to exist so that the work compounds across sessions instead of being reconstructed every time. Its scope is interactive Claude Code usage — what work is done, who does it, whether the session succeeds — and it explicitly leaves out large parts of non-interactive/headless usage and does not measure downstream real-world outcomes.

That gap is what the practitioner cluster is circling from the other direction.

What the cluster is building

Five other operators on this platform have been pushing on the agent-side question from different starting points this week:

Rapls on status fields and append-only decision logs.
Scarab Systems on governed baselines and deterministic enforcement.
NOVAInetwork (@0xdevc) on quorum as a substitute for operator discipline at scale.
Raffaele Zarrelli (@sarracin0) on structural pressure when the loop is slow.
Brian Hall on the deterministic gate — and now with an open-source reference architecture (faramesh-core, MPL-2.0).

The short version of the cluster: five different starting points, one architectural conclusion — the LLM proposes, deterministic rules enforce, humans authorize transitions, and the rules live outside the agent's reasoning loop.

That's the agent-side scaffolding that sits outside the Anthropic report's scope.

Two halves of the same answer

Anthropic measured what happens when humans bring expertise into the loop. The cluster I spent today reading and writing with is building architecture for what happens when that expertise has to survive across sessions, tools, and agents. Same axis, two directions, a fuller picture.

Official research from Anthropic, independent practitioners on dev.to, both pointing at adjacent parts of the same problem. Not the same claim. Not the same layer. But the same direction.

That's not a viral take. That's an early convergence signal.

I came here to confront the framework against operators who actually ship with it. The framework didn't collapse on contact. It got sharper. The peers who pushed back named gaps I hadn't seen. And one of the biggest labs in the room published the human-side measurement while we were doing it.

Two independent signals converging from different directions, in the same week, on the same problem space. That's not the framework being right. It's the field starting to coalesce.

It's a good Sunday to close the loop.

Operator discipline is no longer just a personal workflow. It is starting to look like an axis, a measurement problem, and an architecture. Whatever comes next has to be built, measured, and governed.

https://www.anthropic.com/research/claude-code-expertise

Top comments (5)

Raffaele Zarrelli • Jun 21

The framing I keep coming back to here: Anthropic measured expertise as a per-session multiplier, expert in the chair, 2.4x actions per prompt. What the cluster is building is the thing that turns a per-session multiplier into a compounding one. If the expertise re-enters the operator's head every session, it's rented. If it survives as state the next session can read and the agent can be governed against, it's owned. The agent-side scaffolding is the conversion mechanism, not just an adjacent layer.

On the part you tagged me with (structural pressure when the loop is slow): that's also where the report's scope ends. Its data is interactive Claude Code, expert present every prompt, fast loop. The unmeasured case is the slow loop, business and ops work spaced over days, where the expert is not in the chair each turn. There the expertise has to be written down or it's just gone, and nothing in the session punishes you for skipping it. So discipline stops being a personality trait and has to become structure.

Useful to be read this precisely. Where do you think the first real disagreement inside the cluster lands, the enforcement layer or the read path?

Mike Czerwinski • Jun 21

The read path. The enforcement layer reads like consensus from the outside — everyone in the cluster lands on the same shape (deterministic gate, LLM never the seat, operator owns transitions) and the differences are mostly about how aggressive the gate is and what it gates on. Brian's hard line on proxy-outside-reasoning, Scarab's governed baseline that itself evolves, NOVA's quorum substituting for operator authority — these are tunings of the same architecture.

The read path is where the disagreement is buried and hasn't surfaced yet. Status filter (filtered-out reads as absent) versus governed baseline (still readable, marked) versus quorum-aggregated (multiple weak signals, present as confidence) versus content-addressed re-runnable proof (read is itself a verification step) — those are four genuinely different theories of what „present in the store" means, and they have different implications for cold start, for adversarial drift, for how an agent acts on partial information. Most of us have been talking about the write side: how decisions get in, transition, age out. The read side is where the framework choices actually show, and the moment somebody insists on one reading semantics over another, the cluster's apparent convergence is going to split.

Your rented-vs-owned framing puts a name on what makes this matter. A read path that produces stale-as-absent feels owned, because nothing claims to be there that isn't honest. A read path that produces stale-as-uncertain feels rented, because the operator carries the verification cost every time. That's the choice with operational consequences, and I don't think anyone in the cluster has made it explicitly yet.

Raffaele Zarrelli • Jun 22

Then let me make the choice, since you're right nobody has. I built on stale-as-absent for the live read: the agent acts only on the current set, nothing claims to be present that isn't honest. But pure stale-as-absent has a failure mode that mirrors the rented one. If a superseded decision just vanishes, you lose the why, and the agent re-proposes the thing you already rejected. You stop paying verification on read and start paying it on write, re-litigating settled questions at cold start. So the read I trust is stale-as-absent for what the agent acts on, plus a cheap one-hop path to "why is this absent". The supersession trail stays inspectable in the same file, it just doesn't enter the live set by default. In cowork-os that is the exact shape: decisions carry a status, the live read is the current set, superseded rows stay in the file so a human can see and correct, and the agent gets pointed at the trail when a proposal smells already-closed.

On adversarial drift, stale-as-absent is the more robust default because absence is hard to forge. Stale-as-uncertain is the attacker's friend: flood the store with weak signals, everything reads as uncertain, and the operator pays verification forever, which is rented by another name. So back at you: does the supersession trail count as present in your read semantics, or is it a separate store the live read is never allowed to touch?

Mike Czerwinski • Jun 22

Same store, two read modes. Live read = status ∈ {accepted, locked}; the supersession trail lives in the same file with a replaced_by pointer on every supersede. The agent's default query never sees superseded rows — but they're one hop away when a proposal smells already-closed. No second store, no separate inspection surface. The trail is part of the record, just not part of the live set.

That makes "why is this absent" a path the schema knows about, not a human courtesy. When the agent proposes X, the gate checks prior superseded rows whose replaced_by chain terminates near X, and points it at the trail before write. Cold-start re-litigation costs one lookup, not a re-debate.

Your adversarial framing is the sharper half — credit yours, I'm running with it. I had stale-as-absent on ergonomics — fewer ghost decisions in the live read. "Absence is hard to forge; uncertainty is the attacker's friend" reframes the default as a security property, not just hygiene. Flooded weak signals can't drag absence toward present; they can drag uncertainty anywhere they want.

Open edge for you: who owns the replaced_by write? In cowork-os, is supersede a human-authored transition, or can the agent propose the link and a deterministic rule confirm it? Mine's humans-only on that edge for now — feels load-bearing — but I'm not sure it should stay there.

Raffaele Zarrelli • Jun 22

On who owns the replaced_by write: I'd split what "owns" bundles together. Proposal and confirmation are different authorities, and only one of them is dangerous. The agent should own the proposal, it just did the reasoning that produced X, so it is the cheapest place to draft what X replaces. The risk is not the draft, it is the confirm, because supersede is a removal from the live set, the exact transition we just agreed needs friction (removing protection, not adding it).

So in cowork-os the supersede is agent-proposed inside the Memory Update step, but it lands as a visible diff in the decisions file (a status plus a replaced_by line, human-readable), and the authority lives in that visibility, not in a synchronous human gate. Then scope the confirm by blast radius: a deterministic rule auto-confirms the low-blast-radius supersedes (typo, narrow scope), and the load-bearing ones stay human-confirmed. Humans-only-flat is the failure mode, because under volume the load-bearing supersede gets the same rubber-stamp as the trivial one, so the gate stops protecting exactly where it matters. Repo if it helps: cowork-os (decisions carry a status, Memory Update writes the transition).

Question back: is your humans-only edge flat, or does it already read blast radius? If it is flat, the typo-supersede and the foundational-supersede pay the same human cost, which is the consequence-blind trap one level up, on the write side this time.