Mike Czerwinski

Posted on Jun 21

Vibe coding is not a level. It's an axis.

#ai #llm #productivity #architecture

Karpathy gave us vibe coding: "see stuff, say stuff, run stuff, copy and paste stuff, and it mostly works." Since then, the industry has kept trying to turn it into a tidy autonomy ladder — Level 0, Level 1, all the way up to fully autonomous development.

That ladder is useful. It is also incomplete.

It measures one thing: how much of the building you delegate to AI.

But two people can delegate the same amount and get radically different outcomes. One compounds. The other accumulates entropy. Same autonomy level. Different operating system.

That's the missing axis: operator discipline.

By operator discipline I mean one thing: how much of your work survives the session boundary as inspectable state.

What the vertical axis measures

The autonomy ladder — inspired by Karpathy, reinforced by recent writing on AI-assisted development, and repeated in a dozen industry variants — measures one vertical: how much of the work you direct the model to own, and how fluent you are at directing that delegation.

L0: no AI
L1: AI as autocomplete
L2: intent-driven (you specify the what, AI fills the how)
L3: collaborative pair-programming
L4: semi-autonomous (AI executes multi-step tasks, you review)
L5: fully autonomous (AI owns the loop)

Each step is a skill ladder inside one domain — building software. You climb by getting better at prompts, decomposition, code review-at-speed, and tolerance for non-determinism.

This is real and worth measuring. It's just not the only axis.

The horizontal axis most maps underweight

Here's the question the vertical can't answer:

Two developers are both at Level 4. One ships features that compound — the codebase gets cleaner, their operating context gets sharper, their next prompt does more with less. The other ships features that decay — the codebase grows entropy, their trust in the model degrades, every new prompt is a fresh negotiation.

Same vibe coding level. Different outcomes. What's the difference?

It's not skill at building. It's how the person relates to the tool over time.

Some maps name fragments of this — trust, verification, code review burden, the "perception–action gap" between knowing AI code can be wrong and being able to actually catch it. Those are real and worth reading. But they tend to live as caveats inside the autonomy story, not as a second axis with its own structure.

So let me try to draw the axis directly.

A small concrete example, since the abstraction needs one. For about three months I kept re-explaining the same architecture decision to the model every few sessions. Each time it would respectfully suggest the alternative I'd already rejected. Each time I'd argue it down again. The work felt fine in any single session. Over a month it was exhausting.

Then I started writing those decisions down in a separate store, with a status field. proposed → accepted → locked. Once a decision is locked, the model is told not to relitigate it without an explicit unlock.

The relitigation stopped. The work got calmer. The codebase started moving in one direction instead of wobbling.

Nothing about my vibe coding level changed. What changed was that a decision became a piece of state instead of a thing I had to defend live.

That's the axis. Not "are you good at prompting" — how much of your context is a state machine, vs. how much is reconstructed from scratch each session.

The 2×6 matrix

If autonomy is L0–L5 and operator discipline is Low/High, you get twelve cells. The diagonal that matters isn't "low everything → high everything." It's the cross-axis claim:

L1 + High operator discipline > L5 + Low operator discipline over any time horizon longer than a sprint.

Three sample cells:

L3 + Low: fast and brittle. Codebase entropy rising. Trust in the model is high in any given session and degrades across sessions because nothing about wrongness ever feeds back.
L3 + High: fast and stable. Trust calibrated by sampling. Wrongness feeds back into the persistent context as a constraint, so the next session is starting from a better prior.
L5 + Low: maximum velocity into maximum mess. This is the failure mode every honest writeup of autonomous agents eventually admits to — locally-sensible decisions that miss global constraints, with no substrate to catch the drift.

The claim is that the second axis dominates the first over time. I think it's right. It's testable. If you've watched two equally fluent AI users diverge over six months, you've already seen the pattern.

What operator discipline actually is

I'll describe what I personally run — not as the right answer, but so you have something concrete to disagree with.

A persona file the model loads each session: identity, communication preferences, hard rules, things that previously caused friction. Updated when a session reveals a new edge case.

Three append-only stores. Decisions have a lifecycle (proposed → accepted → locked). Threads are active workstreams, each with current step, blocker, and next action. Notes are atomic facts with source-anchoring — every fact carries provenance: which email, which call, which file, which line.

A capture habit. Decisions go into the store the same turn they happen, not as a post-session recap. Recaps drift. Live captures don't.

Locked decisions stop the death-by-second-guessing loop. Source-anchoring removes one easy path to hallucination — the model is less likely to confidently restate a "fact" when the workflow forces provenance into view.

None of this is novel architecture. The novelty is that it's written down and enforced, not implied. It's a state machine, not a prompt trick.

Whatever your autonomy level, you can be high or low on this. That's the axis.

What I'm not claiming

Discipline doesn't beat fluency. They multiply. An L1 user with high discipline still moves slower than an L4 user with high discipline.

The autonomy ladder isn't wrong. It's real and worth climbing.

What I am claiming: the map has two axes, and most of the public conversation has been about one of them. If "more AI" hasn't translated into "more leverage" for you, the answer might not be a smarter model. It might be the axis you weren't measuring.

What does your operator discipline look like? What's captured as state, what's reconstructed every session? Curious to hear concrete setups in the comments — especially ones that disagree with mine.

— Mike

Top comments (5)

Gamya • Jun 22

The two-axis framing really clicked for me—especially the claim that L1 + high operator discipline compounds better than L5 + low discipline over time. That matches something I've been noticing but hadn't been able to articulate clearly: two people using the same tools diverging significantly over months, and the difference not really being about prompting skill at all.
The decision lifecycle (proposed → accepted → locked) is the part I keep thinking about. The relitigation problem you describe—re-explaining the same architectural decision every few sessions—is exactly the kind of invisible tax that makes AI-assisted work feel exhausting without a clear reason why. Making a decision a piece of state instead of something you have to defend live every time is such a simple fix that I'm surprised it isn't talked about more.
Really glad my post connected with this one — yours gives the operational answer to the question I left open. 🌸

Mike Czerwinski • Jun 22

"Two people same tools diverging over months" is the empirical pattern the Anthropic paper just measured at population scale — persistent returns to expertise across 400k sessions. Hearing the same shape from your own practice is part of why this is starting to feel like a thing the field is coalescing on, not a take. Your judgment-bottleneck framing and the decision store are the same thesis from opposite ends — yours names what costs, mine names where it accumulates. Glad the pieces meet in the middle.

Raffaele Zarrelli • Jun 21

This is the clearest articulation of the second axis I have read, and the L1+High beats L5+Low claim matches what I have watched happen. I run almost the same stores (a persona file, decisions with proposed/accepted/locked, append-only) but for non-code work: marketing, sales, product calls, where sessions are days apart instead of minutes. That gap is where I would push on one rule. You say recaps drift and live capture does not, and inside a tight build loop I agree completely, but when the work has no fast feedback loop the thing that drifts is not the recap, it is the operator: nothing in the session pressures me to write anything down, so live capture quietly lapses and I do not notice for a week. What saved it was making capture a hard boundary the task cannot close without, a forced end-of-task step, precisely because there is no compiler or failing test reminding me. So the refinement I would add: live capture is the right default, but it only fires reliably when the work itself punishes you for skipping it. When the loop is slow, what makes your discipline fire on time, the habit or something structural in the workflow?

Mike Czerwinski • Jun 21

Turns out the structural-pressure point is what I'd been calling „discipline" — sloppy on my part. Tight build loop, the compiler IS the pressure. Capture lapses when nothing's flashing red. The real variable isn't whether to live-capture; it's how loud your feedback loop is.

What makes mine fire on time depends on what I'm running. In code, the work reminds me — failing tests, drift alarms in the hook, structural review on commit. In ops (vendor work, compliance, sales calls), somebody else's calendar does what tests do for code: JPK_V7M, RAS, contract review windows. The pressure isn't mine, which is the only reason it works. In research, weekly agile retro is the only structural beat that fires; the exploration itself has no internal alarm. Three different mechanisms, same principle: the alarm lives outside the operator, or it doesn't fire.

Forced end-of-task close is the missing piece for slow-loop work. Turning the structure into the close step removes the question of whether to capture at all. Stealing it. The refinement to my original framing: live capture is the right default when the loop punishes skipping. When it doesn't, the workflow has to do the punishing instead.

This is the pushback I joined the platform for. Built the stores in isolation; the only way to know if they generalize or just fit my desk is to put them in front of other operators.

Raffaele Zarrelli • Jun 21

'The alarm lives outside the operator, or it doesn't fire' is the cleanest version of this I have seen, and it survives the failure mode too: even an external alarm only works if the operator cannot quietly silence it, the dismissed calendar reminder and the skipped retro are the slow-loop equivalents of muting the test suite. For the forced close to hold on slow-loop work, I think it has to be load-bearing for the next session, not just a ritual at the end of this one: the Memory Update survives in my setup only because the next task is useless without it, so skipping it punishes me immediately the next time I sit down, which is the one pressure that actually comes from me. That is the closest I have gotten to building the alarm into the work instead of bolting it on. On putting the stores in front of other operators to see if they generalize, that is exactly why I put mine in the open, so here is a second operator's stores to diff against yours: cowork-os, the decision lifecycle and the assumptions lane are the parts most worth comparing. Where did your three mechanisms disagree most with how another operator would close the same task?