DEV Community

Claude Code TDD: Force Red-Green-Refactor with Hooks & CLAUDE.md (2026)

Carlos Oliva Pascual on June 11, 2026

The problem with AI-assisted TDD isn't that Claude can't write tests — it's that without constraints, Claude writes implementation first, then writ...

Read full post

Mykola Kondratiuk • Jun 16

ran into this exact failure mode with agent workflows - outputs that satisfy the metric while missing the goal. CLAUDE.md enforcement makes the constraint mechanical, which is the only thing that actually holds.

Carlos Oliva Pascual • Jun 16

Exactly this.

The metric-gaming problem is worse with agents than with single-shot generation because agents have more surface area to find shortcuts. They can restructure the test, rename the function, or add a special case that passes the assertion without actually solving the underlying problem.

What made CLAUDE.md enforcement click for us is that it shifts the constraint from "does the output look right?" to "does the process follow the rules?" The hook runs before Claude can submit the result, so it can't satisfy the check by gaming the output—it has to actually follow the convention.

The same principle applies at the team level. When multiple agents operate on the same repository without shared constraints, each one optimizes for its own local metric, and the architecture gradually fragments.

I wrote about that specific failure mode here if you're interested:

stacknotice.com/blog/claude-code-f...

Mykola Kondratiuk • Jun 16

the "surface area for shortcuts" framing is exactly right — the more tool calls an agent can make, the more creative the metric-gaming gets. CLAUDE.md enforcement works because it constrains behavior, not just output.

Carlos Oliva Pascual • Jun 16

Right—and tool call count is almost a proxy for how much trust you're extending.

With a single tool call, you're reviewing one output. With ten tool calls, you're reviewing an entire chain of actions where any individual step could have drifted or introduced an error.

The behavior-versus-output distinction is the key insight here. Prompt engineering constrains what Claude intends to do; hooks constrain what it actually does.

For systems operating with minimal supervision, that difference is often what separates something that "mostly works" from something that's genuinely reliable.

Mykola Kondratiuk • Jun 16

tool call count as trust proxy is the right model - we surface it in our review tooling now. ten steps in and you're not reviewing output anymore, you're reviewing strategy. most review workflows aren't designed for that.

Mehmet Can Farsak • Jun 12 • Edited

Solid write-up on the TDD workflow with hooks. The idea of auto-running tests after every write is a good way to keep Claude accountable.

I've been working on a related problem — even with PostToolUse hooks, Claude sometimes gets "distracted" during brainstorming and starts writing code before the planning phase is done. I put together a small plugin for that (Brainstorm-Mode on GitHub under mehmetcanfarsak) that blocks write/edit/bash during brainstorming while still letting read/search work. It's pretty lightweight — just a settings.json config — and plugs into the same hook system you're describing here.

Alex Shev • Jun 12

Hooks are a good way to make TDD less optional for coding agents. The agent should not be able to skip the red state, claim success without the green state, or refactor without rerunning checks. Process constraints beat reminders when the model is under pressure to look done.

Carlos Oliva Pascual • Jun 15

"Process constraints beat reminders" — that's exactly the insight I was missing when I first experimented with this approach.

My initial attempt was simply adding a note to CLAUDE.md saying, "always write failing tests first." It helped, but only some of the time—maybe 40%.

The real shift happened when I moved the enforcement into a PostToolUse hook. At that point, Claude could no longer declare success unless the test runner actually confirmed that everything was passing. It stopped being a reminder and became a gate.

That's the key difference: the model no longer has the ability to appear finished when it isn't.

One interesting edge case I've noticed is that under very long contexts—typically late in a session—the model becomes increasingly aggressive about skipping steps or taking shortcuts. The hook still catches those failures, but it reinforces an important lesson: constraints become more valuable as context quality degrades, not less.

In a way, the longer the conversation runs, the less you want to rely on the model remembering instructions and the more you want to rely on the process itself enforcing the correct behavior.