The problem with AI-assisted TDD isn't that Claude can't write tests — it's that without constraints, Claude writes implementation first, then writes tests that match that implementation. You get 100% test coverage and zero confidence the tests catch anything.
This guide shows how to configure Claude Code so TDD isn't a guideline you might forget — it's the only workflow available.
Full guide: stacknotice.com/blog/claude-code-tdd-workflow-2026
The Fix: CLAUDE.md Rules
## Development workflow — STRICT TDD (always follow this order)
1. **RED**: Write a failing test first. Run `npm test` — confirm it FAILS.
Do NOT write implementation before this step.
2. **GREEN**: Write the minimum code to make the test pass.
Run `npm test` — confirm ALL tests pass.
3. **REFACTOR**: Clean up without changing behavior.
Run tests after every change.
**Rules:**
- Never write implementation without a failing test first.
- Never write more implementation than the current test requires.
- One cycle at a time: RED → GREEN → REFACTOR before the next feature.
Hooks: Auto-Run Tests After Every Write
// .claude/settings.json
{
"hooks": {
"PostToolUse": [
{
"matcher": "Write|Edit",
"hooks": [
{
"type": "command",
"command": "npm test -- --reporter=verbose 2>&1 | tail -20"
}
]
}
]
}
}
Every time Claude writes a file, your test suite runs automatically. Claude immediately sees whether the test is red or green — no manual npm test needed.
A Real TDD Cycle
Prompt:
Build a `validatePassword` function. Follow TDD per CLAUDE.md:
write the test first, confirm RED, then implement.
Requirements: 8+ chars, 1 uppercase, 1 number.
Returns { valid: boolean, errors: string[] }
Claude writes the test first:
// src/lib/password.test.ts
describe('validatePassword', () => {
it('returns valid for a strong password', () => {
expect(validatePassword('SecurePass1')).toEqual({ valid: true, errors: [] })
})
it('fails when too short', () => {
const result = validatePassword('Sh0rt')
expect(result.valid).toBe(false)
expect(result.errors).toContain('Password must be at least 8 characters')
})
it('fails when no uppercase', () => {
const result = validatePassword('lowercase1')
expect(result.errors).toContain('Password must contain at least one uppercase letter')
})
it('fails when no number', () => {
const result = validatePassword('NoNumbers')
expect(result.errors).toContain('Password must contain at least one number')
})
})
Hook fires → test output in context: RED (module not found).
Claude writes minimum implementation:
// src/lib/password.ts
export function validatePassword(password: string) {
const errors: string[] = []
if (password.length < 8) errors.push('Password must be at least 8 characters')
if (!/[A-Z]/.test(password)) errors.push('Password must contain at least one uppercase letter')
if (!/[0-9]/.test(password)) errors.push('Password must contain at least one number')
return { valid: errors.length === 0, errors }
}
Hook fires → GREEN. All 4 tests pass.
Testing React Components
// src/components/PasswordInput.test.tsx
describe('PasswordInput', () => {
it('renders input with label', () => {
render(<PasswordInput label="Password" />)
expect(screen.getByLabelText('Password')).toBeInTheDocument()
})
it('toggles password visibility', async () => {
const user = userEvent.setup()
render(<PasswordInput label="Password" />)
const input = screen.getByLabelText('Password')
expect(input).toHaveAttribute('type', 'password')
await user.click(screen.getByRole('button', { name: /show password/i }))
expect(input).toHaveAttribute('type', 'text')
})
it('shows strength indicator', async () => {
const user = userEvent.setup()
render(<PasswordInput label="Password" />)
await user.type(screen.getByLabelText('Password'), 'StrongPass1')
expect(screen.getByText('Strong')).toBeInTheDocument()
})
})
Write test → confirm RED → implement → GREEN. Same cycle, same discipline.
Where TDD Proves Its Value: Regressions
Three weeks later, add a special character requirement:
Add: at least one special character. Write the failing test first.
Claude adds one test → it fails → Claude adds one line to the validator → all 5 tests pass. The existing tests are a regression net. If the new code breaks the uppercase check, you see it immediately.
Handling Claude Skipping TDD
If mid-session Claude writes implementation before tests:
Stop. You wrote implementation before the test.
Delete `src/lib/feature.ts`. Write the test first, confirm RED, then implement.
Explicit course corrections work reliably. If it keeps happening, /compact to clear context.
Pre-commit Hook
Block commits when tests fail:
# .husky/pre-commit
#!/bin/sh
npm test -- --run
Commit your .claude/settings.json alongside CLAUDE.md — every team member gets the same TDD enforcement automatically.
Full guide with Route Handler testing, component testing patterns, and multi-step examples: stacknotice.com/blog/claude-code-tdd-workflow-2026
Top comments (11)
ran into this exact failure mode with agent workflows - outputs that satisfy the metric while missing the goal. CLAUDE.md enforcement makes the constraint mechanical, which is the only thing that actually holds.
Exactly this.
The metric-gaming problem is worse with agents than with single-shot generation because agents have more surface area to find shortcuts. They can restructure the test, rename the function, or add a special case that passes the assertion without actually solving the underlying problem.
What made CLAUDE.md enforcement click for us is that it shifts the constraint from "does the output look right?" to "does the process follow the rules?" The hook runs before Claude can submit the result, so it can't satisfy the check by gaming the output—it has to actually follow the convention.
The same principle applies at the team level. When multiple agents operate on the same repository without shared constraints, each one optimizes for its own local metric, and the architecture gradually fragments.
I wrote about that specific failure mode here if you're interested:
stacknotice.com/blog/claude-code-f...
the "surface area for shortcuts" framing is exactly right — the more tool calls an agent can make, the more creative the metric-gaming gets. CLAUDE.md enforcement works because it constrains behavior, not just output.
Right—and tool call count is almost a proxy for how much trust you're extending.
With a single tool call, you're reviewing one output. With ten tool calls, you're reviewing an entire chain of actions where any individual step could have drifted or introduced an error.
The behavior-versus-output distinction is the key insight here. Prompt engineering constrains what Claude intends to do; hooks constrain what it actually does.
For systems operating with minimal supervision, that difference is often what separates something that "mostly works" from something that's genuinely reliable.
tool call count as trust proxy is the right model - we surface it in our review tooling now. ten steps in and you're not reviewing output anymore, you're reviewing strategy. most review workflows aren't designed for that.
Solid write-up on the TDD workflow with hooks. The idea of auto-running tests after every write is a good way to keep Claude accountable.
I've been working on a related problem — even with PostToolUse hooks, Claude sometimes gets "distracted" during brainstorming and starts writing code before the planning phase is done. I put together a small plugin for that (Brainstorm-Mode on GitHub under mehmetcanfarsak) that blocks write/edit/bash during brainstorming while still letting read/search work. It's pretty lightweight — just a settings.json config — and plugs into the same hook system you're describing here.
Hooks are a good way to make TDD less optional for coding agents. The agent should not be able to skip the red state, claim success without the green state, or refactor without rerunning checks. Process constraints beat reminders when the model is under pressure to look done.
"Process constraints beat reminders" — that's exactly the insight I was missing when I first experimented with this approach.
My initial attempt was simply adding a note to CLAUDE.md saying, "always write failing tests first." It helped, but only some of the time—maybe 40%.
The real shift happened when I moved the enforcement into a PostToolUse hook. At that point, Claude could no longer declare success unless the test runner actually confirmed that everything was passing. It stopped being a reminder and became a gate.
That's the key difference: the model no longer has the ability to appear finished when it isn't.
One interesting edge case I've noticed is that under very long contexts—typically late in a session—the model becomes increasingly aggressive about skipping steps or taking shortcuts. The hook still catches those failures, but it reinforces an important lesson: constraints become more valuable as context quality degrades, not less.
In a way, the longer the conversation runs, the less you want to rely on the model remembering instructions and the more you want to rely on the process itself enforcing the correct behavior.
Some comments may only be visible to logged-in visitors. Sign in to view all comments.