Software engineering became software architecture.

#agents #ai #architecture #automation

Highly opinionated, based on my personal experience dogfooding my own setup.
Not a prescription. I'm scratching the surface, with a lot left to learn.

Is coding solved? I think so. But what's left to us? If there's something every developer can still refer to — even those who don't follow the buzz around agentic engineering, then harness engineering, and now loop engineering — it would be architecture. You could say we've been promoted, or soon will be, to manager of a team of agents. But the one part of the job that still speaks to each of us, the part we should keep doing ourselves, is software architecture.

I'm not claiming the agent writes clean, typed, tested code out of the box — it doesn't, and I'll get to how you set that up. But that setup is mostly a one-time job plus a little maintenance, and once it's done, coding is basically solved. What's really left to us is the architectural decisions — the ones that separate good quality from good quality plus a design that lasts. The taste. The judgment about what should exist and where the boundaries go. That layer is still on our side... or so I thought.

The reality, at least for me, is a bit different. Even though I knew it wasn't a good idea, I'd already handed that layer to the model — inside teatree, the code factory I build and run for myself. It wasn't planned. It just happened — and I'm still watching it slip further away from me.

"Coding is solved" assumes one thing I'll set aside here: a complete spec going in. I treat the spec I hand the agent as bulletproof — making sure it actually is, and asking questions when it isn't, is my job, just not this post's subject. Now let me start from the beginning.

What a gate is, and what it does

A gate, the way I'll use the word here, is a deterministic check that returns a pass-or-fail verdict and blocks the commit on a fail. Same input, same verdict, every time — no matter which model ran or what context was loaded. Run one on every commit and the output can only move one way: toward whatever the gate calls "good." That's convergence. Tech debt piles up when nothing pushes back. A gate pushes back, and you can rely on it to.

This matters more in the agent era, not less. The agent out-produces my reading — it writes more code, faster, than I would by hand, and more than I can carefully read. Convergence is what keeps that volume from becoming instant debt. Drop the gates and the same speed that ships features ships mess at the same rate. The gates aren't perfect, but they're necessary.

I work in Python, so my gates are ruff, ty and tach, run through prek, plus a stack of project-specific hooks. teatree — the personal code factory I'm dogfooding — is a Django project that turns a ticket URL into a merged PR (in theory). Its pre-commit pipeline runs more than sixty hooks in a numbered sequence. Most fire on every commit, a few only at push or in CI. The prose below leans on a handful of them:

Gate	What it blocks
Safety guards	commits and pushes that shouldn't happen at all — a merged branch, a public leak, a secret
Lint and structure	rule, boundary and duplication violations — `ruff` (every rule on), `tach`, `import-linter`, a 500-line file cap
Types	type errors and warnings — `ty`, with warnings failing the commit
Gate guards	silent relaxations of the gates themselves
Evals (token-free)	behaviour regressions — skill-triggers, pinned-regressions

The rest — formatting fixers, lockfile sync, doc generators, dependency audits, a conventional-commit check — do the unglamorous tidying you'd expect. Two of the gates above run stricter than people usually set them. ruff runs with every rule on, exceptions justified one by one. ty fails the commit on a warning instead of letting it scroll past. Coverage sits behind a hard floor, with a stricter per-module floor on the newer code, so one module can't rot quietly while the project number stays green. None of it is advisory. A failing gate is a failing commit.

The ratchet only turns one way

A check you — or your agent — can quietly widen or switch off stops being a constraint. The cheap way out of a red check is to widen the ignore list or lower the floor. So one of my gate guards watches for exactly that: it fails any commit that touches a lint-ignore list, a coverage floor, an omit pattern, or reaches for --no-cov or --no-verify, and tells you to fix the underlying issue instead. The escape hatches are themselves gated, so convergence stays one-directional.

A second guard works on structure, not lint. A 500-line file cap and a cap on module-level functions don't force every oversized file under the limit at once. A file already over the line stays — but it can only shrink. A commit that grows it is blocked, and newly crossing a cap is blocked outright. Structure ratchets the same way coverage does.

But the durable thing isn't the rule. It's where the rule lives. A rule kept in prose — a skill, a note, a thing I have to remember — isn't reliable, because prose gets read inconsistently or not at all. The same rule as a hook is as reliable as your unit tests. My blueprint, the one document that records the system's current shape, puts it plainly: durability comes from enforcement encoded in code and structure, not prose that decays. You can push prose to hold more strictly than that — but it's a harder problem, and a later post.

So a lot of tooling that predates the agentic era just became more necessary than ever. The kind I lean on more lately are the structural ones — what I've seen called architectural fitness functions: a deterministic test that checks a property of the whole module graph rather than a single line. tach enforces dependency direction (which module is allowed to import which — the module DAG, the layering). A chokepoint registry maps each dangerous primitive to the one module allowed to call it: every outbound network call, say, has to go through a single egress module, so a raw HTTP call made anywhere else fails the check. I reach for these because of volume. "Remember not to do X" loses when the agent writes more than I can read. A structural test that makes the violation impossible doesn't depend on anyone reading anything — and it's declarative, so one line catches the whole class, including code that doesn't exist yet.

The part nothing gates

Every gate above checks the code. None of them checks whether the design is right.

ruff will tell you a function is too complex. It won't tell you the function shouldn't exist, or belongs in a different module, or that the boundary it sits behind is in the wrong place. ty catches a type error and waves through a wrong abstraction that happens to type-check. Coverage tells you the code is exercised, not that it's the right code to exercise. You can have all four legs from Part 0 in place — the model, the harness, the deterministic constraints (the gates this post is about), and the skills — pass every gate, and still ship a clean, fully-typed, fully-covered implementation of the wrong architecture.

Architecture has no deterministic gate. No fixed rule returns the same pass or fail every time on whether a design is right.

The closest thing in my setup is a design companion — a checklist that fires before any code touches the core surfaces (the CLI, the core models, the scanners, the overlay base class, a backend protocol) and makes the agent reason through nine checks first:

Layout — blueprint alignment, component boundaries, dependency direction
Contracts — FSM phase boundaries (which moves from one workflow phase to the next are legal), extension-point contracts, behavior preservation
Under change — test surface, resilience invariants, identity and key normalization

Of those nine, exactly one is backed by a real gate: dependency direction, which tach enforces. The other eight produce design questions and nothing more — no verdict behind them. Someone still has to reason about whether the answer is right, and no hook I can write resolves that. It isn't a gap I haven't gotten to yet. With fixed rules, it's the shape of the problem.

But fixed rules aren't the only kind of check. There's another kind — non-deterministic, grading behaviour rather than asserting a line — and that's where Part 2 goes. The tidy conclusion, "no gate, so it stays human," leans on a binary that doesn't hold, so I'd rather leave that door open than slam it here.

I never decided to hand it over

Here's the part I got wrong about my own setup.

I assumed those eight verdict-less checks would route to me. I'm the human, architecture is the human's job, so the companion fires and I sign off. They don't route to me, and the reason is mundane. The companion fires on every change to a core surface. Signing off on each pass means the agent stops and waits for me every few minutes. That doesn't scale — and I never sat down and decided it didn't. I just stopped doing it, the way you stop reading a dialog box you've seen a hundred times.

So the call defaults to the agent. It makes the architectural decision, writes the code, and only pulls me in when it flags its own uncertainty. Which means the agent is already making most of the architectural decisions in this system — not because I reasoned my way to delegating them, but because the alternative was an interruption I couldn't sustain.

Found by use, not by spec

If you can't gate whether the architecture is right, you find out the only other way: you run the thing and watch it break.

My own README is blunt about it — not a stable product, expected to break, expected to change shape, dogfooded daily on real work. A design flaw in a system you don't use is a hypothesis. A design flaw in a system you depend on every day is a stalled ticket, and a stalled ticket is impossible to ignore. Battle-testing isn't a phase after the design. It is the design process — the only honest signal I have about architecture.

teatree's current shape wasn't specced up front. It grew, in this order: it started as one monolithic skill, ac-multitask — take a ticket, run it end to end. I split that into about eight lifecycle skills, one per phase, and those became the t3-* skill system. A unified t3 CLI with a finite-state machine pulled them together. Then it became a Django extension — models, migrations, real persistence. Then the inversion: the whole thing became the Django project itself, with the overlays demoted to lightweight packages on top. Then I packaged it as a Claude plugin. Each shape change came from hitting a wall with the previous one, not from a plan that anticipated the wall.

The blueprint is where the current shape is written down — but after use confirmed it, as a record of what survived, not a spec dictated before the first line.

So what's the job now?

So the agent has the volume of architectural decisions now. It doesn't yet have the judgment to catch its own bad ones.

It still makes beginner mistakes — a boundary in the wrong place, a decision heading the wrong direction — the kind any experienced developer has made once and learned to spot on sight. I catch some of them by reading the agent's reasoning as it goes. Part of that is plain developer experience. The other part is knowing this particular model — where it oversells a fix, where it quietly papers over something it couldn't do, the kind of task it's already failed three times.

So if I've handed off the code and most of the first-pass calls, what's left? The part around them. Building the gates so the convergent work converges without me. Shaping what the agent reasons against — the blueprint, the boundaries, the chokepoints — so its default lands closer to right. Reading the reasoning on the decisions that matter, and catching the ones it gets wrong. Deciding what gets built now and what waits — the product calls, made even when there's no spec written down.

That last part is product work, not engineering: I'm the product manager here as much as the developer. I'm still at the keyboard all day, just not writing code — I read what the agent produces and write back what to do next. (Until that turns into talking out loud, which it will.)

How long that holds, I don't know. The model keeps narrowing the gap, and the day it catches its own architectural mistakes, the job moves again — the way it just moved from writing code to shaping the thing that writes it. I'd rather watch that line move than pretend it's holding still. That's most of why I'm writing this down.