Hi, I'm Ryan, CTO at airCloset.
Disclaimer: "cortex" in this article is the internal codename for an AI platform built in-house at airCloset. It ...
For further actions, you may consider blocking this person and/or reporting abuse
The write-time vs read-time inference distinction maps cleanly onto something I keep running into on the content side of AI too.
When AI generates a draft, that's write-time inference — it happens once, a human reviews it, corrections get made, it gets frozen into reviewed content. That process is fine and powerful. But then the challenge is the distribution step: getting that content to 15 platforms, in multiple languages, with the right metadata per-platform. If you let AI infer those steps on every publish ("guess what format this platform expects"), you get exactly the problem you describe — unverified inference running on every query.
The determinism fix applies there too: structure the platform requirements as explicit facts (not inferred from the platform's vibes), let AI operate over that structured knowledge at publish-time. The harness confines hallucination to where it's acceptable.
"Mastering AI is not about giving it freedom — it's about confining its output to a predictable range" — this is going on my wall. The word 'confining' is doing more work than 'prompting' or 'fine-tuning' ever will.
The sunk-cost point in the trial section is also underrated. Code-graph was 2 months in and working. Throwing it out anyway because the architectural truth pointed the other direction is the kind of decision that separates harness architects from prompt engineers.
The content-distribution mapping caught my eye—it's the exact same failure mode playing out in a completely different domain.
The "harness architects vs prompt engineers" distinction is sharper than what I had in the post: the real line isn't about optimizing a prompt; it's about changing what we allow the model to handle in the first place.
The decision to scrap the code-graph wasn't about tweaking the prompts to be smarter. It was facing the reality that we were asking the model to infer something that should have been managed by a deterministic architecture. Overcoming that sunk-cost fallacy is painful, but it's what forces you to look at the system, not the prompt.
Glad "confining" landed. That word does the heavy lifting precisely because prompting and fine-tuning only try to influence the model's behavior, whereas a harness actually shapes the boundaries around it.
"Shapes the boundaries around it" vs "influences the behavior inside it" — that's a genuinely useful distinction for explaining to non-engineers why prompt engineering has a ceiling. The harness argument becomes: some failure modes need architectural answers, not better wording.
The content-distribution parallel actually runs deeper than I initially framed it — every platform API has its own schema, rate limits, metadata requirements. If you let the system infer those at publish-time, you get exactly the silent variance you described: it works until it doesn't, and the failure is invisible until the content is already gone to the wrong place. The harness answer is the same: structure what can be structured, let inference operate only inside that.
Thanks for the series — it's one of the cleaner end-to-end AI architecture write-ups I've read.
"Some failure modes need architectural answers, not better wording" is the line. And "structure what can be structured, let inference operate only inside that" says the principle in fewer words than I managed.
Honestly, that closer means a lot. Six posts was a long stretch, and responses like this are what make it worth writing. Thanks for the read, and for the framing.
That really means a lot — six posts is a serious commitment, and the consistency of thinking across all of them showed. "Structure what can be structured, let inference operate only inside that" is exactly the kind of principle that travels well. It'll stick.
Looking forward to whatever you write next.
This is a strong articulation of the boundary I keep seeing in production AI systems: do not ask the model to rediscover facts the system can make explicit.
The write-time vs read-time inference distinction is especially useful. Let AI help generate annotations, graph edges, review dimensions, or policy descriptions, but freeze and review those artifacts before they become runtime facts. Then the live agent is operating over inspected structure instead of re-inferring the world on every request.
That also makes failures easier to test: graph wrong, retrieval path wrong, review dimension wrong, or generated conclusion wrong are very different fixes.
Right — and that last point (failures separable by class) is one of the most underrated benefits of the write-time/read-time split. Once inference is frozen at write-time, "the graph is wrong" and "the retrieval missed" and "the live agent reasoned badly" stop being one fuzzy bucket called "the AI got it wrong" and become four distinct categories with four distinct fixes. That distinction is what makes the system testable at all. Without the split, every failure is some opaque mixture of "context was off" and "model was off" with no clean way to bisect.
Exactly. Once the failure classes are separable, the system can stop treating every miss as a mysterious model-quality problem.
That separation also changes how I’d build the regression set. I’d want fixtures for graph construction errors, retrieval/path errors, review-dimension errors, schema/format errors, and final-generation errors. Then a model or prompt change is not just “better/worse”; it shows which layer improved or regressed.
That is the part I like about the write-time/read-time split: it makes the harness itself testable.
Right — and "which layer improved or regressed" is the metric that suddenly becomes meaningful once the layers are separated. Without that, "the new model is X% better" is a black-box claim that nobody can reproduce or trust. With layer-level fixtures, a model bump that improves final generation by 10% but degrades retrieval by 3% is something you can actually catch and reason about. The harness becomes testable because each layer has a tractable contract; the whole "AI quality" problem stops being one monolithic thing.
"Build rails first" hit different. Ten years in test automation and I never saw it as laying runway for AI until now🤣
Glad it landed 🤣 Ten years of test automation is exactly the kind of foundation work AI ends up depending on. The deep reason: a thick set of existing tests is what lets AI know deterministically what counts as correct behavior. Without it, "did I break something?" is a question AI can only answer by inference — with the tests, it's a question with a deterministic answer. The rails were always there; the framing changes when AI shows up to run on them.
"Tests as determinism boundaries for AI" — that's the framing I was missing. Gonna steal that for the next post. 🤝
Steal away 🤝
big repos are the real test. stuffing context just creates noise, the model stops actually understanding and starts pattern matching. treating retrieval as a system design problem is the right call.
Right — and that's exactly why pushing the inference to write-time matters. If the model is asked to reason about everything live, it's the noise that breaks it down into pattern matching. But if you let AI generate annotations, structure, dependency edges, etc. at write-time, review them, and freeze them as deterministic context, the live agent never has to handle the noisy "raw repo" — it operates over the curated artifact instead. Retrieval as a design problem solves the noise problem upstream, before any reasoning happens.