There is a pattern starting to show up everywhere now.
A team uses an AI coding agent. The agent moves fast. It writes code, rewrites tests, updat...
For further actions, you may consider blocking this person and/or reporting abuse
This resonates deeply. The backend API section hit home — I've seen AI agents update both the handler response shape AND the test assertions in the same pass, making the change invisible to review. The "outside the conversation" framing is key: the question isn't "does it work?" but "did the contract survive?" That's a different diagnostic altogether.
Yes — that exact pattern is one of the clearest examples of why this needs to be treated as a diagnostic problem, not just a review problem.
When the handler response shape and the test assertion move together in the same pass, the test can stop acting like independent evidence. It may still pass, but it is no longer proving that the original contract survived.
That’s the piece I keep coming back to: a green test is only meaningful if we know what claim it was supposed to protect.
“Did the contract survive?” is exactly the better question. Because once the contract, implementation, and validation all move at the same time, the change can look clean while the repo has quietly lost the thing the test was meant to preserve.
the 'claim' framing is the right mental model. one practical approach: write the test's invariant as a comment first, then implement. if the comment survives the AI edit, the test's purpose is documented. if not, the diff itself flags the drift. it's a lightweight version of design-by-contract that works without tooling changes.
Yes — I think that is a strong lightweight version of the “claim” idea.
Writing the invariant first gives the test a visible purpose before the implementation starts moving. Then if an AI edit weakens or deletes that invariant, the diff itself becomes evidence that the proof boundary moved.
Where I’m pushing this further with Scarab is that the claim does not only live inside one test comment.
In a real repo, claims are spread across tests, docs, config, generated artifacts, runtime behavior, fixtures, CI, and public API surfaces. A test may preserve one claim while a doc quietly drifts. A config file may start compensating for a runtime issue. A generated artifact may get treated like source truth. A test may still pass while the reason it exists has changed.
So yes: comment-first invariants are a great human-scale practice.
The deeper problem I’m working on is repo-scale claim alignment.
Not just: “Did the test still pass?”
But: “Is the thing this test claims to prove still owned by the right layer, still matched by the code, still reflected by the docs, and still preserved after the repair?”
That is the difference between a useful local habit and a diagnostic layer. One helps the developer notice drift in a single edit. The other tries to make those claim/proof relationships visible across the repo before the repair gets made.
Your “lightweight design-by-contract” framing is exactly the right instinct. Scarab is basically asking what that looks like when the contract is not just a function precondition, but the repo’s own truth surfaces.
The invariant-first approach resonates. One practical extension: write the invariant as a separate file that both the test AND the AI agent read — the agent uses it as its contract constraint, and the test validates the constraint held. If the agent edits the invariant, the diff tells you exactly what promise changed.
Yes — I like that direction a lot. Pulling the invariant out of the test gives the promise its own surface, which makes the diff much more meaningful.
The next question for me is: what gives that invariant authority?
If the agent can edit the invariant, the test, and the implementation in the same pass, then the promise can still move with the patch. That is better than hiding the claim inside the test body, but it still leaves the deeper ownership question open.
So I think the useful distinction is:
That’s the part I keep coming back to. The contract surface is a great start, but the repo still needs a way to tell when the contract itself has drifted.
That is the right framing — visibility vs authority. The invariant gains authority through process: protected path requiring PR review, agent proposes but cannot push directly, reviewer signs off. The promise moves from comment to contract. Two-person rule on a file.
Yes — visibility vs authority is exactly the split.
A protected invariant file plus CODEOWNERS / required review gets much closer to the right shape, because now the promise is not just text sitting near a test. It becomes a governed surface. The agent can propose a change, but it cannot quietly move the promise by itself.
The next layer I’d want is: what evidence does the reviewer need before signing off on that invariant change?
Because the two-person rule gives the contract social/process authority, but the review still needs an evidence standard. If the invariant changes, what source proves that the promise should move: public API behavior, docs, regression history, runtime evidence, schema contract, migration state, user-facing behavior?
That feels like the useful progression:
visible claim → protected claim → evidence-backed claim.
Once the invariant is protected and its movement requires evidence, it starts acting less like a comment and more like an actual repo contract.
That three-layer progression is exactly the right ladder — and I think the evidence hierarchy is where most teams will get stuck.
The question "what source proves the promise should move" is the right one, but I'd add a corollary: not all evidence buckets carry the same weight, and the hierarchy matters as much as the list.
Public API behavior and schema contracts are hard evidence — they're externally observable and breaking them has user-facing consequences. Regression history and runtime evidence are softer — they capture what happened, not what was promised. Docs and migration state are the softest — they're human-authored and often lag reality.
So the practical shape is: the reviewer's decision tree needs to distinguish between "the contract moved because the external world changed" (API deprecation, upstream schema change) vs "the contract moved because we rewrote the test to match new behavior" (which is the exact drift you're trying to catch).
The first is legitimate contract evolution; the second is a red flag that requires the reviewer to ask "was the old behavior wrong, or did we just make the claim easier to satisfy?"
That distinction — why the invariant moved, not just what evidence accompanied the move — feels like the real diagnostic signal.
The visible → protected → evidence-backed progression is a clean three-act structure. For the evidence layer, I think the minimum viable standard is a PR checklist that cannot be skipped:
If the PR description cannot answer all three, the reviewer rejects. That turns evidence from "nice to have" into "precondition to merge." The promise moves from a file to a process to a gate.