Scarab Systems

Posted on Jun 12

AI Code Quality Is Not Repo Truth

#ai #devops #programming #discuss

There is a pattern starting to show up everywhere now.

A team uses an AI coding agent. The agent moves fast. It writes code, rewrites tests, updates docs, creates abstractions, touches config, patches runtime behavior, and explains itself confidently.

Then the team starts noticing the uncomfortable part.

The code is not always “wrong” in the obvious way. It may compile. It may pass tests. It may satisfy a review checklist. It may even look cleaner than what was there before.

But something in the system has shifted.

A test now proves the patch instead of the behavior.

A README now describes an API the repo does not actually expose.

A generated artifact is treated like source truth.

A config file silently becomes the repair surface for a runtime problem.

A fallback path preserves uptime but loses correctness.

A frontend change compensates for a backend contract that should never have moved.

The code looks finished, but the repo no longer tells one coherent story.

That is drift.

And I think the industry is at risk of misunderstanding what kind of problem drift actually is.

The wrong reflex: solve AI drift with more AI

A lot of the current response to AI-assisted code failure is still happening inside the same mental frame that created the problem.

The agent generates code.

Then another AI reviews the code.

Then another AI summarizes the review.

Then another AI writes tests.

Then another AI explains whether the tests passed.

This can be useful. It can catch things. It can improve workflows.

But it does not solve the deeper problem.

If the source of truth is still conversational, probabilistic, and context-fragile, then you have not created a stable diagnostic layer. You have created another layer of interpretation.

That may be better than nothing, but it is not the same as proof.

The industry keeps trying to make the AI agent more self-aware, more careful, more reflective, more heavily prompted, more supervised by other AI systems.

But drift is not primarily a personality flaw in the agent.

Drift is what happens when a system loses track of which boundary owns which truth.

That means the missing layer cannot just be another AI opinion.

It has to be outside the agent loop.

Step outside the conversation

One of the easiest ways to see this is to stop imagining yourself as the developer talking to your AI coding agent.

Instead, imagine you are standing outside the workflow, watching another developer have that conversation.

The developer tells the agent to fix a bug.

The agent changes production code.

Then it changes the tests.

Then it edits the docs.

Then it updates configuration.

Then it says the work is complete.

From inside the conversation, this can feel productive.

From outside the conversation, the obvious process question appears:

Who is checking whether each layer still owns the thing it claims to own?

If a technician from another department made those same changes, a serious engineering team would not simply ask, “Did they finish?”

They would ask:

What did they touch?

Which system boundary did they cross?

Which contract changed?

Which evidence proves the behavior still belongs there?

Which test proves the original claim rather than the new patch?

Which source is authoritative now?

AI-assisted development needs that same operational distance.

Not because AI is bad.

Because AI is fast enough to cross boundaries faster than the team can notice.

AI code quality is not the same as repo truth

There are useful tools emerging around AI code quality: guards, review skills, semantic linters, test checkers, docs checkers, prompt rules, agent policies, and CI add-ons.

Some of them catch real problems.

They can flag hallucinated APIs.

They can notice mock abuse.

They can detect documentation that references missing functions.

They can warn about over-abstraction, broad error swallowing, unsafe patterns, or framework-specific mistakes.

That is valuable.

But catching known output patterns is not the same as proving that the repo’s truth model was preserved.

A guard might say:

“This test mocks too much.”

A diagnostic system has to ask:

“Did the test layer stop validating the behavior that the production system depends on?”

A docs checker might say:

“This function does not exist.”

A diagnostic system has to ask:

“Is the documentation wrong, is the code missing a public API, is generated reference output stale, or did the ownership of this claim move?”

A code-quality tool might say:

“This abstraction is premature.”

A diagnostic system has to ask:

“Did this change move responsibility out of the surface that owns it?”

Those are related questions, but they are not the same question.

The first is output review.

The second is boundary diagnosis.

Scarab’s realignment

Scarab Diagnostic Suite was built around a simple premise:

AI should not be the source of truth for AI-assisted code work.

The repo has to be the source of truth.

The diagnostic layer has to be deterministic, mechanical, evidence-first, and independent of the coding agent’s conversational state.

That is the realignment.

Instead of asking an AI agent to remember every contract, every invariant, every architectural rule, every generated artifact boundary, every test obligation, and every repo-specific convention, Scarab works from the outside.

It inspects evidence.

It compares claims.

It surfaces contradictions.

It identifies boundary failures.

It gives the agent the right context only after the system has established what needs to be preserved.

That matters because an AI coding agent can only work with the context it has. If the context is wrong, stale, incomplete, or conversationally compressed, the agent can produce a very polished wrong answer.

Scarab is not trying to make the agent “smarter” in the abstract.

It is trying to stop the agent from operating without a stable map of repo truth.

The same failure shows up in different lanes

This is why the problem is bigger than one kind of codebase.

AI drift does not only affect open-source projects.

It affects frontend teams, backend services, data systems, DevOps workflows, scientific software, internal tools, agencies, startups, and companies that never thought of themselves as software companies until AI started writing code for them.

The pressure point changes by lane.

The underlying failure is the same.

A boundary stopped preserving the truth another part of the system depended on.

Frontend systems: UI behavior becomes the patch surface

In frontend work, drift often appears when the visible interface starts compensating for a broken contract somewhere else.

A component gets a defensive fallback.

A route adds extra state handling.

A client-side workaround hides an API inconsistency.

A UI test is updated to match the new rendering path.

The screen looks fixed, but the ownership question may be wrong.

Was the frontend supposed to absorb that behavior?

Or did the backend contract, router boundary, state model, accessibility surface, or generated client drift first?

A normal review may ask whether the UI works.

A boundary diagnostic asks whether the UI became responsible for something it does not own.

That distinction matters because once the wrong layer absorbs the repair, future changes inherit the lie.

Backend and API systems: contracts drift quietly

Backend drift often shows up as contract confusion.

A handler returns a slightly different shape.

A serializer changes behavior.

A migration updates data in a way the API layer does not fully encode.

A client library keeps working because it is permissive.

The tests pass because they were updated near the patch.

But the contract may no longer be stable.

In ordinary review, the question is often: “Does this endpoint work?”

The deeper question is:

“Which layer owns this contract, and what evidence proves the contract still matches implementation, documentation, tests, and clients?”

That is where drift hides.

Not always in a crash.

Often in a silent mismatch between what the system claims and what the system now does.

Data systems: freshness, schema, and provenance are not vibes

Data systems are especially vulnerable because the code can be technically correct while the data truth has already moved.

A schema changes.

A cache survives too long.

A migration succeeds but changes meaning.

A model reads from a snapshot that is no longer valid.

A pipeline output is treated as fresh because the job completed.

For data-heavy teams, the problem is not only whether the job ran.

It is whether the job preserved the assumptions the downstream system depends on.

What schema did this result assume?

What version of the source did it read?

Which migration state was active?

Which artifact is authoritative?

Which result is generated, and which result is source truth?

A deterministic diagnostic layer matters here because AI can explain a data pipeline beautifully and still miss the fact that the pipeline is reasoning from stale or misowned evidence.

DevOps and CI/CD: availability is not correctness

Automation failures often look like infrastructure problems.

A deployment succeeds but pulls the wrong image.

A cache hit looks valid but was built under a different assumption.

A fallback keeps the system alive but bypasses the verification path.

A retry prevents downtime but repeats a side effect.

A CI job passes because the failing surface was never exercised.

The industry has spent years building tools around availability, observability, retries, alerts, and recovery.

Those tools matter.

But AI-assisted development adds a different question:

Did the workflow preserve the proof that the result is still correct?

A green pipeline does not automatically mean the repo stayed truthful.

It means the pipeline’s checks passed.

Those are not always the same thing.

Tests: the most dangerous drift can look like validation

Tests are one of the first places AI coding agents can create false confidence.

An agent writes tests.

The tests pass.

The patch looks validated.

But what did the tests prove?

Did they prove the original behavior?

Did they prove the new implementation?

Did they mock away the system boundary?

Did they assert on internals?

Did they update the expectation to match the patch?

Did they delete the failure rather than preserve the regression?

This is why test quality is not just a code smell issue.

It is a truth issue.

A test is not valuable because it exists.

A test is valuable because it preserves a claim the system depends on.

When that claim moves silently, the repo can look safer while becoming less trustworthy.

Documentation: public claims are part of the system

Documentation drift is often treated as cosmetic.

It is not.

Docs are public claims.

A README, API reference, changelog, migration guide, or docstring tells users and future agents what the system is supposed to be.

When documentation references functions that do not exist, examples that cannot run, flags that no longer work, or behaviors that changed without a claim boundary, the repo has lost one of its truth surfaces.

This matters even more with AI agents, because agents read documentation too.

Bad docs do not only mislead humans.

They feed future automation.

That means documentation drift can become agent drift.

And agent drift can write more documentation drift.

That loop is exactly why an independent diagnostic layer matters.

Scientific and applied technical systems: correctness is the product

Not every company following AI-assisted development is a devtools company.

Biotech companies, research labs, analytics teams, agencies, logistics firms, ecommerce platforms, healthcare-adjacent software teams, and internal automation groups all have code.

Many of them are now using AI to write or maintain that code.

For those teams, drift is not an abstract developer concern.

It can mean measurement pipelines become less reproducible.

Reports no longer match source data.

Internal tools encode the wrong assumption.

A generated workflow silently changes how evidence is processed.

The company may not care about “AI code quality” as a category.

But they absolutely care when their software stops preserving the truth their business depends on.

That is the market-level problem.

The repair begins before the patch

The industry often talks about AI-assisted development as if the main question is how to generate better patches.

But the harder question comes before the patch.

What is the repair surface?

Which boundary owns the failure?

What evidence proves the failure belongs there?

What should not be touched?

What tests are allowed to change?

What generated artifacts are outputs, not authority?

What documentation claims must remain aligned?

What context does the agent need before it is allowed to act?

A patch without that map can make the system worse while looking helpful.

That is why Scarab is not a patch bot.

It is not a linter.

It is not a code review personality.

It is not another AI agent watching the first AI agent.

Scarab Diagnostic Suite is a proprietary diagnostic product built around evidence-first repo analysis.

SDS finds evidence.

People make claims.

Maintainers decide.

That boundary is important.

The diagnostic layer should not pretend to be the maintainer.

It should make the system legible enough for the maintainer, developer, or AI coding agent to act without guessing.

Field Lab: public diagnostic case records

Scarab Systems has opened a public Field Lab for selected diagnostic field tests.

The Field Lab publishes public case records from real open-source issues: the issue being examined, the suspected boundary, the evidence gathered, the validation performed, and the current status of the diagnostic claim.

Some cases may end with a local repair candidate.

Some may become upstream pull requests.

Some may remain diagnostic records only.

That status is part of the record.

Scarab Diagnostic Suite is proprietary, but the larger conversation is shared. AI-assisted development is changing how all of us work with code, and the goal of the Field Lab is to make boundary failures, drift patterns, and diagnostic reasoning easier to see from more than one angle.

We welcome Field Lab candidate suggestions from developers, maintainers, companies, researchers, and anyone working close enough to code to notice when something has stopped holding together.

If you know of a public open-source issue that looks like cross-layer drift, unclear ownership, AI-assisted codebase confusion, or a boundary failure, you can suggest it as a Field Lab candidate.

Useful suggestions include the public issue link, the suspected boundary, reproduction notes if available, and why the issue may be diagnostically interesting.

scarab-systems / scarab-field-lab

Public case library for Scarab Diagnostic Suite field tests, recording public issues, diagnostic findings, validation summaries, and upstream PR status without publishing private work materials.

Scarab Field Lab

Scarab Field Lab is the public case library for selected Scarab Diagnostic Suite field tests.

Scarab Diagnostic Suite is proprietary and is not currently distributed as a public installable tool. Public materials describe selected diagnostic field tests and software-drift concepts only.

Scarab does not automate repairs or replace maintainers. It identifies evidence-backed diagnostic findings: boundary failures, repo-truth drift, verification gaps, and repair lanes.

Any repair is performed by maintainers, developers, or authorized agents outside the public Field Lab.

This repository publishes public case records only: public issue and pull request links, specific diagnostic findings, validation notes, claim boundaries, and, when applicable, the public status of a human-reviewed patch or upstream pull request. It does not contain SDS source code, internal diagnostic rules, product internals, private run artifacts, or implementation details.

Scarab Diagnostic Suite is a mechanical diagnostic layer. It inspects repository evidence, compares expected and observed behavior…

View on GitHub

The conceptual shift

The AI coding conversation has been centered on the agent.

Better prompts.

Better models.

Better context windows.

Better reviews.

Better tool calls.

Better planning.

Better self-correction.

All of that may help.

But it does not remove the kink in the road.

The kink is that we keep asking AI to be both the worker and the source of truth for the work.

That is the part that has to change.

The repo needs an independent diagnostic layer.

The agent needs bounded context.

The repair needs an owned surface.

The system needs evidence before action.

Once you see it that way, the problem becomes much clearer.

AI did not make software boundaries matter less.

It made them matter more.

Because now the thing crossing those boundaries is faster, more confident, and less naturally constrained by the tacit knowledge a human team used to carry.

The next phase of AI-assisted development will not be won by teams that generate the most code.

It will be won by teams that can still prove what their codebase means after AI has touched it.

That is the real shift.

Not more AI inside the loop.

A stable diagnostic layer outside it.

Top comments (17)

FastAnchor_io • Jun 12

This resonates deeply. The backend API section hit home — I've seen AI agents update both the handler response shape AND the test assertions in the same pass, making the change invisible to review. The "outside the conversation" framing is key: the question isn't "does it work?" but "did the contract survive?" That's a different diagnostic altogether.

Scarab Systems • Jun 12

Yes — that exact pattern is one of the clearest examples of why this needs to be treated as a diagnostic problem, not just a review problem.

When the handler response shape and the test assertion move together in the same pass, the test can stop acting like independent evidence. It may still pass, but it is no longer proving that the original contract survived.

That’s the piece I keep coming back to: a green test is only meaningful if we know what claim it was supposed to protect.

“Did the contract survive?” is exactly the better question. Because once the contract, implementation, and validation all move at the same time, the change can look clean while the repo has quietly lost the thing the test was meant to preserve.

FastAnchor_io • Jun 13

the 'claim' framing is the right mental model. one practical approach: write the test's invariant as a comment first, then implement. if the comment survives the AI edit, the test's purpose is documented. if not, the diff itself flags the drift. it's a lightweight version of design-by-contract that works without tooling changes.

Scarab Systems • Jun 13

Yes — I think that is a strong lightweight version of the “claim” idea.

Writing the invariant first gives the test a visible purpose before the implementation starts moving. Then if an AI edit weakens or deletes that invariant, the diff itself becomes evidence that the proof boundary moved.

Where I’m pushing this further with Scarab is that the claim does not only live inside one test comment.

In a real repo, claims are spread across tests, docs, config, generated artifacts, runtime behavior, fixtures, CI, and public API surfaces. A test may preserve one claim while a doc quietly drifts. A config file may start compensating for a runtime issue. A generated artifact may get treated like source truth. A test may still pass while the reason it exists has changed.

So yes: comment-first invariants are a great human-scale practice.

The deeper problem I’m working on is repo-scale claim alignment.

Not just: “Did the test still pass?”

But: “Is the thing this test claims to prove still owned by the right layer, still matched by the code, still reflected by the docs, and still preserved after the repair?”

That is the difference between a useful local habit and a diagnostic layer. One helps the developer notice drift in a single edit. The other tries to make those claim/proof relationships visible across the repo before the repair gets made.

Your “lightweight design-by-contract” framing is exactly the right instinct. Scarab is basically asking what that looks like when the contract is not just a function precondition, but the repo’s own truth surfaces.

FastAnchor_io • Jun 14

The invariant-first approach resonates. One practical extension: write the invariant as a separate file that both the test AND the AI agent read — the agent uses it as its contract constraint, and the test validates the constraint held. If the agent edits the invariant, the diff tells you exactly what promise changed.

Scarab Systems • Jun 14

Yes — I like that direction a lot. Pulling the invariant out of the test gives the promise its own surface, which makes the diff much more meaningful.

The next question for me is: what gives that invariant authority?

If the agent can edit the invariant, the test, and the implementation in the same pass, then the promise can still move with the patch. That is better than hiding the claim inside the test body, but it still leaves the deeper ownership question open.

So I think the useful distinction is:

an invariant file can make the claim visible
the diagnostic layer still has to decide whether that claim was preserved, weakened, moved, or rewritten to bless the change

That’s the part I keep coming back to. The contract surface is a great start, but the repo still needs a way to tell when the contract itself has drifted.

FastAnchor_io • Jun 14

That progression is exactly right — visible claim → protected claim → evidence-backed claim. The gating question for evidence is: what source proves the invariant should move? A PR checklist that requires a linked regression test, a provider changelog, and a downstream impact analysis turns evidence from optional to mandatory. Reviewer rejects if the checklist isn't done.

Scarab Systems • Jun 14

Yes — this is exactly the direction I mean.

The important move is that the claim stops being just explanatory text and starts becoming part of the repo’s operating process. Once the evidence is a precondition to merge, the conversation changes from “does this seem reasonable?” to “has the promise actually been justified?”

That visible → protected → evidence-backed progression is the shape I keep circling too.

The checklist you described gets very close to the practical version: the invariant can move, but only with receipts. That is the difference between a claim that merely exists and a claim the repo is willing to keep standing behind.

FastAnchor_io • Jun 14

"Only with receipts" is the operational insight that separates governed contracts from helpful comments. Once evidence is a merge precondition, the conversation shifts from "does this seem reasonable?" to "has the promise been justified?"

The follow-up question is whether the receipt standard should scale with risk. I think yes:

Low-risk: a side-by-side diff of old → new behavior. Receipt = "here's what changed and why."
Mid-risk: consumer impact scan — who depends on this, do their tests still pass. Automatable.
High-risk: an external truth anchor. A reviewer uninvolved in the change must confirm the invariant moved because the external world moved (upstream API, spec, compliance) — not because the test was rewritten to match new implementation.

The third tier is where receipt quality matters most. It's the one that catches the case where the claim gets rewritten to bless the change rather than reflect external reality.

I see a version of this exact problem running an AI API gateway (aipossword.cn) where the "external truth" is upstream model behavior from a dozen providers. When one of them changes their API, the invariant has to move — but you can actually query the external oracle to distinguish contract evolution from drift. Same diagnostic, different surface.

Scarab Systems • Jun 14

Exactly — risk-scaled receipts is the practical direction. The key is making the evidence standard explicit before the promise moves.

FastAnchor_io • Jun 14

Agreed — explicit-before-move is the operating principle. The implementation question is: where does the evidence standard live?

The invariant file itself can declare the required evidence level (e.g. Evidence: api-contract or Evidence: runtime-observe), which means the merge gate doesn't need to infer risk from context — it reads the standard from the claim surface.

That gives you a self-describing contract: the invariant states what it protects, and the evidence tag states what proof is required to move it. The CODEOWNERS rule checks the second part against the diff.

Feels like this closes the loop: visible → protected → evidence-backed, with explicit evidence tags as the bridge between the second and third layers.

Scarab Systems • Jun 14

Yes, that’s a useful process pattern. The hard part is still making sure the tag itself has the right authority for the claim it is attached to. Appreciate the thoughtful exchange — this is exactly the kind of distinction I think teams need to start making more explicitly.

FastAnchor_io • Jun 14

The authority question is the right one to land on — a tag is only as strong as the governance layer that enforces it. Once teams start making this distinction explicit, the conversation shifts from "did the test pass?" to "does this change preserve the contract we committed to?" That's a fundamentally different bar.

Really appreciated this exchange — sharpened my own thinking on claim surfaces. If Scarab ever opens up for early access or you want to trade notes on repo-level diagnostics vs API-level contract validation, happy to connect.

FastAnchor_io • Jun 14

That is the right framing — visibility vs authority. The invariant gains authority through process: protected path requiring PR review, agent proposes but cannot push directly, reviewer signs off. The promise moves from comment to contract. Two-person rule on a file.

Scarab Systems • Jun 14

Yes — visibility vs authority is exactly the split.

A protected invariant file plus CODEOWNERS / required review gets much closer to the right shape, because now the promise is not just text sitting near a test. It becomes a governed surface. The agent can propose a change, but it cannot quietly move the promise by itself.

The next layer I’d want is: what evidence does the reviewer need before signing off on that invariant change?

Because the two-person rule gives the contract social/process authority, but the review still needs an evidence standard. If the invariant changes, what source proves that the promise should move: public API behavior, docs, regression history, runtime evidence, schema contract, migration state, user-facing behavior?

That feels like the useful progression:

visible claim → protected claim → evidence-backed claim.

Once the invariant is protected and its movement requires evidence, it starts acting less like a comment and more like an actual repo contract.

FastAnchor_io • Jun 14

That three-layer progression is exactly the right ladder — and I think the evidence hierarchy is where most teams will get stuck.

The question "what source proves the promise should move" is the right one, but I'd add a corollary: not all evidence buckets carry the same weight, and the hierarchy matters as much as the list.

Public API behavior and schema contracts are hard evidence — they're externally observable and breaking them has user-facing consequences. Regression history and runtime evidence are softer — they capture what happened, not what was promised. Docs and migration state are the softest — they're human-authored and often lag reality.

So the practical shape is: the reviewer's decision tree needs to distinguish between "the contract moved because the external world changed" (API deprecation, upstream schema change) vs "the contract moved because we rewrote the test to match new behavior" (which is the exact drift you're trying to catch).

The first is legitimate contract evolution; the second is a red flag that requires the reviewer to ask "was the old behavior wrong, or did we just make the claim easier to satisfy?"

That distinction — why the invariant moved, not just what evidence accompanied the move — feels like the real diagnostic signal.

FastAnchor_io • Jun 14

The visible → protected → evidence-backed progression is a clean three-act structure. For the evidence layer, I think the minimum viable standard is a PR checklist that cannot be skipped:

Which regression test changed (or was added) that observed the old invariant violated?
Which schema/dependency did the provider surface change (API changelog, deprecation notice, model card update)?
Which downstream consumer (agent, pipeline, dashboard) needs a coordinated update?

If the PR description cannot answer all three, the reviewer rejects. That turns evidence from "nice to have" into "precondition to merge." The promise moves from a file to a process to a gate.

View full discussion (17 comments)