I Spent 3 Weeks Auditing My Team’s AI-Generated Code. Here Is What We Found.

#ai #agents #programming #productivity

The codebase we built in record time looks functional. It passes CI. It has 91% test coverage. It is also quietly becoming unmaintainable in ways that are hard to see until you look very closely.

6 months ago, my team adopted AI-assisted development aggressively and our velocity numbers became something we were genuinely proud of. We were shipping features in a 3rd of the time. The product was growing. The engineering org was happy. In May, we started hitting a wall, a series of production bugs that were unusually difficult to diagnose, some refactoring work that turned into a four-week project that should have taken four days, and a new hire who told me, candidly, that she had never seen a codebase she found so hard to reason about despite the code itself being syntactically clean. We spent the first three weeks of June doing a structured audit. The technical debt we found was not the kind that shows up in a linter or a dependency vulnerability scanner. It was architectural and behavioral — the kind that only becomes visible under the specific pressure of trying to change something.

The codebase that nobody actually understands
I want to be precise about what I mean by AI-generated technical debt because the term is getting used loosely right now.

The code my team wrote over the past year is not bad code in the obvious sense. It does not have egregious algorithmic errors. It handles most edge cases. The tests pass. The functions are named reasonably well. If you read any individual function in isolation, it reads as competent Python.

The problem is not at the function level. It is at the level of how the system holds together across functions, modules, and services.

When AI coding assistants generate code, they optimize for the problem as described in the prompt. They solve the immediate question. They do this extremely well. What they do not do — because they cannot, given how they work — is reason about the broader system consequences of the solution they are producing. They do not know what changed last sprint. They do not know that this service is being deprecated in six months. They do not know that the data model this function depends on was designed as a temporary workaround and was supposed to be replaced eight months ago.

The code that results from generating against the immediate problem, without the context of the broader system, is locally coherent and globally fragile.

The four specific patterns we found
Pattern 1: The function that works but that no one can explain
We found eighteen instances across the codebase of what I started calling “oracle functions” — functions that produce correct output and have passing tests but where none of the engineers who reviewed the PR at the time could give a confident explanation of exactly why the implementation works for all inputs.

Here is a simplified example of the category:

def normalize_user_event_sequence(events: list[dict]) -> list[dict]:
seen = {}
result = []
for event in sorted(events, key=lambda e: (e.get("timestamp", 0), e.get("id", ""))):
key = (event.get("user_id"), event.get("event_type"), event.get("timestamp", 0) // 3600)
if key not in seen or event.get("priority", 0) > seen[key].get("priority", 0):
seen[key] = event
for key in sorted(seen.keys()):
result.append(seen[key])
return result
This function was generated by the AI, passed its tests, and went into production. But when I asked three engineers to explain the bucketing logic — specifically, why the timestamp is divided by 3600 and what happens at bucket boundaries when two events from the same user occur in the same hour window — nobody could give me a confident answer. Including the engineer who merged it.

The function might be correct. Or there might be an edge case at bucket boundaries that has not surfaced yet because the test suite did not cover it. We do not know. That is the problem.

The specific AI behavior that produces this pattern: AI tools are very good at writing code that passes the tests provided. They are not good at generating tests that cover their own edge cases, because generating comprehensive edge case tests requires understanding the full behavior space of the implementation, not just the happy path specified in the prompt.

The fix is not to reject AI-generated implementations. It is to require that the engineer who merges the code can explain the implementation in their own words, without looking at it, to a colleague who is not familiar with the problem. If they cannot do that, the code does not merge.

Pattern 2: Test coverage theater
Our overall test coverage number is 91%. This number is nearly useless as a quality signal for the AI-generated portions of the codebase.

The problem is that AI coding tools, when asked to write tests, generate tests that cover the code that exists rather than tests that cover the behavior that should exist. The distinction matters enormously.

What AI generated as "tests" for the payment processing module

def test_process_payment_returns_success():
result = process_payment(amount=100, currency="USD", user_id="u123")
assert result["status"] == "success"
def test_process_payment_with_zero_amount():
result = process_payment(amount=0, currency="USD", user_id="u123")
assert result["status"] == "success" # This should probably not succeed
def test_process_payment_stores_record():
process_payment(amount=100, currency="USD", user_id="u123")
records = get_payment_records("u123")
assert len(records) == 1
Three passing tests, 94% line coverage on the function, and none of these tests would catch:

What happens when the payment provider API returns a 503
What happens when the same request is retried and idempotency is not handled
What happens when amount is negative
What happens when currency is not in the supported list
What happens when user_id does not exist in the database
AI-generated tests are optimized for coverage metrics. They test the code that exists. A useful test suite tests the contract that should exist, including the failure modes that are not yet in the code.

The fix we implemented: any AI-generated test suite goes through a mandatory “adversarial review” before merge. A different engineer from the one who wrote the feature spends twenty minutes trying to break the implementation and adds tests for anything they find that is not already covered.

Pattern 3: Invisible coupling discovered during refactoring
This was the most expensive pattern we found, and it is the one that directly caused the four-week refactoring project I mentioned.

When we decided to extract our notification service into a standalone microservice, we expected the work to take roughly four days: identify the notification-related code, move it, update the callers, deploy. Standard extraction refactoring.

What we found was that notification logic had been deeply woven into modules that had no obvious relationship to notifications. The user profile module was calling a notification helper directly. The event ingestion pipeline had notification logic embedded in three separate places with slightly different implementations. The billing module was constructing notification payloads using a local utility function that duplicated the logic in the notification module, with a subtle difference in how it handled timezone offsets.

Every one of these entanglements was introduced by an AI coding assistant responding to a prompt along the lines of “when this event occurs, send a notification to the user.” The AI generated the most direct solution to the prompt, which is to add the notification call in the function that processes the event. It did this correctly, each time, for each feature that needed notifications. What it did not do is recognize that a pattern was emerging that would eventually make the notification system impossible to extract cleanly.

The pattern has a name in architecture literature: it is called implicit coupling, and it is the natural consequence of solving local optimization problems without a global view of the system.

The fix at the code review level: require that any PR that introduces a dependency on a service or module that was not previously a dependency of the modified module be reviewed by a senior engineer. The AI will add dependencies freely. A human reviewer needs to explicitly approve each new one.

Pattern 4: Error handling that looks thorough but is not
AI tools generate error handling that is syntactically comprehensive and semantically shallow. They produce try/except blocks. They log errors. They return appropriate error codes. But the error handling they generate tends to treat all errors in a category as equivalent when the correct behavior often differs substantially based on the specific error.

AI-generated error handling pattern we found repeatedly

async def fetch_user_data(user_id: str) -> dict | None:
try:
response = await http_client.get(f"/users/{user_id}")
response.raise_for_status()
return response.json()
except httpx.HTTPStatusError as e:
logger.error("HTTP error fetching user %s: %s", user_id, e)
return None
except httpx.RequestError as e:
logger.error("Request error fetching user %s: %s", user_id, e)
return None
This catches both 404s (user does not exist) and 500s (upstream service is broken) and treats them identically: log, return None. The callers of this function cannot distinguish between “the user does not exist” and “we could not reach the service.” These require completely different handling. A 404 should result in a clean error to the client. A 500 should result in a retry, or a circuit breaker trip, or an alert. Collapsing them both into return None makes the system appear to handle errors gracefully while actually hiding failures that require operational response.

The audit process that surfaced all of this
We did not find these patterns by reading code. We found them by applying structured pressure to the codebase.

The most revealing exercise was the “change test.” We picked a feature that had been built over the AI-assisted sprint and we tried to make a specific, bounded change: add a new field to a data model. We documented every file that required modification, every test that broke, and every unexpected side effect. The radius of a change is a direct measure of coupling. Features built with disciplined architecture tend to have small change radii. The AI-assisted features had change radii that were consistently larger than expected.

The second exercise was “explain it without looking.” For the functions that tested at high coverage but that the team was uncertain about, we asked the engineer who built each feature to explain the implementation to someone unfamiliar with it. When they needed to consult the code to finish the explanation, we flagged that function for deeper review.

The third exercise was a dependency audit. We mapped every cross-module import that was introduced during the AI-assisted sprint and looked for patterns that indicated implicit coupling rather than designed dependency.

What we are doing differently
We have not stopped using AI coding tools. We use Cursor for most greenfield implementation work, GitHub Copilot for inline completion during exploratory sessions, and Claude Code for heavier refactoring tasks where the full codebase context matters. What has changed is the governance around how we use them.

New code that an AI assistant generates does not merge unless the engineer can explain it, specifically, to a level where a colleague could implement the same solution independently from the explanation alone.

AI-generated tests are treated as a starting point, not a completion. The adversarial review is now a required step in our code review process for any AI-assisted PR.

New dependencies introduced by AI-generated code require explicit senior engineer approval. The AI will add them freely. We need a human to consciously decide each time that the coupling is acceptable.

We have a standing weekly exercise where one engineer picks a module and attempts a bounded change without looking at the implementation details first. If the change is harder than it should be, we investigate why.

None of this is anti-AI. All of it is standard engineering discipline that we should have been applying consistently regardless of how the code was generated. The AI-assisted sprint revealed the gaps in our process by filling those gaps with locally correct but globally fragile solutions, faster than our review process was equipped to catch.

For individual engineers trying to build the audit muscle independently, the tools that have been most useful on my team: Sourcegraph for navigating large codebases and tracing where a specific function or pattern is used across the entire repository which is invaluable for discovering the coupling patterns described above. CodeClimate for tracking maintainability trends over time rather than point-in-time snapshots, which is the only way to see whether your codebase is getting harder or easier to change as a trend. GitHub’s code scanning (powered by CodeQL) for the security debt specifically like it catches a meaningful fraction of the vulnerability patterns that AI-generated code tends to introduce without ceremony.

One practical note for anyone who is currently job searching or keeping themselves interview-ready alongside this kind of team-level work: the skills this audit required is reading unfamiliar code critically, tracing dependencies, identifying coupling and are the exact skills companies are now testing in technical interviews. The interview bar has shifted away from pure algorithmic output toward code review and architectural judgment, and understanding what a specific company tests in their current loops matters for how you allocate prep time. For that calibration, a combination of recent Glassdoor interview reports, Blind threads from the past six months, and PracHub () is the most efficient way to understand whether a company is running algorithmic rounds, code review rounds, or a mix without wasting weeks preparing for the wrong format.

The honest version
The velocity numbers were real. We shipped more features in twelve months than we had in any comparable period. The product is better for it. Some of the AI-generated code is excellent and will be in the codebase for years with no problems.

But the maintainability cliff is also real, and it arrived on a timeline that caught us off guard. The new hire’s observation that the codebase was hard to reason about despite being syntactically clean and is the most precise description of AI-generated technical debt I have encountered. It looks fine. It reads fine. It passes review. And then you try to change something and you discover that the system as a whole has properties that none of the individual pieces would suggest.

The teams that are going to do this well are the ones that treat AI coding tools the way senior engineers already treat junior engineers: with trust, with expectation, and with the judgment to know when to override the output. The tools are excellent at what they do. What they do is not the full job.