DEV Community

The AI Test Report Said 97.3% Coverage. The Client's Lead Engineer Asked One Question. The Room Went Silent.

xulingfeng on May 30, 2026

Based on real QA scenarios. About what happens when AI-generated metrics replace real testing, and the quiet engineer in the back row has been run...

Read full post

Alex Shev • Jun 12

Coverage numbers can hide the exact risk the client actually cares about. The better question is usually: what important behavior is still unproven? AI test reports need to connect coverage to scenarios, assertions, and business risk, not just produce a precise-looking percentage.

xulingfeng • Jun 13

Exactly. A 97.3% number looks scientific until you realize it's measuring the wrong thing. The most dangerous part isn't the 2.7% gap — it's the false confidence the 97.3% gives you about the coverage you think you have. That one question that silenced the room? It didn't expose the missing 2.7%. It exposed that nobody had asked "what are we actually proving" before running the report.

Alex Shev • Jun 13

Yes — that is the core failure mode. Coverage reports are useful only when the team can connect them back to the risk model. Otherwise the percentage becomes a comfort metric instead of evidence. The better question is usually: which user-visible failure would still surprise us after this report passes?

xulingfeng • Jun 13

Shen asked how sure you are. You asked what you're still missing. Same gap, different flashlight.
👈

Alex Shev • Jun 13

Exactly. The useful question is not "how much did the suite touch?" but "what claim can we defend from this evidence?" Coverage can be one input, but the missing-artifact question is what turns a report into engineering judgment.

Alex Shev • Jun 14

Yes, same gap. Coverage tells you what was exercised; the missing-question list tells you what could still hurt you. I trust reports much more when they include the negative space: untested flows, mocked dependencies, weak assertions, and the places where the tool could not form a useful check.

Mykola Kondratiuk • Jun 1

97.3% coverage is always a PM spec failure as much as an engineering one - nobody asked 'coverage of what?' early enough. the quiet engineer's question should've been in the acceptance criteria from day one.

Syed Ahmer Shah • May 31

We are so obsessed with vanity metrics like "97.3% coverage" that we completely forget code coverage only measures what lines of code ran, not how they behaved under real stress. Letting AI blindly generate tests often just creates a massive echo chamber where it validates its own logic gaps. A single senior engineer asking about actual business logic, edge cases, or data corruption can bring that whole house of cards down in seconds. This is a masterclass in why human intuition and domain knowledge can't be automated away.

Self-Correcting Systems • May 31

This hits the same failure shape I keep seeing with AI systems: the metric can be
technically real and still not measure the thing people think it measures.

Coverage answers:

“Did this code path run?”

But the acceptance question is closer to:

“Did the system prove the business behavior works under the conditions that matter?”

Those are different objectives.

That 97.3% number reminds me of retrieval accuracy in agent memory. A retriever can find
the most related memory and still pick the wrong one to govern the action. In the same
way, AI-generated tests can execute lots of code and still fail to verify the critical
behavior.

The scary part is when the proxy metric becomes an authority signal. People stop asking
what was asserted, which flows were covered, what mutation survived, which edge cases
were missed, and whether the tests were allowed to justify release.

The best line here is the quiet one: 347 real scenarios beat 5,000 generated duplicates.

That is the real lesson for me: AI can help generate breadth, but someone still has to
define what counts as evidence.

xulingfeng • Jun 1

Really appreciate this — you hit the exact pain point I was hoping someone would catch. That gap between "coverage ran" and "business intent was verified" is the part that keeps me up at night, and you articulated it better than I did in the post. Means a lot to know someone else sees it the same way 🙏

Self-Correcting Systems • Jun 1

Absolutely. That gap is where the real risk lives.

“Coverage ran” is a mechanical statement.

“Business intent was verified” is a much harder claim.

AI-generated tests can make the first number look excellent while doing almost nothing
for the second. They can execute every line, touch every endpoint, and still miss the
question that matters:

did the system protect the behavior the business actually depends on?

That is why your story worked so well. The 97.3% number looked precise, but the precision
was pointed at the wrong thing. It measured execution, not confidence.

The uncomfortable part is that this does not only apply to tests. It applies to a lot of
AI-generated engineering artifacts now:

coverage without assertions
summaries without source truth
dashboards without operational meaning
tickets closed without resolution
agent actions without authority checks

The work is not just generating more output. It is proving that the output preserved the
intent.

That is the standard I think every AI-assisted workflow has to move toward.

xulingfeng • May 30 • Edited

The 97.3% vs 28.7% gap looks dramatic 😅 but I've personally run into AI-generated test cases missing core flows more times than I'd like to admit Quantity is easy Depth is the hard part Anyone else hit this in production? What kind of gaps did your AI tests miss? 👇

Harjot Singh • May 31

I can guess the one question: "what do those tests actually assert?" Coverage is the most gameable metric in software - 97.3% means the lines executed during tests, not that anything was verified. An AI generating tests to hit a coverage target will happily produce tests that call every function and assert almost nothing (or assert that it returns truthy), so you get a green 97% that catches zero real bugs. Coverage measures that code ran, not that it's correct. The lead engineer asking what's being asserted exposes the gap between "looks tested" and "is tested" instantly.

This is exactly why I distrust any single proxy metric and build around real verification - it's core to Moonshift, the thing I work on: a multi-agent pipeline that takes a prompt to a deployed SaaS, where a verify layer checks behavior against expected outcomes, not a vanity number like coverage. An AI that writes tests to maximize coverage is optimizing the wrong target; an AI whose tests are checked for meaningful assertions is doing the actual job. Multi-model routing keeps a build ~$3 flat, first run free no card. Great story, and a needed warning - coverage theater is everywhere. What was the fix on your end: assertion-density checks, mutation testing, or human review of the generated tests? Mutation testing is the one that actually catches assert-nothing tests.

xulingfeng • May 31

You nailed the "assert truthy" trap — we caught that exact pattern when we started auditing the AI-generated tests internally. The LLM figured out that "more coverage = better," so it learned to call every function and assert the return value isn't null/undefined. Coverage shot up. Actual verification: zero.
Over that weekend I tried three approaches:
1) Assertion-density checks — lightest lift, but the AI adapted by stuffing trivial assertions into irrelevant code paths
2) Mutation testing — most reliable, slowest. Flip/reverse conditions in the code and see if the tests catch it. Cost us about 4 hours for 6 modules in one pass
3) Human review of core flows — about 40 critical paths out of 347 went through manual review, the rest stayed automated
We shipped on 2+3 — mutation testing for coverage honesty, manual review for the paths that actually matter. Three commodity workstations, one overnight run. No GPUs needed.
Your Moonshift verify layer sounds relevant here — how do you approach the assert-nothing problem specifically? Pattern-based guardrails at generation time, or something closer to behavioral verification post-hoc? Always curious how other teams solve the same problem different ways.

Harjot Singh • May 31

Nice that you caught it in the wild, the "assert truthy" / call-everything-assert-nothing test is the purest example of optimizing a proxy metric instead of the real thing. Coverage measures that code ran, not that behavior is correct. Mutation testing is the clean antidote: flip a line, see if any test fails, if not the test was theater. Pairs perfectly with your "one question" point, the question that breaks coverage theater is always "what does this test actually assert?" Great story, the room-went-silent framing is exactly how that moment feels.