Daniel Nwaneri

Posted on Jun 17 • Edited on Jun 18

Claude Code Wrote the PR. Here's What the Code Review Actually Caught.

#ai #security #discuss #webdev

Everyone is shipping AI-generated code right now. Most of it is going straight to main.

Quick verdict: Qodo catches production-grade bugs in AI-generated code before they ship. Claude Code generated a Stripe webhook handler that passed TypeScript, looked clean, and had six real bugs — an ack-before-processing pattern that would silently drop fulfilled orders, no replay protection, a non-atomic rate limiter, a DoS-prone body read, a timing-unsafe signature compare, and a shared rate-limit bucket for null IPs. Qodo flagged all six in 90 seconds. Two of them I hadn't planted; the review reasoned them out from how the code behaves at runtime, not what it says.

I'm not going to tell you that's always wrong. A lot of it is fine. But I've been building production systems on Cloudflare Workers for six years, and I know exactly how "fine" can turn into a 2am incident. The subtle bugs — the ones that pass a quick read, pass TypeScript, pass your linter — are the ones that hurt.

So I ran an experiment. I asked Claude Code to generate a Stripe webhook handler for a Cloudflare Worker. I did not edit it. I did not second-guess it. I opened a PR and let Qodo run a code review on it.

This is what happened.

the setup

The feature: a Cloudflare Worker that receives Stripe webhooks, validates HMAC signatures, rate-limits by IP using KV, and processes checkout.session.completed and payment_intent.payment_failed events.

That's a real thing. It's the kind of feature an AI tool generates confidently and completely. Looks clean. Passes TypeScript strict mode. The logic flow makes sense on a first read.

It also had six bugs.

Here's the repo: github.com/dannwaneri/stripe-webhook-worker

The code is 173 lines of TypeScript. Signature validation, rate limiting, event dispatch, KV writes. Nothing exotic. Exactly the kind of thing you'd ship on a deadline without a second look.

the code Claude generated

The entry point looks like this:

export default {
  async fetch(request: Request, env: Env, ctx: ExecutionContext): Promise<Response> {
    const ip = request.headers.get('CF-Connecting-IP') ?? 'unknown';

    const limited = await checkRateLimit(ip, env);
    if (limited) {
      return new Response(JSON.stringify({ error: 'Rate limit exceeded' }), { status: 429 });
    }

    const rawBody = await request.text();
    const signature = request.headers.get('stripe-signature');

    if (!signature) {
      return new Response(JSON.stringify({ error: 'Missing stripe-signature header' }), { status: 400 });
    }

    const isValid = await validateSignature(rawBody, signature, env.STRIPE_WEBHOOK_SECRET);
    if (!isValid) {
      return new Response(JSON.stringify({ error: 'Invalid signature' }), { status: 401 });
    }

    // ... parse JSON, dispatch event

    ctx.waitUntil(processEvent(event, env));

    return new Response(JSON.stringify({ received: true }), { status: 200 });
  },
};

Reads cleanly. Validates the signature. Rate-limits. Returns 200. Done.

Except no.

running qodo

Installing Qodo took about five minutes. Go to the GitHub Marketplace, install the app, scope it to the repo. Then connect at app.qodo.ai. Once that's linked, you trigger a review by commenting on the PR:

/agentic_review

Ninety seconds later, Qodo's response appeared in the PR thread. Six bugs. Zero rule violations.

what it found

Finding 1: ack before processing (action required — reliability)

This was the top-priority finding, and it's the one that would have caused a real incident.

The code calls ctx.waitUntil(processEvent(event, env)) and immediately returns 200 OK. Stripe sees the 2xx and stops retrying. But processEvent runs in the background — if it fails (KV timeout, unhandled exception, runtime termination), Stripe never knows. The order goes unfulfilled. No alert fires. The customer waits.

Qodo's fix: either await processEvent(event, env) and return non-2xx on failure so Stripe retries, or persist the event to a durable queue before returning 200, then process with retries separately.

I knew there was no try/catch in processEvent. I hadn't framed it as an acknowledgment problem — that framing is sharper and explains the real-world failure mode directly.

Finding 2: no replay protection (action required — security)

The validateSignature function parses Stripe's t=timestamp from the header and uses it to reconstruct the signed payload. That's correct. What it doesn't do is check whether the timestamp is recent.

Stripe's own documentation says to reject any webhook where the timestamp is more than five minutes old. Without that check, a valid captured webhook can be replayed indefinitely. Same valid signature, same event ID, processed again.

The fix is four lines:

const age = Math.floor(Date.now() / 1000) - Number(timestamp);
if (age > 300) return false; // reject events older than 5 minutes

Finding 3: non-atomic rate limiting (remediation recommended — reliability)

The rate limiter reads the current count from KV, checks if it's under the limit, then writes the incremented count back. Two concurrent requests both read count = 0, both pass the check, both write count = 1. Under any real burst the rate limiter is trivially bypassed.

The correct implementation uses Durable Objects for atomic counters, or pushes the rate logic to Cloudflare's native rate limiting API.

Finding 4: body read before header check (remediation recommended — security)

The code reads rawBody = await request.text() before checking whether the stripe-signature header even exists. That means any request without a signature — a scanner, a bot, a misconfigured service — forces the worker to consume and buffer the full request body before being rejected.

For most requests that's noise. For a large payload flood it's a real DoS surface. The fix is to check the header first.

Finding 5: timing-unsafe signature compare (advisory — security)

The computed HMAC is compared to the expected hash with ===. JavaScript string comparison short-circuits on the first mismatched character, which leaks timing information an attacker can use to recover the expected hash byte-by-byte.

The fix is crypto.subtle.timingSafeEqual on the raw byte arrays before hex-encoding:

const computedBytes = new Uint8Array(mac);
const expectedBytes = hexToBytes(expectedHash);
return computedBytes.length === expectedBytes.length &&
  crypto.subtle.timingSafeEqual(computedBytes, expectedBytes);

This is a real CVE class. Qodo ranked it advisory.

Finding 6: shared 'unknown' IP bucket (advisory — reliability)

When CF-Connecting-IP is null, the fallback is the string 'unknown'. Every request that arrives without the header — health checks, misconfigured proxies, certain load balancer configurations — shares the same rate limit bucket. One noisy service can lock out all other headerless traffic.

I did not put this one in the code intentionally. Qodo caught it by reading the fallback on line 15, cross-referencing the rate limiter, and reasoning about what happens at runtime with a null header. That's not a pattern matcher. That's contextual analysis.

what surprised me

Two things.

The prioritization surprised me more than the findings did. Timing-unsafe comparison — the CVE-class security bug — ranked advisory. The architectural reliability issue ranked first. That's a judgment call, not a checklist. Qodo's reasoning: if the timing attack succeeds, an attacker can forge requests. But if the ack-before-processing architecture silently drops fulfilled orders, that's production-down-right-now. I don't entirely agree with the weighting, but I understand the reasoning and it's defensible.

Two findings weren't bugs I planted. Finding 4 (body before header) and Finding 6 ('unknown' IP bucket) — both required the review to understand what the code does, not just what it says. The 'unknown' bucket catch in particular required multi-line reasoning — fallback value on line 15, rate limiter logic in a separate function, runtime behavior with a missing header. That's what Qodo calls the Context Engine: the codebase is indexed so reviews understand architecture, not just the diff.

What it missed is worth naming. The KV namespace is reused for two semantically different key types: rl:* keys with a 60-second TTL and order:* keys with no TTL. If you ever add a TTL policy to the namespace globally, order records start expiring. Qodo didn't catch this — it would require knowing the intent of the two key types, not just observing they share a namespace. That's a fair miss. It's also exactly the kind of thing that bites you six months later when someone touches the KV config.

the generation / review distinction

Claude Code generated this. It generated it well — the code is structured, typed, readable, and handles the happy path correctly. That's what generation tools are for.

Qodo reviewed it. It found six bugs, two of them action-required, without knowing I'd planted any of them. It surfaced findings I didn't anticipate. It prioritized by real-world impact, not severity labels.

These are different jobs. Cursor and Claude are good at one. Qodo is built for the other. The reason this matters specifically for AI-generated code: AI tools write confidently. They don't flag their own assumptions. They don't know what they don't know about your production environment. The code looks reviewed because it looks clean.

Qodo is an AI code review platform. It runs as parallel agents on each PR — separate agents for critical issues, duplicated logic, breaking changes, ticket compliance, and rule enforcement, each running independently. The Context Engine indexes your codebase so it can reason about cross-file implications and architectural consistency, not just the lines in the diff. What came back on this PR wasn't a list of style nits. It was a structural critique of how the handler handles failure.

That's the gap between generating and reviewing. The PR looked fine. It wasn't.

takeaway

Run the code review. Not because you don't trust the tool that generated it. Because the tool that generated it isn't the right tool for the job.

Six bugs in 173 lines. Two of them action-required. One I hadn't thought of. That's not a failure of the generator — it's an argument for the review step.

If you're shipping AI-generated PRs without a structured review pass, you're not moving faster. You're just moving the incident to later.

The full code is at github.com/dannwaneri/stripe-webhook-worker. Qodo runs on the free tier for public repos — qodo.ai.

If you want to go deeper on AI code review, Qodo's AI Code Review Academy has a few useful reads:

What is AI code review — how it works and what it catches
Reviewing AI-generated code — common patterns and pitfalls
AI code review tools comparison — side-by-side feature breakdown

Sponsored by Qodo.

This article was written with AI assistance for research and editing. All arguments, examples, and opinions are my own.

Top comments (11)

leob • Jun 17

Yeah that's amazing ... so, would Qodo use different "models" (LLMs), or the same models but trained differently, how does that work?

Daniel Nwaneri • Jun 18 • Edited

From what I can tell, foundation models (Claude, GPT-5-class) . The differentiation is in how they orchestrate multiple specialized agents and index your codebase for context, not in the underlying weights.

leob • Jun 18 • Edited

So the difference is not in the underlying capabilities, but really in how you utilize them ...

I'm asking because, if the LLM is able to find those bugs (when orchestrated and directed by Qodo), you'd think that that same LLM should be capable of not making those bugs in the first place when generating the code! :-)

Guess it goes to show that it matters a lot what you're asking of an LLM in affecting what it does or 'can do', even though the fundamental capabilities are all there ... maybe an LLM can be "really good" only at one clearly defined task at the same time, compared to the human brain which just more naturally does multiple things simultaneously?

Daniel Nwaneri • Jun 18

Generation is autocomplete . The model optimizes for the next plausible token. Review is inversion . The model looks for where "plausible" breaks down at runtime. Same weights, opposing objectives.

Your "one task at a time" framing is close but I'd put it differently: it's not capacity, it's optimization direction. A model writing code isn't asking "where could this fail?" It's asking "what comes next?" Switch the prompt, switch the question.

The human parallel holds . same dev, same brain, writes a bug at 2pm and catches it in review at 4pm. The question is whether you'd actually want a generator that paused mid-write to second-guess itself.

leob • Jun 18 • Edited

Right, so in the end the difference is in the context that you feed into it ...

Still baffles me that we now have these enormous and opaque artificial "brains", and nobody really understands how it's doing its magic, but we're somehow getting good at coaxing it into doing what we want ;-)

P.S. but with something like Qodo, is it only about the different context that they're supplying to the LLM, as in, a clever prompt? ;-)

Or would they do some additional 'training' on the model, creating a new variant of it? (at this point I realize I might be talking total nonsense, lol)

(well I'm asking something which nobody might have the answer to, because Qodo is probably not disclosing their "secret sauce" ...)

To use the "human brain" analogy again:

You might give that (human) developer, who has to do code reviews, but has little experience with it (yeah okay, this is just fictitious ...) a detailed checklist telling him/her how to do code reviews - or, you might send that developer on a 1 week course, to learn best practices and basic principles of doing code reviews - where the checklist is analogous to a "prompt", or 'context', while the 1 week course would be "additional training of the model" (assuming that the latter is even technically possible at all ...)

Daniel Nwaneri • Jun 18

The checklist analogy is closer than you're giving yourself credit for. Most of what tools like Qodo do is retrieval and orchestration . figure out which code is relevant, package it with structured review instructions, dispatch specialized agents per concern. The underlying model stays the same.

Fine-tuning (your "1-week course") is expensive and goes stale fast as codebases evolve. RAG and prompt engineering age better because the context is dynamic. You don't retrain the model; you get better at telling it what to look at and what to ask.

The opaque brain does the same thing with a better briefing packet. That's most of the magic.

leob • Jun 18

Thanks for clarifying, yes that makes sense!

Sloan the DEV Moderator • Jun 17

Hey, this article appears to have been generated with the assistance of ChatGPT or possibly some other AI tool.

We allow our community members to use AI assistance when writing articles as long as they abide by our guidelines. Please review the guidelines and edit your post to add a disclaimer.

Failure to follow these guidelines could result in DEV admin lowering the score of your post, making it less visible to the rest of the community. Or, if upon review we find this post to be particularly harmful, we may decide to unpublish it completely.

We hope you understand and take care to follow our guidelines going forward!

FrancisTRᴅᴇᴠ (っ◔◡◔)っ • Jun 17

Issue resolved. Thanks Daniel!

Benjamin Nguyen • Jun 18

I find interesting your post because I did not know that you had issues with claude.

Daniel Nwaneri • Jun 18

The opposite actually . Claude generated the code well. Qodo reviewed it and found six bugs in what Claude produced. The issue was with the generated code, not the tool.

View full discussion (11 comments)