The Reflective Gap: Notes on AI Code Review in an Era of Breakneck Production

May 19, 2026

There’s a particular kind of silence that descends when an AI code reviewer finishes a pass on a non-trivial diff. The findings have severities. The findings have line numbers. The findings have rationales that read as confident, structured, and adequately formatted. And yet, if you run the same review again — same model, same prompt, same diff — you will get a different set of findings. Not entirely different. Not random. But different enough that the question “did we review this code?” no longer has a binary answer.

This is the asymmetry I keep coming back to as we built our own native code review to serve our needs. It isn’t a flaw to be patched out of the next model release. It’s the operating regime, and any honest treatment of AI code review has to start there.

Why we built our own rather than buy

When we started looking at the AI code review category, the pattern was depressingly uniform. The category is dominated by vendors that, when you strip away the dashboards and the GitHub App glue, are thin orchestration shims over the same three or four frontier models the rest of us have API access to. The differentiation lives in prompt scaffolding, diff chunking heuristics, and a feedback UI — not in any proprietary intelligence about your code. The benchmark data makes this clear: on the OpenSSF CVE benchmark, the spread between tools is largely a function of how aggressively they tune for recall versus precision, with CodeRabbit landing around 59% accuracy and a 36% F1 score, while DeepSource — which prepends a deterministic static analysis pass before the LLM ever runs — tops out at 84.51% F1. The deltas are real, but they’re deltas in plumbing, not in the underlying reasoning engine.

There’s a Veracode dataset that crystallized our skepticism: across more than 100 LLMs tested on 80 security-sensitive coding tasks, 45% of AI-generated code introduced an OWASP Top 10 vulnerability — and the pass rate did not improve across testing cycles from 2025 into 2026, despite the marketing cycles of vendor announcements. If the underlying model is the floor and the ceiling of the analysis, paying a per-seat premium for someone else to write the system prompt felt like paying rent on infrastructure we were already running.

So we built a native code review skill: a mixture-of-models composition that runs the diff through multiple frontier models with task-specialized prompts, then aggregates and classifies findings into a HIGH / MEDIUM / LOW severity taxonomy. The mixture isn’t ensemble-for-ensemble’s-sake. Different models have meaningfully different failure modes on different classes of bugs — one will catch a subtle aliasing issue in async code that another sails past, and vice versa — and disagreement between models is itself signal. When all three flag the same finding at HIGH severity, our confidence is meaningfully higher than when only one does. We surface that disagreement rather than hide it.

The stochasticity problem

Here’s the part the vendor marketing pages will never put above the fold: even when you set temperature to zero, AI code review is not deterministic.

The reasons are now well-documented. A 2025 study at Penn State and Comcast AI tested five hosted LLMs across eight tasks in supposedly deterministic configurations and found accuracy variations of up to 15% across runs, with the gap between best-case and worst-case performance reaching as high as 70%. A separate study focused specifically on code review — testing GPT-4o, GPT-4o mini, Claude, and LLaMA at temperature zero — found that all four models produced varied assessments on identical inputs. Thinking Machines Lab traced the deeper cause: the dominant source isn’t the obvious culprit (floating-point non-associativity in GPU reductions), but the fact that hosted inference systems batch incoming requests dynamically. The same prompt, arriving at a different moment of server load, ends up in a different batch composition, and the kernels used in standard LLM forward passes are not batch-invariant. Same input, different batch, different output.

What this means in practice is that an AI code review is a sample from a distribution of possible reviews, not a deterministic function of the diff. Run it once and you’ve taken one draw. The finding it surfaced as HIGH severity may or may not have appeared on another draw. The bug it missed may have been caught on a different one. There is no way, looking at a single review run, to know whether you’ve burrowed down a rabbit hole worth exploring or one that another run would have routed around entirely.

This is uncomfortable because it breaks the mental model engineers carry over from static analysis. When clippy flags a Vec::clone() in a hot loop, you can rerun clippy a hundred times and get the same warning. The signal is repeatable. The absence of a warning means something definite. With probabilistic review, neither is true. The presence of a finding is contingent. The absence of a finding is also contingent. You have collapsed a Bayesian inference into a yes/no artifact and called it a code review.

But human reviews are not the answer either

The tempting move at this point is to say: fine, AI review is stochastic and unreliable, so we’ll fall back to the artisanal practice of careful human review. The data doesn’t support this either.

The classic Fagan-inspection literature pegs human code review at roughly 60% defect detection on average — and that’s the high-water mark, achieved under formal inspection conditions that almost nobody actually runs anymore. An empirical study on the SmartSHARK dataset traced 187 distinct bugs that had been missed by reviewers across 173 buggy pull requests in 77 GitHub projects, with semantic bugs alone accounting for over half. The dominant categories of misses — semantic errors, build issues, compatibility, concurrency — are precisely the categories where reviewers were attentive but constrained by what a diff view can actually reveal.

The mechanisms are well-understood. Reviewers context-switch into a PR after their own feature work, after meetings, after incident response. They don’t have the spare bandwidth to deeply understand the full implications of every change. Confirmation bias makes them look for the bugs they expect rather than the ones present. Anchoring biases their assessment toward the first reading of the code. Decision fatigue degrades the third review of the afternoon relative to the first of the morning. And the social dynamics that overlay all of this — the implicit pressure not to push back on a senior engineer, the rubber-stamping culture that emerges under deadline pressure, the reluctance to nitpick a colleague — turn a process that is nominally adversarial into one that is collaborative in ways that work against defect detection.

A human review is also a sample. It’s a sample drawn from a different distribution than an AI review — biased by different heuristics, attuned to different patterns, constrained by different forms of attention — but it is a sample nonetheless. The fantasy that a human reviewer constitutes a deterministic check on code quality is exactly that: a fantasy. There is no reflective guarantee, AI or human, that all issues are identified and none are missed.

The asymmetry at play

Here is where the real risk surfaces, and it has nothing to do with whether AI or human review is “better.”

Code production rates have changed by something close to an order of magnitude. GitClear’s analysis of 2025 GitHub data showed the average developer checking in 75% more code than they did in 2022. Microsoft has acknowledged 30% of code in some repositories is now AI-generated. Sonar’s 2026 developer survey puts the broader number at 42% of code being AI-generated or AI-assisted, with developers predicting that will cross 50% by 2027. Whatever the exact figure, the trajectory is unambiguous: the rate at which code enters codebases has accelerated dramatically, and most of that acceleration is downstream of generative AI.

The review side has not seen a comparable jump. Cortex’s 2026 benchmark report found that incidents per pull request rose 23.5% year-over-year, even as PRs per author increased by 20%. Code is shipping faster. The quality of review has not kept pace. AppSec Santa’s 2026 study found 25.1% of AI-generated code samples contained a confirmed security vulnerability when tested against OWASP Top 10 categories, and Aikido Security attributes 1 in 5 enterprise breaches in 2026 to AI-generated code. Georgia Tech’s Vibe Security Radar tracked CVEs traceable to AI coding tools climbing from 6 in January 2026 to 15 in February to 35 in March. The doubling pattern is the part to pay attention to, not the absolute count.

The shape of the problem is this: production has industrialized, but review has not. An artisanal coder producing 200 lines of carefully considered code in a day occupied a different risk regime than the same engineer producing 2,000 lines of agent-generated code in the same window. The denominator has changed. If your review process — whether AI, human, or hybrid — provides no reflective guarantee that all issues are identified, the rate at which un-reviewed-with-confidence code enters production scales linearly with how much code you’re shipping. We have multiplied the numerator by 5x or 10x and pretended the denominator can keep up by buying a per-seat license.

How are we supposed to review then?

I don’t want to end this with a tidy synthesis, because the situation doesn’t admit one. But a few observations have shaped how we think about it.

The right framing is not AI versus human. It is what is the variance of my review process, and is the variance proportionate to the volume I’m shipping? A single AI run is a sample. A single human review is a sample. Two samples from two different distributions, looked at together, give you more coverage than either alone — not because AI catches what humans miss in some heroic complementary way, but because two draws from differently-biased distributions stochastically cover more of the bug space than one draw from either.

The right operational move, then, is to make the sampling explicit. Run multiple AI passes — different models, different prompts, different temperatures — and treat the union of findings as the candidate set, not any single pass. Weight findings by inter-model agreement rather than treating each pass as authoritative. Reserve human attention for the HIGH-severity findings that show up across passes, and for the architectural questions the diff-view obscures from any reviewer, AI or human. Accept that even this is not a guarantee — it’s a tighter confidence interval over the same fundamentally probabilistic inference, not a deterministic check.

And calibrate your shipping rate to the actual variance of your review pipeline, not to the variance you wish it had. The most expensive thing in software right now is not the cost of writing code. It is the cost of the bugs we shipped because we treated a probabilistic review as a deterministic one, and assumed that because we ran a review, we had reviewed the code.

Those are not the same statement. They have never been the same statement. The acceleration of AI-assisted production has only made the gap between them more expensive.

Beyond Boundaries - Expanding the horizons of knowledge

Discussion about this post

Ready for more?