ai engineering-management software-delivery code-review

Your PR Needs a Burden of Proof Now

· 9 min read

AI-assisted teams cannot rely on the diff alone anymore. As code gets cheaper to produce, every pull request needs explicit proof that the change is safe.

A pull request used to carry its own credibility.

Not perfectly, obviously. Bad code still got merged. Sloppy reviews still happened. But the basic trust model was stable enough that most teams could function: a developer wrote the code, a reviewer read the diff, maybe glanced at the tests, and made a judgment call.

That model is breaking.

AI-assisted teams can now generate more code than reviewers can safely trust from the diff alone. The problem is not that the code is always bad. The problem is that the reviewer can no longer assume the diff itself is the whole story. As code gets cheaper to produce, proof gets more expensive to skip.

I wrote recently that the spec is the product now and that engineering productivity is not AI ROI. This is the review layer of the same shift. If implementation gets cheaper upstream, the organization either gets much better at verification or it quietly moves the cost downstream into review tax, rework, and production risk.

Why the old PR trust model worked

The old model was never rigorous, but it was legible.

Most code was written at human speed. The author usually understood every line because they had typed every line. Diff size was constrained by the amount of time it took to build the thing. Reviewers used the code itself as a proxy for the thinking behind it.

You could look at a pull request and infer a lot:

  • whether the author seemed to understand the subsystem
  • whether edge cases had been considered
  • whether tests looked intentional or stapled on at the end
  • whether the blast radius seemed understood
  • whether the change felt like careful work or optimistic work

That was never foolproof. It was just good enough because the economics of code generation limited how much unfinished thinking could get poured into a PR at once.

AI changes those economics.

Now a developer can arrive with a 900-line diff that is internally consistent, stylistically clean, and still under-verified in all the places that matter. The code can look more finished than the underlying thought process actually was. That is the part a lot of teams miss.

The issue is not merely that AI makes mistakes. Human developers make mistakes too. The issue is that AI-assisted code can be produced faster than a reviewer can reconstruct intent, assumptions, and test coverage from the diff alone. The reviewer is being asked to underwrite risk with less trustworthy signals.

The diff got cheaper. Trust did not.

A lot of review habits were built around a simple assumption: if someone took the time to produce this code, they probably did at least some of the reasoning required to make it safe.

That assumption weakens when code generation gets easier.

When implementation is cheap, you see more of these patterns:

  • larger speculative diffs
  • cleaner-looking code with thinner understanding behind it
  • tests that prove the happy path but not the behavior you actually care about
  • refactors that compile but quietly change contracts
  • reviewers approving based on plausibility because full reconstruction would take too long

That last one is the real organizational danger.

Reviewers are still held responsible for quality, but the cost of earning confidence has gone up. So the team starts cutting corners in invisible ways. Approvals become more about pattern recognition than verification. Senior engineers spend more time reverse-engineering what the change is supposed to do. PR turnaround slows down, but nobody calls it review debt. They call it “we’re a little backed up this week.”

This is why AI-assisted review cannot just be “read the code more carefully.” That scales terribly. If the author can generate code in minutes and the reviewer needs an hour to regain confidence, you do not have leverage. You have a transfer of labor.

What the reviewer increasingly needs is evidence

The diff is still useful. It is just no longer enough.

A good PR now needs to show its work.

That usually means some combination of:

  • tests that prove the behavior changed the way you think it did
  • screenshots or video for UI changes
  • logs or traces for operational changes
  • demo notes that explain the path exercised
  • contract checks for integrations and schema-sensitive work
  • risk notes that say what could still be wrong
  • rollout or rollback notes for changes with operational blast radius
  • known unknowns so the reviewer is not forced to guess where the uncertainty sits

None of this is bureaucratic decoration. It is the evidence layer that lets a reviewer inspect outcomes instead of trying to mentally simulate the entire change from code alone.

If a PR changes a billing workflow, I want to know what cases were exercised, what logs were inspected, what failure mode was intentionally not solved yet, and how the change would be rolled back if it misbehaves. If it changes a frontend settings page, I want screenshots, state transitions, and the exact edge cases covered. If it changes an API contract, I want to see contract tests or compatibility notes, not a cheerful comment that says “tested locally.”

That is the new burden of proof. The author is not just submitting code. They are making a case that the code is safe enough to merge.

Without proof standards, senior engineers become cleanup crews

This is where the management failure shows up.

A team adopts AI tools. PR count goes up. Diff volume goes up. Local coding speed looks great. Leadership starts hearing that productivity is improving.

Then the senior people get slower.

Not because they got more cautious. Because they became the quality backstop for work that now arrives faster and less legibly than before.

You see it in a few familiar symptoms:

  • staff and principal engineers drowning in review queues
  • PR comments turning into mini design reviews because the intent was never made explicit
  • merge decisions getting delayed on anything risky because nobody has enough evidence to be confident
  • post-merge cleanup work rising even though initial implementation looked fast
  • juniors appearing highly productive while the actual cost is being absorbed by whoever has the judgment to spot what is missing

That is fake productivity. The output looks better in the upstream dashboard because the downstream bill is being paid by a smaller group of expensive humans.

I would be especially suspicious of any AI rollout where commit volume and PR throughput are up, but review latency, rework, or incident follow-up are quietly getting worse. That is usually not a high-performing system. It is a system hiding verification cost in senior bandwidth.

If you do not set explicit proof standards, your best engineers become cleanup crews for plausible-looking work.

Set a review contract, not a vibe

The practical fix is not “be stricter in review.” That produces moral pressure, not operating clarity.

What works better is an explicit review contract: for this kind of change, here is the evidence the author is expected to bring before a reviewer says yes.

The contract should scale with risk.

Low-risk changes

For copy updates, isolated refactors, or small internal tooling changes, keep it light:

  • short summary of what changed
  • passing automated tests or lint checks
  • note confirming no contract or behavior change

Medium-risk changes

For normal feature work or behavior changes inside a known boundary:

  • targeted tests for the changed behavior
  • screenshots, logs, or demo notes showing the path exercised
  • brief note on affected edge cases and what was intentionally not verified

High-risk changes

For billing, auth, migrations, infra, integrations, or anything with real blast radius, the reviewer should not have to guess:

  • explicit test evidence
  • contract checks or compatibility notes
  • rollout plan and rollback plan
  • monitoring or alerting notes
  • known unknowns and failure modes
  • reviewer guidance on where to focus

That does not need to become a 14-section enterprise template. In most teams, this fits in a few bullets. It just needs to make the author carry enough of the verification load that the reviewer is not forced to recreate the whole change from scratch.

A simple PR template can go a long way:

## What changed

## Why this approach

## Evidence
- Tests run:
- Screenshots / demo:
- Logs / traces:
- Contract checks:

## Risk
- Blast radius:
- Known unknowns:
- Rollback plan:

You can tighten or loosen that by repo, by team, or by change class. The important thing is that the norm is explicit.

Review is shifting from code inspection to evidence inspection

This does not mean code review disappears. It means the center of gravity changes.

Reviewers will still read code. They still need to catch bad abstractions, missing guardrails, and architectural nonsense. But increasingly the valuable question is not “can I inspect every line closely enough to feel smart?” It is “has the author provided enough evidence that this change behaves as claimed, with a risk profile we understand?”

That is a different skill on both sides.

Authors need to learn how to package proof, not just generate output.

Reviewers need to get sharper about what evidence actually increases confidence and what is just ceremonial attachment. Ten screenshots do not help if the real risk is a broken migration. A green test suite does not help much if none of the tests exercise the changed contract. Evidence only matters when it closes a real uncertainty.

This is one reason I think AI-assisted teams will eventually become much more explicit about review tiers, merge criteria, and deployment proof. Not because process people won an argument. Because the old informal trust model depended on code being expensive enough that the diff itself carried more signal.

It does not anymore.

The author owns the first layer of verification

This is the norm I would want every engineering org to state plainly: the person opening the PR owns the first layer of proof.

Not QA. Not the tech lead. Not the reviewer who happens to know the service best.

The author.

That is the only way this scales. If the author can use AI to generate implementation quickly, they can also use that same speed to generate tests, collect screenshots, capture logs, write risk notes, and package the case for merge. If they do not, the savings are fake.

The goal is not to make every PR heavyweight. It is to make trust legible again.

Teams are not losing the ability to generate code. They are losing the ability to know, at review time, whether that code deserves to ship.

The diff no longer carries enough proof on its own.