It was one of those days where everything clicks. You know the kind—the code just flows, the architecture makes sense in your head, every problem has an obvious solution. I coded through meetings, coded through lunch, coded until the office emptied out.
By evening I’d opened 6 pull requests. 13,000 lines of code. Features that had been on the roadmap for weeks, suddenly done. All working. All tested. I went home feeling like a machine.
Two weeks later, I checked the status. One merged.
The other five sat in the review queue, aging. My teammates were buried in their own AI-assisted output. Everyone was generating code at unprecedented rates. Nobody had time to review anyone else’s work. The PRs accumulated comments like “will review tomorrow” that never came.
The code wasn’t the problem. It worked. The review queue was the problem. And I started to realize: I hadn’t actually shipped anything. I’d just created a backlog.

The kind of PR that makes reviewers sigh
So I Looked Into It. Yikes.
Turns out I wasn’t alone. Not even close.
PR review times have jumped 91% at companies using AI coding tools. Teams that used to handle 10-15 PRs per week are now drowning in 50-100. Graphite’s research puts it bluntly: “Code review is the new bottleneck.” (No kidding.)
And here’s the fun part: a CodeRabbit study found AI-generated PRs have 10.83 issues per PR compared to 6.45 for human-written code. That’s 68% more issues. Forty-five percent of AI-generated apps contain exploitable OWASP vulnerabilities. The code compiles. It just might also let someone steal your user database. Whoops.
But wait, it gets better. A METR randomized trial found developers using AI were 19% slower on average—while believing they were 20% faster. We’re literally vibing ourselves into inefficiency. (I explored this more in The Slot Machine in Your IDE—turns out AI coding tools hit the same dopamine buttons as slot machines. Fun!)
The Stack Overflow 2025 survey captured our collective frustration: 66% of developers say AI code is “almost right, but not quite.” Not wrong. Almost right. Which means you have to read every single line hunting for the one subtle bug hiding in otherwise correct-looking code. It’s like Where’s Waldo, except Waldo is a SQL injection vulnerability.
So yeah. AI made writing code trivially easy. Validating it? That’s the hard part now.
The Real Bottleneck
Here’s a secret: writing code was never the bottleneck. Not really. Not for any developer past their first year.
The bottleneck was always understanding the problem well enough to know what to write. Navigating existing systems without breaking them. Getting stakeholders to agree on what “done” means. (Spoiler: they never agree.) Making sure what you built actually solved the right problem and not just the problem you understood.
AI hasn’t touched any of that. What it has done is remove the mechanical constraint of typing. You describe what you want and have code in seconds instead of hours.
But here’s the thing about constraints: when you remove one, another becomes visible. The constraint was always there—it was just hiding.
Now the constraint is review. Testing. Integration. Maintenance. The “almost right but not quite” code has to be read, understood, and verified by humans before it ships. And humans can only review so much before their eyes glaze over and they start approving things they shouldn’t. (Ask me how I know.)
The teams winning with AI aren’t the ones generating the most code. They’re the ones who’ve built systems to validate it without everyone burning out.
Smaller PRs. Yes, Really.
I know, I know. “Make smaller PRs” is the “eat your vegetables” of engineering advice. Everyone knows it. Nobody does it. (I wrote a small PR once. It was 2019. I still talk about it.)
But hear me out.
Stacked PRs—breaking large changes into small, dependent pull requests—sound like overhead. More PRs to open. More reviews to request. More merge conflicts to handle. More opportunities to mess something up.
But the data doesn’t lie. Small PRs (under 400 lines) get reviewed faster. Large PRs turn reviewers into zombies. They skim. They leave fewer comments. They rubber-stamp with “LGTM” just to clear the queue so they can get back to their own work. (I’ve done this. You’ve done this. We’ve all done this.)
When AI generates 2,000 lines for a feature, the temptation is to ship it all as one glorious mega-PR. Resist. Split it into a stack of 4-5 smaller changes. Each one reviewable in ten minutes instead of an hour. Each one with a focused scope that’s actually auditable.
Tools like Graphite and Mergify handle stack dependencies automatically now. They run CI in parallel across the stack. They auto-rebase when earlier PRs merge. The overhead that made stacking painful has largely been automated away.
The teams moving fastest aren’t generating the biggest changes. They’re shipping the smallest changes that are independently valuable. It’s less dramatic. It’s way more effective.
Use More Than One AI Reviewer (Trust Me)
Here’s something I got wrong for months: I was using one AI reviewer.
A general-purpose AI code reviewer is like a general practitioner doctor. Fine for routine checkups. But when you’ve got a weird rash on your authentication system, you want a specialist.
+---------------------------------------------------------------+
| MULTI-AGENT CODE REVIEW |
+---------------------------------------------------------------+
PR Submitted
|
v
+-------------+ +-------------+ +-------------+ +-------------+
| Security | | Performance | | Architecture| | UI/UX |
| Agent | | Agent | | Agent | | Agent |
+-------------+ +-------------+ +-------------+ +-------------+
| - OWASP | | - N+1 | | - Deps | | - A11y |
| - Secrets | | - Memory | | - Coupling | | - Patterns |
| - Auth | | - Perf | | - Modules | | - Design |
+-------------+ +-------------+ +-------------+ +-------------+
| | | |
+-----------------+-----------------+-----------------+
|
v
+-----------------+
| Human Review |
| (filtered PR) |
+-----------------+Specialized AI agents catch the boring stuff so humans can focus on what matters.
The teams with the best results use multiple specialized AI agents, each obsessed with its own domain. A security agent that knows OWASP vulnerabilities and goes absolutely feral when it sees hardcoded credentials. A performance agent that catches N+1 queries and goes “um, actually” about your algorithm choices. An architecture agent that enforces dependency direction and module boundaries. A UI/UX agent that yells about accessibility.
LeadDev research found each specialized agent catches 70-80% of issues in its domain. Stack them, and the vast majority of mechanical problems get caught before a human ever sees the code.
This isn’t about replacing human review. It’s about filtering noise. When a PR reaches a human reviewer, it should already be free of obvious issues. Security vulnerabilities caught. Performance problems flagged. Style guide violations fixed. The human reviewer can spend their precious attention on the questions that actually require human judgment—not debating bracket placement for the fortieth time.
Let the Robots Triage
Not all code changes are created equal. (Controversial, I know.)
A PR that updates button styling is different from a PR that touches the authentication system. (One of these will get you paged at 3am. The other is CSS. Who cares.) A new utility function is different from a database migration. Treating them the same—applying identical review rigor to both—is how you either burn out your senior engineers or let critical bugs slip through. Pick your poison.
+---------------------------------------------------------------+
| RISK-BASED PR ROUTING |
+---------------------------------------------------------------+
RISK FACTORS LOW MEDIUM HIGH
-----------------------------------------------------------
Files changed 1-3 4-10 10+
Core systems None Utils Auth/Pay
DB changes None Index Migration
Test coverage >80% 50-80% <50%
AI generated <30% 30-70% >70%
| | |
v v v
+---------+-----------+-------------+
| Fast | Standard | Senior |
| track | human | review |
+---------+-----------+-------------+Let the robots decide who needs to look at what.
Score PRs automatically. How many files changed? Does it touch core systems like auth, payments, or data models? Any database schema changes? What’s the test coverage? What percentage was AI-generated? (Higher AI percentage = higher risk. Sorry, Claude.)
Low-risk changes—small diffs, high coverage, no core system touches—can be fast-tracked. AI review might be sufficient. Medium-risk changes get standard human review. High-risk changes—anything touching auth, payments, data, or with low test coverage—require senior engineer eyes. Maybe architecture review.
The routing doesn’t require human judgment. It’s automated. Human attention goes where human attention is actually needed.
Microsoft and Google report that roughly 25% of their code is now AI-generated. You can’t have senior engineers reviewing 25% of all code. You need systems that route attention intelligently, or those senior engineers will quit. (They have options.)
Ship Fast, Validate in Production
Here’s the uncomfortable truth I had to accept: you can’t catch everything in review. You just can’t. (My CEO tries really hard though. Yes, he does code reviews. No, he shouldn’t.) No matter how many AI agents you stack, no matter how thorough your senior reviewers are, some bugs only show up in production. With real users. Doing real user things.
+---------------------------------------------------------------+
| PROGRESSIVE ROLLOUT |
+---------------------------------------------------------------+
DEPLOY VALIDATE RELEASE
-----------------------------------------------------------
+--------+ +----------+ +--------+ +--------+ +--------+
| Merged |-->| Dogfood |-->| Canary |-->| Expand |-->| Full |
| (0%) | | (4-8wks) | | (1%) | |(5-25%) | | (100%) |
+--------+ +----------+ +--------+ +--------+ +--------+
| | | | |
Feature Internal Monitor Gradual General
flag users errors expansion Availability
disabled exercise closely
(suffer)
<-------------- INSTANT ROLLBACK (flip flag) --------------->Deploy immediately, release gradually. Your coworkers are the first guinea pigs.
Feature flags change everything. Instead of waiting until code is “ready” (a subjective judgment that blocks flow forever), you deploy immediately with the feature disabled. No user sees it. No risk to production. But the code is there, exercising your deployment pipeline, living in your production environment.
Then you dogfood. Your own team uses the feature for four to eight weeks. Real usage. Real edge cases. The kind of bugs that no test suite catches because they require actual humans doing actual human things to trigger them. (Like clicking the button three times really fast while on a slow connection. Users love doing that.)
Then you canary. One percent of users. Watch your metrics. Errors? Roll back instantly—no code deployment needed, just flip the flag. Everything stable? Expand to five percent. Twenty-five. Full rollout. Users are your best testers anyway. They’ll find bugs your QA team couldn’t dream of, usually within the first five minutes, usually on a Friday.
This pattern—deploy immediately, release gradually—decouples shipping code from exposing users to bugs. You’re not blocking the PR queue waiting for perfect confidence. You’re building confidence incrementally through controlled exposure.
When Microsoft built Windows NT, 200 developers dogfooded daily builds. When someone’s code broke the build, they felt the consequences immediately. That feedback loop created better code than any review process could.
The best AI-assisted teams have rebuilt this loop. They deploy continuously but release strategically. Failure rates drop 68% with AI-powered progressive delivery. Not bad.
What We’re Still Good For
Here’s what AI can’t do. Not yet. Maybe not ever. (Famous last words, but I’ll risk it.)
Architecture decisions require understanding trade-offs across time horizons AI doesn’t grasp. Will this design scale to 10x users? Will this abstraction make sense when requirements inevitably evolve? Is this the right coupling between systems? These questions need judgment about the future, about organizational dynamics, about constraints that aren’t in the code. AI can generate five different architectures. It can’t tell you which one you’ll regret in two years. (I wrote more about this in Architecture for the Age of AI.)
UX polish and accessibility need human judgment. Research found AI passes WCAG accessibility standards only 33% of the time. The details that make software feel good—the micro-interactions, the error messages that actually help instead of making users want to throw their laptop, the edge cases where design guidance doesn’t exist—these require humans who use software and understand how it feels.
Mentorship matters more than ever. When juniors can generate code without understanding it, the risk of skill atrophy becomes acute. “Over-reliance on AI poses serious risks,” warns CIO—“junior developers may see their core skills and engineering intuition atrophy.” The developers who understand the why behind the code need to actively teach it. (Understanding why AI coding agents have limitations helps explain why human mentorship remains irreplaceable.)
This is where human review time should go. Not catching semicolons. Not enforcing import order. Not arguing about whether that variable name is descriptive enough. Let the AI agents handle that. (I’ll admit it—I sometimes enjoy a good nitpick. Especially on junior PRs. It’s a character flaw. I’m working on it.) Humans should ask: Is this the right architecture? Does this make sense for where we’re going? Does this junior developer understand what they just shipped?
The Art of the Follow-Up Ticket
When small bugs appear in review, the instinct is to block the PR until it’s fixed.
Resist. (I’m serious.)
Create a follow-up ticket. Merge the PR. Keep flow moving.
This isn’t about shipping broken code. It’s about recognizing that perfection is the enemy of progress. The code is already better than what’s in production. The bug is minor. Blocking the entire PR—forcing context switches, creating review debt, watching the queue grow while everyone argues about edge cases—costs more than the bug.
The mindset shift: iteration is usage. Shipping gives you more information about real problems than review ever will. The bug you fix after merge might not even be the bug users notice. Better to learn quickly what actually matters than to polish forever in the dark.
MIT Sloan research says smart teams reserve 15% of sprint capacity for debt remediation. The follow-up tickets get worked. They just don’t block flow.
(If this feels scary, it’s because we’ve been trained to treat merging as final. It’s not. That’s what feature flags are for.)
Putting It All Together
Here’s how it fits together:
+---------------------------------------------------------------+ | AI-ASSISTED DEVELOPMENT FLOW | +---------------------------------------------------------------+ 1. GENERATION Human Intent --> AI Code --> Auto-Split Stack (<=400 LoC) 2. AUTOMATED REVIEW Security --> Performance --> Architecture --> UI/UX Agents + Risk Scoring + Best Practice Checks 3. HUMAN REVIEW (Risk-Based) - Low Risk --> Fast-track (AI may suffice) - Medium --> Standard human review - High Risk --> Senior + Architecture review While waiting: docs, diagrams, tooling (don't just sit there) 4. MERGE & DEPLOY Stack-Aware Queue --> CI --> Feature Flag (disabled) 5. VALIDATION Dogfood (4-8 wks) --> Canary (1%->25%) --> Full Release 6. FEEDBACK LOOP - Bugs --> Follow-up tickets (don't block) - Learn --> Update prompts & rules - Debt --> 15% capacity reserved
The whole flow. Print it out. Tape it to your monitor. Ignore 80% of it like everyone else does with process diagrams.
Human specifies intent. AI generates code. The change auto-splits into a stack of small PRs, none exceeding 400 lines. (The AI handles this. You’re welcome.)
Each PR runs through layered AI review. Security. Performance. Architecture. UI/UX. Automated best practices. Risk scoring. The obvious issues get caught before any human wastes time on them.
Based on risk score, the PR routes to the right level of human review. Low-risk changes might not need human eyes at all. High-risk changes get senior attention.
While waiting for review, developers don’t sit idle. They pivot to documentation, architecture diagrams, developer tooling, test infrastructure—tasks where AI excels and review burden is lower. Never be blocked. (Easier said than done, I know.)
Once approved, the PR enters a stack-aware merge queue. CI runs in parallel. The code deploys behind a feature flag, disabled.
Dogfooding begins. Your team suffers through the feature for weeks. Issues surface. Follow-up tickets capture them. The flow continues.
Canary rollout starts. One percent. Five percent. Twenty-five. Full GA.
Every bug teaches the system. AI prompts get updated. Review rules get refined. The system compounds its learning over time.
Validation Is the New Velocity
The teams winning with AI aren’t measuring lines of code generated. They’re measuring time from idea to production. Time from PR opened to merged. Bugs per feature shipped. Customer-facing incidents per quarter.
They’ve accepted that AI makes generation trivially easy. Any developer can prompt their way to thousands of lines per day. But that code is “almost right, but not quite.” It accumulates technical debt at 8x the normal rate. It fails silently instead of crashing loudly.
The new engineering excellence isn’t about writing faster. It’s about building systems that catch the problems AI creates. Stacked PRs that keep changes small. Multi-agent review that catches domain-specific issues. Risk scoring that routes attention where it’s needed. Feature flags that enable safe rollout. Dogfooding that catches real-world failures. Follow-up tickets that keep flow moving. Human focus on what AI genuinely can’t do.
The question isn’t how much code you can generate. It’s how fast you can ship code that actually works.
Validation is the new velocity.
References
- Graphite. “Introducing Graphite Agent and Pricing.” graphite.com
- CodeRabbit. “State of AI vs Human Code Generation Report.” coderabbit.ai
- METR. “Early 2025 AI Experienced OS Dev Study.” metr.org
- Stack Overflow. “2025 Developer Survey - AI.” survey.stackoverflow.co
- LeadDev. “Writing Code Was Never the Bottleneck.” leaddev.com
- InfoQ/GitClear. “AI Code Technical Debt.” infoq.com
- Nielsen Norman Group. “AI Design Tools Not Ready.” nngroup.com
- CIO. “AI Coding Assistants: Wave Goodbye to Junior Developers.” cio.com
- MIT Sloan Management Review. “How to Manage Tech Debt in the AI Era.” sloanreview.mit.edu