Modern Code Review


The Evolution of Code Review

To understand why modern software teams review code, we must first trace the history of the practice.

The First Wave: The Era of Formal Inspections

Code review was not always the seamless, online, asynchronous process it is today. In 1976, IBM researcher Michael Fagan formalized a rigorous, highly structured process known as Fagan inspections or Formal Inspections (Fagan 1976).

During the 1970s and 1980s, testing software was incredibly expensive. To prevent bugs from making it to production, Fagan devised a methodology that operated much like a formal court proceeding. A typical formal inspection required printing out physical copies of the source code and gathering three to six developers in a conference room. Participants were assigned strict, defined roles:

  • The Moderator managed the meeting and controlled the pace.
  • The Reader narrated the code line-by-line, explaining the logic so the original author could hear their own code interpreted by a third party.
  • The Reviewers meticulously checked the logic against predefined checklists.

This method was highly effective for its primary goal: early defect detection. Studies showed that these rigorous inspections could catch a massive percentage of software flaws. However, formal inspections had a fatal flaw: they were excruciatingly slow. One study noted that up to 20% of the entire development interval was wasted simply trying to schedule these inspection meetings. As the software industry shifted toward agile development, continuous integration, and globally distributed teams, gathering five engineers in a room to read paper printouts became impossible to scale.

The Paradigm Shift: The Rise of Modern Code Review (MCR)

To adapt to the need for speed, the software industry abandoned the conference room and moved code review to the web. This marked the birth of Modern Code Review (MCR).

Modern Code Review is fundamentally different from formal inspections. It is defined by three core characteristics: it is informal, it is tool-based, and it is asynchronous (Bacchelli and Bird 2013; Rigby and Bird 2013). Instead of scheduling a meeting, a developer today finishes a unit of work and submits a pull request (or patch) to a code review tool like GitHub, Gerrit, or Microsoft’s CodeFlow. Reviewers are notified via email or a messaging app, and they examine the diff (the specific lines of code that were added or deleted) on their own time, leaving comments directly in the margins of the code.

The “Defect-Finding” Fallacy

If you walk into any software company today and ask a developer, “Why do you review code?”, most of them will give you a very simple, straightforward answer: “To find bugs early”.

It is a logical assumption. Software engineers write code, humans make mistakes, and therefore we need other humans to inspect that code to catch those mistakes before they reach the user. But in the modern software engineering landscape, this assumption is actually a profound misconception. To understand what teams are actually doing, we must dismantle what we call the “Defect-Finding” Fallacy.

Expectations vs. Empirical Reality

Because MCR evolved directly from formal inspections, management and developers carried over the exact same expectations: they believed they were still primarily hunting for bugs. Extensive surveys reveal that “finding defects” remains the number one cited motivation for conducting code reviews (Bacchelli and Bird 2013).

However, when software engineering researchers mined the databases of review tools across Microsoft, Google, and open-source projects, they uncovered a stark contradiction: only 14% to 25% of code review comments actually point out functional defects (Bacchelli and Bird 2013; Czerwonka et al. 2015; Beller et al. 2014). Furthermore, the bugs that are found are rarely deep architectural flaws; they are overwhelmingly minor, low-level logic errors (Bacchelli and Bird 2013).

If 75% to 85% of the time spent reviewing code isn’t fixing bugs, what exactly are software engineers doing? Research has identified that modern code review has evolved into a highly collaborative, socio-technical communication network focused on three non-functional categories:

1. Maintainability and Code Improvement Roughly 75% of the issues fixed during MCR are related to evolvability, readability, and maintainability (Beller et al. 2014; Mäntylä and Lassenius 2009). Reviewers spend the bulk of their time suggesting better coding practices, removing dead code, enforcing team style guidelines, and asking the author to improve documentation. Card-sort analyses of these maintainability comments reveal a consistent breakdown (Bacchelli and Bird 2013; Mäntylä and Lassenius 2009):

  • Comments, naming, and styles (~22% of all review comments) — requests to rename a variable, add a docstring, or fix a formatting violation.
  • Organization of code (~16%) — suggestions to extract a method, move a class, or restructure a module so its responsibility is clearer.
  • Alternative solutions for long-term maintenance (~9%) — proposals of an entirely different approach the author hadn’t considered, usually motivated by future flexibility rather than immediate correctness.

2. Knowledge Transfer and Mentorship Code review operates as a bidirectional educational tool. Junior developers learn best practices by having their code critiqued, while reviewers actively learn about new features and unfamiliar areas of the system by reading someone else’s code.

3. Shared Code Ownership and Team Awareness By requiring at least one other person to read and approve a change, teams ensure there are “backup developers” who understand the architecture. It acts as a forcing function to dilute rigid, individual ownership and binds the team together through a shared sense of collective responsibility.

Divergent Perspectives: Are Review Comments Actually Useful?

If only a small fraction of comments are defect-related, are the rest at least changing the code? Empirical answers are mixed. Re-examining the Bacchelli & Bird dataset, only about one third of all review comments are deemed useful by the original author (Bacchelli and Bird 2013). The remaining two thirds are dismissed as misunderstandings, bikeshedding, out-of-scope refactoring suggestions, or stylistic disagreements the author rejects.

This creates a tension that the field has not fully resolved. On one hand, the act of submitting code for review reliably improves quality through the Ego Effect (discussed below) even when individual comments are ignored. On the other hand, if the median comment is unactionable, the cost-effectiveness of large review rituals becomes harder to defend. High-performing teams respond by raising the signal-to-noise ratio of comments — automating style enforcement so humans can focus on substantive issues, and training reviewers to distinguish must-fix concerns from optional preferences (Google’s “unresolved” vs. “resolved” comment types, discussed later, are one such mechanism).

How Much Code Must a Reviewer Actually Understand?

A second nuance from Bacchelli & Bird’s dataset: different review outcomes demand vastly different depths of code understanding (Bacchelli and Bird 2013). Catching a real functional defect or proposing an alternative architectural solution requires a complete mental model of the change. By contrast, avoiding build breaks or tracking the rationale of a decision can be done with low or no understanding of the code itself — a glance at the commit message and the CI status is enough.

One plausible explanation for the comment distribution discussed earlier is that the easy review outcomes (style, documentation, formatting) have a much lower cognitive entry price than defect detection, so they appear more often even when reviewers care equally about deeper concerns. This is also a strong argument for automating those easy outcomes through linters and static analysis: doing so frees reviewers’ scarce deep-understanding budget for the outcomes only humans can deliver.

Pause and recall — without scrolling back: What fraction of MCR comments point out functional defects? What three sub-categories make up the long-term-maintenance majority of the rest? What fraction of all comments does the original author judge useful? If any answer doesn’t come quickly, that’s exactly the signal that re-reading is needed before moving on.

Cognitive Factors

Achieving any of the goals of MCR requires a reviewer to accomplish one monumental task: actually understanding the code they are reading. The human brain has strict biological limits regarding how much abstract logic it can hold in its working memory (Letovsky 1987). When software teams ignore these limits, the code review process breaks down entirely.

The Brain on Code: Letovsky and the CRCM

In 1987, Stanley Letovsky proposed a foundational model suggesting that programmers act as “knowledge-based understanders”, using an assimilation process to combine raw code with their existing knowledge base to construct a mental model (Letovsky 1987).

Recent studies extended this specifically for MCR, creating the Code Review Comprehension Model (CRCM) (Gonçalves et al. 2025). A reviewer must simultaneously hold a mental model of the existing software system, the proposed changes, and the ideal solution. Because this comparative comprehension is incredibly taxing, reviewers use opportunistic strategies instead of reading top-to-bottom (Gonçalves et al. 2025):

  1. Linear Reading: Used mostly for very small changes (under 175 lines). The reviewer reads from the first changed file to the last.
  2. Difficulty-Based Reading: Reviewers prioritize. Some use an easy-first approach (skimming and approving documentation/renames to reduce cognitive load), while others use a core-based approach (searching for the core change and tracing data flow outward).
  3. Chunking: For massive PRs, reviewers break the code down into logical “chunks”, reviewing commit-by-commit or looking exclusively at automated tests first to understand intent.

The Quantitative Limits of Human Attention

Empirical studies across open-source projects and industry giants like Microsoft and Cisco have identified rigid numerical limits to human code comprehension (Cohen et al. 2006; Bacchelli and Bird 2013; Sadowski et al. 2018).

The 400-Line Rule

A reviewer’s effectiveness drops precipitously once a pull request exceeds 200 to 400 lines of code (LOC) (Cohen et al. 2006; Shah 2026). When hit with a massive PR (a “code bomb”), reviewers are overwhelmed. In a study of 212,687 PRs across 82 open-source projects, researchers found that 66% to 75% of all defects are detected within PRs that are between 200 and 400 LOC (Mariotto et al. 2025). Beyond this threshold, defect discovery plummets.

The 60-Minute Clock

Review sessions should never exceed 60 to 90 minutes (Cohen et al. 2006; Blakely and Boles 1991). After roughly an hour of staring at a diff, the reviewer experiences cognitive fatigue and defect discovery drops to near zero (Dunsmore et al. 2000).

The Speed Limit

Combining these limits dictates that developers should review code at a rate of 200 to 500 lines of code per hour (Cohen et al. 2006). Reviewing faster than this causes the reviewer to miss architectural details (Kemerer and Paulk 2009).

The Scarcity of Reviewer Attention

These per-session limits compound into a daily attention budget that is much smaller than most teams realize. An empirical study of Microsoft’s CodeFlow tool compared the time reviewers had the application open against the time they were actively interacting with it (Czerwonka et al. 2015). The result was striking: although the review tool stayed open on a developer’s screen for an average of 5 to 6 hours per workday, the actual active interaction time — typing, clicking, navigating diffs — added up to only about 30 minutes per developer per day (Czerwonka et al. 2015).

The remaining hours were spent with the tool in the background while the developer worked on their own code, attended meetings, or simply context-switched. The implication is sobering: each individual review must fit into a tiny daily slice of focused attention. Teams that flood their reviewers with three- and four-hundred-line PRs are not getting six hours of analysis per reviewer; they are competing for half an hour. This is the empirical foundation behind the bystander effect documented in larger review groups: adding a fourth or fifth reviewer does not multiply scrutiny — it disperses the already-tiny attention budget across more people, each of whom assumes someone else will read carefully (Sadowski et al. 2018; Rigby and Bird 2013). Microsoft’s empirical sweet spot is two reviewers; Google’s is one, with strict ownership and readability gates compensating for the smaller crowd.

Divergent Perspectives: Is LOC the Only Metric?

Some researchers argue that measuring Lines of Code is too blunt. A 400-line change consisting entirely of a well-documented class interface requires very little effort to review compared to a 50-line patch altering a complex parallel-processing algorithm (Cohen et al. 2006). Additionally, a rigorous experiment by Baum et al. could not reliably conclude that the order in which code changes are presented to a reviewer influences review efficiency, challenging some cognitive load hypotheses.

Engineering Around the Brain: Stacking

To build massive features without exceeding cognitive limits, high-performing teams utilize Stacked Pull Requests (Greiler 2020). Instead of submitting one monolithic feature, developers decompose the work into small, atomic, dependent units (e.g., PR 1 for database tables, PR 2 for API logic, PR 3 for UI). This perfectly aligns with cognitive dynamics, keeping every PR under the 400-line limit and allowing reviewers to process them in optimal 30-to-60-minute sessions.

Socio-Technical Factors

Because software is a virtual product, critiquing code is a direct evaluation of a developer’s thought process, making it an inherently social and emotional event.

The Accountability Shift: From “Me” to “We”

The simple existence of a code review policy alters behavior through the “Ego Effect”. Knowing peers will scrutinize their work acts as an intrinsic motivator, driven by personal standards, professional integrity, pride, and reputation maintenance (Cohen et al. 2006).

During the review itself, accountability shifts from the individual to the collective. Once a reviewer approves a change, they become equally responsible for it, shifting the language from “my code” to “our system” (Alami et al. 2025).

The Emotional Rollercoaster: Coping with Critique

Receiving critical feedback triggers strong emotional responses. Developers must engage in emotional self-regulation using several coping strategies (Alami et al. 2025):

  • Reframing: Reinterpreting the intent of the feedback and decoupling personal identity from the code (“This isn’t an attack; it’s just a mistake”).
  • Dialogic Regulation: Initiating direct, offline conversations to clarify intent and shift back to shared problem-solving.
  • Defensiveness: Advocating for the original code to self-protect, which carries a high risk of escalating conflict.
  • Avoidance: Deliberately choosing not to invite overly “picky” reviewers to limit exposure to stress.

Conflict and the “Bikeshedding” Anti-Pattern

Bikeshedding (nitpicking) occurs when reviewers obsess over trivial, subjective details like formatting while overlooking serious flaws. High-performing teams actively suppress this by implementing automated linters and static analysis tools to enforce style guidelines automatically, preferring to be “reprimanded by a robot”.

Tone is frequently lost in text-based communication; over 66% of non-technical emails in certain open-source projects contained uncivil features. To counteract this, modern teams explicitly train for communication, using questioning over dictating, and occasionally adopting an “Emoji Code” to convey friendly intent.

Bias and the Limits of Anonymity

The socio-technical fabric is susceptible to human biases regarding race, gender, and seniority. For example, when women use gender-identifiable names and profile pictures on open-source platforms like GitHub, their pull request acceptance rates drop compared to peers with gender-neutral profiles (Terrell et al. 2017).

To combat this, organizations have experimented with Anonymous Author Code Review. A large-scale field experiment at Google tested this by building a browser extension that hid the author’s identity and avatar inside their internal tool. Across more than 5,000 code reviews, reviewers correctly guessed the author’s identity in 77% of non-readability reviews (Murphy-Hill et al. 2022). They used contextual clues—such as specific ownership boundaries, programming style, or prior offline conversations—to deduce who wrote the code. While anonymization did not slow down review speed and reduced the focus on power dynamics, “guessability” proved to be an unavoidable reality of highly collaborative engineering (Murphy-Hill et al. 2022).

Writing Reviewable Code

So far we have examined what reviewers do (mostly maintainability comments, rarely deep defect hunting), what slows them down (working-memory limits, scarce daily attention), and the social dynamics that surround the activity. Each of these framings places the burden on the reviewer. But code review is a two-sided contract: a reviewer can only be as effective as the code permits. Authors who design their code to minimize cognitive load, make assumptions explicit, and isolate change hand their reviewer the same kind of leverage a well-written paper hands a peer reviewer.

This section covers five authoring practices, each one targeting a specific cognitive lever the reviewer struggles against:

  1. Design by Contractmake assumptions explicit, so the reviewer reads a checkable specification instead of guessing intent from variable names.
  2. Assertionsmake assumptions executable, so violations fail at the site of the bug rather than three subsystems away.
  3. Guard clausesflatten control flow, so the reviewer holds one path in working memory at a time, not four.
  4. Chunking through named abstractionscompress working-memory load, so the reviewer can move past a verified block as a single concept.
  5. The Boy Scout Ruleprevent quality drift, so each commit pays down debt instead of accumulating it.

These are deliberately narrow in scope — broader treatments of code comprehension, refactoring, and code beacons live in their own chapters.

Design by Contract: Pre- and Post-Conditions

Originally introduced by Bertrand Meyer as the unifying principle behind the Eiffel language, Design by Contract (DbC) treats every function, method, or module as a formal agreement between the caller and the implementation (Meyer 1988):

  • A pre-condition documents what the function assumes about its inputs and the surrounding state. The caller is responsible for satisfying it.
  • A post-condition documents what the function guarantees about its return value and any state changes. The implementation is responsible for delivering it.
  • An invariant documents a property that must hold before and after every public operation of an object.

Together these form the visible contract of the module. Clients reason about behavior using only the contract; everything else is implementation detail that can change freely. A useful analogy: the contract is what the caller sees and depends on; the implementation is what the caller is deliberately prevented from depending on, so it can evolve freely.

For reviewers, explicit contracts are transformative. Reading an unannotated function, a reviewer must mentally reconstruct what the author meant the function to accept and produce — often from variable names alone. With pre- and post-conditions written down, that ambiguity collapses into a checkable specification. The reviewer can now ask three concrete questions instead of one fuzzy one: (1) Are the pre-conditions reasonable for every caller? (2) Does the implementation actually deliver the post-conditions on every path? (3) Are there edge cases where neither is true? These are precisely the questions empirical studies identify as the most effective for catching real defects (Bacchelli and Bird 2013).

Failing Fast with Assertions

A contract is only as useful as its enforcement. Assertions turn pre-conditions, post-conditions, and invariants from documentation into executable checks that fail loudly the moment an assumption is violated. They sit inside the function — close to the code they describe — and disappear from production builds when compiled out, so they cost nothing at runtime in release mode.

def apply_discount(price: int, discount: float) -> float:
    assert price >= 0, "Invalid price"
    assert 0 <= discount <= 100, "Invalid discount percentage"
    return price * (1 - discount / 100)

Run with python -O my_program.py to strip assertions in production.

#include <cassert>

double applyDiscount(int price, double discount) {
    assert(price >= 0);
    assert(discount >= 0 && discount <= 100);
    return price * (1 - discount / 100);
}

Compile with g++ -DNDEBUG main.cpp -o my_program to strip assertions in production.

Assertions follow the fail-fast principle: a bug that violates an assumption surfaces immediately, at the site of the violation, with a stack trace pointing at the broken contract — instead of silently corrupting state and exploding three subsystems away. For the reviewer, every assertion is also a beacon (see Code Beacons) that makes the author’s intent inspectable without having to trace the surrounding logic.

A note on when not to use assertions: assertions express programmer-error invariants — “this can never happen if my code is correct.” They are not the right tool for user-error or runtime conditions — invalid configuration, missing files, malformed network responses — which can absolutely happen and need graceful handling, not a stripped-out crash. The next section on guard clauses covers the latter case; the two patterns coexist for different purposes.

Guard Clauses: Flattening Nested Conditionals

A second cognitive lever the author controls is nesting depth. Controlled experiments show that perceived readability drops sharply as nesting deepens (Johnson et al. 2019), and earlier complexity-metric research established that branching depth correlates with defect density (McCabe 1976; Campbell 2017). Human working memory has to track every open conditional simultaneously, so a function nested four levels deep is roughly four times as expensive to hold in mind as the same logic flattened.

The cheapest refactoring against this is the guard clause: handle each invalid or edge case at the top of the function, return early, and let the “normal” path live at the function’s base indentation.

# Before — the happy path is buried four levels deep.
def apply_discount(price: int, discount_percent: float) -> float | None:
    if price >= 0:
        if 0 <= discount_percent <= 100:
            discount_amount = price * (discount_percent / 100)
            final_price = price - discount_amount
            return final_price
        else:
            logger.error(f"Invalid discount: {discount_percent}")
            return None
    else:
        logger.error(f"Invalid price: {price}")
        return None
# After — guard clauses peel off the edge cases. The happy path is flat.
def apply_discount(price: int, discount_percent: float) -> float | None:
    if price < 0:
        logger.error(f"Invalid price: {price}")
        return None
    if not 0 <= discount_percent <= 100:
        logger.error(f"Invalid discount: {discount_percent}")
        return None

    discount_amount = price * (discount_percent / 100)
    return price - discount_amount

The two versions are behaviorally identical, but the second hands the reviewer two cheap, self-contained checks at the top and a single linear computation at the bottom. The reviewer never has to mentally page-fault out of the happy path to remember which else branch they are in.

Chunking Through Meaningful Abstractions

Working memory holds roughly four chunks of information at once (Gobet and Clarkson 2004). A function that fits on one screen is one chunk; a function that scrolls is many. Authors give reviewers the gift of chunking by extracting named sub-procedures whose name lets the reviewer move past them without inspecting their body.

Compare two implementations of the same invoice-generation logic. The inline annotations on Version A show how each block maps to a named helper in Version B:

# Version A — every step inlined. The reviewer must hold all of it.
def process_order_and_generate_invoice(order_data, customer_info, pricing_rules):
    # --- input validation (becomes _validate_order_data) ---
    if not order_data or not customer_info:
        raise ValueError("Missing order or customer data.")
    if 'items' not in order_data or not isinstance(order_data['items'], list):
        raise ValueError("Order must contain a list of items.")

    # --- subtotal with bulk discount (becomes _calculate_subtotal) ---
    subtotal = 0
    for item in order_data['items']:
        base_price = pricing_rules.get(item['product_id'], 0)
        subtotal += base_price * item['quantity']
        if item['quantity'] > 10:
            subtotal -= (base_price * item['quantity']) * 0.05

    # --- tax with location and exemption rules (becomes _calculate_tax) ---
    tax_rate = 0.0825
    if customer_info.get('is_tax_exempt', False):
        tax_amount = 0
    else:
        if customer_info.get('location') == 'Metropolis':
            tax_rate = 0.10
        tax_amount = subtotal * tax_rate

    total_amount = subtotal + tax_amount
    # ...
# Version B — each block above becomes one named line.
def process_order_and_generate_invoice(order_data, customer_info, pricing_rules):
    _validate_order_data(order_data, customer_info)
    subtotal = _calculate_subtotal(order_data['items'], pricing_rules)
    tax_amount = _calculate_tax(subtotal, customer_info)
    total_amount = subtotal + tax_amount
    # ...

The chunks haven’t disappeared — they’re still real code that the reviewer can drill into when needed. Here’s one of them in isolation:

def _calculate_tax(subtotal: float, customer_info: dict) -> float:
    if customer_info.get('is_tax_exempt', False):
        return 0
    tax_rate = 0.10 if customer_info.get('location') == 'Metropolis' else 0.0825
    return subtotal * tax_rate

What changed between the two versions is the reviewer’s path. A reviewer who trusts _calculate_tax can verify the orchestration in seconds, then drill into one helper at a time. A reviewer who doesn’t trust it can do the same drill, but only for the one helper they care about — the others stay closed. The extraction creates what Ousterhout calls a deep module: a simple interface hiding meaningful complexity (Ousterhout 2021).

The practical rule of thumb: if a function does not fit on one screen, the reader will lose the context they had at the top. Extract methods aggressively, even for code used only once, so that each level of abstraction reads like a sentence rather than a paragraph.

The Boy Scout Rule and the Broken-Window Effect

The final authoring habit is the Boy/Girl Scout Rule popularized by Robert C. Martin: always leave the campground module cleaner than you found it (Martin 2008). Whenever you touch a file for a feature change, take the opportunity to remove a dead import, rename a misleading variable, or split a function that has grown past one screen. Each commit is a tiny refactoring on top of its functional change.

The empirical argument for this habit borrows a metaphor from the broken-windows theory in criminology. The original urban-policing application of that theory has been heavily critiqued, but the metaphor turned out to translate well to software: a recent empirical study of technical debt found that developers who modify a module already containing many code smells are significantly more likely to introduce additional smells in their own change (Levén et al. 2024). Technical debt compounds because each new author silently lowers their personal standards to match the surrounding mess; a clean module exerts the opposite pressure.

For reviewers, this evidence informs one of the hardest decisions in MCR: when should I push back on a cleanup that wasn’t strictly required? If the surrounding code is visibly degrading, accepting small, well-scoped cleanups is consistent with the Levén finding — each one is a broken window repaired before it spreads. If the cleanup would balloon the PR past the 400-line threshold or pull in unrelated concerns, the better move is to request a follow-up PR — preserving both the stacking discipline and the cleanup intent.

Reflection task — pick a real function before moving on. Open the file you most recently wrote or reviewed. (1) Does the function state a checkable pre-condition (assertion, type hint, or comment)? (2) Does it use guard clauses, or is the happy path buried inside nested conditionals? (3) Does it fit on one screen without scrolling? Write down your answer for each — the act of judging against the three criteria is what makes them stick. Whichever criterion you answered “no” to is the cheapest reviewability improvement you can make in your next commit.

Retrieval check — without scrolling up, answer: What is the difference between an assertion and a guard clause? What cognitive limit does chunking through named helpers respect? Which empirical finding underwrites the Boy Scout Rule? If any answer is fuzzy, return to that subsection before moving on — actively recalling material once outperforms re-reading it several times (Roediger and Karpicke 2006).

Code Review at Google

Imagine a software company where more than 25,000 developers submit over 20,000 source code changes every workday into a single monolithic repository (or monorepo) (Sadowski et al. 2018; Potvin and Levenberg 2016). To maintain order, Google enforces a mandatory, highly optimized code review process revolving around four key pillars: education, maintaining norms, gatekeeping, and accident prevention.

When Sadowski et al. interviewed Google engineers about the origin of this process, defect detection was conspicuously absent from the answers. The practice was introduced “to force developers to write code that other developers could understand” — readability first, defects later (Sadowski et al. 2018). As the authors summarize:

Expectations for code review at Google do not center around problem solving. Reviewing was introduced at Google to ensure code readability and maintainability. Today’s developers also perceive this educational aspect, in addition to maintaining norms, tracking history, gatekeeping, and accident prevention. Defect finding is welcomed but not the only focus. (Sadowski et al. 2018)

This is a deliberate inversion of the Bacchelli–Bird “expectation vs. reality” gap discussed earlier: Google never adopted the bug-hunting expectation in the first place.

The Twin Pillars: Ownership and Readability

Google enforces two highly unique concepts dictating who is allowed to approve code:

1. Ownership (Gatekeeping) Every directory in Google’s codebase has explicit “owners”. While anyone can propose a change, it cannot be merged unless an official owner of that specific directory reviews and approves it.

2. Readability (Maintaining Norms) Google has strict, mandatory coding styles for every language. “Readability” is an internal certification developers earn by consistently submitting high-quality code. If an author lacks Readability certification for a specific language, their code must be approved by a reviewer who has it (Sadowski et al. 2018).

The Tool and the Workflow: Enter “Critique”

Google manages this volume using an internal centralized web tool called Critique. The lifecycle of a proposed change (a Changelist or CL) is highly structured:

  1. Creating and Previewing: Critique automatically runs the code through Tricorder, which executes over 110 automated static analyzers to catch formatting errors and run tests before a human ever sees it.
  2. Mailing it Out: The author selects reviewers, aided by a recommendation algorithm.
  3. Commenting: Reviewers leave threaded comments, distinguishing between unresolved comments (mandatory fixes) and resolved comments (optional tips).
  4. Addressing Feedback: The author makes fixes and uploads a new snapshot for easy comparison.
  5. LGTM: Once all comments are addressed and Ownership/Readability requirements are met, the reviewer marks the change with LGTM (Looks Good To Me).

The Statistics: Small, Fast, and Focused

Despite strict rules, Google’s empirical data shows a remarkably fast process (Sadowski et al. 2018):

  • Size Matters: Over 35% of all CLs modify only a single file, and 10% modify just a single line of code. The median size is merely 24 lines.
  • The Power of One: More than 75% of code changes at Google have only one single reviewer.
  • Blink-and-You-Miss-It Speed: The median wait time for initial feedback is under an hour, and the median time to get a change completely approved is under 4 hours. Over 80% of all changes require at most one iteration of back-and-forth before approval.

Developing as a Code Reviewer

Effective code review is a learned skill, not a credential one acquires by joining a team. Industry experience at organizations with deep review cultures suggests that newly onboarded reviewers typically need several months — often the better part of a year — before their review throughput and defect-detection rate approach those of established team members. Google, for example, runs a multi-month Readability mentorship in each language before a new engineer is allowed to approve changes alone (Sadowski et al. 2018). The bottleneck is not tool fluency — modern review tools are simple to learn in a day. The bottleneck is the slow accumulation of two things that no tool can grant: system context (the modules, conventions, and historical decisions that make a change reasonable or alarming) and defect intuition (the trained eye for the kinds of mistakes that look plausible but are not).

Three habits accelerate this curve faster than passive exposure:

1. Develop “rigorous criteria” rather than impressions. Novice reviewers often approve a change because nothing in it jumps out as wrong. Expert reviewers approve because every part of the change has survived an explicit checklist: Are pre-conditions documented? Is each error path tested? Does the change preserve the module’s invariants under concurrent access? Writing your personal checklist down — and revising it after every escaped defect you encounter — is among the most actionable training practices reported in studies of high-performing review cultures (Cohen et al. 2006).

2. Train your “critical eye” for corner cases. Real defects rarely live in the happy path; they live in the cases the author did not think to write a test for. Classic input-domain testing teaches that defects cluster around boundary conditions (empty containers, zero, off-by-one, integer overflow), null or absent values, concurrent and ordering hazards, and partial-failure recovery paths (Beizer 1990). Mäntylä & Lassenius’s review-defect taxonomy is consistent with this: most caught defects fall into “evolvability,” “code organization,” and “functional” categories that frequently surface at exactly these boundaries (Mäntylä and Lassenius 2009). When you read a diff, pause at every branch and ask: what input could send execution down this path? What inputs are intended? What inputs would the author have hated to think about?

3. Use the contract as your worksheet. As argued in Writing Reviewable Code, explicit pre- and post-conditions transform a fuzzy “does this look right?” review into three answerable questions. Even when the author has not written the contract down, you can write it down in your head — or in a review comment — and then verify each clause against the implementation. This converts review from impression-driven scanning into specification-driven analysis, a practice that fits naturally inside Letovsky’s comprehension model and its modern code-review extension, the CRCM (Letovsky 1987; Gonçalves et al. 2025).

Retrieval check — close the page and answer in your own words: (1) Roughly how long does it typically take a new reviewer to reach team-average effectiveness, and what two things slow the curve? (2) What three habits speed it up? (3) Why is “specification-driven analysis” stronger than “does this look right?” If any of these are fuzzy, scroll back — and notice that the act of trying to answer is itself the strongest study move available, much stronger than re-reading (Roediger and Karpicke 2006).

Application task — schedule it now. On your next pull request: write the post-condition of your most complex changed function in plain English at the top of the PR description. On the next review you do: do the same exercise for the function you find hardest to understand, then compare your version with the author’s. Most disagreements in code review trace to a contract the two of you had silently disagreed about — surfacing it converts argument into specification.

The AI Paradigm Shift

For decades, the peer code review process served as the primary quality gate in software engineering. Built on the assumption that writing code is a slow, scarce, human endeavor, a reviewer could reasonably maintain cognitive focus over a colleague’s daily output. However, the advent of Large Language Models (LLMs) and autonomous AI coding agents has violently disrupted this assumption. We are entering an era where code is abundant, cheap, and generated at a velocity designed to outpace human reading limits.

This chapter explores the third wave of code review evolution: the integration of generative AI. We will examine how AI transitions from a simple tool to an autonomous agent, the surprising empirical realities regarding its impact on productivity, the acute security risks it introduces, and why human accountability remains irreplaceable.

From Static Analysis to Agentic Coding

The earliest forms of Automated Code Review (ACR) relied on rule-based static analysis tools (e.g., PMD, SonarQube). While effective at catching simple formatting errors, these tools were rigid, lacked contextual understanding, and generated high volumes of false positives.

The introduction of LLMs has catalyzed a profound paradigm shift. Modern AI review tools evaluate code semantically rather than just syntactically. The literature categorizes this new era of AI assistance into two distinct workflows:

  1. Vibe Coding: An intuitive, prompt-based, conversational workflow where a human developer remains strictly in the loop, guiding the AI step-by-step through ideation and experimentation.
  2. Agentic Coding: A highly autonomous paradigm where AI agents (e.g., Claude Code, SWE-agent, GitHub Copilot) plan, execute, test, and iterate on complex tasks with minimal human intervention, automatically packaging their work into Pull Requests (PRs).

Empirical evidence shows agentic tools are highly capable. In an industrial deployment at Atlassian, the RovoDev Code Reviewer analyzed over 1,900 repositories, automatically generating comments that led directly to code resolutions 38.7% of the time, while reducing the overall PR cycle time by 30.8% and decreasing human reviewer workload by 35.6% (Tantithamthavorn et al. 2026). Similarly, an analysis of 567 PRs generated autonomously by Claude Code across open-source projects revealed that 83.8% of these Agentic-PRs were ultimately accepted and merged by human maintainers, with nearly 55% merged as-is without any further modifications (Watanabe et al. 2025).

Divergent Perspectives: The Productivity Paradox

A dominant narrative in the software industry is that AI drastically accelerates development. However, rigorous empirical studies present a sharply Divergent Perspective, revealing a “productivity paradox” when dealing with complex, real-world systems.

While AI excels at generating boilerplate and tests, reviewing and integrating AI code is proving to be a massive cognitive bottleneck.

  • The 19% Slowdown: A 2025 randomized controlled trial (RCT) by METR evaluated experienced open-source developers working on real issues in their own repositories. Developers forecasted that using early-2025 frontier AI models (like Claude 3.7 Sonnet) would speed them up by 24%. The empirical reality? Developers using AI tools actually took 19% longer to complete their tasks (METR 2025).
  • The Tech Debt Trap: A separate 2025 study evaluating the adoption of the Cursor LLM agent found that while it caused a transient, short-term increase in development velocity, it simultaneously caused a significant, persistent increase in code complexity (41%) and static analysis warnings (30%) (He et al. 2025). Over time, this degradation in code quality acted as a major factor causing a long-term velocity slowdown.

Because agents frequently generate “over-mocked” tests or fail to grasp complex, project-specific invariants, human reviewers must expend significant mental effort debugging AI logic. Reviewing shifts from understanding a human peer’s rationale to auditing a machine’s probabilistic output.

The “Rubber Stamp” Risk and AI Hallucinations

As AI generates massive blocks of code, human reviewers are hit with unprecedented cognitive fatigue. This leads to the Rubber Stamp Effect: reviewers see a massive PR that passes automated linting and unit testing, assume it is valid, and grant an “LGTM” (Looks Good To Me) approval without actually reading the syntax.

Rubber stamping AI code alters a project’s risk profile because AI mistakes do not look like human mistakes. While human errors are often obvious logic gaps or syntax faults, LLMs hallucinate code that looks highly plausible and authoritative but is functionally incorrect or deeply insecure. When discussing the ability of peer review to catch functional defects, the software engineering community frequently refers to Linus’s Law: “Given enough eyeballs, all bugs are shallow” (Raymond 1999). This concept is often used to justify broad, broadcast-based open-source code reviews (like those historically done on the Linux Kernel mailing lists). Modern empirical research (like the findings in the blog post) actively challenges the absolute truth of Linus’s Law by showing that even with many “eyeballs”, architectural bugs are rarely caught in MCR.

Security Vulnerabilities in AI-Generated Code

Extensive literature reviews confirm that LLMs frequently introduce critical security vulnerabilities (Nong et al. 2024).

  • “Stupid Bugs” and Memory Leaks: LLMs are prone to generating naive single-line mistakes. They frequently mishandle memory, leading to null pointer dereferences (CWE-476), buffer overflows, and use-after-free vulnerabilities.
  • Data Poisoning: Because LLMs are trained on unverified public repositories (e.g., GitHub), they can internalize insecure patterns. Threat actors can execute data poisoning attacks by injecting malicious code snippets into training data, causing the LLM to autonomously suggest insecure encryption protocols or backdoored logic to developers.
  • Self-Repair Blind Spots: While advanced LLMs can sometimes fix up to 60% of insecure code written by other models, they exhibit “self-repair blind spots” and perform poorly when asked to detect and fix vulnerabilities in their own generated code.

The Social Disruption: Emotion and Accountability

The integration of AI disrupts the socio-technical fabric of code review. Code review is not just a technical gate; it is a space for mentorship, shared accountability, and social validation.

The Loss of Reciprocity: Accountability is a social contract. One cannot hold an LLM socially or morally accountable. When an LLM reviews code, the shared team accountability transitions strictly back to the individual developer (Alami et al. 2025). As one developer noted, “You cannot blame or hold the LLM accountable”.

Emotional Neutrality vs. Meaningfulness: AI drastically reduces the emotional taxation of code reviews. LLM feedback is consistently polite, objective, and neutral, which eliminates the defensive responses or “bikeshedding” conflict that occurs between humans. However, this emotional sterilization comes at a cost. Developers derive psychological meaningfulness, “joy”, and professional validation from having respected peers validate their code (Alami et al. 2025). Replacing peers with a “faceless chat box” strips the software engineering role of its relational warmth and identity-affirming properties.

The Future: From Syntax-Checking to Outcome-Verification

To safely harness AI without succumbing to the Rubber Stamp effect, the software engineering paradigm must evolve.

  1. The Human-in-the-Loop Imperative: The consensus across modern literature is that AI should be implemented as an AI-primed co-reviewer rather than a replacement. AI should handle the first-pass triage—formatting, basic bug detection, and linting—while human engineers retain authority over architectural context, business logic, and security validation.
  2. The Shift to Preview Environments: Because reading thousands of lines of AI-generated syntax is biologically impossible for a human reviewer to do accurately, the artifact of review must change. We are shifting from a syntax-first culture to an outcome-first culture (Signadot 2024). Reviewing AI-authored code requires spinning up ephemeral, isolated “backend preview environments” where reviewers can actively execute and validate the behavior of the code, rather than passively reading text files. As the industry moves forward, the new standard becomes: “If you cannot preview it, you cannot ship it”.

Practice This

Use the flashcards to retrieve the empirical limits and review vocabulary, then use the quiz to make review decisions about PR size, reviewer cognition, reviewable code, Google-scale workflow, and AI-generated changes.

Modern Code Review Flashcards

Formal inspections, modern asynchronous review, cognitive limits, socio-technical dynamics, reviewable code, Google-scale review, and AI-era review risks.

Difficulty: Basic

How did formal inspections differ from modern code review?

Difficulty: Basic

What is the defect-finding fallacy in Modern Code Review?

Difficulty: Basic

Name three major non-defect functions of code review.

Difficulty: Intermediate

What is the Code Review Comprehension Model (CRCM) asking a reviewer to hold in mind?

Difficulty: Advanced

What practical limits should shape review size and speed?

Difficulty: Intermediate

Why do stacked pull requests help review quality?

Difficulty: Intermediate

How do bikeshedding and linters relate?

Difficulty: Advanced

What are five authoring practices that make code more reviewable?

Difficulty: Advanced

How do assertions and guard clauses differ?

Difficulty: Intermediate

What are Google’s two approval gates in code review?

Difficulty: Advanced

Why can adding more reviewers reduce accountability?

Difficulty: Advanced

Why does AI-generated code shift review toward outcome verification?

Modern Code Review Quiz

Apply modern code-review research to PR size, reviewer cognition, socio-technical dynamics, reviewable-code practices, Google-scale workflow, and AI-era review.

Difficulty: Basic

Which statement best distinguishes formal inspections from Modern Code Review?

Correct Answer:
Difficulty: Intermediate

Your manager says, “If only 14% to 25% of review comments find functional defects, code review is mostly waste.” What is the strongest response?

Correct Answer:
Difficulty: Intermediate

A teammate submits a 1,200-line feature PR touching database migrations, backend rules, and UI. They say one large PR is easier because reviewers see the whole feature at once. What should you recommend?

Correct Answer:
Difficulty: Advanced

Which strategies fit the Code Review Comprehension Model for a non-trivial PR? Select all that apply.

Correct Answers:
Difficulty: Intermediate

An author wants to make a complex function more reviewable before opening a PR. Which changes are aligned with the chapter? Select all that apply.

Correct Answers:
Difficulty: Advanced

In apply_discount, a check rejects a user-entered discount of 150% and returns a validation error. Elsewhere, assert subtotal >= 0 documents an invariant after pricing. Which statement is most accurate?

Correct Answer:
Difficulty: Intermediate

In Google’s review process, why might one change require both an owner approval and a readability approval?

Correct Answer:
Difficulty: Advanced

An AI agent opens a 2,000-line PR that passes unit tests. The reviewer feels pressure to approve because the code looks polished and CI is green. What is the safest review posture?

Correct Answer:

References

  1. (Alami et al. 2025): Adam Alami, Nathan Cassee, Thiago Rocha Silva, Elda Paja, and Neil A. Ernst (2025) “Engagement in Code Review: Emotional, Behavioral, and Cognitive Dimensions in Peer vs. LLM Interactions,” ACM Transactions on Software Engineering and Methodology (TOSEM).
  2. (Bacchelli and Bird 2013): Alberto Bacchelli and Christian Bird (2013) “Expectations, outcomes, and challenges of modern code review,” International Conference on Software Engineering (ICSE). IEEE, pp. 712–721.
  3. (Beizer 1990): Boris Beizer (1990) Software Testing Techniques. 2nd ed. Van Nostrand Reinhold.
  4. (Beller et al. 2014): Moritz Beller, Alberto Bacchelli, Andy Zaidman, and Elmar Juergens (2014) “Modern code reviews in open-source projects: Which problems do they fix?,” Working Conference on Mining Software Repositories (MSR). ACM, pp. 202–211.
  5. (Blakely and Boles 1991): Frank W. Blakely and Mark E. Boles (1991) “A Case Study of Code Inspections,” Hewlett-Packard Journal, 42(4), pp. 58–63.
  6. (Campbell 2017): G. Ann Campbell (2017) Cognitive complexity–a new way of measuring understandability. SonarSource.
  7. (Cohen et al. 2006): Jason Cohen, Steven Teleki, and Eric Brown (2006) Best Kept Secrets of Peer Code Review. SmartBear Software.
  8. (Czerwonka et al. 2015): Jacek Czerwonka, Michaela Greiler, and Jack Tilford (2015) “Code Reviews Do Not Find Bugs: How the Current Code Review Best Practice Slows Us Down,” International Conference on Software Engineering (ICSE). IEEE, pp. 27–28.
  9. (Dunsmore et al. 2000): Alastair Dunsmore, Marc Roper, and Murray Wood (2000) “Object-Oriented Inspection in the Face of Delocalisation,” International Conference on Software Engineering (ICSE). ACM, pp. 467–476.
  10. (Fagan 1976): Michael E. Fagan (1976) “Design and code inspections to reduce errors in program development,” IBM Systems Journal, 15(3), pp. 182–211.
  11. (Gobet and Clarkson 2004): Fernand Gobet and Gary E. Clarkson (2004) “Chunks in expert memory: evidence for the magical number four ... or is it two?,” Memory.
  12. (Gonçalves et al. 2025): Pavlína Wurzel Gonçalves, Pooja Rani, Margaret-Anne Storey, Diomidis Spinellis, and Alberto Bacchelli (2025) “Code Review Comprehension: Reviewing Strategies Seen Through Code Comprehension Theories,” International Conference on Program Comprehension (ICPC).
  13. (Greiler 2020): Michaela Greiler (2020) “Stacked pull requests: make code reviews faster, easier, and more effective.”
  14. (He et al. 2025): Hao He, Courtney Miller, Shyam Agarwal, Christian Kästner, and Bogdan Vasilescu (2025) “Speed at the Cost of Quality: How Cursor AI Increases Short-Term Velocity and Long-Term Complexity in Open-Source Projects,” International Conference on Mining Software Repositories (MSR).
  15. (Johnson et al. 2019): John Johnson, Sergio Lubo, Nishitha Yedla, Jairo Aponte, and Bonita Sharif (2019) “An Empirical Study Assessing Source Code Readability in Comprehension,” International Conference on Software Maintenance and Evolution (ICSME). IEEE, pp. 513–523.
  16. (Kemerer and Paulk 2009): Chris F. Kemerer and Mark C. Paulk (2009) “The Impact of Design and Code Reviews on Software Quality: An Empirical Study Based on PSP Data,” IEEE Transactions on Software Engineering (TSE), 35(4), pp. 534–550.
  17. (Letovsky 1987): Stanley Letovsky (1987) “Cognitive processes in program comprehension,” Journal of Systems and Software, 7(4), pp. 325–339.
  18. (Levén et al. 2024): William Levén, Hampus Broman, Terese Besker, and Richard Torkar (2024) “The broken windows theory applies to technical debt,” Empirical Software Engineering, 29(4).
  19. (METR 2025): METR (2025) “Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity.”
  20. (Mäntylä and Lassenius 2009): Mika V. Mäntylä and Casper Lassenius (2009) “What Types of Defects Are Really Discovered in Code Reviews?,” IEEE Transactions on Software Engineering (TSE), 35(3), pp. 430–448.
  21. (Mariotto et al. 2025): Luca Mariotto, Christian Medeiros Adriano, René Eichhorn, Daniel Burgstahler, and Holger Giese (2025) “From Assessment to Enhancement of Pull Requests at Scale: Aligning Code Reviews with Developer Competencies Using Large Language Models,” International Symposium on Empirical Software Engineering and Measurement (ESEM), pp. 478–487.
  22. (Martin 2008): Robert C. Martin (2008) Clean Code: A Handbook of Agile Software Craftsmanship. Prentice Hall.
  23. (McCabe 1976): Thomas J. McCabe (1976) “A complexity measure,” IEEE Transactions on Software Engineering (TSE), SE-2(4), pp. 308–320.
  24. (Meyer 1988): Bertrand Meyer (1988) Object-Oriented Software Construction. Prentice Hall.
  25. (Murphy-Hill et al. 2022): Emerson Murphy-Hill, Jillian Dicker, Margaret Morrow Hodges, Carolyn D. Egelman, Ciera Jaspan, Lan Cheng, Elizabeth Kammer, Ben Holtz, Matthew A. Jorde, Andrea Knight Dolan, and Collin Green (2022) “Engineering Impacts of Anonymous Author Code Review: A Field Experiment,” IEEE Transactions on Software Engineering (TSE), 48(7), pp. 2495–2509.
  26. (Nong et al. 2024): Yu Nong, Mohammed Aldeen, Long Cheng, Hongxin Hu, Feng Chen, and Haipeng Cai (2024) “From Vulnerabilities to Remediation: A Systematic Literature Review of LLMs in Code Security,” arXiv preprint arXiv:2412.15004.
  27. (Ousterhout 2021): John K. Ousterhout (2021) A Philosophy of Software Design. 2nd ed. Yaknyam Press.
  28. (Potvin and Levenberg 2016): Rachel Potvin and Josh Levenberg (2016) “Why Google Stores Billions of Lines of Code in a Single Repository,” Communications of the ACM, 59(7), pp. 78–87.
  29. (Raymond 1999): Eric S. Raymond (1999) “The Cathedral and the Bazaar,” Knowledge, Technology & Policy, 12(3), pp. 23–49.
  30. (Rigby and Bird 2013): Peter C. Rigby and Christian Bird (2013) “Convergent Contemporary Software Peer Review Practices,” Joint Meeting on Foundations of Software Engineering (ESEC/FSE). ACM, pp. 202–212.
  31. (Roediger and Karpicke 2006): Henry L. Roediger and Jeffrey D. Karpicke (2006) “Test-enhanced learning: Taking memory tests improves long-term retention,” Psychological Science, 17, pp. 249–255.
  32. (Sadowski et al. 2018): Caitlin Sadowski, Emma Söderberg, Luke Church, Michal Sipko, and Alberto Bacchelli (2018) “Modern Code Review: A Case Study at Google,” International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP). ACM, pp. 181–190.
  33. (Shah 2026): Molisha Shah (2026) “Code Review Best Practices That Actually Scale.” Augment Code.
  34. (Signadot 2024): Signadot (2024) “Traditional Code Review Is Dead. What Comes Next?”
  35. (Tantithamthavorn et al. 2026): Chakkrit Tantithamthavorn, Andy Wong, Michael Gupta, and et al. (2026) “RovoDev Code Reviewer: A Large-Scale Online Evaluation of LLM-based Code Review Automation at Atlassian,” International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE/ACM.
  36. (Terrell et al. 2017): Josh Terrell, Andrew Kofink, Justin Middleton, Clarissa Rainear, Emerson Murphy-Hill, Chris Parnin, and Jon Stallings (2017) “Gender differences and bias in open source: Pull request acceptance of women versus men,” PeerJ Computer Science, 3, p. e111.
  37. (Watanabe et al. 2025): Miku Watanabe, Hao Li, Yutaro Kashiwa, Brittany Reid, Hajimu Iida, and Ahmed E. Hassan (2025) “On the Use of Agentic Coding: An Empirical Study of Pull Requests on GitHub,” arXiv preprint arXiv:2509.14745.