Test Doubles Tutorial — Print View

1

The Test That Lied: A Test That Passes Today and Fails Tomorrow

🎯 Goal: See a test that passes today but is fundamentally broken — then carve out a seam where you can later substitute a controlled clock. 🧠 Skills you’ll gain: Recognize when a real collaborator (clock, HTTP, database) makes a test non-deterministic, and introduce a seam — a parameter the test can swap out — so future tests can stay deterministic.

📣 The kind of test that ships green and rots overnight. Imagine you’re on the QuestForge team. A daily-quest event is scheduled for April 28, 2026. A teammate writes a test on April 28 that asserts is_today_event_day("2026-04-28") returns True. Test passes. PR merges. Next day — without a single code change — the same test fails on CI. Why? Because the test depends on the wall clock, not on what the function should do. That hidden dependency is what test doubles exist to control.

🧭 What you already know — and what’s about to shift

From Testing Foundations you know how to write a strong oracle, choose partition + boundary inputs, and avoid peeking at private state. From TDD you know the Red-Green-Refactor rhythm. Every example so far has had one thing in common: the function under test was self-contained. Pass it inputs, observe the output, done.

Real code is rarely like that. Real functions talk to collaborators — clocks, network APIs, databases, payment gateways, email services. Each of those collaborators turns a deterministic test into a flaky test, a slow test, or — worst — a test that appears green but actually never exercised the behavior you cared about. This entire tutorial is about that problem.

📖 New vocabulary (visible glossary)

Term	Meaning
System Under Test (SUT)	The code being tested. Here: `is_today_event_day`.
Collaborator	Anything the SUT calls into. Here: `datetime.now()`.
Indirect input	A value the SUT receives from a collaborator (rather than from its caller). Here: today’s date from the clock.
Seam	A point where you can substitute a collaborator at test time without changing production behavior. We’re about to introduce one.
Dependency Injection	The technique: pass the collaborator in as a parameter instead of hard-coding it. Meszaros, Dependency Injection, p.678.

🌍 The same vocabulary in another language

These terms come from xUnit Test Patterns (Meszaros, 2007). They’re language-agnostic. JavaScript+Jest, Java+Mockito, C#+Moq, Ruby+RSpec — all use the same words for the same roles. What changes between languages is the syntax of how you express a stub or a mock. The role doesn’t change.

⚙️ Task — three small moves:

Read quest_service.py and test_quest_service.py. The test asserts that is_today_event_day("2026-04-28") returns True. Predict: will it pass today? Will it pass tomorrow? Why?
Run the test (▶ button). It passes today (April 28, 2026). 🎉 — and that green dot is misleading. Tomorrow the same code with no changes will fail.
Refactor is_today_event_day to accept a clock parameter (default datetime.datetime). This creates the seam — but you don’t use it yet. Step 2 will show how a stub fills it.

flowchart LR
    subgraph before["BEFORE — no seam"]
        direction TB
        S1["is_today_event_day(date_str)"]:::sut
        S1 --> C1["datetime.now()<br/>📅 wall clock"]:::bad
    end
    subgraph after["AFTER — seam introduced"]
        direction TB
        S2["is_today_event_day(date_str, clock)"]:::sut
        S2 --> C2["clock.now()<br/>↑ caller decides<br/>what clock"]:::good
    end
    before --> after
    classDef sut fill:#e3f2fd,stroke:#1565c0,color:#0d47a1
    classDef good fill:#e8f5e9,stroke:#2e7d32,color:#1b5e20
    classDef bad fill:#ffebee,stroke:#c62828,color:#b71c1c

💡 Pedagogical move — concept over syntax. Your code change is a single keyword (clock) and one default. The point isn’t the syntax; the point is the idea — “this function used to depend on the wall clock; now its caller decides what ‘now’ means.” That idea is the foundation of every test double in this tutorial.

🧠 Why a default value? Why not just require the parameter?

Two reasons:

Backwards compatibility. Other code that calls is_today_event_day(date_str) keeps working — the default clock=datetime.datetime reproduces the old behavior exactly.
Pedagogy. The original test (assert is_today_event_day("2026-04-28") is True) keeps passing without modification. The seam is available to any test that needs it; tests that don’t need it ignore it. That’s what non-intrusive seam means.

In Java, the equivalent would be a constructor parameter with a default factory. In JavaScript, an options object with a default. The concept — “let the caller swap this dependency” — is what carries.

🔭 Coming in Step 2: You created a seam. Now we’ll actually use it — by passing in a FrozenClock object that always says it’s Tuesday. Same SUT, same test shape, but now fully deterministic.

Starter files

quest_service.py

"""QuestForge — daily quest event service."""
from datetime import datetime


def is_today_event_day(event_date_str: str) -> bool:
    """Return True if today is the event date.

    event_date_str is in YYYY-MM-DD format.

    ⚠️ This function calls datetime.now() directly. Tests that pin a
    specific date will pass on that date and fail on every other day.
    That hidden non-determinism is what we're about to fix.
    """
    today = datetime.now().strftime("%Y-%m-%d")
    return today == event_date_str

test_quest_service.py

"""Test for is_today_event_day.

⚠️ This test was written on 2026-04-28. It passes today.
Read it carefully — would it pass *tomorrow*? Why or why not?
"""
from quest_service import is_today_event_day


def test_april_28_is_event_day():
    # Today (2026-04-28) IS the event day, so this should return True.
    assert is_today_event_day("2026-04-28") is True

Why Test Doubles? — Knowledge Check

Min. score: 80%

1. Which of these collaborators are likely to make a test flaky (sometimes pass, sometimes fail without code changes)? (select all that apply)

datetime.now() — the system clock
Right. The clock changes every microsecond — any test that pins a specific date or time becomes a wall-clock dependency. That’s the canonical flaky-test recipe.
An HTTP call to a third-party weather API
Right. Third-party APIs go down, rate-limit, change their JSON shape, and time out. Every one of those failures is invisible from the test code itself.
A function that reverses a list in memory
In-memory list reversal is deterministic — same input, same output, every time. No flakiness. This is the kind of operation that can be tested with no double at all.
A query against a remote database
Right. Remote databases add latency, can be unavailable on CI, and their state can drift between test runs. Same flakiness risk as the HTTP call.

Flakiness comes from collaborators that the test cannot fully control: wall clocks, network calls, remote databases, file systems, randomness. Pure in-memory operations (list reversal, arithmetic) are deterministic and don’t need a double.

2. What is an indirect input to the System Under Test?

Any input passed via keyword argument instead of positional
The keyword/positional distinction is just Python syntax. Indirect input is about where the value comes from — the caller’s arguments versus a collaborator the SUT calls into.
A value the SUT receives from a collaborator (rather than from the caller’s arguments)
Right. The SUT’s direct inputs are its parameters; indirect inputs are values it gets by calling a collaborator. datetime.now() is the canonical indirect input — the SUT pulls it in, no caller passed it. Controlling indirect inputs is exactly what stubs are for.
An argument that’s transformed before being used (e.g., str.lower())
Transformation doesn’t change whether an input is direct or indirect. str.lower() operates on a value the caller passed in — still direct. Indirect inputs are pulled from collaborators behind the public signature.
A global variable defined in another module
Module-level globals can act as indirect inputs (since they aren’t part of the call signature), but they aren’t the defining example. The textbook indirect input is a value pulled from a collaborator’s method call — like clock.now().

Indirect input = a value the SUT obtains from a collaborator rather than from its caller. clock.now(), db.fetch_user(id), api.get_weather() — each returns an indirect input that the SUT then uses. Stubs control these.

3. (Spaced review — Testing Foundations) A test asserts result is not None after refactoring the SUT to accept a clock parameter. Is that a strong oracle?

Yes — the test passes, so the refactor is verified
Tests passing only tells you what their assertions held. is not None holds for any non-None value — including ones that violate the spec. Same Liar-test family from Testing Foundations Step 3.
No — is not None is a weak oracle. It would pass for any non-None return, including False, [], or even a wrong date string. Pin the exact expected value with ==
Right. is not None is the canonical weak oracle — it accepts any non-None return. Pair it with the seam refactor and the test still verifies almost nothing. Pin the exact expected value with == (or is True/is False for booleans).
Yes — is not None is the recommended assertion for boolean-returning functions
There’s no special rule for boolean-returning functions. The strong oracle for booleans is is True / is False — is not None is strictly weaker (it accepts True, False, and every other non-None value).
It’s irrelevant — once you introduce a seam, oracle strength stops mattering
Oracle strength matters in every test, regardless of whether you’re using a real collaborator or a double. A strong oracle paired with a stub is what makes a test simultaneously deterministic and meaningful. Doubles don’t replace strong oracles; they enable them.

Oracle strength is independent of whether collaborators are doubled. is not None is the canonical weak oracle in any context. Even after you replace a real clock with a stub, the assertion still has to pin exactly what the spec mandates.

4. Why is dependency injection the right move before introducing any test doubles?

It’s a Python convention required by pytest
Pytest doesn’t require dependency injection. The technique pre-dates pytest by decades. The reason to do it is design, not framework compliance.
It creates the seam the doubles will use later. Without an injectable dependency, you can’t substitute a controlled version at test time
Right. Dependency Injection (Meszaros p.678) is the pattern that makes substitution possible. Once a collaborator is a parameter, any test can pass in a stub, spy, or mock. Without that seam, your only option is module-level patching — heavier and easier to get wrong.
It improves runtime performance
Performance is a non-issue at this scale. The benefit of DI is testability: the SUT becomes a unit you can isolate from its collaborators.
It’s only needed when you’re using unittest.mock — for hand-rolled stubs you can patch globals instead
Hand-rolled stubs use the same seam as unittest.mock doubles — both pass an object in at the parameter level (or replace it via patching). DI is universally useful regardless of which double-style you reach for.

Dependency Injection is the design move that makes test doubles possible. Pass the collaborator as a parameter; now any test can substitute a controlled version. (Same principle in Java with constructor injection, in C# with interfaces, in JavaScript with options-object patterns. The pattern is language-agnostic.)

2

Hand-Rolled Stub: A Clock That Always Says Tuesday

🎯 Goal: Replace a real collaborator (clock, HTTP API) with a hand-written Test Stub — a tiny class that returns canned values — and watch a flaky test become deterministic. 🧠 Skills you’ll gain: Recognize a Test Stub as a role (Meszaros, p.529), implement one in plain Python without any library, and pick canned values that drive the SUT down a specific behavior partition.

🧭 Bridge from Step 1. You created a seam: DailyQuestService(clock, api) accepts its collaborators as parameters. Now we’ll use the seam — by passing in objects that always answer the same way. That’s a stub.

📖 The verbatim teaching sentence

“Mock is a tool class; stub, spy, and mock are test-design roles. Same in Python, JavaScript, and Java — the role is what matters; the class name is just syntax.”

Read that twice. Most confusion about test doubles in Python comes from conflating Python’s unittest.mock.Mock class with the conceptual Mock role. They’re not the same thing. We’ll dismantle that confusion in Step 4. For now, lock in this: the role is the question; the syntax is the answer.

📖 What is a Test Stub? (Meszaros, xUnit Test Patterns, p.529)

A Test Stub replaces a collaborator with a hand-controlled object that answers questions with canned values. It does not record what was asked of it; it does not enforce a contract. It just answers.

flowchart LR
    T["Test"]:::test --> S["DailyQuestService<br/>(SUT)"]:::sut
    S -->|"clock.now()"| C1["FrozenClock<br/>📅 STUB<br/><i>always returns<br/>April 28, noon</i>"]:::stub
    S -->|"api.fetch_quests(...)"| C2["StubQuestApiClient<br/>📋 STUB<br/><i>always returns<br/>the canned quest list</i>"]:::stub
    T -.->|"asserts on return value"| S
    classDef test fill:#e3f2fd,stroke:#1565c0,color:#0d47a1
    classDef sut fill:#fff3e0,stroke:#e65100,color:#bf360c
    classDef stub fill:#e8f5e9,stroke:#2e7d32,color:#1b5e20

Notice what the test asserts on: the SUT’s return value, not the stubs. That’s state verification — we observe the result of calling the SUT, not whether it talked to anyone. Stubs make state verification possible by removing the variability the real collaborators would have introduced.

⚙️ Task — two small moves:

Read the worked example test_tuesday_picks_tuesday_quest. The FrozenClock, the StubQuestApiClient, and the assertion are all written for you. Predict the test’s outcome before running. Then run it — green.
Fill in the assertion in test_thursday_picks_thursday_quest. The clock is frozen to a Thursday; the canned API quests include a Thursday entry. Compute the expected value from the spec — don’t run-and-paste. Replace "FILL_IN_HERE" with the exact title the SUT should return.

💡 The conceptual move. A stub answers questions — it doesn’t decide what those answers should be. You decide. Your decision drives the SUT down whichever behavior branch the test is meant to exercise. The canned quest list and the frozen weekday together form a precise input partition; the assertion locks in what the SUT does for that partition.

📖 Why we wrote `StubQuestApiClient` as a class with one method, not as a function

DailyQuestService calls self._api.fetch_quests(user_id) — it expects a fetch_quests method on the api object. So our stub must be an object with that method. A function alone wouldn’t have a .fetch_quests attribute.

In Python this is duck typing: any object with a fetch_quests(self, user_id) method that returns a list of quest dicts is acceptable. The real QuestApiClient does it. Our stub does it. The SUT can’t tell them apart — that’s the whole point.

In Java, you’d give both classes a common interface. In TypeScript, you’d type the parameter as { fetchQuests: (userId: string) => Quest[] }. The mechanism differs; the idea (stub satisfies the same contract as the real collaborator) is universal.

🧠 Stub vs Fake — the cousin you'll meet briefly

A Fake Object (Meszaros p.551) is the next-of-kin to a stub: a working but lightweight implementation. Where StubQuestApiClient returns the same canned list no matter what user_id is passed, a FakeQuestApiClient could keep an in-memory dict of {user_id: [quests]} and return different lists for different users.

class FakeQuestApiClient:
    def __init__(self):
        self._data = {}
    def add_quests_for(self, user_id, quests):
        self._data[user_id] = quests
    def fetch_quests(self, user_id):
        return self._data.get(user_id, [])

When to reach for a Fake instead of a Stub: when one canned answer isn’t enough — typically when multiple SUTs share the collaborator, or when the test sequence depends on state that the stub would have to manually thread.

We won’t use Fakes in the worked exercises (one canned list per test is plenty here), but it’s worth knowing they exist. Step 6’s decision guide covers when each one fits.

🌍 The same idea in another language

FrozenClock is just a class with a hard-coded method. Every language has a way to write that.

JavaScript (no framework):

const frozenClock = {
  now: () => new Date('2026-04-28T12:00:00')
};

Java:

Clock frozenClock = Clock.fixed(
  Instant.parse("2026-04-28T12:00:00Z"),
  ZoneOffset.UTC
);

Same role; different syntax. Frameworks (unittest.mock, Jest, Mockito) generate these objects more concisely — but that’s boilerplate reduction, not a different idea.

🔭 Coming in Step 3: A stub answers questions. What if your SUT’s interesting behavior is whom it asks — like a complete_quest that should call ledger.credit(user_id, gold)? That’s where Test Spy comes in.

Starter files

clock.py

"""Reusable test helper: a clock that always says it's `fixed_dt`."""
from datetime import datetime


class FrozenClock:
    """A stub clock — always returns the datetime it was constructed with."""

    def __init__(self, fixed_dt: datetime):
        self._fixed_dt = fixed_dt

    def now(self) -> datetime:
        return self._fixed_dt

quest_api.py

"""The REAL HTTP client — don't call this in tests.

Instantiating QuestApiClient and calling fetch_quests() would actually
hit the network. Tests that exercise `DailyQuestService` should pass
a stub instead.
"""
import urllib.request
import json


class QuestApiClient:
    def fetch_quests(self, user_id: str) -> list[dict]:
        url = f"https://questforge.example.com/quests/{user_id}"
        with urllib.request.urlopen(url) as r:
            return json.loads(r.read())

quest_service.py

"""QuestForge — daily quest service.

DailyQuestService takes a clock and an API client as constructor
parameters (Dependency Injection). At test time we pass in stubs;
in production the caller passes the real ones.
"""
import datetime


def is_today_event_day(event_date_str: str, clock=datetime.datetime) -> bool:
    today = clock.now().strftime("%Y-%m-%d")
    return today == event_date_str


class DailyQuestService:
    """Picks today's daily quest title for a user."""

    def __init__(self, clock, api):
        self._clock = clock
        self._api = api

    def daily_quest_title(self, user_id: str) -> str:
        """Return today's quest title, or 'No quests today' if none match."""
        try:
            quests = self._api.fetch_quests(user_id)
        except ConnectionError:
            return "No quests today"
        if not quests:
            return "No quests today"
        weekday = self._clock.now().strftime("%A")
        for quest in quests:
            if quest["weekday"] == weekday:
                return quest["title"]
        return "No quests today"

test_quest_service.py

"""Step 2 — Hand-rolled stubs for DailyQuestService.

Two stubs are used here. FrozenClock is imported from clock.py.
StubQuestApiClient is defined right below — because it's a regular
class, not anything special. (Step 4 will show that `unittest.mock`
generates the same conceptual object in a single line — but the *idea*
is what we're locking in here, not the syntax.)
"""
from datetime import datetime
from clock import FrozenClock
from quest_service import DailyQuestService


class StubQuestApiClient:
    """A Test Stub (Meszaros, p.529) — returns canned quests regardless of user_id."""

    def __init__(self, canned_quests: list[dict]):
        self._canned = canned_quests

    def fetch_quests(self, user_id: str) -> list[dict]:
        return self._canned


# ===== WORKED EXAMPLE 1 — fully written =====
# Read carefully. Predict the assertion's outcome BEFORE running.
def test_tuesday_picks_tuesday_quest():
    clock = FrozenClock(datetime(2026, 4, 28, 12, 0))   # 2026-04-28 is a Tuesday
    api = StubQuestApiClient([
        {"weekday": "Monday",    "title": "Slay the Slime Lord"},
        {"weekday": "Tuesday",   "title": "Find the Lost Amulet"},
        {"weekday": "Wednesday", "title": "Defeat the Dragon"},
    ])
    service = DailyQuestService(clock, api)
    assert service.daily_quest_title("u123") == "Find the Lost Amulet"


# ===== FADED EXAMPLE 2 — student fills in the expected value =====
# The stub class, the FrozenClock, and the canned data are all provided.
# YOUR JOB: replace "FILL_IN_HERE" with the EXACT title the SUT should return.
# Compute it from the spec; don't run-and-paste.
def test_thursday_picks_thursday_quest():
    clock = FrozenClock(datetime(2026, 4, 30, 12, 0))   # 2026-04-30 is a Thursday
    api = StubQuestApiClient([
        {"weekday": "Monday",   "title": "Slay the Slime Lord"},
        {"weekday": "Thursday", "title": "Battle the Lich King"},
        {"weekday": "Sunday",   "title": "Save the Princess"},
    ])
    service = DailyQuestService(clock, api)
    # TODO — pin the exact title with `==` (strong oracle, Testing Foundations Step 3).
    assert service.daily_quest_title("u456") == "FILL_IN_HERE"

Test Stubs — Knowledge Check

Min. score: 80%

1. Which best describes a Test Stub?

A real implementation that’s been simplified for performance
That’s closer to a Fake Object (Meszaros p.551) — a working but lightweight implementation, like an in-memory database. A Stub doesn’t ‘work’ in the usual sense; it just returns the canned answer it was given.
An object that answers questions with canned values — feeding controlled indirect inputs to the SUT
Right. A Test Stub (Meszaros p.529) provides controlled indirect inputs — it answers the SUT’s questions with values you chose, so the SUT’s behavior under those inputs is what gets tested.
An object that records every method call so the test can verify them later
That describes a Test Spy (Meszaros p.538), the topic of Step 3. A spy adds call recording on top of stub-like behavior — but a stub on its own doesn’t track calls.
An object that throws exceptions on every call to detect missing error handling
That’s a specific use of a stub (the side_effect=ConnectionError pattern from Step 4), but it’s not the defining role. The defining role is providing canned answers; raising exceptions is just one kind of canned answer.

Stub = canned answers. The SUT calls the stub; the stub returns whatever the test configured. Used to control what the SUT receives, not to inspect what the SUT does. (Step 3 covers the latter — that’s a Spy.)

2. Why is hardcoded datetime.now() (used directly inside the SUT) not a stub?

Because datetime.now() is a function, and a stub must be a class
A stub doesn’t have to be a class — it just has to satisfy the contract the SUT expects. The defining property is control, not type. A function or a lambda can stub a function-shaped collaborator perfectly well.
Because the test cannot control what datetime.now() returns. A stub is under the test’s control — the wall clock is not
Right. The defining property of a stub is that the test controls what the stub returns. The wall clock changes every microsecond and is shared across processes — the test has zero control over it. That’s exactly why we replaced it with a FrozenClock.
Because datetime.now() is too fast — stubs must add latency
Latency is irrelevant to the stub vs not-stub distinction. Stubs are typically faster than the real thing because they skip work, but the defining property is control, not speed.
Because Python’s standard library functions can’t be doubled
Python’s standard library is no harder to double than your own code — datetime.datetime accepts a default override, modules can be patched, etc. The reason datetime.now() is the opposite of a stub is that the test can’t control what it returns; nothing about Python prohibits doubling it.

Stub = under the test’s control. datetime.now() is the opposite — the wall clock is shared, mutable, and impossible for the test to pin. Replacing it with FrozenClock(...) is what makes the indirect input controllable.

3. (Spaced review — Testing Foundations Step 3) A teammate writes:

assert service.daily_quest_title("u123") is not None

after stubbing the clock and the API. Is the assertion strong?

Yes — the test passes, so the SUT must be returning the right title
Tests passing only tells you the assertion held. is not None holds for any non-None value, including ones that violate the spec. The Liar test from Testing Foundations Step 3 still applies — being inside a stubbed test doesn’t make it stronger.
No — is not None is a weak oracle. It accepts any non-None return — including the wrong title, an empty string, or False. Pin the exact value with ==
Right. Stubbing collaborators makes the test deterministic; it doesn’t make weak oracles strong. is not None accepts wrong values just as readily as right ones. Pin the exact expected title with ==.
Yes — is not None is the recommended assertion when stubbing dependencies
There’s no special rule for assertions in stubbed tests. Stubs control inputs; oracles check outputs. The two are independent design dimensions, exactly as Testing Foundations Step 5 spelled out.
It’s strong if the SUT’s return type is documented as a string
Documentation doesn’t make is not None precise. The function returns one specific string per partition — pinning that exact string with == is the strong oracle. is not None is a structural assertion (“some object came back”), not a behavioral one (“the right object came back”).

Stubs and strong oracles solve independent problems. Stubs make indirect inputs controllable; oracles make assertions precise. You need both. Putting a weak oracle inside a stubbed test is a Liar test wearing a stub’s clothes.

4. When would a Fake Object (in-memory implementation) be a better choice than a Test Stub?

When the test only needs to control one canned return value
One canned answer is exactly what a Stub is for. A Fake’s added complexity (an in-memory store, mutating state) is overkill when you only need one return value.
When the SUT calls the collaborator multiple times across a test sequence and expects different stateful answers each time (e.g., adding a quest, then fetching it back)
Right. A Fake’s value is consistent stateful behavior across a test sequence. If the SUT does api.add_quest(...) then api.fetch_quests(...) and expects to see the added quest back, a Stub would have to be manually re-configured between calls — a Fake just works.
When the test needs to verify that the SUT actually called the collaborator
That’s a Spy or a Mock (Step 3 / Step 5), not a Fake. A Fake doesn’t track calls — it just behaves like a simplified version of the real collaborator.
Whenever you’re testing a service class — Stubs are only for free functions
Stub vs Fake has nothing to do with whether you’re testing a class or a function. The choice is about how much state the test needs the double to manage; the SUT’s shape is irrelevant.

Stub: one canned answer per call. Fake: working in-memory implementation, useful when the SUT needs consistent stateful behavior across multiple calls (add → fetch → update → fetch again, etc.). Step 6’s decision guide covers when each fits.

3

Hand-Rolled Spy: Did the Ledger Actually Get the Gold?

🎯 Goal: Verify that the SUT called a collaborator with the right arguments — even when the SUT itself returns nothing observable. Implement a Test Spy in plain Python, and pin exactly the right amount of detail in the assertion. 🧠 Skills you’ll gain: Recognize a Test Spy as a role (Meszaros, p.538) — a stub that also records calls. Write spy assertions that are strong enough to catch bugs but loose enough to survive harmless refactors.

🧭 Bridge from Step 2. A stub answers the SUT’s questions. A spy also records what the SUT did. The new conceptual move:

	Stub (Step 2)	Spy (Step 3)
What the test asserts on	The SUT’s return value	The recorded calls on the spy
What the SUT looks like	A function that returns something	Often a method that returns `None` (fire-and-forget)
Verification kind	State verification (Meszaros, p.462)	State verification of the spy — Step 5 will introduce the third kind

The new collaborator is RewardLedger — its job is to credit gold to a user. The SUT calls ledger.credit(user_id, gold) and that’s the only observable effect. The SUT itself returns nothing useful — the call to credit IS the contract. To verify it, we need a spy.

📖 What is a Test Spy? (Meszaros, xUnit Test Patterns, p.538)

A Test Spy behaves like a stub and records every call made to it. The test runs the SUT, then inspects the spy’s recorded-call list. Same SUT/collaborator structure as Step 2; what changes is what the test asserts on.

flowchart LR
    T["Test"]:::test --> S["DailyQuestService"]:::sut
    S -->|"clock.now()"| C1["FrozenClock<br/>📅 STUB"]:::stub
    S -->|"api.fetch_quests(...)"| C2["StubQuestApiClient<br/>📋 STUB"]:::stub
    S -->|"ledger.credit(u1, 100)"| C3["SpyLedger<br/>🎙️ SPY<br/><i>records every call</i>"]:::spy
    T -.->|"asserts on spy.calls"| C3
    classDef test fill:#e3f2fd,stroke:#1565c0,color:#0d47a1
    classDef sut fill:#fff3e0,stroke:#e65100,color:#bf360c
    classDef stub fill:#e8f5e9,stroke:#2e7d32,color:#1b5e20
    classDef spy fill:#f3e5f5,stroke:#6a1b9a,color:#4a148c

Notice the test now asserts on spy.calls, not on the SUT’s return value. The contract being verified is “the SUT called credit with these arguments”.

📖 The hard part isn’t writing the spy — it’s writing the assertion

A spy is even simpler than a stub: a class with a list and an append. The interesting test-design move is how much of each call to pin.

Assertion	What still passes (i.e., what it misses)	Pattern
`assert len(spy.calls) >= 0`	Everything. Always passes. Liar test.	Weak — same family as `result is not None` from Testing Foundations
`assert spy.calls == [("u1", 100, "2026-04-28T12:00:00Z", {"meta": "blob"})]`	Nothing. Breaks if the SUT later calls credit with cleaner arguments — even when the contract is unchanged. Brittle.	Over-specified
`assert spy.calls == [("u1", 100)]`	A wrong user_id, a wrong gold amount, no call at all, two calls. Goldilocks.	Strong, behaviorally-bounded

Same lesson as Testing Foundations Step 4: assert on exactly what the spec says — no less, no more. The spec for complete_quest: “credit the user the gold for the completed quest.” That maps to a 2-tuple (user_id, gold). Anything beyond that is over-specification; anything less is a Liar.

⚙️ Task — three small moves:

Read test_complete_quest_LIAR_oracle. The assertion is assert len(spy.calls) >= 0 — it always passes, regardless of whether the SUT called the spy at all. Add a Python comment above the assertion explaining (in your own words) why this is a Liar test — use the phrase “Liar test” or “weak oracle”. Don’t change the assertion; the test stays a Liar so the lesson is preserved.
Read and run test_complete_quest_credits_correct_gold — fully written, pins the exact 2-tuple. This is the Goldilocks shape.
Fill in the assertion in test_award_streak_bonus_5_days. The streak-bonus rule: 10 gold per day, capped at 100. The student passes days=5. Compute the gold; pin the call.

📖 Why fire-and-forget methods need spies

complete_quest returns None. From the SUT’s caller’s perspective, nothing happens — the function is “void”. Yet the SUT did do something important: it told the ledger to credit gold. Without a spy, that work is invisible to the test.

A spy makes invisible side effects visible. In every language: Java mocks (Mockito.verify(...)), JavaScript spies (jest.fn() + expect(spy).toHaveBeenCalledWith(...)), Python’s unittest.mock recorded calls — the idea is the same. This is the only way to test fire-and-forget methods.

🌍 The same idea in another language

JavaScript with Jest:

const spy = jest.fn();          // creates a function spy
service.completeQuest('u1', 'Slay the Slime');
expect(spy).toHaveBeenCalledWith('u1', 100);

Java with Mockito:

RewardLedger spy = mock(RewardLedger.class);   // also acts as a spy
service.completeQuest("u1", "Slay the Slime");
verify(spy).credit("u1", 100);

Same role; different syntax. The hand-rolled SpyLedger class makes the recording mechanism visible; framework spies (Step 4) hide the boilerplate.

🔭 Coming in Step 4: Hand-rolling spies gets repetitive — you’re writing the same self.calls.append(...) boilerplate every time. Python’s unittest.mock.Mock generates the entire SpyLedger class for you in a single line. But it’s the same conceptual object — just less typing.

Starter files

reward_ledger.py

"""The real reward ledger — would persist gold to a database in production."""


class RewardLedger:
    def credit(self, user_id: str, gold: int) -> None:
        # In production: writes a credit row to the rewards database.
        raise NotImplementedError(
            "Don't call the real ledger in tests — pass a SpyLedger instead."
        )

quest_service.py

"""QuestForge — daily quest service with reward ledger collaborator."""
import datetime


QUEST_REWARDS = {
    "Slay the Slime Lord":   100,
    "Find the Lost Amulet":  150,
    "Battle the Lich King":  250,
    "Defeat the Dragon":     500,
}


def is_today_event_day(event_date_str: str, clock=datetime.datetime) -> bool:
    today = clock.now().strftime("%Y-%m-%d")
    return today == event_date_str


class DailyQuestService:
    """Picks today's quest, completes quests, and awards streak bonuses."""

    def __init__(self, clock, api, ledger=None):
        self._clock = clock
        self._api = api
        self._ledger = ledger

    def daily_quest_title(self, user_id: str) -> str:
        try:
            quests = self._api.fetch_quests(user_id)
        except ConnectionError:
            return "No quests today"
        if not quests:
            return "No quests today"
        weekday = self._clock.now().strftime("%A")
        for quest in quests:
            if quest["weekday"] == weekday:
                return quest["title"]
        return "No quests today"

    def complete_quest(self, user_id: str, quest_title: str) -> None:
        """Credit the user the gold for the completed quest. Returns None."""
        gold = QUEST_REWARDS.get(quest_title, 0)
        self._ledger.credit(user_id, gold)

    def award_streak_bonus(self, user_id: str, days: int) -> None:
        """Award 10 gold per streak day, capped at 100. Returns None."""
        gold = min(days * 10, 100)
        self._ledger.credit(user_id, gold)

test_quest_service.py

"""Step 3 — Hand-rolled spies for fire-and-forget collaborator calls.

A spy is a stub that ALSO records calls. The interesting test-design
move isn't writing the spy — it's writing the assertion. Pin exactly
what the spec mandates: no less (Liar), no more (over-specified).
"""
from datetime import datetime
from clock import FrozenClock
from quest_service import DailyQuestService


class StubQuestApiClient:
    def __init__(self, canned_quests):
        self._canned = canned_quests
    def fetch_quests(self, user_id):
        return self._canned


class SpyLedger:
    """A Test Spy (Meszaros, p.538) — records every credit() call."""
    def __init__(self):
        self.calls = []
    def credit(self, user_id, gold):
        self.calls.append((user_id, gold))


# ===== WORKED EXAMPLE 1 — the Liar test =====
# This assertion ALWAYS passes — even if the SUT never called the spy.
# YOUR JOB: add a Python comment ABOVE the assertion explaining (in
# your own words) why this is a "Liar test" / "weak oracle".
# Don't change the assertion — keep the Liar visible for comparison.
def test_complete_quest_LIAR_oracle():
    spy = SpyLedger()
    service = DailyQuestService(
        FrozenClock(datetime(2026, 4, 28, 12, 0)),
        StubQuestApiClient([]),
        spy,
    )
    service.complete_quest("u1", "Slay the Slime Lord")
    # TODO — add a comment HERE explaining the Liar pattern.
    assert len(spy.calls) >= 0


# ===== WORKED EXAMPLE 2 — Goldilocks =====
# Pins exactly the (user_id, gold) the spec mandates. Read and run.
def test_complete_quest_credits_correct_gold():
    spy = SpyLedger()
    service = DailyQuestService(
        FrozenClock(datetime(2026, 4, 28, 12, 0)),
        StubQuestApiClient([]),
        spy,
    )
    service.complete_quest("u1", "Slay the Slime Lord")
    # Slay the Slime Lord rewards 100 gold (per QUEST_REWARDS in quest_service.py).
    assert spy.calls == [("u1", 100)]


# ===== FADED EXAMPLE 3 — student writes the expected call =====
# The SUT is `award_streak_bonus(user_id, days)`.
# Spec: 10 gold per day, capped at 100.
# YOUR JOB: replace the placeholder gold value with the correct one
# for `days=5`. Compute it from the spec.
def test_award_streak_bonus_5_days():
    spy = SpyLedger()
    service = DailyQuestService(
        FrozenClock(datetime(2026, 4, 28, 12, 0)),
        StubQuestApiClient([]),
        spy,
    )
    service.award_streak_bonus("u2", 5)
    # TODO — replace 999 with the correct gold for a 5-day streak.
    assert spy.calls == [("u2", 999)]

Solution

test_quest_service.py

"""Step 3 solution — Liar named, Goldilocks read, Faded filled in."""
from datetime import datetime
from clock import FrozenClock
from quest_service import DailyQuestService


class StubQuestApiClient:
    def __init__(self, canned_quests):
        self._canned = canned_quests
    def fetch_quests(self, user_id):
        return self._canned


class SpyLedger:
    def __init__(self):
        self.calls = []
    def credit(self, user_id, gold):
        self.calls.append((user_id, gold))


def test_complete_quest_LIAR_oracle():
    spy = SpyLedger()
    service = DailyQuestService(
        FrozenClock(datetime(2026, 4, 28, 12, 0)),
        StubQuestApiClient([]),
        spy,
    )
    service.complete_quest("u1", "Slay the Slime Lord")
    # Liar test / weak oracle: len() of any list is always >= 0,
    # so this assertion holds even if the SUT never called the spy.
    # Same Liar-test family as `result is not None` from Testing
    # Foundations Step 3 — looks productive, verifies nothing.
    assert len(spy.calls) >= 0


def test_complete_quest_credits_correct_gold():
    spy = SpyLedger()
    service = DailyQuestService(
        FrozenClock(datetime(2026, 4, 28, 12, 0)),
        StubQuestApiClient([]),
        spy,
    )
    service.complete_quest("u1", "Slay the Slime Lord")
    assert spy.calls == [("u1", 100)]


def test_award_streak_bonus_5_days():
    spy = SpyLedger()
    service = DailyQuestService(
        FrozenClock(datetime(2026, 4, 28, 12, 0)),
        StubQuestApiClient([]),
        spy,
    )
    service.award_streak_bonus("u2", 5)
    # 5 days × 10 gold = 50 (well below the cap of 100).
    assert spy.calls == [("u2", 50)]

Three moves in this step:

Liar named: a comment above assert len(spy.calls) >= 0 explains why it always passes (the assertion is structurally trivial — len of any list is non-negative). The Liar stays in the file as a cautionary example, not a test that gets fixed.
Goldilocks read: assert spy.calls == [("u1", 100)] pins exactly what the spec mandates — one call with two arguments.
Faded filled in: 5 days × 10 gold = 50 (under the 100-gold cap). The strong oracle pins the exact 2-tuple.

Test Spies — Knowledge Check

Min. score: 80%

1. What is the defining role of a Test Spy that distinguishes it from a Test Stub?

A spy is faster than a stub because it doesn’t compute return values
Speed isn’t the distinction. Spies and stubs are both lightweight in-memory objects. The difference is what the test inspects after the SUT runs.
A spy records every call made to it so the test can later inspect the recorded list. (A spy can also act as a stub by returning canned values, but the recording is what makes it a spy.)
Right. A Spy (Meszaros, p.538) is a stub that also records calls. The test asserts on the recorded calls — that’s what enables verification of fire-and-forget collaborator interactions.
A spy raises exceptions on every call to ensure error paths are exercised
That’s a specific use of a stub or spy (set side_effect to an exception, as Step 4 will show). It’s not the defining property — it’s just one configurable behavior.
A spy is a runtime debugging tool, not a test double
Test spies are absolutely test doubles, not runtime tools. The terminology comes from xUnit Test Patterns (Meszaros, 2007). Don’t confuse “spy” in the testing sense with “spyware” in the security sense — they happen to share a metaphor but are unrelated concepts.

Spy = stub + call recording. The test asserts on the recorded call list (spy.calls), which is how we verify that the SUT did something — even when “did something” leaves no observable return value.

2. (Spaced review — Testing Foundations Step 3) A teammate asserts:

assert len(spy.calls) >= 0

and points out the test passes. Is this assertion useful?

Yes — passing tests prove the SUT works
Tests passing only tells you what their assertions held. len(any_list) >= 0 is a property of Python lists, not of the SUT — so passing this assertion proves nothing about the SUT’s behavior. Same Liar-test family as result is not None from Testing Foundations Step 3, ported to spy assertions.
No — this is a structurally trivial assertion (len of any list is >= 0). It would pass even if the SUT never called the spy. Liar test.
Right. The assertion holds for an empty list, a list of correct calls, a list of wrong calls — every list. The test passes regardless of behavior, which is the textbook Liar test. The fix: pin the exact expected call list with ==.
Yes — len(...) >= 0 is the recommended starting assertion for spy-based tests
There’s no such recommendation. Starting weak and “strengthening later” is how Liar tests get committed to main and forgotten. Always pin the exact expected call list from the start.
No — but only because the assertion should use is True/is False instead
is True/is False is for boolean returns. len(...) >= 0 would still be a Liar even if you wrote (len(...) >= 0) is True — the underlying expression is structurally trivial. The fix is to assert on the recorded calls themselves, not on len().

The Liar pattern is independent of the assertion operator. The issue is the assertion’s expression — len(...) >= 0 is structurally trivial. Replace it with assert spy.calls == [...] pinning the exact expected call.

3. Which spy assertion is brittle (would break under a harmless internal refactor)?

assert spy.calls == [("u1", 100)]
This pins exactly the (user_id, gold) the spec mandates. If the SUT later changes how it formats internal log strings, this test still passes — because it doesn’t reference internal-state details. Goldilocks, not brittle.
assert spy.calls == [("u1", 100, "2026-04-28T12:00:00Z", {"meta": "req-id-abc"})]
Right. This pins a 4-tuple including a timestamp and a request-ID metadata dict — neither of which is in the spec for credit. If the SUT is later refactored to drop the metadata or change the timestamp format (without changing the user/gold contract), this test breaks for the wrong reason. Over-specified, brittle.
assert ("u1", 100) in spy.calls
in spy.calls is under-specified in the other direction (extra calls would still pass), but it isn’t brittle — it tolerates harmless changes. Brittle assertions break when the underlying contract is preserved; under-specified assertions miss bugs the contract was supposed to catch. Different problem.
assert spy.calls[0] == ("u1", 100)
Indexing [0] is just a way to access the first call. It pins what we want (user_id, gold) and ignores everything else. Not brittle. (Slightly less idiomatic than full-list equality, but not the over-specified case.)

Brittle = pins details outside the spec. The 4-tuple includes a timestamp and a metadata dict that aren’t part of the credit contract — they’re internals. A pure refactor that drops the metadata would break this test even though credit(user_id, gold) is still being called correctly. (Same family as the internal-coupling brittleness from Testing Foundations Step 4.)

4. (Spaced review — Step 2) Stub vs Spy in one sentence:

A stub is hand-rolled; a spy uses unittest.mock
Both can be hand-rolled or generated. Step 4 will show that unittest.mock generates either role from the same Mock class — the role isn’t determined by the library.
A stub provides canned answers to the SUT’s questions; a spy records the SUT’s calls so the test can inspect them later
Right. Stub = canned answers (control indirect input). Spy = record-and-inspect (verify indirect output). Same SUT/collaborator structure; different question being asked of the test.
A stub is for read operations; a spy is for write operations
Read/write isn’t the distinction — many real collaborators do both, and the choice of stub or spy depends on what the test wants to verify, not on whether the underlying call is a read or a write.
A stub is faster than a spy
Performance is a non-distinction. The choice between stub and spy is about what behavior the test verifies, not about how fast the double runs.

Stub: "control what the SUT receives." Spy: "observe what the SUT did." Same role-vs-syntax distinction as Step 2 — these are test-design roles, independent of whether you hand-roll them or generate them with a library (Step 4 incoming).

4

Meet `unittest.mock`: Same Roles, Less Typing

🎯 Goal: Re-recognize the stubs and spies you wrote by hand in Steps 2-3 — but now generated by unittest.mock.Mock in a single line. See three syntactic forms of the same conceptual stub side-by-side. 🧠 Skills you’ll gain: Read Mock(return_value=...) and recognize it as a stub. Read a Mock with assert_called_once_with(...) and recognize it as a spy. Use side_effect to simulate collaborator failures.

🧭 Bridge from Steps 2-3. You wrote StubQuestApiClient and SpyLedger by hand. The recording boilerplate (self.calls.append(...)) gets repetitive. Python’s unittest.mock.Mock is a class that generates the same conceptual object on demand:

Set api.fetch_quests.return_value = [...] → api.fetch_quests(...) returns that list. (Stub.)
Set api.fetch_quests.side_effect = ConnectionError → api.fetch_quests(...) raises. (Failing stub.)
Call api.fetch_quests("u1") → Mock auto-records the call; api.fetch_quests.assert_called_once_with("u1") checks the recording. (Spy.)

One class, three roles — depending on what the test asks of it. The role isn’t determined by the class; it’s determined by what the test does with it.

📖 The verbatim teaching sentence — louder this time

“Mock is a tool class; stub, spy, and mock are test-design roles. Same in Python, JavaScript, and Java — the role is what matters; the class name is just syntax.”

unittest.mock.Mock is the most overloaded class name in Python testing. It is not a “Mock object” in Meszaros’ sense (Step 5 will introduce that role). It’s a tool — a configurable double that can play stub, spy, or mock depending on how the test uses it.

⚠️ Why this matters for your career

Reading other people’s tests, you’ll see Mock everywhere. Most uses are stubs in disguise (Mock(return_value=...)). When someone says “I added a mock for the database,” nine times out of ten they actually added a stub. Recognizing the role behind the class name is the difference between parroting Mock syntax and understanding what the test verifies.

⚙️ Task — read four tests, fill in one slot:

Read test_a_handrolled_stub — the Step 2 hand-rolled style for comparison.
Read test_b_mock_return_value — same SUT, same role, generated by Mock. Confirm both pass and verify the same behavior.
Read test_c_mock_as_spy — the same Mock class, now playing the spy role. Notice: nothing about Mock changes between Test B and Test C — only what the test does with it.
Fill in test_d_side_effect_simulates_api_failure — replace the placeholder exception class. Read DailyQuestService.daily_quest_title to find which exception it catches; use that class. The concept: a stub doesn’t have to return — it can raise, simulating a real-world failure path.

📖 `return_value` vs `side_effect` — concept-level contrast

Attribute	What it does	When to reach for it
`mock.return_value = X`	Calls return `X` (a canned answer)	The collaborator should succeed; you want to drive the SUT down a happy-path partition.
`mock.side_effect = Exception`	Calls raise the exception	The collaborator should fail; you want to drive the SUT down its error-handling branch.
`mock.side_effect = [a, b, c]`	First call returns `a`, second `b`, third `c`	The collaborator returns different values across the test sequence.
`mock.side_effect = my_function`	Calls invoke `my_function(*args)`	The return value depends dynamically on the arguments.

Both attributes are configurations of the same Mock object. They’re orthogonal; they answer different test-design questions.

📖 What about `monkeypatch`?

pytest’s monkeypatch fixture is another way to swap a collaborator at test time — particularly useful when the collaborator is a module-level function or constant that the SUT imports, rather than a constructor parameter:

def test_with_monkeypatch(monkeypatch):
    # Replace QUEST_REWARDS at the module level for this one test only.
    # monkeypatch automatically restores it after the test.
    monkeypatch.setattr("quest_service.QUEST_REWARDS", {"Slay the Slime Lord": 9999})
    spy = Mock()
    service = DailyQuestService(FrozenClock(...), Mock(), spy)
    service.complete_quest("u1", "Slay the Slime Lord")
    spy.credit.assert_called_once_with("u1", 9999)

monkeypatch.setattr(target, value) replaces target with value. After the test, monkeypatch restores the original — automatically. The auto-cleanup is what makes monkeypatch safe: a manual replacement that you forgot to restore would leak into every subsequent test.

Conceptually, monkeypatch.setattr is a stub — you’re feeding the SUT a controlled value. Same role; different syntactic vehicle. Use it when the seam is at module level rather than at constructor level.

Step 5 will use the heavier unittest.mock.patch (decorator/context manager) for the same purpose — and explore the canonical pitfall: where in the namespace to patch.

🌍 The same idea in another language

JavaScript with Jest:

const api = { fetchQuests: jest.fn().mockReturnValue([...]) };  // stub
// OR
const api = { fetchQuests: jest.fn().mockImplementation(() => { throw new Error('boom'); }) };  // failing stub via side_effect

Java with Mockito:

QuestApiClient api = mock(QuestApiClient.class);
when(api.fetchQuests(anyString())).thenReturn(List.of(...));  // stub
// OR
when(api.fetchQuests(anyString())).thenThrow(new ConnectionException());  // failing stub

Same conceptual moves: tell the double “return X” or “raise X.” The names of the methods differ across libraries — the roles don’t.

🔭 Coming in Step 5: Mock can also play the third role — Mock Object in Meszaros’ strict sense (behavior verification). To see it cleanly, we need one more idea: patch(), and where in the namespace to patch. That’s the #1 Python-mocking pitfall.

Starter files

test_quest_service.py

"""Step 4 — unittest.mock generates the same conceptual objects you wrote by hand.

Four tests below, all testing the same SUT (DailyQuestService). They
differ only in HOW the double is constructed and what role it plays.
Read them as a side-by-side comparison.
"""
from unittest.mock import Mock
from datetime import datetime
from clock import FrozenClock
from quest_service import DailyQuestService


# Hand-rolled stub class (Step 2 style) — kept for direct comparison.
class StubQuestApiClient:
    def __init__(self, canned_quests):
        self._canned = canned_quests
    def fetch_quests(self, user_id):
        return self._canned


# ===== TEST A — Hand-rolled stub (Step 2 style) =====
def test_a_handrolled_stub():
    clock = FrozenClock(datetime(2026, 4, 28, 12, 0))
    api = StubQuestApiClient([
        {"weekday": "Tuesday", "title": "Find the Lost Amulet"},
    ])
    service = DailyQuestService(clock, api)
    assert service.daily_quest_title("u1") == "Find the Lost Amulet"


# ===== TEST B — Mock with return_value (same ROLE: stub) =====
# `Mock()` creates an auto-magic object. Setting
# `api.fetch_quests.return_value = [...]` configures what
# `api.fetch_quests(anything)` returns. Functionally equivalent to
# the StubQuestApiClient class above — just no class definition.
def test_b_mock_return_value():
    clock = FrozenClock(datetime(2026, 4, 28, 12, 0))
    api = Mock()
    api.fetch_quests.return_value = [
        {"weekday": "Tuesday", "title": "Find the Lost Amulet"},
    ]
    service = DailyQuestService(clock, api)
    assert service.daily_quest_title("u1") == "Find the Lost Amulet"


# ===== TEST C — Mock used as a SPY (different ROLE, same class) =====
# Watch this carefully: `Mock` is the same class as Test B's. But
# we're using it as a SPY — recording the call to `credit` and
# asserting on the recording afterwards. The role isn't determined
# by the class; it's determined by what we DO with it.
def test_c_mock_as_spy():
    clock = FrozenClock(datetime(2026, 4, 28, 12, 0))
    api = Mock()
    api.fetch_quests.return_value = []   # api still acts as stub
    ledger = Mock()                       # ledger plays SPY
    service = DailyQuestService(clock, api, ledger)
    service.complete_quest("u1", "Slay the Slime Lord")
    # Mock auto-records every call; `assert_called_once_with` checks the recording.
    # This is identical in spirit to: assert ledger.calls == [("u1", 100)]
    # — just generated automatically.
    ledger.credit.assert_called_once_with("u1", 100)


# ===== TEST D — fill in the side_effect =====
# The SUT catches ConnectionError and returns "No quests today".
# Use side_effect to make the stub RAISE that exception instead of returning.
# YOUR JOB: replace `ValueError` (the wrong exception) with the right one.
# Read DailyQuestService.daily_quest_title in quest_service.py to confirm
# which exception class is caught.
def test_d_side_effect_simulates_api_failure():
    clock = FrozenClock(datetime(2026, 4, 28, 12, 0))
    api = Mock()
    # TODO: replace ValueError with the exception class the SUT catches.
    api.fetch_quests.side_effect = ValueError
    service = DailyQuestService(clock, api)
    assert service.daily_quest_title("u1") == "No quests today"

Solution

test_quest_service.py

"""Step 4 solution — side_effect set to ConnectionError."""
from unittest.mock import Mock
from datetime import datetime
from clock import FrozenClock
from quest_service import DailyQuestService


class StubQuestApiClient:
    def __init__(self, canned_quests):
        self._canned = canned_quests
    def fetch_quests(self, user_id):
        return self._canned


def test_a_handrolled_stub():
    clock = FrozenClock(datetime(2026, 4, 28, 12, 0))
    api = StubQuestApiClient([
        {"weekday": "Tuesday", "title": "Find the Lost Amulet"},
    ])
    service = DailyQuestService(clock, api)
    assert service.daily_quest_title("u1") == "Find the Lost Amulet"


def test_b_mock_return_value():
    clock = FrozenClock(datetime(2026, 4, 28, 12, 0))
    api = Mock()
    api.fetch_quests.return_value = [
        {"weekday": "Tuesday", "title": "Find the Lost Amulet"},
    ]
    service = DailyQuestService(clock, api)
    assert service.daily_quest_title("u1") == "Find the Lost Amulet"


def test_c_mock_as_spy():
    clock = FrozenClock(datetime(2026, 4, 28, 12, 0))
    api = Mock()
    api.fetch_quests.return_value = []
    ledger = Mock()
    service = DailyQuestService(clock, api, ledger)
    service.complete_quest("u1", "Slay the Slime Lord")
    ledger.credit.assert_called_once_with("u1", 100)


def test_d_side_effect_simulates_api_failure():
    clock = FrozenClock(datetime(2026, 4, 28, 12, 0))
    api = Mock()
    # The SUT's daily_quest_title catches ConnectionError specifically.
    api.fetch_quests.side_effect = ConnectionError
    service = DailyQuestService(clock, api)
    assert service.daily_quest_title("u1") == "No quests today"

daily_quest_title has try: ... except ConnectionError: return "No quests today". Setting side_effect = ConnectionError makes api.fetch_quests(...) raise that exception when called — driving the SUT down its error-handling branch.

Setting side_effect = ValueError would NOT match the SUT’s except clause — the exception would propagate up and the test would fail with an unexpected ValueError. The class has to match the exception the SUT is prepared to catch.

unittest.mock — Knowledge Check

Min. score: 80%

1.

api = Mock()
api.fetch_quests.return_value = [{"weekday": "Tuesday", "title": "..."}]

What role is api playing here?

Mock — because the variable name api and the class Mock are both used
This is the most common confusion in Python testing. The class is Mock, but the role is determined by how the test uses the object — not by the class name. Here, api is configured to return a canned value; that’s a stub role.
Stub — it answers fetch_quests(...) calls with a canned value, providing controlled indirect input to the SUT
Right. return_value configures a canned answer; the SUT receives that answer as indirect input. Same role as StubQuestApiClient from Step 2 — just generated by Mock instead of declared as a class. (Yes, Mock also records calls, but here the test never asserts on them. The role is determined by the test’s intent.)
Spy — every call to a Mock is automatically recorded
Mock objects do auto-record calls, so the capability is there — but role is determined by what the test uses. This test only configures return_value and asserts on the SUT’s return value (state verification). No call assertions are made on api, so its spy capability is unused — it’s playing stub.
Fake — it has a working in-memory implementation
A Fake (Meszaros, p.551) has a working but lightweight implementation — typically with internal state (an in-memory dict, for example). Mock has no internal logic; it just returns whatever you configured. So this isn’t a Fake.

Mock(return_value=X) is the framework’s way of writing what you wrote by hand as class StubX: def method(self): return X. Same role; less typing. The class is Mock; the role is stub. (Verbatim teaching sentence in action.)

2. When should you reach for side_effect instead of return_value?

Never — they’re interchangeable; pick whichever reads better
They are not interchangeable. return_value always returns the same canned answer; side_effect lets the answer vary by call (or raise an exception, or be computed from arguments). Different behaviors, different test-design uses.
When the collaborator should raise an exception (so you can test the SUT’s error-handling branch), or return different values across calls, or compute the return dynamically from the call’s arguments
Right. side_effect covers three patterns return_value cannot: (1) raise on call → exercise the SUT’s except branch; (2) iterable → return different values on consecutive calls; (3) callable → compute return value from the args. Each one corresponds to a distinct test-design need.
When you want the test to be slower (side_effect adds latency)
Speed is a non-issue at this scale. The choice between return_value and side_effect is about behavioral capability, not performance.
When return_value doesn’t exist on the version of unittest.mock you’re using
Both have been in unittest.mock since at least Python 3.3. Versioning isn’t the reason to prefer one over the other.

return_value: one canned answer for every call. side_effect: dynamic — exception-raising, sequenced returns, or computed-from-args. Pick based on what the test needs the collaborator to do, not by what looks shorter.

3. A teammate writes:

ledger.credit.assrt_called_once_with("u1", 100)   # typo

and the test passes. What actually happened?

Mock corrected the typo internally and called the right assert method
Mock has no auto-correct mechanism. It also has no idea you intended assert_called_once_with — to Mock, assrt_called_once_with is just another attribute name to auto-create.
Mock auto-created assrt_called_once_with as a new mock method (because Mock creates attributes on access) — so the line just creates a new auto-mocked attribute and calls it. No actual assertion ran. The test silently passes regardless of behavior.
Right. This is the typo trap — one of the most dangerous Mock pitfalls. Every attribute access on a vanilla Mock returns a new child Mock; calling .assrt_called_once_with(...) on that child Mock just records another call, returns a new Mock, and produces no assertion. Step 5 introduces autospec=True as one defense (it restricts attribute access to the real object’s interface).
Mock raised an AttributeError and pytest caught it as a passing test
There’s no AttributeError because Mock auto-creates attributes. That’s the whole problem — the failure mode is silent.
Python’s interpreter detected the typo and warned via stderr
Python doesn’t warn about typo’d method names — to the language, assrt_called_once_with is a perfectly valid attribute name. Static analyzers (mypy, pylint) might flag it; the runtime won’t.

The typo trap. Mock’s auto-attribute behavior — convenient for quickly stubbing nested attribute chains — also silently swallows typos in assert_* method names. The test passes; the assertion never ran. Step 5’s autospec=True is one defense; using mypy or calling assert_called_once_with (no underscore typo) carefully is another.

4. (Spaced review — TDD) During the Red-Green-Refactor cycle, when do you typically introduce a Mock?

Before Red — Mocks must exist before the test is written
There’s nothing to mock until you write the test — and the test names which collaborators it needs to control. Setting up Mocks before the test exists is putting the cart before the horse.
During Red — the failing test you write is the moment you decide which double to use, because the choice is part of the test design
Right. The Red phase is where you design the test — including which collaborators to double and what role each should play. Green just makes the SUT pass; Refactor improves the code under a green safety net. The double choice is a Red-phase test-design decision.
During Refactor only — Mocks are exclusively a code-cleanup tool
Mocks aren’t a refactor-only tool. They’re a test-design tool that supports refactoring (by making behavior verifiable in isolation) — but the choice happens during Red.
Never — TDD forbids Mocks
TDD doesn’t forbid Mocks; it just emphasizes that the test drives design. Mocks are one of the design moves available — used judiciously when the SUT genuinely depends on collaborators.

Red is the test-design moment. Choosing stub/spy/mock/fake/no-double is a Red-phase decision because it shapes both the test’s structure and (often) the production design that emerges in Green. (Step 6 covers when not to double — also a Red-phase decision.)

5. Why is pytest’s monkeypatch fixture automatically restoring the original value an important property?

It makes monkeypatch faster than unittest.mock
Speed is irrelevant. The benefit is correctness across a test suite, not microseconds per test.
Without auto-restore, a patched module attribute would leak into every subsequent test in the suite — silently breaking tests that don’t even know they’re using a patched value. Auto-restore makes the patch a test-local effect.
Right. Test isolation is non-negotiable: a test that mutates global state and forgets to clean up corrupts every test that runs after it. monkeypatch (and unittest.mock.patch as a context manager / decorator) automate the cleanup, so you can’t forget.
It’s a Python 3.11+ feature for memory management
monkeypatch has been in pytest for many years; it’s not a Python 3.11 feature. And cleanup is a correctness concern, not a memory-management one.
It’s only needed when you’re patching __builtins__
monkeypatch can patch any attribute — module functions, class methods, instance attributes, dictionary entries. It’s not limited to __builtins__.

Test isolation. A test that patches a module attribute and forgets to restore it leaves a time bomb for every subsequent test. monkeypatch and with patch(...) both handle restoration for you; manual setattr/delattr does not. Always prefer the framework-managed forms.

5

Where to Patch — The #1 Python Pitfall, and Why autospec Defends You

🎯 Goal: Recognize and fix the wrong-namespace patch — the most common Python-mocking failure mode. See autospec=True defend you against the typo trap and signature drift. 🧠 Skills you’ll gain: Pick the correct patch() target string by tracing the SUT’s import. Recognize when a Mock has no spec and the risks that come with that.

🧭 Bridge from Step 4. Step 4 used Mocks at constructor parameters — DailyQuestService(clock, api, ledger) accepts the doubles directly. Sometimes that’s not possible: the SUT might call a module-level function directly, with no constructor parameter to swap. Then we use unittest.mock.patch() — and confront the canonical Python pitfall: where in the namespace does the patch belong?

📖 The new SUT — `celebrate_milestone`

Look at quest_service.py. There’s a new method celebrate_milestone(user_id, days) that calls send_push(...) from push_notifier. The import line in quest_service.py is:

from push_notifier import send_push

That single line is the source of every where-to-patch confusion in Python. After this import, send_push is bound in quest_service’s namespace. The quest_service module now has its own reference to the function — separate from push_notifier’s.

flowchart LR
    subgraph push_mod["push_notifier module"]
        P_DEF["send_push<br/>= &lt;real function&gt;"]:::neutral
    end
    subgraph quest_mod["quest_service module"]
        Q_REF["send_push<br/>= &lt;ref to real function&gt;"]:::neutral
        Q_USE["celebrate_milestone<br/>calls send_push(...)<br/>looks up 'send_push' HERE"]:::sut
        Q_REF -.->|"looked up in<br/>this namespace"| Q_USE
    end
    P_DEF -->|"from push_notifier import send_push<br/>copies the reference"| Q_REF
    classDef neutral fill:#fafafa,stroke:#bdbdbd,color:#424242
    classDef sut fill:#fff3e0,stroke:#e65100,color:#bf360c

📜 The rule

Patch where the SUT looks up the name — not where it was originally defined.

celebrate_milestone does send_push(...). Python finds that name by looking it up in quest_service’s namespace (the importing module). So the patch target is "quest_service.send_push", not "push_notifier.send_push". Patching the latter does nothing — quest_service already has its own reference.

Part A — Predict and fix the patch target

⚙️ Task: open test_celebrate.py. The patch target is currently wrong. Run the test (it fails). Read the failure carefully — mock_send was never called, even though the SUT did run celebrate_milestone. That’s the signature of a wrong-namespace patch.

Then fix it: change the patch target string to the right one. Re-run.

💡 Pedagogical note. Your fix is one string change. The conceptual move is naming where the SUT looks the name up. That insight ports to JavaScript (CommonJS’ const { y } = require('x') has the same trap) and Java (static imports have a similar effect). Once you internalize the rule, you stop being trapped by the syntax.

Part B — autospec is a design guardrail, not a syntactic flourish

Read the second pair of tests in the file: test_loose_mock_accepts_wrong_call and test_autospec_rejects_wrong_call. Both run successfully — but they verify very different things.

	Loose Mock (no spec)	Autospec’d Mock
Setup	`with patch("X") as m:`	`with patch("X", autospec=True) as m:`
What `m(wrong_args)` does	Silently records the call	Raises `TypeError` because the real function’s signature is enforced
What `m.assrt_called_once_with(...)` (typo) does	Silently auto-creates an attribute, returns yet another Mock	Same in current Mock — `autospec` defends primarily against call-signature drift, not assertion-method typos. Use linters / mypy for the typo defense.
When you’d want it	Quick exploratory test where signature isn’t a concern	Default-safe habit for any patched callable — catches signature drift the moment a teammate’s refactor breaks the contract

The pedagogical takeaway: autospec=True is a design guardrail. It says “make this Mock as strict as the real thing it’s replacing.” Without it, your test silently accepts calls that the real function would reject — until production catches it for you, which is the worst place to find out.

📖 Behavior verification — the third kind

Steps 2 and 3 used state verification: stubs feed inputs, the test asserts on the SUT’s return value or on the spy’s recorded list. The SUT’s internal call sequence was incidental.

test_celebrate_milestone_sends_push (after you fix the patch target) is different. The SUT returns None. Nothing in its observable state changes. The call itself is the entire contract. We assert that mock_send was called once with specific arguments. That’s behavior verification (Meszaros, p.468).

A Mock configured with call assertions is, in Meszaros’ strict sense, a Mock Object (p.544). The role isn’t “what class did you instantiate” — it’s “what does the test verify, and how?”

🌍 The same idea in another language

JavaScript with Jest (CommonJS): Same trap exists.

// questService.js
const { sendPush } = require('./pushNotifier');
function celebrateMilestone(...) { sendPush(...); }

jest.mock('./pushNotifier') works because Jest hoists this and intercepts at the require boundary. But if the consumer destructures and you only mock the original module, ES module imports can desync — same family of problem.

Java with Mockito static imports: Less prone to this since Java imports are class-level and Mockito patches at the type level. But PowerMock for static methods has its own where-to-patch dance.

The general lesson, language-independent: a name lives in the namespace of the module that introduces it. Patch there.

🧠 The typo trap and `autospec` — the precise truth

A common claim: “autospec catches typos like assrt_called_once_with.” Half-true. Here’s the precise picture.

autospec=True constrains the Mock to the spec of the patched object — its arguments, its attributes (if it’s a class), its method signatures. For attribute access, autospec does restrict the Mock to attributes the real object has — but assert_* methods are part of the Mock’s interface, not the real object’s. So mock.assrt_called_once_with may or may not be caught depending on Python version and exact patching shape.

The reliable defense against assrt_called_once_with typos: mypy or pylint, not autospec. Don’t rely on autospec for typo prevention.

The reliable defense against signature drift (calling send_push("u1") when the real function needs send_push("u1", "msg")): autospec catches this immediately. That’s the use case worth the keystrokes.

🔭 Coming in Step 6: You can build any of the three roles and you know the patching pitfalls. The harder skill is choosing which one — and choosing none at all when over-mocking would brittlify the test.

Starter files

push_notifier.py

"""The real push-notification service — would call APNS / FCM in production."""


def send_push(user_id: str, message: str) -> None:
    # In production: dispatches a real push notification.
    # The print is a teaching aid — if you see this in test output,
    # the patch DIDN'T intercept and the real function ran.
    print(f"📲 REAL send_push fired: user={user_id!r}, message={message!r}")

quest_service.py

"""QuestForge — daily quest service with milestone celebration."""
import datetime
from push_notifier import send_push


QUEST_REWARDS = {
    "Slay the Slime Lord":   100,
    "Find the Lost Amulet":  150,
    "Battle the Lich King":  250,
    "Defeat the Dragon":     500,
}


def is_today_event_day(event_date_str: str, clock=datetime.datetime) -> bool:
    today = clock.now().strftime("%Y-%m-%d")
    return today == event_date_str


class DailyQuestService:
    def __init__(self, clock, api, ledger=None):
        self._clock = clock
        self._api = api
        self._ledger = ledger

    def daily_quest_title(self, user_id: str) -> str:
        try:
            quests = self._api.fetch_quests(user_id)
        except ConnectionError:
            return "No quests today"
        if not quests:
            return "No quests today"
        weekday = self._clock.now().strftime("%A")
        for quest in quests:
            if quest["weekday"] == weekday:
                return quest["title"]
        return "No quests today"

    def complete_quest(self, user_id: str, quest_title: str) -> None:
        gold = QUEST_REWARDS.get(quest_title, 0)
        self._ledger.credit(user_id, gold)

    def award_streak_bonus(self, user_id: str, days: int) -> None:
        gold = min(days * 10, 100)
        self._ledger.credit(user_id, gold)

    def celebrate_milestone(self, user_id: str, days: int) -> None:
        """When a streak hits a multiple of 7, send a push notification."""
        if days % 7 == 0:
            send_push(user_id, f"🎉 {days}-day streak!")

test_celebrate.py

"""Step 5 — Where-to-patch and autospec.

Three tests below. Tests B and C are correct as-is and demonstrate
autospec's value. Test A's PATCH TARGET IS WRONG — fix it.
"""
from unittest.mock import Mock, patch
from datetime import datetime
from clock import FrozenClock
from quest_service import DailyQuestService


def _service():
    return DailyQuestService(FrozenClock(datetime(2026, 4, 28, 12, 0)), Mock(), Mock())


# ===== TEST A — Part A: patch target is WRONG. Fix it. =====
# Run this test as-is. It FAILS — `mock_send.assert_called_once_with(...)`
# complains the mock was never called. That's the symptom of a
# wrong-namespace patch: the real send_push ran, the mock did nothing.
# YOUR JOB: change the patch target string from "push_notifier.send_push"
# to the correct one. Read `quest_service.py`'s import line — the SUT
# looks the name up in *which* namespace?
def test_celebrate_milestone_sends_push():
    service = _service()
    # ← FIX THE STRING BELOW. It's wrong.
    with patch("push_notifier.send_push") as mock_send:
        service.celebrate_milestone("u1", 7)
    mock_send.assert_called_once_with("u1", "🎉 7-day streak!")


# ===== TEST B — Part C: a LOOSE Mock accepts a wrong-signature call =====
# The real send_push takes 2 arguments (user_id, message).
# Without autospec, the Mock will silently accept a 1-argument call.
# Watch what gets through.
def test_loose_mock_accepts_wrong_call():
    with patch("quest_service.send_push") as mock_send:
        # Imagine a teammate's refactor that drops the message arg
        # (real production bug). The Mock has no spec — it accepts.
        mock_send("u1")  # Real send_push REQUIRES 2 args; Mock doesn't care.
    # The recorded call passes assertion. The bug slipped through.
    mock_send.assert_called_once_with("u1")


# ===== TEST C — Part C: autospec REJECTS the wrong-signature call =====
# With autospec=True, the Mock matches the real function's signature.
# Calling it with the wrong number of arguments raises TypeError.
def test_autospec_rejects_wrong_call():
    with patch("quest_service.send_push", autospec=True) as mock_send:
        try:
            mock_send("u1")  # Same bad call as Test B — autospec catches it
            assert False, "autospec should have raised TypeError"
        except TypeError as e:
            # autospec correctly rejected the call. The signature was enforced.
            print(f"✅ autospec caught it: {e}")

Solution

test_celebrate.py

"""Step 5 solution — patch target fixed to where the SUT looks up the name."""
from unittest.mock import Mock, patch
from datetime import datetime
from clock import FrozenClock
from quest_service import DailyQuestService


def _service():
    return DailyQuestService(FrozenClock(datetime(2026, 4, 28, 12, 0)), Mock(), Mock())


def test_celebrate_milestone_sends_push():
    service = _service()
    # quest_service.py does `from push_notifier import send_push`.
    # That binds the name in quest_service's namespace — so we patch THERE.
    with patch("quest_service.send_push") as mock_send:
        service.celebrate_milestone("u1", 7)
    mock_send.assert_called_once_with("u1", "🎉 7-day streak!")


def test_loose_mock_accepts_wrong_call():
    with patch("quest_service.send_push") as mock_send:
        mock_send("u1")
    mock_send.assert_called_once_with("u1")


def test_autospec_rejects_wrong_call():
    with patch("quest_service.send_push", autospec=True) as mock_send:
        try:
            mock_send("u1")
            assert False
        except TypeError as e:
            print(f"✅ autospec caught it: {e}")

The patch target is "quest_service.send_push", NOT "push_notifier.send_push". The reason:

quest_service.py does from push_notifier import send_push.
After that import, send_push is bound in quest_service’s namespace.
When celebrate_milestone calls send_push(...), Python looks up send_push in quest_service’s namespace.
patch("push_notifier.send_push") only replaces the binding in push_notifier’s namespace — but quest_service already has its own reference, so the patch has no effect.

Tests B and C demonstrate the autospec defense: a loose Mock accepts any call signature, while autospec=True enforces the real function’s signature and raises TypeError on a mismatch.

Where to Patch + autospec — Knowledge Check

Min. score: 80%

1. quest_service.py does:

from push_notifier import send_push

and celebrate_milestone calls send_push(...). Which patch target intercepts the call?

patch("push_notifier.send_push") — patch where the function is defined
Patches the binding in push_notifier’s namespace — but quest_service already has its own reference (created by the from ... import line). The SUT’s call ignores the patched binding and uses the local reference. Real function runs; mock is never called. Test fails (or worse, passes silently if no mock-call assertion).
patch("quest_service.send_push") — patch where the SUT looks up the name
Right. After from push_notifier import send_push, the name send_push is bound in quest_service’s namespace. The SUT’s send_push(...) call resolves there. Patching that exact namespace replaces the SUT’s reference — the patch intercepts.
Either one works; both refer to the same function
They refer to the same underlying function object but they are distinct namespace bindings. Patching one does not affect the other. This is the entire essence of the where-to-patch trap.
Neither — from X import Y makes the function un-patchable
It’s absolutely patchable — you just have to patch the right namespace. Python’s from ... import doesn’t disable patching; it just creates a binding the patch has to target precisely.

The rule: patch where the SUT looks up the name, not where it was defined. After from X import Y, the name Y is bound in the importing module — that’s where the SUT will resolve it. The same principle applies to JavaScript CommonJS, Java static imports, and any language with import scoping.

2. What does autospec=True primarily defend against?

Typos in assert_* method names like assrt_called_once_with
Half-myth. autospec constrains the Mock to the real object’s attributes; assert_* methods are part of Mock’s interface, not the real function’s. Whether autospec catches assrt_called_once_with depends on subtle interactions in different Python versions. The reliable typo defense is mypy/pylint.
Calling the patched function with the wrong number or types of arguments — autospec enforces the real function’s signature, raising TypeError if you violate it
Right. With autospec=True, the Mock’s __call__ enforces the patched function’s signature. mock_send("u1") for a function that needs (user_id, message) raises TypeError immediately. This catches signature-drift bugs that a loose Mock would silently accept.
Slow tests — autospec speeds up Mock construction
Autospec is slower than a loose Mock (it inspects the real object’s signature on construction). The benefit is correctness, not speed.
Forgetting to call mock.reset_mock() between tests
reset_mock and autospec are independent concerns. Autospec is about call signatures; reset_mock is about clearing recorded state between assertions.

autospec=True is the default-safe habit for patched callables: it makes the mock as strict as the real thing it’s replacing. Signature drift (the most common refactoring bug) gets caught immediately. Use it unless you have a reason not to.

3. (Spaced review) Match each Meszaros pattern to its book page:

Test Stub p.529, Test Spy p.538, Mock Object p.544
Right. xUnit Test Patterns (Meszaros, 2007): Test Stub p.529, Test Spy p.538, Mock Object p.544 — the canonical references. The umbrella pattern Test Double sits at p.522.
Test Stub p.522, Test Spy p.529, Mock Object p.538
Close, but the page numbers are shifted. Test Double (the umbrella) is at p.522; the three specific roles start at p.529 (Stub), p.538 (Spy), and p.544 (Mock Object).
All three on the same page (p.522)
All five double types each have their own dedicated section. The book is a pattern catalog — every pattern gets its own page.
The book doesn’t define them — Fowler does
Fowler’s “Mocks Aren’t Stubs” article popularized the taxonomy, but the canonical book-form references are Meszaros’. Both are worth reading; Meszaros is the encyclopedia, Fowler is the essay.

Meszaros (2007) is the canonical reference. The page numbers come up regularly in code reviews, design discussions, and Stack Overflow answers — knowing them lets you point to the right chapter when the team is debating which double to use.

4. (Spaced review — Step 4) A Mock is patched in for the SUT’s collaborator. The test asserts mock.method.assert_called_once_with("u1", 100). What role is this Mock playing?

Stub — the collaborator returns a Mock object
Stub provides canned input to the SUT. This test isn’t using the Mock to feed an answer in — it’s verifying a call went out. Wrong direction.
Spy — the test asserts on what the SUT did (the recorded call), inspecting after the fact
Defensible. The assert_called_* style is post-execution inspection of recorded calls, which is closer to a Spy. (Some authors put assert_called_* cleanly in the Spy camp.)
Mock Object — the test sets a strict expectation on the call
Also defensible. The single-call expectation assert_called_once_with(...) IS a strict expectation on a specific interaction — Meszaros’ Mock Object territory. (Some authors put assert_called_* in the Mock Object camp.)
Either Spy or Mock Object — the boundary depends on whether the expectation is configured up-front (Mock Object) or inspected after the fact via assert_called_* (Spy-leaning)
Right. The line between Spy and Mock Object is fuzzier in unittest.mock than in Meszaros’ original taxonomy because the same Mock class can do either. The role boundary in unittest.mock runs through the Spy ↔ Mock Object axis, not through the class. Step 4’s lesson — “the role isn’t determined by the class” — applies again here.

unittest.mock blurs the Spy/Mock-Object line that Meszaros drew crisply. Both are forms of behavior verification; they differ mainly in whether the expectation is set up-front (mockist style) or read after-the-fact (spy style). For your day-to-day work: don’t worry too much about which side of the line you’re on — worry about whether the test actually verifies the contract.

5. (Spaced review — Step 4 typo trap) What’s the most reliable defense against typos like mock.assrt_called_once_with(...) silently passing?

Always use autospec=True
Autospec primarily catches call-signature drift — wrong number/types of arguments to the patched callable. Whether it catches typos in assert_* methods is version-dependent and not reliable. Don’t lean on autospec for this.
Run a static type checker (mypy / pyright) or linter (pylint) — it’ll flag the missing attribute on Mock. Pair that with code review.
Right. mypy / pyright understand Mock’s typing and flag missing methods. pylint catches the typo statically. Code review catches what tooling misses. This combination is robust — autospec adds defense-in-depth but isn’t sufficient on its own.
Memorize the spelling of every assert_* method
Memorization is fragile and doesn’t help when you’re tired or rushed. Static tooling is what scales — let the computer remember the right spelling.
Use Mock(spec_set=True) — it makes Mock immutable
spec_set=True blocks setting new attributes (so m.foo = ... would fail). It doesn’t reliably block reading nonexistent attributes (so m.assrt_called_once_with(...) may still slip through depending on the spec). Use mypy/pyright.

Static tooling > runtime defense for spelling. mypy / pyright understand unittest.mock’s type stubs and catch typos like assrt_called_once_with at edit time, before the test ever runs.

6

When NOT to Use a Double — The Decision Guide

🎯 Goal: Build the judgment to not reach for a double when a real collaborator would do. Recognize over-mocking as brittleness. Apply a decision guide to five real scenarios. 🧠 Skills you’ll gain: Diagnose an over-mocked test by reading it. Classify scenarios by which double (or none) fits. Recognize the “mock what you own” heuristic and the Adapter wrap-and-mock pattern.

🧭 The whole arc, in one sentence. A test double is a tool you reach for when a real collaborator would make the test flaky, slow, or unable to verify the right thing. It is not a default. It is not a sign of professionalism. It is not a coverage strategy. The right number of doubles for many tests is zero.

📖 The decision flow

flowchart TD
    A["What does this test need to verify?"]:::neutral --> B{"Does the SUT have collaborators<br/>worth doubling?<br/>(slow/flaky/unavailable)"}
    B -->|"No — pure function"| NO["No double<br/>Just call it"]:::good
    B -->|"Yes"| C{"Do you control the test's input<br/>via a collaborator?"}
    C -->|"Yes — control input"| STUB["Stub<br/>(canned answers)"]:::good
    C -->|"No — verify a call happened"| D{"Inspect after the fact<br/>or set up-front?"}
    D -->|"After"| SPY["Spy<br/>(record + assert)"]:::good
    D -->|"Up-front strict"| MOCK["Mock Object<br/>(behavior verification)"]:::good
    B -->|"Yes — but stateful + multi-call"| FAKE["Fake<br/>(in-memory implementation)"]:::good
    B -->|"Third-party library<br/>you don't own"| ADAPT["Wrap in an Adapter<br/>then double the adapter"]:::warn
    classDef good fill:#e8f5e9,stroke:#2e7d32,color:#1b5e20
    classDef warn fill:#fff3e0,stroke:#e65100,color:#bf360c
    classDef neutral fill:#fafafa,stroke:#bdbdbd,color:#424242

📖 Three antipatterns to recognize on sight

Antipattern	Symptom	Why it happens	Fix
Over-mocking	Every internal helper is mocked; the test asserts only on the mocks.	“Isolation feels safe; more mocks = more tested.”	Mock at the architectural boundary (HTTP, DB, clock), not at every internal function.
Mocking what you don’t own	A third-party library’s API is mocked directly, scattered across many tests.	The library is brittle and the team doesn’t want to wait for real responses.	Wrap the third-party in an Adapter (Adapter pattern); mock the Adapter. The third-party’s internals stay invisible to your tests.
Coverage chasing	Every line of the SUT runs in some test, but assertions are weak (`is not None`) or mocked-on-mocks.	Coverage is misread as a quality signal.	Stronger oracles, real collaborators where possible, fewer tests that test more meaningfully. Coverage ≠ correctness (Testing Foundations Step 3).

Part 1 — Read the over-mocked vs clean tests

Open xp_calculator.py. The function compute_total_xp(quests) is pure: it takes a list, computes a number, returns it. No clock, no HTTP, no database. No collaborators worth doubling. Yet test_xp_overmocked.py mocks every internal helper.

⚙️ Task 1: read both test_xp_overmocked.py and test_xp_clean.py. In test_xp_clean.py, uncomment the docstring at the top and fill in your one-line answer to: “What did the over-mocked version mock unnecessarily — and what did that cost?”

📖 What the over-mocked test actually verifies (look only after writing your answer)

Look at test_xp_overmocked.py. The mocks intercept _filter_completed, _apply_multipliers, and _sum_xp. With those internals replaced by Mocks returning canned values, the test only verifies that compute_total_xp calls the helpers in some order and returns the last one’s result. That’s not the spec. The spec is “given these quest dicts, return the total XP.”

Worse: if a teammate refactors the internals (rename _apply_multipliers to _apply_modifiers; merge two helpers into one; inline a helper away entirely), every one of those changes preserves the function’s behavior — but breaks the over-mocked test. Brittleness without protection. The clean test never breaks under those refactors because it asserts on the spec, not on the implementation choreography.

Same lesson as Testing Foundations Step 4 (“test behavior, not implementation”), now applied to mocks instead of internal state access. The principle is one principle.

Part 2 — Classify five scenarios

Open scenarios.py. For each of the five scenarios, set the variable to the best single recommendation from this list:

"no_double"   "stub"   "spy"   "mock"   "fake"   "adapter"

The validator accepts any defensible answer for each scenario (some scenarios have more than one defensible answer — e.g., spy and mock are often interchangeable for a single outbound call). It rejects clearly wrong choices.

🧰 Quick decision rubric (use, don't memorize)

🌍 The same decision in another language

The decision is purely about test design, not about syntax. JavaScript, Java, C#, Ruby, Go — every language with serious testing culture has the same five-or-so doubles, the same antipatterns, and the same heuristic: only mock what you own; only mock what’s actually a collaborator; pure functions don’t need doubles.

The frameworks differ; the design judgment doesn’t.

Part 3 — Forward pointers

You now have the conceptual vocabulary to read any test in any modern Python codebase and recognize what role each double is playing — even when the author called everything a “mock.” That recognition transfers across languages.

🔭 Where this leads in the rest of the curriculum:

solid.yml — Dependency Inversion makes doubles trivial: define an interface, have the SUT depend on it, swap implementations at test time. Most painful mocks are caused by skipped DIP.
observer-pattern.yml — the Observer pattern is essentially a spy made into a permanent design feature.
TDD with doubles — the next natural sequel: TDD where the SUT has collaborators from the start. Red phase becomes “decide what to double, then write the failing test.”

🪞 Recalibrate. Look back at Step 1 — the test that passed today and would have failed tomorrow. Your toolkit now has six things to do instead of “ship and pray”:

Recognize a flaky/slow/opaque collaborator (Step 1).
Inject the collaborator as a parameter (Step 1).
Substitute a stub when you need to control input (Step 2, Meszaros p.529).
Substitute a spy when you need to verify a call (Step 3, p.538).
Reach for unittest.mock when boilerplate gets tedious (Step 4) — but recognize the role you’re playing.
Use patch() carefully — where the SUT looks the name up — and prefer autospec=True (Step 5).

And the seventh, just learned: sometimes the right answer is no double at all. That judgment is what makes you good at this.

Starter files

xp_calculator.py

"""A PURE function for computing XP earned across quests.

No collaborators. No clock. No HTTP. No database.
Helper functions are private (underscore prefix) — implementation detail.
"""


def _filter_completed(quests: list[dict]) -> list[dict]:
    return [q for q in quests if q.get("completed")]


def _apply_multipliers(quests: list[dict]) -> list[tuple[str, int]]:
    return [(q["title"], q["xp"] * q.get("multiplier", 1)) for q in quests]


def _sum_xp(items: list[tuple[str, int]]) -> int:
    return sum(xp for _title, xp in items)


def compute_total_xp(quests: list[dict]) -> int:
    """Return the total XP earned from completed quests, with multipliers applied.

    Each quest is a dict with keys: title (str), xp (int), completed (bool),
    and an optional multiplier (int, default 1).
    """
    completed = _filter_completed(quests)
    with_multipliers = _apply_multipliers(completed)
    return _sum_xp(with_multipliers)

test_xp_overmocked.py

"""SMELL — every internal helper is mocked. Read this and recoil.

Notice what's actually verified: nothing about the SUT's behavior.
The mocks made up the answer; the SUT just orchestrated them.
"""
from unittest.mock import patch
from xp_calculator import compute_total_xp


def test_total_xp_overmocked_brittle():
    with patch("xp_calculator._filter_completed") as mock_filter, \
         patch("xp_calculator._apply_multipliers") as mock_apply, \
         patch("xp_calculator._sum_xp") as mock_sum:
        mock_filter.return_value = "<canned>"
        mock_apply.return_value = "<canned>"
        mock_sum.return_value = 200

        result = compute_total_xp([{"completed": True, "xp": 50}])

        assert result == 200
        # The "test" passes whether or not the SUT correctly filters,
        # multiplies, or sums — because we mocked all three.
        # If a teammate renames _apply_multipliers, this test breaks
        # for the WRONG reason (refactor, not behavior change).

test_xp_clean.py

"""Clean: no doubles. compute_total_xp is a pure function — exercise it directly."""
# TODO: in your own words, in ONE LINE, answer the question below.
# The validator just checks that this docstring is no longer empty.
"""The over-mocked version mocked: ___ FILL IN ___
What that cost: ___ FILL IN ___"""

from xp_calculator import compute_total_xp


def test_total_xp_for_two_completed_quests():
    quests = [
        {"title": "Slay",   "xp":  50, "completed": True,  "multiplier": 2},
        {"title": "Find",   "xp":  30, "completed": False, "multiplier": 1},
        {"title": "Defeat", "xp": 100, "completed": True,  "multiplier": 1},
    ]
    # 50*2 + (Find skipped: not completed) + 100*1 = 200
    assert compute_total_xp(quests) == 200


def test_total_xp_for_no_completed_quests():
    quests = [{"title": "Skip", "xp": 999, "completed": False}]
    assert compute_total_xp(quests) == 0

scenarios.py

"""Classify each scenario by the BEST single recommendation.

Allowed values:
  "no_double" — the SUT is pure (or close enough); call it directly
  "stub"      — control indirect input with canned values
  "spy"       — verify a fire-and-forget call after the fact
  "mock"      — strict behavior verification of a single contract call
  "fake"      — stateful in-memory implementation across multiple calls
  "adapter"   — wrap a third-party library, then double the adapter
"""

# Scenario 1: A pure function `compute_tax(price: float, rate: float) -> float`
# that returns price * rate. No collaborators.
SCENARIO_1_BEST = "FILL_IN"

# Scenario 2: A function `is_coupon_expired(coupon)` that calls datetime.now()
# internally to compare against `coupon.expires_at`. We want a deterministic test.
SCENARIO_2_BEST = "FILL_IN"

# Scenario 3: `process_order(order)` POSTs to a payment gateway. The test must
# verify the gateway was called exactly once with the right amount.
SCENARIO_3_BEST = "FILL_IN"

# Scenario 4: A `UserRepository` reads/writes user records to Postgres.
# The SUT under test does many round-trips: register a user, then look them up,
# then update their email, then look them up again. Tests run on CI without a DB.
SCENARIO_4_BEST = "FILL_IN"

# Scenario 5: Throughout the codebase, many modules call `requests.get(...)`
# directly. Patching `requests` everywhere is fragile; the tests are slow.
SCENARIO_5_BEST = "FILL_IN"

Solution

test_xp_clean.py

"""Clean: no doubles. compute_total_xp is a pure function."""
"""The over-mocked version mocked: every internal helper (_filter_completed, _apply_multipliers, _sum_xp).
What that cost: the test verified nothing about the SUT's behavior — only that the mocked helpers were called in some order. Any pure refactor (renaming a helper, inlining one) would break the test even though behavior is unchanged."""

from xp_calculator import compute_total_xp


def test_total_xp_for_two_completed_quests():
    quests = [
        {"title": "Slay",   "xp":  50, "completed": True,  "multiplier": 2},
        {"title": "Find",   "xp":  30, "completed": False, "multiplier": 1},
        {"title": "Defeat", "xp": 100, "completed": True,  "multiplier": 1},
    ]
    assert compute_total_xp(quests) == 200


def test_total_xp_for_no_completed_quests():
    quests = [{"title": "Skip", "xp": 999, "completed": False}]
    assert compute_total_xp(quests) == 0

scenarios.py

"""Classification of five scenarios."""

# Pure function — call it directly, no double needed.
SCENARIO_1_BEST = "no_double"

# Clock dependency — control indirect input via a stub.
SCENARIO_2_BEST = "stub"

# Fire-and-forget outbound call — verify it via spy or mock.
# ("spy" or "mock" both defensible — they overlap heavily in unittest.mock.)
SCENARIO_3_BEST = "mock"

# Stateful round-trip across many calls — Fake is the right tool.
# (Stub would need re-configuration between every call.)
SCENARIO_4_BEST = "fake"

# Third-party library used across many modules — Adapter pattern.
# Wrap `requests` in your own class; mock the adapter; never patch
# `requests` directly (don't mock what you don't own).
SCENARIO_5_BEST = "adapter"

Scenario 1 — pure function: compute_tax(price, rate) -> price * rate has zero collaborators. Just call it. Adding a double would be pure ceremony — slower, harder to read, no benefit.

Scenario 2 — clock dependency: the canonical stub use case. Inject a FrozenClock-style stub (or use Mock(return_value=...) if you’ve moved on from hand-rolling) so the test pins a specific date.

Scenario 3 — verify the payment-gateway call: spy or mock both work. unittest.mock’s Mock + assert_called_once_with blurs the line; either label is defensible. The test verifies the call (a behavior verification), so this is fundamentally a Mock-Object-role scenario in Meszaros’ strict sense.

Scenario 4 — stateful Postgres round-trip: Fake is the right tool. A stub would need separate canned answers for every call in the sequence (write, read, update, read again) — tedious and wrong-shaped. An in-memory dict-backed FakeUserRepository “just works” across the sequence.

Scenario 5 — third-party library: Adapter pattern. Wrap requests in your own thin class (e.g., HttpClient), have all your modules depend on HttpClient, then mock HttpClient. The third-party stays invisible to your tests. This is the “only mock what you own” heuristic in action — Hynek Schlawack’s classic essay covers this well, and Meszaros covers it as the Test Adapter pattern (informally).

Decision Guide — Synthesis Quiz

Min. score: 80%

1. A test mocks every internal helper of the SUT and asserts only on the mocks’ return values. Which antipattern is this?

Behavior verification — the test checks how the SUT works
This is over-mocking, not behavior verification. Behavior verification (Meszaros p.468) is one call against an architectural-boundary collaborator — not every internal helper. Mocking internals couples the test to implementation choreography rather than to the spec.
Over-mocking. The test verifies orchestration, not behavior. A pure refactor that renames or merges any internal helper breaks the test even though behavior is unchanged
Right. Mocks should sit at architectural boundaries (HTTP, DB, clock, notifier) — not at every internal helper. Mocking internals creates brittle tests: any refactor that preserves behavior but rearranges helpers breaks the test for the wrong reason. Same lesson as Testing Foundations Step 4 (“behavior, not implementation”), in mock-shaped clothing.
Solitary unit testing — the canonical and recommended style
Solitary testing means “isolate the SUT from external collaborators (DBs, clocks, networks).” It does not mean “mock every internal helper.” Internal helpers belong to the SUT’s own module — mocking them is over-mocking. Solitary doesn’t endorse this.
Liar test — the assertions don’t actually run
Liar tests have weak oracles (is not None). The over-mocked test’s assertions ARE running and are technically strong (== against a canned value). The problem is what they assert about — implementation details, not the spec.

Mock at the architectural boundary; let internal helpers be real. The line “this collaborator is worth doubling” runs through the boundary between your code and the unpredictable world (clock, HTTP, DB, queue) — not through every function-call edge inside your own module.

2. (Cumulative review) Match each scenario to the best single double:

A: A pure function that adds two integers
B: A function that calls datetime.now() to decide an expiration
C: A function that POSTs to a payment gateway, fire-and-forget
D: A function that round-trips with a Postgres user table 5 times

A: stub, B: stub, C: mock, D: fake
A is wrong. A pure integer-adding function has no collaborator — there’s no place to plug a stub. Doubling it is pure ceremony with no benefit.
A: no_double, B: stub, C: mock (or spy), D: fake
Right. A: pure function → no double. B: clock → stub. C: outbound call → mock or spy (interchangeable in unittest.mock). D: stateful round-trip → fake.
A: mock, B: mock, C: mock, D: mock — all are mocks
Conflating Mock the class with Mock the role. Pure functions don’t need any double; clock stubs return canned values (stub role), not strict expectations (mock role); stateful round-trips need fakes.
A: spy, B: spy, C: spy, D: spy — spies are universally safe
Spies record calls. A pure function doesn’t make outbound calls (nothing to record). A clock-dependency test wants to control input (stub), not observe output. Spy isn’t universally safe; it’s specifically for fire-and-forget output verification.

The rubric: pure → no double; non-deterministic → stub; outbound call → spy/mock; stateful sequence → fake. Memorize the rubric shape (the diagram in the instructions); the words follow.

3. “Don’t mock what you don’t own.” What does this rule actually mean?

Never use unittest.mock — only roll your own classes
unittest.mock is fine — you can use it on objects you own. The rule is about what you mock, not which library you use.
When you depend on a third-party library, wrap it in your own thin Adapter class first; then mock the Adapter, not the third-party. Your tests stay decoupled from the third-party’s internals
Right. Wrap third-party libraries in your own Adapter (Adapter pattern) so your code depends on your type. Then mock that type. Benefits: tests don’t break when the third-party releases a new version; the mock surface is tiny and stable; you can swap the underlying library if needed. Hynek Schlawack’s essay “Don’t Mock What You Don’t Own” lays this out crisply.
Only mock objects you instantiated yourself in the test
Object-instance ownership isn’t the rule. The rule is about interface ownership — whose contract you’re depending on.
Don’t share mocks between test files
Sharing mocks across test files is its own concern (often a bad idea), but it’s unrelated to the “mock what you own” rule.

"Mock what you own" is shorthand for "depend on interfaces you control, then mock those interfaces." The Adapter pattern from classical OO (and the Adapter pattern in design-patterns literature) is exactly the maneuver this rule recommends.

4. (Spaced review — TDD) During Red-Green-Refactor, when do you typically decide which double to use?

Refactor — you start with real collaborators and double them later
Refactor changes structure under a green safety net. Choosing a double mid-refactor would change what the test verifies, which violates the safety net principle.
Red — choosing the double is part of test design, which happens when you write the failing test
Right. Red is the test-design moment. The choice of stub vs spy vs mock vs fake vs no-double shapes both the test’s structure AND (often) the production design that emerges in Green. Choosing late means rewriting the test.
Green — you add doubles when the test is red and you need to make it pass
Green is just “make the failing test pass with the smallest code change.” Adding a double during Green would mean modifying the test, which corrupts the discipline (you’re chasing the test rather than letting it drive).
It doesn’t matter which phase — doubles are an implementation detail
It does matter. The double choice is a test-design decision that affects what the test verifies and how the production code is shaped. Treating it as an implementation detail leads to over-mocking and brittle suites.

Choosing a double is part of test design; test design happens in Red. Same lesson as Testing Foundations Step 5: input choice and oracle strength are independent test-design dimensions, both decided when you write the test. Add "choice of double" as a third independent dimension.

5. (Spaced review — Step 5) Why is autospec=True worth almost always reaching for when you patch a callable?

It runs the patched function in a separate process for safety
No process isolation involved. autospec is a runtime introspection of the patched object’s signature.
It enforces the real callable’s signature on the Mock — so the moment a teammate’s refactor changes the production signature, the test’s calls to the mock raise TypeError immediately, instead of silently accepting drift
Right. autospec is a design guardrail — “make the mock as strict as the real thing.” Signature drift is the most common refactoring bug; autospec catches it the moment the test runs. The cost is a few extra characters; the benefit is a real-world bug class entirely defended.
It catches typos in assert_* method names reliably
Half-myth. autospec primarily enforces call signatures, not assertion-method spelling. The reliable typo defense is mypy/pylint.
It’s required by the Mock library — without it, patches don’t apply
Patches work without autospec — they just don’t enforce signatures. autospec is a safety strict-mode, not a requirement.

Default-safe habit: use autospec=True whenever you’re patching a callable. It costs nothing at edit time, catches a real-world bug class at test time, and makes refactoring safer in the long run.

Test Doubles — Stubs, Spies, and Mocks

The Test That Lied: A Test That Passes Today and Fails Tomorrow

🧭 What you already know — and what’s about to shift

📖 New vocabulary (visible glossary)

Solution

Why Test Doubles? — Knowledge Check

Hand-Rolled Stub: A Clock That Always Says Tuesday

📖 The verbatim teaching sentence

📖 What is a Test Stub? (Meszaros, xUnit Test Patterns, p.529)

Solution

Test Stubs — Knowledge Check

Hand-Rolled Spy: Did the Ledger Actually Get the Gold?

📖 What is a Test Spy? (Meszaros, xUnit Test Patterns, p.538)

📖 The hard part isn’t writing the spy — it’s writing the assertion

Solution

Test Spies — Knowledge Check

Meet `unittest.mock`: Same Roles, Less Typing

📖 The verbatim teaching sentence — louder this time

⚠️ Why this matters for your career

📖 return_value vs side_effect — concept-level contrast

Solution

unittest.mock — Knowledge Check

Where to Patch — The #1 Python Pitfall, and Why autospec Defends You

📖 The new SUT — celebrate_milestone

📜 The rule

Part A — Predict and fix the patch target

Part B — autospec is a design guardrail, not a syntactic flourish

Solution

Where to Patch + autospec — Knowledge Check

When NOT to Use a Double — The Decision Guide

📖 The decision flow

📖 Three antipatterns to recognize on sight

Part 1 — Read the over-mocked vs clean tests

Part 2 — Classify five scenarios

Part 3 — Forward pointers

Solution

Decision Guide — Synthesis Quiz

📖 `return_value` vs `side_effect` — concept-level contrast

📖 The new SUT — `celebrate_milestone`