Test Doubles — Stubs, Spies, and Mocks
Learn to test code that depends on a clock, an HTTP service, a database, or a notification system — without actually hitting them. Concepts first; the Python+pytest syntax is provided so you can focus on the test-design decisions.
The Test That Lied: A Test That Passes Today and Fails Tomorrow
Why this matters
Some tests ship green and rot on a schedule. A teammate writes a test on April 28 asserting is_today_event_day("2026-04-28") returns True, the PR merges, and the next day — without a single code change — CI turns red. The hidden dependency is the wall clock; the test never really verified the function’s behavior. Recognizing those uncontrolled collaborators (clocks, HTTP, databases) and carving out a seam to substitute them is the foundation every other test-double technique builds on.
🎯 You will learn to
- Diagnose when a real collaborator makes a test non-deterministic
- Apply Dependency Injection to introduce a seam the test can swap out
- Analyze the difference between a test that passes and one that actually verifies behavior
📐 Two panes: production code is on the left; tests are on the right. Files prefixed test_ route to the right pane automatically; everything else lands on the left.
🧭 What you already know — and what’s about to shift
From Testing Foundations you know how to write a strong oracle, choose partition + boundary inputs, and avoid peeking at private state. From TDD you know the Red-Green-Refactor rhythm. Every example so far has had one thing in common: the function under test was self-contained. Pass it inputs, observe the output, done.
Real code is rarely like that. Real functions talk to collaborators — clocks, network APIs, databases, payment gateways, email services. Each of those collaborators turns a deterministic test into a flaky test, a slow test, or — worst — a test that appears green but actually never exercised the behavior you cared about. This entire tutorial is about that problem.
📖 New vocabulary (visible glossary)
| Term | Meaning |
|---|---|
| System Under Test (SUT) | The code being tested. Here: is_today_event_day. |
| Collaborator | Anything the SUT calls into. Here: datetime.now(). |
| Indirect input | A value the SUT receives from a collaborator (rather than from its caller). Here: today’s date from the clock. |
| Seam | A point where you can substitute a collaborator at test time without changing production behavior. We’re about to introduce one. |
| Dependency Injection | The technique: pass the collaborator in as a parameter instead of hard-coding it. (Meszaros, Dependency Injection.) |
🌍 The same vocabulary in another language
These terms come from xUnit Test Patterns (Meszaros, 2007). They’re language-agnostic. JavaScript+Jest, Java+Mockito, C#+Moq, Ruby+RSpec — all use the same words for the same roles. What changes between languages is the syntax of how you express a stub or a mock. The role doesn’t change.
📋 The full Meszaros taxonomy (preview)
You’ll meet four named test doubles in this tutorial — Stub, Spy, Mock, and Fake — plus one you’ll see in passing:
| Role | What it does | First encountered in |
|---|---|---|
| Dummy | A placeholder object that’s never actually used. Passed only to satisfy a constructor or method signature when the test doesn’t care about that collaborator. | Step 5’s _service(Mock(), Mock()) helper — those args are dummies. |
| Stub | Returns canned indirect inputs to the SUT. The SUT reads from it; the test doesn’t verify how. | Step 2 — a FrozenClock that always returns the same datetime. |
| Spy | Records the SUT’s outgoing calls so the test can assert on them later. | Step 3 — a ledger spy that captures (user_id, gold) tuples. |
| Mock (Meszaros sense — the “noun”) | A spy + behavior verification: the test sets expectations up-front, and the mock fails if they aren’t met. | Step 4 — unittest.mock + assert_called_once_with. |
| Fake | A working alternate implementation, simpler than production (e.g., an in-memory database for a test). | Step 6 — when stubs/spies become unwieldy. |
Five roles, one taxonomy. The role is determined by how the test uses the object, not by what class instantiated it.
⚙️ Task — three small moves:
-
Read
quest_service.pyandtest_quest_service.py. The test asserts thatis_today_event_day("2026-04-28") is True. The test was written on 2026-04-28 and merged green that day.✏️ Predict before you run. What happens when you run
test_april_28_is_event_daytoday?- (a) Pass — the function returns
Truewhenever its argument is a valid date string. - (b) Pass — the date string in the assertion (
"2026-04-28") matches the value stored in the test, so equality holds. - (c) Fail —
is_today_event_day("2026-04-28")returnsFalsebecause the function compares against today’s wall clock, which is no longer 2026-04-28. - (d) Error — the function raises an exception because
2026-04-28is in the past.
Commit to a letter. Then run the test.
Reveal (after committing)
(c) is the answer. The trap is (b) — students who haven’t yet thought about where the function gets “today” from assume both sides of the
==come from the same source. They don’t. The left side comes fromdatetime.now()(the wall clock); the right side is a hardcoded string. Two different sources, two different rates of change. The test rotted overnight. - (a) Pass — the function returns
- Run the test. The FAIL is the lesson — the test was correct on the day it was written; the world changed beneath it. Tests that depend on the wall clock matching a specific date rot on a schedule.
- Refactor
is_today_event_dayto accept aclockparameter (defaultdatetime.datetime). This creates the seam — but you don’t use it yet. Adding the seam alone won’t fixtest_april_28_is_event_day(it still callsis_today_event_day("2026-04-28")without injecting a clock). Don’t be alarmed when that one test stays red after the refactor — the gate tests below check the seam itself, not the original test. Step 2 will use the seam to control the clock so the test is deterministic.
flowchart LR
subgraph before["BEFORE — no seam"]
direction TB
S1["is_today_event_day(date_str)"]:::sut
S1 --> C1["datetime.now()<br/>📅 wall clock"]:::bad
end
subgraph after["AFTER — seam introduced"]
direction TB
S2["is_today_event_day(date_str, clock)"]:::sut
S2 --> C2["clock.now()<br/>↑ caller decides<br/>what clock"]:::good
end
before --> after
classDef sut fill:#e3f2fd,stroke:#1565c0,color:#0d47a1
classDef good fill:#e8f5e9,stroke:#2e7d32,color:#1b5e20
classDef bad fill:#ffebee,stroke:#c62828,color:#b71c1c
💡 Concept over syntax. Your code change is a single keyword (clock) and one default. The point is the idea — “this function used to depend on the wall clock; now its caller decides what ‘now’ means.” That’s the foundation of every test double in this tutorial. (The default value clock=datetime.datetime keeps existing call sites working — the seam is non-intrusive.)
🔭 Coming in Step 2: You created a seam. Now we’ll actually use it — by passing in a FrozenClock object that always says it’s Tuesday. Same SUT, same test shape, but now fully deterministic.
"""QuestForge — daily quest event service."""
from datetime import datetime
def is_today_event_day(event_date_str: str) -> bool:
"""Return True if today is the event date.
event_date_str is in YYYY-MM-DD format.
⚠️ This function calls datetime.now() directly. Tests that pin a
specific date will pass on that date and fail on every other day.
That hidden non-determinism is what we're about to fix.
"""
today = datetime.now().strftime("%Y-%m-%d")
return today == event_date_str
"""Test for is_today_event_day.
⚠️ This test was written on 2026-04-28 and passed that day.
Today, unless the calendar still reads 2026-04-28, it FAILS —
`is_today_event_day("2026-04-28")` returns False because the wall
clock no longer matches the hardcoded date. That failure is the
lesson: a test that depends on `datetime.now()` matching a specific
string rots the moment the date passes. Step 2 will fix it by
*controlling* the clock instead of asking the OS.
"""
from quest_service import is_today_event_day
def test_april_28_is_event_day():
# Test author assumed today would always be 2026-04-28 when this ran.
# Reality: this test passes on exactly one calendar day.
assert is_today_event_day("2026-04-28") is True
Step 1 — Knowledge Check
Min. score: 80%1. Which of these collaborators are likely to make a test flaky (sometimes pass, sometimes fail without code changes)? (select all that apply)
Flakiness comes from collaborators that the test cannot fully control: wall clocks, network calls, remote databases, file systems, randomness. Pure in-memory operations (list reversal, arithmetic) are deterministic and don’t need a double.
2. What is an indirect input to the System Under Test?
Indirect input = a value the SUT obtains from a collaborator rather than
from its caller. clock.now(), db.fetch_user(id), api.get_weather() —
each returns an indirect input that the SUT then uses. Stubs control these.
3. (Spaced review — Testing Foundations) A test asserts result is not None after refactoring the SUT to accept a clock parameter. Is that a strong oracle?
Oracle strength is independent of whether collaborators are doubled.
is not None is the canonical weak oracle in any context. Even after
you replace a real clock with a stub, the assertion still has to pin
exactly what the spec mandates.
4. Why is dependency injection the right move before introducing any test doubles?
Dependency Injection is the design move that makes test doubles possible. Pass the collaborator as a parameter; now any test can substitute a controlled version. (Same principle in Java with constructor injection, in C# with interfaces, in JavaScript with options-object patterns. The pattern is language-agnostic.)
Hand-Rolled Stub: A Clock That Always Says Tuesday
Why this matters
A seam is only useful if you have something to plug into it. The simplest something is a Test Stub — a tiny hand-written class that always answers questions the same way. Hand-rolling one (in plain Python, no library) makes the role visible: a stub is just a controlled answer to a question. Once you’ve built one yourself, every framework-generated stub you meet later is just less typing for the same idea.
🎯 You will learn to
- Apply the Test Stub role (Meszaros) by writing one in plain Python
- Analyze how canned values drive the SUT down a specific behavior partition
- Evaluate state verification — asserting on the SUT’s return value, not on the stubs
🧭 Bridge from Step 1. You created a seam: DailyQuestService(clock, api) accepts its collaborators as parameters. Now we’ll use the seam — by passing in objects that always answer the same way. That’s a stub.
📖 The verbatim teaching sentence
“
Mockis a tool class; stub, spy, and mock are test-design roles. Same in Python, JavaScript, and Java — the role is what matters; the class name is just syntax.”
Read that twice. Most confusion about test doubles in Python comes from conflating Python’s unittest.mock.Mock class with the conceptual Mock role. They’re not the same thing. We’ll dismantle that confusion in Step 4. For now, lock in this: the role is the question; the syntax is the answer.
📖 What is a Test Stub? (Meszaros, xUnit Test Patterns)
A Test Stub replaces a collaborator with a hand-controlled object that answers questions with canned values. It does not record what was asked of it; it does not enforce a contract. It just answers.
flowchart LR
T["Test"]:::test --> S["DailyQuestService<br/>(SUT)"]:::sut
S -->|"clock.now()"| C1["FrozenClock<br/>📅 STUB<br/><i>always returns<br/>April 28, noon</i>"]:::stub
S -->|"api.fetch_quests(...)"| C2["StubQuestApiClient<br/>📋 STUB<br/><i>always returns<br/>the canned quest list</i>"]:::stub
T -.->|"asserts on return value"| S
classDef test fill:#e3f2fd,stroke:#1565c0,color:#0d47a1
classDef sut fill:#fff3e0,stroke:#e65100,color:#bf360c
classDef stub fill:#e8f5e9,stroke:#2e7d32,color:#1b5e20
Notice what the test asserts on: the SUT’s return value, not the stubs. That’s state verification — we observe the result of calling the SUT, not whether it talked to anyone. Stubs make state verification possible by removing the variability the real collaborators would have introduced.
⚙️ Task — three moves, getting progressively harder:
- Read the worked example
test_tuesday_picks_tuesday_quest. TheFrozenClock, theStubQuestApiClient, and the assertion are all written for you. Predict the test’s outcome before running. Then run it — green. - Fill in the assertion in
test_thursday_picks_thursday_quest. The clock is frozen to a Thursday; the canned API quests include a Thursday entry. Compute the expected value from the spec — don’t run-and-paste. Replace"FILL_IN_HERE"with the exact title the SUT should return. - ✍️ Write your own test —
test_friday_with_no_friday_quest_returns_no_quests_today. Friday clock (datetime(2026, 5, 1, 12, 0)), canned list with no Friday entry, assert== "No quests today". No scaffold — wire up the stubs yourself.
💡 The conceptual move. A stub answers questions — it doesn’t decide what those answers should be. You decide. Your decision drives the SUT down whichever behavior branch the test is meant to exercise. The canned quest list and the frozen weekday together form a precise input partition; the assertion locks in what the SUT does for that partition.
📖 Why we wrote `StubQuestApiClient` as a class with one method, not as a function
DailyQuestService calls self._api.fetch_quests(user_id) — it expects a fetch_quests method on the api object. So our stub must be an object with that method. A function alone wouldn’t have a .fetch_quests attribute.
In Python this is duck typing: any object with a fetch_quests(self, user_id) method that returns a list of quest dicts is acceptable. The real QuestApiClient does it. Our stub does it. The SUT can’t tell them apart — that’s the whole point.
In Java, you’d give both classes a common interface. In TypeScript, you’d type the parameter as { fetchQuests: (userId: string) => Quest[] }. The mechanism differs; the idea (stub satisfies the same contract as the real collaborator) is universal.
🧠 Stub vs Fake — the cousin you'll meet briefly
A Fake Object (Meszaros) is the next-of-kin to a stub: a working but lightweight implementation. Where StubQuestApiClient returns the same canned list no matter what user_id is passed, a FakeQuestApiClient could keep an in-memory dict of {user_id: [quests]} and return different lists for different users.
class FakeQuestApiClient:
def __init__(self):
self._data = {}
def add_quests_for(self, user_id, quests):
self._data[user_id] = quests
def fetch_quests(self, user_id):
return self._data.get(user_id, [])
When to reach for a Fake instead of a Stub: when one canned answer isn’t enough — typically when multiple SUTs share the collaborator, or when the test sequence depends on state that the stub would have to manually thread.
We won’t use Fakes in the worked exercises (one canned list per test is plenty here), but it’s worth knowing they exist. Step 6’s decision guide covers when each one fits.
🌍 The same idea in another language
FrozenClock is just a class with a hard-coded method. Every language has a way to write that.
JavaScript (no framework):
const frozenClock = {
now: () => new Date('2026-04-28T12:00:00')
};
Java:
Clock frozenClock = Clock.fixed(
Instant.parse("2026-04-28T12:00:00Z"),
ZoneOffset.UTC
);
Same role; different syntax. Frameworks (unittest.mock, Jest, Mockito) generate these objects more concisely — but that’s boilerplate reduction, not a different idea.
🔭 Coming in Step 3: A stub answers questions. What if your SUT’s interesting behavior is whom it asks — like a complete_quest that should call ledger.credit(user_id, gold)? That’s where Test Spy comes in.
"""Reusable test helper: a clock that always says it's `fixed_dt`."""
from datetime import datetime
class FrozenClock:
"""A stub clock — always returns the datetime it was constructed with."""
def __init__(self, fixed_dt: datetime):
self._fixed_dt = fixed_dt
def now(self) -> datetime:
return self._fixed_dt
"""The REAL HTTP client — don't call this in tests.
Instantiating QuestApiClient and calling fetch_quests() would actually
hit the network. Tests that exercise `DailyQuestService` should pass
a stub instead.
"""
import urllib.request
import json
class QuestApiClient:
def fetch_quests(self, user_id: str) -> list[dict]:
url = f"https://questforge.example.com/quests/{user_id}"
with urllib.request.urlopen(url) as r:
return json.loads(r.read())
"""QuestForge — daily quest service.
DailyQuestService takes a clock and an API client as constructor
parameters (Dependency Injection). At test time we pass in stubs;
in production the caller passes the real ones.
"""
import datetime
def is_today_event_day(event_date_str: str, clock=datetime.datetime) -> bool:
today = clock.now().strftime("%Y-%m-%d")
return today == event_date_str
class DailyQuestService:
"""Picks today's daily quest title for a user."""
def __init__(self, clock, api):
self._clock = clock
self._api = api
def daily_quest_title(self, user_id: str) -> str:
"""Return today's quest title, or 'No quests today' if none match."""
try:
quests = self._api.fetch_quests(user_id)
except ConnectionError:
return "No quests today"
if not quests:
return "No quests today"
weekday = self._clock.now().strftime("%A")
for quest in quests:
if quest["weekday"] == weekday:
return quest["title"]
return "No quests today"
"""Step 2 — Hand-rolled stubs for DailyQuestService.
Two stubs are used here. FrozenClock is imported from clock.py.
StubQuestApiClient is defined right below — because it's a regular
class, not anything special. (Step 4 will show that `unittest.mock`
generates the same conceptual object in a single line — but the *idea*
is what we're locking in here, not the syntax.)
"""
from datetime import datetime
from clock import FrozenClock
from quest_service import DailyQuestService
class StubQuestApiClient:
"""A Test Stub (Meszaros, http://xunitpatterns.com/Test%20Stub.html) — returns canned quests regardless of user_id."""
def __init__(self, canned_quests: list[dict]):
self._canned = canned_quests
def fetch_quests(self, user_id: str) -> list[dict]:
return self._canned
# ===== WORKED EXAMPLE 1 — fully written =====
# Read carefully. Predict the assertion's outcome BEFORE running.
def test_tuesday_picks_tuesday_quest():
clock = FrozenClock(datetime(2026, 4, 28, 12, 0)) # 2026-04-28 is a Tuesday
api = StubQuestApiClient([
{"weekday": "Monday", "title": "Slay the Slime Lord"},
{"weekday": "Tuesday", "title": "Find the Lost Amulet"},
{"weekday": "Wednesday", "title": "Defeat the Dragon"},
])
service = DailyQuestService(clock, api)
assert service.daily_quest_title("u123") == "Find the Lost Amulet"
# ===== FADED EXAMPLE 2 — student fills in the expected value =====
# The stub class, the FrozenClock, and the canned data are all provided.
# YOUR JOB: replace "FILL_IN_HERE" with the EXACT title the SUT should return.
# Compute it from the spec; don't run-and-paste.
def test_thursday_picks_thursday_quest():
clock = FrozenClock(datetime(2026, 4, 30, 12, 0)) # 2026-04-30 is a Thursday
api = StubQuestApiClient([
{"weekday": "Monday", "title": "Slay the Slime Lord"},
{"weekday": "Thursday", "title": "Battle the Lich King"},
{"weekday": "Sunday", "title": "Save the Princess"},
])
service = DailyQuestService(clock, api)
# TODO — pin the exact title with `==` (strong oracle, Testing Foundations Step 3).
assert service.daily_quest_title("u456") == "FILL_IN_HERE"
Step 2 — Knowledge Check
Min. score: 80%1. Which best describes a Test Stub?
Stub = canned answers. The SUT calls the stub; the stub returns whatever the test configured. Used to control what the SUT receives, not to inspect what the SUT does. (Step 3 covers the latter — that’s a Spy.)
2. Why is hardcoded datetime.now() (used directly inside the SUT) not a stub?
Stub = under the test’s control. datetime.now() is the opposite —
the wall clock is shared, mutable, and impossible for the test to
pin. Replacing it with FrozenClock(...) is what makes the
indirect input controllable.
3. (Spaced review — Testing Foundations Step 3) A teammate writes:
assert service.daily_quest_title("u123") is not None
Stubs and strong oracles solve independent problems. Stubs make indirect inputs controllable; oracles make assertions precise. You need both. Putting a weak oracle inside a stubbed test is a Liar test wearing a stub’s clothes.
4. When would a Fake Object (in-memory implementation) be a better choice than a Test Stub?
Stub: one canned answer per call. Fake: working in-memory implementation, useful when the SUT needs consistent stateful behavior across multiple calls (add → fetch → update → fetch again, etc.). Step 6’s decision guide covers when each fits.
5. Pick the right tool for the test.
Your notify_user(user_id) function calls email_gateway.send(user_id, "Welcome") and returns nothing. The test must verify that the email was sent to user "u1" exactly once with the welcome subject. The real email_gateway.send actually delivers an email — you cannot run it in tests.
Which test double is the right tool? (One choice from Step 1’s vocabulary table.)
Spy. When the SUT calls a collaborator for side effect (no meaningful return value the SUT acts on), the test needs to record the call and assert on it afterward — that’s the spy role. Skeleton:
def test_welcomes_new_user():
spy = SpyEmailGateway()
notify_user("u1", gateway=spy)
assert spy.calls == [("u1", "Welcome")]
Compare the wrong choices: a stub answers a question the SUT asked; a fake provides a working alternate; the real one sends a real email. Step 3 will show you how to hand-roll spies of this exact shape.
Hand-Rolled Spy: Verifying Indirect Outputs
Why this matters
Plenty of real methods return None and do their work as a side effect — ledger.credit(user_id, gold), notifier.send(...), cache.invalidate(...). A stub can’t help: there’s no return value to assert on. You need a Test Spy that records calls so the test can ask, after the fact, did the SUT actually credit the right user the right amount? The hard part isn’t writing the spy — it’s pinning exactly the right amount of detail in the assertion: enough to catch real bugs, loose enough to survive harmless refactors.
🎯 You will learn to
- Apply the Test Spy role (Meszaros) by writing one in plain Python
- Evaluate “Goldilocks” assertions that pin only what the spec demands
- Analyze why fire-and-forget methods are invisible without a spy
🧭 Bridge from Step 2. A stub answers the SUT’s questions. A spy also records what the SUT did. The new conceptual move:
| Aspect | Stub (Step 2) | Spy (Step 3) |
|---|---|---|
| What the test asserts on | The SUT’s return value | The recorded calls on the spy |
| What the SUT looks like | A function that returns something | Often a method that returns None (fire-and-forget) |
| Verification kind | State Verification | State verification of the spy — Step 5 will introduce the third kind |
The new collaborator is RewardLedger — its job is to credit gold to a user. The SUT calls ledger.credit(user_id, gold) and that’s the only observable effect. The SUT itself returns nothing useful — the call to credit IS the contract. To verify it, we need a spy.
📖 What is a Test Spy? (Meszaros, xUnit Test Patterns)
A Test Spy behaves like a stub and records every call made to it. The test runs the SUT, then inspects the spy’s recorded-call list. Same SUT/collaborator structure as Step 2; what changes is what the test asserts on.
flowchart LR
T["Test"]:::test --> S["DailyQuestService"]:::sut
S -->|"clock.now()"| C1["FrozenClock<br/>📅 STUB"]:::stub
S -->|"api.fetch_quests(...)"| C2["StubQuestApiClient<br/>📋 STUB"]:::stub
S -->|"ledger.credit(u1, 100)"| C3["SpyLedger<br/>🎙️ SPY<br/><i>records every call</i>"]:::spy
T -.->|"asserts on spy.calls"| C3
classDef test fill:#e3f2fd,stroke:#1565c0,color:#0d47a1
classDef sut fill:#fff3e0,stroke:#e65100,color:#bf360c
classDef stub fill:#e8f5e9,stroke:#2e7d32,color:#1b5e20
classDef spy fill:#f3e5f5,stroke:#6a1b9a,color:#4a148c
Notice the test now asserts on spy.calls, not on the SUT’s return value. The contract being verified is “the SUT called credit with these arguments”.
📖 The hard part isn’t writing the spy — it’s writing the assertion
A spy is even simpler than a stub: a class with a list and an append. The interesting test-design move is how much of each call to pin.
| Assertion | What still passes (i.e., what it misses) | Pattern |
|---|---|---|
assert len(spy.calls) >= 0 |
Everything. Always passes. Liar test. | Weak — same family as result is not None from Testing Foundations |
assert spy.calls == [("u1", 100, "2026-04-28T12:00:00Z", {"meta": "blob"})] |
Nothing. Breaks if the SUT later calls credit with cleaner arguments — even when the contract is unchanged. Brittle. | Over-specified |
assert spy.calls == [("u1", 100)] |
A wrong user_id, a wrong gold amount, no call at all, two calls. Goldilocks. | Strong, behaviorally-bounded |
Same lesson as Testing Foundations Step 4: assert on exactly what the spec says — no less, no more. The spec for complete_quest: “credit the user the gold for the completed quest.” That maps to a 2-tuple (user_id, gold). Anything beyond that is over-specification; anything less is a Liar.
⚙️ Task — four moves:
- Read
test_complete_quest_LIAR_oracle. The assertion isassert len(spy.calls) >= 0— it always passes, regardless of whether the SUT called the spy at all. Add a Python comment above the assertion explaining (in your own words) why this is a Liar test — use the phrase “Liar test” or “weak oracle”. Don’t change the assertion; the test stays a Liar so the lesson is preserved. - Read and run
test_complete_quest_credits_correct_gold— fully written, pins the exact 2-tuple. This is the Goldilocks shape. - Fill in the assertion in
test_award_streak_bonus_5_days. The streak-bonus rule: 10 gold per day, capped at 100. The student passesdays=5. Compute the gold; pin the call. - ✍️ Write your own test —
test_award_streak_bonus_caps_at_100_for_long_streaks. Usedays=12(above the cap). Wire upSpyLedger+DailyQuestServiceand pinspy.calls == [("u3", 100)]. No scaffold.
📖 Why fire-and-forget methods need spies
complete_quest returns None. From the SUT’s caller’s perspective, nothing happens — the function is “void”. Yet the SUT did do something important: it told the ledger to credit gold. Without a spy, that work is invisible to the test.
A spy makes invisible side effects visible. In every language: Java mocks (Mockito.verify(...)), JavaScript spies (jest.fn() + expect(spy).toHaveBeenCalledWith(...)), Python’s unittest.mock recorded calls — the idea is the same. This is the only way to test fire-and-forget methods.
🌍 The same idea in another language
JavaScript with Jest:
const spy = jest.fn(); // creates a function spy
service.completeQuest('u1', 'Slay the Slime');
expect(spy).toHaveBeenCalledWith('u1', 100);
Java with Mockito:
RewardLedger spy = mock(RewardLedger.class); // also acts as a spy
service.completeQuest("u1", "Slay the Slime");
verify(spy).credit("u1", 100);
Same role; different syntax. The hand-rolled SpyLedger class makes the recording mechanism visible; framework spies (Step 4) hide the boilerplate.
🔭 Coming in Step 4: Hand-rolling spies gets repetitive — you’re writing the same self.calls.append(...) boilerplate every time. Python’s unittest.mock.Mock generates the entire SpyLedger class for you in a single line. But it’s the same conceptual object — just less typing.
"""The real reward ledger — would persist gold to a database in production."""
class RewardLedger:
def credit(self, user_id: str, gold: int) -> None:
# In production: writes a credit row to the rewards database.
raise NotImplementedError(
"Don't call the real ledger in tests — pass a SpyLedger instead."
)
"""QuestForge — daily quest service with reward ledger collaborator."""
import datetime
QUEST_REWARDS = {
"Slay the Slime Lord": 100,
"Find the Lost Amulet": 150,
"Battle the Lich King": 250,
"Defeat the Dragon": 500,
}
def is_today_event_day(event_date_str: str, clock=datetime.datetime) -> bool:
today = clock.now().strftime("%Y-%m-%d")
return today == event_date_str
class DailyQuestService:
"""Picks today's quest, completes quests, and awards streak bonuses."""
def __init__(self, clock, api, ledger=None):
self._clock = clock
self._api = api
self._ledger = ledger
def daily_quest_title(self, user_id: str) -> str:
try:
quests = self._api.fetch_quests(user_id)
except ConnectionError:
return "No quests today"
if not quests:
return "No quests today"
weekday = self._clock.now().strftime("%A")
for quest in quests:
if quest["weekday"] == weekday:
return quest["title"]
return "No quests today"
def complete_quest(self, user_id: str, quest_title: str) -> None:
"""Credit the user the gold for the completed quest. Returns None."""
gold = QUEST_REWARDS.get(quest_title, 0)
self._ledger.credit(user_id, gold)
def award_streak_bonus(self, user_id: str, days: int) -> None:
"""Award 10 gold per streak day, capped at 100. Returns None."""
gold = min(days * 10, 100)
self._ledger.credit(user_id, gold)
"""Step 3 — Hand-rolled spies for fire-and-forget collaborator calls.
A spy is a stub that ALSO records calls. The interesting test-design
move isn't writing the spy — it's writing the assertion. Pin exactly
what the spec mandates: no less (Liar), no more (over-specified).
"""
from datetime import datetime
from clock import FrozenClock
from quest_service import DailyQuestService
class StubQuestApiClient:
def __init__(self, canned_quests):
self._canned = canned_quests
def fetch_quests(self, user_id):
return self._canned
class SpyLedger:
"""A Test Spy (Meszaros, http://xunitpatterns.com/Test%20Spy.html) — records every credit() call."""
def __init__(self):
self.calls = []
def credit(self, user_id, gold):
self.calls.append((user_id, gold))
# ===== WORKED EXAMPLE 1 — the Liar test =====
# This assertion ALWAYS passes — even if the SUT never called the spy.
# YOUR JOB: add a Python comment ABOVE the assertion explaining (in
# your own words) why this is a "Liar test" / "weak oracle".
# Don't change the assertion — keep the Liar visible for comparison.
def test_complete_quest_LIAR_oracle():
spy = SpyLedger()
service = DailyQuestService(
FrozenClock(datetime(2026, 4, 28, 12, 0)),
StubQuestApiClient([]),
spy,
)
service.complete_quest("u1", "Slay the Slime Lord")
# TODO — add a comment HERE explaining the Liar pattern.
assert len(spy.calls) >= 0
# ===== WORKED EXAMPLE 2 — Goldilocks =====
# Pins exactly the (user_id, gold) the spec mandates. Read and run.
def test_complete_quest_credits_correct_gold():
spy = SpyLedger()
service = DailyQuestService(
FrozenClock(datetime(2026, 4, 28, 12, 0)),
StubQuestApiClient([]),
spy,
)
service.complete_quest("u1", "Slay the Slime Lord")
# Slay the Slime Lord rewards 100 gold (per QUEST_REWARDS in quest_service.py).
assert spy.calls == [("u1", 100)]
# ===== FADED EXAMPLE 3 — student writes the expected call =====
# The SUT is `award_streak_bonus(user_id, days)`.
# Spec: 10 gold per day, capped at 100.
# YOUR JOB: replace the placeholder gold value with the correct one
# for `days=5`. Compute it from the spec.
def test_award_streak_bonus_5_days():
spy = SpyLedger()
service = DailyQuestService(
FrozenClock(datetime(2026, 4, 28, 12, 0)),
StubQuestApiClient([]),
spy,
)
service.award_streak_bonus("u2", 5)
# TODO — replace 999 with the correct gold for a 5-day streak.
assert spy.calls == [("u2", 999)]
Step 3 — Knowledge Check
Min. score: 80%1. What is the defining role of a Test Spy that distinguishes it from a Test Stub?
Spy = stub + call recording. The test asserts on the recorded
call list (spy.calls), which is how we verify that the SUT
did something — even when “did something” leaves no observable
return value.
2. (Spaced review — Testing Foundations Step 3) A teammate asserts:
assert len(spy.calls) >= 0
The Liar pattern is independent of the assertion operator. The
issue is the assertion’s expression — len(...) >= 0 is
structurally trivial. Replace it with assert spy.calls == [...]
pinning the exact expected call.
3. Which spy assertion is brittle (would break under a harmless internal refactor)?
Brittle = pins details outside the spec. The 4-tuple includes a
timestamp and a metadata dict that aren’t part of the credit
contract — they’re internals. A pure refactor that drops the
metadata would break this test even though credit(user_id, gold)
is still being called correctly. (Same family as the
internal-coupling brittleness from Testing Foundations Step 4.)
4. (Spaced review — Step 2) Stub vs Spy in one sentence:
Stub: "control what the SUT receives." Spy: "observe what the SUT did." Same role-vs-syntax distinction as Step 2 — these are test-design roles, independent of whether you hand-roll them or generate them with a library (Step 4 incoming).
Library Doubles with unittest.mock: Same Roles, Less Typing
Why this matters
Hand-rolling stubs and spies makes the roles visible, but it gets repetitive — every spy is the same self.calls.append(...) boilerplate. Python’s unittest.mock.Mock collapses that into a single line. The catch: it’s the same class whether the test uses it as a stub, spy, or mock — the role is determined entirely by what the test does with the object. Once you can read a Mock and name its role on sight, framework syntax stops being a vocabulary barrier between you and other people’s tests.
🎯 You will learn to
- Recognize a
Mock(return_value=...)as a stub and a Mock withassert_called_once_with(...)as a spy - Apply
side_effectto simulate collaborator failures - Analyze why “to mock” (verb) and “a Mock” (Meszaros noun) are different things
🧭 Bridge from Steps 2-3. You wrote StubQuestApiClient and SpyLedger by hand. The recording boilerplate (self.calls.append(...)) gets repetitive. Python’s unittest.mock.Mock is a class that generates the same conceptual object on demand:
- Set
api.fetch_quests.return_value = [...]→api.fetch_quests(...)returns that list. (Stub.) - Set
api.fetch_quests.side_effect = ConnectionError→api.fetch_quests(...)raises. (Failing stub.) - Call
api.fetch_quests("u1")→ Mock auto-records the call;api.fetch_quests.assert_called_once_with("u1")checks the recording. (Spy.)
One class, three roles — depending on what the test asks of it. The role isn’t determined by the class; it’s determined by what the test does with it.
📖 The verbatim teaching sentence — louder this time
“
Mockis a tool class; stub, spy, and mock are test-design roles. Same in Python, JavaScript, and Java — the role is what matters; the class name is just syntax.”
unittest.mock.Mock is the most overloaded class name in Python testing. It is not a “Mock object” in Meszaros’ sense (Step 5 will introduce that role). It’s a tool — a configurable double that can play stub, spy, or mock depending on how the test uses it.
⚠️ Why this matters for your career
Reading other people’s tests, you’ll see Mock everywhere. Most uses are stubs in disguise (Mock(return_value=...)). When someone says “I added a mock for the database,” nine times out of ten they actually added a stub. Recognizing the role behind the class name is the difference between parroting Mock syntax and understanding what the test verifies.
🔤 “Mock” as a verb vs. “a Mock” as a noun
English makes this trap worse. Two senses you’ll hear in the wild:
| Form | What it means | Example |
|---|---|---|
| “to mock” (verb) | Replace any collaborator with any test double — colloquial, role-agnostic. | “Let’s mock the database” — could mean stub, spy, fake, or unittest.mock.Mock. |
| “a Mock” (noun, Meszaros) | Specifically a behavior-verifying double with up-front expectations. | “Use a Mock when you need to assert the email service was called exactly once.” |
When a teammate says “we mocked the API,” you don’t know which role they used until you read the test. The verb is loose; the noun is specific. In this tutorial, we use the noun (Meszaros) form. When you talk about your own tests, naming the role — “I stubbed the clock,” “I spied on the ledger,” “I added a mock for the gateway” — communicates more than “I mocked it.”
⚙️ Task — read four tests, fill in one, then write one:
- Read
test_a_handrolled_stub— the Step 2 hand-rolled style for comparison. - Read
test_b_mock_return_value— same SUT, same role, generated byMock. Confirm both pass and verify the same behavior. - Read
test_c_mock_as_spy— the sameMockclass, now playing the spy role. Notice: nothing aboutMockchanges between Test B and Test C — only what the test does with it. - Fill in
test_d_side_effect_simulates_api_failure— replace the placeholder exception class. ReadDailyQuestService.daily_quest_titleto find which exception it catches; use that class. - ✍️ Write
test_e_award_streak_bonus_with_mock_spy. UseMock()(notSpyLedger) as the ledger; callaward_streak_bonus("u9", 7); assertledger.credit.assert_called_once_with("u9", 70). Same spy role as Step 3 — different syntax. Cementing role-vs-class is the whole point.
📖 return_value vs side_effect — concept-level contrast
| Attribute | What it does | When to reach for it |
|---|---|---|
mock.return_value = X |
Calls return X (a canned answer) |
The collaborator should succeed; you want to drive the SUT down a happy-path partition. |
mock.side_effect = Exception |
Calls raise the exception | The collaborator should fail; you want to drive the SUT down its error-handling branch. |
mock.side_effect = [a, b, c] |
First call returns a, second b, third c |
The collaborator returns different values across the test sequence. |
mock.side_effect = my_function |
Calls invoke my_function(*args) |
The return value depends dynamically on the arguments. |
Both attributes are configurations of the same Mock object. They’re orthogonal; they answer different test-design questions.
📖 What about `monkeypatch`?
pytest’s monkeypatch fixture is another way to swap a collaborator at test time — particularly useful when the collaborator is a module-level function or constant that the SUT imports, rather than a constructor parameter:
def test_with_monkeypatch(monkeypatch):
# Replace QUEST_REWARDS at the module level for this one test only.
# monkeypatch automatically restores it after the test.
monkeypatch.setattr("quest_service.QUEST_REWARDS", {"Slay the Slime Lord": 9999})
spy = Mock()
service = DailyQuestService(FrozenClock(...), Mock(), spy)
service.complete_quest("u1", "Slay the Slime Lord")
spy.credit.assert_called_once_with("u1", 9999)
monkeypatch.setattr(target, value) replaces target with value. After the test, monkeypatch restores the original — automatically. The auto-cleanup is what makes monkeypatch safe: a manual replacement that you forgot to restore would leak into every subsequent test.
Conceptually, monkeypatch.setattr is a stub — you’re feeding the SUT a controlled value. Same role; different syntactic vehicle. Use it when the seam is at module level rather than at constructor level.
Step 5 will use the heavier unittest.mock.patch (decorator/context manager) for the same purpose — and explore the canonical pitfall: where in the namespace to patch.
🌍 The same idea in another language
JavaScript with Jest:
const api = { fetchQuests: jest.fn().mockReturnValue([...]) }; // stub
// OR
const api = { fetchQuests: jest.fn().mockImplementation(() => { throw new Error('boom'); }) }; // failing stub via side_effect
Java with Mockito:
QuestApiClient api = mock(QuestApiClient.class);
when(api.fetchQuests(anyString())).thenReturn(List.of(...)); // stub
// OR
when(api.fetchQuests(anyString())).thenThrow(new ConnectionException()); // failing stub
Same conceptual moves: tell the double “return X” or “raise X.” The names of the methods differ across libraries — the roles don’t.
🔭 Coming in Step 5: Mock can also play the third role — Mock Object in Meszaros’ strict sense (behavior verification). To see it cleanly, we need one more idea: patch(), and where in the namespace to patch. That’s the #1 Python-mocking pitfall.
"""Step 4 — unittest.mock generates the same conceptual objects you wrote by hand.
Four tests below, all testing the same SUT (DailyQuestService). They
differ only in HOW the double is constructed and what role it plays.
Read them as a side-by-side comparison.
"""
from unittest.mock import Mock
from datetime import datetime
from clock import FrozenClock
from quest_service import DailyQuestService
# Hand-rolled stub class (Step 2 style) — kept for direct comparison.
class StubQuestApiClient:
def __init__(self, canned_quests):
self._canned = canned_quests
def fetch_quests(self, user_id):
return self._canned
# ===== TEST A — Hand-rolled stub (Step 2 style) =====
def test_a_handrolled_stub():
clock = FrozenClock(datetime(2026, 4, 28, 12, 0))
api = StubQuestApiClient([
{"weekday": "Tuesday", "title": "Find the Lost Amulet"},
])
service = DailyQuestService(clock, api)
assert service.daily_quest_title("u1") == "Find the Lost Amulet"
# ===== TEST B — Mock with return_value (same ROLE: stub) =====
# `Mock()` creates an auto-magic object. Setting
# `api.fetch_quests.return_value = [...]` configures what
# `api.fetch_quests(anything)` returns. Functionally equivalent to
# the StubQuestApiClient class above — just no class definition.
def test_b_mock_return_value():
clock = FrozenClock(datetime(2026, 4, 28, 12, 0))
api = Mock()
api.fetch_quests.return_value = [
{"weekday": "Tuesday", "title": "Find the Lost Amulet"},
]
service = DailyQuestService(clock, api)
assert service.daily_quest_title("u1") == "Find the Lost Amulet"
# ===== TEST C — Mock used as a SPY (different ROLE, same class) =====
# Watch this carefully: `Mock` is the same class as Test B's. But
# we're using it as a SPY — recording the call to `credit` and
# asserting on the recording afterwards. The role isn't determined
# by the class; it's determined by what we DO with it.
def test_c_mock_as_spy():
clock = FrozenClock(datetime(2026, 4, 28, 12, 0))
api = Mock()
api.fetch_quests.return_value = [] # api still acts as stub
ledger = Mock() # ledger plays SPY
service = DailyQuestService(clock, api, ledger)
service.complete_quest("u1", "Slay the Slime Lord")
# Mock auto-records every call; `assert_called_once_with` checks the recording.
# This is identical in spirit to: assert ledger.calls == [("u1", 100)]
# — just generated automatically.
ledger.credit.assert_called_once_with("u1", 100)
# ===== TEST D — fill in the side_effect =====
# The SUT catches ConnectionError and returns "No quests today".
# Use side_effect to make the stub RAISE that exception instead of returning.
# YOUR JOB: replace `ValueError` (the wrong exception) with the right one.
# Read DailyQuestService.daily_quest_title in quest_service.py to confirm
# which exception class is caught.
def test_d_side_effect_simulates_api_failure():
clock = FrozenClock(datetime(2026, 4, 28, 12, 0))
api = Mock()
# TODO: replace ValueError with the exception class the SUT catches.
api.fetch_quests.side_effect = ValueError
service = DailyQuestService(clock, api)
assert service.daily_quest_title("u1") == "No quests today"
Step 4 — Knowledge Check
Min. score: 80%1.
api = Mock()
api.fetch_quests.return_value = [{"weekday": "Tuesday", "title": "..."}]
api playing here?
Mock(return_value=X) is the framework’s way of writing what
you wrote by hand as class StubX: def method(self): return X.
Same role; less typing. The class is Mock; the role is stub.
(Verbatim teaching sentence in action.)
2. When should you reach for side_effect instead of return_value?
return_value: one canned answer for every call.
side_effect: dynamic — exception-raising, sequenced returns,
or computed-from-args. Pick based on what the test needs the
collaborator to do, not by what looks shorter.
3. A teammate writes:
ledger.credit.assrt_called_once_with("u1", 100) # typo
The typo trap. Mock’s auto-attribute behavior — convenient for
quickly stubbing nested attribute chains — also silently swallows
typos in assert_* method names. The test passes; the assertion
never ran. Step 5’s autospec=True is one defense; using mypy or
calling assert_called_once_with (no underscore typo) carefully
is another.
4. (Spaced review — TDD) During the Red-Green-Refactor cycle, when do you typically introduce a Mock?
Red is the test-design moment. Choosing stub/spy/mock/fake/no-double is a Red-phase decision because it shapes both the test’s structure and (often) the production design that emerges in Green. (Step 6 covers when not to double — also a Red-phase decision.)
5. Why is pytest’s monkeypatch fixture automatically restoring the original value an important property?
Test isolation. A test that patches a module attribute and
forgets to restore it leaves a time bomb for every subsequent
test. monkeypatch and with patch(...) both handle restoration
for you; manual setattr/delattr does not. Always prefer the
framework-managed forms.
Where to Patch — The #1 Python Pitfall, and Why autospec Defends You
Why this matters
The single most common Python-mocking bug is patching the wrong namespace. Your test runs, no error is raised, but mock_send was never called and the real send_push ran behind the scenes. The rule is one sentence — patch where the SUT looks the name up, not where it was defined — but the trap catches everyone at least once. Pair that with autospec=True (a guardrail that makes your Mock as strict as the real callable it’s replacing) and you’ve defused two of the production-only failure modes of unittest.mock.
🎯 You will learn to
- Apply the rule “patch where the SUT looks up the name” to pick the right
patch()target - Evaluate when
autospec=Trueis needed to defend against signature drift - Analyze behavior verification (Meszaros) versus the state verification of Steps 2-3
🧭 Bridge from Step 4. Step 4 used Mocks at constructor parameters — DailyQuestService(clock, api, ledger) accepts the doubles directly. Sometimes that’s not possible: the SUT might call a module-level function directly, with no constructor parameter to swap. Then we use unittest.mock.patch() — and confront the canonical Python pitfall: where in the namespace does the patch belong?
📖 The new SUT — celebrate_milestone
Look at quest_service.py. There’s a new method celebrate_milestone(user_id, days) that calls send_push(...) from push_notifier. The import line in quest_service.py is:
from push_notifier import send_push
That single line is the source of every where-to-patch confusion in Python. After this import, send_push is bound in quest_service’s namespace. The quest_service module now has its own reference to the function — separate from push_notifier’s.
flowchart LR
subgraph push_mod["push_notifier module"]
P_DEF["send_push<br/>= <real function>"]:::neutral
end
subgraph quest_mod["quest_service module"]
Q_REF["send_push<br/>= <ref to real function>"]:::neutral
Q_USE["celebrate_milestone<br/>calls send_push(...)<br/>looks up 'send_push' HERE"]:::sut
Q_REF -.->|"looked up in<br/>this namespace"| Q_USE
end
P_DEF -->|"from push_notifier import send_push<br/>copies the reference"| Q_REF
classDef neutral fill:#fafafa,stroke:#bdbdbd,color:#424242
classDef sut fill:#fff3e0,stroke:#e65100,color:#bf360c
📜 The rule
Patch where the SUT looks up the name — not where it was originally defined.
celebrate_milestone does send_push(...). Python finds that name by looking it up in quest_service’s namespace (the importing module). So the patch target is "quest_service.send_push", not "push_notifier.send_push". Patching the latter does nothing — quest_service already has its own reference.
Part A — Predict and fix the patch target
⚙️ Task: open test_celebrate.py. The patch target is currently wrong. Run the test (it fails). Read the failure carefully — mock_send was never called, even though the SUT did run celebrate_milestone. That’s the signature of a wrong-namespace patch.
Then fix it: change the patch target string to the right one. Re-run.
💡 Pedagogical note. Your fix is one string change. The conceptual move is naming where the SUT looks the name up. That insight ports to JavaScript (CommonJS’ const { y } = require('x') has the same trap) and Java (static imports have a similar effect). Once you internalize the rule, you stop being trapped by the syntax.
Part B — autospec is a design guardrail, not a syntactic flourish
Read the second pair of tests in the file: test_loose_mock_accepts_wrong_call and test_autospec_rejects_wrong_call. Both run successfully — but they verify very different things.
| Concern | Loose Mock (no spec) | Autospec’d Mock |
|---|---|---|
| Setup | with patch("X") as m: |
with patch("X", autospec=True) as m: |
What m(wrong_args) does |
Silently records the call | Raises TypeError because the real function’s signature is enforced |
What m.assrt_called_once_with(...) (typo) does |
Silently auto-creates an attribute, returns yet another Mock | Same in current Mock — autospec defends primarily against call-signature drift, not assertion-method typos. Use linters / mypy for the typo defense. |
| When you’d want it | Quick exploratory test where signature isn’t a concern | Default-safe habit for any patched callable — catches signature drift the moment a teammate’s refactor breaks the contract |
The pedagogical takeaway: autospec=True is a design guardrail. It says “make this Mock as strict as the real thing it’s replacing.” Without it, your test silently accepts calls that the real function would reject — until production catches it for you, which is the worst place to find out.
📖 Behavior verification — the third kind
Steps 2 and 3 used state verification: stubs feed inputs, the test asserts on the SUT’s return value or on the spy’s recorded list. The SUT’s internal call sequence was incidental.
test_celebrate_milestone_sends_push (after you fix the patch target) is different. The SUT returns None. Nothing in its observable state changes. The call itself is the entire contract. We assert that mock_send was called once with specific arguments. That’s behavior verification (Meszaros).
A Mock configured with call assertions is, in Meszaros’ strict sense, a Mock Object. The role isn’t “what class did you instantiate” — it’s “what does the test verify, and how?”
| Role | What the test verifies | Verification kind | |—|—|—| | Stub | The SUT’s return value (driven by canned indirect inputs) | State | | Spy | The recorded call list, after the fact | State (of the spy) | | Mock Object | The interaction itself, often with strict expectations | Behavior |
🌍 The same idea in another language
JavaScript with Jest (CommonJS): Same trap exists.
// questService.js
const { sendPush } = require('./pushNotifier');
function celebrateMilestone(...) { sendPush(...); }
jest.mock('./pushNotifier') works because Jest hoists this and intercepts at the require boundary. But if the consumer destructures and you only mock the original module, ES module imports can desync — same family of problem.
Java with Mockito static imports: Less prone to this since Java imports are class-level and Mockito patches at the type level. But PowerMock for static methods has its own where-to-patch dance.
The general lesson, language-independent: a name lives in the namespace of the module that introduces it. Patch there.
🧠 The typo trap and `autospec` — the precise truth
A common claim: “autospec catches typos like assrt_called_once_with.” Half-true. Here’s the precise picture.
autospec=True constrains the Mock to the spec of the patched object — its arguments, its attributes (if it’s a class), its method signatures. For attribute access, autospec does restrict the Mock to attributes the real object has — but assert_* methods are part of the Mock’s interface, not the real object’s. So mock.assrt_called_once_with may or may not be caught depending on Python version and exact patching shape.
The reliable defense against assrt_called_once_with typos: mypy or pylint, not autospec. Don’t rely on autospec for typo prevention.
The reliable defense against signature drift (calling send_push("u1") when the real function needs send_push("u1", "msg")): autospec catches this immediately. That’s the use case worth the keystrokes.
🔭 Coming in Step 6: You can build any of the three roles and you know the patching pitfalls. The harder skill is choosing which one — and choosing none at all when over-mocking would brittlify the test.
"""The real push-notification service — would call APNS / FCM in production."""
def send_push(user_id: str, message: str) -> None:
# In production: dispatches a real push notification.
# The print is a teaching aid — if you see this in test output,
# the patch DIDN'T intercept and the real function ran.
print(f"📲 REAL send_push fired: user={user_id!r}, message={message!r}")
"""QuestForge — daily quest service with milestone celebration."""
import datetime
from push_notifier import send_push
QUEST_REWARDS = {
"Slay the Slime Lord": 100,
"Find the Lost Amulet": 150,
"Battle the Lich King": 250,
"Defeat the Dragon": 500,
}
def is_today_event_day(event_date_str: str, clock=datetime.datetime) -> bool:
today = clock.now().strftime("%Y-%m-%d")
return today == event_date_str
class DailyQuestService:
def __init__(self, clock, api, ledger=None):
self._clock = clock
self._api = api
self._ledger = ledger
def daily_quest_title(self, user_id: str) -> str:
try:
quests = self._api.fetch_quests(user_id)
except ConnectionError:
return "No quests today"
if not quests:
return "No quests today"
weekday = self._clock.now().strftime("%A")
for quest in quests:
if quest["weekday"] == weekday:
return quest["title"]
return "No quests today"
def complete_quest(self, user_id: str, quest_title: str) -> None:
gold = QUEST_REWARDS.get(quest_title, 0)
self._ledger.credit(user_id, gold)
def award_streak_bonus(self, user_id: str, days: int) -> None:
gold = min(days * 10, 100)
self._ledger.credit(user_id, gold)
def celebrate_milestone(self, user_id: str, days: int) -> None:
"""When a streak hits a multiple of 7, send a push notification."""
if days % 7 == 0:
send_push(user_id, f"🎉 {days}-day streak!")
"""Step 5 — Where-to-patch and autospec.
Three tests below. Tests B and C are correct as-is and demonstrate
autospec's value. Test A's PATCH TARGET IS WRONG — fix it.
"""
from unittest.mock import Mock, patch
from datetime import datetime
from clock import FrozenClock
from quest_service import DailyQuestService
def _service():
return DailyQuestService(FrozenClock(datetime(2026, 4, 28, 12, 0)), Mock(), Mock())
# ===== TEST A — Part A: patch target is WRONG. Fix it. =====
# Run this test as-is. It FAILS — `mock_send.assert_called_once_with(...)`
# complains the mock was never called. That's the symptom of a
# wrong-namespace patch: the real send_push ran, the mock did nothing.
# YOUR JOB: change the patch target string from "push_notifier.send_push"
# to the correct one. Read `quest_service.py`'s import line — the SUT
# looks the name up in *which* namespace?
def test_celebrate_milestone_sends_push():
service = _service()
# ← FIX THE STRING BELOW. It's wrong.
with patch("push_notifier.send_push") as mock_send:
service.celebrate_milestone("u1", 7)
mock_send.assert_called_once_with("u1", "🎉 7-day streak!")
# ===== TEST B — Part C: a LOOSE Mock accepts a wrong-signature call =====
# The real send_push takes 2 arguments (user_id, message).
# Without autospec, the Mock will silently accept a 1-argument call.
# Watch what gets through.
def test_loose_mock_accepts_wrong_call():
with patch("quest_service.send_push") as mock_send:
# Imagine a teammate's refactor that drops the message arg
# (real production bug). The Mock has no spec — it accepts.
mock_send("u1") # Real send_push REQUIRES 2 args; Mock doesn't care.
# The recorded call passes assertion. The bug slipped through.
mock_send.assert_called_once_with("u1")
# ===== TEST C — Part C: autospec REJECTS the wrong-signature call =====
# With autospec=True, the Mock matches the real function's signature.
# Calling it with the wrong number of arguments raises TypeError.
def test_autospec_rejects_wrong_call():
with patch("quest_service.send_push", autospec=True) as mock_send:
try:
mock_send("u1") # Same bad call as Test B — autospec catches it
assert False, "autospec should have raised TypeError"
except TypeError as e:
# autospec correctly rejected the call. The signature was enforced.
print(f"✅ autospec caught it: {e}")
Step 5 — Knowledge Check
Min. score: 80%
1. quest_service.py does:
from push_notifier import send_push
celebrate_milestone calls send_push(...). Which patch target intercepts the call?
The rule: patch where the SUT looks up the name, not where it
was defined. After from X import Y, the name Y is bound in the
importing module — that’s where the SUT will resolve it. The same
principle applies to JavaScript CommonJS, Java static imports, and
any language with import scoping.
2. What does autospec=True primarily defend against?
autospec=True is the default-safe habit for patched callables:
it makes the mock as strict as the real thing it’s replacing.
Signature drift (the most common refactoring bug) gets caught
immediately. Use it unless you have a reason not to.
3. What’s the relationship between Test Double (the umbrella name) and Stub / Spy / Mock / Fake / Dummy?
Test Double is the umbrella — five specialized roles below it. When you say “I added a mock,” you’re naming the Mock Object role within the Test Double umbrella, not the umbrella itself. See Meszaros’ Test Double for the full taxonomy.
4. (Spaced review — Step 4) A Mock is patched in for the SUT’s collaborator. The test asserts mock.method.assert_called_once_with("u1", 100). What role is this Mock playing?
unittest.mock blurs the Spy/Mock-Object line that Meszaros drew
crisply. Both are forms of behavior verification; they differ
mainly in whether the expectation is set up-front (mockist style)
or read after-the-fact (spy style). For your day-to-day work:
don’t worry too much about which side of the line you’re on —
worry about whether the test actually verifies the contract.
5. (Spaced review — Steps 1 & 2) In Step 1 you injected clock=datetime.datetime as a constructor parameter (Dependency Injection). In this step you patched "quest_service.send_push" via unittest.mock.patch. When is each technique the right choice?
Two techniques for two situations:
DI when the SUT can take the collaborator as a parameter (Step 1’s
clock=datetime.datetime). Cleanest, most testable.
patch() when the SUT imports the name at module level and you
can’t change that without disrupting other callers (Step 5’s
quest_service.send_push). Heavier, but works when DI doesn’t.
The same role-vs-syntax distinction from Step 4 applies: stub/spy/mock
are roles; DI and patch() are delivery vehicles for those roles.
6. (Spaced review — Step 4 typo trap) What’s the most reliable defense against typos like mock.assrt_called_once_with(...) silently passing?
Static tooling > runtime defense for spelling. mypy / pyright
understand unittest.mock’s type stubs and catch typos like
assrt_called_once_with at edit time, before the test ever runs.
When NOT to Use a Double — The Decision Guide
Why this matters
A test double is a tool — not a default, not a sign of professionalism, not a coverage strategy. The right number of doubles for many tests is zero. Reaching for Mock reflexively produces brittle tests that break under harmless refactors and assert on choreography instead of behavior. This step builds the judgment to not reach for a double when a real collaborator would do — the capstone skill that separates “mocks everything” from “mocks at the right boundary.”
🎯 You will learn to
- Evaluate an over-mocked test and diagnose where it broke from the spec
- Apply a decision guide to classify scenarios as no-double / stub / spy / mock / fake / adapter
- Analyze the “mock what you own” heuristic and the Adapter wrap-and-mock pattern
🧭 The whole arc, in one sentence. A test double is a tool you reach for when a real collaborator would make the test flaky, slow, or unable to verify the right thing. It is not a default. It is not a sign of professionalism. It is not a coverage strategy. The right number of doubles for many tests is zero.
📖 The decision flow
flowchart TD
A["What does this test need to verify?"]:::neutral --> B{"Does the SUT have collaborators<br/>worth doubling?<br/>(slow/flaky/unavailable)"}
B -->|"No — pure function"| NO["No double<br/>Just call it"]:::good
B -->|"Yes"| C{"Do you control the test's input<br/>via a collaborator?"}
C -->|"Yes — control input"| STUB["Stub<br/>(canned answers)"]:::good
C -->|"No — verify a call happened"| D{"Inspect after the fact<br/>or set up-front?"}
D -->|"After"| SPY["Spy<br/>(record + assert)"]:::good
D -->|"Up-front strict"| MOCK["Mock Object<br/>(behavior verification)"]:::good
B -->|"Yes — but stateful + multi-call"| FAKE["Fake<br/>(in-memory implementation)"]:::good
B -->|"Third-party library<br/>you don't own"| ADAPT["Wrap in an Adapter<br/>then double the adapter"]:::warn
classDef good fill:#e8f5e9,stroke:#2e7d32,color:#1b5e20
classDef warn fill:#fff3e0,stroke:#e65100,color:#bf360c
classDef neutral fill:#fafafa,stroke:#bdbdbd,color:#424242
📖 Three antipatterns to recognize on sight
| Antipattern | Symptom | Why it happens | Fix |
|---|---|---|---|
| Over-mocking | Every internal helper is mocked; the test asserts only on the mocks. | “Isolation feels safe; more mocks = more tested.” | Mock at the architectural boundary (HTTP, DB, clock), not at every internal function. |
| Mocking what you don’t own | A third-party library’s API is mocked directly, scattered across many tests. | The library is brittle and the team doesn’t want to wait for real responses. | Wrap the third-party in an Adapter (Adapter pattern); mock the Adapter. The third-party’s internals stay invisible to your tests. |
| Coverage chasing | Every line of the SUT runs in some test, but assertions are weak (is not None) or mocked-on-mocks. |
Coverage is misread as a quality signal. | Stronger oracles, real collaborators where possible, fewer tests that test more meaningfully. Coverage ≠ correctness (Testing Foundations Step 3). |
Part 1 — Read the over-mocked vs clean tests
Open xp_calculator.py. The function compute_total_xp(quests) is pure: it takes a list, computes a number, returns it. No clock, no HTTP, no database. No collaborators worth doubling. Yet test_xp_overmocked.py mocks every internal helper.
⚙️ Task 1: read both test_xp_overmocked.py and test_xp_clean.py. In test_xp_clean.py, uncomment the docstring at the top and fill in your one-line answer to: “What did the over-mocked version mock unnecessarily — and what did that cost?”
📖 What the over-mocked test actually verifies (look only after writing your answer)
Look at test_xp_overmocked.py. The mocks intercept _filter_completed, _apply_multipliers, and _sum_xp. With those internals replaced by Mocks returning canned values, the test only verifies that compute_total_xp calls the helpers in some order and returns the last one’s result. That’s not the spec. The spec is “given these quest dicts, return the total XP.”
Worse: if a teammate refactors the internals (rename _apply_multipliers to _apply_modifiers; merge two helpers into one; inline a helper away entirely), every one of those changes preserves the function’s behavior — but breaks the over-mocked test. Brittleness without protection. The clean test never breaks under those refactors because it asserts on the spec, not on the implementation choreography.
Same lesson as Testing Foundations Step 4 (“test behavior, not implementation”), now applied to mocks instead of internal state access. The principle is one principle.
Part 2 — Classify five scenarios
Open scenarios.py. For each of the five scenarios, set the variable to the best single recommendation from this list:
"no_double" "stub" "spy" "mock" "fake" "adapter"
The validator accepts any defensible answer for each scenario (some scenarios have more than one defensible answer — e.g., spy and mock are often interchangeable for a single outbound call). It rejects clearly wrong choices.
🧰 Quick decision rubric (use, don't memorize)
| If the SUT… | Reach for… |
|—|—|
| …is a pure function — same input always yields same output, no collaborators | No double |
| …calls a clock, a remote service, or any non-deterministic source | Stub |
| …needs to verify a fire-and-forget outbound call (e.g., notifier.send(...)) | Spy or Mock |
| …needs to round-trip with a stateful collaborator (write then read) | Fake |
| …calls a third-party library you don’t own | Adapter wrapper → double the adapter |
| …is just simple math/string/list manipulation | No double (don’t make work) |
🌍 The same decision in another language
The decision is purely about test design, not about syntax. JavaScript, Java, C#, Ruby, Go — every language with serious testing culture has the same five-or-so doubles, the same antipatterns, and the same heuristic: only mock what you own; only mock what’s actually a collaborator; pure functions don’t need doubles.
The frameworks differ; the design judgment doesn’t.
Part 3 — Forward pointers
You now have the conceptual vocabulary to read any test in any modern Python codebase and recognize what role each double is playing — even when the author called everything a “mock.” That recognition transfers across languages.
🔭 Where this leads in the rest of the curriculum:
- SOLID Tutorial — Dependency Inversion makes doubles trivial: define an interface, have the SUT depend on it, swap implementations at test time. Most painful mocks are caused by skipped DIP.
- TDD — the next natural sequel: TDD where the SUT has collaborators from the start. Red phase becomes “decide what to double, then write the failing test.”
🪞 Recalibrate. Look back at Step 1 — the test that passed today and would have failed tomorrow. Your toolkit now has six things to do instead of “ship and pray”:
- Recognize a flaky/slow/opaque collaborator (Step 1).
- Inject the collaborator as a parameter (Step 1).
- Substitute a stub when you need to control input (Step 2).
- Substitute a spy when you need to verify a call (Step 3).
- Reach for
unittest.mockwhen boilerplate gets tedious (Step 4) — but recognize the role you’re playing. - Use
patch()carefully — where the SUT looks the name up — and preferautospec=True(Step 5).
And the seventh, just learned: sometimes the right answer is no double at all. That judgment is what makes you good at this.
"""A PURE function for computing XP earned across quests.
No collaborators. No clock. No HTTP. No database.
Helper functions are private (underscore prefix) — implementation detail.
"""
def _filter_completed(quests: list[dict]) -> list[dict]:
return [q for q in quests if q.get("completed")]
def _apply_multipliers(quests: list[dict]) -> list[tuple[str, int]]:
return [(q["title"], q["xp"] * q.get("multiplier", 1)) for q in quests]
def _sum_xp(items: list[tuple[str, int]]) -> int:
return sum(xp for _title, xp in items)
def compute_total_xp(quests: list[dict]) -> int:
"""Return the total XP earned from completed quests, with multipliers applied.
Each quest is a dict with keys: title (str), xp (int), completed (bool),
and an optional multiplier (int, default 1).
"""
completed = _filter_completed(quests)
with_multipliers = _apply_multipliers(completed)
return _sum_xp(with_multipliers)
"""SMELL — every internal helper is mocked. Read this and recoil.
Notice what's actually verified: nothing about the SUT's behavior.
The mocks made up the answer; the SUT just orchestrated them.
"""
from unittest.mock import patch
from xp_calculator import compute_total_xp
def test_total_xp_overmocked_brittle():
with patch("xp_calculator._filter_completed") as mock_filter, \
patch("xp_calculator._apply_multipliers") as mock_apply, \
patch("xp_calculator._sum_xp") as mock_sum:
mock_filter.return_value = "<canned>"
mock_apply.return_value = "<canned>"
mock_sum.return_value = 200
result = compute_total_xp([{"completed": True, "xp": 50}])
assert result == 200
# The "test" passes whether or not the SUT correctly filters,
# multiplies, or sums — because we mocked all three.
# If a teammate renames _apply_multipliers, this test breaks
# for the WRONG reason (refactor, not behavior change).
"""Clean: no doubles. compute_total_xp is a pure function — exercise it directly."""
# TODO: in your own words, in ONE LINE, answer the question below.
# The validator just checks that this docstring is no longer empty.
"""The over-mocked version mocked: ___ FILL IN ___
What that cost: ___ FILL IN ___"""
from xp_calculator import compute_total_xp
def test_total_xp_for_two_completed_quests():
quests = [
{"title": "Slay", "xp": 50, "completed": True, "multiplier": 2},
{"title": "Find", "xp": 30, "completed": False, "multiplier": 1},
{"title": "Defeat", "xp": 100, "completed": True, "multiplier": 1},
]
# 50*2 + (Find skipped: not completed) + 100*1 = 200
assert compute_total_xp(quests) == 200
def test_total_xp_for_no_completed_quests():
quests = [{"title": "Skip", "xp": 999, "completed": False}]
assert compute_total_xp(quests) == 0
"""Classify each scenario by the BEST single recommendation.
Allowed values:
"no_double" — the SUT is pure (or close enough); call it directly
"stub" — control indirect input with canned values
"spy" — verify a fire-and-forget call after the fact
"mock" — strict behavior verification of a single contract call
"fake" — stateful in-memory implementation across multiple calls
"adapter" — wrap a third-party library, then double the adapter
"""
# Scenario 1: A pure function `compute_tax(price: float, rate: float) -> float`
# that returns price * rate. No collaborators.
SCENARIO_1_BEST = "FILL_IN"
# Scenario 2: A function `is_coupon_expired(coupon)` that calls datetime.now()
# internally to compare against `coupon.expires_at`. We want a deterministic test.
SCENARIO_2_BEST = "FILL_IN"
# Scenario 3: `process_order(order)` POSTs to a payment gateway. The test must
# verify the gateway was called exactly once with the right amount.
SCENARIO_3_BEST = "FILL_IN"
# Scenario 4: A `UserRepository` reads/writes user records to Postgres.
# The SUT under test does many round-trips: register a user, then look them up,
# then update their email, then look them up again. Tests run on CI without a DB.
SCENARIO_4_BEST = "FILL_IN"
# Scenario 5: Throughout the codebase, many modules call `requests.get(...)`
# directly. Patching `requests` everywhere is fragile; the tests are slow.
SCENARIO_5_BEST = "FILL_IN"
Step 6 — Knowledge Check
Min. score: 80%1. A test mocks every internal helper of the SUT and asserts only on the mocks’ return values. Which antipattern is this?
Mock at the architectural boundary; let internal helpers be real. The line “this collaborator is worth doubling” runs through the boundary between your code and the unpredictable world (clock, HTTP, DB, queue) — not through every function-call edge inside your own module.
2. (Cumulative review) Match each scenario to the best single double:
- A: A pure function that adds two integers
- B: A function that calls
datetime.now()to decide an expiration - C: A function that POSTs to a payment gateway, fire-and-forget
- D: A function that round-trips with a Postgres user table 5 times
The rubric: pure → no double; non-deterministic → stub; outbound call → spy/mock; stateful sequence → fake. Memorize the rubric shape (the diagram in the instructions); the words follow.
3. “Don’t mock what you don’t own.” What does this rule actually mean?
"Mock what you own" is shorthand for "depend on interfaces you control, then mock those interfaces." The Adapter pattern from classical OO (and the Adapter pattern in design-patterns literature) is exactly the maneuver this rule recommends.
4. (Spaced review — TDD) During Red-Green-Refactor, when do you typically decide which double to use?
Choosing a double is part of test design; test design happens in Red. Same lesson as Testing Foundations Step 5: input choice and oracle strength are independent test-design dimensions, both decided when you write the test. Add "choice of double" as a third independent dimension.
5. (Spaced review — Step 3) Step 3’s test_complete_quest_LIAR_oracle was left in the file intentionally — assert len(spy.calls) >= 0 passes regardless of behavior, and Step 3 asked you to comment on it rather than fix it. Why keep a known-broken test in the file?
Most testing tutorials only show good tests. Real codebases have
both. Keeping a Liar in the file alongside a Goldilocks test
trains the eye to discriminate — a skill students need on day 1
of a real job, where most tests they read will be imperfect.
(Same reasoning behind Step 6’s test_xp_overmocked.py — kept
in the file as a recognizable bad example, not deleted.)
6. (Spaced review — Step 5) Why is autospec=True worth almost always reaching for when you patch a callable?
Default-safe habit: use autospec=True whenever you’re patching
a callable. It costs nothing at edit time, catches a real-world
bug class at test time, and makes refactoring safer in the long
run.