Testing Foundations with pytest

Build the core testing skills you need BEFORE learning TDD: write meaningful assertions, choose test cases using partitions and boundaries, and test behavior instead of implementation.

Why Test? The Bug That Got Away

Why this matters

Imagine you’ve kept your Duolingo streak alive for 100 days straight. You open the app expecting the 💯 badge — and it shows you 🔥 instead. One missing = sign in the badge logic, and the milestone you actually earned silently disappeared. The code runs cleanly, prints no error, and a million 100-day-streakers feel slightly betrayed. That is what tests prevent.

🎯 You will learn to

Apply pytest’s pass/fail loop: read a failing test, understand what it expects, and fix production code until it passes.
Analyze what a test specifies about a function’s behavior versus what it merely happens to observe.

🧭 Heads-up — a shift coming. By the end of this tutorial you’ll think about tests differently than most beginners do: not as “checking your homework” but as executable specifications of behavior. Notice the shift as it happens.

💡 Why test?

Many students think testing is about finding bugs after you write code. That’s half the story. Tests also:

Specify behavior — a test says “this function should do X”
Prevent regressions — a regression is a bug that comes back after being fixed; once a passing test guards a behavior, any future change that breaks that behavior immediately fails the test
Enable fearless refactoring — change code confidently because the suite catches breakage immediately

Think of tests as a safety net: once a test passes, it stays in place to catch you. If a future change breaks the behavior the test guards, the test fails — the regression is caught before users feel it.

🔍 Predict first

Don’t run anything yet. Open streaks.py and read it.

What will streak_badge(150) return? (deep into 💯 territory)
And streak_badge(50)? (in the 🔥 zone)
And streak_badge(100) — exactly on the line between 🔥 and 💯?

Hold those three predictions in your head.

📂 What you have

Two files are already set up for you:

streaks.py — the production code (with a real bug).
test_streaks.py — three tests, already written for you. Each is a Python function whose name starts with test_. That naming is how pytest finds and runs them. Each body calls streak_badge and asserts what it should return. (In Step 2 you’ll write your own from scratch.)

⚙️ Task:

Read test_streaks.py. What behavior is each test checking? Notice the third test pins down streak_badge(100) — the spec says 100 days and up earns 💯.
Run the tests (Run button). One test will fail. That’s a win 🎯 — the test just caught a real bug. Read the failure carefully: pytest tells you exactly which assertion failed and what value came back instead.
Fix streaks.py so all three tests pass. Don’t touch the test file — production code is what we change; tests describe what the code should do.
Run again. Three passing tests. The fix is now permanently guarded by the test — if anyone ever reverts to the old comparison, the safety net catches it instantly.

That whole loop is the rhythm you’ll see in every later step:

flowchart LR
    predict["1. Predict<br/>(don't run yet)"]:::neutral
    red["2. Run pytest<br/>see RED ✗"]:::bad
    fix["3. Fix streaks.py<br/>(production code, not the test)"]:::neutral
    green["4. Run pytest<br/>see GREEN ✓"]:::good
    guard["5. Test guards behavior<br/>future regressions caught"]:::good
    predict --> red --> fix --> green --> guard
    classDef good fill:#e8f5e9,stroke:#2e7d32,color:#1b5e20
    classDef bad fill:#ffebee,stroke:#c62828,color:#b71c1c
    classDef neutral fill:#fafafa,stroke:#bdbdbd,color:#424242

🎯 Why this bug matters (read after solving)

The bug lives at exactly 100 days — the line between 🔥 and 💯. That’s no coincidence. Bugs love boundaries — the values where behavior changes. They’re the natural home of off-by-one errors (> vs >=, < vs <=). You’ll hunt boundaries systematically in Step 2.

🧭 Pause — name what just happened. You ran a test, read a failure, fixed code, and confirmed it with a re-run. In one sentence: what did that test specify about streak_badge? Use the words “specification” or “behavior” rather than “check.” Then go one level deeper: why does writing the assertion first (before seeing whether the code passes) mean the test reflects intended behavior rather than observed behavior? What would change if you wrote the assertion after reading the output?

🔭 Coming in Step 2: Not all inputs are equally useful for finding bugs. The streak bug at exactly day 100 wasn’t a coincidence — bugs cluster at boundaries, the values where one behavior turns into another. You’ll learn how to find them systematically before they ship.

Starter files

streaks.py

def streak_badge(days: int) -> str:
    """Pick the streak badge for a daily-app streak (Duolingo / Snapchat / BeReal style).

    Spec:
      days >= 100 -> "💯"   (century club)
      days >=  30 -> "🔥"   (on fire)
      days >=   7 -> "⚡"   (lit week)
      days >=   1 -> "✨"   (just started)
      else        -> ""     (no streak)
    """
    if days > 100:
        return "💯"
    if days >= 30:
        return "🔥"
    if days >= 7:
        return "⚡"
    if days >= 1:
        return "✨"
    return ""

test_streaks.py

"""Tests for streaks.streak_badge — pre-written for you in this step.
In Step 2 you'll write your own from scratch."""
import pytest
from streaks import streak_badge


def test_well_above_century_is_diamond():
    # 150 days is deep in the 💯 range — this should never be in doubt.
    assert streak_badge(150) == "💯"


def test_inside_fire_range_is_fire():
    # 50 days is comfortably in the 🔥 range (30-99).
    assert streak_badge(50) == "🔥"


def test_exactly_at_century_boundary_is_diamond():
    # The spec says: 100 days and up earns 💯.
    # 100 is the *boundary* — the value where 🔥 turns into 💯.
    # Boundary bugs (off-by-one) love values like this. (More in Step 2.)
    assert streak_badge(100) == "💯"

Step 1 — Knowledge Check

Min. score: 80%

1. A teammate says: “I only write tests after I finish all my code, to check for bugs.” What is the main limitation of this approach?

There is no real limitation — writing tests after code is the industry standard, and the test suite ends up identical either way since you cover the same behaviors
This claim ignores the order in which tests are written. Tests written after the code reflect what the code happens to do — including any bugs the author baked in. Tests written before (or alongside) reflect what the code should do. The two suites look identical only when the implementation already happens to be correct.
Post-hoc tests verify what you think the code does, not what it should do. They also can’t guide your design during development
Exactly. Post-hoc tests are written against code that already exists, so they can only confirm what the implementation does — including any bugs baked in. They also can’t guide design during development that already happened.
Writing tests afterward is fine as long as you achieve 100% coverage, which ensures all behavior is verified
Coverage measures which lines ran, not whether their behavior was checked. A weak oracle (assert result is not None) hits 100% coverage and verifies almost nothing. You’ll see this exact failure mode in Step 3.
The main problem is that it takes slightly longer because you have to context-switch back to code you already wrote
Context-switching is a minor cost compared with the real issue — post-hoc tests can only confirm your implementation matches itself, not that it matches the spec. They also can’t act as a safety net during the development that already happened.

Post-hoc tests verify your implementation rather than intended behavior. They also cannot serve as a safety net during development because they don’t exist yet.

2. What does this code do?

assert len(result) == 3, "Expected 3 items"

It prints ‘Expected 3 items’ to the console, then continues execution normally
assert is not a print statement. The string after the comma is the failure message — it only appears when the assertion is False and Python raises an AssertionError.
It checks that result has 3 elements; if not, it raises an AssertionError with the message
Right. The condition is evaluated; if True, execution continues silently. If False, Python raises AssertionError with the message string as the error detail.
It creates a new list called result and populates it with exactly 3 items
assert doesn’t create or modify variables — it just evaluates a condition. The result here must already exist when assert runs, or you’d get a NameError first.
It evaluates the condition but silently continues even if the condition is False
A False assertion does not silently continue — it raises AssertionError. (Python’s -O flag does strip asserts, but in normal execution they always raise on failure.)

assert condition, message evaluates the condition. If True, nothing happens and execution continues. If False, Python raises an AssertionError with the message.

3. Dijkstra wrote: “Testing can show the presence of bugs, but never their absence.” Applied to automated test suites, this means:

Tests are only useful if you already suspect a bug — otherwise they’re wasted effort
Tests aren’t just useful when you already suspect a bug — they document expected behavior and catch regressions you didn’t anticipate. The Dijkstra quote is about the ceiling of what tests guarantee, not about when to bother writing them.
Passing tests confirm that tested behaviors work as specified, but cannot rule out bugs in untested behaviors or inputs
Exactly. Passing tests show the tested behaviors work as specified. They cannot show that untested behaviors work — the guarantee only extends as far as the suite covers.
If all tests pass, the code is safe to deploy without further review
Even with a full passing suite, untested behaviors could still harbor bugs. Tests prove only what they test. Code review, static analysis, and other practices complement testing precisely because no suite is exhaustive.
Tests guarantee correctness once you reach 100% line coverage
Coverage measures which lines ran, not whether their behavior was verified. Full coverage with weak oracles (is not None everywhere) passes for almost any return value and proves almost nothing. Dijkstra’s quote applies regardless of coverage level.

Tests confirm the behaviors they test, but cannot guarantee zero bugs overall. Dijkstra’s observation means: your suite is only as trustworthy as the inputs it exercises and the oracles it uses. A suite with three strong tests knows three things; a suite with three weak tests knows almost nothing.

4. The “safety net” metaphor for testing means:

Tests should be run automatically on every commit, like a security camera scanning each code change
This describes continuous integration (running tests automatically on every commit) — that’s about the trigger, not the metaphor. The safety-net image is about what each passing test does for you: it stays in place to catch later breakage.
Each passing test stays in place to catch regressions — you can refactor and add features fearlessly because the net catches breakage
Right. Each passing test permanently guards that behavior against future regressions — that’s the safety net. If a future change breaks a guarded behavior, the test fails before users feel it.
You should aim for 100% test coverage as a safety baseline before each release
The safety-net image isn’t about a coverage number. A test suite with 100% line coverage but weak oracles is full of holes; a smaller suite with strong oracles on the critical paths is a tighter net. Coverage and the safety-net principle are different ideas (more in Step 4).
Tests should be written in a strict mechanical order: unit tests first, then integration, then end-to-end
This describes the testing pyramid (a way of layering test types), not the safety-net metaphor. The safety net is about passing tests as permanent guards against regression, regardless of what level of the pyramid they sit at.

The safety-net metaphor captures the key psychological benefit of testing: each passing test stays in place to catch regressions you might otherwise introduce. With a thick safety net of tests, you can refactor and add features fearlessly — if you break something, the net catches it before users do.

Choosing What to Test: Partitions & Boundaries

Why this matters

That streak_badge bug at exactly day 100 from Step 1 wasn’t random — it lived at a boundary, the value where one behavior turns into another. Bugs cluster at boundaries, so guessing inputs misses them. This step teaches you to find those boundary values systematically, before they ship.

🎯 You will learn to

Apply equivalence partitioning to divide a function’s input space into meaningful groups.
Analyze numeric specs to pinpoint the boundary values where off-by-one bugs hide.
Create your own pytest tests from scratch — test_ prefix, AAA shape, single assertion.

🔍 Retrieve first. Scan the three tests you inherited in Step 1 (test_streaks.py). Each test calls streak_badge and asserts something with ==. Notice the shape of each — same structure, different inputs. You’re about to write tests just like these.

📝 The shape of a pytest test

A pytest test is just a function whose name starts with test_, containing one or more plain assert statements. Here’s the shape on a different function so you can see the pattern without seeing today’s answer:

# The function under test (in some module):
def add(a: int, b: int) -> int:
    return a + b

# The pytest test for it:
def test_add_two_positives():
    assert add(2, 3) == 5

Three things to notice:

The test is just a regular function — no class, no boilerplate.
The body calls the function under test and asserts the expected return value with ==.
The test name reads like a one-line bug report (“add_two_positives FAILED” tells the next reader exactly what broke).

pytest convention: both the file name and function names must start with test_.

Every test has three parts — Arrange (set up inputs), Act (call the function), Assert (verify the result). For the boundary tests below, all three sit on a single line each: the input string is the Arrange, the call to squad_name_valid(...) is the Act, and is True / is False is the Assert.

💡 The principle: equivalence partitions and boundaries

An equivalence partition is a set of inputs that should behave the same. Boundaries are the values where partitions meet — and where most bugs live (remember the > 100 vs >= 100 streak bug from Step 1).

Today’s function: squad_name_valid(name) — checking if a Fortnite / Roblox / Discord squad name is the right length. Rule: 3 ≤ len ≤ 12 characters.

🔍 Before writing any code: Looking only at the spec (3 ≤ len ≤ 12), list the 4 input lengths you would test. Don’t run anything. For each one, write a single word explaining why this specific length matters more than its neighbor. Hold your list — check it against the disclosure below after writing your tests.

⚙️ Task (test_squad.py): Three worked tests are provided so you can see the pattern from multiple angles before writing your own. Read all three first, then write three more.

💬 Self-explain first (do this before writing): Read the three provided tests carefully. Why did the author pick length 5 for “valid representative”, 2 for “just below min”, and 12 for “boundary at max valid”? What is the same about all three tests, and what is different? Articulating both sides primes you to make your own.

Now write three more tests. The three stubs in the file name what each test must check; you decide the input string and the expected return value.

Test name	What partition or boundary it pins down
`test_boundary_min_valid`	the smallest length the spec says is valid
`test_too_long_just_above_max`	one length past the upper bound
`test_empty_string`	the empty string

For each, decide from the spec 3 ≤ len ≤ 12:

What concrete input string has the right length?
Should squad_name_valid return True or False for it? (Read the rule — don’t guess.)
Then write the assertion using the same is True / is False pattern as the worked examples.

💡 Strong oracles on a Boolean return: squad_name_valid returns True/False. assert squad_name_valid("epic") is True is strong (identity comparison — only True itself passes). assert squad_name_valid("epic") with no comparison is weak — 1, "yes", or any truthy value would slip through. (You’ll generalize this idea — strong vs. weak assertions — to any return type in Step 3.)

📖 Quick aside: is True vs == True

is checks object identity (same object in memory); == checks equality (same value). For Booleans these almost always agree, but is True is strictly stricter — only the literal True object passes. If a function were (incorrectly) refactored to return 1 or "yes" instead of True:

Assertion	Result
`assert result is True`	✗ fails — `1 is True` is `False`
`assert result == True`	✓ passes — `1 == True` is `True`
`assert result` (no comparison)	✓ passes — `1` is truthy

For a function whose contract says “returns a Boolean”, use is True / is False — the test then catches both wrong values and wrong types. (For non-Boolean returns, prefer == with the exact expected value — that’s Step 3.)

📐 Reveal — check your 4 input lengths (open AFTER you've written them)

The 4 critical lengths sit exactly where partitions transition:

flowchart LR
    L2["len 2<br/>❌ reject"]:::bad
    L3["len 3<br/>✅ accept"]:::good
    Mid["...middle of valid<br/>partition..."]:::neutral
    L12["len 12<br/>✅ accept"]:::good
    L13["len 13<br/>❌ reject"]:::bad
    L2 --> L3 --> Mid --> L12 --> L13
    classDef good fill:#e8f5e9,stroke:#2e7d32,color:#1b5e20
    classDef bad fill:#ffebee,stroke:#c62828,color:#b71c1c
    classDef neutral fill:#fafafa,stroke:#bdbdbd,color:#757575

Length	Expected	What this catches
2	reject	A `< 3` written as `<= 3` (off-by-one below)
3	accept	A `<= 3` written as `< 3`
12	accept	A `<= 12` written as `< 12`
13	reject	A `< 13` written as `<= 13` (off-by-one above)

The middle of the valid partition isn’t in the list — one representative there is enough. The same heuristic works for any numeric range: lengths, ages, prices, retry counts.

📖 Equivalence partitioning — the deeper “why”

The input space splits into three regions, each with the same expected behavior:

flowchart LR
    A["<b>too short</b><br/>len 0, 1, 2<br/>↦ reject"]:::bad
    B["<b>valid</b><br/>len 3 ... 12<br/>↦ accept"]:::good
    C["<b>too long</b><br/>len 13+<br/>↦ reject"]:::bad
    A --- B --- C
    classDef good fill:#e8f5e9,stroke:#2e7d32,color:#1b5e20
    classDef bad fill:#ffebee,stroke:#c62828,color:#b71c1c

If "a" (length 1) is rejected, "ab" (length 2) probably is too — same partition, same expected behavior. So one representative per partition is enough for the middle of the partition. Spend your test budget on the boundaries instead — that’s where > 12 vs >= 12 bugs hide.

Heuristic for any range [min, max]:

Partition the input space.
Pick one representative per partition.
Test every boundary — last invalid before each transition, first valid after.

📖 Test names ARE documentation

Notice that good test names describe the behavior they verify: test_valid_representative, test_boundary_max_valid, test_too_long_just_above_max. A failing test should read like a one-line bug report: “boundary_max_valid FAILED — assert False is True”. If you can read your test names without opening the code and still know what the suite covers, your tests double as documentation.

Anti-example: test_1, test_squad, test_works. These tell the next reader nothing.

📖 Why pytest beats raw `assert`

Raw assert halts at the first failure; you only learn about one bug at a time. pytest discovers all tests, runs them all, names each one, and shows the exact mismatched value when one fails — e.g. assert False is True. No classes, no boilerplate — just functions starting with test_.

🔗 Connect to your own code. Think of the last function you wrote before this tutorial. What inputs did you test it with? Apply the partition + boundary method: identify the partitions in that function’s input space and name at least one boundary you probably didn’t test. If you weren’t testing at all before this tutorial, name what your first test for that function would be.

🔭 Coming in Step 3: The is True / is False move you used here is one example of a strong oracle — an assertion that pins exactly the expected value. Step 3 generalizes this to any return type — strings, numbers, lists, dicts — and shows the three flavors of weak oracle that look productive but verify almost nothing.

Starter files

squad.py

def squad_name_valid(name: str) -> bool:
    """Return True if and only if len(name) is between 3 and 12 inclusive
    (typical gaming-platform username rule — Fortnite / Roblox / Discord-style)."""
    return 3 <= len(name) <= 12

test_squad.py

"""Partition & boundary tests for squad_name_valid.

Three worked examples are provided. Read them, see the pattern, then
write three more tests for the remaining boundaries and edges.
"""
import pytest
from squad import squad_name_valid


# --- Worked example 1: a representative valid input (middle of valid partition) ---
# `is True` is the strong-oracle form for a Boolean return — only `True` itself passes.
def test_valid_representative():
    assert squad_name_valid("ninja") is True    # length 5


# --- Worked example 2: just below the valid minimum (boundary at len == 2) ---
# This catches a `< 3` bug that the spec says should be `<= 3`.
def test_too_short_just_below_min():
    assert squad_name_valid("xs") is False      # length 2


# --- Worked example 3: at the upper boundary of the valid partition ---
# This catches a `< 12` bug that the spec says should be `<= 12`.
# NOTE: the spec says length 12 is VALID. Read it, don't guess.
def test_boundary_max_valid():
    assert squad_name_valid("epicgamerlol") is True   # length 12


# --- TODO 1: smallest length the spec calls valid ---
# Hint: the spec says `3 <= len <= 12`. What's the SMALLEST length that's valid?
# Pick any string of that length, then assert `is True`.
# def test_boundary_min_valid():
#     ...


# --- TODO 2: one length past the upper bound ---
# Hint: the partner of test_boundary_max_valid. The spec says length 12 is valid;
# what's the first length that should be REJECTED above it?
# def test_too_long_just_above_max():
#     ...


# --- TODO 3: the empty string ---
# Before writing: which partition does "" belong to? Is it a separate
# partition or the extreme of an existing one? Write your answer as a comment
# above the test, then assert the expected behavior.
# def test_empty_string():
#     ...

Step 2 — Knowledge Check

Min. score: 80%

1. Which file name will pytest automatically discover as a test file?

testing_squad.py
pytest discovers files starting with test_ (note the underscore) or ending with _test.py. testing_squad.py matches neither pattern — pytest will silently skip it.
test_squad.py
Correct. test_squad.py starts with test_ and ends with .py — it matches the discovery rule. Functions inside must also start with test_.
check_squad.py
There’s no convention check_* in pytest. The discovery rule is test_* prefix or *_test.py suffix only.
squad_tests.py
Close — but pytest looks for *_test.py (singular) or test_* prefix. squad_tests.py (plural “tests”) doesn’t match either.

pytest discovers files whose names start with test_ or end with _test.py. Functions inside must also start with test_.

2. A spec reads: “A discount applies for orders of strictly more than $50 and up to $500 inclusive.” Which four values are the most important boundaries to test?

$0, $50, $100, $500 — round numbers at each end of the range
$0 and $100 sit deep inside the no-discount and yes-discount partitions — they don’t pin down where the boundaries are. The spec defines the discount cutoffs at $50 (> flips) and $500 (≤ flips); your boundary tests must straddle those exact values.
$50, $51, $500, $501 — one on each side of each boundary (just-below and just-above)
Correct. $50/$51 straddles the > $50 flip and $500/$501 straddles the ≤ $500 flip — one just-below and one at-or-just-above each boundary, which is exactly where > vs >= bugs surface.
$1, $25, $250, $1000 — one sample from each quartile to get wide coverage
This is “spread your bets across the input range” — useful for a smoke test, useless for catching off-by-one bugs. Off-by-one bugs only manifest at the boundary itself; quartile samples will all behave correctly even when the comparison operator is wrong.
$49.99, $50.01, $499.99, $500.01 — cents around each boundary to check decimal precision
Decimal-precision boundaries are a separate class of bug (and rarely the dominant risk for whole-dollar specs). The primary off-by-one risk is > vs >= and <= vs <, which lives at the integer boundary. Test 50/51 and 500/501 first; reach for cents only if the spec actually distinguishes them.

Test just on each side of every boundary: 50/51 (where > 50 flips) and 500/501 (where ≤ 500 flips). Catches the canonical off-by-one (>= 50 vs > 50).

3. A developer’s test suite has 12 tests for squad_name_valid. Every test uses a name of length 5, 6, 7, or 8. All tests pass. Can you trust the suite?

Yes — 12 tests is plenty of coverage, and they all exercise the function with realistic inputs
Test count is not a measure of test quality. 12 tests in the same partition is roughly equivalent to one test repeated 12 times — they all hit the same code path and miss the same bugs. One boundary test catches more than ten middle-of-partition tests.
No — every test is in the middle of the same partition. Boundaries (2, 3, 12, 13) and invalid inputs are untested, so off-by-one bugs would slip through
Right. All 12 tests cluster in the middle of the valid partition (lengths 5–8). The boundaries at 2, 3, 12, and 13 are entirely untested — a len < 12 bug instead of len <= 12 would pass every test and slip to production.
Yes — as long as the tests pass, the function is guaranteed correct for all inputs in the tested range
Passing tests prove the function works for the inputs tested, never for inputs the suite never exercises. The classic boundary bug at length 12 (len < 12 instead of len <= 12) sits exactly in the gap this suite leaves uncovered.
No — 12 tests is too few for any real function; you need at least 50 tests to be confident
There’s no magic test-count threshold for confidence. A 5-test suite with one test per partition + boundary will catch more bugs than a 50-test suite that all clusters in the middle. Quality of input choice beats quantity.

The tests cover the middle of one partition only. A bug like len < 12 instead of len <= 12 would pass all 12 tests but fail in production at length 12. One test per boundary catches far more bugs than many tests clustered in the easy middle.

4. An equivalence partition is:

A set of inputs that the function processes in parallel on multi-core hardware
This is conflating partition (a testing concept — a group of inputs with the same expected behavior) with parallelism (a runtime concept — work happening simultaneously). The two ideas have nothing in common; the word “partition” is overloaded across CS subfields.
A set of inputs that should all produce the same kind of behavior, so one representative usually suffices for the middle of the set
Correct. Inputs that all produce the same kind of behavior form one equivalence partition — testing one representative per partition is enough for the middle, and you spend your test budget on the boundaries between partitions.
A set of inputs that are mathematically equal (e.g., 1.0 and 1)
Mathematical equality (1.0 == 1) is about value, not about behavioral expectation. Equivalence partitioning groups inputs that should behave the same under the function — different values can still be in the same partition (e.g. "alice" and "bob" are both length-5 valid usernames).
A sub-range of inputs created by dividing the domain into equal-sized chunks
This describes equal-width binning (a statistics concept). Equivalence partitions are defined by behavioral boundary — wherever the function’s output classification flips — not by chopping the domain into uniform chunks.

Equivalence partitions group inputs by expected behavior, not by value. If "abc" is accepted, "abcd" almost certainly is too — same partition. Spend your test budget on the boundaries between partitions, not the middles.

5. (Spaced review — Step 1) Recall the bug in streak_badge: days > 100 instead of days >= 100. Classify this bug.

A type error — days should be a float but is being compared as an integer
The type is not the issue here: days is compared as an integer in both the spec (>= 100) and the buggy code (> 100). The bug is in the comparison operator.
A boundary bug — it fails only at one exact value (100) on the edge between two partitions
Correct. The bug only surfaces at exactly days == 100 — the single boundary value where the > 100 operator gives the wrong answer. That one-specific-value signature is the definition of a boundary bug.
A logic error — the wrong branch of the if statement is being taken
It’s tempting to call any wrong-if bug a logic error, but that’s too coarse. Boundary bug is the more precise classification: this bug only manifests at one specific value (100), which is the diagnostic signature of a > vs >= mistake. That precision is what this step is about.
A regression — the function worked correctly before someone introduced the bug
Regression describes how a bug enters a codebase (re-introducing something previously fixed), not what kind of bug it is. The bug here might or might not be a regression — but its type is a boundary bug.

Boundary bugs manifest only at the exact value where partitions meet. days > 100 works for 101, 150, 365 — but fails at 100 itself. This is exactly the kind of bug that Boundary Value Analysis is designed to expose, and it is why you test values just on each side of every boundary in a spec.

Oracle Strength: Strong, Weak, and the Liar Test

Why this matters

In Step 2 you wrote assert squad_name_valid("epic") is True. That’s a strong oracle on a Boolean: only the True singleton satisfies it, so any wrong return — False, 1, "yes" — fails the test. For richer return types (numbers, strings, lists, dicts), it’s much easier to write an assertion that looks productive but lets wrong answers slip through. This step makes the difference between strong and deceptively weak oracles concrete.

🎯 You will learn to

Analyze an assertion to spot the three weak-oracle anti-patterns: presence, type, and single-field.
Apply the strong-oracle form (assert result == <exact expected value>) to any return type so wrong values fail loudly.
Evaluate whether a passing test actually verifies the spec or merely looks like it does.

Today’s function returns something richer than a Boolean — a dict. Open loot.py and read the spec. build_loot_card(name, qty, rarity) returns a five-field dict: name, qty, rarity, label, is_rare. The test surface is bigger now — and that’s exactly where weak oracles get tempting.

🔍 Predict first. Open test_loot.py. Three tests are written and all three pass against the current code. Don’t run them yet. For each one, ask: “If a bug made build_loot_card return a slightly wrong dict, would this assertion catch it?” Hold your three answers — you’ll check them against the table below.

📖 Oracle strength — three flavors of weak

The oracle is the assertion that decides pass/fail. The same function call can be checked at very different strengths. Watch the same input — build_loot_card("Healing Potion", 3, "common") — under four assertions:

Strength	Assertion	What still passes (i.e., what it misses)
Weak — presence	`assert "name" in result`	Any dict with a `name` key. `{"name": "Wrong Name", ...}` passes.
Weak — type	`assert isinstance(result, dict)`	Any dict whatsoever. `{}` passes.
Weak — single-field	`assert result["is_rare"] is False`	The other four fields could all be wrong.
Strong — full equality	`assert result == {"name": "Healing Potion", "qty": 3, "rarity": "common", "label": "3× Common Healing Potion", "is_rare": False}`	Only the exact spec-mandated dict satisfies it.

Each weak form is satisfying to write — the test reports PASS — and each verifies almost nothing. That’s the Liar test anti-pattern: an assertion that looks like a test but lies about how thoroughly the function was checked. Rushed engineers and AI assistants gravitate to weak oracles because they almost always pass. The cost shows up later, when a real bug ships and the passing test couldn’t have caught it.

Notice what the table holds constant: same function, same inputs. Only the assertion varies. That’s the dimension you’re learning here — and it lives independently of which inputs to pick (Step 2’s lesson). A great test gets both right.

⚙️ Task — strengthen the three weak oracles (file: test_loot.py):

Each test starts with a different flavor of weak oracle. Your job for each:

Read the spec in loot.py — the docstring lists the five fields and the rule for each.
Compute what the dict should be for the test’s specific inputs (compute label and is_rare yourself from the rule).
Replace the weak assertion with assert result == { ... } pinning all five spec-mandated fields.

💬 Required: Above each new strong oracle, add a Python comment in this form:

# Weak version (___) would also pass for: ___

Name the flavor of the original weak oracle (presence / type / single-field) and a specific wrong dict the weak oracle would have accepted. This forces the Liar-test pattern into your hands — you can’t write the comment without seeing what the weak form misses.

🧠 Why a *dict* makes the contrast visible (and an int doesn't)

Imagine the function returned a single integer — say 3. The weak forms are still definable (assert result is not None, assert isinstance(result, int)), but the strong form (assert result == 3) feels trivial: of course you write the answer.

A dict has structure. The output has five fields, each with its own correctness condition. That structure is what makes weak oracles tempting and deceptive: an assert "name" in result looks like real testing — there’s a key reference, a substantive-looking check — but it accepts thousands of different wrong dicts. The richer the return type, the more disciplined the oracle has to be. Dicts, lists, and formatted strings are where weak oracles do the most damage in real codebases.

📖 Why pytest beats raw assert

Raw assert halts at the first failure; you only learn about one bug at a time. pytest discovers all tests, runs them all, names each one, and shows the exact mismatched value when one fails — e.g. assert {...} == {...}, with the differing keys highlighted. For a dict-returning function, that diff is gold: you immediately see which field is wrong, which is far more debuggable than a generic AssertionError.

🔭 Coming in Step 4: Strong oracles beat weak ones — but is the strongest possible oracle always the right answer? You’ll see what happens when “I pinned the entire output” goes a step too far, and how the right oracle sits exactly on the spec, no less and no more.

Starter files

loot.py

"""Loot card generator — Diablo / Borderlands / Genshin Impact style."""


def build_loot_card(name: str, qty: int, rarity: str) -> dict:
    """Create the inventory card for a piece of loot.

    Spec (the public contract — what callers can rely on):
      name    -> the input name, unchanged
      qty     -> the input qty, unchanged
      rarity  -> the input rarity, lowercased
      label   -> "{qty}× {Rarity-capitalized} {name}"
      is_rare -> True if and only if rarity is "rare", "epic", or "legendary"
    """
    normalized = rarity.lower()
    return {
        "name": name,
        "qty": qty,
        "rarity": normalized,
        "label": f"{qty}× {rarity.capitalize()} {name}",
        "is_rare": normalized in {"rare", "epic", "legendary"},
    }

test_loot.py

"""Tests for build_loot_card — three tests, three flavors of WEAK oracle.

Each test calls build_loot_card(...) with specific inputs and currently
PASSES. Each starts with a different flavor of weak oracle that lets
wrong implementations slip through. Your job: rewrite each as a STRONG
oracle that pins all five spec-mandated fields with `==`.

The spec is in loot.py.
"""
import pytest
from loot import build_loot_card


def test_common_potion_card():
    result = build_loot_card("Healing Potion", 3, "common")
    # WEAK ORACLE — flavor: PRESENCE.
    # This passes for any dict that has a `name` key — including
    # {"name": "Wrong Name", "qty": 0, ...}. It verifies almost nothing.
    # TODO: replace with `assert result == { ... }` pinning all 5 fields.
    # TODO (required): add a comment above the new assert in this form:
    #   # Weak version (presence) would also pass for: <a specific wrong dict>
    assert "name" in result


def test_rare_sword_card():
    result = build_loot_card("Vorpal Sword", 1, "rare")
    # WEAK ORACLE — flavor: TYPE.
    # Any dict at all passes this — including {} or a totally wrong dict.
    # TODO: replace with `assert result == { ... }` pinning all 5 fields.
    # TODO (required): add a comment above the new assert in this form:
    #   # Weak version (type) would also pass for: <a specific wrong dict>
    assert isinstance(result, dict)


def test_legendary_drop_card():
    result = build_loot_card("Excalibur", 1, "legendary")
    # WEAK ORACLE — flavor: SINGLE-FIELD.
    # The other four fields could all be wrong and this still passes.
    # TODO: replace with `assert result == { ... }` pinning all 5 fields.
    # TODO (required): add a comment above the new assert in this form:
    #   # Weak version (single-field) would also pass for: <a specific wrong dict>
    assert result["is_rare"] is True

Solution

test_loot.py

"""Tests for build_loot_card — strong oracles."""
import pytest
from loot import build_loot_card


def test_common_potion_card():
    result = build_loot_card("Healing Potion", 3, "common")
    # Weak version (presence) would also pass for: {"name": "Wrong", "qty": 0, "rarity": "wrong", "label": "wrong", "is_rare": True}
    assert result == {
        "name": "Healing Potion",
        "qty": 3,
        "rarity": "common",
        "label": "3× Common Healing Potion",
        "is_rare": False,
    }


def test_rare_sword_card():
    result = build_loot_card("Vorpal Sword", 1, "rare")
    # Weak version (type) would also pass for: {} or {"anything": "at all"}
    assert result == {
        "name": "Vorpal Sword",
        "qty": 1,
        "rarity": "rare",
        "label": "1× Rare Vorpal Sword",
        "is_rare": True,
    }


def test_legendary_drop_card():
    result = build_loot_card("Excalibur", 1, "legendary")
    # Weak version (single-field) would also pass for: {"name": "wrong", "qty": 99, "rarity": "wrong", "label": "wrong", "is_rare": True}
    assert result == {
        "name": "Excalibur",
        "qty": 1,
        "rarity": "legendary",
        "label": "1× Legendary Excalibur",
        "is_rare": True,
    }

Each weak oracle was a different flavor of Liar test:

presence: "name" in result — passes for any dict with a name key
type: isinstance(result, dict) — passes for any dict whatsoever
single-field: result["is_rare"] is True — passes if 4 of 5 fields are wrong The strong form pins the entire spec-mandated dict, so any wrong field fails the test. (Coming in Step 4: a tension. Full-dict equality is the right answer when the spec and the implementation match exactly — but it can over-specify when the implementation evolves. Step 4 shows the upper bound.)

Step 3 — Knowledge Check

Min. score: 80%

1. Two tests check the same build_loot_card call. Which assertion is strongest (most likely to catch a bug)? A.

def test_a():
    result = build_loot_card("Healing Potion", 3, "common")
    assert "name" in result

def test_b():
    result = build_loot_card("Healing Potion", 3, "common")
    assert result == {
        "name": "Healing Potion", "qty": 3, "rarity": "common",
        "label": "3× Common Healing Potion", "is_rare": False,
    }

A — checking presence is more flexible because it doesn’t break on output changes
Flexibility-to-future-changes is exactly the wrong success criterion. A should fail when qty, rarity, label, or is_rare is wrong — but it doesn’t, because the assertion only references name. The reason A ‘survives’ is that it’s verifying almost nothing, not that it’s well-designed.
B — it pins every spec-mandated field. A passes for any dict with a name key, including completely wrong contents
Right. B is the strong oracle: only the exact spec-mandated dict satisfies it. A is the presence weak oracle: any dict with a name key — including {"name": "Wrong", "qty": 0, ...} — passes. Liar-test territory.
Both are equally strong — both succeed for the correct call
Two assertions on the same call can differ wildly in bug-catching power. A passes for thousands of wrong dicts; B passes for exactly one (the correct one). ‘Both succeed for the correct call’ just means both don’t false-positive — it doesn’t mean both catch bugs.
Neither — strong oracles always need multiple assert statements, not a single ==
Strong oracles aren’t about count of assertions — they’re about precision. One precise assert result == {...} pins the entire output exactly. Adding more checks doesn’t strengthen the oracle; it just adds noise.

B is a strong oracle: only the exact expected dict satisfies it. A is the presence weak oracle — any dict with a name key passes, including ones where qty, rarity, label, and is_rare are all wrong.

2. For build_loot_card("Excalibur", 1, "legendary"), which assertion is the weakest (catches the fewest bugs)?

assert isinstance(result, dict)
Right. isinstance(result, dict) is the type weak oracle — any dict at all passes, including {} and dicts with completely wrong values. It’s the canonical Liar test for dict-returning functions.
assert result["is_rare"] is True
This is the single-field weak oracle — better than isinstance (it at least checks one specific field), but the other four fields could all be wrong. Stronger than option 0, weaker than option 3.
assert result == { ... } (full dict, all 5 fields)
Full-dict equality is the strong oracle — only the exact spec-mandated dict satisfies it. (Step 4 will show the ceiling on this — but for now, this is the strongest of the four options.)
assert result["name"] == "Excalibur" and result["qty"] == 1
Two-field == is stronger than isinstance (which checks no values at all). It’s still weaker than full-dict equality because it lets rarity, label, and is_rare be wrong — but it’s not the weakest of the four.

isinstance(result, dict) accepts any dict whatsoever — {} passes, a totally wrong dict passes. This is the canonical type weak oracle. The other options check at least one value; only isinstance checks no values at all.

3. (Spaced review — Step 2) A teammate writes assert squad_name_valid("epic") (no is True). The test passes against the current code. Why is this oracle still weak?

It is not weak — the test passes, so the oracle is doing its job
Currently passing is a weaker claim than correctly testing. A bare assert squad_name_valid("epic") passes today only because squad_name_valid happens to return True. The day someone refactors it to return the name string "epic" (still truthy!), the oracle will still pass — but now it’s verifying nothing meaningful. Same Liar-test family as the dict checks above.
It is weak because it would also pass for any truthy return — 1, "yes", or even [1, 2] would all satisfy it. Use is True to pin the exact Boolean
Right. A bare assert <expr> passes for any truthy return — 1, "epic", [True] would all pass, even though none of them are True. is True uses identity comparison and accepts only the True singleton. It’s the Boolean-return version of full-dict equality: pin the exact value.
It is weak because pytest assertions need an explicit message string after the comma
pytest assertion messages are optional. The oracle’s strength has nothing to do with whether you supplied a failure message; it has everything to do with which return values would fail the assertion. Bare assert <expr> is satisfied by every truthy value, which is the weakness here.
It is weak because the input string is too short to be representative
The input length isn’t the issue — "epic" is a perfectly valid length-4 input. The oracle weakness is structural: assert <expr> (no comparison) accepts any truthy return value, regardless of the input.

A bare assert <expr> succeeds for any truthy value — same Liar-test family as the dict checks. is True (identity comparison) is the strong form for Boolean returns; == {full dict} is the strong form for dicts. Same lesson, different shapes.

4. (Spaced review — Step 1) A passing test stays in the suite. Why?

Symmetry — every code change should ship with a matching test addition or removal
There’s no symmetry rule. The purpose of a test is to guard a specific behavior; the question is whether that behavior is still worth guarding (it almost always is). Symmetry between code changes and test deletions isn’t a real principle.
Each passing test is a permanent guard against regressions: if a future change re-introduces the bug, the test fails before users see it
Right. The safety-net idea from Step 1: each passing test stays in place to catch the regression no one anticipated. Delete the test and you cut a hole in the net at exactly the spot a future bug would land.
Removing tests after a feature ships keeps the suite small and fast
Speed is sometimes a reason to split test suites (fast unit tests vs. slow integration), but never a reason to delete a passing test that guards real behavior. The cost of running it is tiny; the cost of a regression slipping through is large.
Tests get noisy over time and need to be deleted to keep the signal clean
Test noise is a real problem, but the fix is usually to strengthen weak oracles or remove duplicates — not to delete passing tests for shipped behavior. A test guarding shipped behavior is the opposite of noise.

The safety-net principle: each passing test permanently guards its behavior. Deleting it cuts a hole in the net. If someone later re-introduces the bug, there’s nothing to catch it.

Test Behavior, Not Implementation

Why this matters

Step 3 said: strong oracles beat weak ones — pin the exact value. That’s true, but only up to a ceiling: the spec. Going below the spec is a weak oracle (Step 3’s lesson). Going above it — asserting on things the spec doesn’t mandate — is the over-specification trap, and it produces tests that break during clean refactors. The cure is to assert on exactly what the spec says, no more, no less.

🎯 You will learn to

Analyze a test for two species of “above the spec” — internal coupling (peeking at private state) and over-specification (pinning unmandated output fields).
Apply the Refactoring Litmus Test: a pure refactor with unchanged behavior should never break a well-written test.
Evaluate test smells like Excessive Setup as feedback on the production design, not as a problem to hide in a helper.

This step covers both halves of “above the spec”:

(a) Internal coupling — the test peeks at private state (obj._tracks). A pure rename of the internal attribute breaks the test even though no observable behavior changed.
(b) Over-specification — the test pins output fields the spec doesn’t mandate (e.g., a full-dict equality that includes a created_at timestamp the spec never promised). Adding a new internal-but-public field breaks the test even though every spec-mandated field is still correct.

Both are species of the same disease: tests verifying the implementation rather than the contract. The cure is the same: assert on exactly what the spec says, no more, no less.

Part A — Internal coupling (the rename experiment)

⚙️ Task (test_brittle_audit.py): Four tests for a PlaylistQueue (think Spotify / Apple Music queue). All four currently pass. You’ll discover which are brittle (break on pure refactoring even when behavior is unchanged) and which are robust (survive any refactoring that preserves the public behavior).

Read the four tests in test_brittle_audit.py. Before running anything: classify each test — does it access internal state (looks inside the object) or only the public interface (calls methods that don’t start with _)? Write your classification as a comment next to each test.
Run the suite as-is — all four tests pass. Good. Now do the experiment:
Refactor the production code without changing behavior: in playlist.py, rename the private attribute self._tracks to self._queue (everywhere — the constructor and the five methods). There are exactly 6 occurrences; use find/replace to catch all of them. The class’s public behavior is unchanged: add, total_duration, track_count, titles, durations still produce the same outputs.
Before re-running: predict how many tests will fail and which ones.
Re-run the suite. The tests that fail are brittle — they coupled to the implementation detail (the attribute name). The ones that survived only touched the public API. Compare to your prediction. Whether you were right or wrong: write one sentence tracing the causal chain — from “I renamed _tracks” to “exactly these tests fail.” The explanation should work without running the code.
Rewrite each broken test using only the public API — methods that don’t start with _. The public surface of PlaylistQueue is: add, track_count(), titles(), durations(), total_duration. Anything starting with _ is internal and off-limits to tests. When all four pass against the refactored code, your suite is robust.

📦 Two Python tools used in this step: `@dataclass` and `@property`

@dataclass — auto-generated value objects

playlist.py stores each track as a Track instance declared with @dataclass(frozen=True):

from dataclasses import dataclass

@dataclass(frozen=True)
class Track:
    title: str
    duration_seconds: int

@dataclass reads the annotated fields and auto-generates __init__, __repr__, and __eq__. frozen=True makes instances immutable — a Track can’t have its title changed after creation, and two Tracks with identical fields compare equal with == out of the box.

Without @dataclass you’d write all this by hand:

class Track:
    def __init__(self, title: str, duration_seconds: int) -> None:
        self.title = title
        self.duration_seconds = duration_seconds
    def __eq__(self, other): ...
    def __repr__(self): ...

Same result, far more boilerplate.

@property — a method that reads like an attribute

PlaylistQueue.total_duration is declared with @property:

@property
def total_duration(self) -> int:
    return sum(t.duration_seconds for t in self._tracks)

Because of @property, callers write queue.total_duration (no parentheses) instead of queue.total_duration(). Use @property for derived values — ones that are computed from stored state rather than stored themselves — that read naturally as a noun.

Contrast with track_count(), titles(), and durations(), which are regular methods. Rule of thumb: if the value feels like a fixed attribute of the object (total duration is a property of the queue’s current state), make it a @property. If it feels like an action or a lookup with side effects, keep it a method.

You’ll see @dataclass and @property again in the TDD tutorial — where ScoringEvent, BattleReport, and total_damage follow the same patterns.

💡 Why this matters: When a test only touches the public API, the production code stays free to evolve internally. The experiment you just ran is a live demonstration of the Refactoring Litmus Test (expand below to name what you discovered).

💡 This principle extends beyond classes. For top-level functions: the “public contract” is the return value. Don’t assert on intermediate variables or module-level state the function happens to touch internally — those are implementation details too, just without the _ prefix signal. Assert on what callers observe: the return value.

🔬 The Refactoring Litmus Test — name what you just discovered

If you refactor the internals of a function and all tests still pass → your tests are robust. If tests break after a pure refactoring (no behavior change) → they’re testing implementation.

That breakage is the symptom; the fix is to rewrite the tests, not to revert the refactor.

Both types of test were checking the same observable behavior: the track was added. They differed only in how they verified it. The brittle test peeked at implementation details (_tracks[0].title). The robust test used the public interface (titles()). Compare that to this pair:

# 🚨 BRITTLE — peeks at private state
assert board._scores[0] == ("alice", 1000)

# ✅ ROBUST — uses the public API
assert board.top_player() == "alice"

The brittle version breaks the moment _scores is renamed, restructured, or replaced — even if the top-player behavior is unchanged. The robust version only breaks when the behavior itself changes — which is exactly when you want it to fail.

📊 What the experiment reveals — expand after completing step 5

The rename changed the implementation but not the public behavior, yet only the robust tests survive:

flowchart TB
    subgraph before["BEFORE — all tests pass"]
        direction LR
        b1["Brittle test<br/>queue._tracks[0].title"]:::brittle
        b2["Robust test<br/>queue.titles()"]:::robust
        b1 --> bp1["✓"]:::good
        b2 --> bp2["✓"]:::good
    end
    subgraph after["AFTER — _tracks renamed to _queue"]
        direction LR
        a1["Brittle test<br/>queue._tracks[0].title"]:::brittle
        a2["Robust test<br/>queue.titles()"]:::robust
        a1 --> ap1["✗ AttributeError"]:::bad
        a2 --> ap2["✓ still passes"]:::good
    end
    before --> after
    classDef brittle fill:#fff3e0,stroke:#e65100,color:#bf360c
    classDef robust fill:#e8f5e9,stroke:#2e7d32,color:#1b5e20
    classDef good fill:#e8f5e9,stroke:#2e7d32,color:#1b5e20
    classDef bad fill:#ffebee,stroke:#c62828,color:#b71c1c

📖 Arrange-Act-Assert (AAA) — the structure of a clean test

def test_total_duration_sums_track_lengths():
    # Arrange — set up the world
    queue = PlaylistQueue()
    queue.add("Espresso", 175)
    queue.add("Vampire", 218)

    # Act — read the ONE derived value under test
    result = queue.total_duration   # property — no ()

    # Assert — verify the observable outcome
    assert result == 393

Every robust test fits this shape. If you can’t separate Arrange from Act cleanly, the function under test is doing too much.

🚩 When Arrange dominates — the Excessive Setup smell

You just learned the AAA shape. The size of each section is itself a signal — and the Arrange section is the loudest.

Here’s a test that compiles, runs, and passes. Read it, then ask: what’s wrong?

def test_checkout_succeeds_for_valid_card():
    # Arrange — 22 lines
    db = InMemoryDatabase(); db.connect()
    user = User(id=1, name="Alex", email="a@x.io")
    db.users.insert(user)
    address = Address(user_id=1, line1="221B Baker St", country="UK")
    db.addresses.insert(address)
    card = Card(user_id=1, last4="4242", expiry="12/30")
    db.cards.insert(card)
    cart = Cart(user_id=1); db.carts.insert(cart)
    item = Item(sku="A1", name="Vinyl", price=20.0)
    db.items.insert(item); cart.add(item)
    tax_service = FakeTaxService(rate=0.08)
    payment_gateway = StubGateway(approves=True)
    email_service = NullEmailService()
    audit_log = InMemoryAuditLog()
    fraud_check = AlwaysPassFraudCheck()
    inventory = StubInventory(in_stock=True)
    feature_flags = FlagSet(enable_new_taxes=False)

    # Act — 1 line
    result = checkout(user.id, payment_gateway, tax_service, email_service,
                      audit_log, fraud_check, inventory, feature_flags)

    # Assert — 1 line
    assert result.status == "ok"

The Assert is fine. The Act is a single call. The Arrange is the problem — eight collaborators stubbed and three database tables seeded just to verify one outcome.

This is the Excessive Setup smell. Every dependency checkout reaches forces a corresponding fixture. Whenever you find yourself building elaborate scaffolding before you can call the function under test, the test is telling you something — but it isn’t telling you to write better tests. It’s telling you to fix the production code.

🪞 Tests are also a design tool, not just a verifier. A bloated Arrange section is the production code asking for refactoring. Your test file is a mirror — its size, shape, and friction reflect the design choices on the other side.

The wrong reflex is to hide the setup in a setup_world() helper. The lines disappear from the test file but the coupling stays. Now the smell is invisible, which is worse than visible — the next engineer never sees the warning sign.

The right reflex is to listen. checkout is doing too much. Split it: a compute_total(cart, tax) that needs two collaborators, a charge(payment_gateway, total) that needs one, plus a thin orchestrator. Each piece is then testable with a 2-line Arrange:

def test_total_includes_tax():
    # Arrange
    cart = Cart(items=[Item(price=20.0)])
    tax = FakeTaxService(rate=0.08)

    # Act
    total = compute_total(cart, tax)

    # Assert
    assert total == 21.60

Same domain. Same kind of assertion. Different production design — and the test difficulty plummets.

✍️ Active prompt (write your answer before reading on): a teammate’s PR adds a test with 40 lines of Arrange before a single assert. Do you (a) approve it because the assertion is correct, (b) ask them to extract a setup_world() helper, or (c) push back on the production code changes that drove the dependency explosion? Hold your answer — the wrap-up quiz revisits exactly this scenario.

Part B — Over-specification (the upper bound of oracle strength)

In Step 3 you wrote assert result == {full dict} to make the oracle as strong as possible. That was right for that spec. Now watch what happens when the implementation grows a new output field that the spec never mentioned.

The same build_loot_card(name, qty, rarity) from Step 3 is back in loot.py — but the production team has added a created_at timestamp to the returned dict for analytics. The spec hasn’t changed. Every field a caller relies on is still computed correctly. But the test from Step 3 — written with full-dict equality — now fails:

# Step-3-style test (full dict equality):
def test_legendary_drop():
    result = build_loot_card("Excalibur", 1, "legendary")
    assert result == {
        "name": "Excalibur", "qty": 1, "rarity": "legendary",
        "label": "1× Legendary Excalibur", "is_rare": True,
    }
# ✗ FAILS — result now also has "created_at": 1730000000

The assertion was too strong. It pinned the entire output, including fields the spec never promised. That extra precision is the over-specification trap: the test breaks during clean refactors that don’t change observable behavior.

⚙️ Task (test_loot_overspec.py): Two tests use full-dict equality. Run them — they fail against the new build_loot_card even though every spec-mandated field is correct. Rewrite each test to assert on exactly the spec-mandated fields (name, qty, rarity, label, is_rare) and not on created_at. When the same refactor (adding a new field) ships next month, your suite stays green.

💡 The rule of thumb: re-read the spec. List the fields it explicitly mandates. Assert on each one with ==. Don’t full-equality the whole dict unless the spec promises exactly that shape and nothing else — and most specs don’t.

📐 The rule of "no less, no more" — visualized

flowchart TB
    spec["✅ THE SPEC<br/>(what callers can rely on)"]:::good
    weak["❌ Weak oracle<br/>(asserts LESS than the spec)<br/>misses real bugs"]:::bad
    strong["✅ Right oracle<br/>(asserts EXACTLY the spec)<br/>catches real bugs, survives refactors"]:::good
    overspec["❌ Over-specified oracle<br/>(asserts MORE than the spec —<br/>private state OR unmandated fields)<br/>breaks on clean refactors"]:::bad
    weak --- strong --- overspec
    classDef good fill:#e8f5e9,stroke:#2e7d32,color:#1b5e20
    classDef bad fill:#ffebee,stroke:#c62828,color:#b71c1c

“Strong” isn’t a one-way arrow. The right oracle sits exactly on the spec — anything beyond it is just as harmful as anything below it.

🎓 Coverage ≠ quality

Suite A — 100% line coverage, weak oracle:

def test_total_duration_runs():
    q = PlaylistQueue(); q.add("Espresso", 175); q.add("Vampire", 218)
    assert q.total_duration is not None   # passes for any non-None return

Suite B — 80% coverage, strong oracle:

def test_total_duration_sums_track_lengths():
    q = PlaylistQueue(); q.add("Espresso", 175); q.add("Vampire", 218)
    assert q.total_duration == 393

If a bug makes total_duration() return 0, Suite A still passes (0 is not None). Suite B catches it. Coverage measures which lines ran, not whether you checked their behavior. The same logic explains why Step 4’s brittle tests passed before the rename: running the assertion is not the same as verifying the right thing.

Starter files

playlist.py

from dataclasses import dataclass


@dataclass(frozen=True)
class Track:
    title: str
    duration_seconds: int


class PlaylistQueue:
    """A Spotify/Apple-Music-style queue: add tracks, ask for total duration."""

    def __init__(self) -> None:
        self._tracks: list[Track] = []

    def add(self, title: str, duration_seconds: int) -> None:
        self._tracks.append(Track(title, duration_seconds))

    @property
    def total_duration(self) -> int:
        return sum(t.duration_seconds for t in self._tracks)

    def track_count(self) -> int:
        return len(self._tracks)

    def titles(self) -> list[str]:
        return [t.title for t in self._tracks]

    def durations(self) -> tuple[int, ...]:
        """Public, ordered, immutable view of per-track durations (seconds)."""
        return tuple(t.duration_seconds for t in self._tracks)

test_brittle_audit.py

"""AUDIT: All four tests pass. Two are brittle — discover which by
renaming `_tracks` to `_queue` in playlist.py and re-running."""
import pytest
from playlist import PlaylistQueue


def test_add_track_updates_count():
    queue = PlaylistQueue()
    queue.add("Espresso", 175)
    assert queue.track_count() == 1


def test_add_track_internal_list():
    queue = PlaylistQueue()
    queue.add("Espresso", 175)
    assert queue._tracks[0].title == "Espresso"
    assert queue._tracks[0].duration_seconds == 175


def test_total_duration_sums_track_lengths():
    queue = PlaylistQueue()
    queue.add("Espresso", 175)
    queue.add("Vampire", 218)
    assert queue.total_duration == 393


def test_internal_list_length():
    queue = PlaylistQueue()
    queue.add("Espresso", 175)
    queue.add("Vampire", 218)
    assert len(queue._tracks) == 2

loot.py

"""Loot card generator — same function as Step 3, but the
implementation has been extended with a `created_at` analytics field.

The SPEC has not changed: callers rely on name, qty, rarity, label,
and is_rare. The new `created_at` is internal — it exists for
analytics and is NOT part of the public contract.
"""
import time


def build_loot_card(name: str, qty: int, rarity: str) -> dict:
    """Create the inventory card for a piece of loot.

    Spec (the public contract — what callers rely on):
      name    -> the input name
      qty     -> the input qty
      rarity  -> the input rarity, lowercased
      label   -> "{qty}× {Rarity-capitalized} {name}"
      is_rare -> True if and only if rarity is "rare", "epic", or "legendary"

    The returned dict ALSO carries a `created_at` field for
    analytics. That field is NOT part of the spec — its presence
    and value are implementation details and must not be asserted on.
    """
    normalized = rarity.lower()
    return {
        "name": name,
        "qty": qty,
        "rarity": normalized,
        "label": f"{qty}× {rarity.capitalize()} {name}",
        "is_rare": normalized in {"rare", "epic", "legendary"},
        "created_at": int(time.time()),
    }

test_loot_overspec.py

"""OVER-SPECIFICATION AUDIT: these two tests over-specify the output.

Each one full-equality-checks the entire returned dict, including
the `created_at` analytics field that the spec never promised. As a
result both tests FAIL against the current `build_loot_card` — even
though every spec-mandated field is correct.

Your job: rewrite each test to assert on EXACTLY the spec-mandated
fields (name, qty, rarity, label, is_rare) and NOT on `created_at`.
When the implementation evolves (timestamps change every second),
your tests must still go green.
"""
import pytest
from loot import build_loot_card


def test_common_potion_has_correct_card():
    result = build_loot_card("Healing Potion", 3, "common")
    # OVER-SPECIFIED — full-equality pins `created_at` (not in spec).
    # TODO: rewrite as field-by-field assertions on spec-mandated keys.
    assert result == {
        "name": "Healing Potion",
        "qty": 3,
        "rarity": "common",
        "label": "3× Common Healing Potion",
        "is_rare": False,
    }


def test_legendary_drop_has_correct_card():
    result = build_loot_card("Excalibur", 1, "legendary")
    # OVER-SPECIFIED — same problem as above.
    # TODO: rewrite as field-by-field assertions on spec-mandated keys.
    assert result == {
        "name": "Excalibur",
        "qty": 1,
        "rarity": "legendary",
        "label": "1× Legendary Excalibur",
        "is_rare": True,
    }

Solution

test_brittle_audit.py

"""AUDIT: Fixed brittle tests — behavior not implementation."""
import pytest
from playlist import PlaylistQueue


def test_add_track_updates_count():
    queue = PlaylistQueue()
    queue.add("Espresso", 175)
    assert queue.track_count() == 1


def test_add_track_via_public_api():
    queue = PlaylistQueue()
    queue.add("Espresso", 175)
    assert "Espresso" in queue.titles()
    assert queue.durations()[0] == 175


def test_total_duration_sums_track_lengths():
    queue = PlaylistQueue()
    queue.add("Espresso", 175)
    queue.add("Vampire", 218)
    assert queue.total_duration == 393


def test_track_count_via_public_api():
    queue = PlaylistQueue()
    queue.add("Espresso", 175)
    queue.add("Vampire", 218)
    assert queue.track_count() == 2

test_loot_overspec.py

"""OVER-SPECIFICATION AUDIT — solved."""
import pytest
from loot import build_loot_card


def test_common_potion_has_correct_card():
    result = build_loot_card("Healing Potion", 3, "common")
    # Assert ONLY on the spec-mandated fields — anything outside
    # the spec is an implementation detail and must not be pinned.
    assert result["name"] == "Healing Potion"
    assert result["qty"] == 3
    assert result["rarity"] == "common"
    assert result["label"] == "3× Common Healing Potion"
    assert result["is_rare"] is False


def test_legendary_drop_has_correct_card():
    result = build_loot_card("Excalibur", 1, "legendary")
    assert result["name"] == "Excalibur"
    assert result["qty"] == 1
    assert result["rarity"] == "legendary"
    assert result["label"] == "1× Legendary Excalibur"
    assert result["is_rare"] is True

Two fixes, one shared lesson — test the spec, no more, no less.

Part A: replace direct ._tracks access with public API calls (titles(), durations(), track_count()). The duration assertion still holds — but now via durations()[0] instead of _tracks[0].duration_seconds, so the rename experiment leaves it green.

Part B: replace full-dict equality with field-by-field equality on the spec-mandated fields only. created_at is in the returned dict but NOT in the spec, so we don’t pin it — and the test stays green every time created_at changes (every second, in fact).

Step 4 — Knowledge Check

Min. score: 80%

1. A UserProfile class stores user data. Two tests check that a name is stored after creation. Which test is more robust? Test A:

def test_a():
    profile = UserProfile("alice")
    assert profile._data["name"] == "alice"

Test B:

def test_b():
    profile = UserProfile("alice")
    assert profile.get_name() == "alice"

Test A — directly reading the storage is the most reliable way to verify the value was saved
Direct storage access (_data) is exactly the brittleness we saw with _tracks. The underscore signals ‘implementation detail.’ The day someone renames _data to _store, or changes the structure, Test A breaks even though the observable behavior (you can retrieve the name you stored) is unchanged.
Test B — it verifies behavior through the public API; Test A breaks if the storage key or structure changes internally
Right. Test B verifies through the public contract (get_name()). Internal storage can be restructured freely without breaking it. Test A is coupled to _data — a pure rename breaks it even when behavior is unchanged.
Both are equally robust since both assertions check the same underlying fact
They differ fundamentally under refactoring. Test A is coupled to the dict structure _data["name"]. Test B is coupled only to the observable contract: ‘after creating a profile with “alice”, get_name() returns “alice”.’ One breaks during internal restructuring; the other survives it.
Neither — storing a name should be tested by checking the return value of the constructor itself
Constructors in Python return None. Testing through the constructor’s return value isn’t the right path. The observable behavior is what you can read back via the public interface — that’s what get_name() provides.

Test A peeks at _data, an internal implementation detail. If someone refactors UserProfile to store data differently, Test A breaks — even though the behavior is unchanged. Test B tests through the public method get_name(), which is the contract the class makes with callers. The Refactoring Litmus Test: does it survive a pure internal refactor? Test B does. Test A does not.

2. You refactor a function’s internal algorithm (from bubble sort to quicksort) without changing its return value. Two of your tests break. What does this tell you?

The refactoring introduced a bug — changing the algorithm likely changed the output in a subtle way
The premise of the question is “without changing its return value” — i.e., behavior is unchanged. So a behavior bug is ruled out. The test failures must be telling you something other than “the new code is wrong” — and that something is: the broken tests were verifying how the function works, not what it does.
The broken tests are testing implementation details rather than behavior. They should be rewritten to assert on outputs
Correct. Tests breaking on a pure refactor are the diagnostic signature of brittle tests — they’re coupled to the implementation (internal variable names, algorithm choice) rather than the observable behavior. Rewrite them; keep the refactor.
You should revert the refactoring since passing tests must never be broken, even temporarily
Reverting a sound refactor because the tests are too tightly coupled inverts who’s in charge. The refactor is correct; the tests are the problem. Keep the refactor, fix the tests to assert on observable behavior, and you end up with cleaner code and a stronger suite.
The tests are too strict — you should lower the assertion precision or add tolerance margins
Loosening assertions or adding tolerance is exactly the wrong direction — that hides the brittleness rather than fixing it. The tests aren’t too strict in what they assert; they’re asserting on the wrong thing (implementation, not behavior). Rewrite, don’t dilute.

If the function’s behavior (inputs → outputs) is unchanged but tests break, those tests are coupled to the implementation. The fix is to rewrite the tests to assert on observable behavior, not internal details.

3. A test you wrote needs 40 lines of setup code before the single assert statement. What is this test telling you?

Nothing — large setups are fine as long as the assertion verifies the right thing in the end
Tests are also a design signal, not just a verifier. A test that needs 40 lines of setup is shouting that the function under test has too many dependencies — every collaborator the function reaches forces a corresponding fixture. Ignoring the smell keeps the pain.
The Arrange section is too big — this is the Excessive Setup smell, signaling that the production code has too many dependencies. Refactor the design
Correct. Excessive Setup is a design smell pointing at the production code — too many dependencies crammed into one function. The test is feedback; the fix belongs in the production design, not in hiding the setup behind a helper.
You should move the setup into a helper function so the test file looks shorter; the underlying design is fine
A helper hides the smell without addressing it. The test file looks shorter, but the underlying coupling is unchanged — the helper just centralizes the setup-heavy fact. The right response is to listen to what the test is telling you about the production design.
You should add more assertions to make the heavy setup feel justified — one assert is not enough for 40 lines of arrangement
Adding more assertions doesn’t compensate for hard-to-arrange tests; it makes them slower to run, harder to read, and more prone to verifying multiple unrelated things in one function. The number of assertions should match the number of behaviors under test, not the size of the Arrange block.

Excessive Setup is a test smell. When Arrange dominates, the function under test is too coupled. Fix the production code, not the test — the test is architectural feedback. Hiding it in a helper just hides the pain.

4. Two suites test the same function. Suite A has 100% line coverage but every assertion is assert result is not None. Suite B has 80% line coverage but every assertion checks an exact expected value. Which statement is correct?

Suite A is stronger — 100% coverage objectively beats 80%, regardless of assertion quality
Coverage is a necessary condition, not a sufficient one. 100% coverage with assert result is not None exercises every line but verifies almost nothing — any non-None return value passes, including 0, "", [], the wrong number entirely. A higher number doesn’t beat a meaningful assertion.
Suite B is stronger — coverage measures only which lines ran, not whether their behavior was verified. A weak oracle with full coverage catches no bugs
Right. Coverage measures which lines ran; strong oracles determine whether the behavior at those lines was actually verified. Suite A ran everything and checked almost nothing. Suite B ran 80% and checked every result precisely.
Both are equally strong — both execute the function, which is all that matters for catching bugs
The two suites do not catch the same bugs. Suite A passes for any non-None return — a bug that flips a return value from 42 to 0 slips through silently. Suite B catches that same bug because the value differs from the asserted expected one. Same lines exercised, vastly different bug-catching power.
Neither is strong — industry standard is ≥95% coverage AND mutation testing, so anything less is unacceptable
Mutation testing is a great deeper technique, but the question isn’t asking which industry metric is best — it’s asking which of these two suites is stronger. Suite B answers that even before mutation testing enters the picture, because it exposes the foundational issue: weak oracles on full coverage verify nothing.

A suite can run every line and still verify nothing if its assertions are weak. Coverage is a necessary ceiling (you cannot test what you never ran) but it is not sufficient for quality. Strong oracles on the critical paths beat weak oracles everywhere.

5. A function build_user_profile(name, age) has this spec:

Returns a dict with name (input), age (input), and is_adult (True if and only if age ≥ 18).

The current implementation also returns a cached_at timestamp for internal caching. The spec doesn’t mention cached_at. Which assertion is right — strong on the spec but not over-specified?

assert build_user_profile("alice", 30) is not None — survives any future change
is not None is the weak-oracle Liar — any dict, including a wrong one, passes. The ‘survives any future change’ framing is exactly the trap: a test that survives every change because it asserts almost nothing isn’t robust, it’s empty.
assert build_user_profile("alice", 30) == {"name": "alice", "age": 30, "is_adult": True, "cached_at": 1730000000} — pins everything
This is the over-specification trap. cached_at is not in the spec — it’s an internal caching detail. Pinning it means the test fails every time the cache fires (i.e., every second), even though every spec-mandated field is correct.
assert result["name"] == "alice", assert result["age"] == 30, assert result["is_adult"] is True — one assert per spec-mandated field
Right. Three asserts, one per spec-mandated field. cached_at is unspecified, so we don’t pin it. The test stays green when the cache implementation evolves and red when any spec-mandated field regresses.
assert isinstance(build_user_profile("alice", 30), dict) — the type guarantee is enough
Type checks are medium oracles — they catch wrong types, never wrong values. build_user_profile("alice", 30) could return {"name": "bob", "age": 0, "is_adult": False} and isinstance(..., dict) is happy. The right oracle pins each spec-mandated field.

The right oracle sits exactly on the spec — no less (weak/medium), no more (over-specified). Field-by-field equality on spec-mandated fields catches real regressions and survives changes to unspecified internal fields like cached_at.

6. (Spaced review — Step 3) Which assertion is the strongest oracle for compute_total([1.50, 2.00])?

assert compute_total([1.50, 2.00]) is not None
is not None is the canonical weak oracle. Any non-None return passes — 0.0, "banana", []. It tells you the function returned something, not whether it returned the right thing.
assert isinstance(compute_total([1.50, 2.00]), float)
Type checks catch type errors only. The function could return 42.0 (wrong sum) and isinstance is happy. Step 3 named this the medium-strength oracle — better than is not None, still wrong-value-blind.
assert compute_total([1.50, 2.00]) > 0
> 0 is a property check, not a value check. The function could return 0.01, 42.0, 1000000.0 — all positive, all wrong. Property assertions are useful when you genuinely don’t know the exact expected value, but for a deterministic function whose answer you can compute, pin the value.
assert compute_total([1.50, 2.00]) == 3.50
Exactly. == 3.50 is the only strong oracle here — any wrong return value (0.0, 1.5, 42.0) would fail it. The other options accept large families of wrong answers.

A strong oracle pins the exact expected value — only 3.50 satisfies it. The others would still pass if the function returned 0.0, 42.0, or any positive float — exactly the Liar test anti-pattern from Step 3.

7. (Spaced review — Step 2) For a function charge_fee(amount) with rule “fee is 2% if amount is 100 or more, else free”, which test pair exposes the most bugs?

amount = 50 and amount = 200 — one value on each side, well away from the boundary
$50 and $200 are both deep inside their partitions — both will pass under either > 100 or >= 100, since 50 is clearly free and 200 clearly charges the fee. Boundary bugs only manifest at the boundary; values away from it never expose them.
amount = 99 and amount = 100 — one just below and one at the boundary, catching > vs >= off-by-one errors
Right. 99 and 100 straddle the >= 100 threshold. The canonical boundary bug here is > 100 vs >= 100 — it produces the wrong answer at exactly amount = 100 and nowhere else.
amount = 0 and amount = 1000000 — extreme values at the edges of the representable range
Extreme-value tests catch overflow / representation bugs (rare in this kind of spec) but completely miss the off-by-one risk at $100. The most common bug here — confusing >= 100 with > 100 — produces correct behavior at $0 and $1,000,000. You’d ship the bug.
amount = 1 and amount = 2 — two adjacent low values to check precision at the small end
$1 and $2 are both in the same partition (free), so they verify the same path twice and never approach the boundary. This is the same anti-pattern as the Step 2 quiz: many tests crowded in one partition is roughly one test repeated.

The boundary is at 100. Testing 99 and 100 catches the canonical off-by-one (> 100 vs >= 100). Picking values far from the boundary misses exactly the bugs that Boundary Value Analysis is designed to expose.

8. (Spaced review — Step 2) A test reads assert is_age_valid(18) == True. A colleague says to use is True instead. What is the reason?

# Option A (current):
assert is_age_valid(18) == True

# Option B:
assert is_age_valid(18) is True

No difference — == True and is True behave identically for Boolean-returning functions
They behave the same if the function returns exactly True or False. But if someone refactors is_age_valid to return 1 (truthy but not Boolean), == True still passes (1 == True is True in Python) while is True fails. For a function whose contract is ‘returns Boolean’, that’s the gap you want covered.
is True catches the case where the function returns a truthy non-Boolean (like 1 or "yes") — == True accepts 1 == True, but 1 is True fails
Correct. is True uses identity — only the exact True singleton passes. == True uses equality — 1 == True evaluates to True in Python, so truthy non-Booleans slip through. For specs that say ‘returns Boolean’, is True is the precise form.
is True is faster at runtime, which matters in large test suites
Runtime performance is not the reason. The reason is oracle precision: is checks identity (is this the exact True object?), while == checks equality (is this value equal to True?). For booleans these differ on truthy non-Boolean values.
== True is a syntax error in pytest; is True is the required form
== True is valid pytest syntax. The distinction is about precision, not syntax. Both forms compile and run; only their bug-catching power differs.

is True uses identity comparison — only the literal True object passes. == True uses equality — 1, True, and any value where v == True all pass. For functions whose spec says “returns a Boolean”, use is True / is False so the test catches both wrong values and wrong return types.

Putting It All Together

Why this matters

Steps 1–4 each isolated one dimension of test design: behavior specification, partition choice, oracle strength, and testing the spec no-more-no-less. Real test design weaves all four together on every new function you encounter. This step lets you fuse them on a brand-new spec — designing a complete suite from scratch and feeling the four skills compose.

🎯 You will learn to

Create a complete test suite for an unfamiliar function from scratch — partitions, representative inputs, and strong oracles.
Evaluate your own suite against deliberately broken implementations to confirm each partition is actually probed.

✍️ Before reading on, write your own recap. In one or two sentences each, answer from memory (no scrolling back):

What did Step 1 teach you about what tests are for?
What did Step 2 teach you about which inputs to pick?
What did Step 3 teach you about the assertion?
What did Step 4 teach you about what to assert — and what NOT to assert?

Write all four sentences before expanding the disclosure below — the comparison is only useful if you retrieved first, not read first.

Once you’ve written your four sentences, expand the box below and compare. If your version names the same ideas in different words, you’ve consolidated the schema. If a step is fuzzy, that’s where to revisit.

📖 Compare with our recap

Step 1 — what tests are for: tests are executable specifications of behavior and a safety net against regressions, not “checking your homework.”
Step 2 — which inputs to pick: partition the input space, then test the boundaries between partitions — the off-by-one zone where most bugs live.
Step 3 — the assertion: oracle strength is one independent dimension. A strong oracle pins exactly what the spec mandates; weak oracles pass for almost any return.
Step 4 — what to assert against: the spec, no less and no more. Don’t peek at private state (internal coupling), and don’t pin output fields the spec doesn’t mandate (over-specification). Robust tests survive refactors.

The skill underneath all four: making the gap between what code does and what it should do visible and automatic.

⚙️ Final challenge — streaming.py defines streaming_price(price, plan) — the kind of pricing logic Spotify, Netflix, and YouTube Premium actually run:

`plan`	Discount
`"student"`	50% off
`"family"`	30% off
anything else	none

🔒 You are writing tests for a fixed function — don’t modify streaming.py. The validator runs your tests not against the streaming.py you can see, but against a hidden reference implementation plus three deliberately broken versions (one with no student discount, one with no family discount, one that returns 0 for unknown plans). To get full credit, your suite must:

pass against the reference (your assertions match the spec), AND

fail against each broken version (your tests actually probe each partition).

That’s the working definition of “your tests cover the partitions” — they catch bugs in each one. If a check fails, the message names which broken version your suite missed, so you know which partition to add a test for.

In test_streaming.py, design a test suite from scratch:

Articulate first (before any code): at the top of test_streaming.py, write a comment listing the partitions you see in the spec, like this:
```
# Partitions of plan:
# 1) ...
# 2) ...
```
The validator will check that this comment exists with at least two named partitions before it grades your tests. (This is the part most engineers skip — and it’s where most bugs slip through.)
Pick a representative input for each partition.
For each input, compute the expected return value and write a test with a strong oracle (an exact == on the computed value, not an is not None check).

You are now applying everything from Steps 1–4: behavior specification (1), partitions (2), oracle strength (3), and testing the spec — no more, no less (4).

💡 No numeric range, so no boundary values — but partitions still apply. Step 2’s boundary heuristic needed an ordered domain: lengths, ages, scores. Here plan is categorical — "student", "family", anything else — no numeric ordering, so there are no >= / > comparison operators and therefore no off-by-one boundary values to probe. But equivalence partitioning still applies: you test one representative per category. This is a Separation of two ideas you’ve used together: boundaries are a special case of partitioning that kicks in only when the domain is ordered.

Ask yourself: for streaming_price, are there any “edge-of-category” inputs worth testing beyond the three named categories? What about an unexpected string like "premium", or an empty string ""? These are the categorical equivalents of boundary probing — checking the edges of the decision logic for inputs the spec doesn’t explicitly name.

💡 Two-parameter functions: When a function takes two parameters, partition each dimension independently, then pick deliberate combinations — not all combinations (that grows exponentially), but enough to represent each partition at least once. Here, price has no spec-defined constraints, so any representative value (e.g., 20) works across all plan tests. If price had its own threshold (e.g., “discount only for orders ≥ $5”), you’d apply boundary testing to that dimension too.

💡 Floating-point equality: When the expected value is computed by multiplication (e.g., 20 * 0.50), standard == usually works for simple fractions, but for arbitrary floats use assert result == pytest.approx(expected) to avoid rounding surprises (e.g., assert streaming_price(13.99, "student") == pytest.approx(6.995)).

🪞 Recalibrate: At the start of Step 1 you rated your confidence (1–10) for designing a test suite from scratch. Re-rate yourself now. The gap between those numbers is what you actually learned — the feeling of progress is unreliable; the gap is data.

🧭 Threshold check — compare then and now: look back at the first test you encountered in Step 1. What did that test specify about the function? Now look at the tests you just wrote. What do they specify? Write one sentence naming what changed in how you think about what a test is for. Then explain why that shift matters for the next function you write — what will you do differently tomorrow that you wouldn’t have done before this tutorial?

🪞 Two independent dimensions of test design

Across this tutorial, two separate dimensions of test design have been mixed together. Naming them apart makes both clearer:

flowchart LR
    subgraph Dim1["DIMENSION 1 — what to test (input choice)"]
        direction TB
        D1A["Boundaries<br/>partition transitions"]
        D1B["Representative<br/>middle of partition"]
        D1C["Special cases<br/>empty, None, zero"]
    end
    subgraph Dim2["DIMENSION 2 — how strong the assertion (oracle)"]
        direction TB
        D2A["Strong<br/>== exact value"]
        D2B["Medium<br/>type / range check"]
        D2C["Weak<br/>is not None"]
    end
    Dim1 -.->|"a good test<br/>gets BOTH right"| Dim2

A test can be strong on input choice (boundary-aware) but weak on oracle (is not None) — and vice versa. Excellence is the cross-product: pick a meaningful input and assert the precise expected outcome. That’s why the streaming-price task above checks both partitions covered AND oracles strong.

🧰 When to reach for which technique (a quick decision guide)

You’ll meet new functions in the wild. Use this to decide which testing tool to pull out:

If the function…	Reach for…	Pattern from
Takes a numeric input with a valid range (`min ≤ x ≤ max`)	Boundary value analysis — test `min-1, min, max, max+1`	Step 2
Takes an input from a small set of categories (`"student"`, `"family"`, …)	Equivalence partitioning — one test per category	Step 2 + Step 5
Returns a value (vs. mutates state)	Strong-oracle equality — `assert result == expected`	Step 3
Returns a float computed by multiplication/division	`pytest.approx` — `assert result == pytest.approx(expected)` to avoid floating-point rounding surprises	Step 3 + real projects
Should raise an exception for certain inputs	`pytest.raises` — `with pytest.raises(ValueError): func(bad_input)`	Next tutorial
Returns a dict / record	Field-by-field equality on spec-mandated fields only — `assert result["price"] == 5` for each field the spec names. Don’t full-equality the whole dict (over-specification: it breaks when an unrelated field gets added)	Step 4
Returns a list	Collection equality — `assert result == [1, 2, 3]`; for order-independent: `assert sorted(result) == sorted(expected)`	Step 3 + real projects
Mutates an object’s state	Public API behavior tests — `obj.observable() == expected`	Step 4
Has internal state you’re tempted to peek at	Don’t. Add a public method instead, then test through it	Step 4
Is “trivial” and you think it doesn’t need a test	It deserves at least one regression test — today’s trivial is tomorrow’s surprise dependency	from research

Most real functions hit several rows at once. Apply them all.

🎲 Want unguided practice on a different shape of function?

The graded exercise above is streaming_price. Once you’ve completed it, try the same approach on one of these self-graded problems — copy the function below into a fresh file (e.g. practice.py) and write your own tests in test_practice.py. There’s no validator here; judge your suite yourself against the partitioning + strong-oracle checklist you used above.

# Option A — numeric boundaries (more like Step 2)
def shipping_fee(weight_kg: float) -> int:
    """Free if 0 < weight <= 1; $5 if 1 < weight <= 10; $20 above."""
    if weight_kg <= 0: return 0
    if weight_kg <= 1: return 0
    if weight_kg <= 10: return 5
    return 20

# Option B — state-changing (more like Step 4)
class StreakCounter:
    def __init__(self) -> None: self._n: int = 0
    def increment(self) -> None: self._n += 1
    def value(self) -> int: return self._n

For Option A, your partitions are numeric ranges; boundary value analysis from Step 2 is the dominant tool. For Option B, the function under test mutates state, so each test follows the behavior, not implementation pattern from Step 4 (assert through value(), never reach for _n).

🚀 What's next — pytest features you'll meet in your next project

You now have the foundations of testing. The pytest features below build on what you’ve learned — they don’t replace it. None of them are needed for what you just did, but you’ll see them everywhere in real codebases:

Feature	What it solves	When you’ll want it
`@pytest.fixture` + `conftest.py`	Repeated Arrange logic across many tests (e.g. database connection, sample objects, mock services)	When two tests start with the same 5 lines of setup.
`@pytest.mark.parametrize`	A family of similar tests on different inputs — one function, many cases	When you’d otherwise copy-paste the same test for `test_age_18`, `test_age_19`, `test_age_20`. The boundary-and-partition logic from Step 2 fits this perfectly.
`unittest.mock` / `pytest-mock`	Testing code that calls external services (HTTP, database, file I/O) without actually hitting them	When the function under test would otherwise require network or disk to run.
`pytest-cov` (coverage)	Measuring which lines of production code your tests execute	When you suspect a partition is missing — coverage shows untested branches. (Reminder from Step 4: coverage ≠ quality.)
Property-based testing (`hypothesis`)	Auto-generating thousands of inputs to find edge cases your boundary tests missed	When the input space is too large for case-by-case enumeration.

Next pedagogical step: the Test-Driven Development (TDD) tutorial — where you write the test before the production code, and let failing tests drive the design. Everything from this tutorial (oracle strength, partitions, behavior testing) becomes a foundation that TDD layers a discipline on top of.

For a different next step — the same testing concepts applied to a whole React app through a real browser — see the Playwright Tutorial. It picks up exactly where this one leaves off: AAA becomes navigate-interact-assert, partitions become user-path scenarios, oracle strength shows up in toHaveText vs toBeVisible, and the behavior vs implementation concept gets a tactile workout against UI refactors.

Where to apply these in your own work: every new function you write deserves at least one boundary test and one partition representative test, with a strong oracle, through the public API. That’s the four skills of this tutorial in 30 seconds per function — and it pays for itself the first time a refactor would have shipped a regression.

Starter files

streaming.py

def streaming_price(price: float, plan: str) -> float:
    """Apply a streaming-service plan discount.

    student -> 50% off  (Spotify Student / YouTube Premium Student style)
    family  -> 30% off  (Spotify Family / Apple Music Family style)
    other   -> no discount  (Individual, free, etc.)
    """
    if plan == "student":
        return price * 0.50
    if plan == "family":
        return price * 0.70
    return price

test_streaming.py

"""Design your own test suite for streaming_price.

Apply what you've learned:
  - pytest conventions (function names start with test_)
  - strong oracles (assert exact expected values, not 'is not None')
  - partition the input space (student / family / other)
"""
import pytest
from streaming import streaming_price

# TODO: Write at least 3 tests covering all three partitions of plan.

Step 5 — Knowledge Check

Min. score: 80%

1. (Spaced review — all steps) A function ships free for orders ≥ $50 and charges $5 otherwise. Which test pair is the single most important?

order = $25 and order = $200 — one in the middle of each partition
$25 and $200 are deep inside their partitions. They both behave correctly under either >= 50 or > 50, since 25 clearly charges and 200 clearly ships free. Boundary bugs only surface at the boundary itself.
order = $49.99 and order = $50 — one just below the boundary, one exactly at it
Right. $49.99 and $50 straddle the >= $50 threshold — the same test-pair structure as the streak bug at day 100. Two values straddling a threshold expose the > vs >= family of bugs.
order = $0 and order = $1,000,000 — the extreme edges
$0 and $1,000,000 catch a different category of bugs (representation issues, overflow). They miss the dominant off-by-one risk at $50, which is where 99% of bugs in this kind of spec live. Boundary tests come first; extreme-value tests come second.
order = $50 only — one boundary test is enough
One boundary test isn’t enough; two values straddling the boundary are what expose > versus >=. With only order = $50, the test cannot tell whether the function is using the right operator because it observes only one side of the threshold.

The boundary is $50. Testing $49.99 (just below) and $50 (exactly at) catches the classic >= vs > off-by-one bug — the same family as the streak_badge bug from Step 1.

2. A teammate adds assert result is not None for calculate_total() and says, “Great — the function works.” What’s the right response?

Agree — passing tests prove the function works.
Tests passing only tells you what their assertions held. assert result is not None holds for any non-None value, including buggy ones. The teammate is reading the PASS result and skipping the question of what was actually verified.
That’s a weak oracle. It passes for any non-None value (including 0.0 or "banana"). Replace with assert result == <expected_value>.
Exactly. The fix is to pin the exact expected value with ==. is not None only tells you the function returned something — the strong oracle tells you it returned the right thing.
Disagree because we don’t have 100% line coverage yet.
Coverage isn’t the issue here — the issue is oracle strength. The function might be running with full coverage and returning the wrong value entirely; this assertion couldn’t tell you. Push back on the assertion, not on coverage.
Disagree — the test is fine but the function name should change.
The function name might also need work, but that’s a separate critique. The immediate problem is that the test doesn’t verify the function’s behavior in any meaningful way — even a perfect function name wouldn’t fix a is not None oracle.

Weak oracles look productive but verify nothing. The fix is the strong oracle: pin the exact expected value.

3. You need to write your first test for a new function parse_username(s). The spec: accept usernames of length 3–20, reject everything else. What is your first step when designing the test suite?

Open parse_username’s source code and write assertions that confirm each code path runs
Starting from the implementation makes your tests confirm what the code does, not what it should do — the same post-hoc trap from Step 1. Start from the spec: identify the partitions, then find the boundaries.
Write assert parse_username('alice') is not None — if the function runs without crashing, it works
is not None is the canonical weak oracle: it passes for any non-None return, including wrong values. You’ve replaced it with strong == assertions precisely to avoid this.
Identify the partitions and boundaries: the valid range [3, 20] has two boundary pairs; pick inputs at each boundary and one representative per partition
Exactly. Partition s by length: too-short (< 3), valid (3–20), too-long (> 20). The boundaries sit at lengths 2, 3, 20, 21. Pick one representative per partition plus those four boundary values — that’s the Step 2 method applied cold.
Start with a single test using a random username and see if it passes
A single random test is equivalent to one middle-of-partition sample — it misses all boundaries. The skill from Step 2 is to choose inputs deliberately, not arbitrarily.

Start from the spec, not the implementation. Partition s by length: too-short (< 3), valid (3–20), too-long (> 20). Test boundary values 2, 3, 20, 21 and one representative in the middle. That’s the Step 2 method applied directly to a new function — no peeking at the implementation required.

4. (Spaced review — Step 4) You rename a private attribute _cache to _store in a class without changing any public method behavior. Three tests break immediately. What does this tell you?

The rename introduced a bug — renaming always risks changing behavior
The question specifies ‘without changing any public method behavior’ — behavior is unchanged by definition. Failures after a pure rename signal the tests themselves are the problem, not the rename.
The three broken tests were accessing _cache directly — they were testing internal implementation, not observable behavior
Correct. Tests that read or write private members (_cache, _tracks, _data) break the moment those internals are renamed or restructured — even when no behavior changes. This is the Refactoring Litmus Test: if a pure internal refactor breaks tests, those tests are brittle.
You should revert the rename to restore the passing tests
Reverting a sound refactor because tests are brittle inverts the correct response. The refactor is valid; the tests are the problem. Rewrite the broken tests to verify behavior through the public API.
Private attributes must never be renamed after the first release
There’s no such rule — private attributes can and should be renamed freely as designs evolve. The issue is not the rename but the tests that depend on the old name instead of the public contract.

A pure refactor (no behavior change) should never break well-written tests. If it does, the broken tests were coupled to internal implementation details. Rewrite them to assert on observable behavior through the public API — then the same refactor (or any future one) leaves the suite green.

Testing Foundations with pytest

Why Test? The Bug That Got Away

Why this matters

🎯 You will learn to

💡 Why test?

🔍 Predict first

📂 What you have

Solution

Step 1 — Knowledge Check

Choosing What to Test: Partitions & Boundaries

Why this matters

🎯 You will learn to

📝 The shape of a pytest test

💡 The principle: equivalence partitions and boundaries

📖 Equivalence partitioning — the deeper “why”

Solution

Step 2 — Knowledge Check

Oracle Strength: Strong, Weak, and the Liar Test

Why this matters

🎯 You will learn to

📖 Oracle strength — three flavors of weak

Solution

Step 3 — Knowledge Check

Test Behavior, Not Implementation

Why this matters

🎯 You will learn to

Part A — Internal coupling (the rename experiment)

📦 Two Python tools used in this step: @dataclass and @property

🚩 When Arrange dominates — the Excessive Setup smell

Part B — Over-specification (the upper bound of oracle strength)

🎓 Coverage ≠ quality

Solution

Step 4 — Knowledge Check

Putting It All Together

Why this matters

🎯 You will learn to

🪞 Two independent dimensions of test design

Solution

Step 5 — Knowledge Check

📦 Two Python tools used in this step: `@dataclass` and `@property`