TDD Tutorial — Print View

1

Cycle 1 — RED: Write the Failing Test

Why this matters

RED is the moment TDD looks weirdest: you deliberately write a test that cannot pass yet, and you make the failure happen on purpose. That inversion is the threshold concept — a failing test is the goal, not the accident, because it’s the first place where the spec gets pinned down before any implementation exists. Learning to read a failure for the right reason is the foundation everything else in this tutorial sits on.

🎯 You will learn to

Apply the four-part pytest test shape (import, define, arrange-act, assert) to translate a one-sentence spec into a runnable failing test
Analyze a pytest failure and distinguish a right-reason RED (ImportError / AssertionError on the assertion you wrote) from a wrong-reason RED (typo / missing colon)
Evaluate why a surprise green on a brand-new test should be treated as a Liar test until proven otherwise

Prerequisite: Testing Foundations — pytest discovery, assert, partitions, behavior-not-implementation. If those feel new, do that one first.

What you’re building — Dragon Dice

Dragon Dice is a (fictional) tabletop combat game. The mechanic is simple: a player rolls a handful of six-sided dice, and certain face values and combinations trigger named combat events — Dragon Flame, Lightning Spark, Goblin Swarm, and so on — each worth a damage number. A turn’s roll is just a Python list of dice values, e.g. [1, 1, 1, 1, 5].

Two kinds of scoring happen on every roll:

Singles — a 1 becomes one Dragon Flame (100 damage); a 5 becomes one Lightning Spark (50). Other face values, on their own, score nothing.
Triples (combos) — three matching dice trigger a bigger event that consumes its dice. Three 1s become one Dragon Blast (1000) instead of three Dragon Flames; three 2s become a Goblin Swarm; and so on. Whatever the combos don’t consume keeps scoring as singles, so [1, 1, 1, 1, 5] produces one Dragon Blast (consuming three 1s) plus a leftover Dragon Flame plus a Lightning Spark — for 1150 total damage. The full ruleset is in the table further down.

Your goal across the seven Dragon-Dice cycles is to grow a score(dice) function that turns any roll into a BattleReport — its total_damage and the ordered tuple of ScoringEvents it produced. You will not look at the full ruleset and write it all at once. TDD adds one rule at a time, each one earned by a test that demands it. After cycle 7 an eighth transfer cycle reapplies the same rhythm to a totally unrelated problem (FizzBuzz), as proof the discipline carries beyond this domain.

Test-Driven Development in one minute

TDD is a design technique that uses tests as the medium of pressure. You write code in short cycles of three phases:

Phase	What you do	Why this phase exists
🔴 RED	Write one failing test that names a behavior you want	Forces the interface and expected behavior to be decided before any logic exists
🟢 GREEN	Write the smallest code that makes the test pass	Resists speculative design; only build what a test demands
🔵 REFACTOR	Improve the code while all tests stay green	The safety net lets you reshape structure without fear of regression

Each phase of cycle 1 is its own tutorial step so the rhythm becomes a felt sequence, not a slogan. From cycle 2 onward, each cycle is one step containing all three phases.

Why a failing test is the goal of RED

A wizard pointing at a screen showing "BUILD SUCCESSFUL — TEST CASES: 25, PASS: 25, FAIL: 0" and shouting "YOU SHALL NOT PASS!"

Most testing intuition is the opposite: green = good, red = bad. TDD inverts that for the first run of every cycle. If you write a brand-new test against code that doesn’t exist yet — and pytest reports PASSED — something is wrong. Maybe the import silently failed. Maybe the assertion is vacuous. Maybe you’re running an old cached version. A surprise green is a Liar test until proven otherwise; the wizard is right to block it.

A failing test is not a bug — RED is the expected starting state of every cycle. But the failure has to come from the behavior under test, not from a typo:

Right reason — ImportError, AttributeError, a value-mismatch on the assertion you wrote. The test correctly says “this behavior does not exist yet.”
Wrong reason — SyntaxError, missing colon, misspelled test_ prefix. The test never ran. You’ve learned nothing about the unit, only about your typing.

Students commonly delete a failing test to make the bar green. We’re leaning into that discomfort instead. Learning to read the failure is what TDD trains.

🤔 But why test-first? Why not just write the code, then test it?

The honest answer: most developers’ instinct is to write the code first. That is the habit TDD is replacing — and it deserves a real argument, not just a style claim.

A small concrete scenario. Suppose you skip the test and just write score() directly. You’re confident it’s right; you eyeball-check it in a REPL with score([1]) and score([1, 5]), see plausible numbers, ship. Two weeks later your teammate adds a triple 1s = Dragon Blast rule by inserting an elif branch that fires before the per-die loop. The elif only matches exactly [1, 1, 1]; rolls like [1, 1, 1, 5] silently fall through and score wrong.

With a test-first cycle, the triple 1s test would have run against an empty score() and forced the question “what does the spec say happens with leftover singles?” before the elif was written. Without the test, the bug ships and surfaces only when a player notices their score is off — if they notice at all.

The general pattern. Code-first writes a function and then asks “what should it do?” Test-first writes a behavioral commitment and then asks “what’s the simplest code that delivers that?” The first habit lets implementation choices smuggle themselves into your sense of what the spec was. The second prevents that — the spec is on disk, in code, before any implementation can pollute it. (Janzen & Saiedian’s ICSE 2007 study of 230+ programmers: even programmers who tried test-first once kept reverting to code-first afterward; the habit is that sticky. Naming it here, so you can notice it in yourself, is half the work.)

So you might still resist test-first today. Notice the resistance. The goal of these seven cycles is to give you the felt experience of small-step rhythm — after which you’ll be choosing test-first because it works, not because we said so. (And per Fucci et al. 2017: even if you sometimes write the code an instant before the test, the granularity and rhythm are where TDD’s measured benefits come from. So don’t worry about being a purist; worry about being incremental.)

The shape of every pytest test

Every pytest test you write has the same four-part structure:

Part	What it does
Import the unit under test	Tells Python what code you’ll call
Define a function whose name starts with `test_`	pytest only discovers functions matching this pattern
Arrange + act	Set up any input and call the unit
Assert an observable property of the result	Pin down one thing the spec promises

The pattern generalizes; the specifics (what to import, call, and assert) come from the spec — and only from the spec.

The dragon dice rules (reference for all seven cycles)

Roll	Event	Damage
Single 1	Dragon Flame	100
Single 5	Lightning Spark	50
Triple 1	Dragon Blast	1000
Triple 2	Goblin Swarm	200
Triple 3	Orc Charge	300
Triple 4	Troll Smash	400
Triple 5	Lightning Storm	500
Triple 6	Demon Strike	600

Triples consume three dice; leftover 1s and 5s still score as singles. Dice are integers 1–6. Today you implement only the empty-roll case. Six more cycles add the rest, and a final transfer cycle (a different problem entirely) proves the rhythm carries.

Commit after every step (the safety-net habit)

The editor has a Git Graph view next to it and an embedded terminal that accepts a small set of shell commands (git, python, pytest, plus &&/||/; chains). Commit at the end of each step with a short message naming the phase (RED:, GREEN:, REFACTOR:, Cycle N:). Two reasons it earns its keep:

Atomic safety net. Every commit is a known-green state you can git reset --hard back to if a refactor goes sideways. Beck’s discipline: never refactor on top of uncommitted code.
Visible history. The Git Graph view shows your DAG growing one node per phase — a literal picture of “Red, Green, Refactor, Red, Green, Refactor…” that mirrors what your editor just did.

Cycle 1’s three steps each give you the exact command to type. From Cycle 2 onwards, the commit prompt only suggests the message — you write the git add <files> && git commit -m "..." yourself. (Always stage the specific files you touched, e.g. git add scorer.py test_scorer.py. Avoid git add -A — it sweeps in junk you didn’t mean to commit.)

Your test list (Canon TDD step 1)

Kent Beck’s Canon TDD (December 2023) starts with a written list of behaviors you want the code to have — before writing any tests. The list isn’t a contract; it’s a thinking tool. New behaviors get appended as they occur to you; ones you finish get struck through; ones that turn out to be already-implemented (the bonus mixed-dice test in cycle 4, the bonus leftover guardrail in cycle 5) get a checkmark with no code change.

Here are the first three items, in the order the cycles will tackle them:

☐ Cycle 1 — Empty roll → no damage, no events
☐ Cycle 2 — A single 1 → one Dragon Flame event
☐ Cycle 3 — A single 5 → one Lightning Spark event
☐ Cycle 4 — …

More items appear as we work through them — Beck’s discipline is to not pre-resolve them all. Pick the next item, turn only that one into a runnable test, make it pass, optionally refactor, repeat. He warns explicitly against converting every list item up front (“leads to rework and depression”) and against mixing refactor into making a test pass (“wearing two hats simultaneously”). The platform’s step-by-step structure enforces both disciplines for you.

Cycle 1’s spec

An empty roll produces a battle report with zero damage and no events.

That sentence names everything you need: a function score, a return value with total_damage and events attributes. Translate it into a pytest test using the four-part shape.

Your task

In test_scorer.py (right pane), fill in the three sub-goal comments. Leave scorer.py empty — its code belongs to the GREEN step.
Predict the category of failure you’ll see — ImportError, AttributeError, or AssertionError? Write it down.
Click Run. Compare the actual failure to your prediction.

Reveal — what we expected (open after running)

ImportError: cannot import name 'score' (or ModuleNotFoundError if scorer.py is empty). That IS the deliverable — RED for the right reason.

Why `()` and not `[]`? (open if you wondered)

Tuples are immutable — they can’t be mutated by accident, and they’re safe dataclass defaults (cycle 2 uses that). Every test in this tutorial that pins down events uses a tuple.

📚 References

📦 Commit your progress

Before moving on, lock this step into the safety net. In the embedded terminal:

git add test_scorer.py && git commit -m "RED: failing test for empty roll"

Starter files

scorer.py

# Cycle 1 RED phase — DO NOT WRITE PRODUCTION CODE HERE YET.
#
# The next tutorial step (Cycle 1 GREEN) is where the BattleReport
# class and the score function are introduced. Right now we are
# only writing the failing test on the right.

test_scorer.py

"""Cycle 1 RED — write the first failing test.

The sub-goals below describe the PURPOSE of each line you need to add,
not the syntax. Translate the spec ("an empty roll has no damage and no
events") into pytest assertions yourself. If you get stuck, consult the
rules table and the references in the instructions panel.
"""
from scorer import score


def test_empty_roll_has_zero_damage_and_no_events():
    # Sub-goal: call the unit under test on the simplest input the spec mentions
    #           — and capture the result so the next two lines can inspect it.

    # Sub-goal: pin down what the spec says about damage in this case.

    # Sub-goal: pin down what the spec says about events in this case.
    pass

2

Cycle 1 — GREEN: Make It Pass

Why this matters

The instinct on GREEN is to “build it right” — anticipate the next cycle, generalize early, reach for the elegant abstraction. That instinct is the single most common way TDD degrades into test-after. The GREEN rule asks for the smallest code that satisfies this one test, even if it looks embarrassingly trivial — because every line you write without a test demanding it is a guess, not a discovery.

🎯 You will learn to

Apply the GREEN rule by writing the smallest code that satisfies the current failing test (no speculative branches, no premature abstraction)
Analyze a pytest test as a contract that prescribes the unit’s interface (@dataclass(frozen=True), @property, default tuple) line-by-line
Evaluate a candidate GREEN against the Transformation Priority Premise — preferring lower-cost transformations (constant → variable) over higher-cost ones (loop / class)

The GREEN rule: write the smallest code that makes the failing test pass. Anything more is speculative design — code with no test demanding it.

The test is your contract

Every line of the test is an obligation your code must satisfy:

Line of the test	What your code must provide
`from scorer import score`	A `score` name in `scorer.py`
`report = score([])`	`score` returns something
`assert report.total_damage == 0`	That something exposes `total_damage`, equal to `0`
`assert report.events == ()`	…and `events`, equal to `()`

Three Python tools, in this new context

You already know these — what’s new is why this test forces you to reach for them:

@dataclass(frozen=True) — gets you free __init__ / __eq__ / __repr__, and the per-field structural __eq__ is exactly what makes report.events == (...) work in cycle 2. (Also hashable, which we lean on later.)
@property — needed because the test reads report.total_damage as an attribute, not report.total_damage(). The test’s grammar is the constraint; @property is the tool that fits it.

That’s it. The test wrote the spec for you; these tools are the smallest Python primitives that satisfy it. (dataclasses · property if you want a refresher.)

The Transformation Priority Premise — why “smallest” beats “best”

Robert Martin’s TPP lists code transformations from simplest to most complex: nothing → constant → variable → conditional → loop. The rule: always pick the simpler transformation that passes the current failing test, even when you “know” a more general one is coming.

For cycle 1, the test only mentions the empty case. You do not need a loop yet — the empty-tuple default already produces 0 damage. The loop arrives when a test (cycle 4) actually demands it.

Your task

In scorer.py (left pane), replace each sub-goal comment with the matching line. Re-read the test for the contract.
Before you click Run, identify one way your code could be wrong. (A misplaced default? A forgotten decorator? A method where the test reads an attribute?) Run, then check whether your prediction matched.
Resist any “improvement” beyond what the test demands — the next step is REFACTOR, and it only earns work that has somewhere to go.

🛟 Stuck? Common shapes that fail (open if pytest is red)

events: list = [] — Python rejects mutable defaults in dataclasses with ValueError. What immutable alternative matches the test’s events == () assertion?
Forgetting @property — without it, report.total_damage is a bound method object, not a number; the assertion fails in a weird way.
@dataclass without frozen=True — passes cycle 1, but cycle 2’s tuple comparisons of value objects need the structural __eq__ that frozen dataclasses provide.
if not dice: ... — speculative branching. The empty-tuple default already handles the empty case.

📦 Commit your progress

🔍 Before you commit, glance at the gutter. The +/~/- markers in the left margin of each editor pane show what changed since your last commit (the RED step). The diff should be exactly the production code you just wrote — nothing else. If you see surprises, investigate before staging.

Then, in the embedded terminal:

git add scorer.py test_scorer.py && git commit -m "GREEN: empty BattleReport with zero damage"

Starter files

scorer.py

"""Cycle 1 GREEN — smallest code that turns the failing test green.

The sub-goals below describe the PURPOSE of each line you need to add,
not the syntax. Re-read test_scorer.py to recover the contract: a name
to export, a return value, two attributes on the return value with
specific values. The Cart example in the instructions shows the toolkit
shape; you must translate it to the dragon-dice naming yourself.
"""
from dataclasses import dataclass


@dataclass(frozen=True)
class BattleReport:
    # Sub-goal: declare the storage that the test reads as `report.events`.
    # Hint: the test compares this to `()`, which already tells you the
    # type and the default value. (See the dataclasses docs link.)

    @property
    def total_damage(self) -> int:
        # Sub-goal: derive the total from whatever events the report holds.
        # Hint: with an empty-tuple default for events, an aggregate built-in
        # over an empty sequence already produces the value the test asserts.
        pass


def score(dice: list[int]) -> BattleReport:
    # Sub-goal: hand back the kind of object the test reads attributes on.
    # Hint: ignore `dice` for now — no test makes a claim about non-empty
    # rolls yet, so any branching on it would be speculative design.
    pass

test_scorer.py

"""Cycle 1 — first failing test (carried over from the RED step)."""
from scorer import score


def test_empty_roll_has_zero_damage_and_no_events():
    report = score([])

    assert report.total_damage == 0
    assert report.events == ()

3

Cycle 1 — REFACTOR: The Pause That Counts

Why this matters

Beginners skip REFACTOR when they “don’t see anything to clean up” — and that habit is exactly how TDD silently decays into write-test-then-write-code. REFACTOR is a phase you enter every cycle, with a deliberate look-around through a checklist; the answer “nothing this time” is a fine outcome, but skipping the look is not. Today’s cycle 1 has almost nothing to clean — that’s why it’s the right moment to install the discipline of looking anyway.

🎯 You will learn to

Apply the five-line REFACTOR checklist (duplication, names, test names, magic constants, imports) as a deliberate pause at the end of every cycle
Evaluate when “nothing to clean this time” is the correct outcome — and notice that entering and looking is the discipline, not finding something
Analyze a quiz question on the rhythm to confirm RED-GREEN-REFACTOR is now reasoned about, not just slogan-recited

The discipline: REFACTOR is a phase you enter every cycle — even when the answer is “nothing to clean this time.” Entering and looking is the discipline. Skipping the look is the failure mode that quietly degrades TDD into test-after.

The REFACTOR checklist (you’ll re-use this every cycle)

Category	Question to ask	Your cycle 1 answer
Duplication	Two pieces of code expressing the same idea?	_____
Names	Do names describe what they mean, not how they work?	_____
Test names	Does each test name read as a behavior sentence?	_____
Magic constants	Unexplained numbers or strings?	_____
Imports	Conventional order, no dead imports?	_____

Fill the right column from your code before opening the reveal. The discipline is the looking, not the finding.

Your task

Re-read your code with the checklist. Spend 30 seconds — don’t rush.
Make any tiny improvement you spot (e.g., a module docstring); keep the bar green.
Open the reveal below to compare your answers. Then take the first quiz.

Reveal — one possible cycle-1 answer column

Category	Cycle 1 answer
Duplication	No — only one piece of code
Names	`BattleReport`, `total_damage`, `events`, `score` — all domain words
Test names	`test_empty_roll_has_zero_damage_and_no_events` — long but unambiguous
Magic constants	None yet
Imports	Just `from dataclasses import dataclass` — clean

For cycle 1, every row is “fine.” That’s a real outcome of a REFACTOR phase — and recognising it without skipping the look is the win.

Why REFACTOR is the most-skipped phase

Martin Fowler calls skipping refactor “the most common way to screw up TDD.” Field studies of student and professional practice agree: developers treat the green bar as the finish line. Within a few cycles, duplication accumulates and the test suite ages — exactly because nobody paused at REFACTOR to look. By making “enter the phase even when there’s nothing to do” a habit now, you defend against that drift for the rest of the tutorial.

📦 Commit your progress

Before moving on, lock this step into the safety net. In the embedded terminal:

git add scorer.py test_scorer.py && git commit -m "REFACTOR: cycle 1 (nothing to clean)"

Starter files

scorer.py

from dataclasses import dataclass


@dataclass(frozen=True)
class BattleReport:
    events: tuple = ()

    @property
    def total_damage(self) -> int:
        return sum(event.damage for event in self.events)


def score(dice: list[int]) -> BattleReport:
    return BattleReport()

test_scorer.py

"""Cycle 1 — first failing test, now green."""
from scorer import score


def test_empty_roll_has_zero_damage_and_no_events():
    report = score([])

    assert report.total_damage == 0
    assert report.events == ()

Step 3 — Knowledge Check

Min. score: 80%

1. What is the correct order of phases in a single TDD cycle?

GREEN → RED → REFACTOR — write code, then prove it works with a test, then clean up
This is test-after development, not TDD. Writing code first means you’ve already committed to a design before the test gets a chance to challenge it — the test becomes a rubber stamp. RED-first forces the design to emerge under the pressure of “how do I want to call this?”
RED → GREEN → REFACTOR — write a failing test, write the smallest code to pass it, then improve the design with the safety net in place
RED → REFACTOR → GREEN — write a failing test, design the right structure first, then make the test pass
REFACTOR before GREEN means refactoring code that doesn’t yet make the new test pass — there’s nothing to refactor toward and no safety net. The rule is: only refactor when all tests are green, so any regression is caught immediately.
REFACTOR → RED → GREEN — clean up old code first, then write a test for the new behavior, then implement it
Refactoring before RED means changing structure without a failing test motivating the change. You lose the link between why the design changed and what behavior demanded it. New behavior should drive new structure, not the other way around.

RED → GREEN → REFACTOR. The order is load-bearing. RED proves the test reveals missing behavior. GREEN proves the code now satisfies that behavior. REFACTOR improves the code while the safety net (the green test) protects you.

2. You write a new test and click Run. The bar is immediately green without you having to write any production code. What is the most defensible interpretation?

Celebrate — the existing code already implements this behavior, so there is nothing to do this cycle
Premature celebration. A test that passes immediately could be a Liar — vacuously true and would pass against any code. Without verifying (the mutation move), you have no evidence the test catches anything. Cycle 4’s bonus mixed-dice test is built specifically around this trap.
Be suspicious — a test that passes without RED may be testing nothing meaningful, OR may genuinely be covered by an earlier refactor. Verify which by temporarily breaking the production code and confirming the test fails for the right reason
Delete the test — a test that passes immediately is redundant and only adds runtime cost
Deleting passing tests cuts holes in your safety net. The test documents intended behavior, even when current code already provides it; future refactors might break that behavior, and only the test will catch it. Runtime cost on a passing assertion is microseconds — not a real reason to remove documentation of a contract.
Rewrite the test until it fails, since the rule is RED → GREEN → REFACTOR and you cannot enter GREEN without first being RED
Forcing artificial RED breaks the test/spec relationship. The test should describe the behavior the spec demands, not the failure your sequence rule expects. If the spec is already implemented, a deliberately failing test would be testing the wrong thing on purpose.

Both halves matter. Sometimes a test passes immediately because a prior refactor generalized behavior — that is a positive outcome, and the test still earns its place by documenting that the behavior is intended. Other times the test is vacuously true. The way to tell is to break the production code on purpose: a real test will catch the break.

3. In the GREEN phase, you have two implementations in mind: one is a hardcoded if dice == []: return BattleReport(), the other is a generic loop. The hardcoded version passes the current test. What does TDD discipline say to do?

Write the generic loop — it will be needed eventually, and writing it now saves work in the next cycle
Writing the loop early assumes you already know what the next test will require. That is speculation, not design pressure from a failing test. A speculative loop is also code with no test motivating it; if it is wrong, no test will catch the mistake.
Write the hardcoded version — TDD’s GREEN rule is the smallest code that passes the current failing test, even if you know more cycles are coming
Skip GREEN entirely and design the loop as part of REFACTOR — the cycle is flexible about which phase contains the new logic
REFACTOR is for cleaning passing code, not introducing new logic. New logic without a failing test demanding it is the antithesis of TDD — and a refactor that adds behavior breaks the rule that REFACTOR preserves observable behavior.
It does not matter — both choices satisfy the test, and the difference is purely stylistic
Stylistic equivalence misses the Transformation Priority Premise. Both choices pass the current test, but only the simpler one keeps the code under genuine test pressure. The generic loop is a guess; the hardcoded version forces a future test to earn the loop.

This is the Transformation Priority Premise in action: prefer simpler transformations. The hardcoded version is fine as long as a future test will challenge it. That challenge — and the design pressure it creates — is exactly what cycle 4 of this tutorial will hand you.

4. Why is REFACTOR worth entering even when there is nothing to clean up?

Because pytest requires every cycle to have all three phases or it will skip the test in the next run
pytest has no opinion about TDD phases — it just runs whatever tests it finds. If a tool were enforcing the cycle, the discipline wouldn’t be about you. The cycle is a practitioner discipline, not a framework requirement.
Because the discipline of pausing to look is the whole point — finding nothing actionable is fine, but skipping the look lets duplication and bad names accumulate silently across cycles
Because REFACTOR is the only phase where you can add new tests without breaking the cycle’s contract
New tests belong to the next cycle’s RED, never to REFACTOR. Adding tests during REFACTOR mixes the two-hats Beck explicitly warns against (Canon TDD: never add behavior during refactor; never restructure during GREEN).
Because the GREEN code is always wrong on the first attempt and must be revised before moving on
Saying GREEN is “always wrong” misframes TPP. GREEN is deliberately simple — that’s the point. It isn’t wrong; it’s unfinished, and it stays unfinished until a future test demands more. Refactor exists to clean accumulated structure, not to repair a defective GREEN.

Field studies report that “skip refactor when it looks fine” is exactly how test-driven discipline degrades into test-after over a few weeks. Every cycle is a checkpoint where you ask “what do I see now?” Most cycles, the answer is mostly fine. But you only know that because you looked.

5. A teammate says: “TDD is just unit testing — write your tests first instead of last.” What is the most accurate correction?

They are right — TDD is a workflow ordering decision; the resulting tests and code are otherwise indistinguishable from test-after
This misses the threshold concept. Test-first changes which questions you ask first (“how should this be called?”) and what pressure shapes the design. The resulting code is often very different — typically more decoupled, with interfaces that didn’t exist before the test demanded them.
TDD is a design technique that uses tests as the medium. Writing the test first forces interface and behavior decisions before implementation, and the REFACTOR step is where the design pressure pays off — neither of which is part of plain unit testing
TDD is primarily a refactoring technique, and the tests exist only as a regression safety net for the refactoring
Refactoring is one phase; framing TDD as primarily-refactoring inverts the order. The tests do far more than catch regressions: they specify behavior, drive interface decisions, and force you to decompose problems incrementally. Calling them just a safety net misses the design pressure they exert.
TDD requires writing all tests for a feature first, then all the code at once — which is the opposite of unit testing’s incremental style
This describes upfront test design (which Beck explicitly warns against in Canon TDD: “converting all list items to tests before making any pass leads to rework and depression”). TDD is one test at a time — granular and incremental, the opposite of all-at-once.

This is the threshold concept: TDD is design first, testing second. The test forces you to decide the public interface (function name, parameters, return shape) before any logic exists. The REFACTOR phase is where the design that emerged under pressure gets shaped intentionally. Calling it “unit testing in reverse order” misses both halves.

4

Cycle 2 — Single 1 → Dragon Flame

Why this matters

Cycle 1 walked the rhythm one phase at a time. Cycle 2 packs all three phases into one step — and immediately tests the hardest TDD discipline of all: allow the hard-code. The first GREEN for “a 1 is a Dragon Flame” should look ugly (if dice == [1]:) because one example is not enough information to choose the right shape. Refactor toward duplication, not before it.

🎯 You will learn to

Apply the full RED-GREEN-REFACTOR rhythm as a single packaged cycle, translating a one-sentence spec into a test, the smallest passing code, and a deliberate REFACTOR pause
Analyze why the first GREEN is allowed (and expected) to look ugly — one example is not enough information to choose the right shape
Evaluate the “refactor toward duplication, not before it” rule against the temptation to generalize early

Spec: a single die showing 1 creates a Dragon Flame event worth 100 damage.

From now on, each cycle is one step with three tasks (RED → GREEN → REFACTOR). Same discipline as cycle 1 — tighter packaging.

Your task

🔴 RED — add test_single_one_creates_dragon_flame_event in test_scorer.py. From the spec (“a single die showing 1 creates a Dragon Flame event worth 100 damage”), translate into pytest assertions yourself. The four-part shape from cycle 1 still applies; the rules table at the top names the event and damage. Predict the failure category (ImportError? AttributeError? AssertionError?) before running.
🟢 GREEN — pick the smallest code that turns the test green. Resist any abstraction beyond what cycle 2’s single test demands. After you’ve made your choice, open the reveal below to compare.
🔵 REFACTOR — walk the cycle-1 checklist. Resist generalizing; cycle 4 will earn the loop.

Reveal — what we expected for RED (open after running)

ImportError: cannot import name 'ScoringEvent' — the test forces you to name the event class before writing it. That’s the design pressure of test-first thinking.

Reveal — one shape for the smallest GREEN (open after you've tried)

A hardcoded if dice == [1]: branch returning a BattleReport with one ScoringEvent. Yes, it’s ugly. Yes, you can see how cycle 3 will duplicate it. That’s the point — wait for the second example.

Why “allow the hard-code” is a TDD discipline. The instinct is to extract a rule, write a loop, build the abstraction now. TDD asks you to wait for the test that demands it. A speculative loop is a guess at the right shape; a loop refactor pulled by cycle 4’s test is a discovery. Refactor toward duplication, not before it.

🪞 Pause (10 seconds, after green): what did the test force you to name before any code existed? Hold your answer; the cycle-3 reveal will compare.

📦 Commit your progress

Before moving on, commit this cycle. Stage only the files you actually changed (scorer.py test_scorer.py) and write a short message — recommended: Cycle 2: single 1 = Dragon Flame.

Starter files

scorer.py

from dataclasses import dataclass


@dataclass(frozen=True)
class BattleReport:
    events: tuple = ()

    @property
    def total_damage(self) -> int:
        return sum(event.damage for event in self.events)


def score(dice: list[int]) -> BattleReport:
    return BattleReport()

test_scorer.py

"""Cycles 1–2 — adding the Dragon Flame behavior."""
from scorer import score


def test_empty_roll_has_zero_damage_and_no_events():
    report = score([])

    assert report.total_damage == 0
    assert report.events == ()


# TODO (RED): import ScoringEvent from scorer
# TODO (RED): write test_single_one_creates_dragon_flame_event
#             score([1]) should return a report with total_damage == 100
#             and events == (ScoringEvent("Dragon Flame", (1,), 100),)

5

Cycle 3 — Single 5 → Lightning Spark

Why this matters

Cycle 3 is the same shape as cycle 2 with different values — and that’s exactly why it matters. Two near-identical hardcoded branches make the duplication impossible to miss; the trap is that your hands will itch to extract a loop right now. Don’t. Refactoring with only two data points is still guessing. Cycle 4’s test will provide the third point — and the loop refactor it earns will be a discovery, not a guess.

🎯 You will learn to

Apply Variation Theory by writing a second test with the same shape as cycle 2 (only the values change) and observing what the contrast makes visible
Evaluate when deliberately keeping ugly code is the disciplined move — refactoring under-informed is worse than not refactoring
Analyze how the visible duplication will be the design pressure that earns the cycle 4 refactor

Spec: a single die showing 5 creates a Lightning Spark event worth 50 damage.

Same shape as cycle 2, different values. The duplication this creates is intentional — cycle 4’s test will earn the right to fix it.

Your task

🔴 RED — add test_single_five_creates_lightning_spark_event, structured exactly like cycle 2’s test but with the Lightning Spark values.
🟢 GREEN — add a second hardcoded if dice == [5]: branch. Resist the urge to write a loop or dict lookup.
🔵 REFACTOR — walk the checklist. The duplication is now visible; the right move is to note it and write nothing. No test demands the loop yet.

Why deliberately keeping ugly code is the disciplined move. You can clearly see duplication. Refactoring it now would be guessing at the right shape with one too few data points. Cycle 4’s test will provide the second data point — and the loop refactor it earns is a discovery, not a guess. Refactor toward duplication, not before it.

📦 Commit your progress

Before moving on, commit this cycle. Stage only the files you actually changed (scorer.py test_scorer.py) and write a short message — recommended: Cycle 3: single 5 = Lightning Spark.

Starter files

scorer.py

from dataclasses import dataclass


@dataclass(frozen=True)
class ScoringEvent:
    name: str
    dice_used: tuple
    damage: int


@dataclass(frozen=True)
class BattleReport:
    events: tuple = ()

    @property
    def total_damage(self) -> int:
        return sum(event.damage for event in self.events)


def score(dice: list[int]) -> BattleReport:
    if dice == [1]:
        return BattleReport((
            ScoringEvent("Dragon Flame", (1,), 100),
        ))
    return BattleReport()

test_scorer.py

"""Cycles 1–3 — adding the Lightning Spark behavior."""
from scorer import score, ScoringEvent


def test_empty_roll_has_zero_damage_and_no_events():
    report = score([])

    assert report.total_damage == 0
    assert report.events == ()


def test_single_one_creates_dragon_flame_event():
    report = score([1])

    assert report.total_damage == 100
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
    )


# TODO (RED): write test_single_five_creates_lightning_spark_event
#             score([5]) should return total_damage == 50 with
#             events == (ScoringEvent("Lightning Spark", (5,), 50),)

6

Cycle 4 — Repeated Singles → First Real Refactor

Why this matters

Cycle 4 is the first design-breaking test of the tutorial — neither dice == [1] nor dice == [5] matches [1, 1], so the cheapest patch (a third hardcoded branch) is globally expensive even when it’s locally small. This is also where the safety-net argument becomes load-bearing: the previous three green tests are what allow you to replace the hardcoded branches with a loop without fear. The mutation move at the end of the cycle proves those tests actually catch the regressions you think they do.

🎯 You will learn to

Apply the first real refactor under safety — replacing hardcoded branches with a loop while three green tests guard the change
Evaluate competing GREEN options (third hardcoded branch vs. loop) by predicting which is cheaper across the next two cycles
Apply the mutation move (mutate a line, watch a test fail, revert) to verify the safety net actually catches regressions

Spec: score([1, 1]) returns total damage 200 with two Dragon Flame events.

The first design-breaking test. Neither dice == [1] nor dice == [5] matches [1, 1] — the duplication you noted in cycle 3 just demanded payment.

Your task

🔴 RED — add test_two_ones_create_two_dragon_flames asserting damage 200 and two Flame events. Run; predict the failure.
🟢 GREEN — you have two options:
- Option A: a third hardcoded branch for dice == [1, 1].
- Option B: replace the hardcoded branches with a for loop over each die.
Pick one. Before you implement, predict: which option will be cheaper over the next two or three cycles? Don’t peek ahead — predict from what you know now. Implement your choice, run, and revisit your prediction.
🔵 REFACTOR + mutation check — re-run; the previous three tests are your safety net. Then prove they actually catch regressions with the ten-second mutation move: temporarily change a line in scorer.py (e.g., if die == 1: → if die == 99:), rerun pytest, watch a test fail, then revert. A test that doesn’t fail when the production code breaks is a Liar test — it’s not pinning down the behavior you think it is.
🟢 Bonus check — add test_one_and_five_create_two_different_events (mixed dice [1, 5] → one Flame + one Spark). Predict whether you’ll need to change scorer.py. The new loop should handle this for free — but the test makes that promise explicit.

🪞 Pause (after green): in one sentence, what did the passing test results just tell you that you’d otherwise have had to verify by hand? Hold your answer; the cycle quiz returns to it.

Reveal — what happens when option A wins (open after running)

A third hardcoded branch passes cycle 4. But the bonus mixed-dice case ([1, 5]) needs a fourth branch — and cycle 5 (triple 1s) cannot be satisfied by any hardcoded branch because the structure has to change. The loop refactor still has to happen, only now you have more code to delete first. Locally smallest (one new if) is globally largest.

Why the mutation move matters

A passing test means one of two things: (a) the code is correct, or (b) the test is vacuous and would pass against any code. The Liar test smell (Codurance taxonomy) is silent — pytest reports green either way. The 10-second mutation move — break the production code, watch the test fail, revert — is the cheap, durable defense. Use it whenever a test passes for a reason you didn’t fully expect (especially the bonus mixed-dice test, which passes “for free” thanks to the loop).

📦 Commit your progress

Before moving on, commit this cycle. Stage only the files you actually changed (scorer.py test_scorer.py) and write a short message — recommended: Cycle 4: per-die loop + mixed-dice guardrail.

Starter files

scorer.py

from dataclasses import dataclass


@dataclass(frozen=True)
class ScoringEvent:
    name: str
    dice_used: tuple
    damage: int


@dataclass(frozen=True)
class BattleReport:
    events: tuple = ()

    @property
    def total_damage(self) -> int:
        return sum(event.damage for event in self.events)


def score(dice: list[int]) -> BattleReport:
    if dice == [1]:
        return BattleReport((
            ScoringEvent("Dragon Flame", (1,), 100),
        ))
    if dice == [5]:
        return BattleReport((
            ScoringEvent("Lightning Spark", (5,), 50),
        ))
    return BattleReport()

test_scorer.py

"""Cycles 1–4 — repeated singles force the first refactor."""
from scorer import score, ScoringEvent


def test_empty_roll_has_zero_damage_and_no_events():
    report = score([])

    assert report.total_damage == 0
    assert report.events == ()


def test_single_one_creates_dragon_flame_event():
    report = score([1])

    assert report.total_damage == 100
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
    )


def test_single_five_creates_lightning_spark_event():
    report = score([5])

    assert report.total_damage == 50
    assert report.events == (
        ScoringEvent("Lightning Spark", (5,), 50),
    )


# TODO (RED): write test_two_ones_create_two_dragon_flames
#             score([1, 1]) should return total_damage == 200 and two
#             Dragon Flame events in events
#
# TODO (Bonus, after the loop refactor): add
# test_one_and_five_create_two_different_events
#             score([1, 5]) should return total_damage == 150 with
#             one Dragon Flame followed by one Lightning Spark

Solution

scorer.py

from dataclasses import dataclass


@dataclass(frozen=True)
class ScoringEvent:
    name: str
    dice_used: tuple
    damage: int


@dataclass(frozen=True)
class BattleReport:
    events: tuple = ()

    @property
    def total_damage(self) -> int:
        return sum(event.damage for event in self.events)


def score(dice: list[int]) -> BattleReport:
    events = []

    for die in dice:
        if die == 1:
            events.append(ScoringEvent("Dragon Flame", (1,), 100))
        if die == 5:
            events.append(ScoringEvent("Lightning Spark", (5,), 50))

    return BattleReport(tuple(events))

test_scorer.py

"""Cycles 1–4 — empty, two singles, repeated singles, and a mixed-dice guardrail."""
from scorer import score, ScoringEvent


def test_empty_roll_has_zero_damage_and_no_events():
    report = score([])

    assert report.total_damage == 0
    assert report.events == ()


def test_single_one_creates_dragon_flame_event():
    report = score([1])

    assert report.total_damage == 100
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
    )


def test_single_five_creates_lightning_spark_event():
    report = score([5])

    assert report.total_damage == 50
    assert report.events == (
        ScoringEvent("Lightning Spark", (5,), 50),
    )


def test_two_ones_create_two_dragon_flames():
    report = score([1, 1])

    assert report.total_damage == 200
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
        ScoringEvent("Dragon Flame", (1,), 100),
    )


def test_one_and_five_create_two_different_events():
    report = score([1, 5])

    assert report.total_damage == 150
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
        ScoringEvent("Lightning Spark", (5,), 50),
    )

The right move is option 2 — replace the hardcoded branches with a per-die loop. The previous three tests act as a safety net that lets you do the rewrite confidently and confirm in one second that nothing broke. The bonus mixed-dice test ([1, 5]) passes immediately on the new loop — the mutation move proves it’s not vacuous.

Step 6 — Knowledge Check

Min. score: 80%

1. You noticed obvious duplication after cycle 3 but did not refactor it. Cycle 4’s test then forced the refactor. Why is this sequence (notice → wait → refactor when test demands) better than (notice → refactor immediately)?

It is not better — refactoring as soon as you see duplication is always the right move; this tutorial just artificially delayed it
“Refactor on first sight of duplication” sounds disciplined, but with one example, you’re guessing the abstraction’s shape. Two examples reveal the common parts; one is just a special case dressed as a pattern. Beck’s Rule of Three (popularised by Fowler) is the same idea — refactor toward duplication, not before it.
Waiting until a test demands the change ensures the refactor is justified by behavior, not by aesthetic preference. It also gives you more data points to design the right shape, instead of guessing
Python performance is better when you delay extracting loops, because the interpreter optimises hardcoded branches differently from iterative ones
Performance is not the reason. Python’s interpreter compiles both forms to similar bytecode, and the difference is microseconds at most. Confusing a design discipline with a performance concern muddles two unrelated questions.
It is purely a style choice — the resulting code in either order is identical, so the order of operations does not matter
Order is the lesson. Refactor-before-test gives you no safety net during the structural change; refactor-after-test gives you passing test results proving behavior was preserved. The same final code is reached, but only one path guarantees the path was safe.

2. You replaced the hardcoded branches with a for loop and ran pytest. All four tests passed. What did the test suite just do for you that you would otherwise have had to do manually?

Nothing — passing tests do not provide any information beyond what manual inspection of the new code would have
Manual inspection scales to one developer reviewing a few lines once. Automated tests scale to every change forever, by every developer, with no fatigue. The measurable industrial benefit of TDD (Microsoft/IBM: 39–91% defect reduction) comes specifically from this asymmetry — humans can’t replicate the per-change verification cost the suite absorbs for free.
It mechanically verified that the new structure preserves the behavior of the empty roll, single 1, single 5, and [1, 1] cases — replacing fear-driven manual reasoning with a one-second check
It rewrote your code to be more efficient — the suite optimises the implementation as a side effect of being run
Tests don’t change your code. They run it and observe the result. (You may be confusing the test suite with mutation testing tools or auto-formatters — those are different tools running outside the suite.)
It logged the change so a future code reviewer can audit it — the suite acts as a versioned history of refactor decisions
Git history logs changes. The test suite enforces behavior. Conflating the two means you lose the live regression-catching property — a log doesn’t fail a build when you break something; a test does.

Tests enable change. The four passing test results are not just “tests passed” — they are “the empty case still works, single 1 still works, single 5 still works, and [1, 1] works for the first time.” Without the suite, you would have to convince yourself of each line by reading.

3. A teammate looks at your cycle-4 GREEN code and says: “Why didn’t you just add a third if dice == [1, 1]: branch? It would have passed the test.” What is the most accurate rebuttal?

A third branch would have passed cycle 4, but the duplication would have got worse, the bonus mixed-dice case would have demanded a fourth branch, and the eventual loop refactor would still have to happen — only with more code to delete
A third branch is forbidden by pytest — only one if statement per function is allowed in TDD code
pytest has no opinion about your control flow. Made-up rules like “one if per function” sound TDD-flavored but aren’t real — and inventing them tells you you’re searching for external discipline instead of internal design judgment. The discipline lives in why you choose one shape over another.
The teammate is right and you should rewrite — TDD always prefers the smallest change, and one new if is smaller than rewriting the function
“Smallest change” needs a time horizon. Locally smallest (one new if) is globally largest (you’ll throw it away in the next cycle anyway). TDD prefers the smallest durable change — which here is the loop, because it survives the next several tests.
It does not matter — both approaches produce the same byte code after Python compiles them
Bytecode equivalence (which isn’t even quite true here — Python doesn’t auto-optimize hardcoded branch chains into loops) is a runtime concern, not a design one. The question is what the code says and what the next developer sees, not what CPython executes.

The TDD GREEN rule is “smallest code that passes the failing test” given the current direction of the design. A third hardcoded branch is locally minimal but globally wasteful — you’ll throw it away in cycle 5 anyway. The loop is the smallest durable change.

4. Why is “RED for the right reason” still the right framing in cycle 4, even though we are now adding a test on top of an already-passing suite?

It is not — RED for the right reason only applies to cycle 1, where there is no implementation at all
RED-for-the-right-reason is a cycle invariant, not a cycle-1 special case. Every cycle’s first run of a new test should fail because the behavior doesn’t yet exist — never because of a typo, missing import, or syntax error. The shape of the right-reason failure changes (ImportError in cycle 1, AssertionError later), but the principle is constant.
Because the new test should fail with a meaningful mismatch (assertion failure on the expected events tuple) — not a typo or a missing import. That tells you the test is correctly pinning down a behavior that the current implementation cannot produce
Because pytest will only run new tests after old ones have failed at least once
pytest’s run order is independent of TDD discipline. Tests run, period. Confusing tool behavior with TDD principle is a category error — TDD is your practice, not pytest’s.
Because every cycle must produce both a red bar and an import error or it will be rejected by the test runner
ImportError-as-RED only made sense in cycle 1 when nothing was imported yet. Demanding it for every cycle would be enforcing form over function — you’d be looking for the wrong failure mode in cycles 2+, missing real bugs that surface as AssertionErrors instead.

“RED for the right reason” applies to every cycle. In cycle 1 the right reason is usually ImportError. In later cycles it is typically an assertion failure showing the expected tuple of events versus what the current code actually produced. Either way, the failure has to come from the behavior under test, not from a typo.

7

Cycle 5 — Triple 1s → Dragon Blast (Design Moment)

Why this matters

The cycle-4 per-die loop walks each die in isolation — it has no way to know that the other two 1s exist when it processes the first. The triple-1 test cannot be satisfied by editing a branch or tweaking the loop body; the structure has to change from “iterate dice in order” to “count faces, then decide what to emit.” This is the threshold concept: tests force structural change, not just lines of code. And the previous five tests survive a full body rewrite of score — because they assert on observable behavior, not on internals.

🎯 You will learn to

Analyze why a per-die loop is structurally incapable of satisfying a triple-combo test — and why this earns a Counter-based count-then-emit shape
Evaluate the Refactoring Litmus Test: which property of the previous tests allowed them to survive a full rewrite of score?
Apply the same mutation move from cycle 4 to a leftover-bookkeeping line, confirming the new structure’s invariants are pinned down by tests

Spec: three 1s in a roll combine into one Dragon Blast (1000 damage) instead of three Dragon Flames.

The pivot moment. Cycle 4’s per-die loop walks each die independently — it has no way to know that the other two 1s exist when it processes the first. This test cannot be satisfied by tweaking a branch; the structure has to change. A design-breaking test.

Your task

🔴 RED — add test_three_ones_create_dragon_blast_instead_of_three_flames. Predict what kind of failure pytest will show (ImportError, AttributeError, AssertionError — and what the message will likely contain). Run.
🟢 GREEN — before reaching for code: open the per-die loop in score(). Spend 90 seconds writing a one-sentence answer to: what about the loop’s structure makes this test impossible to satisfy with a local edit? Then make the structural change.
🔵 REFACTOR — re-run; the previous five tests survived a full body rewrite of score.
🟢 Bonus guardrail — your GREEN code subtracts the consumed dice (counts[1] -= 3) so leftovers still score as singles. That behavior is currently implicit — no test would catch a future refactor that forgets it. Add test_dragon_blast_plus_leftover_flame_and_spark (score([1, 1, 1, 1, 5]) → one Blast, one leftover Flame, one Spark = 1150 damage). It should pass for free; verify with the cycle-4 mutation move (mutate counts[1] -= 3 to counts[1] -= 4, watch the new test fail, revert).

Reveal — one shape that handles per-face-count thinking (open after you've tried)

Stop iterating dice in order. Count how many of each face appeared (collections.Counter), then decide what to emit. Combos consume dice; leftovers still score as singles.

🪞 Pause (after green): Yet all five previous tests still pass after the rewrite. Spend 30 seconds writing down: why? What property of the previous tests allowed them to survive a full body rewrite of score?

Compare your answer — the property that survived

They assert on observable behavior (total_damage == 100, events == (event,)), not on internals (which loop, which variable name). Behavior tests survive structural rewrites; implementation-tests don’t. This is the Refactoring Litmus Test — and it’s the rule that travels: write tests against contracts, not against shapes.

📦 Commit your progress

Before moving on, commit this cycle. Stage only the files you actually changed (scorer.py test_scorer.py) and write a short message — recommended: Cycle 5: triple 1s = Dragon Blast (Counter).

Starter files

scorer.py

from dataclasses import dataclass


@dataclass(frozen=True)
class ScoringEvent:
    name: str
    dice_used: tuple
    damage: int


@dataclass(frozen=True)
class BattleReport:
    events: tuple = ()

    @property
    def total_damage(self) -> int:
        return sum(event.damage for event in self.events)


def score(dice: list[int]) -> BattleReport:
    events = []

    for die in dice:
        if die == 1:
            events.append(ScoringEvent("Dragon Flame", (1,), 100))
        if die == 5:
            events.append(ScoringEvent("Lightning Spark", (5,), 50))

    return BattleReport(tuple(events))

test_scorer.py

"""Cycles 1–5 — triple 1s break the per-die loop."""
from scorer import score, ScoringEvent


def test_empty_roll_has_zero_damage_and_no_events():
    report = score([])

    assert report.total_damage == 0
    assert report.events == ()


def test_single_one_creates_dragon_flame_event():
    report = score([1])

    assert report.total_damage == 100
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
    )


def test_single_five_creates_lightning_spark_event():
    report = score([5])

    assert report.total_damage == 50
    assert report.events == (
        ScoringEvent("Lightning Spark", (5,), 50),
    )


def test_two_ones_create_two_dragon_flames():
    report = score([1, 1])

    assert report.total_damage == 200
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
        ScoringEvent("Dragon Flame", (1,), 100),
    )


def test_one_and_five_create_two_different_events():
    report = score([1, 5])

    assert report.total_damage == 150
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
        ScoringEvent("Lightning Spark", (5,), 50),
    )


# TODO (RED): write test_three_ones_create_dragon_blast_instead_of_three_flames
#             score([1, 1, 1]) should return total_damage == 1000 with
#             a single Dragon Blast event whose dice_used is (1, 1, 1)

Solution

scorer.py

from collections import Counter
from dataclasses import dataclass


@dataclass(frozen=True)
class ScoringEvent:
    name: str
    dice_used: tuple
    damage: int


@dataclass(frozen=True)
class BattleReport:
    events: tuple = ()

    @property
    def total_damage(self) -> int:
        return sum(event.damage for event in self.events)


def score(dice: list[int]) -> BattleReport:
    counts = Counter(dice)
    events = []

    if counts[1] >= 3:
        events.append(ScoringEvent("Dragon Blast", (1, 1, 1), 1000))
        counts[1] -= 3

    for _ in range(counts[1]):
        events.append(ScoringEvent("Dragon Flame", (1,), 100))

    for _ in range(counts[5]):
        events.append(ScoringEvent("Lightning Spark", (5,), 50))

    return BattleReport(tuple(events))

test_scorer.py

"""Cycles 1–5 — combos enter the design."""
from scorer import score, ScoringEvent


def test_empty_roll_has_zero_damage_and_no_events():
    report = score([])

    assert report.total_damage == 0
    assert report.events == ()


def test_single_one_creates_dragon_flame_event():
    report = score([1])

    assert report.total_damage == 100
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
    )


def test_single_five_creates_lightning_spark_event():
    report = score([5])

    assert report.total_damage == 50
    assert report.events == (
        ScoringEvent("Lightning Spark", (5,), 50),
    )


def test_two_ones_create_two_dragon_flames():
    report = score([1, 1])

    assert report.total_damage == 200
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
        ScoringEvent("Dragon Flame", (1,), 100),
    )


def test_one_and_five_create_two_different_events():
    report = score([1, 5])

    assert report.total_damage == 150
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
        ScoringEvent("Lightning Spark", (5,), 50),
    )


def test_three_ones_create_dragon_blast_instead_of_three_flames():
    report = score([1, 1, 1])

    assert report.total_damage == 1000
    assert report.events == (
        ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
    )


def test_dragon_blast_plus_leftover_flame_and_spark():
    # Pin down the leftover behavior `counts[1] -= 3` produces:
    # `[1, 1, 1, 1, 5]` should yield one Blast, one leftover Flame,
    # one Spark. Without this guardrail, a future refactor could
    # silently drop the leftover bookkeeping.
    report = score([1, 1, 1, 1, 5])

    assert report.total_damage == 1150
    assert report.events == (
        ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
        ScoringEvent("Dragon Flame", (1,), 100),
        ScoringEvent("Lightning Spark", (5,), 50),
    )

The structural shift: count occurrences with Counter, run the combo check first, subtract the consumed dice, then emit singles for what’s left. The five previous tests act as the safety net for the rewrite — and all five still pass because counting-and-emitting is observationally equivalent to the per-die loop for non-combo cases. The bonus test_dragon_blast_plus_leftover_flame_and_spark is a guardrail — it pins down the implicit leftover behavior so a future refactor can’t silently break it.

Step 7 — Knowledge Check

Min. score: 80%

1. What makes cycle 5’s test a design-breaking test, as opposed to a normal “add a branch” test?

Cycle 5’s assertion is longer than previous assertions, so the implementation has to grow proportionally to satisfy it
Assertion length isn’t the diagnostic. A test can have a one-line assertion and still be design-breaking; another test can be huge but locally satisfiable. The diagnostic is whether the current structure can be patched, or whether the structure itself has to change.
The per-die loop processes each die in isolation. There is no way for the loop to know, when it sees the first 1, that two more 1s exist later. Counting requires a structural shift — not a new branch
It uses pytest.raises, which forces the implementation to add exception handling
Cycle 5 doesn’t use pytest.raises. The design-breaking property here has nothing to do with exception handling — it’s about per-die iteration vs. per-face counting.
It tests three dice instead of one or two, and pytest internally requires a different runner for tests that exceed two arguments
Pytest has no special runner for any number of dice. This is invented machinery. The structural-pressure phenomenon is about your code’s shape, not pytest’s internals — those are entirely separate concerns.

A design-breaking test is one no local edit can satisfy. The cycle-4 loop is fundamentally per-die; combos are fundamentally per-face-count. You cannot patch your way from one to the other — you have to re-shape the algorithm. That re-shaping is what the test extracts as design pressure.

2. After the count-then-emit refactor, all five previous tests still passed. What does that tell you about the previous tests?

That they were unnecessary — passing on a totally different implementation means they were not pinning down anything specific
Inverts the meaning. Surviving a structural rewrite is evidence the tests are well-designed, not evidence they’re useless. The right reading: behavior-level tests are exactly the ones you want, and structural rewrites are exactly when their value shows up.
That they were testing observable behavior (return values), not implementation details (which loop was used). Behavior-level tests survive a structural rewrite; implementation-level tests would have broken
That pytest silently re-imports the previous implementation if the new one breaks them, so the green bar in this case is misleading
Pytest doesn’t import “the previous implementation” — it imports whatever’s on disk now. The green bar reflects the new implementation passing the old behavioral assertions. (If pytest silently swapped implementations, the test suite would be unfalsifiable, which would defeat its entire purpose.)
That the previous tests were duplicates of each other and could be deleted to simplify the suite
Different tests pin different behaviors (empty roll, single 1, single 5, two-1s). Each is unique evidence. Deletion would shrink the safety net every cycle 5+ refactor relies on. Confusing “all green” with “all redundant” is a category error — you couldn’t prove the rewrite preserved each behavior without each test.

This is the Refactoring Litmus Test concept. Robust tests assert on what the unit does (report.total_damage == 100, report.events == (event,)), not on how it does it (assert "for die in dice" in inspect.getsource(score)). The first survives a rewrite; the second breaks at the slightest refactor.

3. Why does cycle 5’s GREEN code subtract from counts[1] after emitting the Dragon Blast event?

To avoid an off-by-one error in the next cycle’s parametrized tests
The next cycle isn’t the reason. Cycle 5’s GREEN should be motivated by the spec (“combos consume dice; leftovers score as singles”), not by speculative future tests. Forward-anticipation is exactly the speculation TDD asks you to avoid.
Because a combo consumes the dice it uses. If three 1s become a Dragon Blast, those three 1s are no longer available to score as Dragon Flames. The subtraction makes the bookkeeping explicit
Because Python’s Counter requires explicit decrement after every read; otherwise it caches stale values
Counter has no caching. Reads are pure dict lookups; values aren’t stale unless you change them. This invents machinery to explain a code line whose real motivation is domain semantics, not a Python quirk.
It is a stylistic choice — counts[1] = 0 and del counts[1] would also work and produce the same behavior
These produce different behavior at scale. With six 1s, counts[1] -= 3 after the first Blast leaves 3 for a second Blast (that’s cycle 7’s hidden bug, foreshadowed). counts[1] = 0 would emit only one Blast and silently drop the second. The difference is the cycle-7 lesson — it’s not stylistic at all.

“Combos consume dice, leftovers still score as singles” is one of the rules. The subtraction isn’t a Python quirk — it is the domain rule made explicit in the code. The bonus guardrail test you write at the end of this cycle pins down exactly this leftover behavior.

4. A teammate suggests “let’s just add if Counter(dice)[1] >= 3: ... at the top of the existing per-die loop.” Why is that not the same as the count-then-emit refactor we did?

It would still re-walk dice in two different shapes (the loop + the Counter pass), which is a different smell — the code now has two competing models of what the input is, and future combos would multiply the inconsistency
It would be slightly faster but would produce the same result, so functionally it is equivalent
Performance isn’t the lesson. Cycle 5’s restructuring isn’t about speed (it’s negligibly different); it’s about which model of the input the code commits to. Two competing models is the structural smell, regardless of microbenchmarks.
It would not work in Python — Counter cannot be called inside a for loop body
Counter works fine inside loops — it’s just a dict subclass. This invents a Python constraint that doesn’t exist. The real reason against the hybrid approach is design coherence, not language mechanics.
Pytest would refuse to run it because the test runner forbids mixing iteration styles
Pytest has no opinion on iteration styles. (As with earlier questions, made-up framework rules are a sign you’re searching for external enforcement of discipline that has to come from your own design judgment.)

The refactor is not “add Counter on top of the existing loop.” It is “replace the per-die view with the per-face-count view.” Mixing both is the worst of both worlds: now the function maintains two parallel models of the input, and every future change has to update both. The structural shift was the point.

8

Cycle 6 — Goblin Swarm → Discover Rule Objects (Big Refactor)

Why this matters

One combo branch in score is fine. Adding a second one — next to the first — makes the duplication ugly enough that “add another if” feels obviously wrong. That ugliness is the design pressure; what it earns is the rule object abstraction (ComboRule + SingleRule with apply()). The Open-Closed Principle stops being a slogan: new behavior is now new data, not new branches. This is the cycle where students stop pattern-matching TDD and start listening to the test.

🎯 You will learn to

Apply listening to the test — recognize that a duplicate combo branch is the test telling you the structure is wrong
Create a rule-object abstraction (ComboRule + SingleRule with a uniform apply() interface) under the safety net of seven green tests
Evaluate the resulting design against the Open-Closed Principle — new behavior added as data, not as branches

Spec: three 2s combine into a Goblin Swarm (200 damage).

Cycle 6 is structurally the most important cycle in this tutorial. The current code handles one combo (Dragon Blast). Cycle 6 will give you a second — and the design pressure of having two will teach you the right abstraction.

The cycle has three phases. Do them in order.

Phase 1 — 🔴 RED

Add test_three_twos_create_goblin_swarm. Mirror the shape of test_three_ones_create_dragon_blast_instead_of_three_flames — only the dice value, the event name, and the damage change. Run.

Phase 2 — 🟢 GREEN (deliberately ugly)

What is the smallest change that turns this test green? Pick it. Type it out. Don’t refactor yet. Run.

Reveal — one shape (open after you've made it green)

A second if counts[2] >= 3: block right next to the first, with the right name, dice, and damage. Yes, the duplication is now visible. That’s the whole point.

Phase 3 — 🔵 REFACTOR (the discovery)

Look at the two combo blocks side by side:

if counts[1] >= 3:
    events.append(ScoringEvent("Dragon Blast", (1, 1, 1), 1000))
    counts[1] -= 3

if counts[2] >= 3:
    events.append(ScoringEvent("Goblin Swarm", (2, 2, 2), 200))
    counts[2] -= 3

A — Identify what varies

Write down: what is the same and what is different between the two blocks? (Mental notes are fine.)

Compare your answer

Same: the shape — if counts[X] >= N: emit one event with X repeated N times; counts[X] -= N.

Different: four things — the die value, the count threshold, the event name, the damage.

If your answer captured those four things (your names may differ), it’s right. If you have more than four, look for which two collapse into one. If you have fewer, look for which one is hiding two.

B — Name the entity

Two examples is the minimum needed to see a pattern. The four things that vary are fields of an entity that doesn’t yet have a name. What would you call it? (One that holds: a die value, a count, a name, a damage.) Pick a name; we’ll use ComboRule below.

C — Sketch the entity

A ComboRule carries the four fields and does the work the if-block currently does. The behavior: detect the combo, emit one event, decrement the counts. Move that into a method on the entity. What should the method’s signature be? (Hint: it has to read and mutate the Counter, and return the events it produced — possibly an empty list.)

Write the class header before reading on. Pick a method name that describes what it does to the counts.

Compare your answer — one shape that works

@dataclass(frozen=True)
class ComboRule:
    die: int
    count: int
    name: str
    damage: int

    def apply(self, counts: Counter) -> list[ScoringEvent]:
        events = []
        if counts[self.die] >= self.count:
            dice_used = tuple([self.die] * self.count)
            events.append(ScoringEvent(self.name, dice_used, self.damage))
            counts[self.die] -= self.count
        return events

The method is called apply because it applies the rule to a counter and returns whichever events that produces. Returning a list (possibly empty) generalizes cleanly: cycle 7 will need a single apply() call to emit zero or more events from one input.

D — Replace the blocks with data

Declare the two combo rules as data outside score(). Replace the two if-blocks inside score() with a single iteration over the tuple. The combos are now configuration, not code. Run pytest.

Compare your answer — what `score()` looks like after

COMBO_RULES = (
    ComboRule(1, 3, "Dragon Blast", 1000),
    ComboRule(2, 3, "Goblin Swarm", 200),
)


def score(dice: list[int]) -> BattleReport:
    counts = Counter(dice)
    events = []

    for rule in COMBO_RULES:
        events.extend(rule.apply(counts))

    # ... singles loops still here for now ...

    return BattleReport(tuple(events))

All eight tests still pass — the refactor preserved every observable behavior. That’s the Refactoring Litmus Test: behavior-level tests survive structural rewrites.

E — Apply the same recognition to singles

Look at the two for-loops at the bottom of score() (Dragon Flame, Lightning Spark). Same kind of duplication, one field shorter. Apply the same recognition you just did on combos — extract a SingleRule with its own apply(counts) method, declare a SINGLE_RULES tuple, and replace both loops with one iteration. Run pytest. If it goes green, you’ve parallel-transferred the pattern in one shot. If not, debug — that’s the only feedback you need.

F — Cash in the OCP win: add the four remaining combos as data

🪞 Predict first: how many lines inside score() will you change to add four new triple combos (Triple 3 → Orc Charge 300, Triple 4 → Troll Smash 400, Triple 5 → Lightning Storm 500, Triple 6 → Demon Strike 600)? Hold the number.

Now do it: append four rows to COMBO_RULES. Then add one parametrized test (@pytest.mark.parametrize) covering all four. Run.

Why parametrize beats a for-loop inside one test

@pytest.mark.parametrize runs the function once per row, reporting each row as a separate test result. A for loop inside a single test stops at the first failure, hiding everything after it. The parametrize idiom is the right Python answer to “N tests of the same shape” — DRY tests that still report separate failures.

Why this matters (read after green)

What you just did has a name: listening to the test. The pain of imagining six more hardcoded combo branches was a design signal — the structure no longer fit the problem. The cure was structural extraction: pull the varying parts into data, leave the constant shape as code.

You also just applied the Open-Closed Principle: score() is now closed for modification but open for extension. Phase F made the payoff concrete — four new combos cost zero edits to score(). New behavior arrives as data, not as new branches. score() will not change again for the rest of the tutorial.

And you discovered the right abstraction at the right moment — two examples. One would have been a guess; six would have been six branches you’d have to delete. Refactor toward duplication, not before it, and not after it has rotted (Rule of Two).

📦 Commit your progress

Before moving on, commit this cycle. Stage only the files you actually changed (scorer.py test_scorer.py) and write a short message — recommended: Cycle 6: rule objects + all six combos as data.

Starter files

scorer.py

from collections import Counter
from dataclasses import dataclass


@dataclass(frozen=True)
class ScoringEvent:
    name: str
    dice_used: tuple
    damage: int


@dataclass(frozen=True)
class BattleReport:
    events: tuple = ()

    @property
    def total_damage(self) -> int:
        return sum(event.damage for event in self.events)


def score(dice: list[int]) -> BattleReport:
    counts = Counter(dice)
    events = []

    if counts[1] >= 3:
        events.append(ScoringEvent("Dragon Blast", (1, 1, 1), 1000))
        counts[1] -= 3

    for _ in range(counts[1]):
        events.append(ScoringEvent("Dragon Flame", (1,), 100))

    for _ in range(counts[5]):
        events.append(ScoringEvent("Lightning Spark", (5,), 50))

    return BattleReport(tuple(events))

test_scorer.py

"""Cycles 1–6 — Goblin Swarm forces the rule-object refactor."""
from scorer import score, ScoringEvent


def test_empty_roll_has_zero_damage_and_no_events():
    report = score([])

    assert report.total_damage == 0
    assert report.events == ()


def test_single_one_creates_dragon_flame_event():
    report = score([1])

    assert report.total_damage == 100
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
    )


def test_single_five_creates_lightning_spark_event():
    report = score([5])

    assert report.total_damage == 50
    assert report.events == (
        ScoringEvent("Lightning Spark", (5,), 50),
    )


def test_two_ones_create_two_dragon_flames():
    report = score([1, 1])

    assert report.total_damage == 200
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
        ScoringEvent("Dragon Flame", (1,), 100),
    )


def test_one_and_five_create_two_different_events():
    report = score([1, 5])

    assert report.total_damage == 150
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
        ScoringEvent("Lightning Spark", (5,), 50),
    )


def test_three_ones_create_dragon_blast_instead_of_three_flames():
    report = score([1, 1, 1])

    assert report.total_damage == 1000
    assert report.events == (
        ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
    )


def test_dragon_blast_plus_leftover_flame_and_spark():
    report = score([1, 1, 1, 1, 5])

    assert report.total_damage == 1150
    assert report.events == (
        ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
        ScoringEvent("Dragon Flame", (1,), 100),
        ScoringEvent("Lightning Spark", (5,), 50),
    )


# TODO (RED): write test_three_twos_create_goblin_swarm
#             score([2, 2, 2]) should return total_damage == 200 with
#             a Goblin Swarm event whose dice_used is (2, 2, 2)
#
# TODO (Phase F, after the rule-object refactor): write a parametrized
# test_other_triples_create_combo_events using @pytest.mark.parametrize
# that covers the four remaining triples — Orc Charge (300), Troll Smash
# (400), Lightning Storm (500), Demon Strike (600). And append the
# matching ComboRule rows to COMBO_RULES in scorer.py.

Solution

scorer.py

from collections import Counter
from dataclasses import dataclass


@dataclass(frozen=True)
class ScoringEvent:
    name: str
    dice_used: tuple
    damage: int


@dataclass(frozen=True)
class BattleReport:
    events: tuple = ()

    @property
    def total_damage(self) -> int:
        return sum(event.damage for event in self.events)


@dataclass(frozen=True)
class ComboRule:
    die: int
    count: int
    name: str
    damage: int

    def apply(self, counts: Counter) -> list[ScoringEvent]:
        events = []

        if counts[self.die] >= self.count:
            dice_used = tuple([self.die] * self.count)
            events.append(ScoringEvent(self.name, dice_used, self.damage))
            counts[self.die] -= self.count

        return events


@dataclass(frozen=True)
class SingleRule:
    die: int
    name: str
    damage: int

    def apply(self, counts: Counter) -> list[ScoringEvent]:
        events = []

        for _ in range(counts[self.die]):
            events.append(ScoringEvent(self.name, (self.die,), self.damage))

        return events


COMBO_RULES = (
    ComboRule(1, 3, "Dragon Blast", 1000),
    ComboRule(2, 3, "Goblin Swarm", 200),
    ComboRule(3, 3, "Orc Charge", 300),
    ComboRule(4, 3, "Troll Smash", 400),
    ComboRule(5, 3, "Lightning Storm", 500),
    ComboRule(6, 3, "Demon Strike", 600),
)

SINGLE_RULES = (
    SingleRule(1, "Dragon Flame", 100),
    SingleRule(5, "Lightning Spark", 50),
)


def score(dice: list[int]) -> BattleReport:
    counts = Counter(dice)
    events = []

    for rule in COMBO_RULES:
        events.extend(rule.apply(counts))

    for rule in SINGLE_RULES:
        events.extend(rule.apply(counts))

    return BattleReport(tuple(events))

test_scorer.py

"""Cycles 1–6 — rule objects power the design; all six combos are data."""
import pytest
from scorer import score, ScoringEvent


def test_empty_roll_has_zero_damage_and_no_events():
    report = score([])

    assert report.total_damage == 0
    assert report.events == ()


def test_single_one_creates_dragon_flame_event():
    report = score([1])

    assert report.total_damage == 100
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
    )


def test_single_five_creates_lightning_spark_event():
    report = score([5])

    assert report.total_damage == 50
    assert report.events == (
        ScoringEvent("Lightning Spark", (5,), 50),
    )


def test_two_ones_create_two_dragon_flames():
    report = score([1, 1])

    assert report.total_damage == 200
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
        ScoringEvent("Dragon Flame", (1,), 100),
    )


def test_one_and_five_create_two_different_events():
    report = score([1, 5])

    assert report.total_damage == 150
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
        ScoringEvent("Lightning Spark", (5,), 50),
    )


def test_three_ones_create_dragon_blast_instead_of_three_flames():
    report = score([1, 1, 1])

    assert report.total_damage == 1000
    assert report.events == (
        ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
    )


def test_dragon_blast_plus_leftover_flame_and_spark():
    report = score([1, 1, 1, 1, 5])

    assert report.total_damage == 1150
    assert report.events == (
        ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
        ScoringEvent("Dragon Flame", (1,), 100),
        ScoringEvent("Lightning Spark", (5,), 50),
    )


def test_three_twos_create_goblin_swarm():
    report = score([2, 2, 2])

    assert report.total_damage == 200
    assert report.events == (
        ScoringEvent("Goblin Swarm", (2, 2, 2), 200),
    )


@pytest.mark.parametrize(
    "roll, expected_event",
    [
        ([3, 3, 3], ScoringEvent("Orc Charge", (3, 3, 3), 300)),
        ([4, 4, 4], ScoringEvent("Troll Smash", (4, 4, 4), 400)),
        ([5, 5, 5], ScoringEvent("Lightning Storm", (5, 5, 5), 500)),
        ([6, 6, 6], ScoringEvent("Demon Strike", (6, 6, 6), 600)),
    ],
)
def test_other_triples_create_combo_events(roll, expected_event):
    report = score(roll)

    assert report.total_damage == expected_event.damage
    assert report.events == (expected_event,)

Extract ComboRule and SingleRule dataclasses with an apply method, plus two registry tuples (COMBO_RULES, SINGLE_RULES). The score function becomes two trivial loops. Phase F cashes in the OCP win immediately: four new combos (Orc Charge, Troll Smash, Lightning Storm, Demon Strike) cost zero edits to score() — they’re just four new rows in COMBO_RULES, with one parametrized test covering all four.

Step 8 — Knowledge Check

Min. score: 80%

1. What does “listening to the test” mean as a refactoring heuristic?

Read every test out loud before running it, to catch typos in the test names that pytest cannot detect
This is a literal-reading of “listen.” The phrase is a metaphor — it means let the felt difficulty of writing and maintaining tests be a diagnostic signal. Reading aloud catches typos but says nothing about design pressure, which is the point.
Treat the difficulty of writing or understanding a test (or the pain of adding new branches that match its shape) as a diagnostic signal that the production code’s structure may be the real problem
Use only test names that the codebase’s audit log specifically asks for, ignoring the developer’s own intuition about what to call them
There’s no “audit log” of approved test names. This treats TDD like a compliance procedure rather than a design discipline. The whole skill is your judgment about what to test and what to call it — that’s where the design happens.
Run pytest with --log-cli-level=DEBUG and inspect the trace, since that output reveals the next refactor
DEBUG logs are runtime diagnostics, not design diagnostics. The signal you’re listening for is in your own experience of writing the code: pain in adding a branch, awkwardness in the test setup, repetition in the assertions. Logs tell you nothing about your own friction.

“Listen to the test” is a foundational TDD heuristic: when adding a test or a branch hurts disproportionately, the structure is telling you something. In cycle 6, the pain of imagining six more if counts[X] >= 3: branches was the signal. The fix was a structural shift to data, not more code.

2. Why is putting rules in a COMBO_RULES tuple of ComboRule instances better than keeping them as if counts[X] >= 3: branches inside score()?

It is not better — it is just stylistically different. The byte code emitted is identical and the runtime cost is the same
Bytecode is irrelevant — design is about who has to change what when. Phase F of this cycle gave you concrete evidence: with the rule-object form, four new combos cost zero edits to score(). With the if-branch form, four new combos cost four edits. That’s the OCP win, not a stylistic flourish.
Adding a new rule now means appending one row of data, with no edit to score(). That is the Open-Closed Principle: the function is closed for modification but open for extension via the registry
Tuples are faster than lists in Python, so the rule-object form is a performance optimisation
Tuple-vs-list performance differences are nanoseconds and never the design driver. (You may be remembering that tuples are immutable; that matters for correctness in some contexts, not for the OCP win at issue here.)
Python’s import system caches tuples but not function bodies, so subsequent test runs are quicker
Python’s import system has no such asymmetric caching policy. Made-up runtime mechanism. The actual win is structural: score() doesn’t have to change when behaviors are added — and that property has nothing to do with import caching.

The Open-Closed Principle (OCP) says modules should be open for extension and closed for modification. The rule-object refactor is the OCP in five lines: score() no longer changes when you add a new triple combo — only the data does. Phase F of this cycle demonstrated the payoff by adding four combos at once.

3. A teammate says: “You should have done the rule-object refactor in cycle 5, when you first introduced combos. Why wait?” What is the most defensible answer?

Refactoring earlier would have been wrong — the rule-object shape only became visible after cycle 6’s test added a second combo, and refactoring with one example is guessing at the right shape
Cycle 5 was technically possible but pytest had not yet released support for dataclasses, so the refactor would have failed at import time
Pytest supports dataclasses (and has for years), and pytest support wouldn’t dictate when to refactor anyway. This invents history to dodge the design question. The answer has to be about information available to the designer, not tooling availability.
It is a stylistic preference — both orderings produce identical code in the end and one is not measurably better than the other
Order matters because what you knew when matters. With one combo (cycle 5), the abstraction is a guess; with two (cycle 6), the common shape is visible. Refactoring at cycle 5 picks the shape based on speculation; refactoring at cycle 6 picks it based on evidence. Same final code, very different epistemic standing.
Earlier refactoring is always better; the tutorial’s order is artificial and you should always do the biggest refactor as soon as you can
“Earlier is always better” is the universal-rule trap. With one example you can’t see the shape; refactoring early commits you to a guess that the next cycle may invalidate. The TDD discipline is refactor toward duplication, not before it — which is the Rule of Three (or Two, in this tutorial), not “as soon as possible.”

Two data points are the minimum for seeing the right shape of an abstraction. With only Dragon Blast (cycle 5), you’d be guessing whether the variation is “the die that triggers” or “the count required” or “the name.” Cycle 6’s Goblin Swarm provides the second data point, and only then is ComboRule(die, count, name, damage) a discovery rather than a guess. Refactor toward duplication, not before it.

4. After the rule-object refactor, all seven previous tests still pass. Why is this expected, and what does it confirm about the previous tests?

It is unexpected — refactoring should always break something. If nothing broke, the previous tests must have been weak
Inverted heuristic. Tests staying green during a refactor is the good outcome: it proves observable behavior was preserved. If tests broke, you’d suspect either a real regression OR brittle (implementation-coupled) tests. Survival is evidence of robustness, not weakness.
It is expected because the previous tests assert on behavior (return values), not on internals (which class held the rule data). Behavior-level assertions survive structural rewrites — that is what makes them robust
It is expected because pytest skips tests whose names contain numbers — and several of the previous tests do
Pytest doesn’t skip tests by name pattern (the only name-based discovery rule is the test_ prefix). This invents framework behavior to explain a green bar — but the green bar means what it says: every test ran and passed.
It is unexpected because Python’s import order changed, so most tests should have failed to import
Python’s import order didn’t change in any way that would affect imports. The refactor added new classes but didn’t move anything that breaks from scorer import score. (And if imports had broken, you’d see ImportError, not a green bar.)

The seven previous tests asserted on report.total_damage, report.events, and ScoringEvent instances. None of them asserted on internal structure. That is precisely why they survived a refactor that replaced the entire implementation strategy. The Refactoring Litmus Test: behavior tests survive structural change; implementation tests don’t.

9

Cycle 7 — Six 1s → Two Dragon Blasts (Hidden Bug)

Why this matters

Every previous combo test used exactly three of a face — so every previous combo test passed and hid a bug. Six 1s should be two Blasts (2000 damage); your current ComboRule.apply emits one Blast plus three Flames (1300). Line coverage said the if ran, but a line being executed is not the same as a line being right for all relevant inputs. This is the gap between coverage and boundary-value analysis — and it’s the cycle where you experience first-hand the kind of bug TDD literature reports: a defect the developer doesn’t know exists in code they wrote themselves.

🎯 You will learn to

Apply boundary-value analysis to predict where existing tests under-pin a behavior (exactly N covered; 2N and beyond not)
Analyze the gap between line coverage and behavioral correctness — coverage locates under-tested code; it does not measure correctness
Create a fix to ComboRule.apply using // and %= so it emits zero-or-more combos per call with correct leftover bookkeeping

Spec: score([1, 1, 1, 1, 1, 1]) produces two Dragon Blasts (2000 damage).

🪞 Predict first (don’t open the reveal yet). Look at your ComboRule.apply and trace through six 1s by hand. Write down the damage your current code produces. The whole pedagogical value of this step depends on the order: predict before peeking.

Reveal — what the current code actually does (open AFTER tracing)

The if counts[1] >= 3: runs once. It emits one Blast and counts[1] -= 3 leaves counts[1] == 3. Those three 1s fall through to SingleRule, emitting three Flames. Total: 1000 + 300 = 1300 damage.

But six 1s should be two Blasts → 2000 damage. The code is wrong — and no previous test caught it, because every prior combo test used exactly three of a face.

This is the kind of bug TDD literature reports: a defect the developer doesn’t know exists in code they wrote themselves. The test surfaces it.

Your task

🔴 RED — add test_six_ones_create_two_dragon_blasts asserting 2000 damage and two Blast events. Run; see the wrong events tuple.
🟢 GREEN — fix ComboRule.apply so it can emit zero or more combos per call, with the correct leftover bookkeeping. Before you code, write the formula on paper for: how many full combos do n dice of one face produce? How many leftover dice?
🔵 REFACTOR — re-run. Especially gratifying: cycle 5’s leftover guardrail still passes — the fix only changed behavior on cases no prior test pinned down.

Why this matters: coverage vs. boundary thinking

Every previous combo test used exactly count dice (three 1s, three 2s, etc.). The bug only manifests at 2 × count and beyond. Line coverage told you the if ran. It didn’t tell you the line was right for all relevant inputs.

That’s the gap between coverage and boundary-value analysis: every behavior has boundaries (0, exactly N, 2N, between N and 2N) and a healthy suite probes each. Coverage is a locator of under-tested code; it isn’t a measure of correctness. The rule that travels: if a behavior isn’t on the test list, code for it isn’t earned.

📦 Commit your progress

Before moving on, commit this cycle. Stage only the files you actually changed (scorer.py test_scorer.py) and write a short message — recommended: Cycle 7: fix multi-combo bug (six 1s = two Blasts).

Starter files

scorer.py

from collections import Counter
from dataclasses import dataclass


@dataclass(frozen=True)
class ScoringEvent:
    name: str
    dice_used: tuple
    damage: int


@dataclass(frozen=True)
class BattleReport:
    events: tuple = ()

    @property
    def total_damage(self) -> int:
        return sum(event.damage for event in self.events)


@dataclass(frozen=True)
class ComboRule:
    die: int
    count: int
    name: str
    damage: int

    def apply(self, counts: Counter) -> list[ScoringEvent]:
        events = []

        if counts[self.die] >= self.count:
            dice_used = tuple([self.die] * self.count)
            events.append(ScoringEvent(self.name, dice_used, self.damage))
            counts[self.die] -= self.count

        return events


@dataclass(frozen=True)
class SingleRule:
    die: int
    name: str
    damage: int

    def apply(self, counts: Counter) -> list[ScoringEvent]:
        events = []

        for _ in range(counts[self.die]):
            events.append(ScoringEvent(self.name, (self.die,), self.damage))

        return events


COMBO_RULES = (
    ComboRule(1, 3, "Dragon Blast", 1000),
    ComboRule(2, 3, "Goblin Swarm", 200),
    ComboRule(3, 3, "Orc Charge", 300),
    ComboRule(4, 3, "Troll Smash", 400),
    ComboRule(5, 3, "Lightning Storm", 500),
    ComboRule(6, 3, "Demon Strike", 600),
)

SINGLE_RULES = (
    SingleRule(1, "Dragon Flame", 100),
    SingleRule(5, "Lightning Spark", 50),
)


def score(dice: list[int]) -> BattleReport:
    counts = Counter(dice)
    events = []

    for rule in COMBO_RULES:
        events.extend(rule.apply(counts))

    for rule in SINGLE_RULES:
        events.extend(rule.apply(counts))

    return BattleReport(tuple(events))

test_scorer.py

"""Cycles 1–7 — surfacing the multi-combo edge case."""
import pytest
from scorer import score, ScoringEvent


def test_empty_roll_has_zero_damage_and_no_events():
    report = score([])

    assert report.total_damage == 0
    assert report.events == ()


def test_single_one_creates_dragon_flame_event():
    report = score([1])

    assert report.total_damage == 100
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
    )


def test_single_five_creates_lightning_spark_event():
    report = score([5])

    assert report.total_damage == 50
    assert report.events == (
        ScoringEvent("Lightning Spark", (5,), 50),
    )


def test_two_ones_create_two_dragon_flames():
    report = score([1, 1])

    assert report.total_damage == 200
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
        ScoringEvent("Dragon Flame", (1,), 100),
    )


def test_one_and_five_create_two_different_events():
    report = score([1, 5])

    assert report.total_damage == 150
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
        ScoringEvent("Lightning Spark", (5,), 50),
    )


def test_three_ones_create_dragon_blast_instead_of_three_flames():
    report = score([1, 1, 1])

    assert report.total_damage == 1000
    assert report.events == (
        ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
    )


def test_dragon_blast_plus_leftover_flame_and_spark():
    report = score([1, 1, 1, 1, 5])

    assert report.total_damage == 1150
    assert report.events == (
        ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
        ScoringEvent("Dragon Flame", (1,), 100),
        ScoringEvent("Lightning Spark", (5,), 50),
    )


def test_three_twos_create_goblin_swarm():
    report = score([2, 2, 2])

    assert report.total_damage == 200
    assert report.events == (
        ScoringEvent("Goblin Swarm", (2, 2, 2), 200),
    )


@pytest.mark.parametrize(
    "roll, expected_event",
    [
        ([3, 3, 3], ScoringEvent("Orc Charge", (3, 3, 3), 300)),
        ([4, 4, 4], ScoringEvent("Troll Smash", (4, 4, 4), 400)),
        ([5, 5, 5], ScoringEvent("Lightning Storm", (5, 5, 5), 500)),
        ([6, 6, 6], ScoringEvent("Demon Strike", (6, 6, 6), 600)),
    ],
)
def test_other_triples_create_combo_events(roll, expected_event):
    report = score(roll)

    assert report.total_damage == expected_event.damage
    assert report.events == (expected_event,)


# TODO (RED): write test_six_ones_create_two_dragon_blasts
#             score([1, 1, 1, 1, 1, 1]) should return total_damage == 2000
#             and events containing TWO Dragon Blast ScoringEvents

Solution

scorer.py

from collections import Counter
from dataclasses import dataclass


@dataclass(frozen=True)
class ScoringEvent:
    name: str
    dice_used: tuple
    damage: int


@dataclass(frozen=True)
class BattleReport:
    events: tuple = ()

    @property
    def total_damage(self) -> int:
        return sum(event.damage for event in self.events)


@dataclass(frozen=True)
class ComboRule:
    die: int
    count: int
    name: str
    damage: int

    def apply(self, counts: Counter) -> list[ScoringEvent]:
        events = []

        number_of_combos = counts[self.die] // self.count

        for _ in range(number_of_combos):
            dice_used = tuple([self.die] * self.count)
            events.append(ScoringEvent(self.name, dice_used, self.damage))

        counts[self.die] %= self.count

        return events


@dataclass(frozen=True)
class SingleRule:
    die: int
    name: str
    damage: int

    def apply(self, counts: Counter) -> list[ScoringEvent]:
        events = []

        for _ in range(counts[self.die]):
            events.append(ScoringEvent(self.name, (self.die,), self.damage))

        return events


COMBO_RULES = (
    ComboRule(1, 3, "Dragon Blast", 1000),
    ComboRule(2, 3, "Goblin Swarm", 200),
    ComboRule(3, 3, "Orc Charge", 300),
    ComboRule(4, 3, "Troll Smash", 400),
    ComboRule(5, 3, "Lightning Storm", 500),
    ComboRule(6, 3, "Demon Strike", 600),
)

SINGLE_RULES = (
    SingleRule(1, "Dragon Flame", 100),
    SingleRule(5, "Lightning Spark", 50),
)


def score(dice: list[int]) -> BattleReport:
    counts = Counter(dice)
    events = []

    for rule in COMBO_RULES:
        events.extend(rule.apply(counts))

    for rule in SINGLE_RULES:
        events.extend(rule.apply(counts))

    return BattleReport(tuple(events))

test_scorer.py

"""Cycles 1–7 — multi-combo edge case fixed."""
import pytest
from scorer import score, ScoringEvent


def test_empty_roll_has_zero_damage_and_no_events():
    report = score([])

    assert report.total_damage == 0
    assert report.events == ()


def test_single_one_creates_dragon_flame_event():
    report = score([1])

    assert report.total_damage == 100
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
    )


def test_single_five_creates_lightning_spark_event():
    report = score([5])

    assert report.total_damage == 50
    assert report.events == (
        ScoringEvent("Lightning Spark", (5,), 50),
    )


def test_two_ones_create_two_dragon_flames():
    report = score([1, 1])

    assert report.total_damage == 200
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
        ScoringEvent("Dragon Flame", (1,), 100),
    )


def test_one_and_five_create_two_different_events():
    report = score([1, 5])

    assert report.total_damage == 150
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
        ScoringEvent("Lightning Spark", (5,), 50),
    )


def test_three_ones_create_dragon_blast_instead_of_three_flames():
    report = score([1, 1, 1])

    assert report.total_damage == 1000
    assert report.events == (
        ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
    )


def test_dragon_blast_plus_leftover_flame_and_spark():
    report = score([1, 1, 1, 1, 5])

    assert report.total_damage == 1150
    assert report.events == (
        ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
        ScoringEvent("Dragon Flame", (1,), 100),
        ScoringEvent("Lightning Spark", (5,), 50),
    )


def test_three_twos_create_goblin_swarm():
    report = score([2, 2, 2])

    assert report.total_damage == 200
    assert report.events == (
        ScoringEvent("Goblin Swarm", (2, 2, 2), 200),
    )


@pytest.mark.parametrize(
    "roll, expected_event",
    [
        ([3, 3, 3], ScoringEvent("Orc Charge", (3, 3, 3), 300)),
        ([4, 4, 4], ScoringEvent("Troll Smash", (4, 4, 4), 400)),
        ([5, 5, 5], ScoringEvent("Lightning Storm", (5, 5, 5), 500)),
        ([6, 6, 6], ScoringEvent("Demon Strike", (6, 6, 6), 600)),
    ],
)
def test_other_triples_create_combo_events(roll, expected_event):
    report = score(roll)

    assert report.total_damage == expected_event.damage
    assert report.events == (expected_event,)


def test_six_ones_create_two_dragon_blasts():
    report = score([1, 1, 1, 1, 1, 1])

    assert report.total_damage == 2000
    assert report.events == (
        ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
        ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
    )

The fix in ComboRule.apply: replace the one-shot if with a per-combo loop driven by floor division (counts[self.die] // self.count), and replace the subtraction with modulo (counts[self.die] %= self.count). Cycle 5’s bonus leftover guardrail still passes because 4 % 3 == 1 matches the previous 4 - 3 == 1 for that specific input — the disagreement is only on multi-combo cases.

Step 9 — Knowledge Check

Min. score: 80%

1. The “one Dragon Blast for any number of 1s ≥ 3” bug was sitting in your code from cycle 5 onward — three cycles. Why did no previous test catch it?

pytest’s collection algorithm has a known issue with bugs that span multiple cycles, so it deferred reporting
There is no such pytest issue. Bugs aren’t “deferred” by the collection algorithm — they exist or don’t based on whether a test exercises the buggy input. The reason this bug hid for four cycles is purely about which inputs got tested, not about pytest internals.
Every previous test used at most three of a face (three 1s, three 2s, etc.). The bug only manifests when the count of a face is at least six. The suite had no such input, so the defect lived undisturbed
The bug only manifests on Tuesdays, and the previous test runs happened on other days
Bugs are deterministic functions of input, not of date. (Time-dependent bugs do exist — e.g., daylight savings — but the cycle-7 bug isn’t one of them; it depends only on the dice list.)
Cycle 5’s if counts[1] >= 3: is correct — there was no bug; cycle 7 changed the requirement
Cycle 7 didn’t change the rule. The spec was “three 1s = one Dragon Blast, leftovers score as singles” all along. The bug is that for six 1s, the cycle-5 code emits 1 Blast + 3 Flames, when the spec says 2 Blasts + 0 Flames. The defect was always there; only cycle 7’s test surfaced it.

Boundary thinking. Every previous combo test gave the rule exactly count dice. The bug was on the boundary where counts[X] >= 2 * count. Without a test on that boundary, the code would have shipped wrong — and the developer would have been completely confident in it. This is exactly the kind of failure mode empirical TDD studies report.

2. What is the most defensible general lesson from cycle 7 about test-suite design?

Always test every input combinatorially — only complete coverage of the input space catches all bugs
Combinatorial coverage is infeasible for any nontrivial input space (six 1–6 dice already gives 6^6 = 46656 possibilities; ten dice would be 60 million). The lesson isn’t “test more,” it’s “test the right boundaries.” One well-chosen test on 2N catches the bug; thousands of random inputs in the same partition don’t.
Coverage by line is what matters; cycle 7’s bug was in code already covered, so coverage didn’t help. The lesson is to also think about boundary cases (exactly N, 2N, between N and 2N, 0) per behavior
Never use floor-division and modulo in production code — they are too easy to get wrong
This blames the operator instead of the test design. // and % are correct primitives for this exact problem — a cycle-7 test would catch a wrong implementation just as readily as a wrong subtraction. The remedy is more careful test-input selection, not avoiding correct Python features.
TDD is unreliable; this bug proves that even careful test-first development can miss things, so we should add manual QA after every cycle
Inverts the lesson. The bug was caught, in cycle 7, by a TDD-style test. The takeaway isn’t “TDD is unreliable” — it’s “TDD’s effectiveness depends on which inputs your tests probe.” Adding manual QA on top of a test suite that misses boundaries doesn’t fix the root cause; better boundary-value test selection does.

Coverage told us we executed the line. It did not tell us whether the line was right for all relevant inputs. Boundary-value analysis (the partition tradition you encountered in Foundations) complements coverage: every behavior has a “just-at-the-boundary,” “just-past-the-boundary,” and “well-past-the- boundary” input — and a healthy suite probes each.

3. Cycle 5’s bonus leftover guardrail ([1, 1, 1, 1, 5]) still passes after cycle 7’s refactor. Why? Be precise.

Because pytest skips guardrail tests after they pass once — they are stored as cached green results
Pytest doesn’t cache test results between runs. Every pytest invocation re-executes every selected test from scratch. Made-up framework behavior — the explanation needs to come from the code’s semantics.
Because for counts[1] = 4, the old counts[1] -= 3 and the new counts[1] %= 3 both leave 1 in the counter — the two implementations agree on this specific input even though they disagree on counts[1] >= 6
Because the leftover test is parameterized, and pytest reruns parameterized tests with looser tolerances
Pytest doesn’t loosen tolerances for parametrized tests. Each parameter row runs the same assertion logic. The reason the guardrail still passes is mathematical (the two operations agree on 4), not framework-related.
Because the leftover test has been deleted by the cycle 7 refactor — the green count went down, not up
Refactors don’t delete tests — that would defeat the safety net. The green count increased because cycle 7 added a new passing test. The code is what gets restructured during refactoring; the tests remain as guards.

This is a subtle but important point: a refactor does not have to change every output. The two implementations of apply produce the same answer for any input where the old answer was correct — and only the new answer is correct where the old was wrong. That is what makes the cycle-7 fix a generalisation rather than a replacement.

4. A teammate looks at cycle 7’s fix and says: “This bug only existed because we extracted ComboRule. If we’d kept the inline if counts[1] >= 3: from cycle 5, we wouldn’t have had this issue.” Are they right?

Yes — the rule-object refactor introduced complexity that produced the bug. Inline code would have been simpler and safer
Trace the cycle-5 inline code on six 1s by hand: if counts[1] >= 3: ... counts[1] -= 3 runs once, leaves 3, then SingleRule emits 3 Flames. The same defect existed before the refactor, just in a different file. The algorithm needed a loop over complete triples; the abstraction did not create that bug.
No — the exact same bug was present in cycle 5’s inline code (if counts[1] >= 3: ... counts[1] -= 3 would also fail for six 1s). The rule-object refactor moved the bug from one place to another but did not introduce it. The defect was a missing test, not a missing pattern
It depends on Python version — Python 3.10+ silently rewrites if to a multi-iteration form for combos, which is why old Python had the bug
Python doesn’t rewrite if statements; that’s invented language behavior. (And version-blame is a common dodge for design questions — the language is rarely the actual culprit.) The bug is in the algorithm, which Python executes faithfully on every version.
Yes — Python’s Counter class re-orders subtraction operations under the hood when used with dataclasses
Counter doesn’t re-order operations. (Made-up implementation detail, again.) The actual lesson here is that abstractions don’t introduce algorithmic bugs — they at most relocate them. The fix has to be in the algorithm, no matter where the algorithm lives.

The bug is the algorithm’s bug, not the abstraction’s. Whether the if/-= sits in score() or in ComboRule.apply makes no difference to its behavior. The lesson is honest: TDD does not prevent every bug; it surfaces them when tests probe the right inputs. The rule-object refactor was about future- proofing the design, not correctness of cycle 5’s logic.

10

Transfer Cycle — TDD on FizzBuzz (Different Domain, Same Rhythm)

Why this matters

Seven Dragon-Dice cycles risk teaching you “TDD works because dice are compositional.” The transfer cycle disproves that — same rhythm, totally different domain (FizzBuzz), and no scaffolding from us: you write your own test list (Canon TDD step 1), you order the items, you drive the cycles. Janzen & Saiedian’s “residual effect” predicts that this is where the rhythm finally feels natural — earned, not preached. The compression of seven Dragon-Dice cycles into ~four FizzBuzz mini-cycles is itself the test of mastery.

🎯 You will learn to

Create your own Canon TDD test list for an unfamiliar problem, ordered from simplest to most design-breaking
Apply RED-GREEN-REFACTOR to FizzBuzz with no instructor scaffolding — driving the cycles yourself in compressed form
Evaluate the structural parallels between each FizzBuzz move and the Dragon-Dice cycle it mirrors (Variation Theory generalization)

You’ve completed seven cycles of TDD on dice scoring. The risk: “TDD only works because Dragon Dice is a naturally compositional domain.” The way to disprove that risk is to apply the same rhythm to a completely unrelated problem — right now, in compressed form.

The classic spec — FizzBuzz

fizzbuzz(n) returns a list of strings of length n. For each integer i from 1 to n:

If i is a multiple of 15 → "FizzBuzz"
else if i is a multiple of 3 → "Fizz"
else if i is a multiple of 5 → "Buzz"
else → str(i)

So fizzbuzz(5) == ["1", "2", "Fizz", "4", "Buzz"].

Canon TDD step 1 — write your own test list

Before reading any further, take 60 seconds and write your own test list. What behaviors does the spec define? Order them from simplest to most design-breaking — the way the Dragon Dice tutorial implicitly did across cycles 1–7. You did this implicitly throughout — now you do it explicitly.

📋 One possible test list (open ONLY after you've written your own — 60 seconds first)

A natural ordering, simplest first:

☐ fizzbuzz(0) == []
☐ fizzbuzz(1) == ["1"]
☐ fizzbuzz(2) == ["1", "2"]
☐ fizzbuzz(3)[-1] == "Fizz"
☐ fizzbuzz(5)[-1] == "Buzz"
☐ fizzbuzz(15)[-1] == "FizzBuzz"
☐ fizzbuzz(-1) raises ValueError

Compare with your list. Did you have the same items? In the same order? (The reflection at the bottom of this step asks you to map each item to its Dragon-Dice parallel — don’t peek there yet either.)

Your task — drive the cycles yourself (~10–15 minutes)

Pick the simplest unimplemented item from your list. Convert only that one item into a runnable test in test_fizzbuzz.py. Make it pass with the simplest code (TPP — start with constants, not loops; the tests will force the loop when ready). Refactor on green. Pick the next item. Repeat.

Don’t try to handle all rules at once. One test at a time. (Beck’s Canon TDD is explicit on this — converting all list items to tests up-front “leads to rework and depression.”)

What you’re doing here

You are applying what you learned. There is no instructor-provided RED test, no GREEN scaffold, no REFACTOR checklist. The cycle discipline is now yours. If the rhythm feels familiar — that’s the threshold concept doing its work.

🛟 Stuck? (Open only after at least 5 minutes of trying)

The hard test is multiple of 15. Before reading further, ask yourself: which earlier Dragon-Dice cycle had a test that the previous structure couldn’t satisfy with a local edit? What was the move there?

Hint without the answer: trace by hand what your current code returns for i=15. Why? Then ask what you’d change.

If you've named the structural pressure yourself — open for two known options

Order matters: check i % 15 == 0 first, then % 3, then % 5. Simplest TPP move.
String concatenation: build up the result — start empty; if divisible by 3, append "Fizz"; if by 5, append "Buzz"; if still empty, use str(i).

Either passes the test list. Pick one; if a future requirement makes the other fit better, you’ll refactor toward it.

Reflection (after green — this is the heart of the step)

Compare the FizzBuzz cycles you just did with the Dragon Dice arc. Write your answers before opening the reveal.

Which Dragon-Dice cycle’s RED moment does FizzBuzz’s “multiple of 3” test echo?
Which Dragon-Dice cycle does FizzBuzz’s multiple of 15 test parallel?
What was different?
What was the same? (Try to name 3–4 invariants of the rhythm.)

Compare your invariants — order doesn't matter, but check each is in your version somewhere

Items that should appear in your “same” list:

The rhythm itself (RED → GREEN → REFACTOR, one test at a time)
Test-list discipline (Canon TDD step 1 — a list before any tests)
RED-as-success (the failing test is the deliverable, not a problem)
Refactor-toward-duplication (Rule of Two; wait for the second example)
TPP — smallest transformation that passes the failing test
Allow-the-ugly-first-GREEN (don’t pre-design the abstraction)

If your answer captured the rhythm and the discipline, you have the threshold concept. TDD is a rhythm, not a problem-specific technique — you just demonstrated it on a problem with no shared code with Dragon Dice.

📦 Commit your progress

Before moving on, commit this cycle. Stage only the files you actually changed (fizzbuzz.py test_fizzbuzz.py) and write a short message — recommended: Transfer cycle: FizzBuzz via TDD.

Starter files

fizzbuzz.py

# Empty by design — TDD says: write a failing test first.
# Build this file up, one test cycle at a time.

test_fizzbuzz.py

"""Your TDD cycles for FizzBuzz.

Pick the simplest test case from your test list. Write it.
Watch it fail. Make it pass with the simplest code (TPP).
Refactor on green. Pick the next one. Repeat.

Suggested first test: fizzbuzz(0) == []
"""
import pytest
from fizzbuzz import fizzbuzz


# Write your tests below — one per behavior in your list.

Solution

fizzbuzz.py

def fizzbuzz(n: int) -> list[str]:
    if n < 0:
        raise ValueError(f"n must be non-negative, got {n}")
    result: list[str] = []
    for i in range(1, n + 1):
        if i % 15 == 0:
            result.append("FizzBuzz")
        elif i % 3 == 0:
            result.append("Fizz")
        elif i % 5 == 0:
            result.append("Buzz")
        else:
            result.append(str(i))
    return result

test_fizzbuzz.py

"""One full TDD-driven test list for FizzBuzz."""
import pytest
from fizzbuzz import fizzbuzz


def test_empty_returns_empty_list():
    assert fizzbuzz(0) == []


def test_single_number_is_stringified():
    assert fizzbuzz(1) == ["1"]


def test_regular_numbers_become_strings():
    assert fizzbuzz(2) == ["1", "2"]


def test_multiple_of_three_is_fizz():
    assert fizzbuzz(3) == ["1", "2", "Fizz"]


def test_multiple_of_five_is_buzz():
    assert fizzbuzz(5) == ["1", "2", "Fizz", "4", "Buzz"]


def test_multiple_of_fifteen_is_fizzbuzz():
    # The design-breaking moment: 15 is divisible by BOTH 3 and 5.
    # Without the right ordering or composition, "Fizz" wins (or
    # "Buzz" wins), not "FizzBuzz".
    assert fizzbuzz(15)[-1] == "FizzBuzz"


def test_negative_n_raises():
    with pytest.raises(ValueError, match="non-negative"):
        fizzbuzz(-1)

One disciplined path through the FizzBuzz spec. The order is simple-to-design-breaking: empty → single → regular → multiple of 3 → multiple of 5 → multiple of 15 → invalid input.

The multiple of 15 test is the design-breaking moment. The simplest fix is ordering: check % 15 before % 3 or % 5. A more compositional implementation (build the string from “Fizz” and “Buzz” parts) is eventually nicer, but it isn’t the simplest GREEN — and TPP says don’t reach for it until a test demands it. None does, so the ordered conditional stays.

You followed the same rhythm you used on Dragon Dice — that’s the proof the rhythm transfers.

Step 10 — Knowledge Check

Min. score: 80%

1. FizzBuzz’s multiple of 15 test most directly parallels which Dragon-Dice cycle — and why?

Cycle 5 (Triple 1s = Dragon Blast). Both are design-breaking tests — the simplest code that passed earlier tests cannot be tweaked locally to satisfy the new test; the structure has to acknowledge a combined case before earlier cases.
Cycle 1 (Empty roll). Both are the simplest case anyone could write.
Cycle 1 was the simplest case (the empty input), not a structural pivot. FizzBuzz’s empty-list test (fizzbuzz(0) == []) is the cycle-1 parallel — but multiple of 15 is structurally hard, not trivially simple. The match here is on structural pressure, not on “first cycle of the tutorial.”
Cycle 4 (Repeated singles). Both are about handling duplicates of the same input.
Cycle 4 is about generalizing per-input handling into a loop — a refactor, not a structural pivot. The new test ([1, 1]) is satisfiable by either a third hardcoded branch or a loop. FizzBuzz’s multiple of 15 test cannot be satisfied by either of the two prior conditional shapes; it forces a re-ordering. Different kind of design moment.
Cycle 6 (Goblin Swarm). Both are about adding a second example that reveals an abstraction.
Cycle 6 (Goblin Swarm) is about abstraction discovery — the second example reveals the right shape for a refactor. FizzBuzz’s multiple of 15 isn’t about extracting a Rule object; it’s about the previous code being structurally unable to produce the right answer at all. Structural pivot, not abstraction extraction.

Cycle 5 is the right parallel. The signature of a design-breaking test: previous cycles produced code that looks correct, but the new test cannot be satisfied by a local edit — the structure has to change. For Dragon Dice that meant moving from per-die iteration to count-then-emit. For FizzBuzz that meant either reordering the conditionals (so % 15 is checked first) or switching to a compositional string builder. Either way, the existing structure had to accept a behavior the previous tests didn’t anticipate. That feeling — “I can’t tweak this; I have to restructure” — is a TDD rhythm invariant, not a Dragon-Dice quirk.

2. A teammate, who has never done TDD, pushes back: “FizzBuzz is just trivia — the real work is the algorithm. Why did you spend 12 minutes on the cycles instead of 30 seconds typing the obvious solution?” Pick the strongest reply.

On a one-shot toy problem the time is comparable, but the cycles produced a regression-proof test suite and a record of every design choice. The ‘obvious’ solution often hides the design-breaking moment (FizzBuzz at 15) until a refactor exposes it later — at which point there’s no test to catch the slip. Erdogmus et al. (2005) found test-first students wrote more tests AND were more productive per test.
TDD is faster than typing the obvious solution because pytest’s parallelism speeds up small test runs.
Speed-by-parallelism is a fabricated technical justification. Pytest parallelism (with pytest-xdist) saves seconds on large suites — nowhere near the order of magnitude that would make TDD net-faster than typing. The argument has to stand on durable design value, not on runtime tricks.
The ‘obvious’ solution might be wrong; TDD is more correct because tests catch all bugs.
Tests don’t catch all bugs. Cycle 7 of this tutorial (the hidden six-1s bug) showed a defect surviving four cycles of green tests. The honest claim is that tests catch the behaviors they actually check, which is narrower than “all bugs” and still enormously valuable for regression.
TDD is a religious commitment — the question isn’t worth answering.
Dismissing the question avoids the engineering tradeoff. TDD’s case is empirical (Microsoft/IBM, Erdogmus, Janzen & Saiedian — all measured studies), not faith-based. A teammate who hears “it’s a religion” walks away thinking the practice has no defense; that’s the opposite of what the evidence supports.

The honest answer combines two things: (a) for small problems with experienced devs, the time cost of TDD is roughly the same as ad-hoc coding — the same number of tests get written either way, just in a different order; (b) the durable benefit is the test suite as a regression net plus the rhythm of small steps preventing speculative complexity. On a toy, the gain is small. On long-lived code modified by multiple people it’s the difference between a 39–91% drop in pre-release defects (Microsoft/IBM, Williams et al. 2008) and not. TDD isn’t a religion — it’s a tradeoff with strong empirical evidence in specific contexts (long-lived, multi-author, behavior-rich domains). FizzBuzz lets you practice the rhythm cheaply, where the stakes are low.

3. Compare doing FizzBuzz now to what doing it before the Dragon Dice tutorial would have been like. What’s the most likely difference?

You no longer fight the rhythm. RED-as-success, the discipline of one-test-at-a-time, the patience for two examples before refactoring, the pull of TPP toward simple constants — all of these now feel natural. That’s the residual effect Janzen & Saiedian (ICSE 2007) measured: programmers who actually practiced TDD prefer it afterward, even ones who initially resisted.
FizzBuzz is harder now because dice scoring uses different syntax than number sequences.
Inverts the transfer claim. Familiarity with the rhythm is what should make the second problem easier — even when the domain is unrelated. If FizzBuzz felt harder, you might be transferring Dragon-Dice tools (looking for a Counter use that isn’t there) instead of the rhythm. Notice the difference; the rhythm is what transfers.
There is no difference. TDD is a procedure; anyone can follow the steps once they’re listed.
TDD-as-procedure underestimates what changed. Anyone can follow the steps once listed — but doing it naturally, with the right pacing and the right small-step instinct, is a different skill. Janzen & Saiedian’s measured residual effect is exactly the gap between “can follow” and “prefers,” and it only appears after lived practice.
FizzBuzz is now slower because the Dragon-Dice habits (@dataclass, Counter) don’t apply here.
Over-transferring tools to a new domain is a real risk, but slowness from “@dataclass doesn’t apply” misframes the issue: those tools were never the lesson. The lesson is the rhythm; the tools were Dragon-Dice-specific. If FizzBuzz felt slow, look at which habit you tried to carry over — odds are it was a Dragon-Dice tool, not the cycle structure.

The Janzen & Saiedian study (230+ programmers): the residual effect of having actually practiced TDD is preference for it afterward — not because it was preached, but because the rhythm came to feel natural. A FizzBuzz attempt today should feel categorically different from a same-instructions attempt before this tutorial: not just because you know what to type, but because the rhythm itself — pause for the test, allow the ugly first GREEN, wait for two examples — now has a felt naturalness. That felt naturalness is the threshold concept stuck in your hands, not just your head. The Dragon-Dice-specific tools (@dataclass, Counter) didn’t transfer; they were never the lesson. The rhythm transferred — that’s the whole lesson.

11

The Big Picture — Seven Cycles and a Transfer

Why this matters

The cycles taught the rhythm one beat at a time; this step asks whether you can hear the whole song. You’ll synthesize the journey from memory before any reveals, recalibrate your own confidence in writing, and probe whether the discipline transfers to a real piece of code from your own work — not “I’d write more tests” but a specific bug it would have caught. The final quiz is mixed retrieval across all seven cycles, the way Bjork’s spacing principle predicts will make the rhythm last.

🎯 You will learn to

Analyze the seven-cycle journey by recalling, from memory, three design moves and the test that forced each one
Evaluate your own confidence to apply Red-Green-Refactor unaided on a problem you haven’t seen before
Apply the rhythm to one specific piece of your own code — naming what TDD would have prevented in concrete terms

Seven Dragon-Dice cycles. Then an eighth on a totally different problem — FizzBuzz — driven by you with no scaffolding. Every line in your final scorer.py is justified by a test; every line in your fizzbuzz.py is too; and the rhythm that produced both is the same rhythm.

🪞 Synthesise yourself (≈5 min, before opening any reveals or taking the quiz)

The recap material — takeaways, journey table, anti-patterns, empirical case — is collapsed below. You only get one shot at synthesising while it’s still fresh. Do this part with your editor scrolled away from scorer.py.

(1) Recall three design moves, from memory. Name three cycles and, for each, the design move the test forced. Don’t say “the loop refactor” — say which test broke the previous structure and why.

(2) Pick the cycle that surprised you. Which cycle’s RED moment changed how you thought about a structural choice? Why? (One sentence.)

(3) Confidence recalibration — write a number on a sticky note (or in chat). On a 1–5 scale: “I could apply Red-Green-Refactor to a problem I haven’t seen before, this week, without this tutorial open.” Pick a number; anchor it in writing. Re-firing on a remembered number isn’t recalibration. We’ll re-check after the quiz.

(4) Transfer probe. Name one specific piece of code or project of yours — a class assignment, a side project, a past bug — where the rhythm you just learned would have helped, and what specifically it would have prevented. (“It would have caught X” is concrete; “I’d write more tests” is not.)

Then take the quiz below — before opening any of the reveals.

The reveals after the quiz are for comparison, not for study. Treat them like an answer key: open them after committing to your own answers.

Reveal — fill-in-the-blank journey table (open after recall #1)

Cover the right column and predict the lesson for each cycle from memory. Then read across.

Reveal — five takeaways that travel

TDD is design, not testing. The test is the contract; the implementation emerges under its pressure.
Refactor toward duplication, not before it (Rule of Two). One example is a guess at the shape; two makes the variation visible; three or more is duplication that has rotted. Cycle 6’s timing was Rule of Two — and it generalizes to every refactor you’ll do.
Tests enable change. Behavior-level assertions survive structural rewrites; implementation-coupled tests don’t.
Coverage ≠ correctness; complement with boundary-value analysis (zero, exactly N, 2N, between).
Listen to the test. Pain in writing a test usually points at the production code.
If a behavior isn’t on the test list, code for it isn’t earned. Speculative scaffolding (validation, error handling, hypothetical inputs) waits until a test demands it.

Reveal — when TDD shines, when it's overkill

Pick one. One minute each. Which would TDD have helped less?

(a) You’re writing a function that classifies an image as cat-or-dog by calling a pretrained model. The output is a probability, judged by humans on edge cases.

(b) You’re writing a function that adds a new currency to a payment processor. The behavior is precisely specified.

Compare your answer

TDD shines on (b): new features with clear behavioral requirements; complex logic with branching cases; long-lived code modified by multiple people; API design; domains where regressions hurt (payments, scoring, calculations).

TDD is overkill on (a): one-off throwaway scripts; exploratory prototyping; UI layout; non-binary outcomes (ML accuracy, image recognition); Jupyter research.

Even on (a), some tests pay off — the question is whether to write them first. Kent Beck: “the discipline of working strictly test-first is valuable but not necessarily something you want to do all the time.”

Reveal — TDD anti-pattern taxonomy (cover the right column; predict the antidote)

Level	Anti-pattern	What it looks like	Antidote (predict before reading)
I	The Liar	Test passes but asserts vacuously (`isinstance(x, int)` only)	Cycle 4’s mutation move
I	The Nitpicker	Asserts on private attributes / implementation details	Assert on observable behavior
II	Success Against All Odds	New test passes immediately, with no investigation	Verify with mutation
II	Skip-the-Refactor	Stop at green; never enter REFACTOR	Make the look mandatory
III	The Giant	One test asserts dozens of behaviors	One behavior per test
III	Excessive Setup	30+ lines of fixture before one assertion	Decouple production code
IV	The Mockery	More mock setup than test logic	Listen — the design is wrong
IV	Modify-the-Test	AI rewrites the test to match buggy code	Own the spec yourself

Higher level = more architectural smell. Listen to the test.

Reveal — the empirical case for TDD

Study	Finding
Microsoft & IBM (Nagappan et al., 2008)	39–91% decrease in pre-release defect density in TDD teams
Same studies	15–35% longer initial development; offset by reduced debugging
Erdogmus et al. (2005)	Test-first students wrote more tests AND were more productive per test
Janzen & Saiedian (ICSE 2007)	Even programmers who resisted test-first adopted it more after exposure — the Residual Effect
Fucci et al. (2017)	TDD’s benefit comes from granularity + uniformity, not strict test-first ordering — your seven tiny cycles embody both

Caveat: mixed for solo programmers on short tasks. Strongest in team settings, with CI, on long-lived systems.

Reveal — what to learn next (the same rhythm, scaled up)

Fixtures (@pytest.fixture) for reusable setup of objects, DBs, mock APIs
Mocks, fakes, stubs — with a strong default toward fakes over mocks
Property-based testing with Hypothesis — score(any list of 1–6) should always satisfy invariants
Mutation testing with mutmut or cosmic-ray — automate the cycle-4 mutation move across the whole suite
The Outside-In / Double-Loop pattern (Percival, Obey the Testing Goat) — high-level acceptance tests drive unit tests

Each lives inside the same Red-Green-Refactor rhythm you just internalised.

🪞 Recalibrate (after the quiz)

Re-rate confidence on the same 1–5 prompt. Look at your sticky. The gap is the data — feelings of progress are unreliable; the gap is signal.

And revisit your transfer probe answer: is the code you named still the right next place to apply this, or did the quiz/recap shift it? Whichever piece of code you end up picking — start it RED.

Starter files

scorer.py

# The full, seven-cycle implementation lives here. Use this step's
# editor to scroll through what you built — every line is justified by
# a test in test_scorer.py. There is no speculative code.
from collections import Counter
from dataclasses import dataclass


@dataclass(frozen=True)
class ScoringEvent:
    name: str
    dice_used: tuple
    damage: int


@dataclass(frozen=True)
class BattleReport:
    events: tuple = ()

    @property
    def total_damage(self) -> int:
        return sum(event.damage for event in self.events)


@dataclass(frozen=True)
class ComboRule:
    die: int
    count: int
    name: str
    damage: int

    def apply(self, counts: Counter) -> list[ScoringEvent]:
        events = []

        number_of_combos = counts[self.die] // self.count

        for _ in range(number_of_combos):
            dice_used = tuple([self.die] * self.count)
            events.append(ScoringEvent(self.name, dice_used, self.damage))

        counts[self.die] %= self.count

        return events


@dataclass(frozen=True)
class SingleRule:
    die: int
    name: str
    damage: int

    def apply(self, counts: Counter) -> list[ScoringEvent]:
        events = []

        for _ in range(counts[self.die]):
            events.append(ScoringEvent(self.name, (self.die,), self.damage))

        return events


COMBO_RULES = (
    ComboRule(1, 3, "Dragon Blast", 1000),
    ComboRule(2, 3, "Goblin Swarm", 200),
    ComboRule(3, 3, "Orc Charge", 300),
    ComboRule(4, 3, "Troll Smash", 400),
    ComboRule(5, 3, "Lightning Storm", 500),
    ComboRule(6, 3, "Demon Strike", 600),
)

SINGLE_RULES = (
    SingleRule(1, "Dragon Flame", 100),
    SingleRule(5, "Lightning Spark", 50),
)


def score(dice: list[int]) -> BattleReport:
    counts = Counter(dice)
    events = []

    for rule in COMBO_RULES:
        events.extend(rule.apply(counts))

    for rule in SINGLE_RULES:
        events.extend(rule.apply(counts))

    return BattleReport(tuple(events))

test_scorer.py

"""All seven cycles, all green. Read it as a contract."""
import pytest
from scorer import score, ScoringEvent


def test_empty_roll_has_zero_damage_and_no_events():
    report = score([])

    assert report.total_damage == 0
    assert report.events == ()


def test_single_one_creates_dragon_flame_event():
    report = score([1])

    assert report.total_damage == 100
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
    )


def test_single_five_creates_lightning_spark_event():
    report = score([5])

    assert report.total_damage == 50
    assert report.events == (
        ScoringEvent("Lightning Spark", (5,), 50),
    )


def test_two_ones_create_two_dragon_flames():
    report = score([1, 1])

    assert report.total_damage == 200
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
        ScoringEvent("Dragon Flame", (1,), 100),
    )


def test_one_and_five_create_two_different_events():
    report = score([1, 5])

    assert report.total_damage == 150
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
        ScoringEvent("Lightning Spark", (5,), 50),
    )


def test_three_ones_create_dragon_blast_instead_of_three_flames():
    report = score([1, 1, 1])

    assert report.total_damage == 1000
    assert report.events == (
        ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
    )


def test_dragon_blast_plus_leftover_flame_and_spark():
    report = score([1, 1, 1, 1, 5])

    assert report.total_damage == 1150
    assert report.events == (
        ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
        ScoringEvent("Dragon Flame", (1,), 100),
        ScoringEvent("Lightning Spark", (5,), 50),
    )


def test_three_twos_create_goblin_swarm():
    report = score([2, 2, 2])

    assert report.total_damage == 200
    assert report.events == (
        ScoringEvent("Goblin Swarm", (2, 2, 2), 200),
    )


@pytest.mark.parametrize(
    "roll, expected_event",
    [
        ([3, 3, 3], ScoringEvent("Orc Charge", (3, 3, 3), 300)),
        ([4, 4, 4], ScoringEvent("Troll Smash", (4, 4, 4), 400)),
        ([5, 5, 5], ScoringEvent("Lightning Storm", (5, 5, 5), 500)),
        ([6, 6, 6], ScoringEvent("Demon Strike", (6, 6, 6), 600)),
    ],
)
def test_other_triples_create_combo_events(roll, expected_event):
    report = score(roll)

    assert report.total_damage == expected_event.damage
    assert report.events == (expected_event,)


def test_six_ones_create_two_dragon_blasts():
    report = score([1, 1, 1, 1, 1, 1])

    assert report.total_damage == 2000
    assert report.events == (
        ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
        ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
    )

Solution

scorer.py

from collections import Counter
from dataclasses import dataclass


@dataclass(frozen=True)
class ScoringEvent:
    name: str
    dice_used: tuple
    damage: int


@dataclass(frozen=True)
class BattleReport:
    events: tuple = ()

    @property
    def total_damage(self) -> int:
        return sum(event.damage for event in self.events)


@dataclass(frozen=True)
class ComboRule:
    die: int
    count: int
    name: str
    damage: int

    def apply(self, counts: Counter) -> list[ScoringEvent]:
        events = []

        number_of_combos = counts[self.die] // self.count

        for _ in range(number_of_combos):
            dice_used = tuple([self.die] * self.count)
            events.append(ScoringEvent(self.name, dice_used, self.damage))

        counts[self.die] %= self.count

        return events


@dataclass(frozen=True)
class SingleRule:
    die: int
    name: str
    damage: int

    def apply(self, counts: Counter) -> list[ScoringEvent]:
        events = []

        for _ in range(counts[self.die]):
            events.append(ScoringEvent(self.name, (self.die,), self.damage))

        return events


COMBO_RULES = (
    ComboRule(1, 3, "Dragon Blast", 1000),
    ComboRule(2, 3, "Goblin Swarm", 200),
    ComboRule(3, 3, "Orc Charge", 300),
    ComboRule(4, 3, "Troll Smash", 400),
    ComboRule(5, 3, "Lightning Storm", 500),
    ComboRule(6, 3, "Demon Strike", 600),
)

SINGLE_RULES = (
    SingleRule(1, "Dragon Flame", 100),
    SingleRule(5, "Lightning Spark", 50),
)


def score(dice: list[int]) -> BattleReport:
    counts = Counter(dice)
    events = []

    for rule in COMBO_RULES:
        events.extend(rule.apply(counts))

    for rule in SINGLE_RULES:
        events.extend(rule.apply(counts))

    return BattleReport(tuple(events))

The final implementation as it stood at the end of cycle 7. Use this step for reading and reflection — there are no new tasks, just the comprehensive knowledge check that follows.

Step 11 — Knowledge Check

Min. score: 80%

1. Which single statement most accurately captures TDD?

A way to achieve 100% line coverage without writing extra tests later
TDD does not primarily target coverage; coverage is a side effect. A line written under test pressure usually does end up covered, but framing TDD as a coverage technique inverts the causation and misses the design role tests play.
A design discipline where tests drive the structure of production code; the test suite is a valuable byproduct
A workflow optimisation that ships features faster than test-after development
Speed-of-shipping isn’t the case for TDD on small features (initial development is 15–35% slower per Microsoft/IBM). The case is on long-lived code with multiple authors, where the regression net pays back many-fold. “Faster shipping” oversells; the honest claim is defect reduction over time.
A coverage technique ensuring every line is exercised before release
This frames the same issue around coverage from a different angle. The lesson the tutorial built across 7+1 cycles is design pressure, not coverage. Cycle 7 specifically showed how 100%-covered code can still be wrong.

The threshold concept of the tutorial: TDD is design, not testing. Cycle 6’s rule-object refactor emerged under cycle-6’s test pressure, not from a whiteboard. The test suite is a (valuable) byproduct.

2. For which scenario is TDD most beneficial?

A 25-line one-off file-renaming script run once and discarded
TDD-everywhere overcorrection. Beck himself said “the discipline of working strictly test-first is valuable but not necessarily something you want to do all the time.” One-shot scripts have low cost of failure and short life — TDD’s payoff (regression net, multi-author safety) doesn’t materialise.
A new payment-processing module: complex validation, multi-developer code, high cost of regression
A Jupyter notebook exploring a new dataset
Notebook misfit. Exploratory data analysis is defining the problem; you can’t write a test for a behavior you haven’t decided on. TDD assumes the spec is at least partially knowable. Notebooks live before that assumption holds.
Hand-written CSS for a landing page judged visually
Visual-correctness misfit. CSS layouts are judged by humans looking at pixels. Automated assertions on visual correctness are notoriously brittle (screenshot-diff testing exists but is fragile). TDD applies to behavior describable in code; visual judgment isn’t.

Payment processing has every property TDD rewards: complex logic, strict correctness, multiple developers, long life. Microsoft/IBM measured 39–91% defect-density reductions exactly here.

3. A test file contains:

def test_user_creation():
    user = User("Alice", "alice@example.com")
    assert user._name == "Alice"
    assert user._email == "alice@example.com"

Which anti-pattern is present?

Excessive Setup — too many asserts
Definition mismatch. Excessive Setup is about Arrange code dominating the test (30+ lines of fixture before one assertion); two assertions is normal. The smell here is which attributes the assertions touch, not how many there are.
The Nitpicker — asserts on private attributes (_name, _email), coupling the test to implementation details
The Mockery — the test verifies the User constructor
Mockery misclassification. The Mockery is about over-mocking (more mock setup than test logic). This test uses a real User instance with no mocks at all. The smell here is the leading-underscore reach into private state.
Skip-the-Refactor — the cycle’s REFACTOR phase was never entered, so the test was never tightened
Smell-of-process versus smell-of-test. Skip-the-Refactor is a workflow smell — “the author never entered REFACTOR” — and as a diagnosis it’s unfalsifiable from the test code alone (you’d need to see the author’s process). The smell visible in this snippet is a test-design problem: assertions on private attributes (_name, _email). That’s the Nitpicker.

The Nitpicker. Leading underscores signal “private — implementation detail.” Tests that touch private state break on every refactor. Assert on observable behavior via the public API.

4. The “one Dragon Blast for any count ≥ 3” bug lived in cycles 5 and 6 before cycle 7 surfaced it. Why didn’t line coverage catch it?

Coverage caught it; pytest ignored the warning
Coverage tools don’t “catch bugs” — they measure which lines were exercised. There’s no “warning” to ignore. The bug went undetected because no test gave the buggy line an input that would trigger the wrong output, not because pytest filtered something out.
Coverage measures which lines ran, not whether the line is correct for all relevant inputs. The concept that catches it is boundary thinking — probing inputs at exactly N, 2N, between, etc.
Coverage requires the --strict flag we didn’t enable
There’s no --strict mode for coverage that would have caught this. Even 100% line + branch + path coverage would have missed cycle-7’s bug, because every line ran on every prior test — the input space is what the suite under-explored, and no coverage flag changes that.
Coverage can’t measure exception-raising code
Coverage tools do measure exception code; that’s not the limit at issue. The actual limit is conceptual: coverage measures line execution, not correctness across the input space. That’s why boundary thinking complements coverage rather than being subsumed by it.

Coverage tells you the line ran, not that it was right. Every prior combo test used exactly count dice; nothing probed 2 × count. Boundary thinking is the discipline that catches it. Coverage is a locator of under-tested code, not a measure of correctness.

5. (Design a test list from a spec.) A teammate hands you this spec for a function is_valid_password(password: str) -> bool:

Returns True iff all of these hold: length is at least 8 and at most 64; contains at least one digit (0–9); contains at least one symbol from !@#$%; otherwise returns False. The empty string is invalid.

Apply Canon TDD step 1 — write a test list, in the order you’d implement it. Which list below best honours TDD discipline (simplest-first, boundary-aware, behavior-named)?

(1) test_long_string_with_digit_and_symbol_is_valid (2) test_short_string_is_invalid (3) test_no_digit_is_invalid (4) test_no_symbol_is_invalid (5) test_length_8_with_digit_and_symbol_is_valid (6) test_length_7_with_digit_and_symbol_is_invalid (7) test_length_64_with_digit_and_symbol_is_valid (8) test_length_65_is_invalid (9) test_empty_string_is_invalid
Just one parametrized test that tries 50 random strings and checks is_valid_password returns the right answer for each.
Random testing belongs more naturally to property-based testing; Hypothesis is great for that. It does not substitute for a behaviorally named test list. With 50 random strings there is no record of which behaviors are pinned down, no boundary discipline, and no readable test names that double as documentation; the boundary len == 8 may even be missed by chance.
(1) test_returns_a_bool (2) test_handles_unicode (3) test_is_thread_safe (4) test_is_not_None (5) test_does_not_raise
None of these tests come from the password spec; they check ambient properties like is not None and returns a bool. “Returns a bool” is implicit in the type annotation, and is not None passes for almost any non-None return. The test list must come from the rules the function implements, not from generic Python-correctness checks.
(1) test_password_validation — all the rules in one test with multiple assertions covering length, digit, symbol, and empty-string cases.
One test asserting all four rules at once is exactly the Giant antipattern from the test-smell taxonomy. When it fails, it is hard to tell which rule is broken, and the failure message has to cover every assertion. The TDD discipline is one test per behavior, so the test name itself documents which rule that test pins down.

Why option A wins: it follows the same shape as Dragon Dice’s seven cycles — simplest case first (typical valid input), then each rule rejected in isolation (no digit, no symbol, too short), then boundaries on the length range (7 invalid, 8 valid, 64 valid, 65 invalid — the boundary-pair discipline from the Testing Foundations prerequisite), then the empty-string edge. Each test name reads like a one-line bug report.

Why there’s no single right answer: your exact ordering may differ (e.g., empty-string first as the simplest-of-all). The skill being practiced here is generating a structured plan from a spec — the same skill you’ll use on every function you TDD outside this tutorial. The four candidate lists above contrast different wrong-answer patterns: random-instead-of-systematic, wrong-rules, and the Giant antipattern.

If you found yourself drafting your own list before scrolling to the options — even better. That’s exactly Canon TDD step 1 in your hands.

TDD with pytest — Dragon Dice Battle

Cycle 1 — RED: Write the Failing Test

Why this matters

🎯 You will learn to

What you’re building — Dragon Dice

Test-Driven Development in one minute

Why a failing test is the goal of RED

The shape of every pytest test

The dragon dice rules (reference for all seven cycles)

Commit after every step (the safety-net habit)

Your test list (Canon TDD step 1)

Cycle 1’s spec

Your task

📦 Commit your progress

Solution

Cycle 1 — GREEN: Make It Pass

Why this matters

🎯 You will learn to

The test is your contract

Three Python tools, in this new context

The Transformation Priority Premise — why “smallest” beats “best”

Your task

📦 Commit your progress

Solution

Cycle 1 — REFACTOR: The Pause That Counts

Why this matters

🎯 You will learn to

The REFACTOR checklist (you’ll re-use this every cycle)

Your task

📦 Commit your progress

Solution

Step 3 — Knowledge Check

Cycle 2 — Single 1 → Dragon Flame

Why this matters

🎯 You will learn to

Your task

📦 Commit your progress

Solution

Cycle 3 — Single 5 → Lightning Spark

Why this matters

🎯 You will learn to

Your task

📦 Commit your progress

Solution

Cycle 4 — Repeated Singles → First Real Refactor

Why this matters

🎯 You will learn to

Your task

📦 Commit your progress

Solution

Step 6 — Knowledge Check

Cycle 5 — Triple 1s → Dragon Blast (Design Moment)

Why this matters

🎯 You will learn to

Your task

📦 Commit your progress

Solution

Step 7 — Knowledge Check

Cycle 6 — Goblin Swarm → Discover Rule Objects (Big Refactor)

Why this matters

🎯 You will learn to

Phase 1 — 🔴 RED

Phase 2 — 🟢 GREEN (deliberately ugly)

Phase 3 — 🔵 REFACTOR (the discovery)

A — Identify what varies

B — Name the entity

C — Sketch the entity

D — Replace the blocks with data

E — Apply the same recognition to singles

F — Cash in the OCP win: add the four remaining combos as data

📦 Commit your progress

Solution

Step 8 — Knowledge Check

Cycle 7 — Six 1s → Two Dragon Blasts (Hidden Bug)

Why this matters

🎯 You will learn to

Your task

📦 Commit your progress

Solution

Step 9 — Knowledge Check