TDD with pytest — Dragon Dice Battle
Live the Red-Green-Refactor rhythm by growing a fantasy dice-scoring engine from scratch — seven tiny tests, one design pressure at a time, plus a final transfer cycle (FizzBuzz) that proves the rhythm carries to a brand-new problem. Tests drive design, refactors stay safe, and the rules of the game emerge as code.
Cycle 1 — RED: Write the Failing Test
Why this matters
RED is the moment TDD looks weirdest: you deliberately write a test that cannot pass yet, and you make the failure happen on purpose. That inversion is the threshold concept — a failing test is the goal, not the accident, because it’s the first place where the spec gets pinned down before any implementation exists. Learning to read a failure for the right reason is the foundation everything else in this tutorial sits on.
🎯 You will learn to
- Apply the four-part pytest test shape (import, define, arrange-act, assert) to translate a one-sentence spec into a runnable failing test
- Analyze a pytest failure and distinguish a right-reason RED (
ImportError/AssertionErroron the assertion you wrote) from a wrong-reason RED (typo / missing colon) - Evaluate why a surprise green on a brand-new test should be treated as a Liar test until proven otherwise
Prerequisite: Testing Foundations — pytest discovery,
assert, partitions, behavior-not-implementation. If those feel new, do that one first.
What you’re building — Dragon Dice
Dragon Dice is a (fictional) tabletop combat game. The mechanic is simple: a player rolls a handful of six-sided dice, and certain face values and combinations trigger named combat events — Dragon Flame, Lightning Spark, Goblin Swarm, and so on — each worth a damage number. A turn’s roll is just a Python list of dice values, e.g. [1, 1, 1, 1, 5].
Two kinds of scoring happen on every roll:
- Singles — a
1becomes one Dragon Flame (100 damage); a5becomes one Lightning Spark (50). Other face values, on their own, score nothing. - Triples (combos) — three matching dice trigger a bigger event that consumes its dice. Three 1s become one Dragon Blast (1000) instead of three Dragon Flames; three 2s become a Goblin Swarm; and so on. Whatever the combos don’t consume keeps scoring as singles, so
[1, 1, 1, 1, 5]produces one Dragon Blast (consuming three 1s) plus a leftover Dragon Flame plus a Lightning Spark — for 1150 total damage. The full ruleset is in the table further down.
Your goal across the seven Dragon-Dice cycles is to grow a score(dice) function that turns any roll into a BattleReport — its total_damage and the ordered tuple of ScoringEvents it produced. You will not look at the full ruleset and write it all at once. TDD adds one rule at a time, each one earned by a test that demands it. After cycle 7 an eighth transfer cycle reapplies the same rhythm to a totally unrelated problem (FizzBuzz), as proof the discipline carries beyond this domain.
Test-Driven Development in one minute
TDD is a design technique that uses tests as the medium of pressure. You write code in short cycles of three phases:
| Phase | What you do | Why this phase exists |
|---|---|---|
| 🔴 RED | Write one failing test that names a behavior you want | Forces the interface and expected behavior to be decided before any logic exists |
| 🟢 GREEN | Write the smallest code that makes the test pass | Resists speculative design; only build what a test demands |
| 🔵 REFACTOR | Improve the code while all tests stay green | The safety net lets you reshape structure without fear of regression |
Each phase of cycle 1 is its own tutorial step so the rhythm becomes a felt sequence, not a slogan. From cycle 2 onward, each cycle is one step containing all three phases.
Why a failing test is the goal of RED

Most testing intuition is the opposite: green = good, red = bad. TDD inverts that for the first run of every cycle. If you write a brand-new test against code that doesn’t exist yet — and pytest reports PASSED — something is wrong. Maybe the import silently failed. Maybe the assertion is vacuous. Maybe you’re running an old cached version. A surprise green is a Liar test until proven otherwise; the wizard is right to block it.
A failing test is not a bug — RED is the expected starting state of every cycle. But the failure has to come from the behavior under test, not from a typo:
- Right reason —
ImportError,AttributeError, a value-mismatch on the assertion you wrote. The test correctly says “this behavior does not exist yet.” - Wrong reason —
SyntaxError, missing colon, misspelledtest_prefix. The test never ran. You’ve learned nothing about the unit, only about your typing.
Students commonly delete a failing test to make the bar green. We’re leaning into that discomfort instead. Learning to read the failure is what TDD trains.
🤔 But why test-first? Why not just write the code, then test it?
The honest answer: most developers’ instinct is to write the code first. That is the habit TDD is replacing — and it deserves a real argument, not just a style claim.
A small concrete scenario. Suppose you skip the test and just write score() directly. You’re confident it’s right; you eyeball-check it in a REPL with score([1]) and score([1, 5]), see plausible numbers, ship. Two weeks later your teammate adds a triple 1s = Dragon Blast rule by inserting an elif branch that fires before the per-die loop. The elif only matches exactly [1, 1, 1]; rolls like [1, 1, 1, 5] silently fall through and score wrong.
With a test-first cycle, the triple 1s test would have run against an empty score() and forced the question “what does the spec say happens with leftover singles?” before the elif was written. Without the test, the bug ships and surfaces only when a player notices their score is off — if they notice at all.
The general pattern. Code-first writes a function and then asks “what should it do?” Test-first writes a behavioral commitment and then asks “what’s the simplest code that delivers that?” The first habit lets implementation choices smuggle themselves into your sense of what the spec was. The second prevents that — the spec is on disk, in code, before any implementation can pollute it. (Janzen & Saiedian’s ICSE 2007 study of 230+ programmers: even programmers who tried test-first once kept reverting to code-first afterward; the habit is that sticky. Naming it here, so you can notice it in yourself, is half the work.)
So you might still resist test-first today. Notice the resistance. The goal of these seven cycles is to give you the felt experience of small-step rhythm — after which you’ll be choosing test-first because it works, not because we said so. (And per Fucci et al. 2017: even if you sometimes write the code an instant before the test, the granularity and rhythm are where TDD’s measured benefits come from. So don’t worry about being a purist; worry about being incremental.)
The shape of every pytest test
Every pytest test you write has the same four-part structure:
| Part | What it does |
|---|---|
| Import the unit under test | Tells Python what code you’ll call |
Define a function whose name starts with test_ |
pytest only discovers functions matching this pattern |
| Arrange + act | Set up any input and call the unit |
| Assert an observable property of the result | Pin down one thing the spec promises |
The pattern generalizes; the specifics (what to import, call, and assert) come from the spec — and only from the spec.
The dragon dice rules (reference for all seven cycles)
| Roll | Event | Damage |
|---|---|---|
| Single 1 | Dragon Flame | 100 |
| Single 5 | Lightning Spark | 50 |
| Triple 1 | Dragon Blast | 1000 |
| Triple 2 | Goblin Swarm | 200 |
| Triple 3 | Orc Charge | 300 |
| Triple 4 | Troll Smash | 400 |
| Triple 5 | Lightning Storm | 500 |
| Triple 6 | Demon Strike | 600 |
Triples consume three dice; leftover 1s and 5s still score as singles. Dice are integers 1–6. Today you implement only the empty-roll case. Six more cycles add the rest, and a final transfer cycle (a different problem entirely) proves the rhythm carries.
Commit after every step (the safety-net habit)
The editor has a Git Graph view next to it and an embedded terminal that accepts a small set of shell commands (git, python, pytest, plus &&/||/; chains). Commit at the end of each step with a short message naming the phase (RED:, GREEN:, REFACTOR:, Cycle N:). Two reasons it earns its keep:
- Atomic safety net. Every commit is a known-green state you can
git reset --hardback to if a refactor goes sideways. Beck’s discipline: never refactor on top of uncommitted code. - Visible history. The Git Graph view shows your DAG growing one node per phase — a literal picture of “Red, Green, Refactor, Red, Green, Refactor…” that mirrors what your editor just did.
Cycle 1’s three steps each give you the exact command to type. From Cycle 2 onwards, the commit prompt only suggests the message — you write the git add <files> && git commit -m "..." yourself. (Always stage the specific files you touched, e.g. git add scorer.py test_scorer.py. Avoid git add -A — it sweeps in junk you didn’t mean to commit.)
Your test list (Canon TDD step 1)
Kent Beck’s Canon TDD (December 2023) starts with a written list of behaviors you want the code to have — before writing any tests. The list isn’t a contract; it’s a thinking tool. New behaviors get appended as they occur to you; ones you finish get struck through; ones that turn out to be already-implemented (the bonus mixed-dice test in cycle 4, the bonus leftover guardrail in cycle 5) get a checkmark with no code change.
Here are the first three items, in the order the cycles will tackle them:
- ☐ Cycle 1 — Empty roll → no damage, no events
- ☐ Cycle 2 — A single 1 → one Dragon Flame event
- ☐ Cycle 3 — A single 5 → one Lightning Spark event
- ☐ Cycle 4 — …
More items appear as we work through them — Beck’s discipline is to not pre-resolve them all. Pick the next item, turn only that one into a runnable test, make it pass, optionally refactor, repeat. He warns explicitly against converting every list item up front (“leads to rework and depression”) and against mixing refactor into making a test pass (“wearing two hats simultaneously”). The platform’s step-by-step structure enforces both disciplines for you.
Cycle 1’s spec
An empty roll produces a battle report with zero damage and no events.
That sentence names everything you need: a function score, a return value with total_damage and events attributes. Translate it into a pytest test using the four-part shape.
Your task
- In
test_scorer.py(right pane), fill in the three sub-goal comments. Leavescorer.pyempty — its code belongs to the GREEN step. - Predict the category of failure you’ll see —
ImportError,AttributeError, orAssertionError? Write it down. - Click Run. Compare the actual failure to your prediction.
Reveal — what we expected (open after running)
ImportError: cannot import name 'score' (or ModuleNotFoundError if scorer.py is empty). That IS the deliverable — RED for the right reason.
Why `()` and not `[]`? (open if you wondered)
Tuples are immutable — they can’t be mutated by accident, and they’re safe dataclass defaults (cycle 2 uses that). Every test in this tutorial that pins down events uses a tuple.
📚 References
📦 Commit your progress
Before moving on, lock this step into the safety net. In the embedded terminal:
git add test_scorer.py && git commit -m "RED: failing test for empty roll"
# Cycle 1 RED phase — DO NOT WRITE PRODUCTION CODE HERE YET.
#
# The next tutorial step (Cycle 1 GREEN) is where the BattleReport
# class and the score function are introduced. Right now we are
# only writing the failing test on the right.
"""Cycle 1 RED — write the first failing test.
The sub-goals below describe the PURPOSE of each line you need to add,
not the syntax. Translate the spec ("an empty roll has no damage and no
events") into pytest assertions yourself. If you get stuck, consult the
rules table and the references in the instructions panel.
"""
from scorer import score
def test_empty_roll_has_zero_damage_and_no_events():
# Sub-goal: call the unit under test on the simplest input the spec mentions
# — and capture the result so the next two lines can inspect it.
# Sub-goal: pin down what the spec says about damage in this case.
# Sub-goal: pin down what the spec says about events in this case.
pass
Cycle 1 — GREEN: Make It Pass
Why this matters
The instinct on GREEN is to “build it right” — anticipate the next cycle, generalize early, reach for the elegant abstraction. That instinct is the single most common way TDD degrades into test-after. The GREEN rule asks for the smallest code that satisfies this one test, even if it looks embarrassingly trivial — because every line you write without a test demanding it is a guess, not a discovery.
🎯 You will learn to
- Apply the GREEN rule by writing the smallest code that satisfies the current failing test (no speculative branches, no premature abstraction)
- Analyze a pytest test as a contract that prescribes the unit’s interface (
@dataclass(frozen=True),@property, default tuple) line-by-line - Evaluate a candidate GREEN against the Transformation Priority Premise — preferring lower-cost transformations (constant → variable) over higher-cost ones (loop / class)
The GREEN rule: write the smallest code that makes the failing test pass. Anything more is speculative design — code with no test demanding it.
The test is your contract
Every line of the test is an obligation your code must satisfy:
| Line of the test | What your code must provide |
|---|---|
from scorer import score |
A score name in scorer.py |
report = score([]) |
score returns something |
assert report.total_damage == 0 |
That something exposes total_damage, equal to 0 |
assert report.events == () |
…and events, equal to () |
Three Python tools, in this new context
You already know these — what’s new is why this test forces you to reach for them:
@dataclass(frozen=True)— gets you free__init__/__eq__/__repr__, and the per-field structural__eq__is exactly what makesreport.events == (...)work in cycle 2. (Also hashable, which we lean on later.)@property— needed because the test readsreport.total_damageas an attribute, notreport.total_damage(). The test’s grammar is the constraint;@propertyis the tool that fits it.
That’s it. The test wrote the spec for you; these tools are the smallest Python primitives that satisfy it. (dataclasses · property if you want a refresher.)
The Transformation Priority Premise — why “smallest” beats “best”
Robert Martin’s TPP lists code transformations from simplest to most complex: nothing → constant → variable → conditional → loop. The rule: always pick the simpler transformation that passes the current failing test, even when you “know” a more general one is coming.
For cycle 1, the test only mentions the empty case. You do not need a loop yet — the empty-tuple default already produces 0 damage. The loop arrives when a test (cycle 4) actually demands it.
Your task
- In
scorer.py(left pane), replace each sub-goal comment with the matching line. Re-read the test for the contract. - Before you click Run, identify one way your code could be wrong. (A misplaced default? A forgotten decorator? A method where the test reads an attribute?) Run, then check whether your prediction matched.
- Resist any “improvement” beyond what the test demands — the next step is REFACTOR, and it only earns work that has somewhere to go.
🛟 Stuck? Common shapes that fail (open if pytest is red)
events: list = []— Python rejects mutable defaults in dataclasses withValueError. What immutable alternative matches the test’sevents == ()assertion?- Forgetting
@property— without it,report.total_damageis a bound method object, not a number; the assertion fails in a weird way. @dataclasswithoutfrozen=True— passes cycle 1, but cycle 2’s tuple comparisons of value objects need the structural__eq__that frozen dataclasses provide.if not dice: ...— speculative branching. The empty-tuple default already handles the empty case.
📦 Commit your progress
🔍 Before you commit, glance at the gutter. The +/~/- markers in the left margin of each editor pane show what changed since your last commit (the RED step). The diff should be exactly the production code you just wrote — nothing else. If you see surprises, investigate before staging.
Then, in the embedded terminal:
git add scorer.py test_scorer.py && git commit -m "GREEN: empty BattleReport with zero damage"
"""Cycle 1 GREEN — smallest code that turns the failing test green.
The sub-goals below describe the PURPOSE of each line you need to add,
not the syntax. Re-read test_scorer.py to recover the contract: a name
to export, a return value, two attributes on the return value with
specific values. The Cart example in the instructions shows the toolkit
shape; you must translate it to the dragon-dice naming yourself.
"""
from dataclasses import dataclass
@dataclass(frozen=True)
class BattleReport:
# Sub-goal: declare the storage that the test reads as `report.events`.
# Hint: the test compares this to `()`, which already tells you the
# type and the default value. (See the dataclasses docs link.)
@property
def total_damage(self) -> int:
# Sub-goal: derive the total from whatever events the report holds.
# Hint: with an empty-tuple default for events, an aggregate built-in
# over an empty sequence already produces the value the test asserts.
pass
def score(dice: list[int]) -> BattleReport:
# Sub-goal: hand back the kind of object the test reads attributes on.
# Hint: ignore `dice` for now — no test makes a claim about non-empty
# rolls yet, so any branching on it would be speculative design.
pass
"""Cycle 1 — first failing test (carried over from the RED step)."""
from scorer import score
def test_empty_roll_has_zero_damage_and_no_events():
report = score([])
assert report.total_damage == 0
assert report.events == ()
Cycle 1 — REFACTOR: The Pause That Counts
Why this matters
Beginners skip REFACTOR when they “don’t see anything to clean up” — and that habit is exactly how TDD silently decays into write-test-then-write-code. REFACTOR is a phase you enter every cycle, with a deliberate look-around through a checklist; the answer “nothing this time” is a fine outcome, but skipping the look is not. Today’s cycle 1 has almost nothing to clean — that’s why it’s the right moment to install the discipline of looking anyway.
🎯 You will learn to
- Apply the five-line REFACTOR checklist (duplication, names, test names, magic constants, imports) as a deliberate pause at the end of every cycle
- Evaluate when “nothing to clean this time” is the correct outcome — and notice that entering and looking is the discipline, not finding something
- Analyze a quiz question on the rhythm to confirm RED-GREEN-REFACTOR is now reasoned about, not just slogan-recited
The discipline: REFACTOR is a phase you enter every cycle — even when the answer is “nothing to clean this time.” Entering and looking is the discipline. Skipping the look is the failure mode that quietly degrades TDD into test-after.
The REFACTOR checklist (you’ll re-use this every cycle)
| Category | Question to ask | Your cycle 1 answer |
|---|---|---|
| Duplication | Two pieces of code expressing the same idea? | _____ |
| Names | Do names describe what they mean, not how they work? | _____ |
| Test names | Does each test name read as a behavior sentence? | _____ |
| Magic constants | Unexplained numbers or strings? | _____ |
| Imports | Conventional order, no dead imports? | _____ |
Fill the right column from your code before opening the reveal. The discipline is the looking, not the finding.
Your task
- Re-read your code with the checklist. Spend 30 seconds — don’t rush.
- Make any tiny improvement you spot (e.g., a module docstring); keep the bar green.
- Open the reveal below to compare your answers. Then take the first quiz.
Reveal — one possible cycle-1 answer column
| Category | Cycle 1 answer |
|---|---|
| Duplication | No — only one piece of code |
| Names | BattleReport, total_damage, events, score — all domain words |
| Test names | test_empty_roll_has_zero_damage_and_no_events — long but unambiguous |
| Magic constants | None yet |
| Imports | Just from dataclasses import dataclass — clean |
For cycle 1, every row is “fine.” That’s a real outcome of a REFACTOR phase — and recognising it without skipping the look is the win.
Why REFACTOR is the most-skipped phase
Martin Fowler calls skipping refactor “the most common way to screw up TDD.” Field studies of student and professional practice agree: developers treat the green bar as the finish line. Within a few cycles, duplication accumulates and the test suite ages — exactly because nobody paused at REFACTOR to look. By making “enter the phase even when there’s nothing to do” a habit now, you defend against that drift for the rest of the tutorial.
📦 Commit your progress
Before moving on, lock this step into the safety net. In the embedded terminal:
git add scorer.py test_scorer.py && git commit -m "REFACTOR: cycle 1 (nothing to clean)"
from dataclasses import dataclass
@dataclass(frozen=True)
class BattleReport:
events: tuple = ()
@property
def total_damage(self) -> int:
return sum(event.damage for event in self.events)
def score(dice: list[int]) -> BattleReport:
return BattleReport()
"""Cycle 1 — first failing test, now green."""
from scorer import score
def test_empty_roll_has_zero_damage_and_no_events():
report = score([])
assert report.total_damage == 0
assert report.events == ()
Step 3 — Knowledge Check
Min. score: 80%1. What is the correct order of phases in a single TDD cycle?
RED → GREEN → REFACTOR. The order is load-bearing. RED proves the test reveals missing behavior. GREEN proves the code now satisfies that behavior. REFACTOR improves the code while the safety net (the green test) protects you.
2. You write a new test and click Run. The bar is immediately green without you having to write any production code. What is the most defensible interpretation?
Both halves matter. Sometimes a test passes immediately because a prior refactor generalized behavior — that is a positive outcome, and the test still earns its place by documenting that the behavior is intended. Other times the test is vacuously true. The way to tell is to break the production code on purpose: a real test will catch the break.
3. In the GREEN phase, you have two implementations in mind: one is a hardcoded if dice == []: return BattleReport(), the other is a generic loop. The hardcoded version passes the current test. What does TDD discipline say to do?
This is the Transformation Priority Premise in action: prefer simpler transformations. The hardcoded version is fine as long as a future test will challenge it. That challenge — and the design pressure it creates — is exactly what cycle 4 of this tutorial will hand you.
4. Why is REFACTOR worth entering even when there is nothing to clean up?
Field studies report that “skip refactor when it looks fine” is exactly how test-driven discipline degrades into test-after over a few weeks. Every cycle is a checkpoint where you ask “what do I see now?” Most cycles, the answer is mostly fine. But you only know that because you looked.
5. A teammate says: “TDD is just unit testing — write your tests first instead of last.” What is the most accurate correction?
This is the threshold concept: TDD is design first, testing second. The test forces you to decide the public interface (function name, parameters, return shape) before any logic exists. The REFACTOR phase is where the design that emerged under pressure gets shaped intentionally. Calling it “unit testing in reverse order” misses both halves.
Cycle 2 — Single 1 → Dragon Flame
Why this matters
Cycle 1 walked the rhythm one phase at a time. Cycle 2 packs all three phases into one step — and immediately tests the hardest TDD discipline of all: allow the hard-code. The first GREEN for “a 1 is a Dragon Flame” should look ugly (if dice == [1]:) because one example is not enough information to choose the right shape. Refactor toward duplication, not before it.
🎯 You will learn to
- Apply the full RED-GREEN-REFACTOR rhythm as a single packaged cycle, translating a one-sentence spec into a test, the smallest passing code, and a deliberate REFACTOR pause
- Analyze why the first GREEN is allowed (and expected) to look ugly — one example is not enough information to choose the right shape
- Evaluate the “refactor toward duplication, not before it” rule against the temptation to generalize early
Spec: a single die showing 1 creates a Dragon Flame event worth 100 damage.
From now on, each cycle is one step with three tasks (RED → GREEN → REFACTOR). Same discipline as cycle 1 — tighter packaging.
Your task
- 🔴 RED — add
test_single_one_creates_dragon_flame_eventintest_scorer.py. From the spec (“a single die showing 1 creates a Dragon Flame event worth 100 damage”), translate into pytest assertions yourself. The four-part shape from cycle 1 still applies; the rules table at the top names the event and damage. Predict the failure category (ImportError?AttributeError?AssertionError?) before running. - 🟢 GREEN — pick the smallest code that turns the test green. Resist any abstraction beyond what cycle 2’s single test demands. After you’ve made your choice, open the reveal below to compare.
- 🔵 REFACTOR — walk the cycle-1 checklist. Resist generalizing; cycle 4 will earn the loop.
Reveal — what we expected for RED (open after running)
ImportError: cannot import name 'ScoringEvent' — the test forces you to name the event class before writing it. That’s the design pressure of test-first thinking.
Reveal — one shape for the smallest GREEN (open after you've tried)
A hardcoded if dice == [1]: branch returning a BattleReport with one ScoringEvent. Yes, it’s ugly. Yes, you can see how cycle 3 will duplicate it. That’s the point — wait for the second example.
Why “allow the hard-code” is a TDD discipline. The instinct is to extract a rule, write a loop, build the abstraction now. TDD asks you to wait for the test that demands it. A speculative loop is a guess at the right shape; a loop refactor pulled by cycle 4’s test is a discovery. Refactor toward duplication, not before it.
🪞 Pause (10 seconds, after green): what did the test force you to name before any code existed? Hold your answer; the cycle-3 reveal will compare.
📦 Commit your progress
Before moving on, commit this cycle. Stage only the files you actually changed (scorer.py test_scorer.py) and write a short message — recommended: Cycle 2: single 1 = Dragon Flame.
from dataclasses import dataclass
@dataclass(frozen=True)
class BattleReport:
events: tuple = ()
@property
def total_damage(self) -> int:
return sum(event.damage for event in self.events)
def score(dice: list[int]) -> BattleReport:
return BattleReport()
"""Cycles 1–2 — adding the Dragon Flame behavior."""
from scorer import score
def test_empty_roll_has_zero_damage_and_no_events():
report = score([])
assert report.total_damage == 0
assert report.events == ()
# TODO (RED): import ScoringEvent from scorer
# TODO (RED): write test_single_one_creates_dragon_flame_event
# score([1]) should return a report with total_damage == 100
# and events == (ScoringEvent("Dragon Flame", (1,), 100),)
Cycle 3 — Single 5 → Lightning Spark
Why this matters
Cycle 3 is the same shape as cycle 2 with different values — and that’s exactly why it matters. Two near-identical hardcoded branches make the duplication impossible to miss; the trap is that your hands will itch to extract a loop right now. Don’t. Refactoring with only two data points is still guessing. Cycle 4’s test will provide the third point — and the loop refactor it earns will be a discovery, not a guess.
🎯 You will learn to
- Apply Variation Theory by writing a second test with the same shape as cycle 2 (only the values change) and observing what the contrast makes visible
- Evaluate when deliberately keeping ugly code is the disciplined move — refactoring under-informed is worse than not refactoring
- Analyze how the visible duplication will be the design pressure that earns the cycle 4 refactor
Spec: a single die showing 5 creates a Lightning Spark event worth 50 damage.
Same shape as cycle 2, different values. The duplication this creates is intentional — cycle 4’s test will earn the right to fix it.
Your task
- 🔴 RED — add
test_single_five_creates_lightning_spark_event, structured exactly like cycle 2’s test but with the Lightning Spark values. - 🟢 GREEN — add a second hardcoded
if dice == [5]:branch. Resist the urge to write a loop or dict lookup. - 🔵 REFACTOR — walk the checklist. The duplication is now visible; the right move is to note it and write nothing. No test demands the loop yet.
Why deliberately keeping ugly code is the disciplined move. You can clearly see duplication. Refactoring it now would be guessing at the right shape with one too few data points. Cycle 4’s test will provide the second data point — and the loop refactor it earns is a discovery, not a guess. Refactor toward duplication, not before it.
📦 Commit your progress
Before moving on, commit this cycle. Stage only the files you actually changed (scorer.py test_scorer.py) and write a short message — recommended: Cycle 3: single 5 = Lightning Spark.
from dataclasses import dataclass
@dataclass(frozen=True)
class ScoringEvent:
name: str
dice_used: tuple
damage: int
@dataclass(frozen=True)
class BattleReport:
events: tuple = ()
@property
def total_damage(self) -> int:
return sum(event.damage for event in self.events)
def score(dice: list[int]) -> BattleReport:
if dice == [1]:
return BattleReport((
ScoringEvent("Dragon Flame", (1,), 100),
))
return BattleReport()
"""Cycles 1–3 — adding the Lightning Spark behavior."""
from scorer import score, ScoringEvent
def test_empty_roll_has_zero_damage_and_no_events():
report = score([])
assert report.total_damage == 0
assert report.events == ()
def test_single_one_creates_dragon_flame_event():
report = score([1])
assert report.total_damage == 100
assert report.events == (
ScoringEvent("Dragon Flame", (1,), 100),
)
# TODO (RED): write test_single_five_creates_lightning_spark_event
# score([5]) should return total_damage == 50 with
# events == (ScoringEvent("Lightning Spark", (5,), 50),)
Cycle 4 — Repeated Singles → First Real Refactor
Why this matters
Cycle 4 is the first design-breaking test of the tutorial — neither dice == [1] nor dice == [5] matches [1, 1], so the cheapest patch (a third hardcoded branch) is globally expensive even when it’s locally small. This is also where the safety-net argument becomes load-bearing: the previous three green tests are what allow you to replace the hardcoded branches with a loop without fear. The mutation move at the end of the cycle proves those tests actually catch the regressions you think they do.
🎯 You will learn to
- Apply the first real refactor under safety — replacing hardcoded branches with a loop while three green tests guard the change
- Evaluate competing GREEN options (third hardcoded branch vs. loop) by predicting which is cheaper across the next two cycles
- Apply the mutation move (mutate a line, watch a test fail, revert) to verify the safety net actually catches regressions
Spec:
score([1, 1])returns total damage 200 with two Dragon Flame events.
The first design-breaking test. Neither dice == [1] nor dice == [5] matches [1, 1] — the duplication you noted in cycle 3 just demanded payment.
Your task
- 🔴 RED — add
test_two_ones_create_two_dragon_flamesasserting damage 200 and two Flame events. Run; predict the failure. - 🟢 GREEN — you have two options:
- Option A: a third hardcoded branch for
dice == [1, 1]. - Option B: replace the hardcoded branches with a
forloop over each die.
Pick one. Before you implement, predict: which option will be cheaper over the next two or three cycles? Don’t peek ahead — predict from what you know now. Implement your choice, run, and revisit your prediction.
- Option A: a third hardcoded branch for
- 🔵 REFACTOR + mutation check — re-run; the previous three tests are your safety net. Then prove they actually catch regressions with the ten-second mutation move: temporarily change a line in
scorer.py(e.g.,if die == 1:→if die == 99:), rerun pytest, watch a test fail, then revert. A test that doesn’t fail when the production code breaks is a Liar test — it’s not pinning down the behavior you think it is. - 🟢 Bonus check — add
test_one_and_five_create_two_different_events(mixed dice[1, 5]→ one Flame + one Spark). Predict whether you’ll need to changescorer.py. The new loop should handle this for free — but the test makes that promise explicit.
🪞 Pause (after green): in one sentence, what did the passing test results just tell you that you’d otherwise have had to verify by hand? Hold your answer; the cycle quiz returns to it.
Reveal — what happens when option A wins (open after running)
A third hardcoded branch passes cycle 4. But the bonus mixed-dice case ([1, 5]) needs a fourth branch — and cycle 5 (triple 1s) cannot be satisfied by any hardcoded branch because the structure has to change. The loop refactor still has to happen, only now you have more code to delete first. Locally smallest (one new if) is globally largest.
Why the mutation move matters
A passing test means one of two things: (a) the code is correct, or (b) the test is vacuous and would pass against any code. The Liar test smell (Codurance taxonomy) is silent — pytest reports green either way. The 10-second mutation move — break the production code, watch the test fail, revert — is the cheap, durable defense. Use it whenever a test passes for a reason you didn’t fully expect (especially the bonus mixed-dice test, which passes “for free” thanks to the loop).
📦 Commit your progress
Before moving on, commit this cycle. Stage only the files you actually changed (scorer.py test_scorer.py) and write a short message — recommended: Cycle 4: per-die loop + mixed-dice guardrail.
from dataclasses import dataclass
@dataclass(frozen=True)
class ScoringEvent:
name: str
dice_used: tuple
damage: int
@dataclass(frozen=True)
class BattleReport:
events: tuple = ()
@property
def total_damage(self) -> int:
return sum(event.damage for event in self.events)
def score(dice: list[int]) -> BattleReport:
if dice == [1]:
return BattleReport((
ScoringEvent("Dragon Flame", (1,), 100),
))
if dice == [5]:
return BattleReport((
ScoringEvent("Lightning Spark", (5,), 50),
))
return BattleReport()
"""Cycles 1–4 — repeated singles force the first refactor."""
from scorer import score, ScoringEvent
def test_empty_roll_has_zero_damage_and_no_events():
report = score([])
assert report.total_damage == 0
assert report.events == ()
def test_single_one_creates_dragon_flame_event():
report = score([1])
assert report.total_damage == 100
assert report.events == (
ScoringEvent("Dragon Flame", (1,), 100),
)
def test_single_five_creates_lightning_spark_event():
report = score([5])
assert report.total_damage == 50
assert report.events == (
ScoringEvent("Lightning Spark", (5,), 50),
)
# TODO (RED): write test_two_ones_create_two_dragon_flames
# score([1, 1]) should return total_damage == 200 and two
# Dragon Flame events in events
#
# TODO (Bonus, after the loop refactor): add
# test_one_and_five_create_two_different_events
# score([1, 5]) should return total_damage == 150 with
# one Dragon Flame followed by one Lightning Spark
Step 6 — Knowledge Check
Min. score: 80%1. You noticed obvious duplication after cycle 3 but did not refactor it. Cycle 4’s test then forced the refactor. Why is this sequence (notice → wait → refactor when test demands) better than (notice → refactor immediately)?
2. You replaced the hardcoded branches with a for loop and ran pytest. All four tests passed. What did the test suite just do for you that you would otherwise have had to do manually?
Tests enable change. The four passing test results are not just “tests passed” —
they are “the empty case still works, single 1 still works, single 5 still
works, and [1, 1] works for the first time.” Without the suite, you would
have to convince yourself of each line by reading.
3. A teammate looks at your cycle-4 GREEN code and says: “Why didn’t you just add a third if dice == [1, 1]: branch? It would have passed the test.” What is the most accurate rebuttal?
The TDD GREEN rule is “smallest code that passes the failing test” given the current direction of the design. A third hardcoded branch is locally minimal but globally wasteful — you’ll throw it away in cycle 5 anyway. The loop is the smallest durable change.
4. Why is “RED for the right reason” still the right framing in cycle 4, even though we are now adding a test on top of an already-passing suite?
“RED for the right reason” applies to every cycle. In cycle 1 the right reason is
usually ImportError. In later cycles it is typically an assertion failure
showing the expected tuple of events versus what the current code actually
produced. Either way, the failure has to come from the behavior under test, not
from a typo.
Cycle 5 — Triple 1s → Dragon Blast (Design Moment)
Why this matters
The cycle-4 per-die loop walks each die in isolation — it has no way to know that the other two 1s exist when it processes the first. The triple-1 test cannot be satisfied by editing a branch or tweaking the loop body; the structure has to change from “iterate dice in order” to “count faces, then decide what to emit.” This is the threshold concept: tests force structural change, not just lines of code. And the previous five tests survive a full body rewrite of score — because they assert on observable behavior, not on internals.
🎯 You will learn to
- Analyze why a per-die loop is structurally incapable of satisfying a triple-combo test — and why this earns a
Counter-based count-then-emit shape - Evaluate the Refactoring Litmus Test: which property of the previous tests allowed them to survive a full rewrite of
score? - Apply the same mutation move from cycle 4 to a leftover-bookkeeping line, confirming the new structure’s invariants are pinned down by tests
Spec: three 1s in a roll combine into one Dragon Blast (1000 damage) instead of three Dragon Flames.
The pivot moment. Cycle 4’s per-die loop walks each die independently — it has no way to know that the other two 1s exist when it processes the first. This test cannot be satisfied by tweaking a branch; the structure has to change. A design-breaking test.
Your task
- 🔴 RED — add
test_three_ones_create_dragon_blast_instead_of_three_flames. Predict what kind of failure pytest will show (ImportError,AttributeError,AssertionError— and what the message will likely contain). Run. - 🟢 GREEN — before reaching for code: open the per-die loop in
score(). Spend 90 seconds writing a one-sentence answer to: what about the loop’s structure makes this test impossible to satisfy with a local edit? Then make the structural change. - 🔵 REFACTOR — re-run; the previous five tests survived a full body rewrite of
score. - 🟢 Bonus guardrail — your GREEN code subtracts the consumed dice (
counts[1] -= 3) so leftovers still score as singles. That behavior is currently implicit — no test would catch a future refactor that forgets it. Addtest_dragon_blast_plus_leftover_flame_and_spark(score([1, 1, 1, 1, 5])→ one Blast, one leftover Flame, one Spark = 1150 damage). It should pass for free; verify with the cycle-4 mutation move (mutatecounts[1] -= 3tocounts[1] -= 4, watch the new test fail, revert).
Reveal — one shape that handles per-face-count thinking (open after you've tried)
Stop iterating dice in order. Count how many of each face appeared (collections.Counter), then decide what to emit. Combos consume dice; leftovers still score as singles.
🪞 Pause (after green): Yet all five previous tests still pass after the rewrite. Spend 30 seconds writing down: why? What property of the previous tests allowed them to survive a full body rewrite of score?
Compare your answer — the property that survived
They assert on observable behavior (total_damage == 100, events == (event,)), not on internals (which loop, which variable name). Behavior tests survive structural rewrites; implementation-tests don’t. This is the Refactoring Litmus Test — and it’s the rule that travels: write tests against contracts, not against shapes.
📦 Commit your progress
Before moving on, commit this cycle. Stage only the files you actually changed (scorer.py test_scorer.py) and write a short message — recommended: Cycle 5: triple 1s = Dragon Blast (Counter).
from dataclasses import dataclass
@dataclass(frozen=True)
class ScoringEvent:
name: str
dice_used: tuple
damage: int
@dataclass(frozen=True)
class BattleReport:
events: tuple = ()
@property
def total_damage(self) -> int:
return sum(event.damage for event in self.events)
def score(dice: list[int]) -> BattleReport:
events = []
for die in dice:
if die == 1:
events.append(ScoringEvent("Dragon Flame", (1,), 100))
if die == 5:
events.append(ScoringEvent("Lightning Spark", (5,), 50))
return BattleReport(tuple(events))
"""Cycles 1–5 — triple 1s break the per-die loop."""
from scorer import score, ScoringEvent
def test_empty_roll_has_zero_damage_and_no_events():
report = score([])
assert report.total_damage == 0
assert report.events == ()
def test_single_one_creates_dragon_flame_event():
report = score([1])
assert report.total_damage == 100
assert report.events == (
ScoringEvent("Dragon Flame", (1,), 100),
)
def test_single_five_creates_lightning_spark_event():
report = score([5])
assert report.total_damage == 50
assert report.events == (
ScoringEvent("Lightning Spark", (5,), 50),
)
def test_two_ones_create_two_dragon_flames():
report = score([1, 1])
assert report.total_damage == 200
assert report.events == (
ScoringEvent("Dragon Flame", (1,), 100),
ScoringEvent("Dragon Flame", (1,), 100),
)
def test_one_and_five_create_two_different_events():
report = score([1, 5])
assert report.total_damage == 150
assert report.events == (
ScoringEvent("Dragon Flame", (1,), 100),
ScoringEvent("Lightning Spark", (5,), 50),
)
# TODO (RED): write test_three_ones_create_dragon_blast_instead_of_three_flames
# score([1, 1, 1]) should return total_damage == 1000 with
# a single Dragon Blast event whose dice_used is (1, 1, 1)
Step 7 — Knowledge Check
Min. score: 80%1. What makes cycle 5’s test a design-breaking test, as opposed to a normal “add a branch” test?
A design-breaking test is one no local edit can satisfy. The cycle-4 loop is fundamentally per-die; combos are fundamentally per-face-count. You cannot patch your way from one to the other — you have to re-shape the algorithm. That re-shaping is what the test extracts as design pressure.
2. After the count-then-emit refactor, all five previous tests still passed. What does that tell you about the previous tests?
This is the Refactoring Litmus Test concept. Robust tests assert on what the
unit does (report.total_damage == 100, report.events == (event,)), not
on how it does it (assert "for die in dice" in inspect.getsource(score)).
The first survives a rewrite; the second breaks at the slightest refactor.
3. Why does cycle 5’s GREEN code subtract from counts[1] after emitting the Dragon Blast event?
“Combos consume dice, leftovers still score as singles” is one of the rules. The subtraction isn’t a Python quirk — it is the domain rule made explicit in the code. The bonus guardrail test you write at the end of this cycle pins down exactly this leftover behavior.
4. A teammate suggests “let’s just add if Counter(dice)[1] >= 3: ... at the top of the existing per-die loop.” Why is that not the same as the count-then-emit refactor we did?
The refactor is not “add Counter on top of the existing loop.” It is “replace the per-die view with the per-face-count view.” Mixing both is the worst of both worlds: now the function maintains two parallel models of the input, and every future change has to update both. The structural shift was the point.
Cycle 6 — Goblin Swarm → Discover Rule Objects (Big Refactor)
Why this matters
One combo branch in score is fine. Adding a second one — next to the first — makes the duplication ugly enough that “add another if” feels obviously wrong. That ugliness is the design pressure; what it earns is the rule object abstraction (ComboRule + SingleRule with apply()). The Open-Closed Principle stops being a slogan: new behavior is now new data, not new branches. This is the cycle where students stop pattern-matching TDD and start listening to the test.
🎯 You will learn to
- Apply listening to the test — recognize that a duplicate combo branch is the test telling you the structure is wrong
- Create a rule-object abstraction (
ComboRule+SingleRulewith a uniformapply()interface) under the safety net of seven green tests - Evaluate the resulting design against the Open-Closed Principle — new behavior added as data, not as branches
Spec: three 2s combine into a Goblin Swarm (200 damage).
Cycle 6 is structurally the most important cycle in this tutorial. The current code handles one combo (Dragon Blast). Cycle 6 will give you a second — and the design pressure of having two will teach you the right abstraction.
The cycle has three phases. Do them in order.
Phase 1 — 🔴 RED
Add test_three_twos_create_goblin_swarm. Mirror the shape of test_three_ones_create_dragon_blast_instead_of_three_flames — only the dice value, the event name, and the damage change. Run.
Phase 2 — 🟢 GREEN (deliberately ugly)
What is the smallest change that turns this test green? Pick it. Type it out. Don’t refactor yet. Run.
Reveal — one shape (open after you've made it green)
A second if counts[2] >= 3: block right next to the first, with the right name, dice, and damage. Yes, the duplication is now visible. That’s the whole point.
Phase 3 — 🔵 REFACTOR (the discovery)
Look at the two combo blocks side by side:
if counts[1] >= 3:
events.append(ScoringEvent("Dragon Blast", (1, 1, 1), 1000))
counts[1] -= 3
if counts[2] >= 3:
events.append(ScoringEvent("Goblin Swarm", (2, 2, 2), 200))
counts[2] -= 3
A — Identify what varies
Write down: what is the same and what is different between the two blocks? (Mental notes are fine.)
Compare your answer
Same: the shape — if counts[X] >= N: emit one event with X repeated N times; counts[X] -= N.
Different: four things — the die value, the count threshold, the event name, the damage.
If your answer captured those four things (your names may differ), it’s right. If you have more than four, look for which two collapse into one. If you have fewer, look for which one is hiding two.
B — Name the entity
Two examples is the minimum needed to see a pattern. The four things that vary are fields of an entity that doesn’t yet have a name. What would you call it? (One that holds: a die value, a count, a name, a damage.) Pick a name; we’ll use ComboRule below.
C — Sketch the entity
A ComboRule carries the four fields and does the work the if-block currently does. The behavior: detect the combo, emit one event, decrement the counts. Move that into a method on the entity. What should the method’s signature be? (Hint: it has to read and mutate the Counter, and return the events it produced — possibly an empty list.)
Write the class header before reading on. Pick a method name that describes what it does to the counts.
Compare your answer — one shape that works
@dataclass(frozen=True)
class ComboRule:
die: int
count: int
name: str
damage: int
def apply(self, counts: Counter) -> list[ScoringEvent]:
events = []
if counts[self.die] >= self.count:
dice_used = tuple([self.die] * self.count)
events.append(ScoringEvent(self.name, dice_used, self.damage))
counts[self.die] -= self.count
return events
The method is called apply because it applies the rule to a counter and returns whichever events that produces. Returning a list (possibly empty) generalizes cleanly: cycle 7 will need a single apply() call to emit zero or more events from one input.
D — Replace the blocks with data
Declare the two combo rules as data outside score(). Replace the two if-blocks inside score() with a single iteration over the tuple. The combos are now configuration, not code. Run pytest.
Compare your answer — what `score()` looks like after
COMBO_RULES = (
ComboRule(1, 3, "Dragon Blast", 1000),
ComboRule(2, 3, "Goblin Swarm", 200),
)
def score(dice: list[int]) -> BattleReport:
counts = Counter(dice)
events = []
for rule in COMBO_RULES:
events.extend(rule.apply(counts))
# ... singles loops still here for now ...
return BattleReport(tuple(events))
All eight tests still pass — the refactor preserved every observable behavior. That’s the Refactoring Litmus Test: behavior-level tests survive structural rewrites.
E — Apply the same recognition to singles
Look at the two for-loops at the bottom of score() (Dragon Flame, Lightning Spark). Same kind of duplication, one field shorter. Apply the same recognition you just did on combos — extract a SingleRule with its own apply(counts) method, declare a SINGLE_RULES tuple, and replace both loops with one iteration. Run pytest. If it goes green, you’ve parallel-transferred the pattern in one shot. If not, debug — that’s the only feedback you need.
F — Cash in the OCP win: add the four remaining combos as data
🪞 Predict first: how many lines inside score() will you change to add four new triple combos (Triple 3 → Orc Charge 300, Triple 4 → Troll Smash 400, Triple 5 → Lightning Storm 500, Triple 6 → Demon Strike 600)? Hold the number.
Now do it: append four rows to COMBO_RULES. Then add one parametrized test (@pytest.mark.parametrize) covering all four. Run.
Why parametrize beats a for-loop inside one test
@pytest.mark.parametrize runs the function once per row, reporting each row as a separate test result. A for loop inside a single test stops at the first failure, hiding everything after it. The parametrize idiom is the right Python answer to “N tests of the same shape” — DRY tests that still report separate failures.
Why this matters (read after green)
What you just did has a name: listening to the test. The pain of imagining six more hardcoded combo branches was a design signal — the structure no longer fit the problem. The cure was structural extraction: pull the varying parts into data, leave the constant shape as code.
You also just applied the Open-Closed Principle: score() is now closed for modification but open for extension. Phase F made the payoff concrete — four new combos cost zero edits to score(). New behavior arrives as data, not as new branches. score() will not change again for the rest of the tutorial.
And you discovered the right abstraction at the right moment — two examples. One would have been a guess; six would have been six branches you’d have to delete. Refactor toward duplication, not before it, and not after it has rotted (Rule of Two).
📦 Commit your progress
Before moving on, commit this cycle. Stage only the files you actually changed (scorer.py test_scorer.py) and write a short message — recommended: Cycle 6: rule objects + all six combos as data.
from collections import Counter
from dataclasses import dataclass
@dataclass(frozen=True)
class ScoringEvent:
name: str
dice_used: tuple
damage: int
@dataclass(frozen=True)
class BattleReport:
events: tuple = ()
@property
def total_damage(self) -> int:
return sum(event.damage for event in self.events)
def score(dice: list[int]) -> BattleReport:
counts = Counter(dice)
events = []
if counts[1] >= 3:
events.append(ScoringEvent("Dragon Blast", (1, 1, 1), 1000))
counts[1] -= 3
for _ in range(counts[1]):
events.append(ScoringEvent("Dragon Flame", (1,), 100))
for _ in range(counts[5]):
events.append(ScoringEvent("Lightning Spark", (5,), 50))
return BattleReport(tuple(events))
"""Cycles 1–6 — Goblin Swarm forces the rule-object refactor."""
from scorer import score, ScoringEvent
def test_empty_roll_has_zero_damage_and_no_events():
report = score([])
assert report.total_damage == 0
assert report.events == ()
def test_single_one_creates_dragon_flame_event():
report = score([1])
assert report.total_damage == 100
assert report.events == (
ScoringEvent("Dragon Flame", (1,), 100),
)
def test_single_five_creates_lightning_spark_event():
report = score([5])
assert report.total_damage == 50
assert report.events == (
ScoringEvent("Lightning Spark", (5,), 50),
)
def test_two_ones_create_two_dragon_flames():
report = score([1, 1])
assert report.total_damage == 200
assert report.events == (
ScoringEvent("Dragon Flame", (1,), 100),
ScoringEvent("Dragon Flame", (1,), 100),
)
def test_one_and_five_create_two_different_events():
report = score([1, 5])
assert report.total_damage == 150
assert report.events == (
ScoringEvent("Dragon Flame", (1,), 100),
ScoringEvent("Lightning Spark", (5,), 50),
)
def test_three_ones_create_dragon_blast_instead_of_three_flames():
report = score([1, 1, 1])
assert report.total_damage == 1000
assert report.events == (
ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
)
def test_dragon_blast_plus_leftover_flame_and_spark():
report = score([1, 1, 1, 1, 5])
assert report.total_damage == 1150
assert report.events == (
ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
ScoringEvent("Dragon Flame", (1,), 100),
ScoringEvent("Lightning Spark", (5,), 50),
)
# TODO (RED): write test_three_twos_create_goblin_swarm
# score([2, 2, 2]) should return total_damage == 200 with
# a Goblin Swarm event whose dice_used is (2, 2, 2)
#
# TODO (Phase F, after the rule-object refactor): write a parametrized
# test_other_triples_create_combo_events using @pytest.mark.parametrize
# that covers the four remaining triples — Orc Charge (300), Troll Smash
# (400), Lightning Storm (500), Demon Strike (600). And append the
# matching ComboRule rows to COMBO_RULES in scorer.py.
Step 8 — Knowledge Check
Min. score: 80%1. What does “listening to the test” mean as a refactoring heuristic?
“Listen to the test” is a foundational TDD heuristic: when adding a test or a
branch hurts disproportionately, the structure is telling you something. In
cycle 6, the pain of imagining six more if counts[X] >= 3: branches was the
signal. The fix was a structural shift to data, not more code.
2. Why is putting rules in a COMBO_RULES tuple of ComboRule instances better than keeping them as if counts[X] >= 3: branches inside score()?
The Open-Closed Principle (OCP) says modules should be open for extension and
closed for modification. The rule-object refactor is the OCP in five lines:
score() no longer changes when you add a new triple combo — only the data
does. Phase F of this cycle demonstrated the payoff by adding four combos at once.
3. A teammate says: “You should have done the rule-object refactor in cycle 5, when you first introduced combos. Why wait?” What is the most defensible answer?
Two data points are the minimum for seeing the right shape of an abstraction.
With only Dragon Blast (cycle 5), you’d be guessing whether the variation is
“the die that triggers” or “the count required” or “the name.” Cycle 6’s
Goblin Swarm provides the second data point, and only then is ComboRule(die,
count, name, damage) a discovery rather than a guess. Refactor toward
duplication, not before it.
4. After the rule-object refactor, all seven previous tests still pass. Why is this expected, and what does it confirm about the previous tests?
The seven previous tests asserted on report.total_damage, report.events,
and ScoringEvent instances. None of them asserted on internal structure.
That is precisely why they survived a refactor that replaced the entire
implementation strategy. The Refactoring Litmus Test: behavior tests survive
structural change; implementation tests don’t.
Cycle 7 — Six 1s → Two Dragon Blasts (Hidden Bug)
Why this matters
Every previous combo test used exactly three of a face — so every previous combo test passed and hid a bug. Six 1s should be two Blasts (2000 damage); your current ComboRule.apply emits one Blast plus three Flames (1300). Line coverage said the if ran, but a line being executed is not the same as a line being right for all relevant inputs. This is the gap between coverage and boundary-value analysis — and it’s the cycle where you experience first-hand the kind of bug TDD literature reports: a defect the developer doesn’t know exists in code they wrote themselves.
🎯 You will learn to
- Apply boundary-value analysis to predict where existing tests under-pin a behavior (
exactly Ncovered;2Nand beyond not) - Analyze the gap between line coverage and behavioral correctness — coverage locates under-tested code; it does not measure correctness
- Create a fix to
ComboRule.applyusing//and%=so it emits zero-or-more combos per call with correct leftover bookkeeping
Spec:
score([1, 1, 1, 1, 1, 1])produces two Dragon Blasts (2000 damage).
🪞 Predict first (don’t open the reveal yet). Look at your ComboRule.apply and trace through six 1s by hand. Write down the damage your current code produces. The whole pedagogical value of this step depends on the order: predict before peeking.
Reveal — what the current code actually does (open AFTER tracing)
The if counts[1] >= 3: runs once. It emits one Blast and counts[1] -= 3 leaves counts[1] == 3. Those three 1s fall through to SingleRule, emitting three Flames. Total: 1000 + 300 = 1300 damage.
But six 1s should be two Blasts → 2000 damage. The code is wrong — and no previous test caught it, because every prior combo test used exactly three of a face.
This is the kind of bug TDD literature reports: a defect the developer doesn’t know exists in code they wrote themselves. The test surfaces it.
Your task
- 🔴 RED — add
test_six_ones_create_two_dragon_blastsasserting 2000 damage and two Blast events. Run; see the wrong events tuple. - 🟢 GREEN — fix
ComboRule.applyso it can emit zero or more combos per call, with the correct leftover bookkeeping. Before you code, write the formula on paper for: how many full combos dondice of one face produce? How many leftover dice? - 🔵 REFACTOR — re-run. Especially gratifying: cycle 5’s leftover guardrail still passes — the fix only changed behavior on cases no prior test pinned down.
Why this matters: coverage vs. boundary thinking
Every previous combo test used exactly count dice (three 1s, three 2s, etc.). The bug only manifests at 2 × count and beyond. Line coverage told you the if ran. It didn’t tell you the line was right for all relevant inputs.
That’s the gap between coverage and boundary-value analysis: every behavior has boundaries (0, exactly N, 2N, between N and 2N) and a healthy suite probes each. Coverage is a locator of under-tested code; it isn’t a measure of correctness. The rule that travels: if a behavior isn’t on the test list, code for it isn’t earned.
📦 Commit your progress
Before moving on, commit this cycle. Stage only the files you actually changed (scorer.py test_scorer.py) and write a short message — recommended: Cycle 7: fix multi-combo bug (six 1s = two Blasts).
from collections import Counter
from dataclasses import dataclass
@dataclass(frozen=True)
class ScoringEvent:
name: str
dice_used: tuple
damage: int
@dataclass(frozen=True)
class BattleReport:
events: tuple = ()
@property
def total_damage(self) -> int:
return sum(event.damage for event in self.events)
@dataclass(frozen=True)
class ComboRule:
die: int
count: int
name: str
damage: int
def apply(self, counts: Counter) -> list[ScoringEvent]:
events = []
if counts[self.die] >= self.count:
dice_used = tuple([self.die] * self.count)
events.append(ScoringEvent(self.name, dice_used, self.damage))
counts[self.die] -= self.count
return events
@dataclass(frozen=True)
class SingleRule:
die: int
name: str
damage: int
def apply(self, counts: Counter) -> list[ScoringEvent]:
events = []
for _ in range(counts[self.die]):
events.append(ScoringEvent(self.name, (self.die,), self.damage))
return events
COMBO_RULES = (
ComboRule(1, 3, "Dragon Blast", 1000),
ComboRule(2, 3, "Goblin Swarm", 200),
ComboRule(3, 3, "Orc Charge", 300),
ComboRule(4, 3, "Troll Smash", 400),
ComboRule(5, 3, "Lightning Storm", 500),
ComboRule(6, 3, "Demon Strike", 600),
)
SINGLE_RULES = (
SingleRule(1, "Dragon Flame", 100),
SingleRule(5, "Lightning Spark", 50),
)
def score(dice: list[int]) -> BattleReport:
counts = Counter(dice)
events = []
for rule in COMBO_RULES:
events.extend(rule.apply(counts))
for rule in SINGLE_RULES:
events.extend(rule.apply(counts))
return BattleReport(tuple(events))
"""Cycles 1–7 — surfacing the multi-combo edge case."""
import pytest
from scorer import score, ScoringEvent
def test_empty_roll_has_zero_damage_and_no_events():
report = score([])
assert report.total_damage == 0
assert report.events == ()
def test_single_one_creates_dragon_flame_event():
report = score([1])
assert report.total_damage == 100
assert report.events == (
ScoringEvent("Dragon Flame", (1,), 100),
)
def test_single_five_creates_lightning_spark_event():
report = score([5])
assert report.total_damage == 50
assert report.events == (
ScoringEvent("Lightning Spark", (5,), 50),
)
def test_two_ones_create_two_dragon_flames():
report = score([1, 1])
assert report.total_damage == 200
assert report.events == (
ScoringEvent("Dragon Flame", (1,), 100),
ScoringEvent("Dragon Flame", (1,), 100),
)
def test_one_and_five_create_two_different_events():
report = score([1, 5])
assert report.total_damage == 150
assert report.events == (
ScoringEvent("Dragon Flame", (1,), 100),
ScoringEvent("Lightning Spark", (5,), 50),
)
def test_three_ones_create_dragon_blast_instead_of_three_flames():
report = score([1, 1, 1])
assert report.total_damage == 1000
assert report.events == (
ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
)
def test_dragon_blast_plus_leftover_flame_and_spark():
report = score([1, 1, 1, 1, 5])
assert report.total_damage == 1150
assert report.events == (
ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
ScoringEvent("Dragon Flame", (1,), 100),
ScoringEvent("Lightning Spark", (5,), 50),
)
def test_three_twos_create_goblin_swarm():
report = score([2, 2, 2])
assert report.total_damage == 200
assert report.events == (
ScoringEvent("Goblin Swarm", (2, 2, 2), 200),
)
@pytest.mark.parametrize(
"roll, expected_event",
[
([3, 3, 3], ScoringEvent("Orc Charge", (3, 3, 3), 300)),
([4, 4, 4], ScoringEvent("Troll Smash", (4, 4, 4), 400)),
([5, 5, 5], ScoringEvent("Lightning Storm", (5, 5, 5), 500)),
([6, 6, 6], ScoringEvent("Demon Strike", (6, 6, 6), 600)),
],
)
def test_other_triples_create_combo_events(roll, expected_event):
report = score(roll)
assert report.total_damage == expected_event.damage
assert report.events == (expected_event,)
# TODO (RED): write test_six_ones_create_two_dragon_blasts
# score([1, 1, 1, 1, 1, 1]) should return total_damage == 2000
# and events containing TWO Dragon Blast ScoringEvents
Step 9 — Knowledge Check
Min. score: 80%1. The “one Dragon Blast for any number of 1s ≥ 3” bug was sitting in your code from cycle 5 onward — three cycles. Why did no previous test catch it?
Boundary thinking. Every previous combo test gave the rule exactly count
dice. The bug was on the boundary where counts[X] >= 2 * count. Without a test
on that boundary, the code would have shipped wrong — and the developer would
have been completely confident in it. This is exactly the kind of failure mode
empirical TDD studies report.
2. What is the most defensible general lesson from cycle 7 about test-suite design?
Coverage told us we executed the line. It did not tell us whether the line was right for all relevant inputs. Boundary-value analysis (the partition tradition you encountered in Foundations) complements coverage: every behavior has a “just-at-the-boundary,” “just-past-the-boundary,” and “well-past-the- boundary” input — and a healthy suite probes each.
3. Cycle 5’s bonus leftover guardrail ([1, 1, 1, 1, 5]) still passes after cycle 7’s refactor. Why? Be precise.
This is a subtle but important point: a refactor does not have to change every
output. The two implementations of apply produce the same answer for any
input where the old answer was correct — and only the new answer is correct
where the old was wrong. That is what makes the cycle-7 fix a generalisation
rather than a replacement.
4. A teammate looks at cycle 7’s fix and says: “This bug only existed because we extracted ComboRule. If we’d kept the inline if counts[1] >= 3: from cycle 5, we wouldn’t have had this issue.” Are they right?
The bug is the algorithm’s bug, not the abstraction’s. Whether the if/-=
sits in score() or in ComboRule.apply makes no difference to its behavior.
The lesson is honest: TDD does not prevent every bug; it surfaces them when
tests probe the right inputs. The rule-object refactor was about future-
proofing the design, not correctness of cycle 5’s logic.
Transfer Cycle — TDD on FizzBuzz (Different Domain, Same Rhythm)
Why this matters
Seven Dragon-Dice cycles risk teaching you “TDD works because dice are compositional.” The transfer cycle disproves that — same rhythm, totally different domain (FizzBuzz), and no scaffolding from us: you write your own test list (Canon TDD step 1), you order the items, you drive the cycles. Janzen & Saiedian’s “residual effect” predicts that this is where the rhythm finally feels natural — earned, not preached. The compression of seven Dragon-Dice cycles into ~four FizzBuzz mini-cycles is itself the test of mastery.
🎯 You will learn to
- Create your own Canon TDD test list for an unfamiliar problem, ordered from simplest to most design-breaking
- Apply RED-GREEN-REFACTOR to FizzBuzz with no instructor scaffolding — driving the cycles yourself in compressed form
- Evaluate the structural parallels between each FizzBuzz move and the Dragon-Dice cycle it mirrors (Variation Theory generalization)
You’ve completed seven cycles of TDD on dice scoring. The risk: “TDD only works because Dragon Dice is a naturally compositional domain.” The way to disprove that risk is to apply the same rhythm to a completely unrelated problem — right now, in compressed form.
The classic spec — FizzBuzz
fizzbuzz(n) returns a list of strings of length n. For each integer i from 1 to n:
- If
iis a multiple of 15 →"FizzBuzz" - else if
iis a multiple of 3 →"Fizz" - else if
iis a multiple of 5 →"Buzz" - else →
str(i)
So fizzbuzz(5) == ["1", "2", "Fizz", "4", "Buzz"].
Canon TDD step 1 — write your own test list
Before reading any further, take 60 seconds and write your own test list. What behaviors does the spec define? Order them from simplest to most design-breaking — the way the Dragon Dice tutorial implicitly did across cycles 1–7. You did this implicitly throughout — now you do it explicitly.
📋 One possible test list (open ONLY after you've written your own — 60 seconds first)
A natural ordering, simplest first:
- ☐
fizzbuzz(0) == [] - ☐
fizzbuzz(1) == ["1"] - ☐
fizzbuzz(2) == ["1", "2"] - ☐
fizzbuzz(3)[-1] == "Fizz" - ☐
fizzbuzz(5)[-1] == "Buzz" - ☐
fizzbuzz(15)[-1] == "FizzBuzz" - ☐
fizzbuzz(-1)raisesValueError
Compare with your list. Did you have the same items? In the same order? (The reflection at the bottom of this step asks you to map each item to its Dragon-Dice parallel — don’t peek there yet either.)
Your task — drive the cycles yourself (~10–15 minutes)
Pick the simplest unimplemented item from your list. Convert only that one item into a runnable test in test_fizzbuzz.py. Make it pass with the simplest code (TPP — start with constants, not loops; the tests will force the loop when ready). Refactor on green. Pick the next item. Repeat.
Don’t try to handle all rules at once. One test at a time. (Beck’s Canon TDD is explicit on this — converting all list items to tests up-front “leads to rework and depression.”)
What you’re doing here
You are applying what you learned. There is no instructor-provided RED test, no GREEN scaffold, no REFACTOR checklist. The cycle discipline is now yours. If the rhythm feels familiar — that’s the threshold concept doing its work.
🛟 Stuck? (Open only after at least 5 minutes of trying)
The hard test is multiple of 15. Before reading further, ask yourself: which earlier Dragon-Dice cycle had a test that the previous structure couldn’t satisfy with a local edit? What was the move there?
Hint without the answer: trace by hand what your current code returns for i=15. Why? Then ask what you’d change.
If you've named the structural pressure yourself — open for two known options
- Order matters: check
i % 15 == 0first, then% 3, then% 5. Simplest TPP move. - String concatenation: build up the result — start empty; if divisible by
3, append"Fizz"; if by5, append"Buzz"; if still empty, usestr(i).
Either passes the test list. Pick one; if a future requirement makes the other fit better, you’ll refactor toward it.
Reflection (after green — this is the heart of the step)
Compare the FizzBuzz cycles you just did with the Dragon Dice arc. Write your answers before opening the reveal.
- Which Dragon-Dice cycle’s RED moment does FizzBuzz’s “multiple of 3” test echo?
- Which Dragon-Dice cycle does FizzBuzz’s
multiple of 15test parallel? - What was different?
- What was the same? (Try to name 3–4 invariants of the rhythm.)
Compare your invariants — order doesn't matter, but check each is in your version somewhere
Items that should appear in your “same” list:
- The rhythm itself (RED → GREEN → REFACTOR, one test at a time)
- Test-list discipline (Canon TDD step 1 — a list before any tests)
- RED-as-success (the failing test is the deliverable, not a problem)
- Refactor-toward-duplication (Rule of Two; wait for the second example)
- TPP — smallest transformation that passes the failing test
- Allow-the-ugly-first-GREEN (don’t pre-design the abstraction)
If your answer captured the rhythm and the discipline, you have the threshold concept. TDD is a rhythm, not a problem-specific technique — you just demonstrated it on a problem with no shared code with Dragon Dice.
📦 Commit your progress
Before moving on, commit this cycle. Stage only the files you actually changed (fizzbuzz.py test_fizzbuzz.py) and write a short message — recommended: Transfer cycle: FizzBuzz via TDD.
# Empty by design — TDD says: write a failing test first.
# Build this file up, one test cycle at a time.
"""Your TDD cycles for FizzBuzz.
Pick the simplest test case from your test list. Write it.
Watch it fail. Make it pass with the simplest code (TPP).
Refactor on green. Pick the next one. Repeat.
Suggested first test: fizzbuzz(0) == []
"""
import pytest
from fizzbuzz import fizzbuzz
# Write your tests below — one per behavior in your list.
Step 10 — Knowledge Check
Min. score: 80%
1. FizzBuzz’s multiple of 15 test most directly parallels which Dragon-Dice cycle — and why?
Cycle 5 is the right parallel. The signature of a design-breaking test: previous cycles produced code that looks correct, but the new test cannot be satisfied by a local edit — the structure has to change. For Dragon Dice that meant moving from per-die iteration to count-then-emit. For FizzBuzz that meant either reordering the conditionals (so % 15 is checked first) or switching to a compositional string builder. Either way, the existing structure had to accept a behavior the previous tests didn’t anticipate. That feeling — “I can’t tweak this; I have to restructure” — is a TDD rhythm invariant, not a Dragon-Dice quirk.
2. A teammate, who has never done TDD, pushes back: “FizzBuzz is just trivia — the real work is the algorithm. Why did you spend 12 minutes on the cycles instead of 30 seconds typing the obvious solution?” Pick the strongest reply.
The honest answer combines two things: (a) for small problems with experienced devs, the time cost of TDD is roughly the same as ad-hoc coding — the same number of tests get written either way, just in a different order; (b) the durable benefit is the test suite as a regression net plus the rhythm of small steps preventing speculative complexity. On a toy, the gain is small. On long-lived code modified by multiple people it’s the difference between a 39–91% drop in pre-release defects (Microsoft/IBM, Williams et al. 2008) and not. TDD isn’t a religion — it’s a tradeoff with strong empirical evidence in specific contexts (long-lived, multi-author, behavior-rich domains). FizzBuzz lets you practice the rhythm cheaply, where the stakes are low.
3. Compare doing FizzBuzz now to what doing it before the Dragon Dice tutorial would have been like. What’s the most likely difference?
The Janzen & Saiedian study (230+ programmers): the residual effect of having actually practiced TDD is preference for it afterward — not because it was preached, but because the rhythm came to feel natural. A FizzBuzz attempt today should feel categorically different from a same-instructions attempt before this tutorial: not just because you know what to type, but because the rhythm itself — pause for the test, allow the ugly first GREEN, wait for two examples — now has a felt naturalness. That felt naturalness is the threshold concept stuck in your hands, not just your head. The Dragon-Dice-specific tools (@dataclass, Counter) didn’t transfer; they were never the lesson. The rhythm transferred — that’s the whole lesson.
The Big Picture — Seven Cycles and a Transfer
Why this matters
The cycles taught the rhythm one beat at a time; this step asks whether you can hear the whole song. You’ll synthesize the journey from memory before any reveals, recalibrate your own confidence in writing, and probe whether the discipline transfers to a real piece of code from your own work — not “I’d write more tests” but a specific bug it would have caught. The final quiz is mixed retrieval across all seven cycles, the way Bjork’s spacing principle predicts will make the rhythm last.
🎯 You will learn to
- Analyze the seven-cycle journey by recalling, from memory, three design moves and the test that forced each one
- Evaluate your own confidence to apply Red-Green-Refactor unaided on a problem you haven’t seen before
- Apply the rhythm to one specific piece of your own code — naming what TDD would have prevented in concrete terms
Seven Dragon-Dice cycles. Then an eighth on a totally different problem — FizzBuzz — driven by you with no scaffolding. Every line in your final scorer.py is justified by a test; every line in your fizzbuzz.py is too; and the rhythm that produced both is the same rhythm.
🪞 Synthesise yourself (≈5 min, before opening any reveals or taking the quiz)
The recap material — takeaways, journey table, anti-patterns, empirical case — is collapsed below. You only get one shot at synthesising while it’s still fresh. Do this part with your editor scrolled away from scorer.py.
(1) Recall three design moves, from memory. Name three cycles and, for each, the design move the test forced. Don’t say “the loop refactor” — say which test broke the previous structure and why.
(2) Pick the cycle that surprised you. Which cycle’s RED moment changed how you thought about a structural choice? Why? (One sentence.)
(3) Confidence recalibration — write a number on a sticky note (or in chat). On a 1–5 scale: “I could apply Red-Green-Refactor to a problem I haven’t seen before, this week, without this tutorial open.” Pick a number; anchor it in writing. Re-firing on a remembered number isn’t recalibration. We’ll re-check after the quiz.
(4) Transfer probe. Name one specific piece of code or project of yours — a class assignment, a side project, a past bug — where the rhythm you just learned would have helped, and what specifically it would have prevented. (“It would have caught X” is concrete; “I’d write more tests” is not.)
Then take the quiz below — before opening any of the reveals.
The reveals after the quiz are for comparison, not for study. Treat them like an answer key: open them after committing to your own answers.
Reveal — fill-in-the-blank journey table (open after recall #1)
Cover the right column and predict the lesson for each cycle from memory. Then read across.
| Cycle | Behavior | Design move | Lesson |
|—|—|—|—|
| 1 | Empty roll | First class + function | RED for the right reason |
| 2 | Single 1 | ScoringEvent introduced | Allow the ugly first GREEN |
| 3 | Single 5 | Second hardcoded branch | Refactor toward duplication |
| 4 | Repeated singles | Per-die loop | First real refactor; tests enable change |
| 5 | Mixed dice | (no production change) | Free pass — verify with the mutation move |
| 6 | Triple 1s | Counter, count-then-emit | Design-breaking test; structural shift |
| 7 | Combo + leftovers | (no production change) | Guardrail tests for implicit correctness |
| 8 | Triple 2s | Rule objects | Listening to the test; Open-Closed |
| 9 | Other triples | Append data | Refactor pays off; parametrize |
| 10 | Six 1s | // and %= | Hidden edge case; boundary > coverage |
| 11 | Invalid dice | pytest.raises | Robustness is first-class |
| 12 | Summary | Method on BattleReport | Behavior on the existing object |
| Transfer | FizzBuzz | (different domain) | The rhythm transfers — TDD isn’t problem-specific |
Reveal — five takeaways that travel
- TDD is design, not testing. The test is the contract; the implementation emerges under its pressure.
- Refactor toward duplication, not before it (Rule of Two). One example is a guess at the shape; two makes the variation visible; three or more is duplication that has rotted. Cycle 6’s timing was Rule of Two — and it generalizes to every refactor you’ll do.
- Tests enable change. Behavior-level assertions survive structural rewrites; implementation-coupled tests don’t.
- Coverage ≠ correctness; complement with boundary-value analysis (zero, exactly N, 2N, between).
- Listen to the test. Pain in writing a test usually points at the production code.
- If a behavior isn’t on the test list, code for it isn’t earned. Speculative scaffolding (validation, error handling, hypothetical inputs) waits until a test demands it.
Reveal — when TDD shines, when it's overkill
Pick one. One minute each. Which would TDD have helped less?
(a) You’re writing a function that classifies an image as cat-or-dog by calling a pretrained model. The output is a probability, judged by humans on edge cases.
(b) You’re writing a function that adds a new currency to a payment processor. The behavior is precisely specified.
Compare your answer
TDD shines on (b): new features with clear behavioral requirements; complex logic with branching cases; long-lived code modified by multiple people; API design; domains where regressions hurt (payments, scoring, calculations).
TDD is overkill on (a): one-off throwaway scripts; exploratory prototyping; UI layout; non-binary outcomes (ML accuracy, image recognition); Jupyter research.
Even on (a), some tests pay off — the question is whether to write them first. Kent Beck: “the discipline of working strictly test-first is valuable but not necessarily something you want to do all the time.”
Reveal — TDD anti-pattern taxonomy (cover the right column; predict the antidote)
| Level | Anti-pattern | What it looks like | Antidote (predict before reading) |
|---|---|---|---|
| I | The Liar | Test passes but asserts vacuously (isinstance(x, int) only) |
Cycle 4’s mutation move |
| I | The Nitpicker | Asserts on private attributes / implementation details | Assert on observable behavior |
| II | Success Against All Odds | New test passes immediately, with no investigation | Verify with mutation |
| II | Skip-the-Refactor | Stop at green; never enter REFACTOR | Make the look mandatory |
| III | The Giant | One test asserts dozens of behaviors | One behavior per test |
| III | Excessive Setup | 30+ lines of fixture before one assertion | Decouple production code |
| IV | The Mockery | More mock setup than test logic | Listen — the design is wrong |
| IV | Modify-the-Test | AI rewrites the test to match buggy code | Own the spec yourself |
Higher level = more architectural smell. Listen to the test.
Reveal — the empirical case for TDD
| Study | Finding |
|---|---|
| Microsoft & IBM (Nagappan et al., 2008) | 39–91% decrease in pre-release defect density in TDD teams |
| Same studies | 15–35% longer initial development; offset by reduced debugging |
| Erdogmus et al. (2005) | Test-first students wrote more tests AND were more productive per test |
| Janzen & Saiedian (ICSE 2007) | Even programmers who resisted test-first adopted it more after exposure — the Residual Effect |
| Fucci et al. (2017) | TDD’s benefit comes from granularity + uniformity, not strict test-first ordering — your seven tiny cycles embody both |
Caveat: mixed for solo programmers on short tasks. Strongest in team settings, with CI, on long-lived systems.
Reveal — what to learn next (the same rhythm, scaled up)
- Fixtures (
@pytest.fixture) for reusable setup of objects, DBs, mock APIs - Mocks, fakes, stubs — with a strong default toward fakes over mocks
- Property-based testing with Hypothesis —
score(any list of 1–6)should always satisfy invariants - Mutation testing with
mutmutorcosmic-ray— automate the cycle-4 mutation move across the whole suite - The Outside-In / Double-Loop pattern (Percival, Obey the Testing Goat) — high-level acceptance tests drive unit tests
Each lives inside the same Red-Green-Refactor rhythm you just internalised.
🪞 Recalibrate (after the quiz)
Re-rate confidence on the same 1–5 prompt. Look at your sticky. The gap is the data — feelings of progress are unreliable; the gap is signal.
And revisit your transfer probe answer: is the code you named still the right next place to apply this, or did the quiz/recap shift it? Whichever piece of code you end up picking — start it RED.
# The full, seven-cycle implementation lives here. Use this step's
# editor to scroll through what you built — every line is justified by
# a test in test_scorer.py. There is no speculative code.
from collections import Counter
from dataclasses import dataclass
@dataclass(frozen=True)
class ScoringEvent:
name: str
dice_used: tuple
damage: int
@dataclass(frozen=True)
class BattleReport:
events: tuple = ()
@property
def total_damage(self) -> int:
return sum(event.damage for event in self.events)
@dataclass(frozen=True)
class ComboRule:
die: int
count: int
name: str
damage: int
def apply(self, counts: Counter) -> list[ScoringEvent]:
events = []
number_of_combos = counts[self.die] // self.count
for _ in range(number_of_combos):
dice_used = tuple([self.die] * self.count)
events.append(ScoringEvent(self.name, dice_used, self.damage))
counts[self.die] %= self.count
return events
@dataclass(frozen=True)
class SingleRule:
die: int
name: str
damage: int
def apply(self, counts: Counter) -> list[ScoringEvent]:
events = []
for _ in range(counts[self.die]):
events.append(ScoringEvent(self.name, (self.die,), self.damage))
return events
COMBO_RULES = (
ComboRule(1, 3, "Dragon Blast", 1000),
ComboRule(2, 3, "Goblin Swarm", 200),
ComboRule(3, 3, "Orc Charge", 300),
ComboRule(4, 3, "Troll Smash", 400),
ComboRule(5, 3, "Lightning Storm", 500),
ComboRule(6, 3, "Demon Strike", 600),
)
SINGLE_RULES = (
SingleRule(1, "Dragon Flame", 100),
SingleRule(5, "Lightning Spark", 50),
)
def score(dice: list[int]) -> BattleReport:
counts = Counter(dice)
events = []
for rule in COMBO_RULES:
events.extend(rule.apply(counts))
for rule in SINGLE_RULES:
events.extend(rule.apply(counts))
return BattleReport(tuple(events))
"""All seven cycles, all green. Read it as a contract."""
import pytest
from scorer import score, ScoringEvent
def test_empty_roll_has_zero_damage_and_no_events():
report = score([])
assert report.total_damage == 0
assert report.events == ()
def test_single_one_creates_dragon_flame_event():
report = score([1])
assert report.total_damage == 100
assert report.events == (
ScoringEvent("Dragon Flame", (1,), 100),
)
def test_single_five_creates_lightning_spark_event():
report = score([5])
assert report.total_damage == 50
assert report.events == (
ScoringEvent("Lightning Spark", (5,), 50),
)
def test_two_ones_create_two_dragon_flames():
report = score([1, 1])
assert report.total_damage == 200
assert report.events == (
ScoringEvent("Dragon Flame", (1,), 100),
ScoringEvent("Dragon Flame", (1,), 100),
)
def test_one_and_five_create_two_different_events():
report = score([1, 5])
assert report.total_damage == 150
assert report.events == (
ScoringEvent("Dragon Flame", (1,), 100),
ScoringEvent("Lightning Spark", (5,), 50),
)
def test_three_ones_create_dragon_blast_instead_of_three_flames():
report = score([1, 1, 1])
assert report.total_damage == 1000
assert report.events == (
ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
)
def test_dragon_blast_plus_leftover_flame_and_spark():
report = score([1, 1, 1, 1, 5])
assert report.total_damage == 1150
assert report.events == (
ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
ScoringEvent("Dragon Flame", (1,), 100),
ScoringEvent("Lightning Spark", (5,), 50),
)
def test_three_twos_create_goblin_swarm():
report = score([2, 2, 2])
assert report.total_damage == 200
assert report.events == (
ScoringEvent("Goblin Swarm", (2, 2, 2), 200),
)
@pytest.mark.parametrize(
"roll, expected_event",
[
([3, 3, 3], ScoringEvent("Orc Charge", (3, 3, 3), 300)),
([4, 4, 4], ScoringEvent("Troll Smash", (4, 4, 4), 400)),
([5, 5, 5], ScoringEvent("Lightning Storm", (5, 5, 5), 500)),
([6, 6, 6], ScoringEvent("Demon Strike", (6, 6, 6), 600)),
],
)
def test_other_triples_create_combo_events(roll, expected_event):
report = score(roll)
assert report.total_damage == expected_event.damage
assert report.events == (expected_event,)
def test_six_ones_create_two_dragon_blasts():
report = score([1, 1, 1, 1, 1, 1])
assert report.total_damage == 2000
assert report.events == (
ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
)
Step 11 — Knowledge Check
Min. score: 80%1. Which single statement most accurately captures TDD?
The threshold concept of the tutorial: TDD is design, not testing. Cycle 6’s rule-object refactor emerged under cycle-6’s test pressure, not from a whiteboard. The test suite is a (valuable) byproduct.
2. For which scenario is TDD most beneficial?
Payment processing has every property TDD rewards: complex logic, strict correctness, multiple developers, long life. Microsoft/IBM measured 39–91% defect-density reductions exactly here.
3. A test file contains:
def test_user_creation():
user = User("Alice", "alice@example.com")
assert user._name == "Alice"
assert user._email == "alice@example.com"
The Nitpicker. Leading underscores signal “private — implementation detail.” Tests that touch private state break on every refactor. Assert on observable behavior via the public API.
4. The “one Dragon Blast for any count ≥ 3” bug lived in cycles 5 and 6 before cycle 7 surfaced it. Why didn’t line coverage catch it?
Coverage tells you the line ran, not that it was right. Every prior combo
test used exactly count dice; nothing probed 2 × count. Boundary thinking
is the discipline that catches it. Coverage is a locator of under-tested
code, not a measure of correctness.
5. (Design a test list from a spec.) A teammate hands you this spec for a function is_valid_password(password: str) -> bool:
ReturnsApply Canon TDD step 1 — write a test list, in the order you’d implement it. Which list below best honours TDD discipline (simplest-first, boundary-aware, behavior-named)?Trueiff all of these hold: length is at least 8 and at most 64; contains at least one digit (0–9); contains at least one symbol from!@#$%; otherwise returnsFalse. The empty string is invalid.
Why option A wins: it follows the same shape as Dragon Dice’s seven cycles — simplest case first (typical valid input), then each rule rejected in isolation (no digit, no symbol, too short), then boundaries on the length range (7 invalid, 8 valid, 64 valid, 65 invalid — the boundary-pair discipline from the Testing Foundations prerequisite), then the empty-string edge. Each test name reads like a one-line bug report.
Why there’s no single right answer: your exact ordering may differ (e.g., empty-string first as the simplest-of-all). The skill being practiced here is generating a structured plan from a spec — the same skill you’ll use on every function you TDD outside this tutorial. The four candidate lists above contrast different wrong-answer patterns: random-instead-of-systematic, wrong-rules, and the Giant antipattern.
If you found yourself drafting your own list before scrolling to the options — even better. That’s exactly Canon TDD step 1 in your hands.