TDD with pytest — Dragon Dice Battle
Live the Red-Green-Refactor rhythm by growing a fantasy dice-scoring engine from scratch — twelve tiny tests, one design pressure at a time. Tests drive design, refactors stay safe, and the rules of the game emerge as code.
Cycle 1 — RED: Write the Failing Test
Prerequisite: Testing Foundations — pytest discovery,
assert, partitions, behavior-not-implementation. If those feel new, do that one first.
You’ll grow a small dice-scoring engine over 15 steps using only Test-Driven Development.
Test-Driven Development in one minute
TDD is a design technique that uses tests as the medium of pressure. You write code in short cycles of three phases:
| Phase | What you do | Why this phase exists |
|---|---|---|
| 🔴 RED | Write one failing test that names a behavior you want | Forces the interface and expected behavior to be decided before any logic exists |
| 🟢 GREEN | Write the smallest code that makes the test pass | Resists speculative design; only build what a test demands |
| 🔵 REFACTOR | Improve the code while all tests stay green | The safety net lets you reshape structure without fear of regression |
Each phase of cycle 1 is its own tutorial step so the rhythm becomes a felt sequence, not a slogan. From cycle 2 onward, each cycle is one step containing all three phases.
Why a failing test is the goal of RED
A failing test is not a bug — RED is the expected starting state of every cycle. But the failure has to come from the behavior under test, not from a typo:
- Right reason —
ImportError,AttributeError, a value-mismatch on the assertion you wrote. The test correctly says “this behavior does not exist yet.” - Wrong reason —
SyntaxError, missing colon, misspelledtest_prefix. The test never ran. You’ve learned nothing about the unit, only about your typing.
Students commonly delete a failing test to make the bar green. We’re leaning into that discomfort instead. Learning to read the failure is what TDD trains.
The shape of every pytest test
Every pytest test you write has the same four-part structure:
| Part | What it does |
|---|---|
| Import the unit under test | Tells Python what code you’ll call |
Define a function whose name starts with test_ |
pytest only discovers functions matching this pattern |
| Arrange + act | Set up any input and call the unit |
| Assert an observable property of the result | Pin down one thing the spec promises |
The pattern generalises; the specifics (what to import, call, and assert) come from the spec — and only from the spec.
The dragon dice rules (reference for all 12 cycles)
| Roll | Event | Damage |
|---|---|---|
| Single 1 | Dragon Flame | 100 |
| Single 5 | Lightning Spark | 50 |
| Triple 1 | Dragon Blast | 1000 |
| Triple 2 | Goblin Swarm | 200 |
| Triple 3 | Orc Charge | 300 |
| Triple 4 | Troll Smash | 400 |
| Triple 5 | Lightning Storm | 500 |
| Triple 6 | Demon Strike | 600 |
Triples consume three dice; leftover 1s and 5s still score as singles. Dice are integers 1–6. Today you implement only the empty-roll case. Eleven more cycles add the rest.
Cycle 1’s spec
An empty roll produces a battle report with zero damage and no events.
That sentence names everything you need: a function score, a return value with total_damage and events attributes. Translate it into a pytest test using the four-part shape.
Your task
- In
test_scorer.py(right pane), fill in the three sub-goal comments. Leavescorer.pyempty — its code belongs to the GREEN step. - Predict the exact failure message you’ll see, then click Run.
- Expected output:
ImportError: cannot import name 'score'(orModuleNotFoundErrorifscorer.pyis empty). That IS the deliverable — RED for the right reason.
Detail: why a tuple, not a list
The empty events case is written as () (an empty tuple), not [] (an empty list). Tuples are immutable — they can’t be mutated by accident, and they’re safe dataclass defaults (cycle 2 uses that). Every test in this tutorial that pins down events uses a tuple.
References
- pytest — Get started
- Python —
assertstatement - If you get stuck on the test, click Show Solution below the editor for one possible form.
Self-test before clicking Next
Close the page in your head and answer:
- What two pieces of information does pytest’s failure message give you?
- Why is “RED for the right reason” worth distinguishing from “RED for any reason”?
- If you had to write a test for a
multiply(a, b)function from scratch, what four parts would it contain?
# Cycle 1 RED phase — DO NOT WRITE PRODUCTION CODE HERE YET.
#
# The next tutorial step (Cycle 1 GREEN) is where the BattleReport
# class and the score function are introduced. Right now we are
# only writing the failing test on the right.
"""Cycle 1 RED — write the first failing test.
The sub-goals below describe the PURPOSE of each line you need to add,
not the syntax. Translate the spec ("an empty roll has no damage and no
events") into pytest assertions yourself. If you get stuck, consult the
rules table and the references in the instructions panel.
"""
from scorer import score
def test_empty_roll_has_zero_damage_and_no_events():
# Sub-goal: call the unit under test on the simplest input the spec mentions
# — and capture the result so the next two lines can inspect it.
# Sub-goal: pin down what the spec says about damage in this case.
# Sub-goal: pin down what the spec says about events in this case.
pass
Cycle 1 — GREEN: Make It Pass
The GREEN rule: write the smallest code that makes the failing test pass. Anything more is speculative design — code with no test demanding it.
The test is your contract
Every line of the test is an obligation your code must satisfy:
| Line of the test | What your code must provide |
|---|---|
from scorer import score |
A score name in scorer.py |
report = score([]) |
score returns something |
assert report.total_damage == 0 |
That something exposes total_damage, equal to 0 |
assert report.events == () |
…and events, equal to () |
Four obligations. Anything you write that does not contribute to one of those four is speculative design — it has no test demanding it, so it has no test guarding it, so it has no business being there yet.
Three Python tools the scaffold uses
Three building blocks you’ll need. None are TDD-specific — they’re standard Python tools you’ll reach for many times.
@dataclassauto-writes__init__,__repr__, and__eq__from your field declarations. You write the field names and types; Python writes the bookkeeping. The auto-generated__eq__is what makesassert report.events == (...)work field-by-field. (docs)frozen=Trueadded to@dataclass(frozen=True)makes instances immutable — fields can’t be reassigned after construction. Frozen dataclasses are also hashable. We rely on this in cycle 2, where test assertions compare tuples ofScoringEventinstances. (docs)@propertyturns a method into an attribute read. Calling sites usereport.total_damage(no parens) but Python runs the function body each time. The test readstotal_damagelike a field; the@propertydecorator keeps that grammar consistent. (docs)
The Transformation Priority Premise — why “smallest” beats “best”
Robert Martin’s TPP lists code transformations from simplest to most complex: nothing → constant → variable → conditional → loop. The rule: always pick the simpler transformation that passes the current failing test, even when you “know” a more general one is coming.
For cycle 1, the test only mentions the empty case. You do not need a loop yet — the empty-tuple default already produces 0 damage. The loop arrives when a test (cycle 4) actually demands it.
Your task
- In
scorer.py(left pane), replace each sub-goal comment with the matching line. Re-read the test for the contract. - Predict what pytest will show. Run.
- Resist any “improvement” beyond what the test demands — the next step is REFACTOR, and it only earns work that has somewhere to go.
Common mistakes to avoid
events: list = []— Python rejects mutable defaults in dataclasses withValueError. What immutable alternative matches the test’sevents == ()assertion?- Forgetting
@property— without it,report.total_damageis a bound method object, not a number; the assertion fails in a weird way. @dataclasswithoutfrozen=True— passes cycle 1, but cycle 2’s tuple comparisons of value objects need the structural__eq__that frozen dataclasses provide.if not dice: ...— speculative branching. The empty-tuple default already handles the empty case.
If you get genuinely stuck, click Show Solution below the editor for one possible form.
Self-test before clicking Next
- What three things does
@dataclasswrite for you that you’d otherwise type by hand? - Why is
@propertya better fit than a plain method fortotal_damagegiven how the test reads it? - What does
frozen=Truebuy us, and which later cycle will first rely on it? - What four obligations does the cycle-1 test impose — derived from reading the test alone?
"""Cycle 1 GREEN — smallest code that turns the failing test green.
The sub-goals below describe the PURPOSE of each line you need to add,
not the syntax. Re-read test_scorer.py to recover the contract: a name
to export, a return value, two attributes on the return value with
specific values. The Cart example in the instructions shows the toolkit
shape; you must translate it to the dragon-dice naming yourself.
"""
from dataclasses import dataclass
@dataclass(frozen=True)
class BattleReport:
# Sub-goal: declare the storage that the test reads as `report.events`.
# Hint: the test compares this to `()`, which already tells you the
# type and the default value. (See the dataclasses docs link.)
@property
def total_damage(self):
# Sub-goal: derive the total from whatever events the report holds.
# Hint: with an empty-tuple default for events, an aggregate built-in
# over an empty sequence already produces the value the test asserts.
pass
def score(dice):
# Sub-goal: hand back the kind of object the test reads attributes on.
# Hint: ignore `dice` for now — no test makes a claim about non-empty
# rolls yet, so any branching on it would be speculative design.
pass
"""Cycle 1 — first failing test (carried over from the RED step)."""
from scorer import score
def test_empty_roll_has_zero_damage_and_no_events():
report = score([])
assert report.total_damage == 0
assert report.events == ()
Cycle 1 — REFACTOR: The Pause That Counts
The discipline: REFACTOR is a phase you enter every cycle — even when the answer is “nothing to clean this time.” Entering and looking is the discipline. Skipping the look is the failure mode that quietly degrades TDD into test-after.
The REFACTOR checklist (you’ll re-use this every cycle)
| Category | Question to ask | Cycle 1 answer |
|---|---|---|
| Duplication | Two pieces of code expressing the same idea? | No — only one piece of code |
| Names | Do names describe what they mean, not how they work? | BattleReport, total_damage, events, score — all domain words |
| Test names | Does each test name read as a behavior sentence? | test_empty_roll_has_zero_damage_and_no_events — long but unambiguous |
| Magic constants | Unexplained numbers or strings? | None yet |
| Imports | Conventional order, no dead imports? | Just from dataclasses import dataclass — clean |
For cycle 1, every row is “fine.” That’s a real outcome of a REFACTOR phase — and recognising it without skipping the look is the win.
Your task
- Re-read your code with the checklist. Spend 30 seconds — don’t rush.
- Make any tiny improvement you spot (e.g., a module docstring); keep the bar green.
- Take the first quiz below, focused on the rhythm you just lived through.
Why REFACTOR is the most-skipped phase
Martin Fowler calls skipping refactor “the most common way to screw up TDD.” Field studies of student and professional practice agree: developers treat the green bar as the finish line. Within a few cycles, duplication accumulates and the test suite ages — exactly because nobody paused at REFACTOR to look. By making “enter the phase even when there’s nothing to do” a habit now, you defend against that drift for the rest of the tutorial.
from dataclasses import dataclass
@dataclass(frozen=True)
class BattleReport:
events: tuple = ()
@property
def total_damage(self):
return sum(event.damage for event in self.events)
def score(dice):
return BattleReport()
"""Cycle 1 — first failing test, now green."""
from scorer import score
def test_empty_roll_has_zero_damage_and_no_events():
report = score([])
assert report.total_damage == 0
assert report.events == ()
First Quiz — The Red-Green-Refactor Rhythm
Min. score: 80%1. What is the correct order of phases in a single TDD cycle?
RED → GREEN → REFACTOR. The order is load-bearing. RED proves the test reveals missing behavior. GREEN proves the code now satisfies that behavior. REFACTOR improves the code while the safety net (the green test) protects you.
2. You write a new test and click Run. The bar is immediately green without you having to write any production code. What is the most defensible interpretation?
Both halves matter. Sometimes a test passes immediately because a prior refactor generalised behavior — that is a positive outcome, and the test still earns its place by documenting that the behavior is intended. Other times the test is vacuously true. The way to tell is to break the production code on purpose: a real test will catch the break.
3. In the GREEN phase, you have two implementations in mind: one is a hardcoded if dice == []: return BattleReport(), the other is a generic loop. The hardcoded version passes the current test. What does TDD discipline say to do?
This is the Transformation Priority Premise in action: prefer simpler transformations. The hardcoded version is fine as long as a future test will challenge it. That challenge — and the design pressure it creates — is exactly what cycle 4 of this tutorial will hand you.
4. Why is REFACTOR worth entering even when there is nothing to clean up?
Field studies report that “skip refactor when it looks fine” is exactly how test-driven discipline degrades into test-after over a few weeks. Every cycle is a checkpoint where you ask “what do I see now?” Most cycles, the answer is mostly fine. But you only know that because you looked.
5. A teammate says: “TDD is just unit testing — write your tests first instead of last.” What is the most accurate correction?
This is the threshold concept: TDD is design first, testing second. The test forces you to decide the public interface (function name, parameters, return shape) before any logic exists. The REFACTOR phase is where the design that emerged under pressure gets shaped intentionally. Calling it “unit testing in reverse order” misses both halves.
Cycle 2 — Single 1 → Dragon Flame
Spec: a single die showing 1 creates a Dragon Flame event worth 100 damage.
From now on, each cycle is one step with three tasks (RED → GREEN → REFACTOR). Same discipline as cycle 1 — tighter packaging.
Your task
- 🔴 RED — add
test_single_one_creates_dragon_flame_eventintest_scorer.py. The assertion should pin downtotal_damage == 100and oneScoringEvent("Dragon Flame", (1,), 100)inevents. Update the import to bring inScoringEvent. Run — predict the failure, check. - 🟢 GREEN — introduce a
ScoringEventdataclass and the smallest change toscore. The minimum here is a hardcodedif dice == [1]:branch — that’s deliberate. - 🔵 REFACTOR — walk the cycle-1 checklist. Resist generalising; cycle 4 will earn the loop.
Expected RED: ImportError: cannot import name 'ScoringEvent' — the test forces you to name the event class before writing it. That’s the design pressure of test-first thinking.
Why “allow the hard-code” is a TDD discipline
The instinct is to extract a rule, write a loop, build the abstraction now. TDD asks you to wait for the test that demands it. A speculative loop is a guess at the right shape; a loop refactor pulled by cycle 4’s test is a discovery. Refactor toward duplication, not before it.
If you’re genuinely stuck, click Show Solution below the editor.
from dataclasses import dataclass
@dataclass(frozen=True)
class BattleReport:
events: tuple = ()
@property
def total_damage(self):
return sum(event.damage for event in self.events)
def score(dice):
return BattleReport()
"""Cycles 1–2 — adding the Dragon Flame behavior."""
from scorer import score
def test_empty_roll_has_zero_damage_and_no_events():
report = score([])
assert report.total_damage == 0
assert report.events == ()
# TODO (RED): import ScoringEvent from scorer
# TODO (RED): write test_single_one_creates_dragon_flame_event
# score([1]) should return a report with total_damage == 100
# and events == (ScoringEvent("Dragon Flame", (1,), 100),)
Cycle 3 — Single 5 → Lightning Spark
Spec: a single die showing 5 creates a Lightning Spark event worth 50 damage.
Same shape as cycle 2, different values. The duplication this creates is intentional — cycle 4’s test will earn the right to fix it.
Your task
- 🔴 RED — add
test_single_five_creates_lightning_spark_event, structured exactly like cycle 2’s test but with the Lightning Spark values. - 🟢 GREEN — add a second hardcoded
if dice == [5]:branch. Resist the urge to write a loop or dict lookup. - 🔵 REFACTOR — walk the checklist. The duplication is now visible; the right move is to note it and write nothing. No test demands the loop yet.
Why deliberately keeping ugly code is the disciplined move
You can clearly see duplication. Refactoring it now would be guessing at the right shape with one too few data points. Cycle 4’s test will provide the second data point — and the loop refactor it earns is a discovery, not a guess. Refactor toward duplication, not before it.
Click Show Solution below the editor if you need one form of the test + GREEN code.
from dataclasses import dataclass
@dataclass(frozen=True)
class ScoringEvent:
name: str
dice_used: tuple
damage: int
@dataclass(frozen=True)
class BattleReport:
events: tuple = ()
@property
def total_damage(self):
return sum(event.damage for event in self.events)
def score(dice):
if dice == [1]:
return BattleReport((
ScoringEvent("Dragon Flame", (1,), 100),
))
return BattleReport()
"""Cycles 1–3 — adding the Lightning Spark behavior."""
from scorer import score, ScoringEvent
def test_empty_roll_has_zero_damage_and_no_events():
report = score([])
assert report.total_damage == 0
assert report.events == ()
def test_single_one_creates_dragon_flame_event():
report = score([1])
assert report.total_damage == 100
assert report.events == (
ScoringEvent("Dragon Flame", (1,), 100),
)
# TODO (RED): write test_single_five_creates_lightning_spark_event
# score([5]) should return total_damage == 50 with
# events == (ScoringEvent("Lightning Spark", (5,), 50),)
Cycle 4 — Repeated Singles → First Real Refactor
Spec:
score([1, 1])returns total damage 200 with two Dragon Flame events.
The first design-breaking test. Neither dice == [1] nor dice == [5] matches [1, 1] — the duplication you noted in cycle 3 just demanded payment.
Your task
- 🔴 RED — add
test_two_ones_create_two_dragon_flamesasserting damage 200 and two Flame events. Run; predict the failure. - 🟢 GREEN — you have two options. Pick the better one:
- Option A: a third hardcoded branch for
dice == [1, 1]. Passes — and makes duplication worse, with a fourth branch coming next cycle. - Option B: replace the hardcoded branches with a
forloop over each die. Passes cycle 4 and retires the cycle-3 duplication in one move.
- Option A: a third hardcoded branch for
- 🔵 REFACTOR — re-run; the previous three tests are your safety net. Note what just happened: you rewrote the body of
scoreand knew within a second that you didn’t break anything.
The threshold-concept moment: tests enable change
You just rewrote the body of score — two hardcoded branches → generic per-die loop — and confirmed in one second that nothing broke. Without the test suite you’d have had to reason from scratch: “Does this still handle empty? Single 1? Single 5?”
That’s what “tests enable change” means. Not “prevent change.” Not “slow change.” Enable — they replace fear-driven manual reasoning with a mechanical check. This is the cycle that earns the safety-net quiz below.
Click Show Solution if you need to see one form of option B.
from dataclasses import dataclass
@dataclass(frozen=True)
class ScoringEvent:
name: str
dice_used: tuple
damage: int
@dataclass(frozen=True)
class BattleReport:
events: tuple = ()
@property
def total_damage(self):
return sum(event.damage for event in self.events)
def score(dice):
if dice == [1]:
return BattleReport((
ScoringEvent("Dragon Flame", (1,), 100),
))
if dice == [5]:
return BattleReport((
ScoringEvent("Lightning Spark", (5,), 50),
))
return BattleReport()
"""Cycles 1–4 — repeated singles force the first refactor."""
from scorer import score, ScoringEvent
def test_empty_roll_has_zero_damage_and_no_events():
report = score([])
assert report.total_damage == 0
assert report.events == ()
def test_single_one_creates_dragon_flame_event():
report = score([1])
assert report.total_damage == 100
assert report.events == (
ScoringEvent("Dragon Flame", (1,), 100),
)
def test_single_five_creates_lightning_spark_event():
report = score([5])
assert report.total_damage == 50
assert report.events == (
ScoringEvent("Lightning Spark", (5,), 50),
)
# TODO (RED): write test_two_ones_create_two_dragon_flames
# score([1, 1]) should return total_damage == 200 and two
# Dragon Flame events in events
Cycle 4 Quiz — Tests as Design Pressure
Min. score: 80%1. You noticed obvious duplication after cycle 3 but did not refactor it. Cycle 4’s test then forced the refactor. Why is this sequence (notice → wait → refactor when test demands) better than (notice → refactor immediately)?
The test that demands the refactor is also the test that confirms the refactor preserved behavior. Refactoring on aesthetic instinct alone is gambling — you might pick a shape that the next test makes ugly anyway. The discipline is: let the tests speak first, then change the structure.
2. You replaced the hardcoded branches with a for loop and ran pytest. All four tests passed. What did the test suite just do for you that you would otherwise have had to do manually?
Tests enable change. The four green checkmarks are not just “tests passed” —
they are “the empty case still works, single 1 still works, single 5 still
works, and [1, 1] works for the first time.” Without the suite, you would
have to convince yourself of each line by reading.
3. A teammate looks at your cycle-4 GREEN code and says: “Why didn’t you just add a third if dice == [1, 1]: branch? It would have passed the test.” What is the most accurate rebuttal?
The TDD GREEN rule is “smallest code that passes the failing test” given the current direction of the design. A third hardcoded branch is locally minimal but globally wasteful — you’ll throw it away in cycle 5 or cycle 6 anyway. The loop is the smallest durable change.
4. (Spaced review — Cycle 1) Why is “RED for the right reason” still the right framing in cycle 4, even though we are now adding a test on top of an already-passing suite?
“RED for the right reason” applies to every cycle. In cycle 1 the right reason is
usually ImportError. In later cycles it is typically an assertion failure
showing the expected tuple of events versus what the current code actually
produced. Either way, the failure has to come from the behavior under test, not
from a typo.
Cycle 5 — Mixed Dice (No Production Change)
Spec:
score([1, 5])produces a mix of one Dragon Flame and one Lightning Spark.
Your task
- 🔴 RED — add
test_one_and_five_create_two_different_events. Predict what pytest will show this time, then run. - 🟢 GREEN — most likely outcome: all five tests pass without any production change. Cycle 4’s loop already handles mixed dice independently. The new test costs zero code — it just documents the intended behavior.
- 🔵 REFACTOR — verify the “free pass” with the ten-second mutation move (next collapsible).
The mutation move (verify the test isn’t vacuous — do this!)
A test that passes immediately can mean two things:
| Why? | What it means |
|---|---|
| A previous refactor already generalised the behavior | ✅ Positive — design is healthy |
| The assertion is vacuously true / missing / wrong oracle | ❌ Defect — a Liar test |
Tell them apart in 10 seconds: break the production code on purpose. Open scorer.py, temporarily change if die == 5: to if die == 999:. Run pytest. The new test should now fail — proving it actually checks the behavior. Restore the line.
This single move is the strongest defence against AI-generated Liar tests. Make it a habit; we’ll come back to it.
from dataclasses import dataclass
@dataclass(frozen=True)
class ScoringEvent:
name: str
dice_used: tuple
damage: int
@dataclass(frozen=True)
class BattleReport:
events: tuple = ()
@property
def total_damage(self):
return sum(event.damage for event in self.events)
def score(dice):
events = []
for die in dice:
if die == 1:
events.append(ScoringEvent("Dragon Flame", (1,), 100))
if die == 5:
events.append(ScoringEvent("Lightning Spark", (5,), 50))
return BattleReport(tuple(events))
"""Cycles 1–5 — adding mixed dice."""
from scorer import score, ScoringEvent
def test_empty_roll_has_zero_damage_and_no_events():
report = score([])
assert report.total_damage == 0
assert report.events == ()
def test_single_one_creates_dragon_flame_event():
report = score([1])
assert report.total_damage == 100
assert report.events == (
ScoringEvent("Dragon Flame", (1,), 100),
)
def test_single_five_creates_lightning_spark_event():
report = score([5])
assert report.total_damage == 50
assert report.events == (
ScoringEvent("Lightning Spark", (5,), 50),
)
def test_two_ones_create_two_dragon_flames():
report = score([1, 1])
assert report.total_damage == 200
assert report.events == (
ScoringEvent("Dragon Flame", (1,), 100),
ScoringEvent("Dragon Flame", (1,), 100),
)
# TODO (RED): write test_one_and_five_create_two_different_events
# score([1, 5]) should return total_damage == 150 with
# one Dragon Flame followed by one Lightning Spark
Cycle 6 — Triple 1s → Dragon Blast (Design Moment)
Spec: three 1s in a roll combine into one Dragon Blast (1000 damage) instead of three Dragon Flames.
The pivot moment. Cycle 4’s per-die loop walks each die independently — it has no way to know that the other two 1s exist when it processes the first. This test cannot be satisfied by tweaking a branch; the structure has to change. A design-breaking test.
Your task
- 🔴 RED — add
test_three_ones_create_dragon_blast_instead_of_three_flames. The current loop produces three Flames at 300 damage; the test demands one Blast at 1000. Run, see the value-mismatch. - 🟢 GREEN — the structural shift: stop iterating dice in order, count how many of each face appeared (
collections.Counter), then decide what to emit. Combos consume dice; leftovers still score as singles. - 🔵 REFACTOR — re-run; the previous five tests survived a full body rewrite of
score. That’s the safety net working.
Why this was a design-breaking test (and why the previous tests still pass)
A design-breaking test cannot be satisfied by adding a branch — only by reshaping structure. The cycle-4 per-die loop was fundamentally per-die; combos are fundamentally per-face-count. You couldn’t patch your way from one to the other.
Yet all five previous tests still pass after the rewrite. That’s because they assert on observable behavior (total_damage == 100, events == (event,)), not on internals (which loop, which variable name). Behavior tests survive structural rewrites; implementation-tests don’t. This is the Refactoring Litmus Test.
from dataclasses import dataclass
@dataclass(frozen=True)
class ScoringEvent:
name: str
dice_used: tuple
damage: int
@dataclass(frozen=True)
class BattleReport:
events: tuple = ()
@property
def total_damage(self):
return sum(event.damage for event in self.events)
def score(dice):
events = []
for die in dice:
if die == 1:
events.append(ScoringEvent("Dragon Flame", (1,), 100))
if die == 5:
events.append(ScoringEvent("Lightning Spark", (5,), 50))
return BattleReport(tuple(events))
"""Cycles 1–6 — triple 1s break the per-die loop."""
from scorer import score, ScoringEvent
def test_empty_roll_has_zero_damage_and_no_events():
report = score([])
assert report.total_damage == 0
assert report.events == ()
def test_single_one_creates_dragon_flame_event():
report = score([1])
assert report.total_damage == 100
assert report.events == (
ScoringEvent("Dragon Flame", (1,), 100),
)
def test_single_five_creates_lightning_spark_event():
report = score([5])
assert report.total_damage == 50
assert report.events == (
ScoringEvent("Lightning Spark", (5,), 50),
)
def test_two_ones_create_two_dragon_flames():
report = score([1, 1])
assert report.total_damage == 200
assert report.events == (
ScoringEvent("Dragon Flame", (1,), 100),
ScoringEvent("Dragon Flame", (1,), 100),
)
def test_one_and_five_create_two_different_events():
report = score([1, 5])
assert report.total_damage == 150
assert report.events == (
ScoringEvent("Dragon Flame", (1,), 100),
ScoringEvent("Lightning Spark", (5,), 50),
)
# TODO (RED): write test_three_ones_create_dragon_blast_instead_of_three_flames
# score([1, 1, 1]) should return total_damage == 1000 with
# a single Dragon Blast event whose dice_used is (1, 1, 1)
Cycle 6 Quiz — Design-Breaking Tests
Min. score: 80%1. What makes cycle 6’s test a design-breaking test, as opposed to a normal “add a branch” test?
A design-breaking test is one no local edit can satisfy. The cycle-4 loop is fundamentally per-die; combos are fundamentally per-face-count. You cannot patch your way from one to the other — you have to re-shape the algorithm. That re-shaping is what the test extracts as design pressure.
2. After the count-then-emit refactor, all five previous tests still passed. What does that tell you about the previous tests?
This is the Refactoring Litmus Test concept. Robust tests assert on what the
unit does (report.total_damage == 100, report.events == (event,)), not
on how it does it (assert "for die in dice" in inspect.getsource(score)).
The first survives a rewrite; the second breaks at the slightest refactor.
3. Why does cycle 6’s GREEN code subtract from counts[1] after emitting the Dragon Blast event?
“Combos consume dice, leftovers still score as singles” is one of the rules. The subtraction isn’t a Python quirk — it is the domain rule made explicit in the code. Cycle 7’s test will verify exactly this leftover behavior.
4. (Spaced review — Cycle 4) A teammate suggests “let’s just add if Counter(dice)[1] >= 3: ... at the top of the existing per-die loop.” Why is that not the same as the count-then-emit refactor we did?
The refactor is not “add Counter on top of the existing loop.” It is “replace the per-die view with the per-face-count view.” Mixing both is the worst of both worlds: now the function maintains two parallel models of the input, and every future change has to update both. The structural shift was the point.
Cycle 7 — Combo with Leftover Singles
Spec:
score([1, 1, 1, 1, 5])produces one Dragon Blast plus one leftover Flame and a Lightning Spark.
Cycle 6’s counts[1] -= 3 already handles this — but until a test asserts it, the leftover behavior is only implicitly correct. A future refactor could break it silently. Cycle 7 turns implicit correctness into an explicit guardrail.
Your task
- 🔴 RED — add
test_dragon_blast_plus_leftover_flame_and_spark. Predict whether it passes immediately, then run. - 🟢 GREEN — likely no production change. The cycle-6 design already produces this output.
- 🔵 REFACTOR + mutation check — apply the cycle-5 mutation move required this time: change
counts[1] -= 3tocounts[1] -= 4, run, confirm cycle 7 fails, restore. That proves the test is a real guardrail.
Why guardrail tests matter
The most common way a TDD suite degrades over time is implicit correctness silently rotting into implicit incorrectness. A behavior the code does today, but no test asserts, can disappear in a refactor — and nobody notices until a customer reports the bug. The discipline: when you spot behavior that “just works,” ask if a future refactor could break it. If yes, write the test that protects it.
from collections import Counter
from dataclasses import dataclass
@dataclass(frozen=True)
class ScoringEvent:
name: str
dice_used: tuple
damage: int
@dataclass(frozen=True)
class BattleReport:
events: tuple = ()
@property
def total_damage(self):
return sum(event.damage for event in self.events)
def score(dice):
counts = Counter(dice)
events = []
if counts[1] >= 3:
events.append(ScoringEvent("Dragon Blast", (1, 1, 1), 1000))
counts[1] -= 3
for _ in range(counts[1]):
events.append(ScoringEvent("Dragon Flame", (1,), 100))
for _ in range(counts[5]):
events.append(ScoringEvent("Lightning Spark", (5,), 50))
return BattleReport(tuple(events))
"""Cycles 1–7 — adding the leftover-singles guardrail."""
from scorer import score, ScoringEvent
def test_empty_roll_has_zero_damage_and_no_events():
report = score([])
assert report.total_damage == 0
assert report.events == ()
def test_single_one_creates_dragon_flame_event():
report = score([1])
assert report.total_damage == 100
assert report.events == (
ScoringEvent("Dragon Flame", (1,), 100),
)
def test_single_five_creates_lightning_spark_event():
report = score([5])
assert report.total_damage == 50
assert report.events == (
ScoringEvent("Lightning Spark", (5,), 50),
)
def test_two_ones_create_two_dragon_flames():
report = score([1, 1])
assert report.total_damage == 200
assert report.events == (
ScoringEvent("Dragon Flame", (1,), 100),
ScoringEvent("Dragon Flame", (1,), 100),
)
def test_one_and_five_create_two_different_events():
report = score([1, 5])
assert report.total_damage == 150
assert report.events == (
ScoringEvent("Dragon Flame", (1,), 100),
ScoringEvent("Lightning Spark", (5,), 50),
)
def test_three_ones_create_dragon_blast_instead_of_three_flames():
report = score([1, 1, 1])
assert report.total_damage == 1000
assert report.events == (
ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
)
# TODO (RED): write test_dragon_blast_plus_leftover_flame_and_spark
# score([1, 1, 1, 1, 5]) should return total_damage == 1150
# with a Dragon Blast, then a Dragon Flame, then a Lightning Spark
Cycle 8 — Goblin Swarm → Rule Objects (Big Refactor)
Spec: three 2s combine into a Goblin Swarm (200 damage).
Before writing anything, imagine where if counts[2] >= 3: would go. Now imagine six more triple combos (cycles 9–14) each adding their own if-and-subtract block. Painful? That pain is the design pressure. Cycle 8 is when the structure has to change.
Your task
- 🔴 RED — add
test_three_twos_create_goblin_swarm, structured like cycle 6’s test. Run; the existing code knows nothing about 2s. - 🟢 GREEN — you have two choices:
- Option A: another
if counts[2] >= 3:block. Local minimum, multiplies the pain five-fold across cycles 9–14. - Option B: extract the rule shape into data. Pull combo-rule logic into a
ComboRuleclass with anapply(counts)method; same forSingleRule. Then drivescorefrom two registries.
- Option A: another
- 🔵 REFACTOR — option B is the refactor. Re-run; the seven prior tests are your safety net.
Listening to the test & the Open-Closed Principle
The discipline of listening to the test: the difficulty of imagining six more if-blocks was a signal — the structure no longer fit. The cure is structural extraction.
You just applied the Open-Closed Principle: score is now closed for modification (you won’t edit it again) but open for extension (cycle 9 will add four rules by appending data). Adding behavior = adding data, not editing code.
Refactor toward duplication, not before. Two data points (Dragon Blast + Goblin Swarm) are now enough to see the right shape; with one, you’d be guessing.
from collections import Counter
from dataclasses import dataclass
@dataclass(frozen=True)
class ScoringEvent:
name: str
dice_used: tuple
damage: int
@dataclass(frozen=True)
class BattleReport:
events: tuple = ()
@property
def total_damage(self):
return sum(event.damage for event in self.events)
def score(dice):
counts = Counter(dice)
events = []
if counts[1] >= 3:
events.append(ScoringEvent("Dragon Blast", (1, 1, 1), 1000))
counts[1] -= 3
for _ in range(counts[1]):
events.append(ScoringEvent("Dragon Flame", (1,), 100))
for _ in range(counts[5]):
events.append(ScoringEvent("Lightning Spark", (5,), 50))
return BattleReport(tuple(events))
"""Cycles 1–8 — Goblin Swarm forces the rule-object refactor."""
from scorer import score, ScoringEvent
def test_empty_roll_has_zero_damage_and_no_events():
report = score([])
assert report.total_damage == 0
assert report.events == ()
def test_single_one_creates_dragon_flame_event():
report = score([1])
assert report.total_damage == 100
assert report.events == (
ScoringEvent("Dragon Flame", (1,), 100),
)
def test_single_five_creates_lightning_spark_event():
report = score([5])
assert report.total_damage == 50
assert report.events == (
ScoringEvent("Lightning Spark", (5,), 50),
)
def test_two_ones_create_two_dragon_flames():
report = score([1, 1])
assert report.total_damage == 200
assert report.events == (
ScoringEvent("Dragon Flame", (1,), 100),
ScoringEvent("Dragon Flame", (1,), 100),
)
def test_one_and_five_create_two_different_events():
report = score([1, 5])
assert report.total_damage == 150
assert report.events == (
ScoringEvent("Dragon Flame", (1,), 100),
ScoringEvent("Lightning Spark", (5,), 50),
)
def test_three_ones_create_dragon_blast_instead_of_three_flames():
report = score([1, 1, 1])
assert report.total_damage == 1000
assert report.events == (
ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
)
def test_dragon_blast_plus_leftover_flame_and_spark():
report = score([1, 1, 1, 1, 5])
assert report.total_damage == 1150
assert report.events == (
ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
ScoringEvent("Dragon Flame", (1,), 100),
ScoringEvent("Lightning Spark", (5,), 50),
)
# TODO (RED): write test_three_twos_create_goblin_swarm
# score([2, 2, 2]) should return total_damage == 200 with
# a Goblin Swarm event whose dice_used is (2, 2, 2)
Cycle 8 Quiz — Listening to the Test
Min. score: 80%1. What does “listening to the test” mean as a refactoring heuristic?
“Listen to the test” is a foundational TDD heuristic: when adding a test or a
branch hurts disproportionately, the structure is telling you something. In
cycle 8, the pain of imagining six more if counts[X] >= 3: branches was the
signal. The fix was a structural shift to data, not more code.
2. Why is putting rules in a COMBO_RULES tuple of ComboRule instances better than keeping them as if counts[X] >= 3: branches inside score()?
The Open-Closed Principle (OCP) says modules should be open for extension and
closed for modification. The rule-object refactor is the OCP in five lines:
score() no longer changes when you add a new triple combo — only the data
does. Cycle 9 will demonstrate the payoff by adding four combos at once.
3. A teammate says: “You should have done the rule-object refactor in cycle 6, when you first introduced combos. Why wait?” What is the most defensible answer?
Two data points are the minimum for seeing the right shape of an abstraction.
With only Dragon Blast (cycle 6), you’d be guessing whether the variation is
“the die that triggers” or “the count required” or “the name.” Cycle 8’s
Goblin Swarm provides the second data point, and only then is ComboRule(die,
count, name, damage) a discovery rather than a guess. Refactor toward
duplication, not before it.
4. (Spaced review — Cycle 6) After the rule-object refactor, all seven previous tests still pass. Why is this expected, and what does it confirm about the previous tests?
The seven previous tests asserted on report.total_damage, report.events,
and ScoringEvent instances. None of them asserted on internal structure.
That is precisely why they survived a refactor that replaced the entire
implementation strategy. The Refactoring Litmus Test: behavior tests survive
structural change; implementation tests don’t.
Cycle 9 — Add the Rest of the Combo Rules
Spec: the four remaining triples — Triple 3 → Orc Charge (300), Triple 4 → Troll Smash (400), Triple 5 → Lightning Storm (500), Triple 6 → Demon Strike (600).
This cycle exists to prove cycle 8’s refactor paid off. Four new behaviors → no edits to score().
Your task
- 🔴 RED — write one parametrised test (
@pytest.mark.parametrize) covering all four triples. Each row runs as a separate pytest result. - 🟢 GREEN — append four rows to
COMBO_RULES. Do not touchscore(). That’s the proof. - 🔵 REFACTOR — re-run; check duplication (none — rules are data) and that all 12 results pass.
Why parametrize beats a for-loop inside one test
@pytest.mark.parametrize runs the function once per row, reporting each as a separate test result. A for loop inside a single test stops at the first failure, hiding everything after it. The parametrise idiom is the right Python answer to “N tests of the same shape” — DRY tests that still report separate failures.
from collections import Counter
from dataclasses import dataclass
@dataclass(frozen=True)
class ScoringEvent:
name: str
dice_used: tuple
damage: int
@dataclass(frozen=True)
class BattleReport:
events: tuple = ()
@property
def total_damage(self):
return sum(event.damage for event in self.events)
@dataclass(frozen=True)
class ComboRule:
die: int
count: int
name: str
damage: int
def apply(self, counts):
events = []
if counts[self.die] >= self.count:
dice_used = tuple([self.die] * self.count)
events.append(ScoringEvent(self.name, dice_used, self.damage))
counts[self.die] -= self.count
return events
@dataclass(frozen=True)
class SingleRule:
die: int
name: str
damage: int
def apply(self, counts):
events = []
for _ in range(counts[self.die]):
events.append(ScoringEvent(self.name, (self.die,), self.damage))
return events
COMBO_RULES = (
ComboRule(1, 3, "Dragon Blast", 1000),
ComboRule(2, 3, "Goblin Swarm", 200),
)
SINGLE_RULES = (
SingleRule(1, "Dragon Flame", 100),
SingleRule(5, "Lightning Spark", 50),
)
def score(dice):
counts = Counter(dice)
events = []
for rule in COMBO_RULES:
events.extend(rule.apply(counts))
for rule in SINGLE_RULES:
events.extend(rule.apply(counts))
return BattleReport(tuple(events))
"""Cycles 1–9 — adding the four remaining triple combos."""
import pytest
from scorer import score, ScoringEvent
def test_empty_roll_has_zero_damage_and_no_events():
report = score([])
assert report.total_damage == 0
assert report.events == ()
def test_single_one_creates_dragon_flame_event():
report = score([1])
assert report.total_damage == 100
assert report.events == (
ScoringEvent("Dragon Flame", (1,), 100),
)
def test_single_five_creates_lightning_spark_event():
report = score([5])
assert report.total_damage == 50
assert report.events == (
ScoringEvent("Lightning Spark", (5,), 50),
)
def test_two_ones_create_two_dragon_flames():
report = score([1, 1])
assert report.total_damage == 200
assert report.events == (
ScoringEvent("Dragon Flame", (1,), 100),
ScoringEvent("Dragon Flame", (1,), 100),
)
def test_one_and_five_create_two_different_events():
report = score([1, 5])
assert report.total_damage == 150
assert report.events == (
ScoringEvent("Dragon Flame", (1,), 100),
ScoringEvent("Lightning Spark", (5,), 50),
)
def test_three_ones_create_dragon_blast_instead_of_three_flames():
report = score([1, 1, 1])
assert report.total_damage == 1000
assert report.events == (
ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
)
def test_dragon_blast_plus_leftover_flame_and_spark():
report = score([1, 1, 1, 1, 5])
assert report.total_damage == 1150
assert report.events == (
ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
ScoringEvent("Dragon Flame", (1,), 100),
ScoringEvent("Lightning Spark", (5,), 50),
)
def test_three_twos_create_goblin_swarm():
report = score([2, 2, 2])
assert report.total_damage == 200
assert report.events == (
ScoringEvent("Goblin Swarm", (2, 2, 2), 200),
)
# TODO (RED): write a parametrized test_other_triples_create_combo_events
# that covers triples of 3, 4, 5, 6 with their respective events
Cycle 10 — Six 1s → Two Dragon Blasts (Hidden Bug)
Spec:
score([1, 1, 1, 1, 1, 1])produces two Dragon Blasts (2000 damage).
Predict first. Before writing a test, look at your ComboRule.apply and trace through six 1s by hand. What does the current code actually produce?
Reveal the hidden bug (open after you’ve predicted)
The if counts[1] >= 3: runs once. It emits one Blast and counts[1] -= 3 leaves counts[1] == 3. Those three 1s fall through to SingleRule, emitting three Flames. Total: 1000 + 300 = 1300 damage.
But six 1s should be two Blasts → 2000 damage. The code is wrong — and no previous test caught it, because every prior combo test used exactly three of a face.
This is the kind of bug TDD literature reports: a defect the developer doesn’t know exists in code they wrote themselves. The test surfaces it.
Your task
- 🔴 RED — add
test_six_ones_create_two_dragon_blastsasserting 2000 damage and two Blast events. Run; see the wrong events tuple. - 🟢 GREEN — fix
ComboRule.applyso it can emit zero or more combos per call, with the correct leftover bookkeeping. Hint: floor-division (//) and modulo (%) are the right operators. - 🔵 REFACTOR — re-run. Especially gratifying: cycle 7’s leftover guardrail still passes, because
4 % 3 == 1matches4 - 3 == 1for that input. The fix only changed behavior on cases no prior test pinned down.
What this teaches about coverage vs. boundary thinking
Every previous combo test used exactly count dice (three 1s, three 2s, etc.). The bug only manifests at 2 × count and beyond. Line coverage told you the if ran. It didn’t tell you the line was right for all relevant inputs.
That’s the gap between coverage and boundary-value analysis: every behavior has boundaries (0, exactly N, 2N, between N and 2N) and a healthy suite probes each. Coverage is a locator of under-tested code; it isn’t a measure of correctness.
from collections import Counter
from dataclasses import dataclass
@dataclass(frozen=True)
class ScoringEvent:
name: str
dice_used: tuple
damage: int
@dataclass(frozen=True)
class BattleReport:
events: tuple = ()
@property
def total_damage(self):
return sum(event.damage for event in self.events)
@dataclass(frozen=True)
class ComboRule:
die: int
count: int
name: str
damage: int
def apply(self, counts):
events = []
if counts[self.die] >= self.count:
dice_used = tuple([self.die] * self.count)
events.append(ScoringEvent(self.name, dice_used, self.damage))
counts[self.die] -= self.count
return events
@dataclass(frozen=True)
class SingleRule:
die: int
name: str
damage: int
def apply(self, counts):
events = []
for _ in range(counts[self.die]):
events.append(ScoringEvent(self.name, (self.die,), self.damage))
return events
COMBO_RULES = (
ComboRule(1, 3, "Dragon Blast", 1000),
ComboRule(2, 3, "Goblin Swarm", 200),
ComboRule(3, 3, "Orc Charge", 300),
ComboRule(4, 3, "Troll Smash", 400),
ComboRule(5, 3, "Lightning Storm", 500),
ComboRule(6, 3, "Demon Strike", 600),
)
SINGLE_RULES = (
SingleRule(1, "Dragon Flame", 100),
SingleRule(5, "Lightning Spark", 50),
)
def score(dice):
counts = Counter(dice)
events = []
for rule in COMBO_RULES:
events.extend(rule.apply(counts))
for rule in SINGLE_RULES:
events.extend(rule.apply(counts))
return BattleReport(tuple(events))
"""Cycles 1–10 — surfacing the multi-combo edge case."""
import pytest
from scorer import score, ScoringEvent
def test_empty_roll_has_zero_damage_and_no_events():
report = score([])
assert report.total_damage == 0
assert report.events == ()
def test_single_one_creates_dragon_flame_event():
report = score([1])
assert report.total_damage == 100
assert report.events == (
ScoringEvent("Dragon Flame", (1,), 100),
)
def test_single_five_creates_lightning_spark_event():
report = score([5])
assert report.total_damage == 50
assert report.events == (
ScoringEvent("Lightning Spark", (5,), 50),
)
def test_two_ones_create_two_dragon_flames():
report = score([1, 1])
assert report.total_damage == 200
assert report.events == (
ScoringEvent("Dragon Flame", (1,), 100),
ScoringEvent("Dragon Flame", (1,), 100),
)
def test_one_and_five_create_two_different_events():
report = score([1, 5])
assert report.total_damage == 150
assert report.events == (
ScoringEvent("Dragon Flame", (1,), 100),
ScoringEvent("Lightning Spark", (5,), 50),
)
def test_three_ones_create_dragon_blast_instead_of_three_flames():
report = score([1, 1, 1])
assert report.total_damage == 1000
assert report.events == (
ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
)
def test_dragon_blast_plus_leftover_flame_and_spark():
report = score([1, 1, 1, 1, 5])
assert report.total_damage == 1150
assert report.events == (
ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
ScoringEvent("Dragon Flame", (1,), 100),
ScoringEvent("Lightning Spark", (5,), 50),
)
def test_three_twos_create_goblin_swarm():
report = score([2, 2, 2])
assert report.total_damage == 200
assert report.events == (
ScoringEvent("Goblin Swarm", (2, 2, 2), 200),
)
@pytest.mark.parametrize(
"roll, expected_event",
[
([3, 3, 3], ScoringEvent("Orc Charge", (3, 3, 3), 300)),
([4, 4, 4], ScoringEvent("Troll Smash", (4, 4, 4), 400)),
([5, 5, 5], ScoringEvent("Lightning Storm", (5, 5, 5), 500)),
([6, 6, 6], ScoringEvent("Demon Strike", (6, 6, 6), 600)),
],
)
def test_other_triples_create_combo_events(roll, expected_event):
report = score(roll)
assert report.total_damage == expected_event.damage
assert report.events == (expected_event,)
# TODO (RED): write test_six_ones_create_two_dragon_blasts
# score([1, 1, 1, 1, 1, 1]) should return total_damage == 2000
# and events containing TWO Dragon Blast ScoringEvents
Cycle 10 Quiz — TDD Finds Bugs You Didn't Know You Had
Min. score: 80%1. The “one Dragon Blast for any number of 1s ≥ 3” bug was sitting in your code from cycle 6 onward — four cycles. Why did no previous test catch it?
Boundary thinking. Every previous combo test gave the rule exactly count
dice. The bug was on the boundary where counts[X] >= 2 * count. Without a test
on that boundary, the code would have shipped wrong — and the developer would
have been completely confident in it. This is exactly the kind of failure mode
empirical TDD studies report.
2. What is the most defensible general lesson from cycle 10 about test-suite design?
Coverage told us we executed the line. It did not tell us whether the line was right for all relevant inputs. Boundary-value analysis (the partition tradition you encountered in Foundations) complements coverage: every behavior has a “just-at-the-boundary,” “just-past-the-boundary,” and “well-past-the- boundary” input — and a healthy suite probes each.
3. Cycle 7’s guardrail ([1, 1, 1, 1, 5]) still passes after cycle 10’s refactor. Why? Be precise.
This is a subtle but important point: a refactor does not have to change every
output. The two implementations of apply produce the same answer for any
input where the old answer was correct — and only the new answer is correct
where the old was wrong. That is what makes the cycle-10 fix a generalisation
rather than a replacement.
4. (Spaced review — Cycle 8) A teammate looks at cycle 10’s fix and says: “This bug only existed because we extracted ComboRule. If we’d kept the inline if counts[1] >= 3: from cycle 6, we wouldn’t have had this issue.” Are they right?
The bug is the algorithm’s bug, not the abstraction’s. Whether the if/-=
sits in score() or in ComboRule.apply makes no difference to its behavior.
The lesson is honest: TDD does not prevent every bug; it surfaces them when
tests probe the right inputs. The rule-object refactor was about future-
proofing the design, not correctness of cycle 6’s logic.
Cycle 11 — Invalid Dice Raise an Error
Spec:
score([1, 2, 9])raisesValueError("Die values must be between 1 and 6"). Dice are integers 1–6.
So far every test was a happy-path test. Happy-path-only suites are the single most common smell in student-written test code (per testing-education research). TDD applies to error behavior just as much as to success — pytest.raises is the idiom.
Your task
- 🔴 RED — add
test_invalid_die_value_raises_errorusingpytest.raises(ValueError, match=...). The currentscoresilently ignores the 9, so the test fails because no exception is raised. - 🟢 GREEN — add a
validate_dice(dice)helper that raisesValueErrorfor any die outside 1–6; call it at the top ofscore. - 🔵 REFACTOR — checklist; constants like
VALID_DICEkeep the rule explicit.
Why match= matters & happy-path-only is dangerous
Without match=, the test would accept any ValueError — even one with no message, or the wrong message. With it, the test pins down the contract: the function raises and explains why. That’s an oracle for the error case, not just for success.
The deeper lesson: line coverage doesn’t catch missing error handling. A function that silently processes invalid input passes every happy-path assertion. Only an invalid-input test asks “does the boundary actually error?”
from collections import Counter
from dataclasses import dataclass
@dataclass(frozen=True)
class ScoringEvent:
name: str
dice_used: tuple
damage: int
@dataclass(frozen=True)
class BattleReport:
events: tuple = ()
@property
def total_damage(self):
return sum(event.damage for event in self.events)
@dataclass(frozen=True)
class ComboRule:
die: int
count: int
name: str
damage: int
def apply(self, counts):
events = []
number_of_combos = counts[self.die] // self.count
for _ in range(number_of_combos):
dice_used = tuple([self.die] * self.count)
events.append(ScoringEvent(self.name, dice_used, self.damage))
counts[self.die] %= self.count
return events
@dataclass(frozen=True)
class SingleRule:
die: int
name: str
damage: int
def apply(self, counts):
events = []
for _ in range(counts[self.die]):
events.append(ScoringEvent(self.name, (self.die,), self.damage))
return events
COMBO_RULES = (
ComboRule(1, 3, "Dragon Blast", 1000),
ComboRule(2, 3, "Goblin Swarm", 200),
ComboRule(3, 3, "Orc Charge", 300),
ComboRule(4, 3, "Troll Smash", 400),
ComboRule(5, 3, "Lightning Storm", 500),
ComboRule(6, 3, "Demon Strike", 600),
)
SINGLE_RULES = (
SingleRule(1, "Dragon Flame", 100),
SingleRule(5, "Lightning Spark", 50),
)
def score(dice):
counts = Counter(dice)
events = []
for rule in COMBO_RULES:
events.extend(rule.apply(counts))
for rule in SINGLE_RULES:
events.extend(rule.apply(counts))
return BattleReport(tuple(events))
"""Cycles 1–11 — adding input validation."""
import pytest
from scorer import score, ScoringEvent
def test_empty_roll_has_zero_damage_and_no_events():
report = score([])
assert report.total_damage == 0
assert report.events == ()
def test_single_one_creates_dragon_flame_event():
report = score([1])
assert report.total_damage == 100
assert report.events == (
ScoringEvent("Dragon Flame", (1,), 100),
)
def test_single_five_creates_lightning_spark_event():
report = score([5])
assert report.total_damage == 50
assert report.events == (
ScoringEvent("Lightning Spark", (5,), 50),
)
def test_two_ones_create_two_dragon_flames():
report = score([1, 1])
assert report.total_damage == 200
assert report.events == (
ScoringEvent("Dragon Flame", (1,), 100),
ScoringEvent("Dragon Flame", (1,), 100),
)
def test_one_and_five_create_two_different_events():
report = score([1, 5])
assert report.total_damage == 150
assert report.events == (
ScoringEvent("Dragon Flame", (1,), 100),
ScoringEvent("Lightning Spark", (5,), 50),
)
def test_three_ones_create_dragon_blast_instead_of_three_flames():
report = score([1, 1, 1])
assert report.total_damage == 1000
assert report.events == (
ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
)
def test_dragon_blast_plus_leftover_flame_and_spark():
report = score([1, 1, 1, 1, 5])
assert report.total_damage == 1150
assert report.events == (
ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
ScoringEvent("Dragon Flame", (1,), 100),
ScoringEvent("Lightning Spark", (5,), 50),
)
def test_three_twos_create_goblin_swarm():
report = score([2, 2, 2])
assert report.total_damage == 200
assert report.events == (
ScoringEvent("Goblin Swarm", (2, 2, 2), 200),
)
@pytest.mark.parametrize(
"roll, expected_event",
[
([3, 3, 3], ScoringEvent("Orc Charge", (3, 3, 3), 300)),
([4, 4, 4], ScoringEvent("Troll Smash", (4, 4, 4), 400)),
([5, 5, 5], ScoringEvent("Lightning Storm", (5, 5, 5), 500)),
([6, 6, 6], ScoringEvent("Demon Strike", (6, 6, 6), 600)),
],
)
def test_other_triples_create_combo_events(roll, expected_event):
report = score(roll)
assert report.total_damage == expected_event.damage
assert report.events == (expected_event,)
def test_six_ones_create_two_dragon_blasts():
report = score([1, 1, 1, 1, 1, 1])
assert report.total_damage == 2000
assert report.events == (
ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
)
# TODO (RED): write test_invalid_die_value_raises_error
# score([1, 2, 9]) should raise ValueError("Die values must be between 1 and 6")
# Use `with pytest.raises(ValueError, match="..."): score([1, 2, 9])`
Cycle 12 — Human-Readable Battle Summary
Spec:
report.summary()returns a one-line string like"Dragon Blast (1000) + Lightning Spark (50) = 1050". An empty report returns"No damage.".
Twelve cycles in. Cycle 12 is the only one that adds behavior to BattleReport itself rather than to score. Where a behavior lives is a design decision — putting summary on the report keeps formatting close to the data.
Your task
- 🔴 RED — add
test_battle_report_has_human_readable_summary. Run; expectAttributeError. - 🟢 GREEN — add a
summarymethod toBattleReportthat handles empty events as a special case and joinsname (damage)segments with" + "for non-empty. - 🔵 REFACTOR — final checklist. Re-run; all 16+ tests should be green. You just completed twelve TDD cycles.
A small design choice: method vs. property?
Properties signal cheap, side-effect-free derived values. summary() is cheap, but the empty-case branch and the f-string assembly do nontrivial work. Convention: methods for derived strings; properties for derived numbers and booleans. We chose a method. (Either would work — but be consistent within a project.)
from collections import Counter
from dataclasses import dataclass
VALID_DICE = {1, 2, 3, 4, 5, 6}
@dataclass(frozen=True)
class ScoringEvent:
name: str
dice_used: tuple
damage: int
@dataclass(frozen=True)
class BattleReport:
events: tuple = ()
@property
def total_damage(self):
return sum(event.damage for event in self.events)
@dataclass(frozen=True)
class ComboRule:
die: int
count: int
name: str
damage: int
def apply(self, counts):
events = []
number_of_combos = counts[self.die] // self.count
for _ in range(number_of_combos):
dice_used = tuple([self.die] * self.count)
events.append(ScoringEvent(self.name, dice_used, self.damage))
counts[self.die] %= self.count
return events
@dataclass(frozen=True)
class SingleRule:
die: int
name: str
damage: int
def apply(self, counts):
events = []
for _ in range(counts[self.die]):
events.append(ScoringEvent(self.name, (self.die,), self.damage))
return events
COMBO_RULES = (
ComboRule(1, 3, "Dragon Blast", 1000),
ComboRule(2, 3, "Goblin Swarm", 200),
ComboRule(3, 3, "Orc Charge", 300),
ComboRule(4, 3, "Troll Smash", 400),
ComboRule(5, 3, "Lightning Storm", 500),
ComboRule(6, 3, "Demon Strike", 600),
)
SINGLE_RULES = (
SingleRule(1, "Dragon Flame", 100),
SingleRule(5, "Lightning Spark", 50),
)
def validate_dice(dice):
for die in dice:
if die not in VALID_DICE:
raise ValueError("Die values must be between 1 and 6")
def score(dice):
validate_dice(dice)
counts = Counter(dice)
events = []
for rule in COMBO_RULES:
events.extend(rule.apply(counts))
for rule in SINGLE_RULES:
events.extend(rule.apply(counts))
return BattleReport(tuple(events))
"""Cycles 1–12 — adding the human-readable summary."""
import pytest
from scorer import score, ScoringEvent
def test_empty_roll_has_zero_damage_and_no_events():
report = score([])
assert report.total_damage == 0
assert report.events == ()
def test_single_one_creates_dragon_flame_event():
report = score([1])
assert report.total_damage == 100
assert report.events == (
ScoringEvent("Dragon Flame", (1,), 100),
)
def test_single_five_creates_lightning_spark_event():
report = score([5])
assert report.total_damage == 50
assert report.events == (
ScoringEvent("Lightning Spark", (5,), 50),
)
def test_two_ones_create_two_dragon_flames():
report = score([1, 1])
assert report.total_damage == 200
assert report.events == (
ScoringEvent("Dragon Flame", (1,), 100),
ScoringEvent("Dragon Flame", (1,), 100),
)
def test_one_and_five_create_two_different_events():
report = score([1, 5])
assert report.total_damage == 150
assert report.events == (
ScoringEvent("Dragon Flame", (1,), 100),
ScoringEvent("Lightning Spark", (5,), 50),
)
def test_three_ones_create_dragon_blast_instead_of_three_flames():
report = score([1, 1, 1])
assert report.total_damage == 1000
assert report.events == (
ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
)
def test_dragon_blast_plus_leftover_flame_and_spark():
report = score([1, 1, 1, 1, 5])
assert report.total_damage == 1150
assert report.events == (
ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
ScoringEvent("Dragon Flame", (1,), 100),
ScoringEvent("Lightning Spark", (5,), 50),
)
def test_three_twos_create_goblin_swarm():
report = score([2, 2, 2])
assert report.total_damage == 200
assert report.events == (
ScoringEvent("Goblin Swarm", (2, 2, 2), 200),
)
@pytest.mark.parametrize(
"roll, expected_event",
[
([3, 3, 3], ScoringEvent("Orc Charge", (3, 3, 3), 300)),
([4, 4, 4], ScoringEvent("Troll Smash", (4, 4, 4), 400)),
([5, 5, 5], ScoringEvent("Lightning Storm", (5, 5, 5), 500)),
([6, 6, 6], ScoringEvent("Demon Strike", (6, 6, 6), 600)),
],
)
def test_other_triples_create_combo_events(roll, expected_event):
report = score(roll)
assert report.total_damage == expected_event.damage
assert report.events == (expected_event,)
def test_six_ones_create_two_dragon_blasts():
report = score([1, 1, 1, 1, 1, 1])
assert report.total_damage == 2000
assert report.events == (
ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
)
def test_invalid_die_value_raises_error():
with pytest.raises(ValueError, match="Die values must be between 1 and 6"):
score([1, 2, 9])
# TODO (RED): write test_battle_report_has_human_readable_summary
# score([1, 1, 1, 5]).summary() should equal
# "Dragon Blast (1000) + Lightning Spark (50) = 1050"
The Big Picture — Lessons from Twelve Cycles
Twelve cycles. Twelve tests. One small domain model with combos, leftovers, validation, and a summary method. Scroll through your final scorer.py — every line is justified by a test. No speculative code, no dead branch. The tests are the byproduct; the design is the primary product.
The takeaways that travel
- TDD is design, not testing. The test is the contract; the implementation emerges under its pressure.
- Refactor toward duplication, not before it. Two examples reveal the right shape; one is a guess.
- Tests enable change. Behavior-level assertions survive structural rewrites; implementation-coupled tests don’t.
- Coverage ≠ correctness. Boundary thinking (cycle 10) catches what coverage misses.
- Listen to the test. Pain in writing a test usually points at the production code.
Take the final knowledge check at the end of this step. The sections below review the whole journey, the anti-pattern taxonomy, the empirical case for TDD, and what to learn next.
The journey in one table
| Cycle | Behavior | Design move | Lesson |
|---|---|---|---|
| 1 | Empty roll | First class + function | RED for the right reason |
| 2 | Single 1 | ScoringEvent introduced |
Allow the ugly first GREEN |
| 3 | Single 5 | Second hardcoded branch | Refactor toward duplication |
| 4 | Repeated singles | Per-die loop | First real refactor; tests enable change |
| 5 | Mixed dice | (no production change) | Free pass — verify with the mutation move |
| 6 | Triple 1s | Counter, count-then-emit |
Design-breaking test; structural shift |
| 7 | Combo + leftovers | (no production change) | Guardrail tests for implicit correctness |
| 8 | Triple 2s | Rule objects | Listening to the test; Open-Closed |
| 9 | Other triples | Append data | Refactor pays off; parametrize |
| 10 | Six 1s | // and %= |
Hidden edge case; boundary > coverage |
| 11 | Invalid dice | pytest.raises |
Robustness is first-class |
| 12 | Summary | Method on BattleReport |
Behavior on the existing object |
When TDD shines, when it’s overkill
TDD shines on: new features with clear behavioral requirements; complex logic with branching cases; long-lived code modified by multiple people; API design (the test forces a caller’s perspective); domains where regressions hurt (payments, scoring, calculations).
TDD is overkill on: one-off throwaway scripts; exploratory prototyping where the problem isn’t yet defined; UI layout (visual correctness); non-binary outcomes (ML accuracy, image recognition); Jupyter research explorations.
Even in the second list, some tests pay off — the question is whether to write them first. Kent Beck: “the discipline of working strictly test-first is valuable but not necessarily something you want to do all the time.”
TDD anti-pattern taxonomy
| Level | Anti-pattern | What it looks like | Antidote |
|---|---|---|---|
| I | The Liar | Test passes but asserts vacuously (isinstance(x, int) only) |
Cycle 5’s mutation move |
| I | The Nitpicker | Asserts on private attributes / implementation details | Assert on observable behavior |
| II | Success Against All Odds | New test passes immediately, with no investigation | Verify with mutation |
| II | Skip-the-Refactor | Stop at green; never enter REFACTOR | Make the look mandatory |
| III | The Giant | One test asserts dozens of behaviors (mirrors a God Object) | One behavior per test |
| III | Excessive Setup | 30+ lines of fixture before one assertion | Decouple production code |
| IV | The Mockery | More mock setup than test logic | Listen — the design is wrong |
| IV | Modify-the-Test | AI rewrites the test to match buggy code | Own the spec yourself |
The pattern: higher the level, more it’s an architectural smell. Listen to the test.
The empirical case for TDD
| Study | Finding |
|---|---|
| Microsoft & IBM (Williams et al., 2008) | 39–91% decrease in pre-release defect density in TDD teams |
| Same studies | 15–35% longer initial development; offset by reduced debugging |
| Erdogmus et al. (2005) | Test-first students wrote more tests AND were more productive per test |
| Janzen & Saiedian (ICSE 2007) | Mature programmers exposed to TDD significantly more likely to prefer TDD later — the Residual Effect |
| Fucci et al. (2017) | TDD’s benefit comes from granularity + uniformity, not strict test-first ordering — your twelve tiny cycles embody both |
Caveat: mixed for solo programmers on short tasks. Strongest in team settings, with CI, on long-lived systems.
What to learn next (the same rhythm, scaled up)
- Fixtures (
@pytest.fixture) for reusable setup of objects, DBs, mock APIs - Mocks, fakes, stubs — with a strong default toward fakes over mocks
- Property-based testing with Hypothesis —
score(any list of 1–6)should always satisfy invariants - Mutation testing with
mutmutorcosmic-ray— automate the cycle-5 mutation move across the whole suite - The Outside-In / Double-Loop pattern (Percival, Obey the Testing Goat) — high-level acceptance tests drive unit tests
Each lives inside the same Red-Green-Refactor rhythm you just internalised. They scale the discipline; they don’t replace it.
# The full, twelve-cycle implementation lives here. Use this step's
# editor to scroll through what you built — every line is justified by
# a test in test_scorer.py. There is no speculative code.
from collections import Counter
from dataclasses import dataclass
VALID_DICE = {1, 2, 3, 4, 5, 6}
@dataclass(frozen=True)
class ScoringEvent:
name: str
dice_used: tuple
damage: int
@dataclass(frozen=True)
class BattleReport:
events: tuple = ()
@property
def total_damage(self):
return sum(event.damage for event in self.events)
def summary(self):
if not self.events:
return "No damage."
event_text = " + ".join(
f"{event.name} ({event.damage})" for event in self.events
)
return f"{event_text} = {self.total_damage}"
@dataclass(frozen=True)
class ComboRule:
die: int
count: int
name: str
damage: int
def apply(self, counts):
events = []
number_of_combos = counts[self.die] // self.count
for _ in range(number_of_combos):
dice_used = tuple([self.die] * self.count)
events.append(ScoringEvent(self.name, dice_used, self.damage))
counts[self.die] %= self.count
return events
@dataclass(frozen=True)
class SingleRule:
die: int
name: str
damage: int
def apply(self, counts):
events = []
for _ in range(counts[self.die]):
events.append(ScoringEvent(self.name, (self.die,), self.damage))
return events
COMBO_RULES = (
ComboRule(1, 3, "Dragon Blast", 1000),
ComboRule(2, 3, "Goblin Swarm", 200),
ComboRule(3, 3, "Orc Charge", 300),
ComboRule(4, 3, "Troll Smash", 400),
ComboRule(5, 3, "Lightning Storm", 500),
ComboRule(6, 3, "Demon Strike", 600),
)
SINGLE_RULES = (
SingleRule(1, "Dragon Flame", 100),
SingleRule(5, "Lightning Spark", 50),
)
def validate_dice(dice):
for die in dice:
if die not in VALID_DICE:
raise ValueError("Die values must be between 1 and 6")
def score(dice):
validate_dice(dice)
counts = Counter(dice)
events = []
for rule in COMBO_RULES:
events.extend(rule.apply(counts))
for rule in SINGLE_RULES:
events.extend(rule.apply(counts))
return BattleReport(tuple(events))
"""All twelve cycles, all green. Read it as a contract."""
import pytest
from scorer import score, ScoringEvent
def test_empty_roll_has_zero_damage_and_no_events():
report = score([])
assert report.total_damage == 0
assert report.events == ()
def test_single_one_creates_dragon_flame_event():
report = score([1])
assert report.total_damage == 100
assert report.events == (
ScoringEvent("Dragon Flame", (1,), 100),
)
def test_single_five_creates_lightning_spark_event():
report = score([5])
assert report.total_damage == 50
assert report.events == (
ScoringEvent("Lightning Spark", (5,), 50),
)
def test_two_ones_create_two_dragon_flames():
report = score([1, 1])
assert report.total_damage == 200
assert report.events == (
ScoringEvent("Dragon Flame", (1,), 100),
ScoringEvent("Dragon Flame", (1,), 100),
)
def test_one_and_five_create_two_different_events():
report = score([1, 5])
assert report.total_damage == 150
assert report.events == (
ScoringEvent("Dragon Flame", (1,), 100),
ScoringEvent("Lightning Spark", (5,), 50),
)
def test_three_ones_create_dragon_blast_instead_of_three_flames():
report = score([1, 1, 1])
assert report.total_damage == 1000
assert report.events == (
ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
)
def test_dragon_blast_plus_leftover_flame_and_spark():
report = score([1, 1, 1, 1, 5])
assert report.total_damage == 1150
assert report.events == (
ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
ScoringEvent("Dragon Flame", (1,), 100),
ScoringEvent("Lightning Spark", (5,), 50),
)
def test_three_twos_create_goblin_swarm():
report = score([2, 2, 2])
assert report.total_damage == 200
assert report.events == (
ScoringEvent("Goblin Swarm", (2, 2, 2), 200),
)
@pytest.mark.parametrize(
"roll, expected_event",
[
([3, 3, 3], ScoringEvent("Orc Charge", (3, 3, 3), 300)),
([4, 4, 4], ScoringEvent("Troll Smash", (4, 4, 4), 400)),
([5, 5, 5], ScoringEvent("Lightning Storm", (5, 5, 5), 500)),
([6, 6, 6], ScoringEvent("Demon Strike", (6, 6, 6), 600)),
],
)
def test_other_triples_create_combo_events(roll, expected_event):
report = score(roll)
assert report.total_damage == expected_event.damage
assert report.events == (expected_event,)
def test_six_ones_create_two_dragon_blasts():
report = score([1, 1, 1, 1, 1, 1])
assert report.total_damage == 2000
assert report.events == (
ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
)
def test_invalid_die_value_raises_error():
with pytest.raises(ValueError, match="Die values must be between 1 and 6"):
score([1, 2, 9])
def test_battle_report_has_human_readable_summary():
report = score([1, 1, 1, 5])
assert report.summary() == (
"Dragon Blast (1000) + Lightning Spark (50) = 1050"
)
Final Quiz — The Big Picture
Min. score: 80%1. Which single statement most accurately captures TDD?
The threshold concept of the tutorial: TDD is design, not testing. Cycle 8’s rule-object refactor emerged under cycle-8’s test pressure, not from a whiteboard. The test suite is a (valuable) byproduct.
2. For which scenario is TDD most beneficial?
Payment processing has every property TDD rewards: complex logic, strict correctness, multiple developers, long life. Microsoft/IBM measured 39–91% defect-density reductions exactly here.
3. A test file contains:
def test_user_creation():
user = User("Alice", "alice@example.com")
assert user._name == "Alice"
assert user._email == "alice@example.com"
The Nitpicker. Leading underscores signal “private — implementation detail.” Tests that touch private state break on every refactor. Assert on observable behavior via the public API.
4. The “one Dragon Blast for any count ≥ 3” bug lived in cycles 6–10. Why didn’t line coverage catch it?
Coverage tells you the line ran, not that it was right. Every prior combo
test used exactly count dice; nothing probed 2 × count. Boundary thinking
is the discipline that catches it. Coverage is a locator of under-tested
code, not a measure of correctness.