1

Cycle 1 — RED: Write the Failing Test

Prerequisite: Testing Foundations — pytest discovery, assert, partitions, behavior-not-implementation. If those feel new, do that one first.

You’ll grow a small dice-scoring engine over 15 steps using only Test-Driven Development.

Test-Driven Development in one minute

TDD is a design technique that uses tests as the medium of pressure. You write code in short cycles of three phases:

Phase What you do Why this phase exists
🔴 RED Write one failing test that names a behavior you want Forces the interface and expected behavior to be decided before any logic exists
🟢 GREEN Write the smallest code that makes the test pass Resists speculative design; only build what a test demands
🔵 REFACTOR Improve the code while all tests stay green The safety net lets you reshape structure without fear of regression

Each phase of cycle 1 is its own tutorial step so the rhythm becomes a felt sequence, not a slogan. From cycle 2 onward, each cycle is one step containing all three phases.

Why a failing test is the goal of RED

A failing test is not a bug — RED is the expected starting state of every cycle. But the failure has to come from the behavior under test, not from a typo:

  • Right reasonImportError, AttributeError, a value-mismatch on the assertion you wrote. The test correctly says “this behavior does not exist yet.”
  • Wrong reasonSyntaxError, missing colon, misspelled test_ prefix. The test never ran. You’ve learned nothing about the unit, only about your typing.

Students commonly delete a failing test to make the bar green. We’re leaning into that discomfort instead. Learning to read the failure is what TDD trains.

The shape of every pytest test

Every pytest test you write has the same four-part structure:

Part What it does
Import the unit under test Tells Python what code you’ll call
Define a function whose name starts with test_ pytest only discovers functions matching this pattern
Arrange + act Set up any input and call the unit
Assert an observable property of the result Pin down one thing the spec promises

The pattern generalises; the specifics (what to import, call, and assert) come from the spec — and only from the spec.

The dragon dice rules (reference for all 12 cycles)

Roll Event Damage
Single 1 Dragon Flame 100
Single 5 Lightning Spark 50
Triple 1 Dragon Blast 1000
Triple 2 Goblin Swarm 200
Triple 3 Orc Charge 300
Triple 4 Troll Smash 400
Triple 5 Lightning Storm 500
Triple 6 Demon Strike 600

Triples consume three dice; leftover 1s and 5s still score as singles. Dice are integers 1–6. Today you implement only the empty-roll case. Eleven more cycles add the rest.

Cycle 1’s spec

An empty roll produces a battle report with zero damage and no events.

That sentence names everything you need: a function score, a return value with total_damage and events attributes. Translate it into a pytest test using the four-part shape.

Your task

  1. In test_scorer.py (right pane), fill in the three sub-goal comments. Leave scorer.py empty — its code belongs to the GREEN step.
  2. Predict the exact failure message you’ll see, then click Run.
  3. Expected output: ImportError: cannot import name 'score' (or ModuleNotFoundError if scorer.py is empty). That IS the deliverable — RED for the right reason.

Detail: why a tuple, not a list

The empty events case is written as () (an empty tuple), not [] (an empty list). Tuples are immutable — they can’t be mutated by accident, and they’re safe dataclass defaults (cycle 2 uses that). Every test in this tutorial that pins down events uses a tuple.

References

Self-test before clicking Next

Close the page in your head and answer:

  • What two pieces of information does pytest’s failure message give you?
  • Why is “RED for the right reason” worth distinguishing from “RED for any reason”?
  • If you had to write a test for a multiply(a, b) function from scratch, what four parts would it contain?
Starter files
scorer.py
# Cycle 1 RED phase — DO NOT WRITE PRODUCTION CODE HERE YET.
#
# The next tutorial step (Cycle 1 GREEN) is where the BattleReport
# class and the score function are introduced. Right now we are
# only writing the failing test on the right.
test_scorer.py
"""Cycle 1 RED — write the first failing test.

The sub-goals below describe the PURPOSE of each line you need to add,
not the syntax. Translate the spec ("an empty roll has no damage and no
events") into pytest assertions yourself. If you get stuck, consult the
rules table and the references in the instructions panel.
"""
from scorer import score


def test_empty_roll_has_zero_damage_and_no_events():
    # Sub-goal: call the unit under test on the simplest input the spec mentions
    #           — and capture the result so the next two lines can inspect it.

    # Sub-goal: pin down what the spec says about damage in this case.

    # Sub-goal: pin down what the spec says about events in this case.
    pass
2

Cycle 1 — GREEN: Make It Pass

The GREEN rule: write the smallest code that makes the failing test pass. Anything more is speculative design — code with no test demanding it.

The test is your contract

Every line of the test is an obligation your code must satisfy:

Line of the test What your code must provide
from scorer import score A score name in scorer.py
report = score([]) score returns something
assert report.total_damage == 0 That something exposes total_damage, equal to 0
assert report.events == () …and events, equal to ()

Four obligations. Anything you write that does not contribute to one of those four is speculative design — it has no test demanding it, so it has no test guarding it, so it has no business being there yet.

Three Python tools the scaffold uses

Three building blocks you’ll need. None are TDD-specific — they’re standard Python tools you’ll reach for many times.

  • @dataclass auto-writes __init__, __repr__, and __eq__ from your field declarations. You write the field names and types; Python writes the bookkeeping. The auto-generated __eq__ is what makes assert report.events == (...) work field-by-field. (docs)
  • frozen=True added to @dataclass(frozen=True) makes instances immutable — fields can’t be reassigned after construction. Frozen dataclasses are also hashable. We rely on this in cycle 2, where test assertions compare tuples of ScoringEvent instances. (docs)
  • @property turns a method into an attribute read. Calling sites use report.total_damage (no parens) but Python runs the function body each time. The test reads total_damage like a field; the @property decorator keeps that grammar consistent. (docs)

The Transformation Priority Premise — why “smallest” beats “best”

Robert Martin’s TPP lists code transformations from simplest to most complex: nothing → constant → variable → conditional → loop. The rule: always pick the simpler transformation that passes the current failing test, even when you “know” a more general one is coming.

For cycle 1, the test only mentions the empty case. You do not need a loop yet — the empty-tuple default already produces 0 damage. The loop arrives when a test (cycle 4) actually demands it.

Your task

  1. In scorer.py (left pane), replace each sub-goal comment with the matching line. Re-read the test for the contract.
  2. Predict what pytest will show. Run.
  3. Resist any “improvement” beyond what the test demands — the next step is REFACTOR, and it only earns work that has somewhere to go.

Common mistakes to avoid

  • events: list = [] — Python rejects mutable defaults in dataclasses with ValueError. What immutable alternative matches the test’s events == () assertion?
  • Forgetting @property — without it, report.total_damage is a bound method object, not a number; the assertion fails in a weird way.
  • @dataclass without frozen=True — passes cycle 1, but cycle 2’s tuple comparisons of value objects need the structural __eq__ that frozen dataclasses provide.
  • if not dice: ... — speculative branching. The empty-tuple default already handles the empty case.

If you get genuinely stuck, click Show Solution below the editor for one possible form.

Self-test before clicking Next

  • What three things does @dataclass write for you that you’d otherwise type by hand?
  • Why is @property a better fit than a plain method for total_damage given how the test reads it?
  • What does frozen=True buy us, and which later cycle will first rely on it?
  • What four obligations does the cycle-1 test impose — derived from reading the test alone?
Starter files
scorer.py
"""Cycle 1 GREEN — smallest code that turns the failing test green.

The sub-goals below describe the PURPOSE of each line you need to add,
not the syntax. Re-read test_scorer.py to recover the contract: a name
to export, a return value, two attributes on the return value with
specific values. The Cart example in the instructions shows the toolkit
shape; you must translate it to the dragon-dice naming yourself.
"""
from dataclasses import dataclass


@dataclass(frozen=True)
class BattleReport:
    # Sub-goal: declare the storage that the test reads as `report.events`.
    # Hint: the test compares this to `()`, which already tells you the
    # type and the default value. (See the dataclasses docs link.)

    @property
    def total_damage(self):
        # Sub-goal: derive the total from whatever events the report holds.
        # Hint: with an empty-tuple default for events, an aggregate built-in
        # over an empty sequence already produces the value the test asserts.
        pass


def score(dice):
    # Sub-goal: hand back the kind of object the test reads attributes on.
    # Hint: ignore `dice` for now — no test makes a claim about non-empty
    # rolls yet, so any branching on it would be speculative design.
    pass
test_scorer.py
"""Cycle 1 — first failing test (carried over from the RED step)."""
from scorer import score


def test_empty_roll_has_zero_damage_and_no_events():
    report = score([])

    assert report.total_damage == 0
    assert report.events == ()
3

Cycle 1 — REFACTOR: The Pause That Counts

The discipline: REFACTOR is a phase you enter every cycle — even when the answer is “nothing to clean this time.” Entering and looking is the discipline. Skipping the look is the failure mode that quietly degrades TDD into test-after.

The REFACTOR checklist (you’ll re-use this every cycle)

Category Question to ask Cycle 1 answer
Duplication Two pieces of code expressing the same idea? No — only one piece of code
Names Do names describe what they mean, not how they work? BattleReport, total_damage, events, score — all domain words
Test names Does each test name read as a behavior sentence? test_empty_roll_has_zero_damage_and_no_events — long but unambiguous
Magic constants Unexplained numbers or strings? None yet
Imports Conventional order, no dead imports? Just from dataclasses import dataclass — clean

For cycle 1, every row is “fine.” That’s a real outcome of a REFACTOR phase — and recognising it without skipping the look is the win.

Your task

  1. Re-read your code with the checklist. Spend 30 seconds — don’t rush.
  2. Make any tiny improvement you spot (e.g., a module docstring); keep the bar green.
  3. Take the first quiz below, focused on the rhythm you just lived through.

Why REFACTOR is the most-skipped phase

Martin Fowler calls skipping refactor “the most common way to screw up TDD.” Field studies of student and professional practice agree: developers treat the green bar as the finish line. Within a few cycles, duplication accumulates and the test suite ages — exactly because nobody paused at REFACTOR to look. By making “enter the phase even when there’s nothing to do” a habit now, you defend against that drift for the rest of the tutorial.

Starter files
scorer.py
from dataclasses import dataclass


@dataclass(frozen=True)
class BattleReport:
    events: tuple = ()

    @property
    def total_damage(self):
        return sum(event.damage for event in self.events)


def score(dice):
    return BattleReport()
test_scorer.py
"""Cycle 1 — first failing test, now green."""
from scorer import score


def test_empty_roll_has_zero_damage_and_no_events():
    report = score([])

    assert report.total_damage == 0
    assert report.events == ()
4

Cycle 2 — Single 1 → Dragon Flame

Spec: a single die showing 1 creates a Dragon Flame event worth 100 damage.

From now on, each cycle is one step with three tasks (RED → GREEN → REFACTOR). Same discipline as cycle 1 — tighter packaging.

Your task

  1. 🔴 RED — add test_single_one_creates_dragon_flame_event in test_scorer.py. The assertion should pin down total_damage == 100 and one ScoringEvent("Dragon Flame", (1,), 100) in events. Update the import to bring in ScoringEvent. Run — predict the failure, check.
  2. 🟢 GREEN — introduce a ScoringEvent dataclass and the smallest change to score. The minimum here is a hardcoded if dice == [1]: branch — that’s deliberate.
  3. 🔵 REFACTOR — walk the cycle-1 checklist. Resist generalising; cycle 4 will earn the loop.

Expected RED: ImportError: cannot import name 'ScoringEvent' — the test forces you to name the event class before writing it. That’s the design pressure of test-first thinking.

Why “allow the hard-code” is a TDD discipline

The instinct is to extract a rule, write a loop, build the abstraction now. TDD asks you to wait for the test that demands it. A speculative loop is a guess at the right shape; a loop refactor pulled by cycle 4’s test is a discovery. Refactor toward duplication, not before it.

If you’re genuinely stuck, click Show Solution below the editor.

Starter files
scorer.py
from dataclasses import dataclass


@dataclass(frozen=True)
class BattleReport:
    events: tuple = ()

    @property
    def total_damage(self):
        return sum(event.damage for event in self.events)


def score(dice):
    return BattleReport()
test_scorer.py
"""Cycles 1–2 — adding the Dragon Flame behavior."""
from scorer import score


def test_empty_roll_has_zero_damage_and_no_events():
    report = score([])

    assert report.total_damage == 0
    assert report.events == ()


# TODO (RED): import ScoringEvent from scorer
# TODO (RED): write test_single_one_creates_dragon_flame_event
#             score([1]) should return a report with total_damage == 100
#             and events == (ScoringEvent("Dragon Flame", (1,), 100),)
5

Cycle 3 — Single 5 → Lightning Spark

Spec: a single die showing 5 creates a Lightning Spark event worth 50 damage.

Same shape as cycle 2, different values. The duplication this creates is intentional — cycle 4’s test will earn the right to fix it.

Your task

  1. 🔴 RED — add test_single_five_creates_lightning_spark_event, structured exactly like cycle 2’s test but with the Lightning Spark values.
  2. 🟢 GREEN — add a second hardcoded if dice == [5]: branch. Resist the urge to write a loop or dict lookup.
  3. 🔵 REFACTOR — walk the checklist. The duplication is now visible; the right move is to note it and write nothing. No test demands the loop yet.

Why deliberately keeping ugly code is the disciplined move

You can clearly see duplication. Refactoring it now would be guessing at the right shape with one too few data points. Cycle 4’s test will provide the second data point — and the loop refactor it earns is a discovery, not a guess. Refactor toward duplication, not before it.

Click Show Solution below the editor if you need one form of the test + GREEN code.

Starter files
scorer.py
from dataclasses import dataclass


@dataclass(frozen=True)
class ScoringEvent:
    name: str
    dice_used: tuple
    damage: int


@dataclass(frozen=True)
class BattleReport:
    events: tuple = ()

    @property
    def total_damage(self):
        return sum(event.damage for event in self.events)


def score(dice):
    if dice == [1]:
        return BattleReport((
            ScoringEvent("Dragon Flame", (1,), 100),
        ))
    return BattleReport()
test_scorer.py
"""Cycles 1–3 — adding the Lightning Spark behavior."""
from scorer import score, ScoringEvent


def test_empty_roll_has_zero_damage_and_no_events():
    report = score([])

    assert report.total_damage == 0
    assert report.events == ()


def test_single_one_creates_dragon_flame_event():
    report = score([1])

    assert report.total_damage == 100
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
    )


# TODO (RED): write test_single_five_creates_lightning_spark_event
#             score([5]) should return total_damage == 50 with
#             events == (ScoringEvent("Lightning Spark", (5,), 50),)
6

Cycle 4 — Repeated Singles → First Real Refactor

Spec: score([1, 1]) returns total damage 200 with two Dragon Flame events.

The first design-breaking test. Neither dice == [1] nor dice == [5] matches [1, 1] — the duplication you noted in cycle 3 just demanded payment.

Your task

  1. 🔴 RED — add test_two_ones_create_two_dragon_flames asserting damage 200 and two Flame events. Run; predict the failure.
  2. 🟢 GREEN — you have two options. Pick the better one:
    • Option A: a third hardcoded branch for dice == [1, 1]. Passes — and makes duplication worse, with a fourth branch coming next cycle.
    • Option B: replace the hardcoded branches with a for loop over each die. Passes cycle 4 and retires the cycle-3 duplication in one move.
  3. 🔵 REFACTOR — re-run; the previous three tests are your safety net. Note what just happened: you rewrote the body of score and knew within a second that you didn’t break anything.

The threshold-concept moment: tests enable change

You just rewrote the body of score — two hardcoded branches → generic per-die loop — and confirmed in one second that nothing broke. Without the test suite you’d have had to reason from scratch: “Does this still handle empty? Single 1? Single 5?”

That’s what “tests enable change” means. Not “prevent change.” Not “slow change.” Enable — they replace fear-driven manual reasoning with a mechanical check. This is the cycle that earns the safety-net quiz below.

Click Show Solution if you need to see one form of option B.

Starter files
scorer.py
from dataclasses import dataclass


@dataclass(frozen=True)
class ScoringEvent:
    name: str
    dice_used: tuple
    damage: int


@dataclass(frozen=True)
class BattleReport:
    events: tuple = ()

    @property
    def total_damage(self):
        return sum(event.damage for event in self.events)


def score(dice):
    if dice == [1]:
        return BattleReport((
            ScoringEvent("Dragon Flame", (1,), 100),
        ))
    if dice == [5]:
        return BattleReport((
            ScoringEvent("Lightning Spark", (5,), 50),
        ))
    return BattleReport()
test_scorer.py
"""Cycles 1–4 — repeated singles force the first refactor."""
from scorer import score, ScoringEvent


def test_empty_roll_has_zero_damage_and_no_events():
    report = score([])

    assert report.total_damage == 0
    assert report.events == ()


def test_single_one_creates_dragon_flame_event():
    report = score([1])

    assert report.total_damage == 100
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
    )


def test_single_five_creates_lightning_spark_event():
    report = score([5])

    assert report.total_damage == 50
    assert report.events == (
        ScoringEvent("Lightning Spark", (5,), 50),
    )


# TODO (RED): write test_two_ones_create_two_dragon_flames
#             score([1, 1]) should return total_damage == 200 and two
#             Dragon Flame events in events
7

Cycle 5 — Mixed Dice (No Production Change)

Spec: score([1, 5]) produces a mix of one Dragon Flame and one Lightning Spark.

Your task

  1. 🔴 RED — add test_one_and_five_create_two_different_events. Predict what pytest will show this time, then run.
  2. 🟢 GREEN — most likely outcome: all five tests pass without any production change. Cycle 4’s loop already handles mixed dice independently. The new test costs zero code — it just documents the intended behavior.
  3. 🔵 REFACTOR — verify the “free pass” with the ten-second mutation move (next collapsible).

The mutation move (verify the test isn’t vacuous — do this!)

A test that passes immediately can mean two things:

Why? What it means
A previous refactor already generalised the behavior ✅ Positive — design is healthy
The assertion is vacuously true / missing / wrong oracle ❌ Defect — a Liar test

Tell them apart in 10 seconds: break the production code on purpose. Open scorer.py, temporarily change if die == 5: to if die == 999:. Run pytest. The new test should now fail — proving it actually checks the behavior. Restore the line.

This single move is the strongest defence against AI-generated Liar tests. Make it a habit; we’ll come back to it.

Starter files
scorer.py
from dataclasses import dataclass


@dataclass(frozen=True)
class ScoringEvent:
    name: str
    dice_used: tuple
    damage: int


@dataclass(frozen=True)
class BattleReport:
    events: tuple = ()

    @property
    def total_damage(self):
        return sum(event.damage for event in self.events)


def score(dice):
    events = []

    for die in dice:
        if die == 1:
            events.append(ScoringEvent("Dragon Flame", (1,), 100))
        if die == 5:
            events.append(ScoringEvent("Lightning Spark", (5,), 50))

    return BattleReport(tuple(events))
test_scorer.py
"""Cycles 1–5 — adding mixed dice."""
from scorer import score, ScoringEvent


def test_empty_roll_has_zero_damage_and_no_events():
    report = score([])

    assert report.total_damage == 0
    assert report.events == ()


def test_single_one_creates_dragon_flame_event():
    report = score([1])

    assert report.total_damage == 100
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
    )


def test_single_five_creates_lightning_spark_event():
    report = score([5])

    assert report.total_damage == 50
    assert report.events == (
        ScoringEvent("Lightning Spark", (5,), 50),
    )


def test_two_ones_create_two_dragon_flames():
    report = score([1, 1])

    assert report.total_damage == 200
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
        ScoringEvent("Dragon Flame", (1,), 100),
    )


# TODO (RED): write test_one_and_five_create_two_different_events
#             score([1, 5]) should return total_damage == 150 with
#             one Dragon Flame followed by one Lightning Spark
8

Cycle 6 — Triple 1s → Dragon Blast (Design Moment)

Spec: three 1s in a roll combine into one Dragon Blast (1000 damage) instead of three Dragon Flames.

The pivot moment. Cycle 4’s per-die loop walks each die independently — it has no way to know that the other two 1s exist when it processes the first. This test cannot be satisfied by tweaking a branch; the structure has to change. A design-breaking test.

Your task

  1. 🔴 RED — add test_three_ones_create_dragon_blast_instead_of_three_flames. The current loop produces three Flames at 300 damage; the test demands one Blast at 1000. Run, see the value-mismatch.
  2. 🟢 GREEN — the structural shift: stop iterating dice in order, count how many of each face appeared (collections.Counter), then decide what to emit. Combos consume dice; leftovers still score as singles.
  3. 🔵 REFACTOR — re-run; the previous five tests survived a full body rewrite of score. That’s the safety net working.

Why this was a design-breaking test (and why the previous tests still pass)

A design-breaking test cannot be satisfied by adding a branch — only by reshaping structure. The cycle-4 per-die loop was fundamentally per-die; combos are fundamentally per-face-count. You couldn’t patch your way from one to the other.

Yet all five previous tests still pass after the rewrite. That’s because they assert on observable behavior (total_damage == 100, events == (event,)), not on internals (which loop, which variable name). Behavior tests survive structural rewrites; implementation-tests don’t. This is the Refactoring Litmus Test.

Starter files
scorer.py
from dataclasses import dataclass


@dataclass(frozen=True)
class ScoringEvent:
    name: str
    dice_used: tuple
    damage: int


@dataclass(frozen=True)
class BattleReport:
    events: tuple = ()

    @property
    def total_damage(self):
        return sum(event.damage for event in self.events)


def score(dice):
    events = []

    for die in dice:
        if die == 1:
            events.append(ScoringEvent("Dragon Flame", (1,), 100))
        if die == 5:
            events.append(ScoringEvent("Lightning Spark", (5,), 50))

    return BattleReport(tuple(events))
test_scorer.py
"""Cycles 1–6 — triple 1s break the per-die loop."""
from scorer import score, ScoringEvent


def test_empty_roll_has_zero_damage_and_no_events():
    report = score([])

    assert report.total_damage == 0
    assert report.events == ()


def test_single_one_creates_dragon_flame_event():
    report = score([1])

    assert report.total_damage == 100
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
    )


def test_single_five_creates_lightning_spark_event():
    report = score([5])

    assert report.total_damage == 50
    assert report.events == (
        ScoringEvent("Lightning Spark", (5,), 50),
    )


def test_two_ones_create_two_dragon_flames():
    report = score([1, 1])

    assert report.total_damage == 200
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
        ScoringEvent("Dragon Flame", (1,), 100),
    )


def test_one_and_five_create_two_different_events():
    report = score([1, 5])

    assert report.total_damage == 150
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
        ScoringEvent("Lightning Spark", (5,), 50),
    )


# TODO (RED): write test_three_ones_create_dragon_blast_instead_of_three_flames
#             score([1, 1, 1]) should return total_damage == 1000 with
#             a single Dragon Blast event whose dice_used is (1, 1, 1)
9

Cycle 7 — Combo with Leftover Singles

Spec: score([1, 1, 1, 1, 5]) produces one Dragon Blast plus one leftover Flame and a Lightning Spark.

Cycle 6’s counts[1] -= 3 already handles this — but until a test asserts it, the leftover behavior is only implicitly correct. A future refactor could break it silently. Cycle 7 turns implicit correctness into an explicit guardrail.

Your task

  1. 🔴 RED — add test_dragon_blast_plus_leftover_flame_and_spark. Predict whether it passes immediately, then run.
  2. 🟢 GREEN — likely no production change. The cycle-6 design already produces this output.
  3. 🔵 REFACTOR + mutation check — apply the cycle-5 mutation move required this time: change counts[1] -= 3 to counts[1] -= 4, run, confirm cycle 7 fails, restore. That proves the test is a real guardrail.

Why guardrail tests matter

The most common way a TDD suite degrades over time is implicit correctness silently rotting into implicit incorrectness. A behavior the code does today, but no test asserts, can disappear in a refactor — and nobody notices until a customer reports the bug. The discipline: when you spot behavior that “just works,” ask if a future refactor could break it. If yes, write the test that protects it.

Starter files
scorer.py
from collections import Counter
from dataclasses import dataclass


@dataclass(frozen=True)
class ScoringEvent:
    name: str
    dice_used: tuple
    damage: int


@dataclass(frozen=True)
class BattleReport:
    events: tuple = ()

    @property
    def total_damage(self):
        return sum(event.damage for event in self.events)


def score(dice):
    counts = Counter(dice)
    events = []

    if counts[1] >= 3:
        events.append(ScoringEvent("Dragon Blast", (1, 1, 1), 1000))
        counts[1] -= 3

    for _ in range(counts[1]):
        events.append(ScoringEvent("Dragon Flame", (1,), 100))

    for _ in range(counts[5]):
        events.append(ScoringEvent("Lightning Spark", (5,), 50))

    return BattleReport(tuple(events))
test_scorer.py
"""Cycles 1–7 — adding the leftover-singles guardrail."""
from scorer import score, ScoringEvent


def test_empty_roll_has_zero_damage_and_no_events():
    report = score([])

    assert report.total_damage == 0
    assert report.events == ()


def test_single_one_creates_dragon_flame_event():
    report = score([1])

    assert report.total_damage == 100
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
    )


def test_single_five_creates_lightning_spark_event():
    report = score([5])

    assert report.total_damage == 50
    assert report.events == (
        ScoringEvent("Lightning Spark", (5,), 50),
    )


def test_two_ones_create_two_dragon_flames():
    report = score([1, 1])

    assert report.total_damage == 200
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
        ScoringEvent("Dragon Flame", (1,), 100),
    )


def test_one_and_five_create_two_different_events():
    report = score([1, 5])

    assert report.total_damage == 150
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
        ScoringEvent("Lightning Spark", (5,), 50),
    )


def test_three_ones_create_dragon_blast_instead_of_three_flames():
    report = score([1, 1, 1])

    assert report.total_damage == 1000
    assert report.events == (
        ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
    )


# TODO (RED): write test_dragon_blast_plus_leftover_flame_and_spark
#             score([1, 1, 1, 1, 5]) should return total_damage == 1150
#             with a Dragon Blast, then a Dragon Flame, then a Lightning Spark
10

Cycle 8 — Goblin Swarm → Rule Objects (Big Refactor)

Spec: three 2s combine into a Goblin Swarm (200 damage).

Before writing anything, imagine where if counts[2] >= 3: would go. Now imagine six more triple combos (cycles 9–14) each adding their own if-and-subtract block. Painful? That pain is the design pressure. Cycle 8 is when the structure has to change.

Your task

  1. 🔴 RED — add test_three_twos_create_goblin_swarm, structured like cycle 6’s test. Run; the existing code knows nothing about 2s.
  2. 🟢 GREEN — you have two choices:
    • Option A: another if counts[2] >= 3: block. Local minimum, multiplies the pain five-fold across cycles 9–14.
    • Option B: extract the rule shape into data. Pull combo-rule logic into a ComboRule class with an apply(counts) method; same for SingleRule. Then drive score from two registries.
  3. 🔵 REFACTOR — option B is the refactor. Re-run; the seven prior tests are your safety net.

Listening to the test & the Open-Closed Principle

The discipline of listening to the test: the difficulty of imagining six more if-blocks was a signal — the structure no longer fit. The cure is structural extraction.

You just applied the Open-Closed Principle: score is now closed for modification (you won’t edit it again) but open for extension (cycle 9 will add four rules by appending data). Adding behavior = adding data, not editing code.

Refactor toward duplication, not before. Two data points (Dragon Blast + Goblin Swarm) are now enough to see the right shape; with one, you’d be guessing.

Starter files
scorer.py
from collections import Counter
from dataclasses import dataclass


@dataclass(frozen=True)
class ScoringEvent:
    name: str
    dice_used: tuple
    damage: int


@dataclass(frozen=True)
class BattleReport:
    events: tuple = ()

    @property
    def total_damage(self):
        return sum(event.damage for event in self.events)


def score(dice):
    counts = Counter(dice)
    events = []

    if counts[1] >= 3:
        events.append(ScoringEvent("Dragon Blast", (1, 1, 1), 1000))
        counts[1] -= 3

    for _ in range(counts[1]):
        events.append(ScoringEvent("Dragon Flame", (1,), 100))

    for _ in range(counts[5]):
        events.append(ScoringEvent("Lightning Spark", (5,), 50))

    return BattleReport(tuple(events))
test_scorer.py
"""Cycles 1–8 — Goblin Swarm forces the rule-object refactor."""
from scorer import score, ScoringEvent


def test_empty_roll_has_zero_damage_and_no_events():
    report = score([])

    assert report.total_damage == 0
    assert report.events == ()


def test_single_one_creates_dragon_flame_event():
    report = score([1])

    assert report.total_damage == 100
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
    )


def test_single_five_creates_lightning_spark_event():
    report = score([5])

    assert report.total_damage == 50
    assert report.events == (
        ScoringEvent("Lightning Spark", (5,), 50),
    )


def test_two_ones_create_two_dragon_flames():
    report = score([1, 1])

    assert report.total_damage == 200
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
        ScoringEvent("Dragon Flame", (1,), 100),
    )


def test_one_and_five_create_two_different_events():
    report = score([1, 5])

    assert report.total_damage == 150
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
        ScoringEvent("Lightning Spark", (5,), 50),
    )


def test_three_ones_create_dragon_blast_instead_of_three_flames():
    report = score([1, 1, 1])

    assert report.total_damage == 1000
    assert report.events == (
        ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
    )


def test_dragon_blast_plus_leftover_flame_and_spark():
    report = score([1, 1, 1, 1, 5])

    assert report.total_damage == 1150
    assert report.events == (
        ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
        ScoringEvent("Dragon Flame", (1,), 100),
        ScoringEvent("Lightning Spark", (5,), 50),
    )


# TODO (RED): write test_three_twos_create_goblin_swarm
#             score([2, 2, 2]) should return total_damage == 200 with
#             a Goblin Swarm event whose dice_used is (2, 2, 2)
11

Cycle 9 — Add the Rest of the Combo Rules

Spec: the four remaining triples — Triple 3 → Orc Charge (300), Triple 4 → Troll Smash (400), Triple 5 → Lightning Storm (500), Triple 6 → Demon Strike (600).

This cycle exists to prove cycle 8’s refactor paid off. Four new behaviors → no edits to score().

Your task

  1. 🔴 RED — write one parametrised test (@pytest.mark.parametrize) covering all four triples. Each row runs as a separate pytest result.
  2. 🟢 GREEN — append four rows to COMBO_RULES. Do not touch score(). That’s the proof.
  3. 🔵 REFACTOR — re-run; check duplication (none — rules are data) and that all 12 results pass.

Why parametrize beats a for-loop inside one test

@pytest.mark.parametrize runs the function once per row, reporting each as a separate test result. A for loop inside a single test stops at the first failure, hiding everything after it. The parametrise idiom is the right Python answer to “N tests of the same shape” — DRY tests that still report separate failures.

Starter files
scorer.py
from collections import Counter
from dataclasses import dataclass


@dataclass(frozen=True)
class ScoringEvent:
    name: str
    dice_used: tuple
    damage: int


@dataclass(frozen=True)
class BattleReport:
    events: tuple = ()

    @property
    def total_damage(self):
        return sum(event.damage for event in self.events)


@dataclass(frozen=True)
class ComboRule:
    die: int
    count: int
    name: str
    damage: int

    def apply(self, counts):
        events = []

        if counts[self.die] >= self.count:
            dice_used = tuple([self.die] * self.count)
            events.append(ScoringEvent(self.name, dice_used, self.damage))
            counts[self.die] -= self.count

        return events


@dataclass(frozen=True)
class SingleRule:
    die: int
    name: str
    damage: int

    def apply(self, counts):
        events = []

        for _ in range(counts[self.die]):
            events.append(ScoringEvent(self.name, (self.die,), self.damage))

        return events


COMBO_RULES = (
    ComboRule(1, 3, "Dragon Blast", 1000),
    ComboRule(2, 3, "Goblin Swarm", 200),
)

SINGLE_RULES = (
    SingleRule(1, "Dragon Flame", 100),
    SingleRule(5, "Lightning Spark", 50),
)


def score(dice):
    counts = Counter(dice)
    events = []

    for rule in COMBO_RULES:
        events.extend(rule.apply(counts))

    for rule in SINGLE_RULES:
        events.extend(rule.apply(counts))

    return BattleReport(tuple(events))
test_scorer.py
"""Cycles 1–9 — adding the four remaining triple combos."""
import pytest
from scorer import score, ScoringEvent


def test_empty_roll_has_zero_damage_and_no_events():
    report = score([])

    assert report.total_damage == 0
    assert report.events == ()


def test_single_one_creates_dragon_flame_event():
    report = score([1])

    assert report.total_damage == 100
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
    )


def test_single_five_creates_lightning_spark_event():
    report = score([5])

    assert report.total_damage == 50
    assert report.events == (
        ScoringEvent("Lightning Spark", (5,), 50),
    )


def test_two_ones_create_two_dragon_flames():
    report = score([1, 1])

    assert report.total_damage == 200
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
        ScoringEvent("Dragon Flame", (1,), 100),
    )


def test_one_and_five_create_two_different_events():
    report = score([1, 5])

    assert report.total_damage == 150
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
        ScoringEvent("Lightning Spark", (5,), 50),
    )


def test_three_ones_create_dragon_blast_instead_of_three_flames():
    report = score([1, 1, 1])

    assert report.total_damage == 1000
    assert report.events == (
        ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
    )


def test_dragon_blast_plus_leftover_flame_and_spark():
    report = score([1, 1, 1, 1, 5])

    assert report.total_damage == 1150
    assert report.events == (
        ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
        ScoringEvent("Dragon Flame", (1,), 100),
        ScoringEvent("Lightning Spark", (5,), 50),
    )


def test_three_twos_create_goblin_swarm():
    report = score([2, 2, 2])

    assert report.total_damage == 200
    assert report.events == (
        ScoringEvent("Goblin Swarm", (2, 2, 2), 200),
    )


# TODO (RED): write a parametrized test_other_triples_create_combo_events
#             that covers triples of 3, 4, 5, 6 with their respective events
12

Cycle 10 — Six 1s → Two Dragon Blasts (Hidden Bug)

Spec: score([1, 1, 1, 1, 1, 1]) produces two Dragon Blasts (2000 damage).

Predict first. Before writing a test, look at your ComboRule.apply and trace through six 1s by hand. What does the current code actually produce?

Reveal the hidden bug (open after you’ve predicted)

The if counts[1] >= 3: runs once. It emits one Blast and counts[1] -= 3 leaves counts[1] == 3. Those three 1s fall through to SingleRule, emitting three Flames. Total: 1000 + 300 = 1300 damage.

But six 1s should be two Blasts → 2000 damage. The code is wrong — and no previous test caught it, because every prior combo test used exactly three of a face.

This is the kind of bug TDD literature reports: a defect the developer doesn’t know exists in code they wrote themselves. The test surfaces it.

Your task

  1. 🔴 RED — add test_six_ones_create_two_dragon_blasts asserting 2000 damage and two Blast events. Run; see the wrong events tuple.
  2. 🟢 GREEN — fix ComboRule.apply so it can emit zero or more combos per call, with the correct leftover bookkeeping. Hint: floor-division (//) and modulo (%) are the right operators.
  3. 🔵 REFACTOR — re-run. Especially gratifying: cycle 7’s leftover guardrail still passes, because 4 % 3 == 1 matches 4 - 3 == 1 for that input. The fix only changed behavior on cases no prior test pinned down.

What this teaches about coverage vs. boundary thinking

Every previous combo test used exactly count dice (three 1s, three 2s, etc.). The bug only manifests at 2 × count and beyond. Line coverage told you the if ran. It didn’t tell you the line was right for all relevant inputs.

That’s the gap between coverage and boundary-value analysis: every behavior has boundaries (0, exactly N, 2N, between N and 2N) and a healthy suite probes each. Coverage is a locator of under-tested code; it isn’t a measure of correctness.

Starter files
scorer.py
from collections import Counter
from dataclasses import dataclass


@dataclass(frozen=True)
class ScoringEvent:
    name: str
    dice_used: tuple
    damage: int


@dataclass(frozen=True)
class BattleReport:
    events: tuple = ()

    @property
    def total_damage(self):
        return sum(event.damage for event in self.events)


@dataclass(frozen=True)
class ComboRule:
    die: int
    count: int
    name: str
    damage: int

    def apply(self, counts):
        events = []

        if counts[self.die] >= self.count:
            dice_used = tuple([self.die] * self.count)
            events.append(ScoringEvent(self.name, dice_used, self.damage))
            counts[self.die] -= self.count

        return events


@dataclass(frozen=True)
class SingleRule:
    die: int
    name: str
    damage: int

    def apply(self, counts):
        events = []

        for _ in range(counts[self.die]):
            events.append(ScoringEvent(self.name, (self.die,), self.damage))

        return events


COMBO_RULES = (
    ComboRule(1, 3, "Dragon Blast", 1000),
    ComboRule(2, 3, "Goblin Swarm", 200),
    ComboRule(3, 3, "Orc Charge", 300),
    ComboRule(4, 3, "Troll Smash", 400),
    ComboRule(5, 3, "Lightning Storm", 500),
    ComboRule(6, 3, "Demon Strike", 600),
)

SINGLE_RULES = (
    SingleRule(1, "Dragon Flame", 100),
    SingleRule(5, "Lightning Spark", 50),
)


def score(dice):
    counts = Counter(dice)
    events = []

    for rule in COMBO_RULES:
        events.extend(rule.apply(counts))

    for rule in SINGLE_RULES:
        events.extend(rule.apply(counts))

    return BattleReport(tuple(events))
test_scorer.py
"""Cycles 1–10 — surfacing the multi-combo edge case."""
import pytest
from scorer import score, ScoringEvent


def test_empty_roll_has_zero_damage_and_no_events():
    report = score([])

    assert report.total_damage == 0
    assert report.events == ()


def test_single_one_creates_dragon_flame_event():
    report = score([1])

    assert report.total_damage == 100
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
    )


def test_single_five_creates_lightning_spark_event():
    report = score([5])

    assert report.total_damage == 50
    assert report.events == (
        ScoringEvent("Lightning Spark", (5,), 50),
    )


def test_two_ones_create_two_dragon_flames():
    report = score([1, 1])

    assert report.total_damage == 200
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
        ScoringEvent("Dragon Flame", (1,), 100),
    )


def test_one_and_five_create_two_different_events():
    report = score([1, 5])

    assert report.total_damage == 150
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
        ScoringEvent("Lightning Spark", (5,), 50),
    )


def test_three_ones_create_dragon_blast_instead_of_three_flames():
    report = score([1, 1, 1])

    assert report.total_damage == 1000
    assert report.events == (
        ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
    )


def test_dragon_blast_plus_leftover_flame_and_spark():
    report = score([1, 1, 1, 1, 5])

    assert report.total_damage == 1150
    assert report.events == (
        ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
        ScoringEvent("Dragon Flame", (1,), 100),
        ScoringEvent("Lightning Spark", (5,), 50),
    )


def test_three_twos_create_goblin_swarm():
    report = score([2, 2, 2])

    assert report.total_damage == 200
    assert report.events == (
        ScoringEvent("Goblin Swarm", (2, 2, 2), 200),
    )


@pytest.mark.parametrize(
    "roll, expected_event",
    [
        ([3, 3, 3], ScoringEvent("Orc Charge", (3, 3, 3), 300)),
        ([4, 4, 4], ScoringEvent("Troll Smash", (4, 4, 4), 400)),
        ([5, 5, 5], ScoringEvent("Lightning Storm", (5, 5, 5), 500)),
        ([6, 6, 6], ScoringEvent("Demon Strike", (6, 6, 6), 600)),
    ],
)
def test_other_triples_create_combo_events(roll, expected_event):
    report = score(roll)

    assert report.total_damage == expected_event.damage
    assert report.events == (expected_event,)


# TODO (RED): write test_six_ones_create_two_dragon_blasts
#             score([1, 1, 1, 1, 1, 1]) should return total_damage == 2000
#             and events containing TWO Dragon Blast ScoringEvents
13

Cycle 11 — Invalid Dice Raise an Error

Spec: score([1, 2, 9]) raises ValueError("Die values must be between 1 and 6"). Dice are integers 1–6.

So far every test was a happy-path test. Happy-path-only suites are the single most common smell in student-written test code (per testing-education research). TDD applies to error behavior just as much as to success — pytest.raises is the idiom.

Your task

  1. 🔴 RED — add test_invalid_die_value_raises_error using pytest.raises(ValueError, match=...). The current score silently ignores the 9, so the test fails because no exception is raised.
  2. 🟢 GREEN — add a validate_dice(dice) helper that raises ValueError for any die outside 1–6; call it at the top of score.
  3. 🔵 REFACTOR — checklist; constants like VALID_DICE keep the rule explicit.

Why match= matters & happy-path-only is dangerous

Without match=, the test would accept any ValueError — even one with no message, or the wrong message. With it, the test pins down the contract: the function raises and explains why. That’s an oracle for the error case, not just for success.

The deeper lesson: line coverage doesn’t catch missing error handling. A function that silently processes invalid input passes every happy-path assertion. Only an invalid-input test asks “does the boundary actually error?”

Starter files
scorer.py
from collections import Counter
from dataclasses import dataclass


@dataclass(frozen=True)
class ScoringEvent:
    name: str
    dice_used: tuple
    damage: int


@dataclass(frozen=True)
class BattleReport:
    events: tuple = ()

    @property
    def total_damage(self):
        return sum(event.damage for event in self.events)


@dataclass(frozen=True)
class ComboRule:
    die: int
    count: int
    name: str
    damage: int

    def apply(self, counts):
        events = []

        number_of_combos = counts[self.die] // self.count

        for _ in range(number_of_combos):
            dice_used = tuple([self.die] * self.count)
            events.append(ScoringEvent(self.name, dice_used, self.damage))

        counts[self.die] %= self.count

        return events


@dataclass(frozen=True)
class SingleRule:
    die: int
    name: str
    damage: int

    def apply(self, counts):
        events = []

        for _ in range(counts[self.die]):
            events.append(ScoringEvent(self.name, (self.die,), self.damage))

        return events


COMBO_RULES = (
    ComboRule(1, 3, "Dragon Blast", 1000),
    ComboRule(2, 3, "Goblin Swarm", 200),
    ComboRule(3, 3, "Orc Charge", 300),
    ComboRule(4, 3, "Troll Smash", 400),
    ComboRule(5, 3, "Lightning Storm", 500),
    ComboRule(6, 3, "Demon Strike", 600),
)

SINGLE_RULES = (
    SingleRule(1, "Dragon Flame", 100),
    SingleRule(5, "Lightning Spark", 50),
)


def score(dice):
    counts = Counter(dice)
    events = []

    for rule in COMBO_RULES:
        events.extend(rule.apply(counts))

    for rule in SINGLE_RULES:
        events.extend(rule.apply(counts))

    return BattleReport(tuple(events))
test_scorer.py
"""Cycles 1–11 — adding input validation."""
import pytest
from scorer import score, ScoringEvent


def test_empty_roll_has_zero_damage_and_no_events():
    report = score([])

    assert report.total_damage == 0
    assert report.events == ()


def test_single_one_creates_dragon_flame_event():
    report = score([1])

    assert report.total_damage == 100
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
    )


def test_single_five_creates_lightning_spark_event():
    report = score([5])

    assert report.total_damage == 50
    assert report.events == (
        ScoringEvent("Lightning Spark", (5,), 50),
    )


def test_two_ones_create_two_dragon_flames():
    report = score([1, 1])

    assert report.total_damage == 200
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
        ScoringEvent("Dragon Flame", (1,), 100),
    )


def test_one_and_five_create_two_different_events():
    report = score([1, 5])

    assert report.total_damage == 150
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
        ScoringEvent("Lightning Spark", (5,), 50),
    )


def test_three_ones_create_dragon_blast_instead_of_three_flames():
    report = score([1, 1, 1])

    assert report.total_damage == 1000
    assert report.events == (
        ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
    )


def test_dragon_blast_plus_leftover_flame_and_spark():
    report = score([1, 1, 1, 1, 5])

    assert report.total_damage == 1150
    assert report.events == (
        ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
        ScoringEvent("Dragon Flame", (1,), 100),
        ScoringEvent("Lightning Spark", (5,), 50),
    )


def test_three_twos_create_goblin_swarm():
    report = score([2, 2, 2])

    assert report.total_damage == 200
    assert report.events == (
        ScoringEvent("Goblin Swarm", (2, 2, 2), 200),
    )


@pytest.mark.parametrize(
    "roll, expected_event",
    [
        ([3, 3, 3], ScoringEvent("Orc Charge", (3, 3, 3), 300)),
        ([4, 4, 4], ScoringEvent("Troll Smash", (4, 4, 4), 400)),
        ([5, 5, 5], ScoringEvent("Lightning Storm", (5, 5, 5), 500)),
        ([6, 6, 6], ScoringEvent("Demon Strike", (6, 6, 6), 600)),
    ],
)
def test_other_triples_create_combo_events(roll, expected_event):
    report = score(roll)

    assert report.total_damage == expected_event.damage
    assert report.events == (expected_event,)


def test_six_ones_create_two_dragon_blasts():
    report = score([1, 1, 1, 1, 1, 1])

    assert report.total_damage == 2000
    assert report.events == (
        ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
        ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
    )


# TODO (RED): write test_invalid_die_value_raises_error
#             score([1, 2, 9]) should raise ValueError("Die values must be between 1 and 6")
#             Use `with pytest.raises(ValueError, match="..."): score([1, 2, 9])`
14

Cycle 12 — Human-Readable Battle Summary

Spec: report.summary() returns a one-line string like "Dragon Blast (1000) + Lightning Spark (50) = 1050". An empty report returns "No damage.".

Twelve cycles in. Cycle 12 is the only one that adds behavior to BattleReport itself rather than to score. Where a behavior lives is a design decision — putting summary on the report keeps formatting close to the data.

Your task

  1. 🔴 RED — add test_battle_report_has_human_readable_summary. Run; expect AttributeError.
  2. 🟢 GREEN — add a summary method to BattleReport that handles empty events as a special case and joins name (damage) segments with " + " for non-empty.
  3. 🔵 REFACTOR — final checklist. Re-run; all 16+ tests should be green. You just completed twelve TDD cycles.

A small design choice: method vs. property?

Properties signal cheap, side-effect-free derived values. summary() is cheap, but the empty-case branch and the f-string assembly do nontrivial work. Convention: methods for derived strings; properties for derived numbers and booleans. We chose a method. (Either would work — but be consistent within a project.)

Starter files
scorer.py
from collections import Counter
from dataclasses import dataclass


VALID_DICE = {1, 2, 3, 4, 5, 6}


@dataclass(frozen=True)
class ScoringEvent:
    name: str
    dice_used: tuple
    damage: int


@dataclass(frozen=True)
class BattleReport:
    events: tuple = ()

    @property
    def total_damage(self):
        return sum(event.damage for event in self.events)


@dataclass(frozen=True)
class ComboRule:
    die: int
    count: int
    name: str
    damage: int

    def apply(self, counts):
        events = []

        number_of_combos = counts[self.die] // self.count

        for _ in range(number_of_combos):
            dice_used = tuple([self.die] * self.count)
            events.append(ScoringEvent(self.name, dice_used, self.damage))

        counts[self.die] %= self.count

        return events


@dataclass(frozen=True)
class SingleRule:
    die: int
    name: str
    damage: int

    def apply(self, counts):
        events = []

        for _ in range(counts[self.die]):
            events.append(ScoringEvent(self.name, (self.die,), self.damage))

        return events


COMBO_RULES = (
    ComboRule(1, 3, "Dragon Blast", 1000),
    ComboRule(2, 3, "Goblin Swarm", 200),
    ComboRule(3, 3, "Orc Charge", 300),
    ComboRule(4, 3, "Troll Smash", 400),
    ComboRule(5, 3, "Lightning Storm", 500),
    ComboRule(6, 3, "Demon Strike", 600),
)

SINGLE_RULES = (
    SingleRule(1, "Dragon Flame", 100),
    SingleRule(5, "Lightning Spark", 50),
)


def validate_dice(dice):
    for die in dice:
        if die not in VALID_DICE:
            raise ValueError("Die values must be between 1 and 6")


def score(dice):
    validate_dice(dice)

    counts = Counter(dice)
    events = []

    for rule in COMBO_RULES:
        events.extend(rule.apply(counts))

    for rule in SINGLE_RULES:
        events.extend(rule.apply(counts))

    return BattleReport(tuple(events))
test_scorer.py
"""Cycles 1–12 — adding the human-readable summary."""
import pytest
from scorer import score, ScoringEvent


def test_empty_roll_has_zero_damage_and_no_events():
    report = score([])

    assert report.total_damage == 0
    assert report.events == ()


def test_single_one_creates_dragon_flame_event():
    report = score([1])

    assert report.total_damage == 100
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
    )


def test_single_five_creates_lightning_spark_event():
    report = score([5])

    assert report.total_damage == 50
    assert report.events == (
        ScoringEvent("Lightning Spark", (5,), 50),
    )


def test_two_ones_create_two_dragon_flames():
    report = score([1, 1])

    assert report.total_damage == 200
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
        ScoringEvent("Dragon Flame", (1,), 100),
    )


def test_one_and_five_create_two_different_events():
    report = score([1, 5])

    assert report.total_damage == 150
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
        ScoringEvent("Lightning Spark", (5,), 50),
    )


def test_three_ones_create_dragon_blast_instead_of_three_flames():
    report = score([1, 1, 1])

    assert report.total_damage == 1000
    assert report.events == (
        ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
    )


def test_dragon_blast_plus_leftover_flame_and_spark():
    report = score([1, 1, 1, 1, 5])

    assert report.total_damage == 1150
    assert report.events == (
        ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
        ScoringEvent("Dragon Flame", (1,), 100),
        ScoringEvent("Lightning Spark", (5,), 50),
    )


def test_three_twos_create_goblin_swarm():
    report = score([2, 2, 2])

    assert report.total_damage == 200
    assert report.events == (
        ScoringEvent("Goblin Swarm", (2, 2, 2), 200),
    )


@pytest.mark.parametrize(
    "roll, expected_event",
    [
        ([3, 3, 3], ScoringEvent("Orc Charge", (3, 3, 3), 300)),
        ([4, 4, 4], ScoringEvent("Troll Smash", (4, 4, 4), 400)),
        ([5, 5, 5], ScoringEvent("Lightning Storm", (5, 5, 5), 500)),
        ([6, 6, 6], ScoringEvent("Demon Strike", (6, 6, 6), 600)),
    ],
)
def test_other_triples_create_combo_events(roll, expected_event):
    report = score(roll)

    assert report.total_damage == expected_event.damage
    assert report.events == (expected_event,)


def test_six_ones_create_two_dragon_blasts():
    report = score([1, 1, 1, 1, 1, 1])

    assert report.total_damage == 2000
    assert report.events == (
        ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
        ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
    )


def test_invalid_die_value_raises_error():
    with pytest.raises(ValueError, match="Die values must be between 1 and 6"):
        score([1, 2, 9])


# TODO (RED): write test_battle_report_has_human_readable_summary
#             score([1, 1, 1, 5]).summary() should equal
#             "Dragon Blast (1000) + Lightning Spark (50) = 1050"
15

The Big Picture — Lessons from Twelve Cycles

Twelve cycles. Twelve tests. One small domain model with combos, leftovers, validation, and a summary method. Scroll through your final scorer.py — every line is justified by a test. No speculative code, no dead branch. The tests are the byproduct; the design is the primary product.

The takeaways that travel

  1. TDD is design, not testing. The test is the contract; the implementation emerges under its pressure.
  2. Refactor toward duplication, not before it. Two examples reveal the right shape; one is a guess.
  3. Tests enable change. Behavior-level assertions survive structural rewrites; implementation-coupled tests don’t.
  4. Coverage ≠ correctness. Boundary thinking (cycle 10) catches what coverage misses.
  5. Listen to the test. Pain in writing a test usually points at the production code.

Take the final knowledge check at the end of this step. The sections below review the whole journey, the anti-pattern taxonomy, the empirical case for TDD, and what to learn next.

The journey in one table

Cycle Behavior Design move Lesson
1 Empty roll First class + function RED for the right reason
2 Single 1 ScoringEvent introduced Allow the ugly first GREEN
3 Single 5 Second hardcoded branch Refactor toward duplication
4 Repeated singles Per-die loop First real refactor; tests enable change
5 Mixed dice (no production change) Free pass — verify with the mutation move
6 Triple 1s Counter, count-then-emit Design-breaking test; structural shift
7 Combo + leftovers (no production change) Guardrail tests for implicit correctness
8 Triple 2s Rule objects Listening to the test; Open-Closed
9 Other triples Append data Refactor pays off; parametrize
10 Six 1s // and %= Hidden edge case; boundary > coverage
11 Invalid dice pytest.raises Robustness is first-class
12 Summary Method on BattleReport Behavior on the existing object

When TDD shines, when it’s overkill

TDD shines on: new features with clear behavioral requirements; complex logic with branching cases; long-lived code modified by multiple people; API design (the test forces a caller’s perspective); domains where regressions hurt (payments, scoring, calculations).

TDD is overkill on: one-off throwaway scripts; exploratory prototyping where the problem isn’t yet defined; UI layout (visual correctness); non-binary outcomes (ML accuracy, image recognition); Jupyter research explorations.

Even in the second list, some tests pay off — the question is whether to write them first. Kent Beck: “the discipline of working strictly test-first is valuable but not necessarily something you want to do all the time.”

TDD anti-pattern taxonomy

Level Anti-pattern What it looks like Antidote
I The Liar Test passes but asserts vacuously (isinstance(x, int) only) Cycle 5’s mutation move
I The Nitpicker Asserts on private attributes / implementation details Assert on observable behavior
II Success Against All Odds New test passes immediately, with no investigation Verify with mutation
II Skip-the-Refactor Stop at green; never enter REFACTOR Make the look mandatory
III The Giant One test asserts dozens of behaviors (mirrors a God Object) One behavior per test
III Excessive Setup 30+ lines of fixture before one assertion Decouple production code
IV The Mockery More mock setup than test logic Listen — the design is wrong
IV Modify-the-Test AI rewrites the test to match buggy code Own the spec yourself

The pattern: higher the level, more it’s an architectural smell. Listen to the test.

The empirical case for TDD

Study Finding
Microsoft & IBM (Williams et al., 2008) 39–91% decrease in pre-release defect density in TDD teams
Same studies 15–35% longer initial development; offset by reduced debugging
Erdogmus et al. (2005) Test-first students wrote more tests AND were more productive per test
Janzen & Saiedian (ICSE 2007) Mature programmers exposed to TDD significantly more likely to prefer TDD later — the Residual Effect
Fucci et al. (2017) TDD’s benefit comes from granularity + uniformity, not strict test-first ordering — your twelve tiny cycles embody both

Caveat: mixed for solo programmers on short tasks. Strongest in team settings, with CI, on long-lived systems.

What to learn next (the same rhythm, scaled up)

  • Fixtures (@pytest.fixture) for reusable setup of objects, DBs, mock APIs
  • Mocks, fakes, stubs — with a strong default toward fakes over mocks
  • Property-based testing with Hypothesis — score(any list of 1–6) should always satisfy invariants
  • Mutation testing with mutmut or cosmic-ray — automate the cycle-5 mutation move across the whole suite
  • The Outside-In / Double-Loop pattern (Percival, Obey the Testing Goat) — high-level acceptance tests drive unit tests

Each lives inside the same Red-Green-Refactor rhythm you just internalised. They scale the discipline; they don’t replace it.

Starter files
scorer.py
# The full, twelve-cycle implementation lives here. Use this step's
# editor to scroll through what you built — every line is justified by
# a test in test_scorer.py. There is no speculative code.
from collections import Counter
from dataclasses import dataclass


VALID_DICE = {1, 2, 3, 4, 5, 6}


@dataclass(frozen=True)
class ScoringEvent:
    name: str
    dice_used: tuple
    damage: int


@dataclass(frozen=True)
class BattleReport:
    events: tuple = ()

    @property
    def total_damage(self):
        return sum(event.damage for event in self.events)

    def summary(self):
        if not self.events:
            return "No damage."

        event_text = " + ".join(
            f"{event.name} ({event.damage})" for event in self.events
        )

        return f"{event_text} = {self.total_damage}"


@dataclass(frozen=True)
class ComboRule:
    die: int
    count: int
    name: str
    damage: int

    def apply(self, counts):
        events = []

        number_of_combos = counts[self.die] // self.count

        for _ in range(number_of_combos):
            dice_used = tuple([self.die] * self.count)
            events.append(ScoringEvent(self.name, dice_used, self.damage))

        counts[self.die] %= self.count

        return events


@dataclass(frozen=True)
class SingleRule:
    die: int
    name: str
    damage: int

    def apply(self, counts):
        events = []

        for _ in range(counts[self.die]):
            events.append(ScoringEvent(self.name, (self.die,), self.damage))

        return events


COMBO_RULES = (
    ComboRule(1, 3, "Dragon Blast", 1000),
    ComboRule(2, 3, "Goblin Swarm", 200),
    ComboRule(3, 3, "Orc Charge", 300),
    ComboRule(4, 3, "Troll Smash", 400),
    ComboRule(5, 3, "Lightning Storm", 500),
    ComboRule(6, 3, "Demon Strike", 600),
)

SINGLE_RULES = (
    SingleRule(1, "Dragon Flame", 100),
    SingleRule(5, "Lightning Spark", 50),
)


def validate_dice(dice):
    for die in dice:
        if die not in VALID_DICE:
            raise ValueError("Die values must be between 1 and 6")


def score(dice):
    validate_dice(dice)

    counts = Counter(dice)
    events = []

    for rule in COMBO_RULES:
        events.extend(rule.apply(counts))

    for rule in SINGLE_RULES:
        events.extend(rule.apply(counts))

    return BattleReport(tuple(events))
test_scorer.py
"""All twelve cycles, all green. Read it as a contract."""
import pytest
from scorer import score, ScoringEvent


def test_empty_roll_has_zero_damage_and_no_events():
    report = score([])

    assert report.total_damage == 0
    assert report.events == ()


def test_single_one_creates_dragon_flame_event():
    report = score([1])

    assert report.total_damage == 100
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
    )


def test_single_five_creates_lightning_spark_event():
    report = score([5])

    assert report.total_damage == 50
    assert report.events == (
        ScoringEvent("Lightning Spark", (5,), 50),
    )


def test_two_ones_create_two_dragon_flames():
    report = score([1, 1])

    assert report.total_damage == 200
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
        ScoringEvent("Dragon Flame", (1,), 100),
    )


def test_one_and_five_create_two_different_events():
    report = score([1, 5])

    assert report.total_damage == 150
    assert report.events == (
        ScoringEvent("Dragon Flame", (1,), 100),
        ScoringEvent("Lightning Spark", (5,), 50),
    )


def test_three_ones_create_dragon_blast_instead_of_three_flames():
    report = score([1, 1, 1])

    assert report.total_damage == 1000
    assert report.events == (
        ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
    )


def test_dragon_blast_plus_leftover_flame_and_spark():
    report = score([1, 1, 1, 1, 5])

    assert report.total_damage == 1150
    assert report.events == (
        ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
        ScoringEvent("Dragon Flame", (1,), 100),
        ScoringEvent("Lightning Spark", (5,), 50),
    )


def test_three_twos_create_goblin_swarm():
    report = score([2, 2, 2])

    assert report.total_damage == 200
    assert report.events == (
        ScoringEvent("Goblin Swarm", (2, 2, 2), 200),
    )


@pytest.mark.parametrize(
    "roll, expected_event",
    [
        ([3, 3, 3], ScoringEvent("Orc Charge", (3, 3, 3), 300)),
        ([4, 4, 4], ScoringEvent("Troll Smash", (4, 4, 4), 400)),
        ([5, 5, 5], ScoringEvent("Lightning Storm", (5, 5, 5), 500)),
        ([6, 6, 6], ScoringEvent("Demon Strike", (6, 6, 6), 600)),
    ],
)
def test_other_triples_create_combo_events(roll, expected_event):
    report = score(roll)

    assert report.total_damage == expected_event.damage
    assert report.events == (expected_event,)


def test_six_ones_create_two_dragon_blasts():
    report = score([1, 1, 1, 1, 1, 1])

    assert report.total_damage == 2000
    assert report.events == (
        ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
        ScoringEvent("Dragon Blast", (1, 1, 1), 1000),
    )


def test_invalid_die_value_raises_error():
    with pytest.raises(ValueError, match="Die values must be between 1 and 6"):
        score([1, 2, 9])


def test_battle_report_has_human_readable_summary():
    report = score([1, 1, 1, 5])

    assert report.summary() == (
        "Dragon Blast (1000) + Lightning Spark (50) = 1050"
    )