Debugging Python: From Symptom to Fix

Learn debugging as a distinct, learnable skill — not as accidental tinkering. You'll work three real Python bugs (recursive boundary, data-representation, temporal ordering) using a hypothesis-driven process and the time-travel debugger's breakpoints, conditional breakpoints, watch expressions, and history scrubber. Ends with an interleaved triage drill and an independent transfer challenge.

The Debugging Process

🎯 Goal: Apply the 7-stage debugging cycle to a tiny off-by-one bug.

flowchart TD
    A[1. Symptom — what's wrong?] --> B[2. Predict — what should the state be?]
    B --> C[3. Evidence — collect data with the right tool]
    C --> D[4. Hypothesis — one sentence cause]
    D --> E[5. Localize — first wrong line]
    E --> F[6. Fix — minimal change]
    F --> G[7. Verify — rerun ALL tests]

No edit happens until stage 6. That’s the central discipline.

Why this matters & what you'll learn

Debugging is a systematic, learnable process — not a vibe. Most engineers default to tinkering (edit, run, hope, repeat) and the bug eventually goes away without them learning what was wrong. The 7-stage cycle above replaces tinkering with a discipline you can repeat on any bug. Walking through it once on a tiny off-by-one anchors the cycle before you face anything harder.

You will learn to:

Apply the 7-stage hypothesis-driven cycle to a small failing test.
Distinguish fault, error, and failure — and trace one to the next.
Evaluate why the local-verification trap (only rerunning the failing test) hides regressions.

📖 Recap from lecture: the four phases of debugging

Lecture 10 framed debugging as a systematic process with four phases:

Investigating symptoms to reproduce the bug
Locating the faulty code
Determining the root cause of the bug
Implementing and verifying a fix

Inside that frame, each phase has its own moves. The 7-stage cycle is the zoomed-in version of those four phases — same process, more resolution. The four phases tell you what to do; the seven stages tell you how.

Lecture phase	This tutorial’s stages
1. Investigate symptoms / Reproduce	Symptom + Predict + Evidence
2. Determine root cause	Hypothesis
3. Locate the faulty code	Localize
4. Implement & verify fix	Fix + Verify

🐞 Lecture vocabulary: fault vs error vs failure

The lecture distinguished three terms that get sloppily blurred in everyday speech:

Term	Definition	Where it lives
Fault	The erroneous location in the code (e.g., `range(1, ...)` skipping index 0).	In source code.
Error	An incorrect program state during execution (e.g., the loop variable `i` starts at the wrong value).	In memory at runtime.
Failure	The observed outside behavior (e.g., `greet([\"Ada\", \"Linus\", \"Grace\"])` returns `\"Hello, Linus, Grace!\"` instead of including Ada).	What the user / test sees.

Flow: Fault → (program execution) → Error → (error reaches the system boundary) → Failure.

A useful question the lecture leaves you with: “How can we prevent this error from becoming a failure?” — assertions and defensive checks are exactly that prevention. The bug you’re about to fix demonstrates this chain end-to-end.

📋 Reproducing the bug — what the lecture said about Step 1

The lecture spent extra time on the first phase (“Reproduce the bug”) because everything downstream depends on it. Two pieces to reproduce:

Problem environment — the setting in which the bug occurs: hardware, OS, settings, runtime dependencies, software versions. Try to re-create it on a different machine.
Problem history — the steps needed to recreate the failure: the sequence of data inputs, user interactions, communications with other components. Plus timing, randomness, physical influences.

And whenever possible, write an automated bug reproduction test — a test that fails on the bug and passes after the fix. Run it repeatedly during debugging so “did I fix it yet?” is one click, not five minutes of manual reproduction. After the fix, keep the test in the suite for regression testing — re-running existing tests after later code changes to make sure the bug doesn’t sneak back in.

In this tutorial the bug reproduction is already automated for you (the failing pytest test is the reproduction). Notice that we never click “I think I fixed it” without re-running the test — that’s the lecture’s discipline in action.

Reference: Andreas Zeller, Why Programs Fail – A Guide to Systematic Debugging (2009).

📂 What you have

Two files: greet.py (production code, has a bug) and test_greet.py (three pytest tests, one of which fails). Don’t run anything yet.

🔍 1. Symptom — predict, then run

Open greet.py. Read it. Predict what each of these returns:

greet(["Ada", "Linus", "Grace"])
greet([])
greet(["Solo"])

Now click Run. Read the failing assertion — the mismatch is the symptom. State it in your own words.

🧠 2. Predict the state

Before opening the debugger, predict: at the moment the loop body first executes, what should i be? What is names[i] supposed to be? Hold the answer.

🔬 3. Evidence — your first breakpoint

A breakpoint is already set on line 4 (the for line). Click Debug (next to Run). Execution pauses before the marked line runs. The Variables tab shows names. The Watch tab is empty — add i to it (you’ll see <not yet defined> since the loop hasn’t started).

Now click Step Over (F10) once. The loop has started one iteration. Look at i in Watch. Look at names[i]. Compare with your prediction.

🔎 4. Hypothesis (one sentence)

Don’t fix yet. Write your hypothesis as a single sentence — what is wrong and where it lives.

Compare with a sample sentence

*"The loop starts at index 1, so `names[0]` is never appended to `parts`."* Did yours name *which iteration* is wrong and *what consequence* follows? That's the schema.

📍 5. Localize

Three candidates: the test, the return, the range(...). Pick the first divergence — the earliest line whose behavior contradicts your hypothesis. Justify in one sentence why the other two are not it.

🩹 6. Minimal fix

Now you may edit. Smallest possible change. Don’t refactor the whole function. Don’t add a special case for empty lists. Just fix the iteration range.

✅ 7. Verify

Click Run. All three tests must pass — the one that was failing AND the two that already passed. Verification means no regressions. Confusing those is the local-verification trap.

Starter files

greet.py

def greet(names: list[str]) -> str:
    parts: list[str] = ["Hello"]
    for i in range(1, len(names)):
        parts.append(names[i])
    return ", ".join(parts) + "!"

test_greet.py

from greet import greet

def test_three_names_all_appear() -> None:
    assert greet(["Ada", "Linus", "Grace"]) == "Hello, Ada, Linus, Grace!"

def test_empty_list_just_says_hello() -> None:
    assert greet([]) == "Hello!"

def test_single_name_appears() -> None:
    assert greet(["Solo"]) == "Hello, Solo!"

Solution

greet.py

def greet(names: list[str]) -> str:
    parts: list[str] = ["Hello"]
    for i in range(0, len(names)):
        parts.append(names[i])
    return ", ".join(parts) + "!"

Fix is range(0, len(names)) (or range(len(names))).

Notice: we didn’t also refactor to for name in names: even though that’s nicer. A bug fix is not a license to clean up the surrounding code. Smaller fixes are safer to review and easier to revert if they introduce a new problem.

Step 1 — Knowledge Check

Min. score: 80%

1. A teammate says: “I added print(repr(x)) and saw the value had a leading space.” Which stage of the debugging cycle is this?

Hypothesis
A hypothesis is a one-sentence proposed cause (e.g., “the input has invisible whitespace”). Adding a print and observing a value is the evidence you’d use to test such a hypothesis.
Evidence collection
Localize
Localize is identifying the line where intended and actual diverge. Observing a value reveals what is wrong; localization is the next move (which line first produced this value?).
Verification
Verification is rerunning the test suite after a fix. The print(repr(x)) happened before any edit, so it’s earlier in the cycle.

Adding instrumentation and observing values is evidence collection (stage 3). The hypothesis comes after you have evidence — and the fix and verification come later still. Naming the stage you’re in helps you avoid skipping straight to fixing.

2. A student fixes their failing test, runs pytest test_failing.py (just that one file) and sees green. They mark the bug fixed and move on. What stage did they skip?

Verification — they should rerun the whole suite, not just the previously failing test
Hypothesis — they should have written down their proposed cause first
Hypothesis is important, but the student got past it — they identified a fix that worked. The skip happened after the fix, when they didn’t rerun the rest of the suite.
Localization — they should have identified the exact line
Localization happened — they fixed something. The skip was after the fix, in the verification phase.
Nothing — running the previously-failing test is sufficient verification
This is the local-verification trap — judging the whole program by checking only the part you just touched. A fix can break tests that were previously passing, and the only way to catch that is rerunning the whole suite.

Verification means rerunning the entire test suite — including tests that previously passed. A fix in one place can introduce a regression somewhere else, and that’s exactly the kind of regression a quick “did the failing test go green?” check will miss.

3. A debugger user types len(parts) into the Watch panel during a paused session and sees 2, when they expected 3. Which stage of the cycle is this?

Predict
Predict happens before the debugger pauses — it’s holding a value in your head. Watching a live value during a pause is collecting evidence against the prediction.
Evidence
Localize
Localize is identifying the line where intended and actual diverge. Watching a value confirms a value is wrong; localization is the next move (which line first produced this value?).
Verify
Verify happens after a fix — rerunning the test suite. The student here hasn’t fixed anything yet; they’re still gathering data.

Reading a watched value during a pause is evidence collection. Predict happens upstream (before the run); Localize and Verify happen downstream (after a hypothesis or fix). Naming the stage you’re in is what keeps the cycle from collapsing into tinkering.

4. total(items) returns $5 too high for one user. You discover the discount-loading function reads the wrong database column, so that user’s discount is never applied. Which is the symptom and which is the cause?

Symptom: discount-loading reads wrong column. Cause: total is $5 too high.
This is the canonical symptom-vs-cause swap. Calling the column-read the symptom would push you to “fix” the visible total (if user_id == BAD_USER: total -= 5) — and leave the actual broken function untouched. Every other user who hits the same column read still gets wrong totals; the next column rename will silently break it again. The thing the user experiences is the symptom; the thing in the code that produces it is the cause. Mixing them up is exactly how programs accumulate if BAD_USER patches that no one understands six months later.
Symptom: total is $5 too high. Cause: discount-loading reads wrong column.
They are the same thing — the user’s bill is wrong.
Symptom and cause are different concepts. The symptom is what you observe (a wrong total). The cause is the broken thing that produces the symptom (the wrong column).
Neither — both are bugs and need separate fixes.
There is one root cause (the wrong column) producing one symptom (the wrong total). Treating them as separate bugs leads to symptom-patching — e.g., subtracting $5 from the total — without fixing the actual problem.

The symptom is what you observe (the wrong total). The cause is the reason it happens (the discount-loading function reading the wrong column). Symptom-patching — e.g., inserting a special if user_id == BAD_USER: total -= 5 check — would make one test green without fixing the underlying bug, and would fail on any other user affected by the same column read.

Debugger Tour

🎯 Goal: Build minimum tool fluency. Each section below pairs a debugging question with the smallest tool move that answers it. There’s no bug to fix — tour.py runs correctly.

Click Debug (not Run) to start each section.

Why this matters & what you'll learn

Tools subordinate to questions, not the other way around. If you learn debugger features as a feature menu, you’ll forget them; if you learn each one as the answer to a specific debugging question, they stick. This step pairs six common questions with the smallest tool move that answers each — on correct code — so when a real bug forces the question, the move is already in your fingers.

You will learn to:

Apply six debugger moves (breakpoint, hover, watch, conditional breakpoint, call stack, history scrubber) to answer specific questions.
Analyze which question each tool actually answers — and which it doesn’t.

1. “Where is execution right now?” → Breakpoint

Click the gutter next to line 8 in tour.py (the line total += score). A breakpoint marker appears — that’s the breakpoint you’ll edit later.

Click Debug. Execution pauses before line 8 runs; the debugger reports the current paused line, and sighted users also see an arrow marker in the gutter. The current line is highlighted.

2. “What does this variable hold right now?” → Variables tab + hover

Look at the Variables tab. You’ll see locals like score and total. Each value has a type badge (int, list, dict).

Now hover over score in the editor. A tooltip shows the value. The same trick works on any identifier in the source — no need to dig through the panel.

3. “What value will an expression have at this point?” → Watch

Open the Watch tab. Click ➕ and add total + score. The expression evaluates as if it ran right now. Click Step Over (F10). The value updates.

Watches are how you ask “what would len(items) * factor be at this exact moment?” without editing the program to add a print.

4. “Which iteration first violates an invariant?” → Conditional breakpoint

Right-click the breakpoint marker you placed on line 8 → Edit Breakpoint → enter score < 0 as the condition. Click Continue (F5).

Execution flies through every iteration where score >= 0 and pauses only at the iteration where score < 0 (line 8). That’s the iteration where the invariant first fails.

Without conditional breakpoints, you’d step 9 times through normal iterations to reach the one you care about. With one, the debugger does the filtering.

5. “How did we get here?” → Call Stack

Open the Call Stack tab. You’ll see process_scores → main. Click each frame to inspect that scope’s locals. The stack tells the story of how this line got executed.

For recursive code, the stack is a vertical history of decisions. You’ll use it heavily in Case 1.

6. “What was this variable BEFORE this line ran?” → History scrubber

Drag the History scrubber backward by 5-10 ticks. Watch total rewind in the Variables tab. Drag forward — it advances. The debugger switches from live execution to a rewound history state; sighted users also see the gutter marker change appearance.

This is the time-travel feature. You can move to any moment in the program’s history without restarting. You’ll drill it deliberately in the Backward Tour before Case 3.

🪞 Reflect

Close the editor. From memory, list the six moves. For each, name the debugging question it answers. If you can’t, that move isn’t yet yours — flag it for revisit.

Carry this forward: for any new debugger feature you encounter, name the question it answers. If you can’t, you don’t need it yet.

Starter files

tour.py

# Tour program — no bug. Exercise the debugger UI here.

def compute_score(raw: list[int]) -> float:
    return sum(raw) / len(raw)

def process_scores(scores: list[float]) -> float:
    total: float = 0
    for score in scores:
        total += score
    return total / len(scores)

def main() -> float:
    raw: list[tuple[str, list[int]]] = [
        ("Ada", [95, 88, 92]),
        ("Linus", [72, 81, 78]),
        ("Grace", [98, 95, 91]),
        ("Alan", [-3, 55, 70]),     # negative — used by §4
        ("Margaret", [85, 89, 87]),
    ]
    scores: list[float] = []
    for name, raw_scores in raw:
        score = compute_score(raw_scores)
        scores.append(score)
    average = process_scores(scores)
    print(f"average score: {average:.2f}")
    return average

main()

Solution

There’s no fix to apply — this step is procedural drill. The six moves above answer the most common forward-debugging questions. The history scrubber gets its own dedicated drill in the Backward Tour before Case 3, where backward localization actually pays off.

Step 2 — Knowledge Check

Min. score: 80%

1. “I want to know which iteration of a 10,000-item loop is the first one to break the invariant.” Which tool answers it?

Step Over through iterations until something breaks
Stepping through 10,000 iterations works in principle but is prohibitive. A conditional breakpoint runs the same check at every iteration inside the debugger’s engine — effectively free per iteration — and pauses only when the predicate is true.
A conditional breakpoint using the invariant as the predicate
Hover over the variable to read its current value
Hover shows the value at this paused moment. It can’t filter across iterations.
The Call Stack panel showing your function chain
Call Stack shows how you got to the current line, not which iteration first violated something.

Conditional breakpoints filter. The condition runs at every loop pass; the debugger pauses only when it’s true.

2. “I want to inspect what total was 5 lines ago.” Which tool answers it?

Add a Watch and rerun
Watch shows the current value at the paused moment, not historical values.
Drag the History scrubber backward
Set a breakpoint earlier and restart
This works but requires re-executing the entire program from scratch. If the program takes 30 seconds — or an hour — to run, you pay that cost every time you want to inspect a different moment. The scrubber rewinds through the recorded trace instantly, no rerun needed.
Hover the variable
Hover is the same as Watch — it shows the value at the current pause, not the past.

Time-travel. The scrubber lets you slide back through any moment in the run without re-executing. (You’ll drill backward localization specifically in the Backward Tour before Case 3.)

3. The tour file’s line-14 def enroll(student, students=[]) lights up the ↔ aliasing badge across calls. Why?

Each call gets its own fresh empty list — the ↔ badge is reporting incorrect state
The ↔ badge is correct — the bug is real. Try it: run enroll("Ada") twice and the second call’s students already contains “Ada” from the first.
The default [] is evaluated once at definition, so calls share one list
The interpreter caches all empty lists as the same object for memory efficiency
Python doesn’t intern lists. [] is [] returns False — every literal [] makes a new object. The trap is only with default arguments, because the default expression runs once at def time.
Python’s parameter passing is by-reference, so mutations leak across calls
Python is pass-by-reference for objects, but a fresh local variable inside each function call wouldn’t share state across calls. The aliasing here is specifically because the default value object is reused.

Default argument values are evaluated exactly once, at function-definition time. The students=[] creates one list, bound to the function as its default. Every subsequent call that doesn’t override the parameter reuses that same list. Standard fix: def enroll(student, students=None): students = students if students is not None else []. The ↔ badge is the time-travel debugger’s way of pointing at exactly this aliasing — saving you 30 minutes of head-scratching.

Case 1 — Maze Pathfinder (Boundary Bug)

🎯 Goal: A maze has a valid 10-step path from S to G, but the pathfinder returns None when called with max_steps=10. Find why.

📋 Open debugging_log.md and fill each field as you work. The first time, the log carries you stage by stage. Cases 2 and 3 fade this scaffolding — by Case 3 you’ll name three of the stages yourself. Committing each stage to writing is the difference between thinking the cycle and doing the cycle.

Why this matters & what you'll learn

Boundary bugs — off-by-one in range, slice indices, comparison operators, loop sentinels — are the most common shape of algorithmic bug, and they hide in plain sight because nine of ten test cases pass. This case forces the discipline you just learned (the 7-stage cycle) onto a recursive boundary bug, so the cycle has to handle a real call stack before you internalize it.

You will learn to:

Apply the full 7-stage cycle to a recursive boundary bug, writing each stage in the debugging log.
Analyze recursive execution by walking the Call Stack tab to read frame-by-frame state.
Evaluate which of two adjacent if checks is the first divergence between intended and actual behavior.

📂 What you have

A small delivery robot has a battery measured in grid steps. find_path(maze, max_steps) should return a path if one exists using at most max_steps moves, otherwise None.

Three pytest tests in test_pathfinder.py:

test_tiny_maze_found_with_extra_budget — passes.
test_path_rejected_when_battery_too_small — passes (max_steps=9, no 9-step path).
test_path_found_when_battery_limit_is_exact — fails (max_steps=10, but a 10-step path exists).

1. Symptom — run and read

Click Run. Read the failing assertion. State the symptom in one sentence: expected what / got what.

2. Predict before debugging

Open pathfinder.py. Read _dfs carefully — especially the two checks at the top of the function:

if steps_used >= max_steps:
    return None

if current == goal:
    return path.copy()

Predict: at the moment a recursive call has just stepped onto the goal cell using exactly the budget, what are steps_used and max_steps? Which of the two checks above runs first? What does it return?

3. Set evidence — breakpoint and watches

Set a breakpoint at the top of _dfs (the steps_used = len(path) - 1 line). In the Watch tab, add at least the values your prediction depends on. Add more if you want orientation (e.g., current, goal, current == goal).

4. Drive

Click Debug. Continue (F5) advances to each next pause — repeat until current == goal is True in the Watch tab. Don’t fix yet.

As recursion deepens, the Call Stack tab grows. Click any frame to see that level’s locals — this is how you read recursion in a debugger.

5. Compare prediction to observation

When current == goal is True in the Watch tab, look at steps_used and max_steps.

What did you predict steps_used would be at the moment the goal cell is reached?
What does the debugger show?
If they differ, complete this sentence before continuing: “My model assumed ___, but the code computes steps_used as len(path) - 1, which means ___.”

⚠️ Click only AFTER you've written your prediction — what the comparison typically reveals

Most students predict `steps_used = 9` (the nine moves *leading to* the goal). The actual value is `10` — because the goal cell has already been appended to `path` before this recursive call starts, so `len(path) - 1` counts the goal cell itself as a step. If your prediction was wrong, that gap is the heart of the bug.

Which conditional fires first when _dfs runs on this call — the cutoff or the goal check?

That is the first divergence between intended behavior (“we reached the goal, return the path”) and actual behavior (“we hit the budget, return None”).

6. Hypothesis

Write your one-sentence hypothesis. Format: *“ ."* No fix yet — just the cause. (If you can't write a clean sentence yet, that's fine — the act of trying surfaces what's still fuzzy.)

⚠️ Click only AFTER you've written your hypothesis — compare with a sample sentence

*"The cutoff check rejects exact-budget arrivals before the goal check can accept them."* Did yours name the *check* and the *timing*? If so, you have the schema for a debugging hypothesis: a specific code element doing the wrong thing at a specific moment.

7. Minimal fix

Edit _dfs so the goal check runs before the cutoff check.

🪞 Reflect — before you verify

Bug family: Off-by-one boundaries hide in range, slice indices, comparison operators, loop sentinels, array bounds. Name one place in your own code where this exact shape could appear.

Cycle stage: Which stage was hardest on this case — Predict, Evidence, or Hypothesis? Name it.

If it was Predict: recursive code is hard to predict because you’d need to mentally simulate the whole call stack. The debugger’s Call Stack tab is built for exactly that gap.

If it was Hypothesis: the schema that helped was “which check does what when.” That schema transfers to every boundary bug you’ll meet.

8. Verify

Click Run. All three tests must pass — including test_path_rejected_when_battery_too_small. If that one breaks, your fix is too aggressive.

Starter files

maze_data.py

# Mazes used by the pathfinder case.

# Shortest valid path from S to G is exactly 10 steps.
BATTERY_LIMIT_MAZE: list[str] = [
    "#########",
    "#S..#..G#",
    "#.#.#.#.#",
    "#.#...#.#",
    "#.#####.#",
    "#.......#",
    "#########",
]

# Sanity maze whose shortest path is 2 steps.
TINY_MAZE: list[str] = [
    "#####",
    "#S.G#",
    "#####",
]

pathfinder.py

"""Depth-first maze pathfinder."""

from collections.abc import Iterator

Position = tuple[int, int]
Maze = list[str]


def find_marker(maze: Maze, marker: str) -> Position:
    for row_index, row in enumerate(maze):
        col_index = row.find(marker)
        if col_index != -1:
            return row_index, col_index
    raise ValueError(f"marker {marker!r} not found")


def is_open(maze: Maze, position: Position) -> bool:
    row, col = position
    return maze[row][col] != "#"


def neighbors(maze: Maze, position: Position) -> Iterator[Position]:
    """Yield neighbors in a deterministic order so traces are repeatable."""
    row, col = position
    for next_position in [
        (row, col + 1),  # east
        (row + 1, col),  # south
        (row, col - 1),  # west
        (row - 1, col),  # north
    ]:
        if is_open(maze, next_position):
            yield next_position


def find_path(maze: Maze, max_steps: int) -> list[Position] | None:
    """Return a path from S to G using at most max_steps moves.

    A path includes both the start and goal positions, so:
      steps_used == len(path) - 1
    """
    start = find_marker(maze, "S")
    goal = find_marker(maze, "G")
    return _dfs(
        maze=maze,
        current=start,
        goal=goal,
        max_steps=max_steps,
        path=[start],
        seen={start},
    )


def _dfs(
    maze: Maze,
    current: Position,
    goal: Position,
    max_steps: int,
    path: list[Position],
    seen: set[Position],
) -> list[Position] | None:
    steps_used = len(path) - 1

    # Stop searching when the path has used the available battery budget.
    if steps_used >= max_steps:
        return None

    if current == goal:
        return path.copy()

    for next_position in neighbors(maze, current):
        if next_position in seen:
            continue
        seen.add(next_position)
        path.append(next_position)
        result = _dfs(maze, next_position, goal, max_steps, path, seen)
        if result is not None:
            return result
        path.pop()
        seen.remove(next_position)

    return None

test_pathfinder.py

from maze_data import BATTERY_LIMIT_MAZE, TINY_MAZE
from pathfinder import find_path


def test_tiny_maze_found_with_extra_budget() -> None:
    path = find_path(TINY_MAZE, max_steps=3)
    assert path is not None
    assert len(path) - 1 == 2


def test_path_rejected_when_battery_too_small() -> None:
    path = find_path(BATTERY_LIMIT_MAZE, max_steps=9)
    assert path is None


def test_path_found_when_battery_limit_is_exact() -> None:
    path = find_path(BATTERY_LIMIT_MAZE, max_steps=10)
    assert path is not None, "A 10-step path exists and should be accepted."
    assert len(path) - 1 == 10

debugging_log.md

# Debugging log — Case 1 (Maze Pathfinder)

The 7 stages match the cycle from Step 1. Fill each field as you work.

**Symptom** — one sentence, expected vs actual: _..._
**Predict** — at the moment a recursive call has just stepped onto the goal cell on an exact-budget run, what should `steps_used` and `max_steps` be? Which of the two early checks should fire? _..._
**Evidence** — which tool you used, what cue you were watching, what value you actually observed when paused on the goal cell: _..._
**Hypothesis** — one sentence; name the *check* and the *timing* (format: *"\<which check\> \<does what\> \<when\>."*): _..._
**Localize** — which line is the first divergence between intended and actual behavior, and one sentence on why each of the other candidates is *not* it: _..._
**Fix** — file, line, the minimal change: _..._
**Verify** — `pytest` exit code, which tests pass; any regressions in the under-budget rejection case? _..._

Solution

pathfinder.py

"""Depth-first maze pathfinder — boundary bug fixed."""

from collections.abc import Iterator

Position = tuple[int, int]
Maze = list[str]


def find_marker(maze: Maze, marker: str) -> Position:
    for row_index, row in enumerate(maze):
        col_index = row.find(marker)
        if col_index != -1:
            return row_index, col_index
    raise ValueError(f"marker {marker!r} not found")


def is_open(maze: Maze, position: Position) -> bool:
    row, col = position
    return maze[row][col] != "#"


def neighbors(maze: Maze, position: Position) -> Iterator[Position]:
    row, col = position
    for next_position in [
        (row, col + 1),
        (row + 1, col),
        (row, col - 1),
        (row - 1, col),
    ]:
        if is_open(maze, next_position):
            yield next_position


def find_path(maze: Maze, max_steps: int) -> list[Position] | None:
    start = find_marker(maze, "S")
    goal = find_marker(maze, "G")
    return _dfs(
        maze=maze,
        current=start,
        goal=goal,
        max_steps=max_steps,
        path=[start],
        seen={start},
    )


def _dfs(
    maze: Maze,
    current: Position,
    goal: Position,
    max_steps: int,
    path: list[Position],
    seen: set[Position],
) -> list[Position] | None:
    steps_used = len(path) - 1

    # Goal check FIRST — reaching the goal is terminal and valid
    # regardless of how many steps it took.
    if current == goal:
        return path.copy()

    if steps_used >= max_steps:
        return None

    for next_position in neighbors(maze, current):
        if next_position in seen:
            continue
        seen.add(next_position)
        path.append(next_position)
        result = _dfs(maze, next_position, goal, max_steps, path, seen)
        if result is not None:
            return result
        path.pop()
        seen.remove(next_position)

    return None

Swap the order of the two checks at the top of _dfs so the goal check runs first. When the recursion lands on the goal cell with steps_used == max_steps, we now correctly return the path instead of bailing out one step too soon.

Why goal-first is preferred over the alternative (loosening the cutoff to > or to > max_steps if current != goal): reaching the goal is a terminal valid state. Treating it that way reads more clearly than special-casing the cutoff condition. The two are functionally equivalent in this maze, but the goal-first version generalizes better — for any future cutoff predicate, the goal acceptance still works.

Common wrong fixes (and why they’re wrong):

Raising max_steps in the test. That’s editing the spec to match the bug, not fixing the code.
Editing the maze. Same issue — the test was correct.
Removing the cutoff entirely. Now the path-rejection test (max_steps=9) breaks. The cutoff was correct as a concept; only its ordering was wrong.

Step 3 — Knowledge Check

Min. score: 80%

1. Which of these would be a root-cause fix for this bug, as opposed to a workaround?

Change the failing test to use max_steps=11 so it passes
This makes the test pass without changing the code’s behavior — the same bug would still reject any future user who calls find_path with the exact required budget. That’s a workaround, not a fix.
Edit BATTERY_LIMIT_MAZE to add an extra row of open cells
Editing the data makes the bug invisible for this maze. The next maze with a path equal to its budget would have the same problem. That’s a workaround.
Move the if current == goal: check to run before the cutoff check
Catch the None return in find_path and re-run DFS with max_steps + 1
Calling DFS twice is more code, doesn’t address why the first call rejected a valid path, and is twice as slow. The actual bug — wrong check ordering in _dfs — is still present.

The root cause is the order of the two early checks in _dfs. Reordering them is a one-line, minimal change that addresses the cause directly. Every other option here is a workaround: it makes the symptom disappear without fixing the underlying logic.

2. A student fixes _dfs by loosening the cutoff to steps_used > max_steps instead of swapping the check order. The test_path_found_when_battery_limit_is_exact test now passes. Is this a correct fix?

Yes — it works for this case, so it is correct
Making one test pass is not the same as fixing the bug. steps_used > max_steps means the cutoff fires only when steps_used exceeds the budget — so a path of length max_steps + 1 would be accepted. Try the test_path_rejected_when_battery_too_small test with this ‘fix’ applied.
No — it accepts over-budget paths, breaking the rejection test
Yes — > and >= are interchangeable in boundary checks
> and >= differ at exactly the boundary value — the very value this bug is about. steps_used >= max_steps rejects steps_used == max_steps; steps_used > max_steps accepts it. They are not interchangeable when the boundary case is the target.
No — we should have used < instead
steps_used < max_steps inverts the cutoff entirely, accepting any path that used fewer steps than the budget and rejecting everything at or above. That would reject even the small-maze two-step path.

The root-cause fix is check ordering — goal first, cutoff second — not loosening the comparator. Loosening >= to > makes the exact-budget test pass but breaks the under-budget-rejection test, because a path one step over budget is now accepted. A fix that passes the newly-passing test while breaking a previously-passing test is a regression, not a fix. This is exactly why Verify means rerunning the whole suite.

3. True or false: Once you’ve fixed the boundary bug in _dfs, you can verify the fix is correct by rerunning only test_path_found_when_battery_limit_is_exact (the previously failing test).

True — the other tests already passed, so they’re fine
Fixing one test can break another — test_path_rejected_when_battery_too_small is exactly the kind of case a too-aggressive fix could break. The only reliable verification is rerunning the entire suite.
False — verify means rerun all tests, including the ones that were already green, to catch regressions

Verification means rerunning the whole suite. Specifically: after the goal-first fix, test_path_rejected_when_battery_too_small (max_steps=9) must still pass. If you accidentally over-loosen the cutoff, this test will catch you — but only if you rerun it.

Case 2 — Ledger Reconciliation (Data Representation Bug)

🎯 Goal: A campus debit-card system imports 30 transactions and one account is $36.00 wrong at month end. The technique you’ve used so far (single breakpoint + step) would force you to step through every transaction. Don’t.

📋 Keep filling debugging_log.md. Fields are now name-only — refer to Case 1’s log if you need the per-stage prompts. Writing forces commitment; commitment is what makes the cycle yours.

Why this matters & what you'll learn

Data-representation bugs — hidden whitespace, mixed encodings, silent type coercions — are a different family from algorithmic bugs. The algorithm is correct; the data is carrying something invisible. The forward-stepping technique you used in Case 1 doesn’t scale to 30 transactions, and your eyes won’t catch a leading space. This case introduces two new moves (conditional breakpoints, repr()) that are nearly free once you know to reach for them.

You will learn to:

Apply conditional breakpoints to filter a long input stream down to the suspicious case.
Analyze a value with repr() to surface invisible characters that print() hides.
Evaluate where a normalization fix belongs — at the load boundary, not at the consumer.

🔀 Before you start: Case 1 had a bug you could trace by reading two if checks in one function. Is that true here? Spend 30 seconds predicting: what kind of thing is wrong, and what will the evidence-collection move look like?

The contrast — read after you've tried step 3
Case 1 was *algorithmic* — the data was correct; one check was in the wrong place. This is a *data-representation* bug — the algorithm is correct; the data carries something invisible. Different family, different first move: you don't step through logic looking for a wrong branch; you inspect the data itself to find what it's hiding.

📂 What you have

ledger.py — loads transactions from a CSV and applies them to account balances.
transactions.csv — 30 rows of test data.
test_ledger.py — two pytest tests, both failing.

Read both failures carefully.

1. Symptom — and a clue

Click Run. Two tests fail:

test_month_end_balances — ACCT-202 is wrong by $36.00.
test_transaction_types_are_valid_after_loading — the loaded transaction kinds set contains an unexpected value.

The second failure is a clue, not a separate bug. Look at the assertion message — what kind appears that shouldn’t?

2. Predict before debugging

You could step through 30 transactions to find the wrong one. Don’t. That’s exactly the kind of work the debugger is supposed to save you. Predict instead: of the 30 transactions, which one(s) belong to ACCT-202? (You can scan transactions.csv if you want — but only briefly.)

3. Stop only on the suspicious account — conditional breakpoint

Set a breakpoint at the start of apply_transaction (the before = balances.get(...) line). Right-click that breakpoint marker → Edit Breakpoint → enter a condition that pauses only for the suspicious account. What predicate on tx discriminates ACCT-202 from the other accounts?

Predicate answer

`tx.account == "ACCT-202"`

Click Debug. The debugger flies past every transaction for other accounts and pauses only on the rows for ACCT-202. Use Continue to move from one ACCT-202 row to the next.

4. Look closely

For each pause, inspect:

tx.id
tx.kind
repr(tx.kind) ← the secret weapon

Add repr(tx.kind) to your Watch tab so it shows on every pause. Across the ACCT-202 pauses, what does repr show that you wouldn’t notice otherwise?

5. Compare prediction to observation

Across the ACCT-202 pauses, look at repr(tx.kind) in your Watch tab.

What did you predict tx.kind would be for transaction T011?
What does repr() show that print() would have hidden?
Complete this sentence: “My model assumed the value was ___, but repr shows ___ because ___.”

What the comparison reveals

Most students predict `tx.kind == 'REVERSAL'`. The `repr()` output shows `"' REVERSAL'"` — the outer quotes make the leading space unmistakable. `print()` would have shown ` REVERSAL` with no delimiters, where the space blends invisibly into the line. The gap between prediction and observation is the bug's fingerprint.

6. Where is the divergence?

Once you’ve spotted the malformed transaction, ask: where in the code is the bug? Is it in apply_transaction (which decides DEPOSIT vs WITHDRAWAL etc.)? Or earlier, in how the row got loaded into a Transaction object?

7. Hypothesis

Write your one-sentence hypothesis before expanding. Name the layer (loading vs processing) and what’s wrong with the data.

Compare with a sample sentence

*"The kind field arrives from the CSV with hidden whitespace. `load_transactions` doesn't normalize it, so it falls through to the unknown-kind branch in `apply_transaction` and gets treated as a withdrawal."* A clean hypothesis names *where* the bug enters (the loader) and *why* the symptom appears far from the cause (the if/elif cascade silently misses).

8. Minimal fix

One change in load_transactions on the kind=row["type"].upper() line. Resist the temptation to:

Patch the final balance.
Edit the CSV.
Change the reversal arithmetic in apply_transaction.
Delete the unknown-kind fallback.

The right fix is the smallest change in the right place.

🪞 Reflect — before you verify

Bug family: Hidden-character bugs hide in CSV imports, copy-pasted strings, JSON keys, environment variables, log lines, command-line args. Name one place where repr() would surface something print() hides.

What repr() changed: Did it change the Evidence step for you (you saw the space you wouldn’t have seen), the Localize step (it told you exactly which field), or both? Write one sentence explaining why print() would have missed it.

9. Verify

Click Run. Both tests must turn green. The arithmetic in apply_transaction is unchanged; only the loading code was wrong.

Starter files

ledger.py

"""Ledger reconciliation — applies CSV transactions to running balances."""

import csv
import logging
from dataclasses import dataclass
from decimal import Decimal

logger = logging.getLogger(__name__)

VALID_KINDS: set[str] = {"DEPOSIT", "WITHDRAWAL", "REFUND", "REVERSAL", "FEE"}


@dataclass(frozen=True)
class Transaction:
    id: str
    account: str
    kind: str
    amount_cents: int


def parse_money(text: str) -> int:
    """Convert a dollars-and-cents string to integer cents."""
    return int(Decimal(text) * 100)


def load_transactions(path: str) -> list[Transaction]:
    transactions: list[Transaction] = []
    with open(path, newline="", encoding="utf-8") as csv_file:
        reader = csv.DictReader(csv_file)
        for row in reader:
            transactions.append(
                Transaction(
                    id=row["id"],
                    account=row["account"],
                    kind=row["type"].upper(),
                    amount_cents=parse_money(row["amount"]),
                )
            )
    return transactions


def apply_transaction(balances: dict[str, int], tx: Transaction) -> None:
    before = balances.get(tx.account, 0)

    if tx.kind == "DEPOSIT":
        after = before + tx.amount_cents
    elif tx.kind == "WITHDRAWAL":
        after = before - tx.amount_cents
    elif tx.kind == "FEE":
        after = before - tx.amount_cents
    elif tx.kind == "REFUND":
        after = before + tx.amount_cents
    elif tx.kind == "REVERSAL":
        after = before + tx.amount_cents
    else:
        # Realistic but dangerous legacy behavior: old exports used blank
        # types for card charges, so unknown types are treated as
        # withdrawals.
        after = before - tx.amount_cents

    balances[tx.account] = after


def reconcile(transactions: list[Transaction]) -> dict[str, int]:
    balances: dict[str, int] = {}
    for tx in transactions:
        apply_transaction(balances, tx)
    return balances

transactions.csv

id,account,type,amount
T001,ACCT-100,DEPOSIT,200.00
T002,ACCT-100,WITHDRAWAL,45.25
T003,ACCT-100,FEE,2.50
T004,ACCT-100,REFUND,10.00
T005,ACCT-101,DEPOSIT,125.00
T006,ACCT-101,WITHDRAWAL,19.99
T007,ACCT-101,WITHDRAWAL,8.50
T008,ACCT-101,REFUND,8.50
T009,ACCT-202,DEPOSIT,80.00
T010,ACCT-202,WITHDRAWAL,18.00
T011,ACCT-202, REVERSAL,18.00
T012,ACCT-303,DEPOSIT,300.00
T013,ACCT-303,FEE,7.50
T014,ACCT-303,WITHDRAWAL,22.00
T015,ACCT-303,REFUND,3.25
T016,ACCT-100,WITHDRAWAL,16.00
T017,ACCT-101,FEE,2.50
T018,ACCT-202,WITHDRAWAL,7.25
T019,ACCT-303,WITHDRAWAL,41.99
T020,ACCT-100,REFUND,1.25
T021,ACCT-101,DEPOSIT,40.00
T022,ACCT-202,FEE,1.75
T023,ACCT-303,FEE,2.50
T024,ACCT-100,FEE,2.50
T025,ACCT-101,WITHDRAWAL,12.00
T026,ACCT-202,DEPOSIT,5.00
T027,ACCT-303,REFUND,10.00
T028,ACCT-100,WITHDRAWAL,30.00
T029,ACCT-101,REFUND,4.00
T030,ACCT-202,WITHDRAWAL,3.00

test_ledger.py

from ledger import load_transactions, reconcile


def test_month_end_balances() -> None:
    transactions = load_transactions('/tutorial/transactions.csv')
    balances = reconcile(transactions)
    assert balances == {
        "ACCT-100": 11500,
        "ACCT-101": 13451,
        "ACCT-202": 7300,
        "ACCT-303": 23926,
    }


def test_transaction_types_are_valid_after_loading() -> None:
    transactions = load_transactions('/tutorial/transactions.csv')
    kinds = {tx.kind for tx in transactions}
    assert kinds <= {"DEPOSIT", "WITHDRAWAL", "REFUND", "REVERSAL", "FEE"}, \
        f"unexpected transaction kind(s) loaded: {kinds}"

debugging_log.md

# Debugging log — Case 2 (Ledger Reconciliation)

Same 7-stage form, names only. If you're stuck on what a stage demands, reread Case 1's log.

**Symptom**: _..._
**Predict**: _..._
**Evidence**: _..._
**Hypothesis**: _..._
**Localize**: _..._
**Fix**: _..._
**Verify**: _..._

Solution

ledger.py

"""Ledger reconciliation — bug fixed."""

import csv
import logging
from dataclasses import dataclass
from decimal import Decimal

logger = logging.getLogger(__name__)

VALID_KINDS: set[str] = {"DEPOSIT", "WITHDRAWAL", "REFUND", "REVERSAL", "FEE"}


@dataclass(frozen=True)
class Transaction:
    id: str
    account: str
    kind: str
    amount_cents: int


def parse_money(text: str) -> int:
    return int(Decimal(text) * 100)


def load_transactions(path: str) -> list[Transaction]:
    transactions: list[Transaction] = []
    with open(path, newline="", encoding="utf-8") as csv_file:
        reader = csv.DictReader(csv_file)
        for row in reader:
            transactions.append(
                Transaction(
                    id=row["id"],
                    account=row["account"],
                    kind=row["type"].strip().upper(),
                    amount_cents=parse_money(row["amount"]),
                )
            )
    return transactions


def apply_transaction(balances: dict[str, int], tx: Transaction) -> None:
    before = balances.get(tx.account, 0)

    if tx.kind == "DEPOSIT":
        after = before + tx.amount_cents
    elif tx.kind == "WITHDRAWAL":
        after = before - tx.amount_cents
    elif tx.kind == "FEE":
        after = before - tx.amount_cents
    elif tx.kind == "REFUND":
        after = before + tx.amount_cents
    elif tx.kind == "REVERSAL":
        after = before + tx.amount_cents
    else:
        after = before - tx.amount_cents

    balances[tx.account] = after


def reconcile(transactions: list[Transaction]) -> dict[str, int]:
    balances: dict[str, int] = {}
    for tx in transactions:
        apply_transaction(balances, tx)
    return balances

The fix is kind=row["type"].strip().upper() in load_transactions. The CSV row T011,ACCT-202, REVERSAL,18.00 has a leading space in the type field. The original code’s .upper() preserved that space (the ' ' is unchanged by upper()), so tx.kind became ' REVERSAL'. None of the explicit if/elif branches in apply_transaction matched, so it fell through to the unknown-kind branch and was charged as a $18 withdrawal. The fix should have added $18 (REVERSAL), so the account is off by $18 + $18 = $36.

The repr() trick is what surfaces the issue. print(' REVERSAL') looks identical to print('REVERSAL') to a human reader, but repr(' REVERSAL') shows "' REVERSAL'" — quotes included — making the leading space unmistakable.

Common wrong fixes (and why they’re wrong):

Adding $36.00 to ACCT-202 after reconciliation. Hardcodes a one-time correction without fixing the cause. The next CSV with the same data shape will be wrong again.
Editing transactions.csv. “Fix the data” is a workaround. The bug is that the loader doesn’t normalize whitespace — your loader should be robust against typical CSV imperfections.
Changing the REVERSAL arithmetic in apply_transaction. This rewrites the spec to match the bug’s symptom.
Deleting the unknown-kind branch. That branch exists for a reason (legacy blank types). Removing it would surface a NameError for after, which is a different problem entirely.

Want to go further? A more defensive variant.

Validate at load time: ```python kind: str = row["type"].strip().upper() if kind not in VALID_KINDS: raise ValueError(f"unknown transaction kind {kind!r} in row {row['id']}") ``` That would have caught the original bug at *load* time with a clear message, instead of producing a silently wrong balance.

Step 4 — Knowledge Check

Min. score: 80%

1. Which of these is the root-cause fix?

After reconcile(...) runs, do balances["ACCT-202"] += 3600 to correct the off-by-$36 result
This hardcodes a one-time correction. The next CSV with the same shape will be wrong again. The bug is that the loader doesn’t normalize whitespace — that’s where the fix belongs.
Edit transactions.csv to remove the leading space on row T011
Editing the data hides the symptom for this CSV but doesn’t fix the loader. Real CSV imports often have stray whitespace; a robust loader strips it.
In load_transactions, change row["type"].upper() to row["type"].strip().upper()
Change the REVERSAL branch in apply_transaction to subtract instead of add
This rewrites the spec to match the bug’s symptom. The REVERSAL arithmetic is correct (a reversal cancels a charge — addition). The bug is that the kind field never had a chance to match "REVERSAL".

The bug is that the CSV row had a leading space, so kind became ' REVERSAL' instead of 'REVERSAL'. The fix belongs in load_transactions because that’s where data flows from external (untrusted) format into internal representation. Strip-and-validate at the boundary, then trust the data inside.

2. Why is repr(tx.kind) more useful than print(tx.kind) when investigating this bug?

repr() calls str() internally, so it produces the same output — the difference is just defensive habit
str(' REVERSAL') outputs the string’s content directly — the leading space looks the same as any gap before a word. repr(' REVERSAL') outputs "' REVERSAL'", wrapping the whole value in quote characters. Those outer quotes make the leading space unmistakable: you can see it sits inside the quotes, between ' and R. The difference is structural, not stylistic.
repr() shows quotes and escape characters around the string, making invisible characters (like leading spaces) visible
repr() strips trailing whitespace before displaying, which is what makes the leading space visible by contrast
Backwards. repr() doesn’t strip whitespace — it displays it inside quote delimiters so you can see it. print() displays the string’s raw content without delimiters, which is precisely what makes leading/trailing whitespace invisible.
repr() is the only function that displays unicode characters correctly
Both print() and repr() display Unicode correctly. The issue is that print() outputs raw content without surrounding delimiters, so a leading space looks like normal spacing, while repr() wraps the value in quote characters that make the boundary visible.

repr('REVERSAL') returns \"'REVERSAL'\" — including the surrounding quotes — while repr(' REVERSAL') returns \"' REVERSAL'\". The leading space jumps out because repr() shows the string as a Python literal, with quotes around its contents. print() displays the string’s content without delimiters, so leading and trailing whitespace becomes invisible. This is the canonical Python trick for spotting whitespace bugs.

3. You have a 30-iteration loop where one specific iteration produces a wrong result. Which technique most efficiently locates the bad iteration?

Set a normal breakpoint inside the loop and click Continue 30 times
This works but is exactly the kind of mechanical drudgery the debugger is meant to eliminate. With 30 iterations it’s tolerable; with 30,000 it’s hopeless.
Use a conditional breakpoint that fires only on the bad iteration
Add print(...) to every line of the function and read the output
This produces a flood of output you’d have to skim by eye. A conditional breakpoint puts the filtering logic inside the debugger so it stops only on the iteration you care about.
Comment out lines until the loop produces the right answer
This is tinkering — random edits to see what changes. It can occasionally land on the answer but it teaches the wrong habit and breaks badly on bigger programs.

Conditional breakpoints scale. They turn the debugger into a filter: only stop when this expression is true. The cost is the same regardless of whether the loop has 30 or 30,000 iterations. This is one of the highest-leverage debugger features and the reason “set a conditional breakpoint” is one of the first moves an experienced debugger reaches for in long-running data-processing code.

Backward Tour — Time-Travel Drill

🎯 Goal: Drill the backward moves. Stepping forward through code is the default; rewinding from a final state to find when something first changed is a different motor pattern. There’s no bug — counter.py runs correctly.

Click Debug to start.

Why this matters & what you'll learn

Stepping forward is the default; rewinding from a known-wrong final state to find when it first appeared is a separate motor pattern that takes deliberate practice. Case 3 will demand exactly this move on a real bug — but learning the move during the bug hunt mixes two hard things at once. Drilling the four scrubber moves on correct code now isolates the skill so Case 3 can focus on the bug, not the tool.

You will learn to:

Apply the four scrubber moves: anchor, single-tick rewind, jump-to-tick, scrub-until-predicate.
Analyze a recorded execution history by reading the Variables tab as you scrub.
Evaluate when backward localization beats forward stepping (symptom-far-from-cause bugs).

1. “What was the final state?” → Run to completion, then anchor

Click Debug without setting any breakpoints. The program runs to completion. The debugger pauses at the last line.

In the Variables tab, expand state. Note count and the length of history. This is your anchor — every move below is relative to this final state. Anchoring on a known wrong final state is exactly what Case 3 will ask of you.

2. “Rewind one event” → Scrub backward by one tick

Drag the History scrubber backward by one tick. Watch count change in the Variables tab. The arrow gutter turns gray when you’re rewound — you’re not at “live” execution anymore.

Verify: count should now equal what it was just before the last event. Cross-check against history[-2].

3. “What was `count` after exactly N events?” → Scrub to a specific moment

Scrub backward until len(state["history"]) shows 3. Read state["count"]. That’s the value after exactly 3 events were applied.

Predict before scrubbing further: what was count after exactly 5 events? Now scrub to len == 5 and verify against your prediction.

4. “When did `count` first go negative?” → Anchor + walk backward to first divergence

Look at history — each entry is (event, count_after). Scan for the first negative second element. That moment is where count first turned negative.

Now use the scrubber to visit that moment: drag backward until state["count"] first shows a negative value. This is the localization move you’ll use in Case 3 — anchoring on a known state, rewinding to the first moment that state appeared.

5. “What was `count` immediately before the reset event?” → Predicate-driven scrub

The simulator includes a reset event that zeros count. Find the entry ("reset", 0) in history. Scrub to one tick before that reset fired. What was count?

6. “Forward again to live” → Scrub all the way forward

Drag the scrubber all the way to the right. The arrow gutter returns to its normal color — you’re back at “live” execution. Edits will run from this point if you make any.

🪞 Reflect

From memory, name the four scrubber moves:

Run to end, inspect the anchor state
Scrub backward one tick (per-event rewind)
Scrub to a specific tick (jump by a marker like len(history) == N)
Scrub backward until a predicate first holds — this is the move for Case 3

The shape is always: anchor on a known state, walk backward to find when it first appeared.

Starter files

counter.py

# Backward Tour — no bug. Exercise the history scrubber.
#
# A tiny event-driven counter. Each event modifies `count`.
# `history` records (event_name, count_after_event) for every step.

from typing import Any

CounterState = dict[str, Any]


def apply_event(state: CounterState, event: str) -> None:
    if event == "inc":
        state["count"] += 1
    elif event == "dec":
        state["count"] -= 1
    elif event == "double":
        state["count"] *= 2
    elif event == "neg":
        state["count"] = -state["count"]
    elif event == "reset":
        state["count"] = 0
    else:
        raise ValueError(f"unknown event {event!r}")
    state["history"].append((event, state["count"]))


def main() -> CounterState:
    state: CounterState = {"count": 1, "history": []}
    events: list[str] = ["inc", "double", "neg", "double", "inc", "reset", "inc", "inc"]
    for event in events:
        apply_event(state, event)
    return state


main()

Solution

There’s no fix to apply — this step builds the backward-localization motor pattern. The four moves above (anchor, rewind one, jump to a tick, scrub until predicate) are the same moves Case 3 will demand on a real bug.

Why backward, not forward? When the symptom is visible at the end of execution but the cause is somewhere in the middle of a long event stream, anchoring on the wrong final state and rewinding walks you directly to the divergence. Stepping forward forces you to inspect every event — including the early ones that produced no symptom — before reaching the bad one. That’s wasted attention for a bug class the scrubber is designed for.

Step 5 — Knowledge Check

Min. score: 80%

1. “I want to find the first event in a 50-event stream that produced a wrong state.” Which scrubber move fits best?

Step Over forward from the start until something looks wrong
Forward stepping forces you to inspect early events that produced no symptom. The first 30 events may all be correct; you’d waste attention before reaching the divergence.
Anchor on the wrong final state, then scrub backward to the divergence
Set a conditional breakpoint at the start of the loop
A conditional breakpoint helps when you can describe the cue in advance (e.g., ‘when count goes negative’). When you only know the final symptom and the cause’s shape is unclear, scrubbing is more direct.
Hover the variable in the editor
Hover shows the value at the current paused moment. It can’t move you through history.

Anchor on the wrong final state, scrub backward until it matches the spec. The first tick where the state is correct again is the one immediately before the bug fired. This is the canonical backward-localization move.

2. “What was count after exactly 4 events?” — which scrubber move answers this?

Drag the scrubber until len(state['history']) reads 4
Restart the program and step over 4 times
This works but requires re-running the program from scratch. The scrubber moves through the recorded trace instantly — no rerun needed.
Hover the count variable in the editor
Hover shows the value at the current paused moment, not at a specific past tick.
Set a breakpoint on line 4 of the source file
Line numbers and event counts aren’t the same. A breakpoint on line 4 fires every time line 4 runs — once per iteration — which doesn’t directly answer ‘after 4 events’.

Scrub to a specific tick by reading a marker (here, len(history)). Pick a state property that monotonically increases (event count, log length, step number) so each tick is identifiable from the Variables tab.

3. After scrubbing backward, the arrow gutter turns gray. What does that mean?

The program crashed while running and the debugger halted execution
A crash would be reported as a traceback, not a UI state change.
You’re inspecting a past (rewound) state, not live execution
The debugger has lost or evicted parts of the recording
The recording is intact — gray indicates position, not loss.
You’ve scrubbed to the very start of the program’s recorded trace
The scrubber’s leftmost position is the start; gray applies whenever you’re not at the rightmost (live) position.

Gray = rewound. You’re inspecting a recorded past state — edits won’t take effect from this point until you scrub forward to the end again. This visual cue prevents the confusion of “why isn’t my edit running?” — the answer is always “scrub forward first, then run.”

Case 3 — Course Waitlist (Temporal Bug)

🎯 Goal: A course-registration simulator processes 9 events and ends in a wrong state. The visible symptom appears several events after the event that caused it. Find the first bad state transition, not just the final wrong state.

📋 debugging_log.md — three stages are now unlabeled. Name them yourself before filling them in. Naming the stage you’re in is the move that keeps the cycle from collapsing into tinkering.

Why this matters & what you'll learn

Some bugs separate cause from symptom in time: a wrong decision happens early, the visible failure appears events later, and stepping forward forces you to inspect correct state for ages before anything looks wrong. This is what the time-travel debugger is built for — anchor on the wrong final state and rewind to the first divergence. Case 3 demands the backward-localization move you drilled in Step 5, on a real bug where forward stepping would waste the most attention.

You will learn to:

Apply the anchor-and-rewind technique to find the first wrong state transition in an event stream.
Analyze a temporal bug whose symptom appears events after the cause.
Evaluate two correct fixes (pop(0) vs deque.popleft()) on intent, cost, and disruption.

🔀 Before you start: In Cases 1 and 2, you could find the bug by reaching one specific line with a breakpoint. Will that work here? Spend 30 seconds predicting: what kind of thing might be wrong, and will a single well-placed breakpoint be enough to find it?

The contrast — read after step 3
Cases 1–2 were *spatial* — the bug lives at a specific line you can reach with a breakpoint. This one is *temporal* — the cause and the symptom are separated by time. The wrong state is visible at the end, but the wrong decision happened much earlier. The new move is the history scrubber: run to the wrong final state, then rewind to find the first moment things went wrong.

📂 What you have

waitlist.py simulates two courses (CS201, MATH220) with sample events: students join waitlists, students drop, freed seats get allocated. The stated policy is FIFO: the first student to join a full course’s waitlist should be the first admitted when a seat opens.

test_waitlist.py has two tests, one failing:

test_cs201_waitlist_is_fifo — fails: enrolled list is wrong.
test_math220_single_waitlisted_student_gets_open_seat — passes (only one waitlisted student, so FIFO/LIFO is indistinguishable).

1. Symptom — read the failure carefully

Click Run. The failing assertion shows expected vs actual enrollment lists. Note the difference — you’ll need it in step 3.

2. Strategy — which direction would you start?

Would you step forward from event 1, watching state change after each event? Or would you let the program finish, then work backward from the known wrong final state?

Which direction is faster here — and why?

Backward. Events 1–3 produce no observable symptom. Starting forward means inspecting correct state for several events before anything looks wrong. Anchoring on the known wrong final state and scrubbing backward walks directly to the first divergence — you stop the moment something changes from wrong to right.

Click Debug without setting any breakpoints. Let the program run to completion. The debugger will be at the end of execution.

Now, in the Variables tab, expand state then 'CS201' then enrolled and waitlist. Observe their final (wrong) values.

3. Scrub backward through history

Drag the History scrubber backward, slowly, while watching the Variables tab. You’ll see enrolled and waitlist change as you rewind through events.

Scrub one event at a time. At each event, ask one question: “Did the front of the waitlist just get admitted?” Stop at the first event where the answer is no.

4. Now narrow to a line

Once you’ve identified that event, scrub forward to it. Set a breakpoint inside allocate_next — the function responsible for moving students from the waitlist into enrolled seats.

Click Continue (or restart with Debug if needed) until execution pauses there for the right event.

5. Compare prediction to observation

Before you step over the pop() line, add these to the Watch tab:

course.waitlist[0] — the student at the front
course.waitlist[-1] — the student at the back

Predict: given FIFO policy, which end should pop() remove from — front or back?

Now Step Over the pop() line. Add next_student to Watch (it now has a value). Compare: which end of the waitlist did pop() actually take from?

What the comparison reveals

`pop()` with no argument removes the *last* element (index `-1`). FIFO policy requires removing the *first* element. If your prediction was "front", your model was right — and the code was wrong. If you predicted "back", you may have assumed `pop()` defaults to front. That's the key gap: Python's list is a stack by default, not a queue.

6. Hypothesis

Write your one-sentence hypothesis. Name the operation and the spec it violates.

Compare with a sample sentence

*"`list.pop()` removes the LAST element. The spec says FIFO — the FIRST element should be admitted first."* The hypothesis pins the bug to a *single library call's behavior* rather than to the surrounding orchestration. That precision is what makes the fix one character.

7. Minimal fix — and a judgment call

Two correct fixes exist. Pick one and justify in one sentence (write your reasoning as a comment at the top of allocate_next):

course.waitlist.pop(0) — one-character change, list stays a list.
Convert waitlist to collections.deque and use popleft() — bigger diff, but the type says “queue”.

Criteria to weigh: communicates intent / asymptotic cost / disruption to surrounding code. There’s no single right answer; the justified choice is what matters.

🪞 Reflect — before you verify

Bug family: Symptom-far-from-cause bugs hide in caches that go stale events ago, message queues processed out of order, undo/redo stacks, optimistic UI updates. Name one place where the wrong final state would have been easier to find by stepping backward than forward.

Did you try stepping forward first? If so, at what point did you decide to switch direction? That decision point is worth naming — it’s the diagnostic cue that says “this is a temporal bug.”

8. Verify

Click Run. Both waitlist tests must pass.

Starter files

waitlist.py

"""Course waitlist simulator with a deliberately seeded ordering bug."""

from dataclasses import dataclass, field


@dataclass
class CourseState:
    capacity: int
    enrolled: list[str] = field(default_factory=list)
    waitlist: list[str] = field(default_factory=list)

    @property
    def open_seats(self) -> int:
        return self.capacity - len(self.enrolled)


@dataclass(frozen=True)
class Event:
    step: int
    kind: str
    course: str
    student: str | None = None


def initial_state() -> dict[str, CourseState]:
    return {
        "CS201": CourseState(capacity=2, enrolled=["Ava Chen", "Ben Ortiz"]),
        "MATH220": CourseState(capacity=1, enrolled=["Iris Long"]),
    }


def sample_events() -> list[Event]:
    """Reproducible event stream.

    CS201 policy: students should be admitted from the waitlist in FIFO order.
    """
    return [
        Event(1, "join_waitlist", "CS201", "Mina Patel"),
        Event(2, "join_waitlist", "CS201", "Theo Rios"),
        Event(3, "join_waitlist", "CS201", "Jules Kim"),
        Event(4, "drop", "CS201", "Ben Ortiz"),
        Event(5, "join_waitlist", "MATH220", "Noor Ali"),
        Event(6, "join_waitlist", "CS201", "Kai Morgan"),
        Event(7, "drop", "MATH220", "Iris Long"),
        Event(8, "drop", "CS201", "Ava Chen"),
        Event(9, "join_waitlist", "CS201", "Sam Lee"),
    ]


def apply_event(state: dict[str, CourseState], event: Event) -> None:
    course = state[event.course]
    if event.kind == "join_waitlist":
        _handle_join(course, event.student)
    elif event.kind == "drop":
        _handle_drop(event.course, course, event.student)
    else:
        raise ValueError(f"unknown event kind {event.kind!r}")


def _handle_join(course: CourseState, student: str | None) -> None:
    if student in course.enrolled or student in course.waitlist:
        raise ValueError(f"duplicate student in course state: {student}")

    if course.open_seats > 0:
        course.enrolled.append(student)
    else:
        course.waitlist.append(student)


def _handle_drop(course_name: str, course: CourseState, student: str | None) -> None:
    if student in course.enrolled:
        course.enrolled.remove(student)
        allocate_next(course_name, course)
    elif student in course.waitlist:
        course.waitlist.remove(student)


def allocate_next(course_name: str, course: CourseState) -> None:
    """Fill open seats from the waitlist."""
    while course.open_seats > 0 and course.waitlist:
        next_student = course.waitlist.pop()
        course.enrolled.append(next_student)


def run_events(
    events: list[Event] | None = None,
    state: dict[str, CourseState] | None = None,
) -> dict[str, CourseState]:
    if state is None:
        state = initial_state()
    if events is None:
        events = sample_events()
    for event in events:
        apply_event(state, event)
    return state

test_waitlist.py

from waitlist import run_events


def test_cs201_waitlist_is_fifo() -> None:
    state = run_events()
    cs201 = state["CS201"]
    assert cs201.enrolled == ["Mina Patel", "Theo Rios"]
    assert cs201.waitlist == ["Jules Kim", "Kai Morgan", "Sam Lee"]


def test_math220_single_waitlisted_student_gets_open_seat() -> None:
    state = run_events()
    math220 = state["MATH220"]
    assert math220.enrolled == ["Noor Ali"]
    assert math220.waitlist == []

debugging_log.md

# Debugging log — Case 3 (Course Waitlist)

Stages 1, 2, 6, 7 are labeled. Stages 3-5 are not — *name the stage yourself*, then fill in the content.

**Symptom** (one sentence — expected vs actual): _..._
**Predict** (which end of the waitlist should `pop()` remove from, given FIFO?): _..._
: _..._
: _..._
: _..._
**Fix**: _..._
**Verify**: _..._

<details><summary>Field labels 3-5 (open only after you've named them yourself)</summary>

Evidence
Hypothesis
Localize
</details>

Solution

waitlist.py

"""Course waitlist simulator — bug fixed (FIFO enforced)."""

from dataclasses import dataclass, field


@dataclass
class CourseState:
    capacity: int
    enrolled: list[str] = field(default_factory=list)
    waitlist: list[str] = field(default_factory=list)

    @property
    def open_seats(self) -> int:
        return self.capacity - len(self.enrolled)


@dataclass(frozen=True)
class Event:
    step: int
    kind: str
    course: str
    student: str | None = None


def initial_state() -> dict[str, CourseState]:
    return {
        "CS201": CourseState(capacity=2, enrolled=["Ava Chen", "Ben Ortiz"]),
        "MATH220": CourseState(capacity=1, enrolled=["Iris Long"]),
    }


def sample_events() -> list[Event]:
    return [
        Event(1, "join_waitlist", "CS201", "Mina Patel"),
        Event(2, "join_waitlist", "CS201", "Theo Rios"),
        Event(3, "join_waitlist", "CS201", "Jules Kim"),
        Event(4, "drop", "CS201", "Ben Ortiz"),
        Event(5, "join_waitlist", "MATH220", "Noor Ali"),
        Event(6, "join_waitlist", "CS201", "Kai Morgan"),
        Event(7, "drop", "MATH220", "Iris Long"),
        Event(8, "drop", "CS201", "Ava Chen"),
        Event(9, "join_waitlist", "CS201", "Sam Lee"),
    ]


def apply_event(state: dict[str, CourseState], event: Event) -> None:
    course = state[event.course]
    if event.kind == "join_waitlist":
        _handle_join(course, event.student)
    elif event.kind == "drop":
        _handle_drop(event.course, course, event.student)
    else:
        raise ValueError(f"unknown event kind {event.kind!r}")


def _handle_join(course: CourseState, student: str | None) -> None:
    if student in course.enrolled or student in course.waitlist:
        raise ValueError(f"duplicate student in course state: {student}")

    if course.open_seats > 0:
        course.enrolled.append(student)
    else:
        course.waitlist.append(student)


def _handle_drop(course_name: str, course: CourseState, student: str | None) -> None:
    if student in course.enrolled:
        course.enrolled.remove(student)
        allocate_next(course_name, course)
    elif student in course.waitlist:
        course.waitlist.remove(student)


def allocate_next(course_name: str, course: CourseState) -> None:
    """Fill open seats from the waitlist (FIFO)."""
    while course.open_seats > 0 and course.waitlist:
        next_student = course.waitlist.pop(0)
        course.enrolled.append(next_student)


def run_events(
    events: list[Event] | None = None,
    state: dict[str, CourseState] | None = None,
) -> dict[str, CourseState]:
    if state is None:
        state = initial_state()
    if events is None:
        events = sample_events()
    for event in events:
        apply_event(state, event)
    return state

The fix is course.waitlist.pop(0) instead of course.waitlist.pop(). Python’s list.pop() with no argument removes the last element (LIFO / stack behavior). For a FIFO queue you need pop(0) to remove the first element.

For production code prefer collections.deque with popleft() — quiz Q4 explores why.

Common wrong fixes (and why they’re wrong):

Sorting waitlist alphabetically before pop. This produces deterministic-looking output that happens to match the test by coincidence (Mina, Theo come before Jules alphabetically). It is unrelated to FIFO.
Special-casing Jules Kim or specific names. Hardcodes a fix to this event stream; any new event ordering breaks again.
Reordering sample_events(). Editing the input data to match the bug.
Changing the test’s expected lists to LIFO. Editing the spec to match the bug.

Step 6 — Knowledge Check

Min. score: 80%

1. For a Python list xs = ['a', 'b', 'c', 'd'], what does xs.pop() return, and what is xs afterward?

Returns 'a'; xs becomes ['b', 'c', 'd']
That would be xs.pop(0), which removes from the front. With no argument, pop() removes from the end.
Returns 'd'; xs becomes ['a', 'b', 'c']
Returns the entire list; xs becomes []
pop removes a single element, not the whole list. For a single-element list, that element happens to be the whole list, but in general only one element is returned.
Raises an error — pop() requires an index
pop() works without an argument — it defaults to -1, the last element. pop(0) and pop(-1) are both valid.

list.pop() with no argument removes and returns the last element. This is LIFO (stack) behavior. For FIFO (queue) behavior, use pop(0) (or collections.deque.popleft() for O(1) performance).

2. Which of these is the correct fix to enforce FIFO admission policy?

Sort course.waitlist alphabetically before each pop() call
Alphabetical order happens to match FIFO order in this specific test by coincidence (Mina, Theo come before Jules). Change one student’s name and the test breaks. The spec is FIFO (insertion order), not alphabetical — these are different concepts.
Special-case Jules Kim and Mina Patel inside allocate_next
Hardcoding student names is the textbook symptom-patch anti-pattern. The next event stream with different names breaks again. The bug isn’t about which students; it’s about which end of the list is admitted first.
Change pop() to pop(0) in allocate_next
Reorder sample_events() so Jules Kim joins last
Editing the input data to match the bug is a workaround. Real registrar systems can’t reorder real student actions to make their code happy.

The bug is in how a student is removed from the waitlist, not in any of the data. pop() removes from the back; pop(0) removes from the front. FIFO requires removing from the front.

3. You discover the symptom (CS201 enrolls the wrong students) at the end of the program, but the cause is in event 4 (drop Ben Ortiz, which triggers allocate_next). Which technique most directly localizes the bug?

Step from event 1, watching state after each event, until something looks wrong
Forward stepping works but is wasteful — events 1, 2, 3 produce no symptom. You’d inspect them anyway because you don’t know which event is the bad one until you reach it. Backward navigation lets you anchor on the known wrong final state and rewind to the divergence.
Set a breakpoint on the failing test’s assertion and inspect the final state
The final state inspection only confirms the symptom you already knew about (the test failed). It doesn’t help localize which event caused it.
Run to completion, then scrub backward to find the first diverging event
Add print(state) after every event and read the output
Adding prints is a viable fallback when no time-travel debugger is available. With one available, scrubbing is faster, leaves no print debris in the code, and lets you skim 9 snapshots in seconds.

Back-in-time / history-scrubbing is built for exactly this bug shape. When the symptom appears later than the cause, scrubbing backward from the symptom — instead of stepping forward from the start — directly walks you to the divergence point. Forward stepping spends time on events that produced no observable change.

4. (Bonus — code communication.) Which choice best communicates that a list is being used as a FIFO queue?

my_list.pop(0) (Python list)
pop(0) works but doesn’t tell the next reader “queue” — many lists use pop(0) for non-queue reasons, so intent must be inferred from context. It also costs O(n) time because Python shifts every remaining element left after removing index 0. deque.popleft() communicates intent and runs in O(1).
my_deque.popleft() (collections.deque)
my_list[0]; del my_list[0]
Functionally equivalent but harder to read. popleft() names the operation in one word.
heapq.heappop(my_heap)
heapq is a priority queue — it returns the smallest element regardless of insertion order. That’s not FIFO.

collections.deque.popleft() is the idiomatic, readable choice. It tells the next reader: this is a FIFO queue. list.pop(0) works but doesn’t communicate intent (and is O(n) for large lists). For a debugging tutorial, the takeaway is broader: fixes that document intent are easier to get right and easier to maintain than fixes that merely produce the right output.

Triage Drill — Pick the Right Technique

🎯 Goal: Match each scenario to the right first move. The point isn’t speed; it’s discriminating between bug families.

Try the drill from memory. Pass threshold: 0.85. After the quiz, you’ll see a recap of the cue→technique mapping for spaced retrieval next time.

Why this matters & what you'll learn

Knowing six debugger moves doesn’t help if you reach for the wrong one first. Real bugs arrive without labels; the skill that separates a competent debugger from a thrashing one is reading the cue in a bug description and picking the right first move. This step interleaves the three bug families you’ve practiced so the discrimination is forced — and adds two ubiquitous moves the lecture covered (rubber duck, post-fix documentation) so they’re in the toolkit.

You will learn to:

Analyze a bug description and discriminate which family (boundary, data, temporal) it belongs to.
Evaluate which technique fits each cue — and articulate why neighboring techniques don’t.
Apply rubber-duck debugging and post-fix documentation as standard moves in your workflow.

🦆 Two debugging moves the lecture covered that you haven’t drilled yet

Before the quiz, lock these in. They’re cheap, ubiquitous in real practice, and the triage drill will mention them.

🦆 Rubber Duck Debugging — your most valuable root-cause tool

The lecture called this the “most valuable root-cause analysis tool” — and the call-out wasn’t ironic.

The Curse of Knowledge. When you’ve held a mental model of your code in your head for the past hour, you read what you intended to write, not what you actually wrote. Your eyes skip the bug because your model says it’s not there. This is why staring at the same five lines for 20 minutes rarely uncovers anything new.

The technique.

Place a rubber duck (or any silent object — a coffee mug, a textbook, a sympathetic stuffed animal) on your desk.
Explain to the duck what your code is supposed to do, line by line. Out loud. Slowly.
At some point — typically a third of the way through — you’ll tell the duck what your code should be doing next, and realize that’s not what it’s actually doing.

That’s the moment your mental model and the actual code diverge. The bug lives in that gap.

Why it works. Verbalization forces you to retrieve and articulate each intermediate step instead of skimming over it. The duck doesn’t help you; explaining helps you. The duck just keeps you from looking like you’re talking to yourself.

Practice tip: when you don’t have a duck, write the explanation as a comment in the code (you can delete it after). Same effect.

📝 After the fix — document and regression-test (don't skip this)

The lecture closed phase 4 (Implement & verify a fix) with three moves you should plan to do every time:

Add nearby assertions. When you find a bug, related bugs are often hiding in the same neighborhood. assert x is not None, assert len(items) > 0, assert response.status_code == 200 — assertions catch errors before they become failures.
Document why the fix was necessary in a code comment, in the git commit message, and in the bug report. Future-you (and future-teammate) will need to understand why this line exists; “fix bug” is not enough.
Keep the bug-reproduction test in the suite for regression testing. Re-running existing tests after later code changes is how you make sure today’s fix doesn’t get silently undone next month. Every bug fix should leave behind a test.

The triage quiz below assumes you’ll do all three after picking the right first move.

Starter files

notes.txt

This step is a quiz only. No code to edit.

Take your time on each scenario — the goal is matching cues to
techniques, not memorizing pairs.

Solution

What you practiced here is technique selection — reading the cue in a bug description and reaching for the right tool. For spaced retrieval next time, here is the canonical mapping:

Bug cue	First move
Boundary / off-by-one	Ordinary breakpoint + watch the boundary expression
One item in a long stream	Conditional breakpoint with a discriminating predicate
Symptom appears later than the cause	Run to completion, scrub backward, then breakpoint on the suspected event
Aliasing / shared-state surprise	Inspect `oid` badges in Variables
Failure not reproducing	Reproducibility first — write a discriminating test
Stuck >15 minutes	Stop. Externalize the failure description.

Step 8 — Knowledge Check

Min. score: 80%

1. A function processes 50,000 log lines and produces a wrong total. You’ve confirmed the bug is consistent run-to-run. Which technique most efficiently localizes it?

Set a normal breakpoint inside the loop and step through, watching the running total
Stepping through 50,000 iterations is exactly the kind of work the debugger should save you. With a conditional, the debugger pauses only on the iteration of interest.
Use a conditional breakpoint that pauses when the running invariant is broken
Add print statements after every line in the function and read the 50,000 lines of output
Reading 50,000 lines by eye is human-impossible. If you must use prints, filter them with a predicate so only the relevant ones print — at which point you’ve reinvented a conditional breakpoint, more clumsily.
Run with pdb.set_trace() before the loop and step through it manually
pdb.set_trace() is a real and common move, but it leaves you stepping through 50,000 iterations the same way option 0 does. The debugger has filters; use them.

Long streams want conditional breakpoints. The condition is whatever invariant you suspect is broken (running_total > 1e9, line.startswith('ERROR'), etc.). The debugger filters; you only see the iterations that matter.

2. A recursive function returns the wrong answer for one specific input. The function is small (12 lines) and you have a clear test case that reproduces it. Which technique fits best?

Conditional breakpoint with a complex predicate
A conditional breakpoint is overkill here — the function is small and the input is specific. An ordinary breakpoint plus stepping plus the call stack gives you everything you need.
Back-in-time scrubbing of the entire program execution
Back-in-time scrubbing shines when the symptom appears far from the cause. Here, the function is 12 lines and the symptom is in the return value — it’s not a temporal-distance problem.
Ordinary breakpoint at the top, plus a Watch on the parameter
Add a print() inside the recursion to trace each call
Print-tracing works but modifies the source for every probe. Ordinary breakpoints + watches give you the same information non-invasively, and the call stack tells you what print can’t (which recursion path you’re on).

For small, well-localized buggy functions, ordinary breakpoint + step + watch + call stack is the simplest and fastest combination. Reach for fancier tools (conditional breakpoints, back-in-time) only when the simpler tool is genuinely insufficient.

3. Final cart total is wrong; a discount appears to have been applied to the wrong line item. The cart processed 8 events (add item, apply coupon, etc.) and the wrong-line discount happened somewhere in the middle. Which technique fits best?

Set a conditional breakpoint on the discount-application line that pauses if the wrong line is being modified
This works if you already know which expression characterizes ‘wrong line.’ Often you don’t — that’s exactly what scrubbing helps you discover. Once scrubbing has identified the suspicious moment, then a breakpoint can pinpoint it.
Run to completion, then scrub backward to the first wrong-discount event
Use git bisect on the cart’s commit history
git bisect finds the commit that introduced a regression. Useful for a different class of debugging problem (a bug that worked before some commit). It’s not the right tool when you have one buggy version and want to know which event in a single run caused the symptom.
Step forward through every event from the start
Forward stepping through 8 events is fine in a pinch but wasteful — events 1, 2, 3 likely produced correct partial state. Anchoring on the wrong final state and rewinding is faster.

Back-in-time / scrubbing is the right first move when symptom and cause are temporally distant within a single run. After scrubbing localizes the suspicious event, an ordinary breakpoint can give you line-level precision.

4. A function has two parameters that should be independent. After running, you find that modifying one of them mysteriously changes the other. Which technique fits best?

Add print statements to track every modification
Print debugging is a fallback. The debugger has dedicated UI for spotting aliasing — use it.
Set a conditional breakpoint on every line that touches either parameter
Brute force, and you’ll discover the same answer the oid badge tells you in one glance.
Inspect the oid badges in the Variables tab to spot a shared object
Use back-in-time scrubbing
Scrubbing tells you when state diverged. The oid badge tells you why (they’re the same object). For aliasing specifically, oid is the more direct cue.

Mysterious co-mutation is the signature of aliasing. The most efficient first move is checking the Variables tab: if two names share an oid, they reference the same object, and modifying one will appear to “modify” the other. The classic Python instance is mutable default arguments — exactly what you saw in Step 2’s register_score.

5. You’ve spent 20 minutes setting and clearing breakpoints, making small edits, and rerunning tests. Nothing has worked, and you’re starting to feel frustrated. What’s the right next move?

Push through — debugging always feels this way; just keep iterating
Trying edit after edit without new evidence is a frustrating cycle that rarely ends productively. Stop and externalize: write down the symptom in plain words, list what hypotheses you’ve already ruled out and how, then re-pick a technique deliberately. The act of writing usually surfaces the next move.
Restart the file from scratch with a fresh implementation
A rewrite occasionally helps for tiny scripts, but it’s overkill here. The bug is hard because you don’t have a clear hypothesis yet — not because the code is structurally broken. Externalize the failure first.
Stop. Externalize the symptom and ruled-out hypotheses, then re-pick a technique
Ask a teammate to fix it for you
Asking a teammate is a good follow-up move. Before that, externalize what you know. The act of writing the symptom and the ruled-out hypotheses often reveals the next move to you — and if it doesn’t, you now have a precise question rather than a vague one.

When the cycle stalls, the move is to externalize. Write down the failure precisely, list hypotheses you’ve ruled out (and how), and re-pick a technique deliberately. This isn’t about willpower — it’s about getting the problem out of your head and onto a surface where you can reason about it. Research on debugging found that simply forcing this articulation helped students solve bugs they otherwise would have escalated.

6. A test passes locally on your laptop but fails on the autograder. You’ve reproduced the failure on the autograder twice. What’s the most useful first move?

Set a conditional breakpoint inside the function — clearly the issue is data-dependent
A conditional breakpoint can’t help if you can’t reproduce the bug locally — the breakpoint never fires.
Drag the History scrubber backward — clearly the symptom appears late
Same problem — scrubbing requires running the program; if it doesn’t fail locally, you have nothing to scrub through.
Reproduce the failure locally first — debuggers are useless without it
Email the instructor for an autograder log
An autograder log is useful, but the larger question is why environments differ. Sometimes the answer is OS, Python version, locale, or random seed. Reproducing locally is the necessary first step before any other technique becomes useful.

Reproducibility is upstream of every debugging technique. A bug you can’t reproduce is a bug you can’t debug — none of breakpoints, scrubbing, or watches help if the failure isn’t in front of you. The first move is to find what differs between environments (Python version? OS? data? seed?) and either fix the discrepancy or simulate the autograder’s environment locally.

7. A test that previously passed now fails after a change you just made. The previous test still passes. What does this tell you?

Regression — revert your change and re-apply it more carefully
The test was wrong all along; update its expected value to match
If the test previously passed and the only change is yours, the test isn’t suddenly wrong — the test caught a real regression. Updating the expected value is a workaround that hides the regression.
You should add time.sleep(1) to make the test more deterministic
Sleeps mask race conditions, but they don’t fix logic regressions. Adding sleeps to make tests pass is a known anti-pattern.
Ignore it — you’re focused on a different previously-failing test
Ignoring a regression because you’re focused elsewhere is exactly how regressions ship to production. Fix it now while the change is fresh in your head.

A previously-passing test that newly fails after your change is a regression — your change broke a behavior that was correct. Revert and re-apply more carefully (smaller change, more thought). This is exactly why “verify means rerun the whole suite” — to catch regressions, not just confirm the one fix.

8. A payment processor handles 10,000 transactions. Two adjacent transactions produce totals that are slightly off — but only when a specific merchant ID appears. The failure is consistent run-to-run, and the wrong calculation fires exactly when the bad merchant ID is processed. Which technique fits best?

Run to completion, then scrub the History scrubber backward — the symptom appears in the middle of the run
Scrubbing works well when the symptom and cause are in different, temporally-separated moments — you need to rewind past several events that look fine. Here, the wrong calculation fires exactly when the bad merchant ID is processed: the symptom and cause are at the same moment. A conditional breakpoint that pauses only for that merchant ID is more direct.
Set a conditional breakpoint with the predicate tx.merchant_id == SUSPECT_ID
Inspect oid badges in the Variables tab — the totals might be aliased objects
The symptom is a wrong value, not co-mutation of two names. Aliasing is the signature of two variables moving together unexpectedly — not of a single wrong calculation on one specific input.
Step forward through all 10,000 transactions
10,000 manual steps is precisely what conditional breakpoints are designed to avoid.

Conditional breakpoints vs. back-in-time scrubbing depend on temporal distance. Scrubbing earns its cost when symptom and cause are separated by time (many events happen between the bug and when you notice it). Here, the symptom co-occurs with the cause — the bad calculation fires exactly when the suspicious merchant ID is processed. A conditional breakpoint that pauses only on that ID is the direct move.

9. Which of these counts as evidence in the debugging cycle? (select all that apply)

A specific value of a variable observed at a specific line, e.g., i = 1 at the start of the loop
A failing test’s exact assertion message
A vague hunch that ‘something feels off about the loop’
A hunch is a seed for a hypothesis, not evidence. Evidence is observable, specific, and reproducible. Until the hunch is converted to a discriminating test or observation, it’s not yet evidence.
A repr() of a string showing a hidden whitespace character

Evidence is observable, specific, and reproducible. Variable values at specific lines, exact failure messages, and repr() outputs all qualify. Hunches are valuable as the starting point for hypothesis generation, but they don’t yet count as evidence — they need to be tested against observations before they earn that status. Distinguishing the two clearly is one of the highest-leverage moves an experienced debugger makes.

Transfer Challenge — You're On Your Own

🎯 Goal: Find and fix a bug in unfamiliar code without step-by-step prompts. You pick the technique. You type the debugging log.

Compare to Cases 1–3: there, we numbered each stage of the cycle. Here, you do.

📂 What you have

A small program: tagger.py reads articles.txt (each line is "Title|tag") and returns the most common tag.

Two pytest tests in test_tagger.py:

test_python_is_most_common — fails (returns the wrong value).
test_no_whitespace_in_result — fails (the result contains whitespace).

📋 Your debugging log

Open debugging_log.md and fill each field as you work.

🚨 Resist the obvious. You may recognize the bug family — but verify with the debugger before assuming. Pattern-matching without evidence is the trap of Step 7’s tinkering item.

Why this matters & what you'll learn

Knowing the cycle on scaffolded examples is one thing; running it without prompts on unfamiliar code is the actual job. Transfer is what tells you whether the cycle has become yours or whether it lived only in the labels we put around each stage. This step removes the per-stage scaffolds — you name the stages, pick the technique, and write the log — so you can see for yourself what you’ve internalized.

You will learn to:

Apply the full cycle on unfamiliar code without step-by-step prompts.
Evaluate which case from this tutorial the new bug most resembles structurally — and defend the match.
Analyze your own default debugging mode (tinkering / print / hypothesis-driven) and name when to override it.

🔗 After fixing — before the quiz

The Transfer Challenge is intentionally in the same bug family as one of the three cases. Before reading the solution or the quiz:

Which case is it most similar to structurally?
Write one sentence: “Both bugs share ___ even though the surface is different because ___.”
Write one sentence: “The surface difference is ___ — which is what makes this feel new.”

Commit to those sentences. Quiz Q1 asks you to defend the match.

🌐 Far-transfer probe — while you debug

Pick one codebase you’ve worked on recently. Where does external data enter (a file read, an API call, a form submission, a database query)? At that entry point: is normalization happening at the boundary, or are downstream consumers doing it — or not doing it at all? Spend 30 seconds answering for one entry point before you start the debugger.

Hint of last resort

If you haven’t found it yet after 10 minutes, the test output already tells you what repr(...) would tell you on a paused breakpoint. Re-read the failing assertion of test_no_whitespace_in_result.

🪞 Self-check — after you fix it

Before this tutorial, which mode would you have defaulted to on this bug?

Tinkering — try .strip(), .replace('\n', ''), and other edits until something worked.
Print-first — add print(tag) everywhere. (The trailing \n prints as a literal newline, easy to miss; repr() makes it impossible to miss.)
Hypothesis-driven — breakpoint, inspect repr(tag), name the cause, fix at the load boundary.
Honestly not sure — depends on the day and how stuck you felt.

Name which one. That’s the metacognitive skill: knowing your default mode is how you know when to override it.

Starter files

tagger.py

"""Article tag analyzer.

Reads a file where each line is `"Title|tag"`, returns the most
common tag (uppercased) across all articles.

There is a bug. Both tests in test_tagger.py fail.
"""

from collections import Counter


def top_tag(articles_path: str) -> str:
    counts: Counter[str] = Counter()
    with open(articles_path) as f:
        for line in f:
            title, tag = line.split("|", 1)
            counts[tag.upper()] += 1
    return counts.most_common(1)[0][0]

articles.txt

Why Python rocks|python
JavaScript closures|javascript
Decorators in Python|python
Async Python explained|python
Rust intro|rust

test_tagger.py

from tagger import top_tag


def test_python_is_most_common() -> None:
    # Three of five articles are tagged "python", so PYTHON should win.
    assert top_tag('/tutorial/articles.txt') == "PYTHON"


def test_no_whitespace_in_result() -> None:
    result = top_tag('/tutorial/articles.txt')
    assert result == result.strip(), \
        f"Result {result!r} contains whitespace — tags should be normalized at load time."

debugging_log.md

# Debugging log

Fill each field as you work. Fields 1, 2, 6, 7 are labeled for you.
Fields 3–5 are not — name the stage yourself, then fill in the content.

1. **Symptom** (one sentence — expected vs actual): _..._
2. **Predict** (what should the state be at the suspect line?): _..._
3. (technique chosen and why — write: "I used [tool] because [cue]"): _..._
4. (one sentence — *what* is wrong, *where* it lives): _..._
5. (the line where intended and actual first diverge): _..._
6. **Fix** (file, line, minimal change): _..._
7. **Verify** (which tests pass now; any regressions?): _..._

<details><summary>Field labels 3–5 (open only after completing the log)</summary>

3. Evidence
4. Hypothesis
5. Localize
</details>

Solution

tagger.py

"""Article tag analyzer — fixed."""

from collections import Counter


def top_tag(articles_path: str) -> str:
    counts: Counter[str] = Counter()
    with open(articles_path) as f:
        for line in f:
            title, tag = line.split("|", 1)
            counts[tag.strip().upper()] += 1
    return counts.most_common(1)[0][0]

The bug is that for line in f yields each line with its trailing newline included. So tag becomes 'python\n', and tag.upper() becomes 'PYTHON\n'. The Counter accumulates under that key, and the function returns 'PYTHON\n' — which the tests, expecting 'PYTHON', correctly reject.

The fix is tag.strip().upper() (or call .rstrip() / .rstrip('\n') if you want to be more specific). Strip-and-validate at the boundary is the same pattern as Case 2’s ledger fix.

The case-isomorphism is intentional. This bug is the same family as Case 2 — input data has invisible whitespace; the bug fires because normalization wasn’t applied at load time; the fix is in the loading layer. The surface is completely different (file iteration with for line in f vs csv.DictReader), but the cycle and the cure are the same. That’s transfer — the same mental model applies despite a different surface.

Notice what makes this bug family so common in real codebases: every layer that reads external data is a possible source. CSV imports. JSON parses. HTTP request bodies. Database VARCHAR columns. User text input. The defensive habit is strip-and-normalize at the boundary; once data is inside your domain, trust it.

Step 9 — Knowledge Check

Min. score: 80%

1. Which of the three earlier cases is this bug most structurally similar to?

Case 1 (Maze Pathfinder, boundary bug in _dfs)
Case 1 was a boundary / off-by-one bug — >= rejecting an exact-budget path. This bug isn’t about boundaries; it’s about input data having a hidden character (a \n) that broke equality with the expected key.
Case 2 (Ledger Reconciliation, hidden whitespace)
Case 3 (Course Waitlist, FIFO vs LIFO ordering)
Case 3 was about order of admission (FIFO vs LIFO). This bug isn’t about ordering; the bug fires regardless of when the offending line is processed.
None of them — this is a totally new bug family
Surface looks different (file iteration vs CSV) but the family is identical: input data carries a stray whitespace character, normalization is missing at the loading layer, fix is strip() at the boundary. This is exactly Case 2.

This bug is the same family as Case 2 in different clothes. Both: external data (CSV row in Case 2, file line here) carries a stray whitespace character; the loading code doesn’t normalize it; the fix is to strip-and-validate at the data boundary. Recognizing isomorphism across surfaces is what transfer means in the research literature.

2. (Final retrieval — spaced from Step 1.) Place these debugging-cycle stages in order: A. Verify B. Symptom C. Hypothesis D. Fix E. Evidence F. Localize G. Predict

B → G → E → C → F → D → A
B → D → A → C → E → F → G
Fix and Verify before Hypothesis is the tinkering anti-cycle: edit, run, hope, repeat. The whole point of hypothesis-driven debugging is that the fix comes near the end of the process.
G → B → C → E → F → D → A
Predict before Symptom inverts the order. You can’t predict what should happen until you know what is happening. The Symptom is what triggers the cycle in the first place.
B → E → G → C → F → D → A
Predict before Evidence is closer to right but very close to wrong. Predict comes from understanding the spec; Evidence comes from running the program. The order is: Symptom (run) → Predict (what should the state be) → Evidence (collect actual state) → Hypothesis.

Symptom → Predict → Evidence → Hypothesis → Localize → Fix → Verify. The order matters: each stage produces what the next stage needs. Skipping or reordering creates known anti-patterns: tinkering (Fix-first), local verification (skipping Verify of the full suite), or pattern-matching wrong fixes (Localize without Hypothesis).

🪞 Final reflection (no graded answer): Which stage is hardest for you to slow down on? If your honest answer is “Fix” — i.e., you skip ahead to editing — you’re in good company. That’s the most common failure mode. The remedy is not willpower; it’s the explicit form of the cycle plus practice. You just did three rounds of practice.

3. (Spaced retrieval — Step 1’s “no edit until stage 6” rule.) You’re 30 seconds into investigating a bug. You think you see the problem. What does the discipline say to do right now?

Make the edit immediately while the insight is fresh
Edit-first is the anti-pattern this tutorial is built to dismantle. Bugs that ‘felt obvious’ are exactly where confirmation bias lives — the discipline is to verbalize the hypothesis first.
Form an explicit hypothesis (in words) before touching the code
Run the failing test 3 more times to make sure it’s reproducible
You should already have a reproducible failure (Step 1’s discipline) before you ever entered the investigation. Running the test more times is not the next move; committing to a hypothesis is.
Consult the time-travel debugger first; insight without evidence is a hallucination
TT-debugger is a tool for evidence-gathering, not a substitute for the hypothesis stage. Form the hypothesis first, then use the debugger to test it.

“No edit until stage 6” is the central rule. Even a 5-second hypothesis (“I think it’s the off-by-one in the range call”) forces you to articulate what you believe before you commit to a fix. Without articulation, you fix-and-hope, which can take 10× longer than verbalize-then-fix.

4. (Transfer — apply the cycle to a new case.) A teammate reports: “My function expand_aliases is supposed to look up names in aliases.json, but every key returns None.” Which stage of the debugging cycle did your teammate just do, and what’s the next stage?

They gave a Symptom; the next stage is Predict
They gave a Hypothesis; the next stage is Localize
A hypothesis names a cause — e.g., ‘I think the JSON is failing to parse.’ Your teammate described what they observe (a symptom), not why it’s happening.
They gave Evidence; the next stage is Verify
Evidence is concrete observation (e.g., ‘I added print(repr(key)) and saw "name\n"’). ‘Every key returns None’ is a symptom — the externally visible fault — not the underlying evidence.
They gave a Fix; the next stage is to confirm it works
No fix has been proposed. They reported a problem; the cycle starts at Symptom and works forward.

Symptom = the externally visible fault (“returns None”). The next stage is Predict — what should happen per the spec? Then Evidence — what is happening (use the debugger or print(repr(...))). Then Hypothesis. Skipping Predict is the most common shortcut and the most expensive one — without a written prediction, you can’t tell whether observation matches expectation.

5. (Spaced — Step 2’s aliasing badge.) Your code does:

def add_to(items: list[str] = []) -> list[str]:
    items.append("x")
    return items

print(add_to())   # ['x']
print(add_to())   # ['x', 'x']  ←  surprise

Which Python rule is biting you?

Lists are immutable — calling .append actually creates a new list each time
Lists are mutable (.append mutates in place). The bug here is real and the second call sees the first call’s mutation.
Default arguments are evaluated once at definition time, so the same list is reused across calls
Function calls inside a script always share their parameter memory unless copy.deepcopy is used
Function parameter scoping is normal — local in scope, separate per call. The trap is only with mutable default values, because the default object is created once.
Python caches small lists; [] always points to the same canonical empty list
Python doesn’t intern empty lists. [] is [] returns False (try it). Default mutable args is a separate, well-documented Python footgun.

Default argument values are evaluated once, at function-definition time. The items=[] creates one list, bound to the function as its default. Every call that uses the default reuses that same list. The fix is def add_to(items=None): items = items or [] (or if items is None: items = []). This is one of Python’s top-5 gotchas — the time-travel debugger’s aliasing badge (Step 2) lights up on this exact pattern.

Debugging Python: From Symptom to Fix

The Debugging Process

📂 What you have

🔍 1. Symptom — predict, then run

🧠 2. Predict the state

🔬 3. Evidence — your first breakpoint

🔎 4. Hypothesis (one sentence)

📍 5. Localize

🩹 6. Minimal fix

✅ 7. Verify

Solution

Step 1 — Knowledge Check

Debugger Tour

1. “Where is execution right now?” → Breakpoint

2. “What does this variable hold right now?” → Variables tab + hover

3. “What value will an expression have at this point?” → Watch

4. “Which iteration first violates an invariant?” → Conditional breakpoint

5. “How did we get here?” → Call Stack

6. “What was this variable BEFORE this line ran?” → History scrubber

Solution

Step 2 — Knowledge Check

Case 1 — Maze Pathfinder (Boundary Bug)

📂 What you have

1. Symptom — run and read

2. Predict before debugging

3. Set evidence — breakpoint and watches

4. Drive

5. Compare prediction to observation

6. Hypothesis

7. Minimal fix

8. Verify

Solution

Step 3 — Knowledge Check

Case 2 — Ledger Reconciliation (Data Representation Bug)

📂 What you have

1. Symptom — and a clue

2. Predict before debugging

3. Stop only on the suspicious account — conditional breakpoint

4. Look closely

5. Compare prediction to observation

6. Where is the divergence?

7. Hypothesis

8. Minimal fix

9. Verify

Solution

Step 4 — Knowledge Check

Backward Tour — Time-Travel Drill

1. “What was the final state?” → Run to completion, then anchor

2. “Rewind one event” → Scrub backward by one tick

3. “What was count after exactly N events?” → Scrub to a specific moment

4. “When did count first go negative?” → Anchor + walk backward to first divergence

5. “What was count immediately before the reset event?” → Predicate-driven scrub

6. “Forward again to live” → Scrub all the way forward

Solution

Step 5 — Knowledge Check

Case 3 — Course Waitlist (Temporal Bug)

📂 What you have

1. Symptom — read the failure carefully

2. Strategy — which direction would you start?

3. Scrub backward through history

4. Now narrow to a line

5. Compare prediction to observation

6. Hypothesis

7. Minimal fix — and a judgment call

8. Verify

Solution

Step 6 — Knowledge Check

Triage Drill — Pick the Right Technique

🦆 Two debugging moves the lecture covered that you haven’t drilled yet

Solution

Step 8 — Knowledge Check

Transfer Challenge — You're On Your Own

📂 What you have

📋 Your debugging log

Solution

Step 9 — Knowledge Check

3. “What was `count` after exactly N events?” → Scrub to a specific moment

4. “When did `count` first go negative?” → Anchor + walk backward to first divergence

5. “What was `count` immediately before the reset event?” → Predicate-driven scrub