1

Why Test? The Bug That Got Away

The Lay of the Land

Learning objective: After this step you will be able to explain why automated tests are valuable and use assert to verify function behavior.

You write Python code that works… until someone changes it and breaks something you did not expect. Testing is the safety net that catches those breaks before they reach users.

A Common Misconception

Many students think testing is about finding bugs after you write code. That is only half the story. Tests also:

  • Specify behavior — a test says “this function should do X” before you write it
  • Prevent regressions — once a bug is fixed, a test ensures it never comes back
  • Enable fearless refactoring — change code confidently because tests catch breakage

Think of tests as a ratchet: once a test passes, your progress is locked in. You can never accidentally slip backward.

Predict Before You Run

The function below is supposed to convert a numeric score (0–100) to a letter grade. Before running it, read the code carefully and predict: what will calculate_grade(85) return? What about calculate_grade(90)?

Write your predictions down, then click Run to see if you were right.

Task

  1. Run grades.py and observe the output. Was your prediction correct?
  2. Find and fix the bug in calculate_grade().
  3. Add three assert statements at the bottom of the file to verify:
    • calculate_grade(95) returns "A"
    • calculate_grade(85) returns "B"
    • calculate_grade(72) returns "C"

An assert works like this:

assert calculate_grade(95) == "A", "95 should be an A"

If the condition is False, Python raises an AssertionError with the message.

Looking ahead: Notice where the bug lives — at the value 90, right on the line between a B and an A. That is no coincidence. Bugs love boundaries (the exact values where behavior changes). You will learn to hunt them systematically in Step 3: Partitions & Boundaries.

Before You Move On (retrieval practice)

Close this page and, from memory, answer in one sentence each:

  • Name two things a test does besides “find bugs.”
  • What does Python do when assert condition evaluates to False?

Rereading feels productive, but retrieval is what actually builds durable memory.

Starter files
grades.py
def calculate_grade(score):
    """Convert a numeric score (0-100) to a letter grade."""
    if score > 90:
        return "A"
    if score >= 80:
        return "B"
    if score >= 70:
        return "C"
    if score >= 60:
        return "D"
    return "F"


if __name__ == "__main__":
    # Try it out — predict the output before running!
    print("grade(95) =", calculate_grade(95))
    print("grade(85) =", calculate_grade(85))
    print("grade(90) =", calculate_grade(90))
    print("grade(45) =", calculate_grade(45))

    # TODO: Add three assert statements below to verify:
    #   calculate_grade(95) == "A"
    #   calculate_grade(85) == "B"
    #   calculate_grade(72) == "C"
2

Your First pytest Test

Learning objective: After this step you will be able to write pytest test functions and judge whether an assertion is strong enough to catch real bugs.

From assert to pytest

Raw assert statements work, but they give minimal feedback when things fail. pytest is Python’s most popular testing framework. It discovers and runs your tests automatically, gives clear failure output, and scales from tiny scripts to enormous projects.

pytest Conventions

pytest uses simple naming conventions — no classes, no boilerplate:

Convention Rule Example
File name Must start with test_ test_temperature.py
Function name Must start with test_ def test_freezing_point():
Assertions Plain assert — same as before assert celsius_to_fahrenheit(0) == 32

That is it. No import unittest, no class TestSomething(unittest.TestCase), no self.assertEqual(). Just functions and assert.

Running Tests

In this tutorial, clicking Run on a test file will execute pytest automatically. You will see output like:

test_temperature.py::test_freezing_point PASSED
test_temperature.py::test_boiling_point PASSED

A PASSED test means the assertion held. A FAILED test shows you exactly what went wrong.

Task 1: Complete the Basic Tests

The temperature.py module is provided and working. Complete test_temperature.py:

  1. Fill in test_freezing_point — assert that celsius_to_fahrenheit(0) returns 32.0
  2. Fill in test_boiling_point — assert that celsius_to_fahrenheit(100) returns 212.0
  3. Write test_negative_forty from scratch — assert that celsius_to_fahrenheit(-40) returns -40.0 (the unique crossover point where Celsius equals Fahrenheit!)

Not All Assertions Are Equal

A test only catches bugs if its oracle (the assertion that decides pass/fail) is strong enough. Compare these three assertions for the same function call:

Strength Assertion What breaks it?
Weak assert result is not None Almost nothing — 42, "hello", and 32.0 all pass
Medium assert isinstance(result, (int, float)) Type errors only — wrong numeric values slip through
Strong assert result == 32.0 Any incorrect value at all

A weak oracle lets most bugs through. Students under time pressure (or AI assistants trying to “make the test pass”) often default to weak oracles because they almost always succeed — making the test look productive while testing nothing. You will meet these as the Liar test anti-pattern later in the TDD tutorial.

Task 2: Fill in the Oracle

Open test_oracle_drill.py. Each function has Arrange and Act already written. Your job is to write the strongest possible assertion for the result — one that would fail if the function returned the wrong value.

  • test_oracle_strong_freeze — for celsius_to_fahrenheit(0)
  • test_oracle_strong_boil — for celsius_to_fahrenheit(100)

Before You Move On (retrieval practice)

Without looking: what is the rule for pytest’s file-name and function-name conventions, and why does assert isinstance(result, float) count as a weak oracle?

Starter files
temperature.py
"""Temperature conversion utilities."""


def celsius_to_fahrenheit(celsius):
    """Convert Celsius to Fahrenheit."""
    return celsius * 9 / 5 + 32


def fahrenheit_to_celsius(fahrenheit):
    """Convert Fahrenheit to Celsius."""
    return (fahrenheit - 32) * 5 / 9
test_temperature.py
"""Tests for the temperature module."""
import pytest
from temperature import celsius_to_fahrenheit, fahrenheit_to_celsius


def test_freezing_point():
    # TODO: assert that celsius_to_fahrenheit(0) returns 32.0
    pass


def test_boiling_point():
    # TODO: assert that celsius_to_fahrenheit(100) returns 212.0
    pass


# TODO: Write a test_negative_forty function from scratch.
# Assert that celsius_to_fahrenheit(-40) returns -40.0


if __name__ == "__main__":
    pytest.main(["-v"])
test_oracle_drill.py
"""Fill-in-the-Oracle drill — write the STRONGEST assertion possible.

Arrange and Act are done for you. Replace the weak oracle below each
`result = ...` line with the strongest possible assertion on `result`.
"""
import pytest
from temperature import celsius_to_fahrenheit


def test_oracle_weak_example():
    # Arrange + Act
    result = celsius_to_fahrenheit(0)
    # Weak oracle (provided as a bad example):
    assert result is not None   # passes for almost any return value


def test_oracle_strong_freeze():
    # Arrange + Act
    result = celsius_to_fahrenheit(0)
    # TODO: replace this weak oracle with the strongest possible assertion.
    assert isinstance(result, float)


def test_oracle_strong_boil():
    # Arrange + Act
    result = celsius_to_fahrenheit(100)
    # TODO: replace this weak oracle with the strongest possible assertion.
    assert isinstance(result, float)


if __name__ == "__main__":
    pytest.main(["-v"])
3

Choosing What to Test: Partitions & Boundaries

Learning objective: After this step you will be able to derive equivalence partitions and identify boundary values from a specification, and use them to choose a small but thorough set of test inputs.

The Problem with “Try Some Inputs”

In Step 1 you fixed a bug in calculate_grade. Here is what the buggy version looked like:

if score > 90:   # BUG: missed the boundary at exactly 90
    return "A"

The bug did not cause wrong output for 91, 95, or 100. It only showed up at one exact value: 90. This is not a coincidence — most off-by-one bugs live on the boundary between partitions of the input space. That is why “just try some numbers” is not a reliable test-design strategy.

Equivalence Partitioning

An equivalence partition is a set of inputs that should all behave the same way. A function that validates usernames of length 3–12 has three partitions:

┌──────────────┬────────────────────┬──────────────┐
│  too short   │       valid        │  too long    │
│  len 0, 1, 2 │  len 3, 4, ..., 12 │  len 13+     │
└──────────────┴────────────────────┴──────────────┘

The principle: if "a" (length 1) is rejected, "ab" (length 2) will probably be rejected too — they are in the same partition. You do not need to test every value; one representative per partition is usually enough for the middle of the partition.

Boundary Value Analysis

But the ends of each partition — the boundaries — deserve special attention because that is where off-by-one bugs hide. For the username rule 3 ≤ len ≤ 12, the boundaries are:

Input length Expected Why test this?
2 reject Just below the valid minimum
3 accept Exactly the valid minimum
12 accept Exactly the valid maximum
13 reject Just above the valid maximum

These four tests together catch every common off-by-one error. Compare that to “try length 1, length 5, length 100” — all in the middle of partitions, none on boundaries. You could pass all three and still have the > 12 vs >= 12 kind of bug that bit us in Step 1.

The Heuristic

For a function with a numeric input range [min, max]:

  1. Partition the input space into regions that should behave the same.
  2. Pick one representative from each partition.
  3. Test every boundary — the last invalid value before each partition change and the first valid value after it.

Task: Test validate_username

Open username.py. The function validate_username is provided and correct:

def validate_username(username):
    """Return True iff 3 <= len(username) <= 12."""
    return 3 <= len(username) <= 12

Your job in test_username.py:

  • Two tests are provided as worked examples:
    • test_valid_representative — a typical valid username
    • test_too_short_just_below_min — length 2, just below the valid minimum
  • You write four more tests, one per remaining boundary:
    • test_boundary_min_valid — length 3 is accepted
    • test_boundary_max_valid — length 12 is accepted
    • test_too_long_just_above_max — length 13 is rejected
    • test_empty_string — length 0 is rejected

Run test_username.py after each test you add. All six should pass when you’re done.

Before You Move On (retrieval practice)

Without looking: name the four boundary values you would test for any range [a, b], and explain in one sentence why testing only the middle of each partition is not enough.

Starter files
username.py
def validate_username(username):
    """Return True iff len(username) is between 3 and 12 inclusive."""
    return 3 <= len(username) <= 12
test_username.py
"""Partition & boundary tests for validate_username."""
import pytest
from username import validate_username


# --- Given: a representative valid input (middle of the 'valid' partition) ---
def test_valid_representative():
    assert validate_username("alice") is True   # length 5


# --- Given: just below the valid minimum (boundary at len == 2) ---
def test_too_short_just_below_min():
    assert validate_username("ab") is False     # length 2


# --- TODO: boundary at len == 3 (smallest valid length) ---
# def test_boundary_min_valid():
#     ...


# --- TODO: boundary at len == 12 (largest valid length) ---
# def test_boundary_max_valid():
#     ...


# --- TODO: boundary at len == 13 (just above the valid maximum) ---
# def test_too_long_just_above_max():
#     ...


# --- TODO: empty string (length 0 — edge of the invalid-short partition) ---
# def test_empty_string():
#     ...


if __name__ == "__main__":
    pytest.main(["-v"])
4

Test Behavior, Not Implementation

Learning objective: After this step you will be able to distinguish between brittle (implementation-coupled) and robust (behavior-focused) tests, structure tests using the Arrange-Act-Assert pattern, and judge why coverage numbers are not the same as test quality.

The Key Principle

Test what a function DOES, not how it does it.

A brittle test breaks when you change the internal implementation, even if the behavior is unchanged. A robust test only breaks when the actual behavior changes.

Example: Two Ways to Test the Same Function

Consider testing a function that sorts a list of students by GPA:

# BRITTLE — tests implementation details
def test_sort_brittle():
    sorter = StudentSorter()
    sorter.sort(students)
    # Checks internal state — breaks if we rename the field
    assert sorter._sorted_list[0].name == "Alice"
    assert sorter._comparison_count == 3  # Why do we care?

# ROBUST — tests behavior
def test_sort_robust():
    sorter = StudentSorter()
    result = sorter.sort(students)
    # Checks what the function RETURNS — the observable behavior
    assert result[0].gpa >= result[1].gpa  # Sorted descending
    assert len(result) == len(students)     # No items lost

The brittle test will break if you rename _sorted_list or change the sort algorithm (even to a faster one). The robust test only cares about the result.

Structuring Robust Tests: The AAA Pattern

Behavior-focused tests almost always fall into a clean three-part structure called Arrange-Act-Assert (AAA):

def test_total_sums_prices():
    # Arrange — set up the world the test needs
    cart = ShoppingCart()
    cart.add("Apple", 1.50)
    cart.add("Bread", 2.00)

    # Act — invoke the ONE behavior under test
    result = cart.total()

    # Assert — verify the observable outcome
    assert result == 3.50

AAA is more than tidiness. It is a diagnostic structure: if you cannot cleanly separate Arrange from Act, the function under test probably has too many responsibilities. A gigantic Arrange section is the Excessive Setup smell — that pain is architectural feedback, telling you to refactor the production code rather than hide the setup in a helper.

The Refactoring Litmus Test

If you refactor the internals of a function and all tests still pass, your tests are robust. If tests break after a pure refactoring (no behavior change), they are testing implementation.

Coverage ≠ Quality

Before the task, a cautionary contrast. Which test suite is stronger?

Suite A — 100% line coverage

def test_total_runs():
    cart = ShoppingCart()
    cart.add("Apple", 1.50)
    cart.add("Bread", 2.00)
    total = cart.total()
    assert total is not None   # weak oracle — passes for any non-None return

Suite B — 80% line coverage

def test_total_sums_prices():
    cart = ShoppingCart()
    cart.add("Apple", 1.50)
    cart.add("Bread", 2.00)
    assert cart.total() == 3.50   # strong oracle — exact expected value

Suite A executes every line of total() but asserts nothing meaningful. If you introduced a bug that made total() return 0.0, Suite A would still pass (because 0.0 is not None). Suite B, with less coverage, actually catches the bug.

Coverage measures which lines ran, not whether you checked their behavior. It is a useful diagnostic ceiling (you cannot test what you never ran), but not a measure of adequacy.

Task: Spot and Fix the Brittle Tests

Open test_brittle_audit.py. It contains four tests for a simple ShoppingCart class. Two tests are brittle (they test implementation details via the private _items attribute) and two are robust (they test behavior). Your job:

  1. Read each test carefully.
  2. Identify which two are brittle and which two are robust.
  3. Fix the two brittle tests by rewriting them to test behavior through the public API (items(), item_count(), total()) instead of the private _items attribute.
  4. Run the tests to confirm they all pass.

Before You Move On (retrieval practice)

Without looking, answer in one sentence each:

  • Why is a test that accesses obj._private_field more fragile than one that calls obj.public_method()?
  • What does the AAA pattern give you besides tidy structure?
  • If one suite has 100% coverage and another has 70%, which is stronger?
Starter files
shopping_cart.py
class ShoppingCart:
    def __init__(self):
        self._items = []

    def add(self, name, price):
        self._items.append({"name": name, "price": price})

    def total(self):
        return sum(item["price"] for item in self._items)

    def item_count(self):
        return len(self._items)

    def items(self):
        return [item["name"] for item in self._items]
test_brittle_audit.py
"""AUDIT: Two of these tests are brittle. Find and fix them."""
import pytest
from shopping_cart import ShoppingCart


# Test 1
def test_add_item_updates_count():
    cart = ShoppingCart()
    cart.add("Apple", 1.50)
    assert cart.item_count() == 1


# Test 2 — BRITTLE? Check if this tests behavior or implementation.
def test_add_item_internal_list():
    cart = ShoppingCart()
    cart.add("Apple", 1.50)
    # Accesses private attribute directly
    assert cart._items[0]["name"] == "Apple"
    assert cart._items[0]["price"] == 1.50


# Test 3
def test_total_sums_prices():
    cart = ShoppingCart()
    cart.add("Apple", 1.50)
    cart.add("Bread", 2.00)
    assert cart.total() == 3.50


# Test 4 — BRITTLE? Check if this tests behavior or implementation.
def test_internal_list_length():
    cart = ShoppingCart()
    cart.add("Apple", 1.50)
    cart.add("Bread", 2.00)
    # Accesses private attribute directly
    assert len(cart._items) == 2


if __name__ == "__main__":
    pytest.main(["-v"])