Testing Foundations with pytest
Build the core testing skills you need BEFORE learning TDD: write meaningful assertions, choose test cases using partitions and boundaries, and test behavior instead of implementation.
Why Test? The Bug That Got Away
The Lay of the Land
Learning objective: After this step you will be able to explain why automated tests are valuable and use
assertto verify function behavior.
You write Python code that works… until someone changes it and breaks something you did not expect. Testing is the safety net that catches those breaks before they reach users.
A Common Misconception
Many students think testing is about finding bugs after you write code. That is only half the story. Tests also:
- Specify behavior — a test says “this function should do X” before you write it
- Prevent regressions — once a bug is fixed, a test ensures it never comes back
- Enable fearless refactoring — change code confidently because tests catch breakage
Think of tests as a ratchet: once a test passes, your progress is locked in. You can never accidentally slip backward.
Predict Before You Run
The function below is supposed to convert a numeric score (0–100) to a letter grade. Before running it, read the code carefully and predict: what will calculate_grade(85) return? What about calculate_grade(90)?
Write your predictions down, then click Run to see if you were right.
Task
- Run
grades.pyand observe the output. Was your prediction correct? - Find and fix the bug in
calculate_grade(). - Add three
assertstatements at the bottom of the file to verify:calculate_grade(95)returns"A"calculate_grade(85)returns"B"calculate_grade(72)returns"C"
An assert works like this:
assert calculate_grade(95) == "A", "95 should be an A"
If the condition is False, Python raises an AssertionError with the message.
Looking ahead: Notice where the bug lives — at the value
90, right on the line between a B and an A. That is no coincidence. Bugs love boundaries (the exact values where behavior changes). You will learn to hunt them systematically in Step 3: Partitions & Boundaries.
Before You Move On (retrieval practice)
Close this page and, from memory, answer in one sentence each:
- Name two things a test does besides “find bugs.”
- What does Python do when
assert conditionevaluates toFalse?
Rereading feels productive, but retrieval is what actually builds durable memory.
def calculate_grade(score):
"""Convert a numeric score (0-100) to a letter grade."""
if score > 90:
return "A"
if score >= 80:
return "B"
if score >= 70:
return "C"
if score >= 60:
return "D"
return "F"
if __name__ == "__main__":
# Try it out — predict the output before running!
print("grade(95) =", calculate_grade(95))
print("grade(85) =", calculate_grade(85))
print("grade(90) =", calculate_grade(90))
print("grade(45) =", calculate_grade(45))
# TODO: Add three assert statements below to verify:
# calculate_grade(95) == "A"
# calculate_grade(85) == "B"
# calculate_grade(72) == "C"
Why Test? — Knowledge Check
Min. score: 80%1. A teammate says: “I only write tests after I finish all my code, to check for bugs.” What is the main limitation of this approach?
Post-hoc tests verify your implementation rather than intended behavior. They also cannot serve as a safety net during development because they don’t exist yet.
2. What does this code do?
assert len(result) == 3, "Expected 3 items"
assert condition, message evaluates the condition. If True, nothing happens and
execution continues. If False, Python raises an AssertionError with the message.
3. Which of the following is not a benefit of automated tests?
Tests reduce bugs and catch regressions, but they cannot guarantee zero bugs. Dijkstra famously said: “Testing can show the presence of bugs, but never their absence.”
4. The “ratchet metaphor” for testing means:
The ratchet metaphor (from Harry Percival’s “Obey the Testing Goat”) captures the key psychological benefit of testing: each passing test is a permanent checkpoint.
Your First pytest Test
Learning objective: After this step you will be able to write pytest test functions and judge whether an assertion is strong enough to catch real bugs.
From assert to pytest
Raw assert statements work, but they give minimal feedback when things fail. pytest is Python’s most popular testing framework. It discovers and runs your tests automatically, gives clear failure output, and scales from tiny scripts to enormous projects.
pytest Conventions
pytest uses simple naming conventions — no classes, no boilerplate:
| Convention | Rule | Example |
|---|---|---|
| File name | Must start with test_ |
test_temperature.py |
| Function name | Must start with test_ |
def test_freezing_point(): |
| Assertions | Plain assert — same as before |
assert celsius_to_fahrenheit(0) == 32 |
That is it. No import unittest, no class TestSomething(unittest.TestCase), no self.assertEqual(). Just functions and assert.
Running Tests
In this tutorial, clicking Run on a test file will execute pytest automatically. You will see output like:
test_temperature.py::test_freezing_point PASSED
test_temperature.py::test_boiling_point PASSED
A PASSED test means the assertion held. A FAILED test shows you exactly what went wrong.
Task 1: Complete the Basic Tests
The temperature.py module is provided and working. Complete test_temperature.py:
- Fill in
test_freezing_point— assert thatcelsius_to_fahrenheit(0)returns32.0 - Fill in
test_boiling_point— assert thatcelsius_to_fahrenheit(100)returns212.0 - Write
test_negative_fortyfrom scratch — assert thatcelsius_to_fahrenheit(-40)returns-40.0(the unique crossover point where Celsius equals Fahrenheit!)
Not All Assertions Are Equal
A test only catches bugs if its oracle (the assertion that decides pass/fail) is strong enough. Compare these three assertions for the same function call:
| Strength | Assertion | What breaks it? |
|---|---|---|
| Weak | assert result is not None |
Almost nothing — 42, "hello", and 32.0 all pass |
| Medium | assert isinstance(result, (int, float)) |
Type errors only — wrong numeric values slip through |
| Strong | assert result == 32.0 |
Any incorrect value at all |
A weak oracle lets most bugs through. Students under time pressure (or AI assistants trying to “make the test pass”) often default to weak oracles because they almost always succeed — making the test look productive while testing nothing. You will meet these as the Liar test anti-pattern later in the TDD tutorial.
Task 2: Fill in the Oracle
Open test_oracle_drill.py. Each function has Arrange and Act already written. Your job is to write the strongest possible assertion for the result — one that would fail if the function returned the wrong value.
test_oracle_strong_freeze— forcelsius_to_fahrenheit(0)test_oracle_strong_boil— forcelsius_to_fahrenheit(100)
Before You Move On (retrieval practice)
Without looking: what is the rule for pytest’s file-name and function-name conventions, and why does assert isinstance(result, float) count as a weak oracle?
"""Temperature conversion utilities."""
def celsius_to_fahrenheit(celsius):
"""Convert Celsius to Fahrenheit."""
return celsius * 9 / 5 + 32
def fahrenheit_to_celsius(fahrenheit):
"""Convert Fahrenheit to Celsius."""
return (fahrenheit - 32) * 5 / 9
"""Tests for the temperature module."""
import pytest
from temperature import celsius_to_fahrenheit, fahrenheit_to_celsius
def test_freezing_point():
# TODO: assert that celsius_to_fahrenheit(0) returns 32.0
pass
def test_boiling_point():
# TODO: assert that celsius_to_fahrenheit(100) returns 212.0
pass
# TODO: Write a test_negative_forty function from scratch.
# Assert that celsius_to_fahrenheit(-40) returns -40.0
if __name__ == "__main__":
pytest.main(["-v"])
"""Fill-in-the-Oracle drill — write the STRONGEST assertion possible.
Arrange and Act are done for you. Replace the weak oracle below each
`result = ...` line with the strongest possible assertion on `result`.
"""
import pytest
from temperature import celsius_to_fahrenheit
def test_oracle_weak_example():
# Arrange + Act
result = celsius_to_fahrenheit(0)
# Weak oracle (provided as a bad example):
assert result is not None # passes for almost any return value
def test_oracle_strong_freeze():
# Arrange + Act
result = celsius_to_fahrenheit(0)
# TODO: replace this weak oracle with the strongest possible assertion.
assert isinstance(result, float)
def test_oracle_strong_boil():
# Arrange + Act
result = celsius_to_fahrenheit(100)
# TODO: replace this weak oracle with the strongest possible assertion.
assert isinstance(result, float)
if __name__ == "__main__":
pytest.main(["-v"])
pytest Basics — Knowledge Check
Min. score: 80%1. Which file name will pytest automatically discover as a test file?
pytest discovers files whose names start with test_ or end with _test.py.
Functions inside must also start with test_.
2. What does this pytest output mean?
test_math.py::test_addition PASSED
test_math.py::test_division FAILED
PASSED means the assertions in that test function all held. FAILED means at least
one assertion in test_division was False. pytest will show the exact line and
values that caused the failure.
3. Two tests check the same function. Which assertion is strongest (most likely to catch a bug)? A.
def test_add_a():
result = add(2, 3)
assert isinstance(result, int)
def test_add_b():
result = add(2, 3)
assert result == 5
A is a weak oracle: add(2, 3) returning 0, 42, or 5 all pass. B is a
strong oracle: only the correct value 5 satisfies it. AI assistants often
default to weak oracles because they always succeed — the test looks green but
verifies nothing. You will see this anti-pattern again (the “Liar test”) in the TDD tutorial.
4. (Spaced review — Step 1) A developer finishes a feature and deletes the tests because “the feature is done and working.” What principle does this violate?
The ratchet metaphor means tests lock in progress permanently. Deleting passing tests removes the safety net: if someone later modifies the code and introduces a regression, there is no test to catch it.
Choosing What to Test: Partitions & Boundaries
Learning objective: After this step you will be able to derive equivalence partitions and identify boundary values from a specification, and use them to choose a small but thorough set of test inputs.
The Problem with “Try Some Inputs”
In Step 1 you fixed a bug in calculate_grade. Here is what the buggy version looked like:
if score > 90: # BUG: missed the boundary at exactly 90
return "A"
The bug did not cause wrong output for 91, 95, or 100. It only showed up at one exact value: 90. This is not a coincidence — most off-by-one bugs live on the boundary between partitions of the input space. That is why “just try some numbers” is not a reliable test-design strategy.
Equivalence Partitioning
An equivalence partition is a set of inputs that should all behave the same way. A function that validates usernames of length 3–12 has three partitions:
┌──────────────┬────────────────────┬──────────────┐
│ too short │ valid │ too long │
│ len 0, 1, 2 │ len 3, 4, ..., 12 │ len 13+ │
└──────────────┴────────────────────┴──────────────┘
The principle: if "a" (length 1) is rejected, "ab" (length 2) will probably be rejected too — they are in the same partition. You do not need to test every value; one representative per partition is usually enough for the middle of the partition.
Boundary Value Analysis
But the ends of each partition — the boundaries — deserve special attention because that is where off-by-one bugs hide. For the username rule 3 ≤ len ≤ 12, the boundaries are:
| Input length | Expected | Why test this? |
|---|---|---|
| 2 | reject | Just below the valid minimum |
| 3 | accept | Exactly the valid minimum |
| 12 | accept | Exactly the valid maximum |
| 13 | reject | Just above the valid maximum |
These four tests together catch every common off-by-one error. Compare that to “try length 1, length 5, length 100” — all in the middle of partitions, none on boundaries. You could pass all three and still have the > 12 vs >= 12 kind of bug that bit us in Step 1.
The Heuristic
For a function with a numeric input range [min, max]:
- Partition the input space into regions that should behave the same.
- Pick one representative from each partition.
- Test every boundary — the last invalid value before each partition change and the first valid value after it.
Task: Test validate_username
Open username.py. The function validate_username is provided and correct:
def validate_username(username):
"""Return True iff 3 <= len(username) <= 12."""
return 3 <= len(username) <= 12
Your job in test_username.py:
- Two tests are provided as worked examples:
test_valid_representative— a typical valid usernametest_too_short_just_below_min— length 2, just below the valid minimum
- You write four more tests, one per remaining boundary:
test_boundary_min_valid— length 3 is acceptedtest_boundary_max_valid— length 12 is acceptedtest_too_long_just_above_max— length 13 is rejectedtest_empty_string— length 0 is rejected
Run test_username.py after each test you add. All six should pass when you’re done.
Before You Move On (retrieval practice)
Without looking: name the four boundary values you would test for any range [a, b], and explain in one sentence why testing only the middle of each partition is not enough.
def validate_username(username):
"""Return True iff len(username) is between 3 and 12 inclusive."""
return 3 <= len(username) <= 12
"""Partition & boundary tests for validate_username."""
import pytest
from username import validate_username
# --- Given: a representative valid input (middle of the 'valid' partition) ---
def test_valid_representative():
assert validate_username("alice") is True # length 5
# --- Given: just below the valid minimum (boundary at len == 2) ---
def test_too_short_just_below_min():
assert validate_username("ab") is False # length 2
# --- TODO: boundary at len == 3 (smallest valid length) ---
# def test_boundary_min_valid():
# ...
# --- TODO: boundary at len == 12 (largest valid length) ---
# def test_boundary_max_valid():
# ...
# --- TODO: boundary at len == 13 (just above the valid maximum) ---
# def test_too_long_just_above_max():
# ...
# --- TODO: empty string (length 0 — edge of the invalid-short partition) ---
# def test_empty_string():
# ...
if __name__ == "__main__":
pytest.main(["-v"])
Partitions & Boundaries — Knowledge Check
Min. score: 80%1. A spec reads: “A discount applies for orders of strictly more than $50 and up to $500 inclusive.” Which four values are the most important boundaries to test?
Boundary Value Analysis says: test the values just on each side of every boundary.
For strictly > 50 and ≤ 500, the boundaries are 50/51 (where > 50 flips) and
500/501 (where ≤ 500 flips). Exactly this pattern catches the canonical off-by-one
bug — writing >= 50 instead of > 50, or < 500 instead of <= 500.
2. A developer’s test suite has 12 tests. Every test uses a valid username of length 5, 6, 7, or 8. All tests pass. Can you trust the suite?
The tests cover the middle of one partition only. A bug like len < 12 instead of
len <= 12 would pass all 12 tests but fail in production at length 12. One test per
boundary catches far more bugs than many tests clustered in the easy middle.
3. An equivalence partition is:
Equivalence partitioning groups inputs by expected behavior, not by value. All
valid usernames (length 3–12) form one partition; all too-short usernames form
another. The point: if "abc" is accepted, "abcd" almost certainly is too —
so you do not need to test them both. Spend your test budget on boundaries
between partitions instead.
4. (Spaced review — Step 1) Recall the bug in calculate_grade: score > 90 instead of score >= 90. Classify this bug.
Boundary bugs manifest only at the exact value where partitions meet.
score > 90 works for 91, 95, 100 — but fails at 90 itself. This is exactly
the kind of bug that Boundary Value Analysis is designed to expose, and it
is why you test values just on each side of every boundary in a spec.
Test Behavior, Not Implementation
Learning objective: After this step you will be able to distinguish between brittle (implementation-coupled) and robust (behavior-focused) tests, structure tests using the Arrange-Act-Assert pattern, and judge why coverage numbers are not the same as test quality.
The Key Principle
Test what a function DOES, not how it does it.
A brittle test breaks when you change the internal implementation, even if the behavior is unchanged. A robust test only breaks when the actual behavior changes.
Example: Two Ways to Test the Same Function
Consider testing a function that sorts a list of students by GPA:
# BRITTLE — tests implementation details
def test_sort_brittle():
sorter = StudentSorter()
sorter.sort(students)
# Checks internal state — breaks if we rename the field
assert sorter._sorted_list[0].name == "Alice"
assert sorter._comparison_count == 3 # Why do we care?
# ROBUST — tests behavior
def test_sort_robust():
sorter = StudentSorter()
result = sorter.sort(students)
# Checks what the function RETURNS — the observable behavior
assert result[0].gpa >= result[1].gpa # Sorted descending
assert len(result) == len(students) # No items lost
The brittle test will break if you rename _sorted_list or change the sort algorithm (even to a faster one). The robust test only cares about the result.
Structuring Robust Tests: The AAA Pattern
Behavior-focused tests almost always fall into a clean three-part structure called Arrange-Act-Assert (AAA):
def test_total_sums_prices():
# Arrange — set up the world the test needs
cart = ShoppingCart()
cart.add("Apple", 1.50)
cart.add("Bread", 2.00)
# Act — invoke the ONE behavior under test
result = cart.total()
# Assert — verify the observable outcome
assert result == 3.50
AAA is more than tidiness. It is a diagnostic structure: if you cannot cleanly separate Arrange from Act, the function under test probably has too many responsibilities. A gigantic Arrange section is the Excessive Setup smell — that pain is architectural feedback, telling you to refactor the production code rather than hide the setup in a helper.
The Refactoring Litmus Test
If you refactor the internals of a function and all tests still pass, your tests are robust. If tests break after a pure refactoring (no behavior change), they are testing implementation.
Coverage ≠ Quality
Before the task, a cautionary contrast. Which test suite is stronger?
Suite A — 100% line coverage
def test_total_runs():
cart = ShoppingCart()
cart.add("Apple", 1.50)
cart.add("Bread", 2.00)
total = cart.total()
assert total is not None # weak oracle — passes for any non-None return
Suite B — 80% line coverage
def test_total_sums_prices():
cart = ShoppingCart()
cart.add("Apple", 1.50)
cart.add("Bread", 2.00)
assert cart.total() == 3.50 # strong oracle — exact expected value
Suite A executes every line of total() but asserts nothing meaningful. If you introduced a bug that made total() return 0.0, Suite A would still pass (because 0.0 is not None). Suite B, with less coverage, actually catches the bug.
Coverage measures which lines ran, not whether you checked their behavior. It is a useful diagnostic ceiling (you cannot test what you never ran), but not a measure of adequacy.
Task: Spot and Fix the Brittle Tests
Open test_brittle_audit.py. It contains four tests for a simple ShoppingCart class. Two tests are brittle (they test implementation details via the private _items attribute) and two are robust (they test behavior). Your job:
- Read each test carefully.
- Identify which two are brittle and which two are robust.
- Fix the two brittle tests by rewriting them to test behavior through the public API (
items(),item_count(),total()) instead of the private_itemsattribute. - Run the tests to confirm they all pass.
Before You Move On (retrieval practice)
Without looking, answer in one sentence each:
- Why is a test that accesses
obj._private_fieldmore fragile than one that callsobj.public_method()? - What does the AAA pattern give you besides tidy structure?
- If one suite has 100% coverage and another has 70%, which is stronger?
class ShoppingCart:
def __init__(self):
self._items = []
def add(self, name, price):
self._items.append({"name": name, "price": price})
def total(self):
return sum(item["price"] for item in self._items)
def item_count(self):
return len(self._items)
def items(self):
return [item["name"] for item in self._items]
"""AUDIT: Two of these tests are brittle. Find and fix them."""
import pytest
from shopping_cart import ShoppingCart
# Test 1
def test_add_item_updates_count():
cart = ShoppingCart()
cart.add("Apple", 1.50)
assert cart.item_count() == 1
# Test 2 — BRITTLE? Check if this tests behavior or implementation.
def test_add_item_internal_list():
cart = ShoppingCart()
cart.add("Apple", 1.50)
# Accesses private attribute directly
assert cart._items[0]["name"] == "Apple"
assert cart._items[0]["price"] == 1.50
# Test 3
def test_total_sums_prices():
cart = ShoppingCart()
cart.add("Apple", 1.50)
cart.add("Bread", 2.00)
assert cart.total() == 3.50
# Test 4 — BRITTLE? Check if this tests behavior or implementation.
def test_internal_list_length():
cart = ShoppingCart()
cart.add("Apple", 1.50)
cart.add("Bread", 2.00)
# Accesses private attribute directly
assert len(cart._items) == 2
if __name__ == "__main__":
pytest.main(["-v"])
Behavior vs Implementation — Knowledge Check
Min. score: 80%1. Which test is more robust? Test A:
def test_sort_a():
result = sort_names(["Charlie", "Alice", "Bob"])
assert result == ["Alice", "Bob", "Charlie"]
def test_sort_b():
result = sort_names(["Charlie", "Alice", "Bob"])
assert result[0] < result[1] < result[2]
assert len(result) == 3
Test A is fine for this specific input, but Test B expresses the property that matters: the output is sorted and the same length. If you add a fourth name to the input, Test A breaks but Test B still works.
2. You refactor a function’s internal algorithm (from bubble sort to quicksort) without changing its return value. Two of your tests break. What does this tell you?
If the function’s behavior (inputs → outputs) is unchanged but tests break, those tests are coupled to the implementation. The fix is to rewrite the tests to assert on observable behavior, not internal details.
3. A test you wrote needs 40 lines of setup code before the single assert statement. What is this test telling you?
Excessive Setup is a documented test smell. When Arrange dominates a test, the function under test is usually too tightly coupled to its dependencies. The fix is to refactor the production code, not to hide the pain inside a helper. The test is acting as architectural feedback — listen to the test.
4. Two suites test the same function. Suite A has 100% line coverage but every assertion is assert result is not None. Suite B has 80% line coverage but every assertion checks an exact expected value. Which statement is correct?
This is the coverage misconception: a suite can run every line and still verify nothing if its assertions are weak. Coverage is a necessary ceiling (you cannot test what you never ran) but it is not sufficient for quality. Strong oracles on the critical paths beat weak oracles everywhere.
5. (Spaced review — Step 3) For a function charge_fee(amount) with rule “fee is 2% if amount is 100 or more, else free”, which test pair exposes the most bugs?
The boundary is at 100. Testing 99 and 100 catches the canonical off-by-one
(> 100 vs >= 100). Picking values far from the boundary misses exactly
the bugs that Boundary Value Analysis is designed to expose.