Debugging – Finding and Fixing Faults Systematically
“Debugging is like being a detective in a crime movie where you are also the murderer.” — Filipe Fortes
Debugging is the systematic process of finding and fixing faults (commonly called “bugs”) in a program’s source code. Every working developer spends a large fraction of their time on it, and a good debugging process is one of the highest-leverage skills you can build.
Why Debugging Skills Matter
Software defects are not a niche concern: they cost the U.S. economy roughly $60 billion every year, and validation activities (including debugging) consume 50–75% of development time on a typical project. The cost isn’t the hour you spent fixing the bug — it’s the revenue lost, the customer trust eroded, and, in safety-critical settings, the lives placed at risk while the defect was in production.
Empirical studies of professional developers find that the best debuggers are roughly three times as efficient as average ones on the same defects. That gap is not innate talent; it comes from a disciplined process. The rest of this chapter is that process.
The Search-the-Error-Message Pattern
Before you launch a full debugging session, ask whether the error is yours at all. If you see a message coming from a framework, library, or external service that does not directly point to a fix, you are very likely the thousandth developer to encounter it — and a 30-second search will usually surface a solution.
| When you see… | Do this |
|---|---|
| An error from a framework, library, or service (not your own code) | Search the error message |
| An error from your own code | Skip the search and start the 4-step debugging process below |
The pattern, applied carefully:
- Strip project-specific identifiers from the input and output.
ERROR: relation "tobias_dev_orders_2026_q1" does not existwill find very little.ERROR: relation does not existwill find the underlying cause. Stripping also helps with privacy — usernames, internal hostnames, and API keys do not need to be sent to third parties. - Paste the cleaned message into a search engine or AI assistant.
- Study results before acting. This is where caution earns its keep. With the rise of AI agents that browse the web, prompt injection attacks plant malicious “fix this by running…” instructions on pages that look like normal Stack Overflow answers. Read any command before you run it; activate the shell-scripting judgment you developed in earlier chapters. A suggestion to
git push --forcetomainor tocurl … | sudo bashis almost never the right answer. - Only after external sources are exhausted, ask a more experienced coworker. Their time is more expensive than yours, and they will not be pleased if the answer was one search away.
Fault, Error, Failure
Casual conversation uses bug to mean any of three different things. Debugging works better when you keep them separate, because each one is observed at a different place in the system and points you toward a different next step.
Why the distinction is load-bearing:
A try { … } catch { … } block that swallows an exception turns a failure back into a contained error — the user no longer sees a crash, even though the fault is still in the code. Real systems use this on purpose: fault-tolerant systems (think airplane flight control, payment processors) assume that faults will exist and design so that errors do not propagate to failures. The right level of error handling is its own design decision, covered in the Defensive Programming chapter — for debugging, the lesson is that where you observe the symptom is not where you fix the bug.
Worked example
import sys
import math
def cal_circumference(radius):
diameter = 2 * radius
circumference = diameter * math.pi
return circumference
def __main__():
try:
input_radius = sys.argv[1]
C = cal_circumference(input_radius)
print(f"The circumference of a circle with radius {input_radius} is: {C}")
except:
print("An error occurred but there is no failure")
__main__()
- Fault — line 10.
sys.argv[1]is always a string; nothing converts it to a number before it flows intocal_circumference. - Error — inside
cal_circumference,radiusis'10', sodiameter = 2 * radiusproduces'1010'(Python repeats the string twice) instead of20. - Failure — would be the wrong number printed to the user. The bare
except:block here prevents the failure but masks the fault and makes the bug harder to find.
The Four-Step Debugging Process
The rest of this chapter walks through the same four steps in order. The progression matters: skipping ahead — for example, jumping into a debugger before you can reliably reproduce the bug — wastes hours.
- Investigate symptoms to reproduce the bug
- Locate the faulty code
- Determine the root cause
- Implement and verify a fix
Step 1: Reproduce the Bug
Goal: Get to a place where you can observe the bug on demand — and, eventually, where a test can do it for you.
A bug you cannot reproduce is a bug you cannot debug. The cautionary tale: between 1985 and 1987 the Therac-25 radiation-therapy machine killed six patients with massive overdoses. The triggering condition was an experienced operator typing faster than the developers expected — a sequence the test team had never reproduced because they typed slower. Until the team could reproduce the input sequence, the bug remained invisible.
To reproduce a bug, capture two things:
The problem environment — the setting in which the bug occurs:
- Hardware, operating system, runtime, package versions, browser
- User settings, configuration flags, feature gates
- The exact build of the software the user was running
The problem history — the steps that reach the bug:
- Sequence of data inputs and user interactions
- Communication with other components (HTTP request bodies, message-queue payloads)
- Timing, randomness seeds, physical influences where relevant (NASA’s deep-space missions, for example, deal with cosmic-ray bit flips that can only be reproduced with the right hardware-level instrumentation)
This is why the bug-report templates of mature projects feel tedious — “OS version? Browser? Steps to reproduce?” That tedium is the developer’s only path back to the user’s experience.
Write an Automated Bug-Reproduction Test
Once you can reproduce the bug manually, your next step is to automate the reproduction. A failing test is more valuable than a sticky note that says “reproduce by clicking these seven things.”
- Why automate it now, before you know the fix? Because you are about to try a dozen possible fixes. Doing the reproduction manually each time is slow, error-prone, and (much worse) tempting to skip.
- Simplify the test — strip out every input detail that is not load-bearing for the failure. A 200-step reproduction usually has 5 critical steps and 195 confounders.
- Keep the test forever. When the fix lands, this test becomes a regression test that prevents the same bug from sneaking back in a future change.
You are essentially turning the user’s report into a permanent, runnable specification of the bug’s absence.
Step 2: Locate the Faulty Code
Goal: Reduce the search space from “the whole codebase” to “this file, probably this function.”
In a well-designed system, the responsibility for the symptom should map cleanly to a single module. In any other system — which is most of them — you need tactics.
Logging
Add logging statements that record what the program is actually doing. Python’s logging module, JavaScript’s console.debug / pino, Java’s slf4j, Rust’s tracing — every mature ecosystem has one. Use levels (debug, info, warning, error, critical) so production can run at warning while you crank it up to debug when investigating.
What to log:
- Inputs, especially unexpected ones
- State changes — “transitioned from
unauthenticatedtoauthenticated” - Communication with other components — request/response payloads, message-queue events
A formatted log line such as
2026-05-24 14:14:47 | ERROR | main.py:34 | Failed to connect to database: 'my_db'
gives you a file, a line number, a level, and a human-readable message in one glance — orders of magnitude more useful than print("here"). For backend systems especially, build logging in from day one; debugging without logs is debugging with one hand tied behind your back.
Visual Diagrams
If your codebase is a few thousand lines, reading every file to find the bug is hopeless. A component or sequence diagram that shows what talks to what — even a hand-drawn one — typically cuts the search drastically. Empirical studies of robotics engineers debugging unfamiliar systems found that engineers who had a generated component diagram found the faulty component significantly faster than those who only had the source code, because the diagram lets you ask “does this component even receive the input it needs?” before you start reading code.
This is one reason the SEBook chapters on UML class, sequence, state, and component diagrams are worth the time — they pay back when something breaks.
Focus on the Most Likely Origins
Bugs cluster. They are more likely to live in:
- Code with code smells — long methods, duplicated code, deeply nested conditionals. Refactor the worst offenders before you start debugging when you can; it often makes the bug obvious.
- Code that was written quickly — at 2 a.m., under deadline, by an AI agent without supervision, by a contributor unfamiliar with the module.
- Code at boundaries — wherever data crosses a type boundary (string ↔ number), a process boundary (request parsing, response serialization), or a security boundary.
Common low-level bugs your linter or type-checker can flag automatically: uninitialized variables, unused values, unreachable code, memory leaks, null-pointer access, type inconsistencies. Run the linter before you start hand-searching.
Assertions
assert statements catch errors as they happen, at the source, rather than letting them propagate silently into something inscrutable later.
def withdraw(account, amount):
assert amount > 0, "withdrawal amount must be positive"
assert account.balance >= amount, "insufficient funds"
account.balance -= amount
An assertion failure points directly at the violated invariant, which is far easier to diagnose than the eventual NoneType has no attribute 'balance' three call-frames deep. Most languages let you compile assertions out of production binaries (Python’s -O flag, C’s NDEBUG), so the diagnostic cost is paid only during development and test runs. Some teams measure code quality in assertions per 100 lines of code — it is a crude metric, but a defensive program is usually a debuggable program.
Note that assertions are not exceptions. They are not meant to be caught and recovered from; they signal a programmer mistake (a violated invariant), not a user mistake (bad input). For graceful recovery use proper error handling; for “this should never happen” use an assertion.
Step 3: Determine the Root Cause
Goal: Understand why the faulty code behaves the way it does — what you believed about the program that turns out to be wrong.
Rubber Duck Debugging
The most valuable root-cause-analysis tool costs about $3 and lives on your desk.
Why it works: when you read code you wrote yourself, you suffer from the curse of knowledge — you see what you intended to write, not what you actually wrote. The defect is on the page, but your mental model is overwriting it.
How to apply it: put a rubber duck (or any inanimate object — a coffee mug, a houseplant) on your desk and explain your code to it, line by line. At some point you will tell the duck what the next line should do, look at the line, and realize it doesn’t do that. The duck has found your bug.
Why a duck and not a teammate? Two reasons. A teammate will interrupt and may confirm your biases. And a teammate is usually busy debugging their own code. The duck is always available, and it never agrees with you when you are wrong.
For students: in this course, prefer rubber-duck debugging over asking an AI assistant to find the bug for you. The act of explaining the code is what builds the mental model you will need for the next, harder bug. Use AI for accelerating things you already understand; use the duck for things you don’t yet.
Step-Through Debugger
The second-most-valuable root-cause tool: an interactive debugger that lets you pause execution and inspect program state.
The core moves, supported by every modern IDE (VS Code, PyCharm, IntelliJ, Chrome DevTools…):
- Breakpoint — an intentional stopping point. Click the gutter to the left of a line; when execution reaches that line, it pauses before executing it.
- Step over / step into / step out — advance one line at a time; descend into a function call; pop back out to the caller.
- Watch / inspect — read variables in the current scope, evaluate expressions in the debug console (e.g., type
len(items) > 0to ask a question of the running program). - Call stack — see who called this function, and who called them.
Walking the worked-example program above through the debugger would show you, immediately:
| Line reached | Local state observed | What you learn |
|---|---|---|
input_radius = sys.argv[1] (after) |
input_radius = '10' (string) |
The CLI argument is a string |
cal_circumference(input_radius) (entered) |
radius = '10' |
The string is passed through unchanged |
diameter = 2 * radius (after) |
diameter = '1010' |
2 * '10' concatenates, it doesn’t multiply |
circumference = diameter * math.pi |
TypeError |
The except swallows it as a “failure” message |
The bug isn’t in cal_circumference at all — it’s in the missing int() / float() conversion at line 10. The debugger tells you that in 30 seconds; staring at the code might take much longer.
Run Configurations
Most IDEs let you save a run / launch configuration so the debugger always starts the program with the right arguments and environment. In VS Code that’s a launch.json entry:
{
"version": "0.2.0",
"configurations": [
{
"name": "Python Debugger: Current File",
"type": "debugpy",
"request": "launch",
"args": ["10"],
"program": "${file}",
"console": "integratedTerminal"
}
]
}
For backend / Node.js / multi-process systems, the configuration grows — --inspect flags, port forwarding, source maps. The search engines / AI tools from the search pattern above are well-equipped to help you write that configuration.
Conditional Breakpoints
When a bug only manifests on the 1000th iteration of a loop, stepping through 999 boring iterations is unbearable. Right-click a breakpoint and add a condition (i == 1000, or request.user.id == 'tobias' and request.amount > 50000). The breakpoint only fires when the condition is true. You can also attach a hit count so the breakpoint triggers only on the Nth pass through the line.
Time-Travel Debuggers
Standard debuggers go forward. A time-travel debugger records the execution and lets you step backwards — re-examine a variable’s value three lines ago, hypothetically change it, and re-run forward from that point. They are not built into VS Code by default but are available as extensions for Python (rr, pyrasite), Node.js, and other runtimes. The SEBook’s Python debugging tutorial gives you a sandboxed time-travel debugger to practice with — once you have used one, you will look for them everywhere.
Step 4: Implement and Verify the Fix
Goal: Land a fix that closes the bug and keeps the rest of the system green.
The temptation is to call the bug “fixed” the moment the failing reproduction stops failing. Resist it. Two more steps separate a plausible fix from a trustworthy one.
Add Assertions to Catch Nearby Bugs
The conditions that produced this bug probably hold in other places too. After the fix, sprinkle assertions on the surrounding invariants — “radius is a number”, “discount is between 0 and 1”, “queue length is non-negative”. They serve as live documentation and they will catch the next bug in the family before it ships.
Run the Test Suite
Run the regression test you wrote in Step 1 (it should now pass) and the rest of the suite (none of the previously-passing tests should now fail). A fix that introduces a new bug is a regression — common and embarrassing, but easy to catch if you have the discipline to re-run the suite before you call it done.
Document the Fix
In three places:
- A code comment — only when the why is non-obvious.
# Convert from string to float because sys.argv always returns stringsbelongs in the code;# Increment xdoes not. - The git commit message — reference the bug report or ticket.
fix(checkout): convert radius from str to float (closes #4271)is searchable forever;fix bugis not. - The bug report itself — close it with a short description of the root cause and the fix. This is your project’s institutional memory: the next person to hit a similar symptom will find your write-up.
This last step also makes you more effective when working alongside AI coding agents — they will sometimes “helpfully” undo a non-obvious fix a few commits later if there is no comment explaining why it was non-obvious in the first place.
Keep the Test Forever
The reproduction test you wrote in Step 1 stays in the suite as a permanent regression test. Regression testing — re-running existing tests after code changes to ensure new updates haven’t broken old behavior — is the entire reason a green CI pipeline gives you any confidence at all.
Debugging-Adjacent Git Tools
Two git commands deserve a mention here because they answer questions debuggers can’t:
git blame <file>— for each line in the file, shows the commit that last changed it, the author, and the timestamp. “When was this line written? What was the change that introduced it?” GitHub renders this beautifully.git bisect— when a regression test passes on an old commit and fails on the current commit,git bisectperforms a binary search across the intervening commits to identify the specific commit that introduced the bug. With an automated test you can rungit bisect start <bad> <good> && git bisect run ./run-tests.shand walk away while git does the bisection. Hundreds of commits resolve in roughly $\log_2(n)$ steps.
These are covered in depth in the Git chapter; the point here is that they belong in your debugging toolbox, not just your version-control workflow.
Practice
Want to practice the step-through debugger, breakpoints, and a time-travel debugger on real (broken) code?
- Python Debugging Tutorial — work through several bugs in a sandboxed editor with a full debugger, including time-travel features.
Debugging
Retrieval practice for the four-step debugging process — fault / error / failure vocabulary, reproduction tactics, when to use logs vs the debugger vs rubber-ducking, conditional breakpoints, and the discipline of verifying a fix. Cards span Remember through Evaluate.
Define fault, error, and failure — and explain why keeping them distinct changes how you debug.
Name the four steps of the systematic debugging process, in order.
Why does reproducing the bug come before trying to fix it? What are you trying to capture?
What is regression testing, and how does it relate to the bug-reproduction test you wrote in step 1?
When debugging your own code, when should you reach for search engines / AI tools vs a debugger? Give the rule.
You’re explaining your code to a colleague at their desk. Halfway through line 12 you stop, stare, and say ‘oh.’ You’ve just fixed the bug yourself. Name the phenomenon and the technique.
Compare an assertion (assert x > 0) and an exception (if x <= 0: raise ValueError). When is each appropriate?
Your loop iterates 50,000 times and the bug only appears around iteration 12,000. How do you avoid clicking Step Over 12,000 times?
What is a time-travel debugger, and what does it do that an ordinary debugger cannot?
You write try: do_thing(); except: pass and tell your team ‘this is fault-tolerant.’ Why is this misleading?
A regression test passed two weeks ago and fails today. There are ~200 commits between the two versions and no obvious culprit in the diff. What’s the right move, and why does it scale better than the alternatives?
You just landed a bug fix. The failing reproduction test now passes. What three more things should you do before calling the bug closed?
Your team has a 200-step manual reproduction of an intermittent bug. Before fixing the bug, what should you do to the reproduction itself, and why?
Look at this debugger trace. After input_radius = sys.argv[1], the watch panel shows input_radius = '10' (with quotes). Two steps later, diameter = 2 * radius produces diameter = '1010'. What’s the bug and where is it?
A new colleague says: “I’ve been debugging for 4 hours. I’ve read the function 50 times. I just can’t see what’s wrong.” Diagnose what’s happening and prescribe the next 30 minutes.
Debugging Quiz
Apply, Analyze, and Evaluate-level questions on the four-step debugging process — distinguish fault / error / failure on real scenarios, pick the right tactic (logs vs debugger vs git bisect vs rubber duck) for the situation, and recognize when a fix isn't actually done.
A user reports: “I clicked ‘Submit’ and the page froze with a spinning wheel that never stopped.” You open the code and find that a callback in handlePayment() never resolves its Promise when the payment gateway returns a 5xx response. How would you classify each of these in the fault / error / failure vocabulary?
After any immediate privacy risk has been contained, a user reports that your web app sometimes shows them another user’s data. You cannot reproduce it locally. They send a screenshot but no other details. What should your first debugging action be?
Your team has just manually reproduced an intermittent payment bug after two days of investigation. Before anyone touches the production code, which of the following are worthwhile next steps? (Select all that apply.)
A teammate has a Python bug they’ve been stuck on for an hour. They walk over to your desk and say “can you look at this?” You read the function — about 30 lines — and notice nothing obviously wrong. Which suggestion is the highest-leverage pedagogical move?
You have a regression: a test that passed on Friday now fails on Monday. There are 87 commits between the two versions and no obvious culprit in the diff. Which tool is the most efficient for finding the commit that introduced the regression?
You see this error in your terminal while setting up a new project: ERROR 3680 (HY000): Failed to create schema directory 'tobias_dev_orders_2026_q1' (errno: 2 - No such file or directory). What is the best thing to copy into a search engine or AI assistant?
You’re chasing a bug that only appears around the 10,000th line item in a specific user’s account. Stepping through the loop one iteration at a time in the debugger would mean clicking Step Over thousands of times. What’s the right move?
A teammate marks a ticket “FIXED” with this commit: a one-line change that makes the previously-failing reproduction pass. They did not run the rest of the test suite. What is the most important risk they have left exposed?
Look at this code:
def transfer(account_from, account_to, amount):
try:
account_from.balance -= amount
account_to.balance += amount
except:
pass
The team lead says “This is fault-tolerant — if anything goes wrong, the user doesn’t see a crash.” What’s wrong with this reasoning?
A junior engineer is debugging a deeply nested issue in a backend microservice. They have been at it for three hours with no progress, just rereading the same 200 lines of code. What is the single most likely explanation for why they are stuck?