SQL Tutorial — Print View

1

Exploring Data

Why this matters

In Python or JavaScript, you tell the machine how to solve a problem step by step. In SQL, you describe what you want and the engine figures out the rest. If this imperative → declarative shift feels strange, that is normal — it is the foundational mental shift this entire tutorial is built around.

🎯 You will learn to

Apply SELECT to project a chosen subset of columns from a table
Analyze why an explicit column list is safer than SELECT * in production code

The students table

A students table is set up for you (9 rows):

Column	Type	Constraints
`id`	INTEGER	PRIMARY KEY
`name`	TEXT	NOT NULL
`year`	INTEGER
`major`	TEXT
`gpa`	REAL

SELECT — reading data

SELECT is projection — it chooses which columns of a table to return. (To choose which rows, you’ll use WHERE in Step 3.)

-- Explore the whole table (good for learning, not production)
SELECT * FROM students;

-- Project only the columns the caller needs
SELECT name, gpa FROM students;

SELECT names the columns, FROM names the table, * means “all columns”. Use SELECT * while exploring; in production code, always list the columns you need so schema changes do not silently break callers.

SQL is a *multiset*, not a set — DISTINCT removes duplicates

Set theory says {CS, CS, Math} is the same as {CS, Math} — duplicates collapse. SQL is not a set. It’s a multiset (sometimes called a “bag”): duplicates stick around unless you ask for them to be removed.

-- 9 rows: every student's major, including duplicates
SELECT major FROM students;

-- 4 rows: each distinct major listed once
SELECT DISTINCT major FROM students;

Why does this matter? If you want to count distinct majors, the obvious-looking SELECT COUNT(*) FROM students gives 9 (all rows). The intended query is SELECT COUNT(DISTINCT major) FROM students, which gives 4. Forgetting DISTINCT is one of the top-5 SQL gotchas.

Task

Open query.sql. Two queries; neither errors, but each is the wrong tool for its stated job.

Query 1 — exploration. SELECT * is correct here. Predict the row and column count as a comment, then Run.
Query 2 — production. Tighten it to return only name and major.

Investigate (optional). After both pass, append SELECT major, name FROM students. Does the column order in the output change?

Translating to Python or pandas

SELECT is a projection — like [s['name'] for s in students] or .map(s => s.name). pandas mapping:

SQL	pandas
`SELECT * FROM students`	`students`
`SELECT name, gpa FROM students`	`students[['name','gpa']]`

Two things that differ from a Python list: (1) rows come back in no guaranteed order — there is no “row 0”; if order matters, ask for it with ORDER BY (Step 3). (2) Every row has exactly the declared columns and types — values can be NULL, but a column cannot disappear.

How this tutorial works (PRIMM)

Every step uses the same rhythm: Predict what a query will do, Run to compare, Investigate with a tweak, then Modify to fix the bug. The hard skill is diagnosing why a query is wrong — not recalling syntax.

Starter files

query.sql

-- Step 1 — Exploring data with SELECT
-- Two queries below. Predict, then run and verify.

-- Query 1: exploration. Runs correctly.
-- Predict the row count and column count on the next line.
SELECT * FROM students;

-- Query 2: works, but not production-quality.
-- It fetches more columns than the caller needs.
-- Tighten it to return only name and major, and add a one-line
-- comment explaining why specific columns are safer than SELECT *.
SELECT id, name, year, major, gpa FROM students;

Solution

query.sql

-- All columns, all rows
SELECT * FROM students;

-- Only name and major
SELECT name, major FROM students;

SELECT * returns all 9 rows and all 5 columns. SELECT name, major returns the same 9 rows but only 2 columns. SELECT chooses columns, not rows — to choose rows you will need WHERE (Step 3).

Step 1 — Knowledge Check

Min. score: 80%

1. A Python programmer says: “A SQL table is just like a Python list of dictionaries.” What is the most important difference they are missing?

SQL tables are stored on disk; Python lists live in memory
Disk vs. memory is incidental — modern Python streams from disk and SQL can run in-memory. The deeper gap is that SQL rows have no positional access at all.
SQL tables enforce a rigid schema; rows have no guaranteed order
Rows are accessed by 1-based index instead of 0-based
There’s no students[0] in SQL. Rows aren’t indexed by position — you identify them by content (WHERE). 1-based vs 0-based isn’t the issue; no positional access is.
Python lists can hold more data than SQL tables
Capacity isn’t the difference. Both can hold large data. The conceptual gap is that a SQL table is a set (no order, schema-enforced), not a list.

SQL tables have a rigid schema (fixed columns and types) and no guaranteed row order — there is no students[0]. Storage location is incidental; the indexing distractor is the misconception this step is built to dismantle.

2. Without an ORDER BY clause, in what order does SQL return rows?

In insertion order — the order rows were added
Insertion order isn’t guaranteed. Small tables may happen to come back that way, but the engine is free to return rows in any order it likes.
In primary key order — ascending by id
The primary key uniquely identifies rows; it doesn’t dictate result-set order. SQLite, Postgres, MySQL — none guarantee PK order without ORDER BY.
Alphabetically by the first column
There’s no default-alphabetical rule. Result order depends on storage layout, indexes, and the query planner — not on column types or names.
In no guaranteed order — depends on storage and query plan

A SQL table is a set, not a list. Without ORDER BY, order depends on storage and query plan — never rely on it.

3. In Python you write students[0] to get the first student. What is the SQL equivalent?

SELECT * FROM students WHERE id = 0
id = 0 references a value, not a position. Even if a row had id 0, that’s content-based lookup — not the same as Python’s students[0], which asks for position 0.
SELECT * FROM students[0]
Not valid SQL — tables aren’t subscriptable. SQL doesn’t expose row position.
There is no equivalent — SQL rows have no position
SELECT FIRST FROM students
No FIRST keyword in standard SQL. Some dialects have LIMIT 1 paired with ORDER BY, but that’s still content-based — picking ‘first’ requires defining an ordering.

SQL rows have no position — you identify them by their content (WHERE clause), not by an index.

4. Arrange the fragments to write a query that retrieves only the name and gpa columns from the students table. (arrange in order)

Correct order:

SELECT
name, gpa
FROM
students;

Distractors (not used):

WHERE
*

SELECT names the columns, FROM names the table. * is wrong (not just those two); WHERE filters rows, not columns.

5. The students table has 9 rows but only 4 distinct majors. What does SELECT major FROM students return?

4 rows — duplicates are removed by default
Set theory wants {CS, CS, Math} to collapse to {CS, Math}. SQL is a multiset — duplicates stick around unless you ask for them to be removed with DISTINCT.
9 rows — one for each student, with duplicate majors included
9 rows, but pre-deduplicated by the query optimizer when storage allows
Optimizers don’t silently change the row count. SELECT major always returns one row per source row; deduplication requires explicit DISTINCT.
An error — projecting only major is ambiguous when multiple students share one
No ambiguity — every row contributes its own major value to the projection. The grouping rule (Step 4) only applies once GROUP BY enters the picture.

SQL tables are multisets (bags), not sets. SELECT major FROM students returns one row per source row, including all duplicates. To collapse duplicates, write SELECT DISTINCT major FROM students. To count distinct majors, write SELECT COUNT(DISTINCT major) FROM students — COUNT(*) would return 9 (the row count), not 4.

6. Two teammates need every column from students for a debugging session. Alice writes:

SELECT * FROM students;

Bob writes:

SELECT id, name, year, major, gpa FROM students;

Both queries return identical results today. Six months later, the team adds a new column phone_number to students and a new column ssn later still. Whose code ages better — and why?

Alice’s — SELECT * automatically adapts to schema changes
That’s the bug, not the feature. Alice’s code starts returning phone_number and later ssn to every caller. If any caller wasn’t expecting them — log files, JSON exports, downstream services — Alice has just created a privacy incident.
Bob’s — explicit columns avoid leaking new PII and column-count breakage
Tie — both forms are equivalent; SQL’s optimizer flattens any difference
Optimizers can’t read your intent. Both queries have identical execution plans today, but they have very different consequences when the schema evolves. Choose explicit columns at write time.
Alice’s — fewer characters means smaller network payloads and faster execution
Network payload differences are negligible compared to the maintenance and security cost. Production codebases ban SELECT * from non-exploratory code precisely because of the schema-evolution risk.

SELECT * is fine for exploration and dangerous in production. When the schema changes, Alice’s downstream code starts receiving columns it never asked for — most commonly leaking PII (ssn, phone_number) into logs/exports, but also breaking code that expects a specific column count. Bob’s explicit list is self-documenting (it says exactly what the caller depends on) and change-isolating (a new column doesn’t silently propagate). The trade-off — three extra seconds of typing — is what makes seasoned engineers refuse SELECT * in code review every time.

2

Joining Tables

Why this matters

Real data is rarely in one table — every production query you write will stitch information together across two or more. JOIN is the workhorse, but it has a famous trap: forget the ON predicate and you silently return every row paired with every row.

🎯 You will learn to

Apply JOIN ... ON to combine rows from two tables on a shared key
Analyze a multi-table query for the Cartesian-product trap (missing ON)

Two tables, one question

The students table has a major column, but each major’s department and advisor live in a separate majors table:

`majors.code`	`majors.department`	`majors.advisor`
CS	Engineering	Dr. Hopper
Math	Sciences	Dr. Noether
Physics	Sciences	Dr. Curie
Bio	Life Sciences	Dr. McClintock

To answer “who is Mango’s advisor?”, you need both tables. JOIN combines them on a shared column — here, students.major = majors.code.

JOIN syntax

JOIN is relational join — it combines rows from two tables that share a key, producing one output row per matching pair.

SELECT s.name, m.department, m.advisor
FROM students AS s
JOIN majors AS m ON s.major = m.code;

Aliases (AS s, AS m) — required as soon as two tables are in play, and required for self-joins.
ON — the predicate that says how the two tables line up. Evaluated per pair.
Match-only — rows where ON is TRUE for both sides survive; the rest are dropped. Blue Bella (NULL major) and Bio (no students) both drop out.

The Cartesian-product trap

Without ON, the database pairs every row with every row. 9 students × 4 majors = 36 rows of nonsense.

-- WRONG: no ON clause — produces 9 × 4 = 36 rows
SELECT s.name, m.advisor FROM students s, majors m;

The old comma-join syntax (FROM a, b WHERE a.x = b.y) makes this easy: delete the WHERE during a refactor and silently ship a Cartesian product. Use explicit JOIN ... ON ... so the relationship and the filter are separate.

Task

joins.sql has two queries — one silently broken, one that runs but leaks too much data.

Query 1 — the Cartesian trap. Returns 36 rows instead of ~8. Convert it to explicit JOIN ... ON ....
Query 2 — over-selection. Tighten it to project just the student name and their advisor, using aliases.

Predict before you Run: how many rows comes back from the corrected Query 1? Remember Blue Bella (NULL major) and Bio (no students).

Investigate (optional). After Query 1 passes, is Blue Bella in the result? Why not?

🎓 What about LEFT JOIN?

LEFT JOIN keeps unmatched rows from the left side — useful when you want to include Blue Bella even without a major. Not needed for this step; flagging the keyword so it rings a bell later.

Starter files

joins.sql

-- Step 2 — Joining students to majors
-- Predict row counts before you run.

-- Query 1: intended "student name + advisor" — returns 36 rows.
-- Identify the bug, then rewrite as an explicit JOIN ... ON.
SELECT s.name, m.advisor
FROM students s, majors m;

-- Query 2: works but over-selects.
-- Tighten to project only student name and advisor, using aliases.
SELECT *
FROM students
JOIN majors ON students.major = majors.code;

Solution

joins.sql

-- Query 1: student name + advisor, explicit JOIN
SELECT s.name, m.advisor
FROM students AS s
JOIN majors AS m ON s.major = m.code;

-- Query 2: same shape, tightened to two columns with aliases
SELECT s.name, m.advisor
FROM students AS s
JOIN majors AS m ON s.major = m.code;

Both queries return 8 rows (all students except Blue Bella, whose major is NULL and has no match). The explicit JOIN ... ON form makes the relationship visible and survives refactors — the old comma-join + WHERE form silently becomes a Cartesian product if the WHERE is removed.

Step 2 — Knowledge Check

Min. score: 80%

1. You write SELECT * FROM students s, majors m; with no WHERE or ON. What do you get?

An error — SQL requires an ON clause
SQL is happy to silently produce a Cartesian product — the comma syntax is a valid join (a CROSS JOIN), just rarely the one you wanted. The DBMS has no way to know your intent, so it gives you what you literally asked for.
Only rows where students.major = majors.code
Plain SQL never infers a join condition; you must spell out ON s.major = m.code (or WHERE s.major = m.code). ORMs that auto-join on declared relationships add this convenience on top — but the SQL standard itself does no such inference.
Every student paired with every major (36 rows)
An empty result set
Cross-product is the algebraic default, not an empty set. An empty result would require all rows to fail some predicate — but here there is no predicate to fail.

With no join predicate, SQL pairs every row from students with every row from majors: 9 × 4 = 36 rows. This is the #1 accidental bug in multi-table queries.

2. A JOIN ... ON s.major = m.code between students (9 rows) and majors (4 rows) returns 8 rows. Why not 9?

JOIN always drops one row
JOINs don’t have a fixed-cost arithmetic; they keep every pair where ON is TRUE. The number of rows lost depends on the data, not on a constant rule.
One student has a NULL major, which matches nothing
One major has no students — and that row drops
An inner JOIN is symmetric — a major with zero students does drop, but that doesn’t change the student count on the left side. The 9 students you started with stay unless their predicate fails.
Aliases hide one row from the output
Aliases (AS s, AS m) are syntactic shorthand only — they rename tables for the rest of the query, but never filter rows. They do not affect cardinality.

JOIN keeps only pairs where the ON predicate is TRUE for both sides. Blue Bella has major = NULL; NULL = anything is UNKNOWN, so she has no match and drops.

3. Arrange the fragments to list each student’s name and their department. (arrange in order)

Correct order:

SELECT s.name, m.department
FROM students AS s
JOIN majors AS m
ON s.major = m.code;

Distractors (not used):

FROM students, majors
WHERE s.major = m.code

The comma form + WHERE is the old trap. Use explicit JOIN ... ON so the relationship is visible and survives refactors.

4. Why are table aliases (AS s, AS m) strongly recommended even in simple two-table joins?

SQL refuses to run without them
Aliases are optional in SQL — the query runs fine without them when no column names collide. They become required when you self-join a table to itself.
They disambiguate same-named columns across tables
They improve query performance
Aliases are pure syntax; the query planner ignores them. Performance comes from indexes, joins, and predicates.
They prevent NULL values in the result
NULLs come from the data (or from outer joins), not from how you name tables. Aliases don’t filter.

Aliases disambiguate same-named columns (both students and majors could have name) and keep queries readable. They are also required for self-joins. They do not affect performance or NULL handling.

5. For an INNER JOIN, do these two queries return the same rows? Query A: SELECT * FROM students s JOIN majors m ON s.major = m.code AND s.year > 1; Query B: SELECT * FROM students s JOIN majors m ON s.major = m.code WHERE s.year > 1;

Yes — the result is identical for an INNER JOIN
No — predicate placement always changes the result
For an outer join, placement absolutely matters. For inner joins specifically, both forms produce the same row set — the join keeps a row only when both predicates are TRUE.
Yes, but only if year has no NULLs
NULL handling is identical here — NULL > 1 is UNKNOWN, and both queries drop the row. The 3VL behavior doesn’t differ between ON-clause filters and WHERE filters.
No — Query A pre-filters and is therefore always faster
Modern optimizers reorder predicates regardless of where you wrote them. A query planner sees both forms and produces the same plan.

For INNER JOINs, ON ... AND ... and WHERE ... are equivalent — both predicates must be TRUE for a row to survive. For LEFT/RIGHT OUTER JOINs, placement matters: an ON filter applies during matching (unmatched left rows still appear with NULLs), while a WHERE filter applies after the join (an unmatched left row that fails the WHERE is dropped). This subtlety is the #1 cause of “missing rows” in outer-join queries.

3

Filtering & Sorting

Why this matters

Almost every real query needs to narrow down — give me only the rows that match these conditions, sorted this way. WHERE is your filter and ORDER BY is your sort, but both interact with NULL in ways that quietly drop rows you expected to see. NULL is genuinely difficult — experienced developers still get tripped up by it, so plan to spend a little extra time here.

🎯 You will learn to

Apply WHERE predicates and ORDER BY to return exactly the rows you want, in the order you want
Analyze how three-valued logic (NULL = UNKNOWN) silently drops rows from != and = filters
Evaluate a failing filter by classifying the bug as syntax, semantic, or logical

WHERE predicates

WHERE is row selection — it keeps the rows where a predicate evaluates to TRUE, and discards the rest.

WHERE gpa >= 3.5                        -- numeric comparison
WHERE major = 'CS'                      -- string equality (single quotes!)
WHERE major = 'CS' AND gpa > 3.7        -- compound predicate
WHERE major IN ('CS', 'Math')           -- set membership
WHERE name LIKE 'A%'                    -- pattern match (% = any characters)

Two syntax traps for Python/JS developers:

Single quotes for strings. 'CS', not "CS". Double quotes quote identifiers in SQL.
Single = for equality. =, not ==. WHERE major == 'CS' is a syntax error.

NULL and Three-Valued Logic — the biggest trap

SQL booleans are three-valued: TRUE, FALSE, UNKNOWN. NULL means unknown value — comparing anything to NULL (even another NULL) yields UNKNOWN, and WHERE discards UNKNOWN rows.

WHERE major != 'CS'                       -- WRONG: silently drops NULL-major rows
WHERE major != 'CS' OR major IS NULL      -- RIGHT: handle NULL explicitly

WHERE major IS NULL       -- value is unknown
WHERE major IS NOT NULL   -- value is known

In Python, None != "Active" is True. In SQL, NULL != 'Active' is UNKNOWN — and the row is dropped.

ORDER BY

ORDER BY is sorting — it imposes order on the result for presentation. It’s also the only guarantee of row order; without it, order is whatever the engine finds convenient.

SELECT name, gpa FROM students
ORDER BY gpa DESC;   -- highest first (ASC is the default)

Task

filter.sql has three broken queries. Each fails differently. For each:

Predict what it actually returns — or what error it raises.
Classify the bug: syntax (grammar), semantic (valid SQL, wrong reference), or logical (runs fine, wrong rows). Write the classification as a comment.
Fix the query.

Query 3 is deceptively easy — the bug only appears when you reason about NULL carefully.

Investigate. After Query 3 passes, comment out your IS NULL branch and re-run. Where did Blue Bella go? This one-line toggle makes the NULL trap concrete.

Translating to JavaScript or pandas

JavaScript	pandas	SQL
`arr.filter(x => x.gpa > 3.5)`	`df[df['gpa'] > 3.5]`	`WHERE gpa > 3.5`
`arr.map(x => x.name)`	`df[['name']]`	`SELECT name`
`arr.sort((a,b) => b.gpa - a.gpa)`	`df.sort_values('gpa', ascending=False)`	`ORDER BY gpa DESC`

A SELECT is roughly .filter().map().sort() chained — but declarative.

pandas’ NaN behaves closer to SQL (df[df['col'] != 'CS'] excludes NaN). The idiomatic test is df['col'].isna() ≡ col IS NULL.

Starter files

filter.sql

-- Step 3 — Filtering rows with WHERE and sorting with ORDER BY
-- Three broken queries. Classify each as syntax / semantic / logical, then fix.

-- Query 1. Intended: CS students with GPA above 3.5, highest GPA first.
-- Runs without error, but returns the wrong rows in the wrong order.
-- Two independent bugs hide here. Find both, classify them, then fix.
SELECT * FROM students
WHERE major = 'CS' OR gpa > 3.5
ORDER BY gpa ASC;

-- Query 2. Intended: students in year 1 or year 2.
-- Runs without error. Why does it return zero rows?
SELECT * FROM students
WHERE year = 1 AND year = 2;

-- Query 3. Intended: students whose major is NOT 'CS'.
-- Runs without error, but quietly drops a student. Who, and why?
-- Hint: one student has a NULL major. Reason about 3VL before fixing.
SELECT * FROM students
WHERE major != 'CS';

Solution

filter.sql

-- Query 1: CS students with GPA > 3.5, highest GPA first
SELECT * FROM students
WHERE major = 'CS' AND gpa > 3.5
ORDER BY gpa DESC;

-- Query 2: Students in year 1 or year 2
SELECT * FROM students
WHERE year = 1 OR year = 2;

-- Query 3: Students NOT in CS (NULL-safe)
SELECT * FROM students
WHERE major != 'CS' OR major IS NULL;

Query 1 returns Mango (3.9) and Bananito (3.7). AND combines both conditions; ORDER BY gpa DESC sorts highest first.

Query 2 — the “natural-language AND” trap. In English, “in year 1 and year 2” means “either”. In SQL, AND requires both conditions true in the same row — impossible for a single column. Whenever a filter returns zero rows against a plausible question, check this trap first. You could also write WHERE year IN (1, 2).

Query 3 is the NULL trap. Without OR major IS NULL, Blue Bella (whose major is NULL) is silently dropped. NULL != 'CS' evaluates to UNKNOWN, not TRUE, so WHERE discards her.

Step 3 — Knowledge Check

Min. score: 80%

1. A student writes:

WHERE status != 'Active'

The status column has some NULL values. What happens to those rows?

They are included — NULL is not equal to ‘Active’
Intuitively true, but SQL doesn’t agree. NULL means ‘unknown’ — comparing it to anything (even with !=) yields UNKNOWN, and WHERE keeps only TRUE rows.
They are excluded — NULL != 'Active' is UNKNOWN, not TRUE
They cause a runtime error
No error. WHERE silently drops UNKNOWN-evaluating rows; that silent drop is exactly the bug to watch for.
They are included only if NULL is listed as a valid status
There’s no ‘list of valid statuses’ anywhere. NULL handling is governed by 3VL, not by a list.

In SQL’s Three-Valued Logic, NULL != 'Active' evaluates to UNKNOWN — not TRUE. WHERE only keeps rows where the predicate is TRUE. UNKNOWN rows are silently discarded. To include NULL rows, write: WHERE status != 'Active' OR status IS NULL.

2. Which expression correctly tests whether a column value is missing (NULL)?

WHERE credits = NULL
= NULL evaluates to UNKNOWN — never TRUE. SQL has a special operator for this case precisely because = doesn’t work on NULL.
WHERE credits == NULL
== is not valid SQL — it borrows from C-family languages. SQL uses single = for equality, and even then IS NULL for NULL tests.
WHERE credits IS NULL
WHERE ISNULL(credits) = 1
Some dialects offer ISNULL() (and Oracle has NVL()), but they’re non-standard extensions. IS NULL is the portable answer.

NULL = NULL evaluates to UNKNOWN, not TRUE — so WHERE credits = NULL never matches any row. The only portable, standard-SQL way to test for NULL is IS NULL. == NULL is not valid SQL syntax (SQL uses =, not ==). ISNULL() is a non-standard extension.

3. In SQL’s logical execution order, the SELECT clause is written first. When is it actually evaluated?

First — it defines what data to retrieve
Reading order is misleading. SELECT is written first but logically runs after the row-filtering and grouping clauses.
Second — immediately after FROM determines the source table
FROM identifies the source, but WHERE filters before SELECT projects. Otherwise, an alias used in SELECT couldn’t reference filtered-out rows.
Fifth — after FROM, WHERE, GROUP BY, and HAVING
Last — after ORDER BY and LIMIT
ORDER BY runs after SELECT (so it can refer to SELECT’s aliases). LIMIT is the absolute last step.

SELECT is evaluated fifth: FROM -> WHERE -> GROUP BY -> HAVING -> SELECT -> ORDER BY -> LIMIT. This is why an alias defined in SELECT cannot be used in WHERE — the alias doesn’t exist until after WHERE has already filtered the rows.

4. What does SELECT name, gpa FROM students do?

It filters the table to only include rows that have a name and a gpa
That’s row filtering — the job of WHERE. SELECT picks columns, not rows.
It returns all rows but only the name and gpa columns
It creates new columns called name and gpa
Those columns must already exist on the table. SELECT doesn’t create columns; it projects existing ones (or computes expressions over them).
It sorts the table by name and then by gpa
Sorting is ORDER BY name, gpa. SELECT alone has no implicit ordering — that’s why you saw the first quiz warn about row order.

SELECT projects (selects) columns, not rows. SELECT name, gpa FROM students returns all rows but shows only the name and gpa columns. To filter rows, you need a WHERE clause. This is a common point of confusion — SELECT chooses columns, WHERE chooses rows.

5. Arrange the clauses to find CS students with GPA above 3.5, sorted highest first. (arrange in order)

Correct order:

SELECT name, gpa
FROM students
WHERE major = 'CS' AND gpa > 3.5
ORDER BY gpa DESC;

Distractors (not used):

HAVING gpa > 3.5
SORT BY gpa DESC;

The logical execution order is FROM -> WHERE -> SELECT -> ORDER BY. HAVING filters groups (after GROUP BY), not individual rows. SORT BY is not valid SQL — use ORDER BY. The distractor HAVING is a common confusion point; it only applies after GROUP BY.

6. A student writes two queries. Which one is more likely to produce unexpected results in production, and why? Query A: SELECT * FROM users WHERE role != 'admin' Query B: SELECT id, name FROM users WHERE role != 'admin' OR role IS NULL

Query A — silently drops NULL roles, plus uses SELECT *
Query B — the extra OR clause makes it slower
The extra OR is constant cost — and on indexed columns, both queries optimize similarly. Correctness beats microseconds anyway.
Both are equally safe in production
They differ on both projection (* vs explicit columns) and on how NULL roles are treated. They are not equally safe.
Query A — SELECT * causes a syntax error
SELECT * is valid SQL — that’s part of the trap. It compiles fine; it’s just brittle as columns get added/dropped.

Query A has two production risks: (1) SELECT * breaks if columns are added or removed, and (2) != 'admin' silently drops rows where role is NULL. Query B is safer: it names specific columns and explicitly handles NULL. This combines two lessons — explicit projection and NULL-safe predicates.

4

Aggregating Data

Why this matters

Every time Spotify shows “Your top artists this month” or Instagram shows “4 people liked this” — that is a GROUP BY + COUNT running against millions of rows in milliseconds. Aggregation is how data becomes insight; the trick is keeping straight what gets filtered before grouping (rows) versus after (groups).

🎯 You will learn to

Apply GROUP BY with aggregate functions (COUNT, AVG, SUM, MIN, MAX) to summarize rows per category
Analyze SQL’s logical execution order to predict why aggregates belong in HAVING, not WHERE
Evaluate a GROUP BY query for the grouping rule (every selected column is grouped or aggregated)

Logical execution order

SQL clauses do not run top-to-bottom as written:

FROM      — which tables to read
WHERE     — filter rows
GROUP BY  — group rows
HAVING    — filter groups
SELECT    — project columns  ← written first, evaluated fifth
ORDER BY  — sort
LIMIT     — truncate

This explains (a) why you cannot use a SELECT alias inside WHERE — the alias does not exist yet — and (b) why WHERE cannot contain an aggregate — the groups do not exist yet. Keep this list visible while you work the task — even experienced developers reference it.

GROUP BY syntax

GROUP BY is aggregation — it partitions rows into buckets per category, then applies summary functions (COUNT, AVG, SUM, MIN, MAX) to each bucket. The result has one row per bucket, not per original row.

SELECT   major,
         COUNT(*) AS student_count,
         AVG(gpa)  AS avg_gpa
FROM     students
GROUP BY major;

The GROUP BY rule

Every column in SELECT must either (1) appear in GROUP BY, or (2) be wrapped in an aggregate (COUNT, AVG, SUM, MIN, MAX).

-- WRONG: name is neither grouped nor aggregated
SELECT major, name, COUNT(*) FROM students GROUP BY major;

-- RIGHT
SELECT major, COUNT(*) AS n FROM students GROUP BY major;

SQLite silently picks an arbitrary name here — PostgreSQL and strict-mode MySQL reject it. Treat SQLite’s quiet acceptance as a trap, not a blessing.

WHERE vs HAVING

HAVING is group-level filtering — the aggregate-aware analog of WHERE. WHERE filters individual rows before grouping; HAVING filters groups after aggregation. Aggregates cannot appear in WHERE because the groups don’t exist yet.

SELECT   major, COUNT(*) AS n
FROM     students
WHERE    year > 1             -- rows first
GROUP BY major
HAVING   COUNT(*) >= 2;       -- groups after

COUNT(*) counts every row in the group; COUNT(gpa) counts only rows where gpa is not NULL.

Task

aggregate.sql has two queries, each making one of the two most common aggregation mistakes (illegal grouping; WHERE-with-aggregate). For each:

Plan before typing: write a one-line comment with the clauses in evaluation order, e.g. FROM students → WHERE year>1 → GROUP BY major → HAVING count ≥ 2 → SELECT major, count. Even 30 seconds of planning catches most aggregation bugs before you write them.
Predict what happens — a crash, a silent wrong answer, or a confusing SQLite-lenient result.
Explain in a one-line comment why it is wrong, then fix it.

Investigate. After you pass, temporarily change your fixed Query 2 back to WHERE AVG(gpa) > 3.3 and read the exact error message. You want to recognize it instantly in a future code review.

Translating to pandas

SQL	pandas
`SELECT major, COUNT(*) FROM students GROUP BY major`	`students.groupby('major').size()`
`SELECT major, AVG(gpa) FROM students GROUP BY major`	`students.groupby('major')['gpa'].mean()`
`... HAVING AVG(gpa) > 3.3`	`result[result > 3.3]` (filter after `.mean()`)

JavaScript’s .reduce() collapses a collection into a single value. SQL’s GROUP BY does the same — but per group, in one query.

Starter files

aggregate.sql

-- Step 4 — Aggregation with GROUP BY and HAVING
-- Plan each query in one comment line before you fix it.

-- Query 1. Intended: count students per major, most students first.
-- Violates the grouping rule — why? What does SQLite do about it?
SELECT major, name, COUNT(*) AS student_count
FROM students
GROUP BY major
ORDER BY student_count DESC;

-- Query 2. Intended: average GPA per major, only majors with avg GPA > 3.3.
-- Uses the wrong clause for an aggregate condition. Fix it.
SELECT major, AVG(gpa) AS avg_gpa
FROM students
WHERE AVG(gpa) > 3.3
GROUP BY major;

Solution

aggregate.sql

-- Query 1: count students per major, most students first
SELECT major, COUNT(*) AS student_count
FROM students
GROUP BY major
ORDER BY student_count DESC;

-- Query 2: average GPA per major, only where avg > 3.3
SELECT major, AVG(gpa) AS avg_gpa
FROM students
GROUP BY major
HAVING AVG(gpa) > 3.3;

Query 1 produces 4 groups: CS (4), Math (2), Physics (2), and NULL (1). The ORDER BY student_count DESC sorts by the alias.

Query 2 returns only CS (avg 3.575). The other majors fall below 3.3. Note: you must use HAVING (not WHERE) because AVG(gpa) is an aggregate that doesn’t exist until after grouping.

Step 4 — Knowledge Check

Min. score: 80%

1. A student writes:

SELECT major, name, COUNT(*) AS n
FROM students
GROUP BY major;

What is wrong?

COUNT(*) is not a valid aggregate function
COUNT(*) is the canonical row-count aggregate — it appears in every SQL textbook and is the most widely-supported function in SQL. The problem isn’t that aggregate; it’s the bare column next to it.
name is in SELECT but not in GROUP BY or any aggregate
GROUP BY must appear before SELECT in the written query
Clauses appear in a fixed order in the written query (SELECT → FROM → WHERE → GROUP BY), but they are evaluated in a different order: FROM → WHERE → GROUP BY → SELECT. This question is about a semantic violation, not a syntactic one.
Non-aggregate columns and aggregates cannot mix in one SELECT
Aggregates and non-aggregates can mix — that’s the whole point of GROUP BY. The rule is: every non-aggregate column must appear in GROUP BY so the database knows it has one value per group. Here name doesn’t, so the database has multiple names per major and no rule for which to pick.

The grouping rule: every column in SELECT must either appear in GROUP BY or be wrapped in an aggregate. name is neither — there are multiple names per major, so it’s undefined which one to return. SQLite may pick an arbitrary name; PostgreSQL and strict-mode MySQL reject this entirely.

2. Which clause filters groups based on an aggregate condition?

WHERE
WHERE filters rows before grouping — at WHERE-time, the aggregates don’t exist yet. WHERE COUNT(*) > 5 is always an error.
HAVING
FILTER
FILTER exists in some dialects (SQL:2003 introduced FILTER (WHERE ...) for inline aggregate filtering), but it’s per-aggregate. Group-level filtering is HAVING’s job.
SELECT ... WHERE
There’s no SELECT ... WHERE clause structure for groups — SELECT projects, WHERE filters rows, HAVING filters groups. Three different jobs, three different keywords.

WHERE filters individual rows before grouping. HAVING filters groups after aggregation. WHERE COUNT(*) > 5 is always an error because the count doesn’t exist yet when WHERE runs. Use HAVING COUNT(*) > 5.

3. If 3 out of 10 students have a NULL value in the gpa column, what does each of these return? COUNT(*) vs COUNT(gpa)

Both return 10 — COUNT always includes every row
COUNT(*) does count every row, but COUNT(column) is a different aggregate that filters NULLs. The two forms exist precisely to give you a choice between “how many rows in the group?” and “how many known values in this column?”
Both return 7 — COUNT skips NULL rows
This collapses the two forms into one. The distinction is intentional: when the column is the answer (“how many students have a known GPA?”) use COUNT(column); when you want the group’s size regardless of nullness use COUNT(*).
COUNT(*) returns 10; COUNT(gpa) returns 7
COUNT(*) returns 7; COUNT(gpa) returns 10
Backwards. COUNT(*) always sees the row even if every column is NULL — only the row’s existence matters. COUNT(gpa) is the one that drops the NULLs.

COUNT(*) counts every row regardless of NULLs. COUNT(column) counts only rows where that column is not NULL. The distinction is critical whenever a column is optional — use the right one depending on whether you want to count the group size or the number of known values.

4. A query uses WHERE year > 2 to filter rows before grouping. If a student has year set to NULL, are they included in the grouped results?

Yes — NULL is treated as 0, which is not greater than 2
SQL treats NULL as unknown, not zero: NULL > 2, NULL = 0, NULL = NULL all evaluate to UNKNOWN — never TRUE, never FALSE. Spreadsheets coerce blank cells to 0, which sets up the wrong intuition for SQL.
Yes — WHERE passes all rows to GROUP BY by default
WHERE applies its predicate to every row, including NULL ones. There is no “pass-through” rule; the predicate just gets a NULL operand and produces UNKNOWN, which WHERE then drops.
No — NULL > 2 is UNKNOWN, and WHERE discards UNKNOWN rows
It depends on the database engine
NULL handling in WHERE is one of the most consistent parts of SQL across engines — three-valued logic (TRUE/FALSE/UNKNOWN) is in the ANSI standard and every major database implements it the same way. Other things vary; this doesn’t.

From Step 3: NULL > 2 evaluates to UNKNOWN, and WHERE discards UNKNOWN rows before grouping even begins. Always check whether NULLs in filtered columns are affecting your aggregates.

5. Arrange the clauses to count students per major, but only show majors with more than 2 students. (arrange in order)

Correct order:

SELECT major, COUNT(*) AS n
FROM students
GROUP BY major
HAVING COUNT(*) > 2;

Distractors (not used):

WHERE COUNT(*) > 2;
ORDER BY major

WHERE cannot hold aggregates — the count does not exist yet when WHERE runs. HAVING filters after grouping.

6. Arrange the steps in the order the database actually executes them, for this query:

SELECT major, COUNT(*) AS n FROM students
WHERE year > 1 GROUP BY major HAVING COUNT(*) >= 2 ORDER BY n DESC;

(arrange in order)

Correct order:

FROM students — read the source rows
WHERE year > 1 — drop rows where year is 1 or NULL
GROUP BY major — partition surviving rows into groups
HAVING COUNT(*) >= 2 — drop groups smaller than 2
SELECT major, COUNT(*) AS n — project the final columns
ORDER BY n DESC — sort groups by count

Distractors (not used):

SELECT first, then filter rows in WHERE
ORDER BY runs before HAVING

SELECT is written first but evaluated fifth. This is why aliases defined in SELECT cannot be used in WHERE, and why aggregates in WHERE always fail.

7. [Review from Step 3] A table has a nullable status column. Which query correctly finds all rows where status is either 'inactive' or unknown?

WHERE status = 'inactive' OR status = NULL
The first half is fine, but status = NULL always evaluates to UNKNOWN — it never matches. SQL has IS NULL precisely because = doesn’t work on NULL.
WHERE status = 'inactive' OR status IS NULL
WHERE status IN ('inactive', NULL)
IN (a, b) desugars to status = a OR status = b. The NULL inside expands to status = NULL, which is UNKNOWN, so it silently misses NULL rows.
WHERE status != 'active'
!= 'active' drops NULL rows the same way = 'inactive' does — three-valued logic strikes again. You’d need OR status IS NULL regardless.

= NULL and IN (..., NULL) both silently miss NULL rows (internally they use =). != 'active' drops NULL rows too. Only IS NULL works.

5

Designing Your Own Table

Why this matters

You have been querying an existing table for four steps; now it is time to build your own. A well-designed schema makes invalid data impossible to insert in the first place — your NOT NULL, PRIMARY KEY, and type declarations are the database’s safety net. If constraint syntax feels fiddly at first, that is normal — the payoff is that bugs surface at insert time, not three weeks later in a production report.

🎯 You will learn to

Apply CREATE TABLE with appropriate types and constraints (PRIMARY KEY, NOT NULL) to a real schema
Create rows with INSERT INTO that satisfy every declared constraint
Evaluate a failing query by classifying the error as syntax, semantic, logical, or constraint violation

CREATE TABLE

CREATE TABLE is schema declaration — it defines the shape and integrity rules of a table (columns, types, constraints), not the data. Like declaring a struct, not creating an instance.

CREATE TABLE books (
  id     INTEGER PRIMARY KEY,    -- identity: unique + non-null
  title  TEXT    NOT NULL,       -- required
  author TEXT    NOT NULL,       -- required
  pages  INTEGER                 -- optional (nullable by default)
);

PRIMARY KEY = unique + non-null + indexed.
NOT NULL = required.
Types: INTEGER, REAL, TEXT, BLOB.

SQLite quirk: standard SQL makes PRIMARY KEY imply NOT NULL; SQLite does not. Add NOT NULL explicitly if portability matters.

INSERT INTO

INSERT INTO is row insertion — it adds new rows that must conform to the schema’s types and constraints.

INSERT INTO books (id, title, author, pages) VALUES
  (1, 'SICP', 'Abelson & Sussman', 657);

Omitting a NOT NULL column (with no default) fails. Nullable columns can be skipped. After any insert/update, run a quick SELECT to verify.

Four kinds of SQL errors

Syntax — grammar violated. Refuses to run.
Semantic — valid grammar, references something that does not exist (wrong column name). Errors at parse time.
Logical — runs and returns a result, but the wrong result (e.g., the NULL trap from Step 3). No error message — the hardest to catch.
Constraint violation — grammatically and semantically valid, but violates PRIMARY KEY / NOT NULL / UNIQUE / FOREIGN KEY at execution time.

Predict before you run

Read the two files (schema.sql and populate.sql) before touching either. Then write down (in a comment in the file or on paper):

Which of the three INSERT statements will succeed against the strengthened schema (the one with the constraints from the table below)?
For each insert that fails, which constraint will it violate — and which of the four error categories does that map to?

Once you have written your prediction, then strengthen the schema and run the inserts. Compare the actual failures to your prediction. The point isn’t to be right; it’s to commit to a hypothesis before the database tells you the answer.

Task

A teammate handed you two unreviewed files. Bring them up to production quality.

1. schema.sql — strengthen. It compiles, but has no integrity constraints. Add the missing types + constraints:

Column	Type	Constraints
`id`	INTEGER	PRIMARY KEY
`code`	TEXT	NOT NULL
`name`	TEXT	NOT NULL
`credits`	INTEGER	(nullable — leave as is)

Above each constraint, write a one-line comment naming the bug it prevents (e.g., -- prevents duplicate course IDs).

2. populate.sql — classify and repair. Three INSERT statements. Two fail against the strengthened schema. For each failing insert, decide which error category fits, then fix it (or replace with a valid row). End with SELECT * FROM courses;. Aim for at least 3 rows.

Translating to pandas

SQL	pandas
`CREATE TABLE books (id INTEGER PRIMARY KEY, title TEXT NOT NULL, ...)`	`books = pd.DataFrame(columns=['id', 'title', 'author', 'pages'])`
`INSERT INTO books VALUES (1, 'SICP', 'Abelson', 657)`	`books.loc[len(books)] = [1, 'SICP', 'Abelson', 657]`

pandas accepts anything silently — wrong types, missing required fields. SQL enforces the schema and catches those bugs at insert time.

Starter files

schema.sql

-- Step 5a — Designing a schema that protects its own data.
-- The CREATE below runs, but it has NO constraints.
-- Add the missing types + constraints from the table in the instructions.
-- For each constraint you add, leave a one-line comment naming the bug it prevents.
DROP TABLE IF EXISTS courses;

CREATE TABLE courses (
  id,
  code,
  name,
  credits
);

populate.sql

-- Step 5b — Populating the strengthened schema.
-- Classify each broken INSERT (syntax / semantic / logical / constraint violation), then fix.
DELETE FROM courses;

-- Insert A — the model row. Matches the schema exactly.
INSERT INTO courses (id, code, name, credits) VALUES
  (1, 'CS 35L', 'Software Construction', 4);

-- Insert B — predict what breaks when the schema enforces NOT NULL
INSERT INTO courses (id, code, credits) VALUES
  (2, 'CS 111', 4);

-- Insert C — all values look plausible. What does the strengthened schema
-- reject here? Look carefully at how this row compares to Insert A.
INSERT INTO courses (id, code, name, credits) VALUES
  (1, 'MATH 61', 'Discrete Structures', 4);

-- Verify your data. Leave this line as the last statement.
SELECT * FROM courses;

Solution

schema.sql

DROP TABLE IF EXISTS courses;

CREATE TABLE courses (
  id      INTEGER PRIMARY KEY,
  code    TEXT    NOT NULL,
  name    TEXT    NOT NULL,
  credits INTEGER
);

populate.sql

DELETE FROM courses;

INSERT INTO courses (id, code, name, credits) VALUES
  (1, 'CS 35L',  'Software Construction', 4),
  (2, 'CS 111',  'Operating Systems',     4),
  (3, 'MATH 61', 'Discrete Structures',   4);

SELECT * FROM courses;

CREATE TABLE declares the shape; INSERT INTO adds rows that must conform. The final SELECT * FROM courses verifies the data. DROP TABLE IF EXISTS at the top makes schema.sql safe to re-run.

On Insert C: all four values looked plausible in isolation, but id = 1 is already taken by Insert A. The strengthened schema has id INTEGER PRIMARY KEY, which enforces uniqueness — so the database raises a constraint violation at execution time. The fix is to give the new row a distinct id (e.g., 3, 'MATH 61', 'Discrete Structures', 4). Constraint violations sit between semantic errors (caught at parse time) and logical errors (never caught): the database catches them, but only when the statement runs.

Step 5 — Knowledge Check

Min. score: 80%

1. What does PRIMARY KEY guarantee about a column’s values?

Values are sorted in ascending order
Storage may happen to keep PK rows ordered (B-tree indexes do this), but logically there’s no sort guarantee. Result-set order still requires ORDER BY.
The column cannot contain text
PRIMARY KEY can be on any type — INTEGER, TEXT, BLOB. The constraint is uniqueness, not the column type.
Each value is unique and non-null across all rows
The column is indexed for faster lookups only
An implicit index is created, but that’s a side effect. The defining contract is uniqueness + non-null, which is what ‘identifying every row’ requires.

PRIMARY KEY enforces two things: uniqueness (no two rows share the same value) and non-nullability. It also creates an implicit index, but its core purpose is to uniquely identify every row.

2. What happens if you INSERT INTO a row but omit a column declared NOT NULL with no default?

The column is automatically set to 0 or an empty string
There’s no implicit default of 0 or '' unless you declared one. Without DEFAULT, the column has no fallback.
The row is inserted with NULL in that column
That’d defeat the purpose of NOT NULL. The constraint says the column may never be NULL — including at insert time.
The database raises an error and the row is not inserted
The row is inserted but flagged as incomplete
There’s no ‘incomplete’ state. A row either satisfies all constraints or the insert is rejected outright.

A NOT NULL column with no default value requires an explicit value. Omitting it in the INSERT causes an error — the database enforces the constraint and refuses the row.

3. A student writes:

SELECT full_name FROM students WHERE full_name = 'Mango';

The column is actually named name, not full_name. What type of SQL error is this?

Logical error — the query runs but returns wrong rows
A logical error runs successfully but produces the wrong result. Here the query never runs — the column lookup fails before execution.
Semantic error — valid grammar, but references something that does not exist
Syntax error — the SQL grammar rules are violated
Syntax errors are about grammar (missing parentheses, misplaced clauses). The grammar is fine here; the meaning is broken.
Constraint violation — a NOT NULL or PRIMARY KEY rule was broken
Constraints are about row-level data validity (NOT NULL, PRIMARY KEY, FOREIGN KEY). This is a naming problem at parse time.

A semantic error is a query that is grammatically correct SQL but references something that doesn’t exist (wrong column name, wrong table). A syntax error violates grammar (wrong clause order, missing parentheses). A logical error is when the query runs but returns the wrong answer.

4. You need to store a library’s book collection. Each book has a title, author, year published, and an optional page count. Which schema is best?

CREATE TABLE books (title TEXT, author TEXT, year INTEGER, pages INTEGER)
Missing a PRIMARY KEY — without one there’s no guaranteed way to uniquely identify each book, making updates and deletes ambiguous.
CREATE TABLE books (id INTEGER PRIMARY KEY, title TEXT NOT NULL, author TEXT NOT NULL, year INTEGER, pages INTEGER)
CREATE TABLE books (id INTEGER PRIMARY KEY, title TEXT NOT NULL, author TEXT NOT NULL, year TEXT, pages TEXT)
Storing year as TEXT instead of INTEGER loses type safety — you can’t sort or range-query correctly, and the column accepts non-year strings.
CREATE TABLE books (book_data TEXT)
Storing all data in a single TEXT column gives up all schema enforcement: no types, no NOT NULL constraints, and no ability to query individual fields.

The best schema has a PRIMARY KEY for unique identification, NOT NULL on required fields (title, author), appropriate types (INTEGER for year, not TEXT), and leaves optional fields (pages) nullable.

5. Arrange the lines to create a songs table with a primary key and a required title. (arrange in order)

Correct order:

CREATE TABLE songs (
id INTEGER PRIMARY KEY,
title TEXT NOT NULL,
artist TEXT,
plays INTEGER
);

Distractors (not used):

id INTEGER UNIQUE,
title TEXT NULL,

PRIMARY KEY provides uniqueness and identification — UNIQUE alone lacks the “identity” role. NOT NULL on title ensures every song has a name. TEXT NULL is redundant (columns are nullable by default) and wrong for a required field.

6. [Review] What is the correct logical execution order of these SQL clauses? (arrange in order)

Correct order:

FROM
WHERE
GROUP BY
HAVING
SELECT
ORDER BY

Distractors (not used):

SORT BY
FILTER

The logical execution order is FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY. This is why you cannot use a SELECT alias in WHERE (the alias doesn’t exist yet) and why WHERE cannot use aggregate functions (groups don’t exist yet). SORT BY is not valid SQL (use ORDER BY); FILTER is not a standalone clause.

6

Modifying & Cleaning Up

Why this matters

Destructive SQL is where careful habits pay off. If you have ever run a shell command and watched the wrong files disappear, you already know the feeling — these patterns exist so you do not have it in production. A missing WHERE on UPDATE or DELETE is one of the most common ways to corrupt a real database, and the only defense is a workflow that previews changes before committing them.

🎯 You will learn to

Apply UPDATE, DELETE, and DROP TABLE with the right scoping (WHERE, IF EXISTS)
Evaluate a destructive statement by previewing its impact with a SELECT first
Analyze the differences between row-level deletion (DELETE), table removal (DROP), and modification (UPDATE)

Safe failure experiment

Run SELECT * FROM students; and note the row count. Now run DELETE FROM students; (no WHERE). Run the SELECT again — the table is empty. The page reload restores the seed data; production has no reload button.

UPDATE, DELETE, DROP

All three are destructive — they change the database state, in irreversible ways without a backup.

UPDATE modifies values in existing rows. Scoped by WHERE.
DELETE removes rows. Scoped by WHERE. Schema stays.
DROP TABLE removes the table itself — data and schema. No undo.

UPDATE students SET gpa = 3.95 WHERE name = 'Mango';
DELETE FROM students WHERE gpa < 3.0;
DROP TABLE IF EXISTS courses;

Without WHERE, UPDATE modifies every row and DELETE empties the table. Safe habit: write SELECT ... WHERE ... first, then convert.

Statement	Removes	Schema remains?
`DELETE FROM t WHERE ...`	Matching rows	Yes
`DELETE FROM t`	All rows	Yes
`DROP TABLE t`	All rows + schema	No

Task

A teammate pushed cleanup.sql to a PR. It is scheduled to run on production in an hour. Do not approve it as-is.

For each statement:

Predict what it actually does — not what the comment claims.
Preview with a SELECT. Write the SELECT ... WHERE ... that would reveal the affected rows; leave it as a comment above the fixed statement so the next reviewer sees your reasoning.
Fix the statement to match its stated intent.

That trail — claim, preview, confirm — is how destructive SQL actually gets approved.

Stated intents:

Update Kiwi’s GPA from 3.50 to 3.60.
Delete all year-4 students.
Drop the courses table from Step 5.

Translating to pandas

SQL	pandas
`UPDATE students SET gpa = 3.6 WHERE name = 'Kiwi'`	`students.loc[students['name'] == 'Kiwi', 'gpa'] = 3.6`
`DELETE FROM students WHERE year = 4`	`students = students[students['year'] != 4]`
`DROP TABLE courses`	`del courses`

SQL modifies in place; pandas often returns a new frame. In SQL, forgetting WHERE is permanent.

🎓 Where to go next

This tutorial covers the ~90% of SQL that 90% of code needs. Three topics deserve your next hour:

LEFT JOIN and the outer-join family. Step 2 taught JOIN, which drops rows with no match (Blue Bella with NULL major). LEFT JOIN keeps them and fills the right side with NULL.
Common Table Expressions (WITH ... AS (...)). Complex queries collapse from unreadable nested subqueries into top-to-bottom readable pipelines.
The N+1 query problem. The #1 antipattern at the app ↔ SQL boundary: looping over a list, firing one query per element. Batch with WHERE id IN (...) or an explicit JOIN.

Starter files

cleanup.sql

-- Step 6 — Review a teammate's destructive PR before it runs in production.
-- For each statement: predict, preview with a commented SELECT, then fix.

-- 1. Intended: raise Kiwi's GPA to 3.6.
--    Read this UPDATE carefully. Does it actually do what the comment promises?
--    Preview by running a SELECT with the same WHERE clause first.
UPDATE students SET gpa = 3.6 WHERE name = 'Mango';

-- 2. Intended: delete all year-4 students.
--    Wrong column referenced. Would this even run? If yes, what does it delete?
DELETE FROM students WHERE gpa = 4;

-- 3. Intended: drop the courses table.
--    Works most of the time. When would this statement crash the deploy,
--    and how do you make it safe to re-run?
DROP TABLE courses;

Solution

cleanup.sql

-- 1. Update Kiwi's GPA to 3.6
UPDATE students SET gpa = 3.6 WHERE name = 'Kiwi';

-- 2. Delete all year-4 contestants
DELETE FROM students WHERE year = 4;

-- 3. Drop the courses table
DROP TABLE IF EXISTS courses;

Task 1: UPDATE ... SET ... WHERE modifies only the rows matching the WHERE clause. Without WHERE name = 'Kiwi', every student’s GPA would become 3.6.

Task 2: DELETE FROM ... WHERE year = 4 removes Coconick and Orangelo. Without the WHERE, all 9 students would be deleted.

Task 3: DROP TABLE IF EXISTS is safer than plain DROP TABLE — it avoids an error if the table was already dropped or never created. After this, SELECT * FROM courses would produce an error (the table no longer exists), while DELETE FROM courses without WHERE would leave the table empty but still queryable.

Step 6 — Knowledge Check

Min. score: 80%

1. A developer runs:

DELETE FROM orders;

What is the result?

The orders table is removed from the database
DELETE removes rows, not the table itself. To remove the schema too, use DROP TABLE.
All rows are deleted but the table structure remains
An error — DELETE requires a WHERE clause
SQL doesn’t require WHERE on DELETE — that’s exactly what makes this footgun dangerous. The query runs cleanly and empties the table.
Only rows matching the implicit WHERE TRUE condition are deleted
There’s no implicit WHERE TRUE concept — DELETE without WHERE just deletes everything by design. The framing in this option implies a safety net that doesn’t exist.

DELETE FROM orders without a WHERE deletes every row but leaves the table schema intact. The table still exists — it is just empty. To remove the schema too, use DROP TABLE orders. SQL does not require a WHERE clause on DELETE, which makes this mistake easy to make in production.

2. What is the safest way to remove a table that may or may not exist, without causing an error?

REMOVE TABLE IF EXISTS courses;
There’s no REMOVE TABLE keyword in SQL. The keyword is DROP.
DELETE TABLE courses;
DELETE TABLE is invalid syntax. DELETE FROM removes rows; DROP TABLE removes the table itself.
DROP TABLE IF EXISTS courses;
TRUNCATE TABLE IF EXISTS courses;
TRUNCATE removes rows (like DELETE without WHERE) but keeps the schema. Also, SQLite doesn’t support TRUNCATE at all — and TRUNCATE TABLE IF EXISTS isn’t standard.

DROP TABLE IF EXISTS tablename is the idiomatic safe form. Plain DROP TABLE tablename raises an error if the table doesn’t exist. DELETE TABLE is not valid SQL. TRUNCATE removes all rows (like DELETE) but keeps the schema, and is not supported by all databases (including SQLite).

3. Before running an UPDATE on a production table, what should you do first?

Run the UPDATE and check the result in the UI
By the time you check the UI, the change is committed. The whole point is to preview before the destructive action.
Run a SELECT with the same WHERE clause to preview affected rows
Add ORDER BY to the UPDATE so the order is deterministic
ORDER BY on UPDATE doesn’t preview anything — and it’s not even standard. The point is to verify which rows are affected, not what order they’re processed in.
Change WHERE to WHERE TRUE so every row is visible
WHERE TRUE matches every row — the opposite of safe. You want to verify the specific rows the UPDATE will hit.

Running SELECT ... WHERE ... with the same predicate lets you see exactly which rows the UPDATE will touch — before the change is permanent. This is the claim/preview/confirm trail the task demonstrated.

4. After running DROP TABLE courses, what happens if you run SELECT * FROM courses?

It returns an empty result set (no rows, but the columns are shown)
Empty result requires the schema to still exist (so columns are known). DROP removes the schema too — there’s nothing to project.
The database raises an error — the table no longer exists
It returns NULL for every column
NULL implies columns existing but values being unknown. After DROP, the columns themselves don’t exist.
It returns the data that was in the table before the DROP
DROP doesn’t soft-delete or version data. The table is gone; old data isn’t recoverable from SQL alone.

DROP TABLE removes both the data and the schema. The table no longer exists in the database at all, so any query referencing it fails with an error. Compare this with DELETE FROM courses (no WHERE), which removes all rows but keeps the table structure intact — SELECT * FROM courses would return an empty result set, not an error.

5. Arrange the steps for safely updating Kiwi’s GPA. The professional workflow: preview first, then modify. (arrange in order)

Correct order:

-- Preview the rows you're about to change
SELECT * FROM students WHERE name = 'Kiwi';
-- Then apply the change
UPDATE students SET gpa = 3.6 WHERE name = 'Kiwi';

Distractors (not used):

UPDATE students SET gpa = 3.6;
DELETE FROM students WHERE name = 'Kiwi';

Always preview with SELECT before modifying production data. UPDATE students SET gpa = 3.6; is missing the WHERE clause — it would change every student’s GPA to 3.6. The DELETE distractor removes the row entirely instead of updating it.

6. [Review from Step 4] What is the difference between COUNT(*) and COUNT(email)?

COUNT(*) counts all rows; COUNT(email) skips NULL emails
COUNT(*) is slower; COUNT(email) is optimized
Performance is roughly identical for both — engines optimize them similarly. The semantic difference (NULL handling) is what matters.
They are equivalent — both count all rows
They differ on rows where the named column is NULL. If email is nullable and some rows are NULL, the two return different counts.
COUNT(*) counts columns; COUNT(email) counts rows
Both count rows. * is a special form meaning ‘count any row’; COUNT(column) filters rows where the column is non-NULL.

COUNT(*) counts every row regardless of NULL. COUNT(column) counts only rows where that column is not NULL.

7. [Review from Step 3] Which statement correctly removes every student whose major is NULL?

DELETE FROM students WHERE major = NULL
= NULL evaluates to UNKNOWN, never TRUE. The DELETE would run but match no rows — silently doing nothing, exactly the trap that motivates IS NULL.
DELETE FROM students WHERE major IS NULL
DELETE FROM students WHERE major != 'CS'
!= 'CS' would also drop NULL rows by 3VL — but it would also delete every non-CS student with a known major. Way too broad.
DELETE FROM students WHERE major
Bare WHERE major (without an operator) treats major as a boolean — implicit conversion that’s not portable. It’s not what you want for NULL semantics.

= NULL always evaluates to UNKNOWN and matches no rows — the delete would silently do nothing. Only IS NULL tests for unknown values.

7

Putting It Together

Why this matters

You’ve spent six steps learning the building blocks separately. This step is composition — chaining JOIN, WHERE, GROUP BY/HAVING, aggregates, and ORDER BY into one query that answers a real-world question. The skill the capstone builds isn’t typing more SQL — it’s decomposing a natural-language ask into the constructs you already know.

🎯 You will learn to

Create a single query that combines JOIN, WHERE, GROUP BY/HAVING, aggregates, and ORDER BY
Analyze a natural-language request to map each phrase to the SQL construct it requires
Evaluate your own query against a hand-computed prediction (rows, ordering, totals)

The dean’s request

The dean wants a snapshot of upperclassmen (year 2 and above) by major. List each major that has at least 2 such students, showing:

the major code
its department
the advisor
the average GPA among those students
the student count

Sort by average GPA, highest first.

Decomposing the request

Each phrase of a real-world ask maps to a SQL construct. Identifying these mappings before you type is the skill this capstone builds.

Phrase in the request	Construct
“by major”	`GROUP BY` (Step 4)
“year 2 and above” — row filter	`WHERE` (Step 3)
“department, advisor” — needs `majors` table	`JOIN` (Step 2)
“at least 2 such students” — group filter	`HAVING COUNT(*) >= 2` (Step 4)
“average GPA”, “student count”	`AVG()`, `COUNT()` (Step 4)
“highest first”	`ORDER BY ... DESC` (Step 3)
(production hygiene)	Explicit columns + aliases (Steps 1, 2)

Plan, then type

In final.sql, sketch the clauses in execution order as a comment, then write the query. Naming the constructs you’ll need is the highest-leverage habit for aggregation queries.

One query. No SELECT *. Run before you celebrate.

Predict before you run

Before you submit your query, work this out by hand:

Open the data. The students table has 9 rows. Walk through every student’s year and major. Tally per major: how many year-2-or-above students does each major have?
Apply the dean’s filter (“year ≥ 2 and COUNT(*) ≥ 2 per major”). Cross out any major that doesn’t meet both conditions.
Now commit (on paper) to all three:
- The exact number of rows your query will return.
- The major that appears in row 1 (highest avg GPA).
- The major that appears in row 2.

Then run. The fastest way to know whether your query is correct isn’t “does it run?” — it’s “does it return exactly the row count, ordering, and majors I predicted?” If your number matches but the majors are wrong, your ORDER BY direction is reversed. If the count is off by one, your HAVING threshold is wrong. Predicting before running converts a vague “looks fine” into a sharp diagnostic.

⚠️ Open after you've committed to all three answers

Two majors qualify: CS (Mango 3.9, Bananito 3.7, Coconick 3.4 → year≥2 count is 3, avg ≈ 3.67) and Math (Watermelina 3.6, Grapenzo 3.5 → year≥2 count is 2, avg = 3.55). Engineering and the NULL-major student fail one filter each. Sorted highest-first: CS, then Math.

If your prediction matched, great — your mental model of the data + the filters + the ordering is solid. If it didn’t, that’s the lesson: pin down which step (counting / filtering / sorting / averaging) was wrong before you debug the SQL. The query is downstream of the model.

Starter files

final.sql

-- Capstone: one query that combines everything from Steps 1–6.
-- Plan first as a comment in execution order:
--   FROM ... → JOIN ... → WHERE ... → GROUP BY ... → HAVING ...
--   → SELECT ... → ORDER BY ...
-- Expected output: 2 rows, columns major | department | advisor | avg_gpa | n
-- Then write the query below.

Solution

final.sql

-- Plan: students JOIN majors → WHERE year >= 2 → GROUP BY major
--     → HAVING count >= 2 → SELECT code, dept, advisor, AVG, COUNT
--     → ORDER BY avg DESC
SELECT s.major     AS major,
       m.department,
       m.advisor,
       AVG(s.gpa)  AS avg_gpa,
       COUNT(*)    AS n
FROM students AS s
JOIN majors AS m ON s.major = m.code
WHERE s.year >= 2
GROUP BY s.major, m.department, m.advisor
HAVING COUNT(*) >= 2
ORDER BY avg_gpa DESC;

Every prior step is in this query:

JOIN ... ON ... (Step 2) connects students to their major’s department + advisor.
WHERE s.year >= 2 (Step 3) filters rows before grouping — only upperclassmen reach the GROUP BY.
GROUP BY (Step 4) collapses surviving rows into per-major buckets.
HAVING COUNT(*) >= 2 (Step 4) filters those buckets — only majors with at least 2 upperclassmen.
AVG, COUNT (Step 4) produce the summary statistics.
ORDER BY avg_gpa DESC (Step 3) presents highest-average first.
Aliases + explicit columns (Steps 1, 2) keep the query production-clean.

Note the standard-SQL grouping rule: every non-aggregate column in SELECT must appear in GROUP BY. m.department and m.advisor are functionally dependent on s.major (each major has one department and advisor) — but standard SQL doesn’t reason about functional dependencies, so they go in the GROUP BY. SQLite would let you omit them; PostgreSQL and strict-mode MySQL would reject the query.

Step 7 — Knowledge Check

Min. score: 80%

1. Two engineers debate. Engineer A writes:

SELECT major, AVG(gpa) FROM students GROUP BY major HAVING COUNT(*) >= 2;

Engineer B writes:

SELECT major, AVG(gpa) FROM students WHERE COUNT(*) >= 2 GROUP BY major;

Which one runs correctly?

A — WHERE cannot use aggregates; that filter belongs in HAVING
B — WHERE runs first, so it filters faster than HAVING does
WHERE does run before grouping — and that’s exactly why it can’t reference COUNT(*). The count doesn’t exist yet at WHERE time.
Both run — SQLite is permissive about this kind of query
SQLite is more permissive than other engines on some things (e.g., the GROUP BY rule), but it still rejects aggregates in WHERE. Both rejections are universal.
Neither — both queries are missing a required JOIN clause
Neither query needs a JOIN — students alone holds enough data. The bug is the placement of COUNT(*), not the absence of a JOIN.

Engineer A is right. WHERE COUNT(*) >= 2 always errors because aggregates don’t exist when WHERE runs (the logical execution order from Step 4). Filtering on aggregates is HAVING’s job.

2. [Review from Step 3] In your capstone query, you wrote WHERE year >= 2. If a future student has year set to NULL, are they included in the result?

Yes — SQL treats NULL as the integer value 0 for comparisons
SQL treats NULL as unknown, not zero. Spreadsheet intuition (blank → 0) doesn’t carry over.
No — NULL >= 2 is UNKNOWN, and WHERE discards UNKNOWN rows
Yes — WHERE keeps NULL rows by default unless told otherwise
There’s no NULL-pass-through default. WHERE applies its predicate to every row, and NULL operands always produce UNKNOWN.
It depends on the database engine’s NULL-handling configuration
3VL is in the ANSI SQL standard. Every major engine treats NULL comparisons identically — it’s one of the most consistent parts of SQL.

Three-Valued Logic (Step 3): comparing NULL to anything yields UNKNOWN, and WHERE keeps only TRUE rows. The principle matters whenever a filtered column is nullable.

3. [Review from Step 4] The grouping rule states that every column in SELECT must either appear in GROUP BY or be wrapped in an aggregate. Why?

It’s a stylistic convention to keep queries readable
It’s a semantic requirement, not stylistic. Standard-compliant engines reject violations; SQLite silently picks an arbitrary value, which is worse.
Without it, multiple values map to one group with no defined choice — the result is undefined
GROUP BY only supports indexed columns
GROUP BY works on any column type. Indexes affect performance, not whether grouping is allowed.
The SQL standard prohibits mixing aggregates and non-aggregates in the same SELECT
The standard does allow mixing — that’s the entire point of GROUP BY. The constraint is that non-aggregate columns must be in GROUP BY too.

If a column is neither grouped nor aggregated, multiple row values collapse into one group with no rule for which to return. Standard SQL rejects this; SQLite picks arbitrarily — silent corruption, not a feature.

4. Arrange the clauses of your capstone query in the order the database actually executes them. (arrange in order)

Correct order:

FROM students JOIN majors
WHERE year >= 2
GROUP BY major
HAVING COUNT(*) >= 2
SELECT major, AVG(gpa), COUNT(*)
ORDER BY avg_gpa DESC

Distractors (not used):

WHERE year >= 2 AND COUNT(*) >= 2
SORT BY avg_gpa DESC

FROM/JOIN → WHERE → GROUP BY → HAVING → SELECT → ORDER BY. SELECT is written first but evaluated fifth — which is why aggregates belong in HAVING (after grouping), never in WHERE. The distractors fail two ways: mixing an aggregate into WHERE, and using SORT BY (not valid SQL).

SQL Essentials

Exploring Data

Why this matters

🎯 You will learn to

The students table

SELECT — reading data

Task

Solution

Step 1 — Knowledge Check

Joining Tables

Why this matters

🎯 You will learn to

Two tables, one question

JOIN syntax

The Cartesian-product trap

Task

Solution

Step 2 — Knowledge Check

Filtering & Sorting

Why this matters

🎯 You will learn to

WHERE predicates

NULL and Three-Valued Logic — the biggest trap

ORDER BY

Task

Solution

Step 3 — Knowledge Check

Aggregating Data

Why this matters

🎯 You will learn to

Logical execution order

GROUP BY syntax

The GROUP BY rule

WHERE vs HAVING

Task

Solution

Step 4 — Knowledge Check

Designing Your Own Table

Why this matters

🎯 You will learn to

CREATE TABLE

INSERT INTO

Four kinds of SQL errors

Predict before you run

Task

Solution

Step 5 — Knowledge Check

Modifying & Cleaning Up

Why this matters

🎯 You will learn to

Safe failure experiment

UPDATE, DELETE, DROP

Task

Solution

Step 6 — Knowledge Check

Putting It Together

Why this matters

🎯 You will learn to

The dean’s request

Decomposing the request

Plan, then type

Predict before you run

Solution

Step 7 — Knowledge Check