SQL Tutorial — Print View

1

Exploring Data

Describe What You Want, Not How

Learning objective: Retrieve the slice of a dataset a caller needs, and explain why describing what you want (declarative) scales better than describing how to compute it (imperative).

In Python or JavaScript, you tell the machine how to solve a problem step by step. In SQL, you describe what you want and the engine figures out the rest. If this imperative → declarative shift feels strange, that is normal.

The students table

A students table is set up for you (9 rows):

Column	Type	Constraints
`id`	INTEGER	PRIMARY KEY
`name`	TEXT	NOT NULL
`year`	INTEGER
`major`	TEXT
`gpa`	REAL

SELECT — reading data

SELECT is projection — it chooses which columns of a table to return. (To choose which rows, you’ll use WHERE in Step 3.)

-- Explore the whole table (good for learning, not production)
SELECT * FROM students;

-- Project only the columns the caller needs
SELECT name, gpa FROM students;

SELECT names the columns, FROM names the table, * means “all columns”. Use SELECT * while exploring; in production code, always list the columns you need so schema changes do not silently break callers.

Task

Open query.sql. Two queries; neither errors, but each is the wrong tool for its stated job.

Query 1 — exploration. SELECT * is correct here. Predict the row and column count as a comment, then Run.
Query 2 — production. Tighten it to return only name and major.

Investigate (optional). After both pass, append SELECT major, name FROM students. Does the column order in the output change?

Translating to Python or pandas

SELECT is a projection — like [s['name'] for s in students] or .map(s => s.name). pandas mapping:

SQL	pandas
`SELECT * FROM students`	`students`
`SELECT name, gpa FROM students`	`students[['name','gpa']]`

Two things that differ from a Python list: (1) rows come back in no guaranteed order — there is no “row 0”; if order matters, ask for it with ORDER BY (Step 3). (2) Every row has exactly the declared columns and types — values can be NULL, but a column cannot disappear.

How this tutorial works (PRIMM)

Every step uses the same rhythm: Predict what a query will do, Run to compare, Investigate with a tweak, then Modify to fix the bug. The hard skill is diagnosing why a query is wrong — not recalling syntax.

Starter files

query.sql

-- Step 1 — Exploring data with SELECT
-- Two queries below. Predict, then run and verify.

-- Query 1: exploration. Runs correctly.
-- Predict the row count and column count on the next line.
SELECT * FROM students;

-- Query 2: works, but not production-quality.
-- It fetches more columns than the caller needs.
-- Tighten it to return only name and major, and add a one-line
-- comment explaining why specific columns are safer than SELECT *.
SELECT id, name, year, major, gpa FROM students;

Exploring Data — Knowledge Check

Min. score: 80%

1. A Python programmer says: “A SQL table is just like a Python list of dictionaries.” What is the most important difference they are missing?

SQL tables are stored on disk; Python lists live in memory
SQL tables enforce a rigid schema; rows have no guaranteed order
Rows are accessed by 1-based index instead of 0-based
Python lists can hold more data than SQL tables

SQL tables have a rigid schema (fixed columns and types) and no guaranteed row order — there is no students[0]. Storage location is incidental; the indexing distractor is the misconception this step is built to dismantle.

2. Without an ORDER BY clause, in what order does SQL return rows?

In insertion order — the order rows were added
In primary key order — ascending by id
Alphabetically by the first column
In no guaranteed order — depends on storage and query plan

A SQL table is a set, not a list. Without ORDER BY, order depends on storage and query plan — never rely on it.

3. In Python you write students[0] to get the first student. What is the SQL equivalent?

SELECT * FROM students WHERE id = 0
SELECT * FROM students[0]
There is no equivalent — SQL rows have no position
SELECT FIRST FROM students

SQL rows have no position — you identify them by their content (WHERE clause), not by an index.

4. Arrange the fragments to write a query that retrieves only the name and gpa columns from the students table. (arrange in order)

Correct order:

SELECT
name, gpa
FROM
students;

Distractors (not used):

WHERE
*

SELECT names the columns, FROM names the table. * is wrong (not just those two); WHERE filters rows, not columns.

2

Joining Tables

JOIN — Combining Tables by a Shared Key

Learning objective: Answer questions that span two tables (e.g. “who is Mango’s advisor?”) by combining rows on a shared key, and recognize when a query has accidentally paired every row with every row.

Real data is rarely in one table. The students table has a major column, but each major’s department and advisor live in a separate majors table:

`majors.code`	`majors.department`	`majors.advisor`
CS	Engineering	Dr. Hopper
Math	Sciences	Dr. Noether
Physics	Sciences	Dr. Curie
Bio	Life Sciences	Dr. McClintock

To answer “who is Mango’s advisor?”, you need both tables. JOIN combines them on a shared column — here, students.major = majors.code.

JOIN syntax

JOIN is relational join — it combines rows from two tables that share a key, producing one output row per matching pair.

SELECT s.name, m.department, m.advisor
FROM students AS s
JOIN majors AS m ON s.major = m.code;

Aliases (AS s, AS m) — required as soon as two tables are in play, and required for self-joins.
ON — the predicate that says how the two tables line up. Evaluated per pair.
Match-only — rows where ON is TRUE for both sides survive; the rest are dropped. Blue Bella (NULL major) and Bio (no students) both drop out.

The Cartesian-product trap

Without ON, the database pairs every row with every row. 9 students × 4 majors = 36 rows of nonsense.

-- WRONG: no ON clause — produces 9 × 4 = 36 rows
SELECT s.name, m.advisor FROM students s, majors m;

The old comma-join syntax (FROM a, b WHERE a.x = b.y) makes this easy: delete the WHERE during a refactor and silently ship a Cartesian product. Use explicit JOIN ... ON ... so the relationship and the filter are separate.

Task

joins.sql has two queries — one silently broken, one that runs but leaks too much data.

Query 1 — the Cartesian trap. Returns 36 rows instead of ~8. Convert it to explicit JOIN ... ON ....
Query 2 — over-selection. Tighten it to project just the student name and their advisor, using aliases.

Predict before you Run: how many rows comes back from the corrected Query 1? Remember Blue Bella (NULL major) and Bio (no students).

Investigate (optional). After Query 1 passes, is Blue Bella in the result? Why not?

🎓 What about LEFT JOIN?

LEFT JOIN keeps unmatched rows from the left side — useful when you want to include Blue Bella even without a major. Not needed for this step; flagging the keyword so it rings a bell later.

Starter files

joins.sql

-- Step 2 — Joining students to majors
-- Predict row counts before you run.

-- Query 1: intended "student name + advisor" — returns 36 rows.
-- Identify the bug, then rewrite as an explicit JOIN ... ON.
SELECT s.name, m.advisor
FROM students s, majors m;

-- Query 2: works but over-selects.
-- Tighten to project only student name and advisor, using aliases.
SELECT *
FROM students
JOIN majors ON students.major = majors.code;

Joining Tables — Knowledge Check

Min. score: 80%

1. You write SELECT * FROM students s, majors m; with no WHERE or ON. What do you get?

An error — SQL requires an ON clause
SQL is happy to silently produce a Cartesian product — the comma syntax is a valid join (a CROSS JOIN), just rarely the one you wanted. The DBMS has no way to know your intent, so it gives you what you literally asked for.
Only rows where students.major = majors.code
Plain SQL never infers a join condition; you must spell out ON s.major = m.code (or WHERE s.major = m.code). ORMs that auto-join on declared relationships add this convenience on top — but the SQL standard itself does no such inference.
Every student paired with every major (36 rows)
An empty result set
Cross-product is the algebraic default, not an empty set. An empty result would require all rows to fail some predicate — but here there is no predicate to fail.

With no join predicate, SQL pairs every row from students with every row from majors: 9 × 4 = 36 rows. This is the #1 accidental bug in multi-table queries.

2. A JOIN ... ON s.major = m.code between students (9 rows) and majors (4 rows) returns 8 rows. Why not 9?

JOIN always drops one row
JOINs don’t have a fixed-cost arithmetic; they keep every pair where ON is TRUE. The number of rows lost depends on the data, not on a constant rule.
One student has a NULL major, which matches nothing
One major has no students — and that row drops
An inner JOIN is symmetric — a major with zero students does drop, but that doesn’t change the student count on the left side. The 9 students you started with stay unless their predicate fails.
Aliases hide one row from the output
Aliases (AS s, AS m) are syntactic shorthand only — they rename tables for the rest of the query, but never filter rows. They do not affect cardinality.

JOIN keeps only pairs where the ON predicate is TRUE for both sides. Blue Bella has major = NULL; NULL = anything is UNKNOWN, so she has no match and drops.

3. Arrange the fragments to list each student’s name and their department. (arrange in order)

Correct order:

SELECT s.name, m.department
FROM students AS s
JOIN majors AS m
ON s.major = m.code;

Distractors (not used):

FROM students, majors
WHERE s.major = m.code

The comma form + WHERE is the old trap. Use explicit JOIN ... ON so the relationship is visible and survives refactors.

4. Why are table aliases (AS s, AS m) strongly recommended even in simple two-table joins?

SQL refuses to run without them
They disambiguate same-named columns across tables
They improve query performance
They prevent NULL values in the result

Aliases disambiguate same-named columns (both students and majors could have name) and keep queries readable. They are also required for self-joins. They do not affect performance or NULL handling.

3

Filtering & Sorting

WHERE and ORDER BY — The filter() and sort() of SQL

Learning objective: Narrow a result set to exactly the rows that answer a question, order the output, and avoid the silent bugs that unknown values (NULL) introduce into filters.

NULL is genuinely difficult — experienced developers still get tripped up by it. Plan to spend a little extra time on this step.

WHERE predicates

WHERE is row selection — it keeps the rows where a predicate evaluates to TRUE, and discards the rest.

WHERE gpa >= 3.5                        -- numeric comparison
WHERE major = 'CS'                      -- string equality (single quotes!)
WHERE major = 'CS' AND gpa > 3.7        -- compound predicate
WHERE major IN ('CS', 'Math')           -- set membership
WHERE name LIKE 'A%'                    -- pattern match (% = any characters)

Two syntax traps for Python/JS developers:

Single quotes for strings. 'CS', not "CS". Double quotes quote identifiers in SQL.
Single = for equality. =, not ==. WHERE major == 'CS' is a syntax error.

NULL and Three-Valued Logic — the biggest trap

SQL booleans are three-valued: TRUE, FALSE, UNKNOWN. NULL means unknown value — comparing anything to NULL (even another NULL) yields UNKNOWN, and WHERE discards UNKNOWN rows.

WHERE major != 'CS'                       -- WRONG: silently drops NULL-major rows
WHERE major != 'CS' OR major IS NULL      -- RIGHT: handle NULL explicitly

WHERE major IS NULL       -- value is unknown
WHERE major IS NOT NULL   -- value is known

In Python, None != "Active" is True. In SQL, NULL != 'Active' is UNKNOWN — and the row is dropped.

ORDER BY

ORDER BY is sorting — it imposes order on the result for presentation. It’s also the only guarantee of row order; without it, order is whatever the engine finds convenient.

SELECT name, gpa FROM students
ORDER BY gpa DESC;   -- highest first (ASC is the default)

Task

filter.sql has three broken queries. Each fails differently. For each:

Predict what it actually returns — or what error it raises.
Classify the bug: syntax (grammar), semantic (valid SQL, wrong reference), or logical (runs fine, wrong rows). Write the classification as a comment.
Fix the query.

Query 3 is deceptively easy — the bug only appears when you reason about NULL carefully.

Investigate. After Query 3 passes, comment out your IS NULL branch and re-run. Where did Blue Bella go? This one-line toggle makes the NULL trap concrete.

Translating to JavaScript or pandas

JavaScript	pandas	SQL
`arr.filter(x => x.gpa > 3.5)`	`df[df['gpa'] > 3.5]`	`WHERE gpa > 3.5`
`arr.map(x => x.name)`	`df[['name']]`	`SELECT name`
`arr.sort((a,b) => b.gpa - a.gpa)`	`df.sort_values('gpa', ascending=False)`	`ORDER BY gpa DESC`

A SELECT is roughly .filter().map().sort() chained — but declarative.

pandas’ NaN behaves closer to SQL (df[df['col'] != 'CS'] excludes NaN). The idiomatic test is df['col'].isna() ≡ col IS NULL.

Starter files

filter.sql

-- Step 3 — Filtering rows with WHERE and sorting with ORDER BY
-- Three broken queries. Classify each as syntax / semantic / logical, then fix.

-- Query 1. Intended: CS students with GPA above 3.5, highest GPA first.
-- Runs without error, but returns the wrong rows in the wrong order.
-- Two independent bugs hide here. Find both, classify them, then fix.
SELECT * FROM students
WHERE major = 'CS' OR gpa > 3.5
ORDER BY gpa ASC;

-- Query 2. Intended: students in year 1 or year 2.
-- Runs without error. Why does it return zero rows?
SELECT * FROM students
WHERE year = 1 AND year = 2;

-- Query 3. Intended: students whose major is NOT 'CS'.
-- Runs without error, but quietly drops a student. Who, and why?
-- Hint: one student has a NULL major. Reason about 3VL before fixing.
SELECT * FROM students
WHERE major != 'CS';

Filtering & Sorting — Knowledge Check

Min. score: 80%

1. A student writes:

WHERE status != 'Active'

The status column has some NULL values. What happens to those rows?

They are included — NULL is not equal to ‘Active’
They are excluded — NULL != 'Active' is UNKNOWN, not TRUE
They cause a runtime error
They are included only if NULL is listed as a valid status

In SQL’s Three-Valued Logic, NULL != 'Active' evaluates to UNKNOWN — not TRUE. WHERE only keeps rows where the predicate is TRUE. UNKNOWN rows are silently discarded. To include NULL rows, write: WHERE status != 'Active' OR status IS NULL.

2. Which expression correctly tests whether a column value is missing (NULL)?

WHERE credits = NULL
WHERE credits == NULL
WHERE credits IS NULL
WHERE ISNULL(credits) = 1

NULL = NULL evaluates to UNKNOWN, not TRUE — so WHERE credits = NULL never matches any row. The only portable, standard-SQL way to test for NULL is IS NULL. == NULL is not valid SQL syntax (SQL uses =, not ==). ISNULL() is a non-standard extension.

3. In SQL’s logical execution order, the SELECT clause is written first. When is it actually evaluated?

First — it defines what data to retrieve
Second — immediately after FROM determines the source table
Fifth — after FROM, WHERE, GROUP BY, and HAVING
Last — after ORDER BY and LIMIT

SELECT is evaluated fifth: FROM -> WHERE -> GROUP BY -> HAVING -> SELECT -> ORDER BY -> LIMIT. This is why an alias defined in SELECT cannot be used in WHERE — the alias doesn’t exist until after WHERE has already filtered the rows.

4. What does SELECT name, gpa FROM students do?

It filters the table to only include rows that have a name and a gpa
It returns all rows but only the name and gpa columns
It creates new columns called name and gpa
It sorts the table by name and then by gpa

SELECT projects (selects) columns, not rows. SELECT name, gpa FROM students returns all rows but shows only the name and gpa columns. To filter rows, you need a WHERE clause. This is a common point of confusion — SELECT chooses columns, WHERE chooses rows.

5. Arrange the clauses to find CS students with GPA above 3.5, sorted highest first. (arrange in order)

Correct order:

SELECT name, gpa
FROM students
WHERE major = 'CS' AND gpa > 3.5
ORDER BY gpa DESC;

Distractors (not used):

HAVING gpa > 3.5
SORT BY gpa DESC;

The logical execution order is FROM -> WHERE -> SELECT -> ORDER BY. HAVING filters groups (after GROUP BY), not individual rows. SORT BY is not valid SQL — use ORDER BY. The distractor HAVING is a common confusion point; it only applies after GROUP BY.

6. A student writes two queries. Which one is more likely to produce unexpected results in production, and why? Query A: SELECT * FROM users WHERE role != 'admin' Query B: SELECT id, name FROM users WHERE role != 'admin' OR role IS NULL

Query A — silently drops NULL roles, plus uses SELECT *
Query B — the extra OR clause makes it slower
Both are equally safe in production
Query A — SELECT * causes a syntax error

Query A has two production risks: (1) SELECT * breaks if columns are added or removed, and (2) != 'admin' silently drops rows where role is NULL. Query B is safer: it names specific columns and explicitly handles NULL. This combines two lessons — explicit projection and NULL-safe predicates.

4

Aggregating Data

GROUP BY — The reduce() of SQL

Learning objective: Turn row-level data into per-category summaries (counts, averages, totals), and distinguish filtering individual rows from filtering whole groups.

Every time Spotify shows “Your top artists this month” or Instagram shows “4 people liked this” — that is a GROUP BY + COUNT running against millions of rows in milliseconds.

Logical execution order

SQL clauses do not run top-to-bottom as written:

FROM      — which tables to read
WHERE     — filter rows
GROUP BY  — group rows
HAVING    — filter groups
SELECT    — project columns  ← written first, evaluated fifth
ORDER BY  — sort
LIMIT     — truncate

This explains (a) why you cannot use a SELECT alias inside WHERE — the alias does not exist yet — and (b) why WHERE cannot contain an aggregate — the groups do not exist yet. Keep this list visible while you work the task — even experienced developers reference it.

GROUP BY syntax

GROUP BY is aggregation — it partitions rows into buckets per category, then applies summary functions (COUNT, AVG, SUM, MIN, MAX) to each bucket. The result has one row per bucket, not per original row.

SELECT   major,
         COUNT(*) AS student_count,
         AVG(gpa)  AS avg_gpa
FROM     students
GROUP BY major;

The GROUP BY rule

Every column in SELECT must either (1) appear in GROUP BY, or (2) be wrapped in an aggregate (COUNT, AVG, SUM, MIN, MAX).

-- WRONG: name is neither grouped nor aggregated
SELECT major, name, COUNT(*) FROM students GROUP BY major;

-- RIGHT
SELECT major, COUNT(*) AS n FROM students GROUP BY major;

SQLite silently picks an arbitrary name here — PostgreSQL and strict-mode MySQL reject it. Treat SQLite’s quiet acceptance as a trap, not a blessing.

WHERE vs HAVING

HAVING is group-level filtering — the aggregate-aware analog of WHERE. WHERE filters individual rows before grouping; HAVING filters groups after aggregation. Aggregates cannot appear in WHERE because the groups don’t exist yet.

SELECT   major, COUNT(*) AS n
FROM     students
WHERE    year > 1             -- rows first
GROUP BY major
HAVING   COUNT(*) >= 2;       -- groups after

COUNT(*) counts every row in the group; COUNT(gpa) counts only rows where gpa is not NULL.

Task

aggregate.sql has two queries, each making one of the two most common aggregation mistakes (illegal grouping; WHERE-with-aggregate). For each:

Plan before typing: write a one-line comment with the clauses in evaluation order, e.g. FROM students → WHERE year>1 → GROUP BY major → HAVING count ≥ 2 → SELECT major, count. Even 30 seconds of planning catches most aggregation bugs before you write them.
Predict what happens — a crash, a silent wrong answer, or a confusing SQLite-lenient result.
Explain in a one-line comment why it is wrong, then fix it.

Investigate. After you pass, temporarily change your fixed Query 2 back to WHERE AVG(gpa) > 3.3 and read the exact error message. You want to recognize it instantly in a future code review.

Translating to pandas

SQL	pandas
`SELECT major, COUNT(*) FROM students GROUP BY major`	`students.groupby('major').size()`
`SELECT major, AVG(gpa) FROM students GROUP BY major`	`students.groupby('major')['gpa'].mean()`
`... HAVING AVG(gpa) > 3.3`	`result[result > 3.3]` (filter after `.mean()`)

JavaScript’s .reduce() collapses a collection into a single value. SQL’s GROUP BY does the same — but per group, in one query.

Starter files

aggregate.sql

-- Step 4 — Aggregation with GROUP BY and HAVING
-- Plan each query in one comment line before you fix it.

-- Query 1. Intended: count students per major, most students first.
-- Violates the grouping rule — why? What does SQLite do about it?
SELECT major, name, COUNT(*) AS student_count
FROM students
GROUP BY major
ORDER BY student_count DESC;

-- Query 2. Intended: average GPA per major, only majors with avg GPA > 3.3.
-- Uses the wrong clause for an aggregate condition. Fix it.
SELECT major, AVG(gpa) AS avg_gpa
FROM students
WHERE AVG(gpa) > 3.3
GROUP BY major;

Aggregation — Knowledge Check

Min. score: 80%

1. A student writes:

SELECT major, name, COUNT(*) AS n
FROM students
GROUP BY major;

What is wrong?

COUNT(*) is not a valid aggregate function
COUNT(*) is the canonical row-count aggregate — it appears in every SQL textbook and is the most widely-supported function in SQL. The problem isn’t that aggregate; it’s the bare column next to it.
name is in SELECT but not in GROUP BY or any aggregate
GROUP BY must appear before SELECT in the written query
Clauses appear in a fixed order in the written query (SELECT → FROM → WHERE → GROUP BY), but they are evaluated in a different order: FROM → WHERE → GROUP BY → SELECT. This question is about a semantic violation, not a syntactic one.
Non-aggregate columns and aggregates cannot mix in one SELECT
Aggregates and non-aggregates can mix — that’s the whole point of GROUP BY. The rule is: every non-aggregate column must appear in GROUP BY so the database knows it has one value per group. Here name doesn’t, so the database has multiple names per major and no rule for which to pick.

The grouping rule: every column in SELECT must either appear in GROUP BY or be wrapped in an aggregate. name is neither — there are multiple names per major, so it’s undefined which one to return. SQLite may pick an arbitrary name; PostgreSQL and strict-mode MySQL reject this entirely.

2. Which clause filters groups based on an aggregate condition?

WHERE
HAVING
FILTER
SELECT ... WHERE

WHERE filters individual rows before grouping. HAVING filters groups after aggregation. WHERE COUNT(*) > 5 is always an error because the count doesn’t exist yet when WHERE runs. Use HAVING COUNT(*) > 5.

3. If 3 out of 10 students have a NULL value in the gpa column, what does each of these return? COUNT(*) vs COUNT(gpa)

Both return 10 — COUNT always includes every row
COUNT(*) does count every row, but COUNT(column) is a different aggregate that filters NULLs. The two forms exist precisely to give you a choice between “how many rows in the group?” and “how many known values in this column?”
Both return 7 — COUNT skips NULL rows
This collapses the two forms into one. The distinction is intentional: when the column is the answer (“how many students have a known GPA?”) use COUNT(column); when you want the group’s size regardless of nullness use COUNT(*).
COUNT(*) returns 10; COUNT(gpa) returns 7
COUNT(*) returns 7; COUNT(gpa) returns 10
Backwards. COUNT(*) always sees the row even if every column is NULL — only the row’s existence matters. COUNT(gpa) is the one that drops the NULLs.

COUNT(*) counts every row regardless of NULLs. COUNT(column) counts only rows where that column is not NULL. The distinction is critical whenever a column is optional — use the right one depending on whether you want to count the group size or the number of known values.

4. A query uses WHERE year > 2 to filter rows before grouping. If a student has year set to NULL, are they included in the grouped results?

Yes — NULL is treated as 0, which is not greater than 2
SQL treats NULL as unknown, not zero: NULL > 2, NULL = 0, NULL = NULL all evaluate to UNKNOWN — never TRUE, never FALSE. Spreadsheets coerce blank cells to 0, which sets up the wrong intuition for SQL.
Yes — WHERE passes all rows to GROUP BY by default
WHERE applies its predicate to every row, including NULL ones. There is no “pass-through” rule; the predicate just gets a NULL operand and produces UNKNOWN, which WHERE then drops.
No — NULL > 2 is UNKNOWN, and WHERE discards UNKNOWN rows
It depends on the database engine
NULL handling in WHERE is one of the most consistent parts of SQL across engines — three-valued logic (TRUE/FALSE/UNKNOWN) is in the ANSI standard and every major database implements it the same way. Other things vary; this doesn’t.

From Step 3: NULL > 2 evaluates to UNKNOWN, and WHERE discards UNKNOWN rows before grouping even begins. Always check whether NULLs in filtered columns are affecting your aggregates.

5. Arrange the clauses to count students per major, but only show majors with more than 2 students. (arrange in order)

Correct order:

SELECT major, COUNT(*) AS n
FROM students
GROUP BY major
HAVING COUNT(*) > 2;

Distractors (not used):

WHERE COUNT(*) > 2;
ORDER BY major

WHERE cannot hold aggregates — the count does not exist yet when WHERE runs. HAVING filters after grouping.

6. Arrange the steps in the order the database actually executes them, for this query:

SELECT major, COUNT(*) AS n FROM students
WHERE year > 1 GROUP BY major HAVING COUNT(*) >= 2 ORDER BY n DESC;

(arrange in order)

Correct order:

FROM students — read the source rows
WHERE year > 1 — drop rows where year is 1 or NULL
GROUP BY major — partition surviving rows into groups
HAVING COUNT(*) >= 2 — drop groups smaller than 2
SELECT major, COUNT(*) AS n — project the final columns
ORDER BY n DESC — sort groups by count

Distractors (not used):

SELECT first, then filter rows in WHERE
ORDER BY runs before HAVING

SELECT is written first but evaluated fifth. This is why aliases defined in SELECT cannot be used in WHERE, and why aggregates in WHERE always fail.

7. [Review from Step 3] A table has a nullable status column. Which query correctly finds all rows where status is either 'inactive' or unknown?

WHERE status = 'inactive' OR status = NULL
WHERE status = 'inactive' OR status IS NULL
WHERE status IN ('inactive', NULL)
WHERE status != 'active'

= NULL and IN (..., NULL) both silently miss NULL rows (internally they use =). != 'active' drops NULL rows too. Only IS NULL works.

5

Designing Your Own Table

CREATE TABLE and INSERT INTO

Learning objective: Design a table that protects its own integrity — so bad data cannot enter in the first place — and classify query errors so you can debug them efficiently.

You have been querying an existing table for four steps. Now it is time to build your own. If constraint syntax feels fiddly at first, that is normal — the payoff is that the database catches bad data before it becomes a bug.

CREATE TABLE

CREATE TABLE is schema declaration — it defines the shape and integrity rules of a table (columns, types, constraints), not the data. Like declaring a struct, not creating an instance.

CREATE TABLE books (
  id     INTEGER PRIMARY KEY,    -- identity: unique + non-null
  title  TEXT    NOT NULL,       -- required
  author TEXT    NOT NULL,       -- required
  pages  INTEGER                 -- optional (nullable by default)
);

PRIMARY KEY = unique + non-null + indexed.
NOT NULL = required.
Types: INTEGER, REAL, TEXT, BLOB.

SQLite quirk: standard SQL makes PRIMARY KEY imply NOT NULL; SQLite does not. Add NOT NULL explicitly if portability matters.

INSERT INTO

INSERT INTO is row insertion — it adds new rows that must conform to the schema’s types and constraints.

INSERT INTO books (id, title, author, pages) VALUES
  (1, 'SICP', 'Abelson & Sussman', 657);

Omitting a NOT NULL column (with no default) fails. Nullable columns can be skipped. After any insert/update, run a quick SELECT to verify.

Four kinds of SQL errors

Syntax — grammar violated. Refuses to run.
Semantic — valid grammar, references something that does not exist (wrong column name). Errors at parse time.
Logical — runs and returns a result, but the wrong result (e.g., the NULL trap from Step 3). No error message — the hardest to catch.
Constraint violation — grammatically and semantically valid, but violates PRIMARY KEY / NOT NULL / UNIQUE / FOREIGN KEY at execution time.

Task

A teammate handed you two unreviewed files. Bring them up to production quality.

1. schema.sql — strengthen. It compiles, but has no integrity constraints. Add the missing types + constraints:

Column	Type	Constraints
`id`	INTEGER	PRIMARY KEY
`code`	TEXT	NOT NULL
`name`	TEXT	NOT NULL
`credits`	INTEGER	(nullable — leave as is)

Above each constraint, write a one-line comment naming the bug it prevents (e.g., -- prevents duplicate course IDs).

2. populate.sql — classify and repair. Three INSERT statements. Two fail against the strengthened schema. For each failing insert, decide which error category fits, then fix it (or replace with a valid row). End with SELECT * FROM courses;. Aim for at least 3 rows.

Translating to pandas

SQL	pandas
`CREATE TABLE books (id INTEGER PRIMARY KEY, title TEXT NOT NULL, ...)`	`books = pd.DataFrame(columns=['id', 'title', 'author', 'pages'])`
`INSERT INTO books VALUES (1, 'SICP', 'Abelson', 657)`	`books.loc[len(books)] = [1, 'SICP', 'Abelson', 657]`

pandas accepts anything silently — wrong types, missing required fields. SQL enforces the schema and catches those bugs at insert time.

Starter files

schema.sql

-- Step 5a — Designing a schema that protects its own data.
-- The CREATE below runs, but it has NO constraints.
-- Add the missing types + constraints from the table in the instructions.
-- For each constraint you add, leave a one-line comment naming the bug it prevents.
DROP TABLE IF EXISTS courses;

CREATE TABLE courses (
  id,
  code,
  name,
  credits
);

populate.sql

-- Step 5b — Populating the strengthened schema.
-- Classify each broken INSERT (syntax / semantic / logical / constraint violation), then fix.
DELETE FROM courses;

-- Insert A — the model row. Matches the schema exactly.
INSERT INTO courses (id, code, name, credits) VALUES
  (1, 'CS 35L', 'Software Construction', 4);

-- Insert B — predict what breaks when the schema enforces NOT NULL
INSERT INTO courses (id, code, credits) VALUES
  (2, 'CS 111', 4);

-- Insert C — all values look plausible. What does the strengthened schema
-- reject here? Look carefully at how this row compares to Insert A.
INSERT INTO courses (id, code, name, credits) VALUES
  (1, 'MATH 61', 'Discrete Structures', 4);

-- Verify your data. Leave this line as the last statement.
SELECT * FROM courses;

Creating Tables — Knowledge Check

Min. score: 80%

1. What does PRIMARY KEY guarantee about a column’s values?

Values are sorted in ascending order
The column cannot contain text
Each value is unique and non-null across all rows
The column is indexed for faster lookups only

PRIMARY KEY enforces two things: uniqueness (no two rows share the same value) and non-nullability. It also creates an implicit index, but its core purpose is to uniquely identify every row.

2. What happens if you INSERT INTO a row but omit a column declared NOT NULL with no default?

The column is automatically set to 0 or an empty string
The row is inserted with NULL in that column
The database raises an error and the row is not inserted
The row is inserted but flagged as incomplete

A NOT NULL column with no default value requires an explicit value. Omitting it in the INSERT causes an error — the database enforces the constraint and refuses the row.

3. A student writes:

SELECT full_name FROM students WHERE full_name = 'Mango';

The column is actually named name, not full_name. What type of SQL error is this?

Logical error — the query runs but returns wrong rows
Semantic error — valid grammar, but references something that does not exist
Syntax error — the SQL grammar rules are violated
Constraint violation — a NOT NULL or PRIMARY KEY rule was broken

A semantic error is a query that is grammatically correct SQL but references something that doesn’t exist (wrong column name, wrong table). A syntax error violates grammar (wrong clause order, missing parentheses). A logical error is when the query runs but returns the wrong answer.

4. You need to store a library’s book collection. Each book has a title, author, year published, and an optional page count. Which schema is best?

CREATE TABLE books (title TEXT, author TEXT, year INTEGER, pages INTEGER)
Missing a PRIMARY KEY — without one there’s no guaranteed way to uniquely identify each book, making updates and deletes ambiguous.
CREATE TABLE books (id INTEGER PRIMARY KEY, title TEXT NOT NULL, author TEXT NOT NULL, year INTEGER, pages INTEGER)
CREATE TABLE books (id INTEGER, title TEXT, author TEXT, year TEXT, pages TEXT)
Storing year as TEXT instead of INTEGER loses type safety — you can’t sort or range-query correctly, and the column accepts non-year strings.
CREATE TABLE books (book_data TEXT)
Storing all data in a single TEXT column gives up all schema enforcement: no types, no NOT NULL constraints, and no ability to query individual fields.

The best schema has a PRIMARY KEY for unique identification, NOT NULL on required fields (title, author), appropriate types (INTEGER for year, not TEXT), and leaves optional fields (pages) nullable.

5. Arrange the lines to create a songs table with a primary key and a required title. (arrange in order)

Correct order:

CREATE TABLE songs (
id INTEGER PRIMARY KEY,
title TEXT NOT NULL,
artist TEXT,
plays INTEGER
);

Distractors (not used):

id INTEGER UNIQUE,
title TEXT NULL,

PRIMARY KEY provides uniqueness and identification — UNIQUE alone lacks the “identity” role. NOT NULL on title ensures every song has a name. TEXT NULL is redundant (columns are nullable by default) and wrong for a required field.

6. [Review] What is the correct logical execution order of these SQL clauses? (arrange in order)

Correct order:

FROM
WHERE
GROUP BY
HAVING
SELECT
ORDER BY

Distractors (not used):

SORT BY
FILTER

The logical execution order is FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY. This is why you cannot use a SELECT alias in WHERE (the alias doesn’t exist yet) and why WHERE cannot use aggregate functions (groups don’t exist yet). SORT BY is not valid SQL (use ORDER BY); FILTER is not a standalone clause.

6

Modifying & Cleaning Up

UPDATE, DELETE, and DROP TABLE

Learning objective: Change or remove data without damaging the rest of the database, recognize destructive statements before they run, and apply the professional habit of previewing impact before committing.

Destructive SQL is where careful habits pay off. If you have ever run a shell command and watched the wrong files disappear, you already know the feeling — these patterns exist so you do not have it in production.

Safe failure experiment

Run SELECT * FROM students; and note the row count. Now run DELETE FROM students; (no WHERE). Run the SELECT again — the table is empty. The page reload restores the seed data; production has no reload button.

UPDATE, DELETE, DROP

All three are destructive — they change the database state, in irreversible ways without a backup.

UPDATE modifies values in existing rows. Scoped by WHERE.
DELETE removes rows. Scoped by WHERE. Schema stays.
DROP TABLE removes the table itself — data and schema. No undo.

UPDATE students SET gpa = 3.95 WHERE name = 'Mango';
DELETE FROM students WHERE gpa < 3.0;
DROP TABLE IF EXISTS courses;

Without WHERE, UPDATE modifies every row and DELETE empties the table. Safe habit: write SELECT ... WHERE ... first, then convert.

Statement	Removes	Schema remains?
`DELETE FROM t WHERE ...`	Matching rows	Yes
`DELETE FROM t`	All rows	Yes
`DROP TABLE t`	All rows + schema	No

Task

A teammate pushed cleanup.sql to a PR. It is scheduled to run on production in an hour. Do not approve it as-is.

For each statement:

Predict what it actually does — not what the comment claims.
Preview with a SELECT. Write the SELECT ... WHERE ... that would reveal the affected rows; leave it as a comment above the fixed statement so the next reviewer sees your reasoning.
Fix the statement to match its stated intent.

That trail — claim, preview, confirm — is how destructive SQL actually gets approved.

Stated intents:

Update Kiwi’s GPA from 3.50 to 3.60.
Delete all year-4 students.
Drop the courses table from Step 5.

Translating to pandas

SQL	pandas
`UPDATE students SET gpa = 3.6 WHERE name = 'Kiwi'`	`students.loc[students['name'] == 'Kiwi', 'gpa'] = 3.6`
`DELETE FROM students WHERE year = 4`	`students = students[students['year'] != 4]`
`DROP TABLE courses`	`del courses`

SQL modifies in place; pandas often returns a new frame. In SQL, forgetting WHERE is permanent.

🎓 Where to go next

This tutorial covers the ~90% of SQL that 90% of code needs. Three topics deserve your next hour:

LEFT JOIN and the outer-join family. Step 2 taught JOIN, which drops rows with no match (Blue Bella with NULL major). LEFT JOIN keeps them and fills the right side with NULL.
Common Table Expressions (WITH ... AS (...)). Complex queries collapse from unreadable nested subqueries into top-to-bottom readable pipelines.
The N+1 query problem. The #1 antipattern at the app ↔ SQL boundary: looping over a list, firing one query per element. Batch with WHERE id IN (...) or an explicit JOIN.

Starter files

cleanup.sql

-- Step 6 — Review a teammate's destructive PR before it runs in production.
-- For each statement: predict, preview with a commented SELECT, then fix.

-- 1. Intended: raise Kiwi's GPA to 3.6.
--    Targets the wrong student. A second trap hides here too:
--    drop the WHERE clause and every student ends up at 3.6.
UPDATE students SET gpa = 3.6 WHERE name = 'Mango';

-- 2. Intended: delete all year-4 students.
--    Wrong column referenced. Would this even run? If yes, what does it delete?
DELETE FROM students WHERE gpa = 4;

-- 3. Intended: drop the courses table.
--    Works most of the time. When would this statement crash the deploy,
--    and how do you make it safe to re-run?
DROP TABLE courses;

Modifying & Cleaning Up — Knowledge Check

Min. score: 80%

1. A developer runs:

DELETE FROM orders;

What is the result?

The orders table is removed from the database
All rows are deleted but the table structure remains
An error — DELETE requires a WHERE clause
Only rows matching the implicit WHERE TRUE condition are deleted

DELETE FROM orders without a WHERE deletes every row but leaves the table schema intact. The table still exists — it is just empty. To remove the schema too, use DROP TABLE orders. SQL does not require a WHERE clause on DELETE, which makes this mistake easy to make in production.

2. What is the safest way to remove a table that may or may not exist, without causing an error?

REMOVE TABLE IF EXISTS courses;
DELETE TABLE courses;
DROP TABLE IF EXISTS courses;
TRUNCATE TABLE IF EXISTS courses;

DROP TABLE IF EXISTS tablename is the idiomatic safe form. Plain DROP TABLE tablename raises an error if the table doesn’t exist. DELETE TABLE is not valid SQL. TRUNCATE removes all rows (like DELETE) but keeps the schema, and is not supported by all databases (including SQLite).

3. Before running an UPDATE on a production table, what should you do first?

Run the UPDATE and check the result in the UI
Run a SELECT with the same WHERE clause to preview affected rows
Add ORDER BY to the UPDATE so the order is deterministic
Change WHERE to WHERE TRUE so every row is visible

Running SELECT ... WHERE ... with the same predicate lets you see exactly which rows the UPDATE will touch — before the change is permanent. This is the claim/preview/confirm trail the task demonstrated.

4. After running DROP TABLE courses, what happens if you run SELECT * FROM courses?

It returns an empty result set (no rows, but the columns are shown)
The database raises an error — the table no longer exists
It returns NULL for every column
It returns the data that was in the table before the DROP

DROP TABLE removes both the data and the schema. The table no longer exists in the database at all, so any query referencing it fails with an error. Compare this with DELETE FROM courses (no WHERE), which removes all rows but keeps the table structure intact — SELECT * FROM courses would return an empty result set, not an error.

5. Arrange the steps for safely updating Kiwi’s GPA. The professional workflow: preview first, then modify. (arrange in order)

Correct order:

-- Preview the rows you're about to change
SELECT * FROM students WHERE name = 'Kiwi';
-- Then apply the change
UPDATE students SET gpa = 3.6 WHERE name = 'Kiwi';

Distractors (not used):

UPDATE students SET gpa = 3.6;
DELETE FROM students WHERE name = 'Kiwi';

Always preview with SELECT before modifying production data. UPDATE students SET gpa = 3.6; is missing the WHERE clause — it would change every student’s GPA to 3.6. The DELETE distractor removes the row entirely instead of updating it.

6. [Review from Step 4] What is the difference between COUNT(*) and COUNT(email)?

COUNT(*) counts all rows; COUNT(email) counts only rows where email is not NULL
COUNT(*) is slower; COUNT(email) is optimized
They are equivalent — both count all rows
COUNT(*) counts columns; COUNT(email) counts rows

COUNT(*) counts every row regardless of NULL. COUNT(column) counts only rows where that column is not NULL.

7. [Review from Step 3] Which statement correctly removes every student whose major is NULL?

DELETE FROM students WHERE major = NULL
DELETE FROM students WHERE major IS NULL
DELETE FROM students WHERE major != 'CS'
DELETE FROM students WHERE major

= NULL always evaluates to UNKNOWN and matches no rows — the delete would silently do nothing. Only IS NULL tests for unknown values.

7

Putting It Together

Capstone — One Query, Six Steps’ Worth of Tools

Learning objective: Compose a single query that combines JOIN, WHERE, GROUP BY/HAVING, aggregates, and ORDER BY to answer a real-world question — and recognize which construct each part of the question maps to.

You’ve spent six steps learning the building blocks separately. This step is composition — chaining those blocks into one query that answers a real-world question.

The dean’s request

The dean wants a snapshot of upperclassmen (year 2 and above) by major. List each major that has at least 2 such students, showing:

the major code
its department
the advisor
the average GPA among those students
the student count

Sort by average GPA, highest first.

Decomposing the request

Each phrase of a real-world ask maps to a SQL construct. Identifying these mappings before you type is the skill this capstone builds.

Phrase in the request	Construct
“by major”	`GROUP BY` (Step 4)
“year 2 and above” — row filter	`WHERE` (Step 3)
“department, advisor” — needs `majors` table	`JOIN` (Step 2)
“at least 2 such students” — group filter	`HAVING COUNT(*) >= 2` (Step 4)
“average GPA”, “student count”	`AVG()`, `COUNT()` (Step 4)
“highest first”	`ORDER BY ... DESC` (Step 3)
(production hygiene)	Explicit columns + aliases (Steps 1, 2)

Plan, then type

In final.sql, sketch the clauses in execution order as a comment, then write the query. Naming the constructs you’ll need is the highest-leverage habit for aggregation queries.

One query. No SELECT *. Run before you celebrate.

Starter files

final.sql

-- Capstone: one query that combines everything from Steps 1–6.
-- Plan first as a comment in execution order:
--   FROM ... → JOIN ... → WHERE ... → GROUP BY ... → HAVING ...
--   → SELECT ... → ORDER BY ...
-- Then write the query below.

Capstone — Cumulative Check

Min. score: 70%

1. Two engineers debate. Engineer A writes:

SELECT major, AVG(gpa) FROM students GROUP BY major HAVING COUNT(*) >= 2;

Engineer B writes:

SELECT major, AVG(gpa) FROM students WHERE COUNT(*) >= 2 GROUP BY major;

Which one runs correctly?

A — WHERE cannot reference an aggregate; that condition belongs in HAVING
B — WHERE runs first, so it filters faster than HAVING
Both run; SQLite is permissive about this
Neither — both need a JOIN

Engineer A is right. WHERE COUNT(*) >= 2 always errors because aggregates don’t exist when WHERE runs (the logical execution order from Step 4). Filtering on aggregates is HAVING’s job.

2. [Review from Step 3] In your capstone query, you wrote WHERE year >= 2. If a future student has year set to NULL, are they included in the result?

Yes — NULL is treated as 0
No — NULL >= 2 is UNKNOWN, and WHERE discards UNKNOWN rows
Yes — WHERE includes NULL by default
It depends on the database engine

Three-Valued Logic (Step 3): comparing NULL to anything yields UNKNOWN, and WHERE keeps only TRUE rows. The principle matters whenever a filtered column is nullable.

3. [Review from Step 4] The grouping rule states that every column in SELECT must either appear in GROUP BY or be wrapped in an aggregate. Why?

It’s a stylistic convention to keep queries readable
Without it, multiple values map to one group with no defined choice — the result is undefined
GROUP BY only supports indexed columns
The SQL standard prohibits mixing aggregates and non-aggregates in the same SELECT

If a column is neither grouped nor aggregated, multiple row values collapse into one group with no rule for which to return. Standard SQL rejects this; SQLite picks arbitrarily — silent corruption, not a feature.

4. Arrange the clauses of your capstone query in the order the database actually executes them. (arrange in order)

Correct order:

FROM students JOIN majors
WHERE year >= 2
GROUP BY major
HAVING COUNT(*) >= 2
SELECT major, AVG(gpa), COUNT(*)
ORDER BY avg_gpa DESC

Distractors (not used):

WHERE year >= 2 AND COUNT(*) >= 2
SORT BY avg_gpa DESC

FROM/JOIN → WHERE → GROUP BY → HAVING → SELECT → ORDER BY. SELECT is written first but evaluated fifth — which is why aggregates belong in HAVING (after grouping), never in WHERE. The distractors fail two ways: mixing an aggregate into WHERE, and using SORT BY (not valid SQL).

SQL Essentials

Exploring Data

Describe What You Want, Not How

The students table

SELECT — reading data

Task

Solution

Exploring Data — Knowledge Check

Joining Tables

JOIN — Combining Tables by a Shared Key

JOIN syntax

The Cartesian-product trap

Task

Solution

Joining Tables — Knowledge Check

Filtering & Sorting

WHERE and ORDER BY — The filter() and sort() of SQL

WHERE predicates

NULL and Three-Valued Logic — the biggest trap

ORDER BY

Task

Solution

Filtering & Sorting — Knowledge Check

Aggregating Data

GROUP BY — The reduce() of SQL

Logical execution order

GROUP BY syntax

The GROUP BY rule

WHERE vs HAVING

Task

Solution

Aggregation — Knowledge Check

Designing Your Own Table

CREATE TABLE and INSERT INTO

CREATE TABLE

INSERT INTO

Four kinds of SQL errors

Task

Solution

Creating Tables — Knowledge Check

Modifying & Cleaning Up

UPDATE, DELETE, and DROP TABLE

Safe failure experiment

UPDATE, DELETE, DROP

Task

Solution

Modifying & Cleaning Up — Knowledge Check

Putting It Together

Capstone — One Query, Six Steps’ Worth of Tools

The dean’s request

Decomposing the request

Plan, then type

Solution

Capstone — Cumulative Check