1

Exploring Data

Describe What You Want, Not How

Learning objective: Retrieve the slice of a dataset a caller needs, and explain why describing what you want (declarative) scales better than describing how to compute it (imperative).

In Python or JavaScript, you tell the machine how to solve a problem step by step. In SQL, you describe what you want and the engine figures out the rest. If this imperative → declarative shift feels strange, that is normal.

The students table

A students table is set up for you (9 rows):

Column Type Constraints
id INTEGER PRIMARY KEY
name TEXT NOT NULL
year INTEGER  
major TEXT  
gpa REAL  

SELECT — reading data

SELECT is projection — it chooses which columns of a table to return. (To choose which rows, you’ll use WHERE in Step 3.)

-- Explore the whole table (good for learning, not production)
SELECT * FROM students;

-- Project only the columns the caller needs
SELECT name, gpa FROM students;

SELECT names the columns, FROM names the table, * means “all columns”. Use SELECT * while exploring; in production code, always list the columns you need so schema changes do not silently break callers.

Task

Open query.sql. Two queries; neither errors, but each is the wrong tool for its stated job.

  • Query 1 — exploration. SELECT * is correct here. Predict the row and column count as a comment, then Run.
  • Query 2 — production. Tighten it to return only name and major.

Investigate (optional). After both pass, append SELECT major, name FROM students. Does the column order in the output change?

Translating to Python or pandas

SELECT is a projection — like [s['name'] for s in students] or .map(s => s.name). pandas mapping:

SQL pandas
SELECT * FROM students students
SELECT name, gpa FROM students students[['name','gpa']]

Two things that differ from a Python list: (1) rows come back in no guaranteed order — there is no “row 0”; if order matters, ask for it with ORDER BY (Step 3). (2) Every row has exactly the declared columns and types — values can be NULL, but a column cannot disappear.

How this tutorial works (PRIMM)

Every step uses the same rhythm: Predict what a query will do, Run to compare, Investigate with a tweak, then Modify to fix the bug. The hard skill is diagnosing why a query is wrong — not recalling syntax.

Starter files
query.sql
-- Step 1 — Exploring data with SELECT
-- Two queries below. Predict, then run and verify.

-- Query 1: exploration. Runs correctly.
-- Predict the row count and column count on the next line.
SELECT * FROM students;

-- Query 2: works, but not production-quality.
-- It fetches more columns than the caller needs.
-- Tighten it to return only name and major, and add a one-line
-- comment explaining why specific columns are safer than SELECT *.
SELECT id, name, year, major, gpa FROM students;
2

Joining Tables

JOIN — Combining Tables by a Shared Key

Learning objective: Answer questions that span two tables (e.g. “who is Mango’s advisor?”) by combining rows on a shared key, and recognize when a query has accidentally paired every row with every row.

Real data is rarely in one table. The students table has a major column, but each major’s department and advisor live in a separate majors table:

majors.code majors.department majors.advisor
CS Engineering Dr. Hopper
Math Sciences Dr. Noether
Physics Sciences Dr. Curie
Bio Life Sciences Dr. McClintock

To answer “who is Mango’s advisor?”, you need both tables. JOIN combines them on a shared column — here, students.major = majors.code.

JOIN syntax

JOIN is relational join — it combines rows from two tables that share a key, producing one output row per matching pair.

SELECT s.name, m.department, m.advisor
FROM students AS s
JOIN majors AS m ON s.major = m.code;
  • Aliases (AS s, AS m) — required as soon as two tables are in play, and required for self-joins.
  • ON — the predicate that says how the two tables line up. Evaluated per pair.
  • Match-only — rows where ON is TRUE for both sides survive; the rest are dropped. Blue Bella (NULL major) and Bio (no students) both drop out.

The Cartesian-product trap

Without ON, the database pairs every row with every row. 9 students × 4 majors = 36 rows of nonsense.

-- WRONG: no ON clause — produces 9 × 4 = 36 rows
SELECT s.name, m.advisor FROM students s, majors m;

The old comma-join syntax (FROM a, b WHERE a.x = b.y) makes this easy: delete the WHERE during a refactor and silently ship a Cartesian product. Use explicit JOIN ... ON ... so the relationship and the filter are separate.

Task

joins.sql has two queries — one silently broken, one that runs but leaks too much data.

  • Query 1 — the Cartesian trap. Returns 36 rows instead of ~8. Convert it to explicit JOIN ... ON ....
  • Query 2 — over-selection. Tighten it to project just the student name and their advisor, using aliases.

Predict before you Run: how many rows comes back from the corrected Query 1? Remember Blue Bella (NULL major) and Bio (no students).

Investigate (optional). After Query 1 passes, is Blue Bella in the result? Why not?

🎓 What about LEFT JOIN?

LEFT JOIN keeps unmatched rows from the left side — useful when you want to include Blue Bella even without a major. Not needed for this step; flagging the keyword so it rings a bell later.

Starter files
joins.sql
-- Step 2 — Joining students to majors
-- Predict row counts before you run.

-- Query 1: intended "student name + advisor" — returns 36 rows.
-- Identify the bug, then rewrite as an explicit JOIN ... ON.
SELECT s.name, m.advisor
FROM students s, majors m;

-- Query 2: works but over-selects.
-- Tighten to project only student name and advisor, using aliases.
SELECT *
FROM students
JOIN majors ON students.major = majors.code;
3

Filtering & Sorting

WHERE and ORDER BY — The filter() and sort() of SQL

Learning objective: Narrow a result set to exactly the rows that answer a question, order the output, and avoid the silent bugs that unknown values (NULL) introduce into filters.

NULL is genuinely difficult — experienced developers still get tripped up by it. Plan to spend a little extra time on this step.

WHERE predicates

WHERE is row selection — it keeps the rows where a predicate evaluates to TRUE, and discards the rest.

WHERE gpa >= 3.5                        -- numeric comparison
WHERE major = 'CS'                      -- string equality (single quotes!)
WHERE major = 'CS' AND gpa > 3.7        -- compound predicate
WHERE major IN ('CS', 'Math')           -- set membership
WHERE name LIKE 'A%'                    -- pattern match (% = any characters)

Two syntax traps for Python/JS developers:

  • Single quotes for strings. 'CS', not "CS". Double quotes quote identifiers in SQL.
  • Single = for equality. =, not ==. WHERE major == 'CS' is a syntax error.

NULL and Three-Valued Logic — the biggest trap

SQL booleans are three-valued: TRUE, FALSE, UNKNOWN. NULL means unknown value — comparing anything to NULL (even another NULL) yields UNKNOWN, and WHERE discards UNKNOWN rows.

WHERE major != 'CS'                       -- WRONG: silently drops NULL-major rows
WHERE major != 'CS' OR major IS NULL      -- RIGHT: handle NULL explicitly

WHERE major IS NULL       -- value is unknown
WHERE major IS NOT NULL   -- value is known

In Python, None != "Active" is True. In SQL, NULL != 'Active' is UNKNOWN — and the row is dropped.

ORDER BY

ORDER BY is sorting — it imposes order on the result for presentation. It’s also the only guarantee of row order; without it, order is whatever the engine finds convenient.

SELECT name, gpa FROM students
ORDER BY gpa DESC;   -- highest first (ASC is the default)

Task

filter.sql has three broken queries. Each fails differently. For each:

  1. Predict what it actually returns — or what error it raises.
  2. Classify the bug: syntax (grammar), semantic (valid SQL, wrong reference), or logical (runs fine, wrong rows). Write the classification as a comment.
  3. Fix the query.

Query 3 is deceptively easy — the bug only appears when you reason about NULL carefully.

Investigate. After Query 3 passes, comment out your IS NULL branch and re-run. Where did Blue Bella go? This one-line toggle makes the NULL trap concrete.

Translating to JavaScript or pandas
JavaScript pandas SQL
arr.filter(x => x.gpa > 3.5) df[df['gpa'] > 3.5] WHERE gpa > 3.5
arr.map(x => x.name) df[['name']] SELECT name
arr.sort((a,b) => b.gpa - a.gpa) df.sort_values('gpa', ascending=False) ORDER BY gpa DESC

A SELECT is roughly .filter().map().sort() chained — but declarative.

pandas’ NaN behaves closer to SQL (df[df['col'] != 'CS'] excludes NaN). The idiomatic test is df['col'].isna()col IS NULL.

Starter files
filter.sql
-- Step 3 — Filtering rows with WHERE and sorting with ORDER BY
-- Three broken queries. Classify each as syntax / semantic / logical, then fix.

-- Query 1. Intended: CS students with GPA above 3.5, highest GPA first.
-- Runs without error, but returns the wrong rows in the wrong order.
-- Two independent bugs hide here. Find both, classify them, then fix.
SELECT * FROM students
WHERE major = 'CS' OR gpa > 3.5
ORDER BY gpa ASC;

-- Query 2. Intended: students in year 1 or year 2.
-- Runs without error. Why does it return zero rows?
SELECT * FROM students
WHERE year = 1 AND year = 2;

-- Query 3. Intended: students whose major is NOT 'CS'.
-- Runs without error, but quietly drops a student. Who, and why?
-- Hint: one student has a NULL major. Reason about 3VL before fixing.
SELECT * FROM students
WHERE major != 'CS';
4

Aggregating Data

GROUP BY — The reduce() of SQL

Learning objective: Turn row-level data into per-category summaries (counts, averages, totals), and distinguish filtering individual rows from filtering whole groups.

Every time Spotify shows “Your top artists this month” or Instagram shows “4 people liked this” — that is a GROUP BY + COUNT running against millions of rows in milliseconds.

Logical execution order

SQL clauses do not run top-to-bottom as written:

1. FROM      — which tables to read
2. WHERE     — filter rows
3. GROUP BY  — group rows
4. HAVING    — filter groups
5. SELECT    — project columns  ← written first, evaluated fifth
6. ORDER BY  — sort
7. LIMIT     — truncate

This explains (a) why you cannot use a SELECT alias inside WHERE — the alias does not exist yet — and (b) why WHERE cannot contain an aggregate — the groups do not exist yet. Keep this list visible while you work the task — even experienced developers reference it.

GROUP BY syntax

GROUP BY is aggregation — it partitions rows into buckets per category, then applies summary functions (COUNT, AVG, SUM, MIN, MAX) to each bucket. The result has one row per bucket, not per original row.

SELECT   major,
         COUNT(*) AS student_count,
         AVG(gpa)  AS avg_gpa
FROM     students
GROUP BY major;

The GROUP BY rule

Every column in SELECT must either (1) appear in GROUP BY, or (2) be wrapped in an aggregate (COUNT, AVG, SUM, MIN, MAX).

-- WRONG: name is neither grouped nor aggregated
SELECT major, name, COUNT(*) FROM students GROUP BY major;

-- RIGHT
SELECT major, COUNT(*) AS n FROM students GROUP BY major;

SQLite silently picks an arbitrary name here — PostgreSQL and strict-mode MySQL reject it. Treat SQLite’s quiet acceptance as a trap, not a blessing.

WHERE vs HAVING

HAVING is group-level filtering — the aggregate-aware analog of WHERE. WHERE filters individual rows before grouping; HAVING filters groups after aggregation. Aggregates cannot appear in WHERE because the groups don’t exist yet.

SELECT   major, COUNT(*) AS n
FROM     students
WHERE    year > 1             -- rows first
GROUP BY major
HAVING   COUNT(*) >= 2;       -- groups after

COUNT(*) counts every row in the group; COUNT(gpa) counts only rows where gpa is not NULL.

Task

aggregate.sql has two queries, each making one of the two most common aggregation mistakes (illegal grouping; WHERE-with-aggregate). For each:

  1. Plan before typing: write a one-line comment with the clauses in evaluation order, e.g. FROM students → WHERE year>1 → GROUP BY major → HAVING count ≥ 2 → SELECT major, count. Even 30 seconds of planning catches most aggregation bugs before you write them.
  2. Predict what happens — a crash, a silent wrong answer, or a confusing SQLite-lenient result.
  3. Explain in a one-line comment why it is wrong, then fix it.

Investigate. After you pass, temporarily change your fixed Query 2 back to WHERE AVG(gpa) > 3.3 and read the exact error message. You want to recognize it instantly in a future code review.

Translating to pandas
SQL pandas
SELECT major, COUNT(*) FROM students GROUP BY major students.groupby('major').size()
SELECT major, AVG(gpa) FROM students GROUP BY major students.groupby('major')['gpa'].mean()
... HAVING AVG(gpa) > 3.3 result[result > 3.3] (filter after .mean())

JavaScript’s .reduce() collapses a collection into a single value. SQL’s GROUP BY does the same — but per group, in one query.

Starter files
aggregate.sql
-- Step 4 — Aggregation with GROUP BY and HAVING
-- Plan each query in one comment line before you fix it.

-- Query 1. Intended: count students per major, most students first.
-- Violates the grouping rule — why? What does SQLite do about it?
SELECT major, name, COUNT(*) AS student_count
FROM students
GROUP BY major
ORDER BY student_count DESC;

-- Query 2. Intended: average GPA per major, only majors with avg GPA > 3.3.
-- Uses the wrong clause for an aggregate condition. Fix it.
SELECT major, AVG(gpa) AS avg_gpa
FROM students
WHERE AVG(gpa) > 3.3
GROUP BY major;
5

Designing Your Own Table

CREATE TABLE and INSERT INTO

Learning objective: Design a table that protects its own integrity — so bad data cannot enter in the first place — and classify query errors so you can debug them efficiently.

You have been querying an existing table for four steps. Now it is time to build your own. If constraint syntax feels fiddly at first, that is normal — the payoff is that the database catches bad data before it becomes a bug.

CREATE TABLE

CREATE TABLE is schema declaration — it defines the shape and integrity rules of a table (columns, types, constraints), not the data. Like declaring a struct, not creating an instance.

CREATE TABLE books (
  id     INTEGER PRIMARY KEY,    -- identity: unique + non-null
  title  TEXT    NOT NULL,       -- required
  author TEXT    NOT NULL,       -- required
  pages  INTEGER                 -- optional (nullable by default)
);
  • PRIMARY KEY = unique + non-null + indexed.
  • NOT NULL = required.
  • Types: INTEGER, REAL, TEXT, BLOB.

SQLite quirk: standard SQL makes PRIMARY KEY imply NOT NULL; SQLite does not. Add NOT NULL explicitly if portability matters.

INSERT INTO

INSERT INTO is row insertion — it adds new rows that must conform to the schema’s types and constraints.

INSERT INTO books (id, title, author, pages) VALUES
  (1, 'SICP', 'Abelson & Sussman', 657);

Omitting a NOT NULL column (with no default) fails. Nullable columns can be skipped. After any insert/update, run a quick SELECT to verify.

Four kinds of SQL errors

  1. Syntax — grammar violated. Refuses to run.
  2. Semantic — valid grammar, references something that does not exist (wrong column name). Errors at parse time.
  3. Logical — runs and returns a result, but the wrong result (e.g., the NULL trap from Step 3). No error message — the hardest to catch.
  4. Constraint violation — grammatically and semantically valid, but violates PRIMARY KEY / NOT NULL / UNIQUE / FOREIGN KEY at execution time.

Task

A teammate handed you two unreviewed files. Bring them up to production quality.

1. schema.sql — strengthen. It compiles, but has no integrity constraints. Add the missing types + constraints:

Column Type Constraints
id INTEGER PRIMARY KEY
code TEXT NOT NULL
name TEXT NOT NULL
credits INTEGER (nullable — leave as is)

Above each constraint, write a one-line comment naming the bug it prevents (e.g., -- prevents duplicate course IDs).

2. populate.sql — classify and repair. Three INSERT statements. Two fail against the strengthened schema. For each failing insert, decide which error category fits, then fix it (or replace with a valid row). End with SELECT * FROM courses;. Aim for at least 3 rows.

Translating to pandas
SQL pandas
CREATE TABLE books (id INTEGER PRIMARY KEY, title TEXT NOT NULL, ...) books = pd.DataFrame(columns=['id', 'title', 'author', 'pages'])
INSERT INTO books VALUES (1, 'SICP', 'Abelson', 657) books.loc[len(books)] = [1, 'SICP', 'Abelson', 657]

pandas accepts anything silently — wrong types, missing required fields. SQL enforces the schema and catches those bugs at insert time.

Starter files
schema.sql
-- Step 5a — Designing a schema that protects its own data.
-- The CREATE below runs, but it has NO constraints.
-- Add the missing types + constraints from the table in the instructions.
-- For each constraint you add, leave a one-line comment naming the bug it prevents.
DROP TABLE IF EXISTS courses;

CREATE TABLE courses (
  id,
  code,
  name,
  credits
);
populate.sql
-- Step 5b — Populating the strengthened schema.
-- Classify each broken INSERT (syntax / semantic / logical / constraint violation), then fix.
DELETE FROM courses;

-- Insert A — the model row. Matches the schema exactly.
INSERT INTO courses (id, code, name, credits) VALUES
  (1, 'CS 35L', 'Software Construction', 4);

-- Insert B — predict what breaks when the schema enforces NOT NULL
INSERT INTO courses (id, code, credits) VALUES
  (2, 'CS 111', 4);

-- Insert C — all values look plausible. What does the strengthened schema
-- reject here? Look carefully at how this row compares to Insert A.
INSERT INTO courses (id, code, name, credits) VALUES
  (1, 'MATH 61', 'Discrete Structures', 4);

-- Verify your data. Leave this line as the last statement.
SELECT * FROM courses;
6

Modifying & Cleaning Up

UPDATE, DELETE, and DROP TABLE

Learning objective: Change or remove data without damaging the rest of the database, recognize destructive statements before they run, and apply the professional habit of previewing impact before committing.

Destructive SQL is where careful habits pay off. If you have ever run a shell command and watched the wrong files disappear, you already know the feeling — these patterns exist so you do not have it in production.

Safe failure experiment

Run SELECT * FROM students; and note the row count. Now run DELETE FROM students; (no WHERE). Run the SELECT again — the table is empty. The page reload restores the seed data; production has no reload button.

UPDATE, DELETE, DROP

All three are destructive — they change the database state, in irreversible ways without a backup.

  • UPDATE modifies values in existing rows. Scoped by WHERE.
  • DELETE removes rows. Scoped by WHERE. Schema stays.
  • DROP TABLE removes the table itself — data and schema. No undo.
UPDATE students SET gpa = 3.95 WHERE name = 'Mango';
DELETE FROM students WHERE gpa < 3.0;
DROP TABLE IF EXISTS courses;

Without WHERE, UPDATE modifies every row and DELETE empties the table. Safe habit: write SELECT ... WHERE ... first, then convert.

Statement Removes Schema remains?
DELETE FROM t WHERE ... Matching rows Yes
DELETE FROM t All rows Yes
DROP TABLE t All rows + schema No

Task

A teammate pushed cleanup.sql to a PR. It is scheduled to run on production in an hour. Do not approve it as-is.

For each statement:

  1. Predict what it actually does — not what the comment claims.
  2. Preview with a SELECT. Write the SELECT ... WHERE ... that would reveal the affected rows; leave it as a comment above the fixed statement so the next reviewer sees your reasoning.
  3. Fix the statement to match its stated intent.

That trail — claim, preview, confirm — is how destructive SQL actually gets approved.

Stated intents:

  1. Update Kiwi’s GPA from 3.50 to 3.60.
  2. Delete all year-4 students.
  3. Drop the courses table from Step 5.
Translating to pandas
SQL pandas
UPDATE students SET gpa = 3.6 WHERE name = 'Kiwi' students.loc[students['name'] == 'Kiwi', 'gpa'] = 3.6
DELETE FROM students WHERE year = 4 students = students[students['year'] != 4]
DROP TABLE courses del courses

SQL modifies in place; pandas often returns a new frame. In SQL, forgetting WHERE is permanent.

🎓 Where to go next

This tutorial covers the ~90% of SQL that 90% of code needs. Three topics deserve your next hour:

  • LEFT JOIN and the outer-join family. Step 2 taught JOIN, which drops rows with no match (Blue Bella with NULL major). LEFT JOIN keeps them and fills the right side with NULL.
  • Common Table Expressions (WITH ... AS (...)). Complex queries collapse from unreadable nested subqueries into top-to-bottom readable pipelines.
  • The N+1 query problem. The #1 antipattern at the app ↔ SQL boundary: looping over a list, firing one query per element. Batch with WHERE id IN (...) or an explicit JOIN.
Starter files
cleanup.sql
-- Step 6 — Review a teammate's destructive PR before it runs in production.
-- For each statement: predict, preview with a commented SELECT, then fix.

-- 1. Intended: raise Kiwi's GPA to 3.6.
--    Targets the wrong student. A second trap hides here too:
--    drop the WHERE clause and every student ends up at 3.6.
UPDATE students SET gpa = 3.6 WHERE name = 'Mango';

-- 2. Intended: delete all year-4 students.
--    Wrong column referenced. Would this even run? If yes, what does it delete?
DELETE FROM students WHERE gpa = 4;

-- 3. Intended: drop the courses table.
--    Works most of the time. When would this statement crash the deploy,
--    and how do you make it safe to re-run?
DROP TABLE courses;
7

Putting It Together

Capstone — One Query, Six Steps’ Worth of Tools

Learning objective: Compose a single query that combines JOIN, WHERE, GROUP BY/HAVING, aggregates, and ORDER BY to answer a real-world question — and recognize which construct each part of the question maps to.

You’ve spent six steps learning the building blocks separately. This step is composition — chaining those blocks into one query that answers a real-world question.

The dean’s request

The dean wants a snapshot of upperclassmen (year 2 and above) by major. List each major that has at least 2 such students, showing:

  • the major code
  • its department
  • the advisor
  • the average GPA among those students
  • the student count

Sort by average GPA, highest first.

Decomposing the request

Each phrase of a real-world ask maps to a SQL construct. Identifying these mappings before you type is the skill this capstone builds.

Phrase in the request Construct
“by major” GROUP BY (Step 4)
“year 2 and above” — row filter WHERE (Step 3)
“department, advisor” — needs majors table JOIN (Step 2)
“at least 2 such students” — group filter HAVING COUNT(*) >= 2 (Step 4)
“average GPA”, “student count” AVG(), COUNT() (Step 4)
“highest first” ORDER BY ... DESC (Step 3)
(production hygiene) Explicit columns + aliases (Steps 1, 2)

Plan, then type

In final.sql, sketch the clauses in execution order as a comment, then write the query. Naming the constructs you’ll need is the highest-leverage habit for aggregation queries.

One query. No SELECT *. Run before you celebrate.

Starter files
final.sql
-- Capstone: one query that combines everything from Steps 1–6.
-- Plan first as a comment in execution order:
--   FROM ... → JOIN ... → WHERE ... → GROUP BY ... → HAVING ...
--   → SELECT ... → ORDER BY ...
-- Then write the query below.