1

Exploring Data

Why this matters

In Python or JavaScript, you tell the machine how to solve a problem step by step. In SQL, you describe what you want and the engine figures out the rest. If this imperative → declarative shift feels strange, that is normal — it is the foundational mental shift this entire tutorial is built around.

🎯 You will learn to

  • Apply SELECT to project a chosen subset of columns from a table
  • Analyze why an explicit column list is safer than SELECT * in production code

The students table

A students table is set up for you (9 rows):

Column Type Constraints
id INTEGER PRIMARY KEY
name TEXT NOT NULL
year INTEGER  
major TEXT  
gpa REAL  

SELECT — reading data

SELECT is projection — it chooses which columns of a table to return. (To choose which rows, you’ll use WHERE in Step 3.)

-- Explore the whole table (good for learning, not production)
SELECT * FROM students;

-- Project only the columns the caller needs
SELECT name, gpa FROM students;

SELECT names the columns, FROM names the table, * means “all columns”. Use SELECT * while exploring; in production code, always list the columns you need so schema changes do not silently break callers.

SQL is a *multiset*, not a set — DISTINCT removes duplicates

Set theory says {CS, CS, Math} is the same as {CS, Math} — duplicates collapse. SQL is not a set. It’s a multiset (sometimes called a “bag”): duplicates stick around unless you ask for them to be removed.

-- 9 rows: every student's major, including duplicates
SELECT major FROM students;

-- 4 rows: each distinct major listed once
SELECT DISTINCT major FROM students;

Why does this matter? If you want to count distinct majors, the obvious-looking SELECT COUNT(*) FROM students gives 9 (all rows). The intended query is SELECT COUNT(DISTINCT major) FROM students, which gives 4. Forgetting DISTINCT is one of the top-5 SQL gotchas.

Task

Open query.sql. Two queries; neither errors, but each is the wrong tool for its stated job.

  • Query 1 — exploration. SELECT * is correct here. Predict the row and column count as a comment, then Run.
  • Query 2 — production. Tighten it to return only name and major.

Investigate (optional). After both pass, append SELECT major, name FROM students. Does the column order in the output change?

Translating to Python or pandas

SELECT is a projection — like [s['name'] for s in students] or .map(s => s.name). pandas mapping:

SQL pandas
SELECT * FROM students students
SELECT name, gpa FROM students students[['name','gpa']]

Two things that differ from a Python list: (1) rows come back in no guaranteed order — there is no “row 0”; if order matters, ask for it with ORDER BY (Step 3). (2) Every row has exactly the declared columns and types — values can be NULL, but a column cannot disappear.

How this tutorial works (PRIMM)

Every step uses the same rhythm: Predict what a query will do, Run to compare, Investigate with a tweak, then Modify to fix the bug. The hard skill is diagnosing why a query is wrong — not recalling syntax.

Starter files
query.sql
-- Step 1 — Exploring data with SELECT
-- Two queries below. Predict, then run and verify.

-- Query 1: exploration. Runs correctly.
-- Predict the row count and column count on the next line.
SELECT * FROM students;

-- Query 2: works, but not production-quality.
-- It fetches more columns than the caller needs.
-- Tighten it to return only name and major, and add a one-line
-- comment explaining why specific columns are safer than SELECT *.
SELECT id, name, year, major, gpa FROM students;
2

Joining Tables

Why this matters

Real data is rarely in one table — every production query you write will stitch information together across two or more. JOIN is the workhorse, but it has a famous trap: forget the ON predicate and you silently return every row paired with every row.

🎯 You will learn to

  • Apply JOIN ... ON to combine rows from two tables on a shared key
  • Analyze a multi-table query for the Cartesian-product trap (missing ON)

Two tables, one question

The students table has a major column, but each major’s department and advisor live in a separate majors table:

majors.code majors.department majors.advisor
CS Engineering Dr. Hopper
Math Sciences Dr. Noether
Physics Sciences Dr. Curie
Bio Life Sciences Dr. McClintock

To answer “who is Mango’s advisor?”, you need both tables. JOIN combines them on a shared column — here, students.major = majors.code.

JOIN syntax

JOIN is relational join — it combines rows from two tables that share a key, producing one output row per matching pair.

SELECT s.name, m.department, m.advisor
FROM students AS s
JOIN majors AS m ON s.major = m.code;
  • Aliases (AS s, AS m) — required as soon as two tables are in play, and required for self-joins.
  • ON — the predicate that says how the two tables line up. Evaluated per pair.
  • Match-only — rows where ON is TRUE for both sides survive; the rest are dropped. Blue Bella (NULL major) and Bio (no students) both drop out.

The Cartesian-product trap

Without ON, the database pairs every row with every row. 9 students × 4 majors = 36 rows of nonsense.

-- WRONG: no ON clause — produces 9 × 4 = 36 rows
SELECT s.name, m.advisor FROM students s, majors m;

The old comma-join syntax (FROM a, b WHERE a.x = b.y) makes this easy: delete the WHERE during a refactor and silently ship a Cartesian product. Use explicit JOIN ... ON ... so the relationship and the filter are separate.

Task

joins.sql has two queries — one silently broken, one that runs but leaks too much data.

  • Query 1 — the Cartesian trap. Returns 36 rows instead of ~8. Convert it to explicit JOIN ... ON ....
  • Query 2 — over-selection. Tighten it to project just the student name and their advisor, using aliases.

Predict before you Run: how many rows comes back from the corrected Query 1? Remember Blue Bella (NULL major) and Bio (no students).

Investigate (optional). After Query 1 passes, is Blue Bella in the result? Why not?

🎓 What about LEFT JOIN?

LEFT JOIN keeps unmatched rows from the left side — useful when you want to include Blue Bella even without a major. Not needed for this step; flagging the keyword so it rings a bell later.

Starter files
joins.sql
-- Step 2 — Joining students to majors
-- Predict row counts before you run.

-- Query 1: intended "student name + advisor" — returns 36 rows.
-- Identify the bug, then rewrite as an explicit JOIN ... ON.
SELECT s.name, m.advisor
FROM students s, majors m;

-- Query 2: works but over-selects.
-- Tighten to project only student name and advisor, using aliases.
SELECT *
FROM students
JOIN majors ON students.major = majors.code;
3

Filtering & Sorting

Why this matters

Almost every real query needs to narrow down — give me only the rows that match these conditions, sorted this way. WHERE is your filter and ORDER BY is your sort, but both interact with NULL in ways that quietly drop rows you expected to see. NULL is genuinely difficult — experienced developers still get tripped up by it, so plan to spend a little extra time here.

🎯 You will learn to

  • Apply WHERE predicates and ORDER BY to return exactly the rows you want, in the order you want
  • Analyze how three-valued logic (NULL = UNKNOWN) silently drops rows from != and = filters
  • Evaluate a failing filter by classifying the bug as syntax, semantic, or logical

WHERE predicates

WHERE is row selection — it keeps the rows where a predicate evaluates to TRUE, and discards the rest.

WHERE gpa >= 3.5                        -- numeric comparison
WHERE major = 'CS'                      -- string equality (single quotes!)
WHERE major = 'CS' AND gpa > 3.7        -- compound predicate
WHERE major IN ('CS', 'Math')           -- set membership
WHERE name LIKE 'A%'                    -- pattern match (% = any characters)

Two syntax traps for Python/JS developers:

  • Single quotes for strings. 'CS', not "CS". Double quotes quote identifiers in SQL.
  • Single = for equality. =, not ==. WHERE major == 'CS' is a syntax error.

NULL and Three-Valued Logic — the biggest trap

SQL booleans are three-valued: TRUE, FALSE, UNKNOWN. NULL means unknown value — comparing anything to NULL (even another NULL) yields UNKNOWN, and WHERE discards UNKNOWN rows.

WHERE major != 'CS'                       -- WRONG: silently drops NULL-major rows
WHERE major != 'CS' OR major IS NULL      -- RIGHT: handle NULL explicitly

WHERE major IS NULL       -- value is unknown
WHERE major IS NOT NULL   -- value is known

In Python, None != "Active" is True. In SQL, NULL != 'Active' is UNKNOWN — and the row is dropped.

ORDER BY

ORDER BY is sorting — it imposes order on the result for presentation. It’s also the only guarantee of row order; without it, order is whatever the engine finds convenient.

SELECT name, gpa FROM students
ORDER BY gpa DESC;   -- highest first (ASC is the default)

Task

filter.sql has three broken queries. Each fails differently. For each:

  1. Predict what it actually returns — or what error it raises.
  2. Classify the bug: syntax (grammar), semantic (valid SQL, wrong reference), or logical (runs fine, wrong rows). Write the classification as a comment.
  3. Fix the query.

Query 3 is deceptively easy — the bug only appears when you reason about NULL carefully.

Investigate. After Query 3 passes, comment out your IS NULL branch and re-run. Where did Blue Bella go? This one-line toggle makes the NULL trap concrete.

Translating to JavaScript or pandas
JavaScript pandas SQL
arr.filter(x => x.gpa > 3.5) df[df['gpa'] > 3.5] WHERE gpa > 3.5
arr.map(x => x.name) df[['name']] SELECT name
arr.sort((a,b) => b.gpa - a.gpa) df.sort_values('gpa', ascending=False) ORDER BY gpa DESC

A SELECT is roughly .filter().map().sort() chained — but declarative.

pandas’ NaN behaves closer to SQL (df[df['col'] != 'CS'] excludes NaN). The idiomatic test is df['col'].isna()col IS NULL.

Starter files
filter.sql
-- Step 3 — Filtering rows with WHERE and sorting with ORDER BY
-- Three broken queries. Classify each as syntax / semantic / logical, then fix.

-- Query 1. Intended: CS students with GPA above 3.5, highest GPA first.
-- Runs without error, but returns the wrong rows in the wrong order.
-- Two independent bugs hide here. Find both, classify them, then fix.
SELECT * FROM students
WHERE major = 'CS' OR gpa > 3.5
ORDER BY gpa ASC;

-- Query 2. Intended: students in year 1 or year 2.
-- Runs without error. Why does it return zero rows?
SELECT * FROM students
WHERE year = 1 AND year = 2;

-- Query 3. Intended: students whose major is NOT 'CS'.
-- Runs without error, but quietly drops a student. Who, and why?
-- Hint: one student has a NULL major. Reason about 3VL before fixing.
SELECT * FROM students
WHERE major != 'CS';
4

Aggregating Data

Why this matters

Every time Spotify shows “Your top artists this month” or Instagram shows “4 people liked this” — that is a GROUP BY + COUNT running against millions of rows in milliseconds. Aggregation is how data becomes insight; the trick is keeping straight what gets filtered before grouping (rows) versus after (groups).

🎯 You will learn to

  • Apply GROUP BY with aggregate functions (COUNT, AVG, SUM, MIN, MAX) to summarize rows per category
  • Analyze SQL’s logical execution order to predict why aggregates belong in HAVING, not WHERE
  • Evaluate a GROUP BY query for the grouping rule (every selected column is grouped or aggregated)

Logical execution order

SQL clauses do not run top-to-bottom as written:

1. FROM      — which tables to read
2. WHERE     — filter rows
3. GROUP BY  — group rows
4. HAVING    — filter groups
5. SELECT    — project columns  ← written first, evaluated fifth
6. ORDER BY  — sort
7. LIMIT     — truncate

This explains (a) why you cannot use a SELECT alias inside WHERE — the alias does not exist yet — and (b) why WHERE cannot contain an aggregate — the groups do not exist yet. Keep this list visible while you work the task — even experienced developers reference it.

GROUP BY syntax

GROUP BY is aggregation — it partitions rows into buckets per category, then applies summary functions (COUNT, AVG, SUM, MIN, MAX) to each bucket. The result has one row per bucket, not per original row.

SELECT   major,
         COUNT(*) AS student_count,
         AVG(gpa)  AS avg_gpa
FROM     students
GROUP BY major;

The GROUP BY rule

Every column in SELECT must either (1) appear in GROUP BY, or (2) be wrapped in an aggregate (COUNT, AVG, SUM, MIN, MAX).

-- WRONG: name is neither grouped nor aggregated
SELECT major, name, COUNT(*) FROM students GROUP BY major;

-- RIGHT
SELECT major, COUNT(*) AS n FROM students GROUP BY major;

SQLite silently picks an arbitrary name here — PostgreSQL and strict-mode MySQL reject it. Treat SQLite’s quiet acceptance as a trap, not a blessing.

WHERE vs HAVING

HAVING is group-level filtering — the aggregate-aware analog of WHERE. WHERE filters individual rows before grouping; HAVING filters groups after aggregation. Aggregates cannot appear in WHERE because the groups don’t exist yet.

SELECT   major, COUNT(*) AS n
FROM     students
WHERE    year > 1             -- rows first
GROUP BY major
HAVING   COUNT(*) >= 2;       -- groups after

COUNT(*) counts every row in the group; COUNT(gpa) counts only rows where gpa is not NULL.

Task

aggregate.sql has two queries, each making one of the two most common aggregation mistakes (illegal grouping; WHERE-with-aggregate). For each:

  1. Plan before typing: write a one-line comment with the clauses in evaluation order, e.g. FROM students → WHERE year>1 → GROUP BY major → HAVING count ≥ 2 → SELECT major, count. Even 30 seconds of planning catches most aggregation bugs before you write them.
  2. Predict what happens — a crash, a silent wrong answer, or a confusing SQLite-lenient result.
  3. Explain in a one-line comment why it is wrong, then fix it.

Investigate. After you pass, temporarily change your fixed Query 2 back to WHERE AVG(gpa) > 3.3 and read the exact error message. You want to recognize it instantly in a future code review.

Translating to pandas
SQL pandas
SELECT major, COUNT(*) FROM students GROUP BY major students.groupby('major').size()
SELECT major, AVG(gpa) FROM students GROUP BY major students.groupby('major')['gpa'].mean()
... HAVING AVG(gpa) > 3.3 result[result > 3.3] (filter after .mean())

JavaScript’s .reduce() collapses a collection into a single value. SQL’s GROUP BY does the same — but per group, in one query.

Starter files
aggregate.sql
-- Step 4 — Aggregation with GROUP BY and HAVING
-- Plan each query in one comment line before you fix it.

-- Query 1. Intended: count students per major, most students first.
-- Violates the grouping rule — why? What does SQLite do about it?
SELECT major, name, COUNT(*) AS student_count
FROM students
GROUP BY major
ORDER BY student_count DESC;

-- Query 2. Intended: average GPA per major, only majors with avg GPA > 3.3.
-- Uses the wrong clause for an aggregate condition. Fix it.
SELECT major, AVG(gpa) AS avg_gpa
FROM students
WHERE AVG(gpa) > 3.3
GROUP BY major;
5

Designing Your Own Table

Why this matters

You have been querying an existing table for four steps; now it is time to build your own. A well-designed schema makes invalid data impossible to insert in the first place — your NOT NULL, PRIMARY KEY, and type declarations are the database’s safety net. If constraint syntax feels fiddly at first, that is normal — the payoff is that bugs surface at insert time, not three weeks later in a production report.

🎯 You will learn to

  • Apply CREATE TABLE with appropriate types and constraints (PRIMARY KEY, NOT NULL) to a real schema
  • Create rows with INSERT INTO that satisfy every declared constraint
  • Evaluate a failing query by classifying the error as syntax, semantic, logical, or constraint violation

CREATE TABLE

CREATE TABLE is schema declaration — it defines the shape and integrity rules of a table (columns, types, constraints), not the data. Like declaring a struct, not creating an instance.

CREATE TABLE books (
  id     INTEGER PRIMARY KEY,    -- identity: unique + non-null
  title  TEXT    NOT NULL,       -- required
  author TEXT    NOT NULL,       -- required
  pages  INTEGER                 -- optional (nullable by default)
);
  • PRIMARY KEY = unique + non-null + indexed.
  • NOT NULL = required.
  • Types: INTEGER, REAL, TEXT, BLOB.

SQLite quirk: standard SQL makes PRIMARY KEY imply NOT NULL; SQLite does not. Add NOT NULL explicitly if portability matters.

INSERT INTO

INSERT INTO is row insertion — it adds new rows that must conform to the schema’s types and constraints.

INSERT INTO books (id, title, author, pages) VALUES
  (1, 'SICP', 'Abelson & Sussman', 657);

Omitting a NOT NULL column (with no default) fails. Nullable columns can be skipped. After any insert/update, run a quick SELECT to verify.

Four kinds of SQL errors

  1. Syntax — grammar violated. Refuses to run.
  2. Semantic — valid grammar, references something that does not exist (wrong column name). Errors at parse time.
  3. Logical — runs and returns a result, but the wrong result (e.g., the NULL trap from Step 3). No error message — the hardest to catch.
  4. Constraint violation — grammatically and semantically valid, but violates PRIMARY KEY / NOT NULL / UNIQUE / FOREIGN KEY at execution time.

Predict before you run

Read the two files (schema.sql and populate.sql) before touching either. Then write down (in a comment in the file or on paper):

  1. Which of the three INSERT statements will succeed against the strengthened schema (the one with the constraints from the table below)?
  2. For each insert that fails, which constraint will it violate — and which of the four error categories does that map to?

Once you have written your prediction, then strengthen the schema and run the inserts. Compare the actual failures to your prediction. The point isn’t to be right; it’s to commit to a hypothesis before the database tells you the answer.

Task

A teammate handed you two unreviewed files. Bring them up to production quality.

1. schema.sql — strengthen. It compiles, but has no integrity constraints. Add the missing types + constraints:

Column Type Constraints
id INTEGER PRIMARY KEY
code TEXT NOT NULL
name TEXT NOT NULL
credits INTEGER (nullable — leave as is)

Above each constraint, write a one-line comment naming the bug it prevents (e.g., -- prevents duplicate course IDs).

2. populate.sql — classify and repair. Three INSERT statements. Two fail against the strengthened schema. For each failing insert, decide which error category fits, then fix it (or replace with a valid row). End with SELECT * FROM courses;. Aim for at least 3 rows.

Translating to pandas
SQL pandas
CREATE TABLE books (id INTEGER PRIMARY KEY, title TEXT NOT NULL, ...) books = pd.DataFrame(columns=['id', 'title', 'author', 'pages'])
INSERT INTO books VALUES (1, 'SICP', 'Abelson', 657) books.loc[len(books)] = [1, 'SICP', 'Abelson', 657]

pandas accepts anything silently — wrong types, missing required fields. SQL enforces the schema and catches those bugs at insert time.

Starter files
schema.sql
-- Step 5a — Designing a schema that protects its own data.
-- The CREATE below runs, but it has NO constraints.
-- Add the missing types + constraints from the table in the instructions.
-- For each constraint you add, leave a one-line comment naming the bug it prevents.
DROP TABLE IF EXISTS courses;

CREATE TABLE courses (
  id,
  code,
  name,
  credits
);
populate.sql
-- Step 5b — Populating the strengthened schema.
-- Classify each broken INSERT (syntax / semantic / logical / constraint violation), then fix.
DELETE FROM courses;

-- Insert A — the model row. Matches the schema exactly.
INSERT INTO courses (id, code, name, credits) VALUES
  (1, 'CS 35L', 'Software Construction', 4);

-- Insert B — predict what breaks when the schema enforces NOT NULL
INSERT INTO courses (id, code, credits) VALUES
  (2, 'CS 111', 4);

-- Insert C — all values look plausible. What does the strengthened schema
-- reject here? Look carefully at how this row compares to Insert A.
INSERT INTO courses (id, code, name, credits) VALUES
  (1, 'MATH 61', 'Discrete Structures', 4);

-- Verify your data. Leave this line as the last statement.
SELECT * FROM courses;
6

Modifying & Cleaning Up

Why this matters

Destructive SQL is where careful habits pay off. If you have ever run a shell command and watched the wrong files disappear, you already know the feeling — these patterns exist so you do not have it in production. A missing WHERE on UPDATE or DELETE is one of the most common ways to corrupt a real database, and the only defense is a workflow that previews changes before committing them.

🎯 You will learn to

  • Apply UPDATE, DELETE, and DROP TABLE with the right scoping (WHERE, IF EXISTS)
  • Evaluate a destructive statement by previewing its impact with a SELECT first
  • Analyze the differences between row-level deletion (DELETE), table removal (DROP), and modification (UPDATE)

Safe failure experiment

Run SELECT * FROM students; and note the row count. Now run DELETE FROM students; (no WHERE). Run the SELECT again — the table is empty. The page reload restores the seed data; production has no reload button.

UPDATE, DELETE, DROP

All three are destructive — they change the database state, in irreversible ways without a backup.

  • UPDATE modifies values in existing rows. Scoped by WHERE.
  • DELETE removes rows. Scoped by WHERE. Schema stays.
  • DROP TABLE removes the table itself — data and schema. No undo.
UPDATE students SET gpa = 3.95 WHERE name = 'Mango';
DELETE FROM students WHERE gpa < 3.0;
DROP TABLE IF EXISTS courses;

Without WHERE, UPDATE modifies every row and DELETE empties the table. Safe habit: write SELECT ... WHERE ... first, then convert.

Statement Removes Schema remains?
DELETE FROM t WHERE ... Matching rows Yes
DELETE FROM t All rows Yes
DROP TABLE t All rows + schema No

Task

A teammate pushed cleanup.sql to a PR. It is scheduled to run on production in an hour. Do not approve it as-is.

For each statement:

  1. Predict what it actually does — not what the comment claims.
  2. Preview with a SELECT. Write the SELECT ... WHERE ... that would reveal the affected rows; leave it as a comment above the fixed statement so the next reviewer sees your reasoning.
  3. Fix the statement to match its stated intent.

That trail — claim, preview, confirm — is how destructive SQL actually gets approved.

Stated intents:

  1. Update Kiwi’s GPA from 3.50 to 3.60.
  2. Delete all year-4 students.
  3. Drop the courses table from Step 5.
Translating to pandas
SQL pandas
UPDATE students SET gpa = 3.6 WHERE name = 'Kiwi' students.loc[students['name'] == 'Kiwi', 'gpa'] = 3.6
DELETE FROM students WHERE year = 4 students = students[students['year'] != 4]
DROP TABLE courses del courses

SQL modifies in place; pandas often returns a new frame. In SQL, forgetting WHERE is permanent.

🎓 Where to go next

This tutorial covers the ~90% of SQL that 90% of code needs. Three topics deserve your next hour:

  • LEFT JOIN and the outer-join family. Step 2 taught JOIN, which drops rows with no match (Blue Bella with NULL major). LEFT JOIN keeps them and fills the right side with NULL.
  • Common Table Expressions (WITH ... AS (...)). Complex queries collapse from unreadable nested subqueries into top-to-bottom readable pipelines.
  • The N+1 query problem. The #1 antipattern at the app ↔ SQL boundary: looping over a list, firing one query per element. Batch with WHERE id IN (...) or an explicit JOIN.
Starter files
cleanup.sql
-- Step 6 — Review a teammate's destructive PR before it runs in production.
-- For each statement: predict, preview with a commented SELECT, then fix.

-- 1. Intended: raise Kiwi's GPA to 3.6.
--    Read this UPDATE carefully. Does it actually do what the comment promises?
--    Preview by running a SELECT with the same WHERE clause first.
UPDATE students SET gpa = 3.6 WHERE name = 'Mango';

-- 2. Intended: delete all year-4 students.
--    Wrong column referenced. Would this even run? If yes, what does it delete?
DELETE FROM students WHERE gpa = 4;

-- 3. Intended: drop the courses table.
--    Works most of the time. When would this statement crash the deploy,
--    and how do you make it safe to re-run?
DROP TABLE courses;
7

Putting It Together

Why this matters

You’ve spent six steps learning the building blocks separately. This step is composition — chaining JOIN, WHERE, GROUP BY/HAVING, aggregates, and ORDER BY into one query that answers a real-world question. The skill the capstone builds isn’t typing more SQL — it’s decomposing a natural-language ask into the constructs you already know.

🎯 You will learn to

  • Create a single query that combines JOIN, WHERE, GROUP BY/HAVING, aggregates, and ORDER BY
  • Analyze a natural-language request to map each phrase to the SQL construct it requires
  • Evaluate your own query against a hand-computed prediction (rows, ordering, totals)

The dean’s request

The dean wants a snapshot of upperclassmen (year 2 and above) by major. List each major that has at least 2 such students, showing:

  • the major code
  • its department
  • the advisor
  • the average GPA among those students
  • the student count

Sort by average GPA, highest first.

Decomposing the request

Each phrase of a real-world ask maps to a SQL construct. Identifying these mappings before you type is the skill this capstone builds.

Phrase in the request Construct
“by major” GROUP BY (Step 4)
“year 2 and above” — row filter WHERE (Step 3)
“department, advisor” — needs majors table JOIN (Step 2)
“at least 2 such students” — group filter HAVING COUNT(*) >= 2 (Step 4)
“average GPA”, “student count” AVG(), COUNT() (Step 4)
“highest first” ORDER BY ... DESC (Step 3)
(production hygiene) Explicit columns + aliases (Steps 1, 2)

Plan, then type

In final.sql, sketch the clauses in execution order as a comment, then write the query. Naming the constructs you’ll need is the highest-leverage habit for aggregation queries.

One query. No SELECT *. Run before you celebrate.

Predict before you run

Before you submit your query, work this out by hand:

  1. Open the data. The students table has 9 rows. Walk through every student’s year and major. Tally per major: how many year-2-or-above students does each major have?
  2. Apply the dean’s filter (“year ≥ 2 and COUNT(*) ≥ 2 per major”). Cross out any major that doesn’t meet both conditions.
  3. Now commit (on paper) to all three:
    • The exact number of rows your query will return.
    • The major that appears in row 1 (highest avg GPA).
    • The major that appears in row 2.

Then run. The fastest way to know whether your query is correct isn’t “does it run?” — it’s “does it return exactly the row count, ordering, and majors I predicted?” If your number matches but the majors are wrong, your ORDER BY direction is reversed. If the count is off by one, your HAVING threshold is wrong. Predicting before running converts a vague “looks fine” into a sharp diagnostic.

⚠️ Open after you've committed to all three answers

Two majors qualify: CS (Mango 3.9, Bananito 3.7, Coconick 3.4 → year≥2 count is 3, avg ≈ 3.67) and Math (Watermelina 3.6, Grapenzo 3.5 → year≥2 count is 2, avg = 3.55). Engineering and the NULL-major student fail one filter each. Sorted highest-first: CS, then Math.

If your prediction matched, great — your mental model of the data + the filters + the ordering is solid. If it didn’t, that’s the lesson: pin down which step (counting / filtering / sorting / averaging) was wrong before you debug the SQL. The query is downstream of the model.

Starter files
final.sql
-- Capstone: one query that combines everything from Steps 1–6.
-- Plan first as a comment in execution order:
--   FROM ... → JOIN ... → WHERE ... → GROUP BY ... → HAVING ...
--   → SELECT ... → ORDER BY ...
-- Expected output: 2 rows, columns major | department | advisor | avg_gpa | n
-- Then write the query below.