SQL Essentials
A hands-on introduction to SQL using SQLite, running entirely in your browser. Start by querying an existing database, then learn to filter, aggregate, create your own tables, and modify data — no installation required.
Exploring Data
Describe What You Want, Not How
Learning objective: Retrieve the slice of a dataset a caller needs, and explain why describing what you want (declarative) scales better than describing how to compute it (imperative).
In Python or JavaScript, you tell the machine how to solve a problem step by step. In SQL, you describe what you want and the engine figures out the rest. If this imperative → declarative shift feels strange, that is normal.
The students table
A students table is set up for you (9 rows):
| Column | Type | Constraints |
|---|---|---|
id |
INTEGER | PRIMARY KEY |
name |
TEXT | NOT NULL |
year |
INTEGER | |
major |
TEXT | |
gpa |
REAL |
SELECT — reading data
SELECT is projection — it chooses which columns of a table to return. (To choose which rows, you’ll use WHERE in Step 3.)
-- Explore the whole table (good for learning, not production)
SELECT * FROM students;
-- Project only the columns the caller needs
SELECT name, gpa FROM students;
SELECT names the columns, FROM names the table, * means “all columns”. Use SELECT * while exploring; in production code, always list the columns you need so schema changes do not silently break callers.
Task
Open query.sql. Two queries; neither errors, but each is the wrong tool for its stated job.
- Query 1 — exploration.
SELECT *is correct here. Predict the row and column count as a comment, then Run. - Query 2 — production. Tighten it to return only
nameandmajor.
Investigate (optional). After both pass, append SELECT major, name FROM students. Does the column order in the output change?
Translating to Python or pandas
SELECT is a projection — like [s['name'] for s in students] or .map(s => s.name). pandas mapping:
| SQL | pandas |
|---|---|
SELECT * FROM students |
students |
SELECT name, gpa FROM students |
students[['name','gpa']] |
Two things that differ from a Python list: (1) rows come back in no guaranteed order — there is no “row 0”; if order matters, ask for it with ORDER BY (Step 3). (2) Every row has exactly the declared columns and types — values can be NULL, but a column cannot disappear.
How this tutorial works (PRIMM)
Every step uses the same rhythm: Predict what a query will do, Run to compare, Investigate with a tweak, then Modify to fix the bug. The hard skill is diagnosing why a query is wrong — not recalling syntax.
-- Step 1 — Exploring data with SELECT
-- Two queries below. Predict, then run and verify.
-- Query 1: exploration. Runs correctly.
-- Predict the row count and column count on the next line.
SELECT * FROM students;
-- Query 2: works, but not production-quality.
-- It fetches more columns than the caller needs.
-- Tighten it to return only name and major, and add a one-line
-- comment explaining why specific columns are safer than SELECT *.
SELECT id, name, year, major, gpa FROM students;
Exploring Data — Knowledge Check
Min. score: 80%1. A Python programmer says: “A SQL table is just like a Python list of dictionaries.” What is the most important difference they are missing?
SQL tables have a rigid schema (fixed columns and types) and no guaranteed row order — there is no students[0]. Storage location is incidental; the indexing distractor is the misconception this step is built to dismantle.
2. Without an ORDER BY clause, in what order does SQL return rows?
A SQL table is a set, not a list. Without ORDER BY, order depends on storage and query plan — never rely on it.
3. In Python you write students[0] to get the first student. What is the SQL equivalent?
SQL rows have no position — you identify them by their content (WHERE clause), not by an index.
4. Arrange the fragments to write a query that retrieves only the name and gpa columns from the students table.
(arrange in order)
SELECTname, gpaFROMstudents;
WHERE*
SELECT names the columns, FROM names the table. * is wrong (not just those two); WHERE filters rows, not columns.
Joining Tables
JOIN — Combining Tables by a Shared Key
Learning objective: Answer questions that span two tables (e.g. “who is Mango’s advisor?”) by combining rows on a shared key, and recognize when a query has accidentally paired every row with every row.
Real data is rarely in one table. The students table has a major column, but each major’s department and advisor live in a separate majors table:
majors.code |
majors.department |
majors.advisor |
|---|---|---|
| CS | Engineering | Dr. Hopper |
| Math | Sciences | Dr. Noether |
| Physics | Sciences | Dr. Curie |
| Bio | Life Sciences | Dr. McClintock |
To answer “who is Mango’s advisor?”, you need both tables. JOIN combines them on a shared column — here, students.major = majors.code.
JOIN syntax
JOIN is relational join — it combines rows from two tables that share a key, producing one output row per matching pair.
SELECT s.name, m.department, m.advisor
FROM students AS s
JOIN majors AS m ON s.major = m.code;
- Aliases (
AS s,AS m) — required as soon as two tables are in play, and required for self-joins. ON— the predicate that says how the two tables line up. Evaluated per pair.- Match-only — rows where
ONis TRUE for both sides survive; the rest are dropped. Blue Bella (NULL major) and Bio (no students) both drop out.
The Cartesian-product trap
Without ON, the database pairs every row with every row. 9 students × 4 majors = 36 rows of nonsense.
-- WRONG: no ON clause — produces 9 × 4 = 36 rows
SELECT s.name, m.advisor FROM students s, majors m;
The old comma-join syntax (FROM a, b WHERE a.x = b.y) makes this easy: delete the WHERE during a refactor and silently ship a Cartesian product. Use explicit JOIN ... ON ... so the relationship and the filter are separate.
Task
joins.sql has two queries — one silently broken, one that runs but leaks too much data.
- Query 1 — the Cartesian trap. Returns 36 rows instead of ~8. Convert it to explicit
JOIN ... ON .... - Query 2 — over-selection. Tighten it to project just the student name and their advisor, using aliases.
Predict before you Run: how many rows comes back from the corrected Query 1? Remember Blue Bella (NULL major) and Bio (no students).
Investigate (optional). After Query 1 passes, is Blue Bella in the result? Why not?
🎓 What about LEFT JOIN?
LEFT JOIN keeps unmatched rows from the left side — useful when you want to include Blue Bella even without a major. Not needed for this step; flagging the keyword so it rings a bell later.
-- Step 2 — Joining students to majors
-- Predict row counts before you run.
-- Query 1: intended "student name + advisor" — returns 36 rows.
-- Identify the bug, then rewrite as an explicit JOIN ... ON.
SELECT s.name, m.advisor
FROM students s, majors m;
-- Query 2: works but over-selects.
-- Tighten to project only student name and advisor, using aliases.
SELECT *
FROM students
JOIN majors ON students.major = majors.code;
Joining Tables — Knowledge Check
Min. score: 80%
1. You write SELECT * FROM students s, majors m; with no WHERE or ON. What do you get?
With no join predicate, SQL pairs every row from students with every row from majors: 9 × 4 = 36 rows. This is the #1 accidental bug in multi-table queries.
2. A JOIN ... ON s.major = m.code between students (9 rows) and majors (4 rows) returns 8 rows. Why not 9?
JOIN keeps only pairs where the ON predicate is TRUE for both sides. Blue Bella has major = NULL; NULL = anything is UNKNOWN, so she has no match and drops.
3. Arrange the fragments to list each student’s name and their department. (arrange in order)
SELECT s.name, m.departmentFROM students AS sJOIN majors AS mON s.major = m.code;
FROM students, majorsWHERE s.major = m.code
The comma form + WHERE is the old trap. Use explicit JOIN ... ON so the relationship is visible and survives refactors.
4. Why are table aliases (AS s, AS m) strongly recommended even in simple two-table joins?
Aliases disambiguate same-named columns (both students and majors could have name) and keep queries readable. They are also required for self-joins. They do not affect performance or NULL handling.
Filtering & Sorting
WHERE and ORDER BY — The filter() and sort() of SQL
Learning objective: Narrow a result set to exactly the rows that answer a question, order the output, and avoid the silent bugs that unknown values (NULL) introduce into filters.
NULL is genuinely difficult — experienced developers still get tripped up by it. Plan to spend a little extra time on this step.
WHERE predicates
WHERE is row selection — it keeps the rows where a predicate evaluates to TRUE, and discards the rest.
WHERE gpa >= 3.5 -- numeric comparison
WHERE major = 'CS' -- string equality (single quotes!)
WHERE major = 'CS' AND gpa > 3.7 -- compound predicate
WHERE major IN ('CS', 'Math') -- set membership
WHERE name LIKE 'A%' -- pattern match (% = any characters)
Two syntax traps for Python/JS developers:
- Single quotes for strings.
'CS', not"CS". Double quotes quote identifiers in SQL. - Single
=for equality.=, not==.WHERE major == 'CS'is a syntax error.
NULL and Three-Valued Logic — the biggest trap
SQL booleans are three-valued: TRUE, FALSE, UNKNOWN. NULL means unknown value — comparing anything to NULL (even another NULL) yields UNKNOWN, and WHERE discards UNKNOWN rows.
WHERE major != 'CS' -- WRONG: silently drops NULL-major rows
WHERE major != 'CS' OR major IS NULL -- RIGHT: handle NULL explicitly
WHERE major IS NULL -- value is unknown
WHERE major IS NOT NULL -- value is known
In Python, None != "Active" is True. In SQL, NULL != 'Active' is UNKNOWN — and the row is dropped.
ORDER BY
ORDER BY is sorting — it imposes order on the result for presentation. It’s also the only guarantee of row order; without it, order is whatever the engine finds convenient.
SELECT name, gpa FROM students
ORDER BY gpa DESC; -- highest first (ASC is the default)
Task
filter.sql has three broken queries. Each fails differently. For each:
- Predict what it actually returns — or what error it raises.
- Classify the bug: syntax (grammar), semantic (valid SQL, wrong reference), or logical (runs fine, wrong rows). Write the classification as a comment.
- Fix the query.
Query 3 is deceptively easy — the bug only appears when you reason about NULL carefully.
Investigate. After Query 3 passes, comment out your IS NULL branch and re-run. Where did Blue Bella go? This one-line toggle makes the NULL trap concrete.
Translating to JavaScript or pandas
| JavaScript | pandas | SQL |
|---|---|---|
arr.filter(x => x.gpa > 3.5) |
df[df['gpa'] > 3.5] |
WHERE gpa > 3.5 |
arr.map(x => x.name) |
df[['name']] |
SELECT name |
arr.sort((a,b) => b.gpa - a.gpa) |
df.sort_values('gpa', ascending=False) |
ORDER BY gpa DESC |
A SELECT is roughly .filter().map().sort() chained — but declarative.
pandas’ NaN behaves closer to SQL (df[df['col'] != 'CS'] excludes NaN). The idiomatic test is df['col'].isna() ≡ col IS NULL.
-- Step 3 — Filtering rows with WHERE and sorting with ORDER BY
-- Three broken queries. Classify each as syntax / semantic / logical, then fix.
-- Query 1. Intended: CS students with GPA above 3.5, highest GPA first.
-- Runs without error, but returns the wrong rows in the wrong order.
-- Two independent bugs hide here. Find both, classify them, then fix.
SELECT * FROM students
WHERE major = 'CS' OR gpa > 3.5
ORDER BY gpa ASC;
-- Query 2. Intended: students in year 1 or year 2.
-- Runs without error. Why does it return zero rows?
SELECT * FROM students
WHERE year = 1 AND year = 2;
-- Query 3. Intended: students whose major is NOT 'CS'.
-- Runs without error, but quietly drops a student. Who, and why?
-- Hint: one student has a NULL major. Reason about 3VL before fixing.
SELECT * FROM students
WHERE major != 'CS';
Filtering & Sorting — Knowledge Check
Min. score: 80%1. A student writes:
WHERE status != 'Active'
status column has some NULL values. What happens to those rows?
In SQL’s Three-Valued Logic, NULL != 'Active' evaluates to UNKNOWN — not TRUE. WHERE only keeps rows where the predicate is TRUE. UNKNOWN rows are silently discarded. To include NULL rows, write: WHERE status != 'Active' OR status IS NULL.
2. Which expression correctly tests whether a column value is missing (NULL)?
NULL = NULL evaluates to UNKNOWN, not TRUE — so WHERE credits = NULL never matches any row. The only portable, standard-SQL way to test for NULL is IS NULL. == NULL is not valid SQL syntax (SQL uses =, not ==). ISNULL() is a non-standard extension.
3. In SQL’s logical execution order, the SELECT clause is written first. When is it actually evaluated?
SELECT is evaluated fifth: FROM -> WHERE -> GROUP BY -> HAVING -> SELECT -> ORDER BY -> LIMIT. This is why an alias defined in SELECT cannot be used in WHERE — the alias doesn’t exist until after WHERE has already filtered the rows.
4. What does SELECT name, gpa FROM students do?
SELECT projects (selects) columns, not rows. SELECT name, gpa FROM students returns all rows but shows only the name and gpa columns. To filter rows, you need a WHERE clause. This is a common point of confusion — SELECT chooses columns, WHERE chooses rows.
5. Arrange the clauses to find CS students with GPA above 3.5, sorted highest first. (arrange in order)
SELECT name, gpaFROM studentsWHERE major = 'CS' AND gpa > 3.5ORDER BY gpa DESC;
HAVING gpa > 3.5SORT BY gpa DESC;
The logical execution order is FROM -> WHERE -> SELECT -> ORDER BY. HAVING filters groups (after GROUP BY), not individual rows. SORT BY is not valid SQL — use ORDER BY. The distractor HAVING is a common confusion point; it only applies after GROUP BY.
6. A student writes two queries. Which one is more likely to produce unexpected results in production, and why?
Query A: SELECT * FROM users WHERE role != 'admin'
Query B: SELECT id, name FROM users WHERE role != 'admin' OR role IS NULL
Query A has two production risks: (1) SELECT * breaks if columns are added or removed, and (2) != 'admin' silently drops rows where role is NULL. Query B is safer: it names specific columns and explicitly handles NULL. This combines two lessons — explicit projection and NULL-safe predicates.
Aggregating Data
GROUP BY — The reduce() of SQL
Learning objective: Turn row-level data into per-category summaries (counts, averages, totals), and distinguish filtering individual rows from filtering whole groups.
Every time Spotify shows “Your top artists this month” or Instagram shows “4 people liked this” — that is a GROUP BY + COUNT running against millions of rows in milliseconds.
Logical execution order
SQL clauses do not run top-to-bottom as written:
1. FROM — which tables to read
2. WHERE — filter rows
3. GROUP BY — group rows
4. HAVING — filter groups
5. SELECT — project columns ← written first, evaluated fifth
6. ORDER BY — sort
7. LIMIT — truncate
This explains (a) why you cannot use a SELECT alias inside WHERE — the alias does not exist yet — and (b) why WHERE cannot contain an aggregate — the groups do not exist yet. Keep this list visible while you work the task — even experienced developers reference it.
GROUP BY syntax
GROUP BY is aggregation — it partitions rows into buckets per category, then applies summary functions (COUNT, AVG, SUM, MIN, MAX) to each bucket. The result has one row per bucket, not per original row.
SELECT major,
COUNT(*) AS student_count,
AVG(gpa) AS avg_gpa
FROM students
GROUP BY major;
The GROUP BY rule
Every column in SELECT must either (1) appear in GROUP BY, or (2) be wrapped in an aggregate (COUNT, AVG, SUM, MIN, MAX).
-- WRONG: name is neither grouped nor aggregated
SELECT major, name, COUNT(*) FROM students GROUP BY major;
-- RIGHT
SELECT major, COUNT(*) AS n FROM students GROUP BY major;
SQLite silently picks an arbitrary name here — PostgreSQL and strict-mode MySQL reject it. Treat SQLite’s quiet acceptance as a trap, not a blessing.
WHERE vs HAVING
HAVING is group-level filtering — the aggregate-aware analog of WHERE. WHERE filters individual rows before grouping; HAVING filters groups after aggregation. Aggregates cannot appear in WHERE because the groups don’t exist yet.
SELECT major, COUNT(*) AS n
FROM students
WHERE year > 1 -- rows first
GROUP BY major
HAVING COUNT(*) >= 2; -- groups after
COUNT(*) counts every row in the group; COUNT(gpa) counts only rows where gpa is not NULL.
Task
aggregate.sql has two queries, each making one of the two most common aggregation mistakes (illegal grouping; WHERE-with-aggregate). For each:
- Plan before typing: write a one-line comment with the clauses in evaluation order, e.g.
FROM students → WHERE year>1 → GROUP BY major → HAVING count ≥ 2 → SELECT major, count. Even 30 seconds of planning catches most aggregation bugs before you write them. - Predict what happens — a crash, a silent wrong answer, or a confusing SQLite-lenient result.
- Explain in a one-line comment why it is wrong, then fix it.
Investigate. After you pass, temporarily change your fixed Query 2 back to WHERE AVG(gpa) > 3.3 and read the exact error message. You want to recognize it instantly in a future code review.
Translating to pandas
| SQL | pandas |
|---|---|
SELECT major, COUNT(*) FROM students GROUP BY major |
students.groupby('major').size() |
SELECT major, AVG(gpa) FROM students GROUP BY major |
students.groupby('major')['gpa'].mean() |
... HAVING AVG(gpa) > 3.3 |
result[result > 3.3] (filter after .mean()) |
JavaScript’s .reduce() collapses a collection into a single value. SQL’s GROUP BY does the same — but per group, in one query.
-- Step 4 — Aggregation with GROUP BY and HAVING
-- Plan each query in one comment line before you fix it.
-- Query 1. Intended: count students per major, most students first.
-- Violates the grouping rule — why? What does SQLite do about it?
SELECT major, name, COUNT(*) AS student_count
FROM students
GROUP BY major
ORDER BY student_count DESC;
-- Query 2. Intended: average GPA per major, only majors with avg GPA > 3.3.
-- Uses the wrong clause for an aggregate condition. Fix it.
SELECT major, AVG(gpa) AS avg_gpa
FROM students
WHERE AVG(gpa) > 3.3
GROUP BY major;
Aggregation — Knowledge Check
Min. score: 80%1. A student writes:
SELECT major, name, COUNT(*) AS n
FROM students
GROUP BY major;
The grouping rule: every column in SELECT must either appear in GROUP BY or be wrapped in an aggregate. name is neither — there are multiple names per major, so it’s undefined which one to return. SQLite may pick an arbitrary name; PostgreSQL and strict-mode MySQL reject this entirely.
2. Which clause filters groups based on an aggregate condition?
WHERE filters individual rows before grouping. HAVING filters groups after aggregation. WHERE COUNT(*) > 5 is always an error because the count doesn’t exist yet when WHERE runs. Use HAVING COUNT(*) > 5.
3. If 3 out of 10 students have a NULL value in the gpa column, what does each of these return?
COUNT(*) vs COUNT(gpa)
COUNT(*) counts every row regardless of NULLs. COUNT(column) counts only rows where that column is not NULL. The distinction is critical whenever a column is optional — use the right one depending on whether you want to count the group size or the number of known values.
4. A query uses WHERE year > 2 to filter rows before grouping. If a student has year set to NULL, are they included in the grouped results?
From Step 3: NULL > 2 evaluates to UNKNOWN, and WHERE discards UNKNOWN rows before grouping even begins. Always check whether NULLs in filtered columns are affecting your aggregates.
5. Arrange the clauses to count students per major, but only show majors with more than 2 students. (arrange in order)
SELECT major, COUNT(*) AS nFROM studentsGROUP BY majorHAVING COUNT(*) > 2;
WHERE COUNT(*) > 2;ORDER BY major
WHERE cannot hold aggregates — the count does not exist yet when WHERE runs. HAVING filters after grouping.
6. Arrange the steps in the order the database actually executes them, for this query:
SELECT major, COUNT(*) AS n FROM students
WHERE year > 1 GROUP BY major HAVING COUNT(*) >= 2 ORDER BY n DESC;
FROM students — read the source rowsWHERE year > 1 — drop rows where year is 1 or NULLGROUP BY major — partition surviving rows into groupsHAVING COUNT(*) >= 2 — drop groups smaller than 2SELECT major, COUNT(*) AS n — project the final columnsORDER BY n DESC — sort groups by count
SELECT first, then filter rows in WHEREORDER BY runs before HAVING
SELECT is written first but evaluated fifth. This is why aliases defined in SELECT cannot be used in WHERE, and why aggregates in WHERE always fail.
7. [Review from Step 3] A table has a nullable status column. Which query correctly finds all rows where status is either 'inactive' or unknown?
= NULL and IN (..., NULL) both silently miss NULL rows (internally they use =). != 'active' drops NULL rows too. Only IS NULL works.
Designing Your Own Table
CREATE TABLE and INSERT INTO
Learning objective: Design a table that protects its own integrity — so bad data cannot enter in the first place — and classify query errors so you can debug them efficiently.
You have been querying an existing table for four steps. Now it is time to build your own. If constraint syntax feels fiddly at first, that is normal — the payoff is that the database catches bad data before it becomes a bug.
CREATE TABLE
CREATE TABLE is schema declaration — it defines the shape and integrity rules of a table (columns, types, constraints), not the data. Like declaring a struct, not creating an instance.
CREATE TABLE books (
id INTEGER PRIMARY KEY, -- identity: unique + non-null
title TEXT NOT NULL, -- required
author TEXT NOT NULL, -- required
pages INTEGER -- optional (nullable by default)
);
PRIMARY KEY= unique + non-null + indexed.NOT NULL= required.- Types:
INTEGER,REAL,TEXT,BLOB.
SQLite quirk: standard SQL makes PRIMARY KEY imply NOT NULL; SQLite does not. Add NOT NULL explicitly if portability matters.
INSERT INTO
INSERT INTO is row insertion — it adds new rows that must conform to the schema’s types and constraints.
INSERT INTO books (id, title, author, pages) VALUES
(1, 'SICP', 'Abelson & Sussman', 657);
Omitting a NOT NULL column (with no default) fails. Nullable columns can be skipped. After any insert/update, run a quick SELECT to verify.
Four kinds of SQL errors
- Syntax — grammar violated. Refuses to run.
- Semantic — valid grammar, references something that does not exist (wrong column name). Errors at parse time.
- Logical — runs and returns a result, but the wrong result (e.g., the NULL trap from Step 3). No error message — the hardest to catch.
- Constraint violation — grammatically and semantically valid, but violates
PRIMARY KEY/NOT NULL/UNIQUE/FOREIGN KEYat execution time.
Task
A teammate handed you two unreviewed files. Bring them up to production quality.
1. schema.sql — strengthen. It compiles, but has no integrity constraints. Add the missing types + constraints:
| Column | Type | Constraints |
|---|---|---|
id |
INTEGER | PRIMARY KEY |
code |
TEXT | NOT NULL |
name |
TEXT | NOT NULL |
credits |
INTEGER | (nullable — leave as is) |
Above each constraint, write a one-line comment naming the bug it prevents (e.g., -- prevents duplicate course IDs).
2. populate.sql — classify and repair. Three INSERT statements. Two fail against the strengthened schema. For each failing insert, decide which error category fits, then fix it (or replace with a valid row). End with SELECT * FROM courses;. Aim for at least 3 rows.
Translating to pandas
| SQL | pandas |
|---|---|
CREATE TABLE books (id INTEGER PRIMARY KEY, title TEXT NOT NULL, ...) |
books = pd.DataFrame(columns=['id', 'title', 'author', 'pages']) |
INSERT INTO books VALUES (1, 'SICP', 'Abelson', 657) |
books.loc[len(books)] = [1, 'SICP', 'Abelson', 657] |
pandas accepts anything silently — wrong types, missing required fields. SQL enforces the schema and catches those bugs at insert time.
-- Step 5a — Designing a schema that protects its own data.
-- The CREATE below runs, but it has NO constraints.
-- Add the missing types + constraints from the table in the instructions.
-- For each constraint you add, leave a one-line comment naming the bug it prevents.
DROP TABLE IF EXISTS courses;
CREATE TABLE courses (
id,
code,
name,
credits
);
-- Step 5b — Populating the strengthened schema.
-- Classify each broken INSERT (syntax / semantic / logical / constraint violation), then fix.
DELETE FROM courses;
-- Insert A — the model row. Matches the schema exactly.
INSERT INTO courses (id, code, name, credits) VALUES
(1, 'CS 35L', 'Software Construction', 4);
-- Insert B — predict what breaks when the schema enforces NOT NULL
INSERT INTO courses (id, code, credits) VALUES
(2, 'CS 111', 4);
-- Insert C — all values look plausible. What does the strengthened schema
-- reject here? Look carefully at how this row compares to Insert A.
INSERT INTO courses (id, code, name, credits) VALUES
(1, 'MATH 61', 'Discrete Structures', 4);
-- Verify your data. Leave this line as the last statement.
SELECT * FROM courses;
Creating Tables — Knowledge Check
Min. score: 80%
1. What does PRIMARY KEY guarantee about a column’s values?
PRIMARY KEY enforces two things: uniqueness (no two rows share the same value) and non-nullability. It also creates an implicit index, but its core purpose is to uniquely identify every row.
2. What happens if you INSERT INTO a row but omit a column declared NOT NULL with no default?
A NOT NULL column with no default value requires an explicit value. Omitting it in the INSERT causes an error — the database enforces the constraint and refuses the row.
3. A student writes:
SELECT full_name FROM students WHERE full_name = 'Mango';
name, not full_name. What type of SQL error is this?
A semantic error is a query that is grammatically correct SQL but references something that doesn’t exist (wrong column name, wrong table). A syntax error violates grammar (wrong clause order, missing parentheses). A logical error is when the query runs but returns the wrong answer.
4. You need to store a library’s book collection. Each book has a title, author, year published, and an optional page count. Which schema is best?
The best schema has a PRIMARY KEY for unique identification, NOT NULL on required fields (title, author), appropriate types (INTEGER for year, not TEXT), and leaves optional fields (pages) nullable.
5. Arrange the lines to create a songs table with a primary key and a required title.
(arrange in order)
CREATE TABLE songs (id INTEGER PRIMARY KEY,title TEXT NOT NULL,artist TEXT,plays INTEGER);
id INTEGER UNIQUE,title TEXT NULL,
PRIMARY KEY provides uniqueness and identification — UNIQUE alone lacks the “identity” role. NOT NULL on title ensures every song has a name. TEXT NULL is redundant (columns are nullable by default) and wrong for a required field.
6. [Review] What is the correct logical execution order of these SQL clauses? (arrange in order)
FROMWHEREGROUP BYHAVINGSELECTORDER BY
SORT BYFILTER
The logical execution order is FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY. This is why you cannot use a SELECT alias in WHERE (the alias doesn’t exist yet) and why WHERE cannot use aggregate functions (groups don’t exist yet). SORT BY is not valid SQL (use ORDER BY); FILTER is not a standalone clause.
Modifying & Cleaning Up
UPDATE, DELETE, and DROP TABLE
Learning objective: Change or remove data without damaging the rest of the database, recognize destructive statements before they run, and apply the professional habit of previewing impact before committing.
Destructive SQL is where careful habits pay off. If you have ever run a shell command and watched the wrong files disappear, you already know the feeling — these patterns exist so you do not have it in production.
Safe failure experiment
Run SELECT * FROM students; and note the row count. Now run DELETE FROM students; (no WHERE). Run the SELECT again — the table is empty. The page reload restores the seed data; production has no reload button.
UPDATE, DELETE, DROP
All three are destructive — they change the database state, in irreversible ways without a backup.
UPDATEmodifies values in existing rows. Scoped byWHERE.DELETEremoves rows. Scoped byWHERE. Schema stays.DROP TABLEremoves the table itself — data and schema. No undo.
UPDATE students SET gpa = 3.95 WHERE name = 'Mango';
DELETE FROM students WHERE gpa < 3.0;
DROP TABLE IF EXISTS courses;
Without WHERE, UPDATE modifies every row and DELETE empties the table. Safe habit: write SELECT ... WHERE ... first, then convert.
| Statement | Removes | Schema remains? |
|---|---|---|
DELETE FROM t WHERE ... |
Matching rows | Yes |
DELETE FROM t |
All rows | Yes |
DROP TABLE t |
All rows + schema | No |
Task
A teammate pushed cleanup.sql to a PR. It is scheduled to run on production in an hour. Do not approve it as-is.
For each statement:
- Predict what it actually does — not what the comment claims.
- Preview with a SELECT. Write the
SELECT ... WHERE ...that would reveal the affected rows; leave it as a comment above the fixed statement so the next reviewer sees your reasoning. - Fix the statement to match its stated intent.
That trail — claim, preview, confirm — is how destructive SQL actually gets approved.
Stated intents:
- Update Kiwi’s GPA from 3.50 to 3.60.
- Delete all year-4 students.
- Drop the
coursestable from Step 5.
Translating to pandas
| SQL | pandas |
|---|---|
UPDATE students SET gpa = 3.6 WHERE name = 'Kiwi' |
students.loc[students['name'] == 'Kiwi', 'gpa'] = 3.6 |
DELETE FROM students WHERE year = 4 |
students = students[students['year'] != 4] |
DROP TABLE courses |
del courses |
SQL modifies in place; pandas often returns a new frame. In SQL, forgetting WHERE is permanent.
🎓 Where to go next
This tutorial covers the ~90% of SQL that 90% of code needs. Three topics deserve your next hour:
LEFT JOINand the outer-join family. Step 2 taughtJOIN, which drops rows with no match (Blue Bella with NULL major).LEFT JOINkeeps them and fills the right side with NULL.- Common Table Expressions (
WITH ... AS (...)). Complex queries collapse from unreadable nested subqueries into top-to-bottom readable pipelines. - The N+1 query problem. The #1 antipattern at the app ↔ SQL boundary: looping over a list, firing one query per element. Batch with
WHERE id IN (...)or an explicit JOIN.
-- Step 6 — Review a teammate's destructive PR before it runs in production.
-- For each statement: predict, preview with a commented SELECT, then fix.
-- 1. Intended: raise Kiwi's GPA to 3.6.
-- Targets the wrong student. A second trap hides here too:
-- drop the WHERE clause and every student ends up at 3.6.
UPDATE students SET gpa = 3.6 WHERE name = 'Mango';
-- 2. Intended: delete all year-4 students.
-- Wrong column referenced. Would this even run? If yes, what does it delete?
DELETE FROM students WHERE gpa = 4;
-- 3. Intended: drop the courses table.
-- Works most of the time. When would this statement crash the deploy,
-- and how do you make it safe to re-run?
DROP TABLE courses;
Modifying & Cleaning Up — Knowledge Check
Min. score: 80%1. A developer runs:
DELETE FROM orders;
DELETE FROM orders without a WHERE deletes every row but leaves the table schema intact. The table still exists — it is just empty. To remove the schema too, use DROP TABLE orders. SQL does not require a WHERE clause on DELETE, which makes this mistake easy to make in production.
2. What is the safest way to remove a table that may or may not exist, without causing an error?
DROP TABLE IF EXISTS tablename is the idiomatic safe form. Plain DROP TABLE tablename raises an error if the table doesn’t exist. DELETE TABLE is not valid SQL. TRUNCATE removes all rows (like DELETE) but keeps the schema, and is not supported by all databases (including SQLite).
3. Before running an UPDATE on a production table, what should you do first?
Running SELECT ... WHERE ... with the same predicate lets you see exactly which rows the UPDATE will touch — before the change is permanent. This is the claim/preview/confirm trail the task demonstrated.
4. After running DROP TABLE courses, what happens if you run SELECT * FROM courses?
DROP TABLE removes both the data and the schema. The table no longer exists in the database at all, so any query referencing it fails with an error. Compare this with DELETE FROM courses (no WHERE), which removes all rows but keeps the table structure intact — SELECT * FROM courses would return an empty result set, not an error.
5. Arrange the steps for safely updating Kiwi’s GPA. The professional workflow: preview first, then modify. (arrange in order)
-- Preview the rows you're about to changeSELECT * FROM students WHERE name = 'Kiwi';-- Then apply the changeUPDATE students SET gpa = 3.6 WHERE name = 'Kiwi';
UPDATE students SET gpa = 3.6;DELETE FROM students WHERE name = 'Kiwi';
Always preview with SELECT before modifying production data. UPDATE students SET gpa = 3.6; is missing the WHERE clause — it would change every student’s GPA to 3.6. The DELETE distractor removes the row entirely instead of updating it.
6. [Review from Step 4] What is the difference between COUNT(*) and COUNT(email)?
COUNT(*) counts every row regardless of NULL. COUNT(column) counts only rows where that column is not NULL.
7. [Review from Step 3] Which statement correctly removes every student whose major is NULL?
= NULL always evaluates to UNKNOWN and matches no rows — the delete would silently do nothing. Only IS NULL tests for unknown values.
Putting It Together
Capstone — One Query, Six Steps’ Worth of Tools
Learning objective: Compose a single query that combines
JOIN,WHERE,GROUP BY/HAVING, aggregates, andORDER BYto answer a real-world question — and recognize which construct each part of the question maps to.
You’ve spent six steps learning the building blocks separately. This step is composition — chaining those blocks into one query that answers a real-world question.
The dean’s request
The dean wants a snapshot of upperclassmen (year 2 and above) by major. List each major that has at least 2 such students, showing:
- the major code
- its department
- the advisor
- the average GPA among those students
- the student count
Sort by average GPA, highest first.
Decomposing the request
Each phrase of a real-world ask maps to a SQL construct. Identifying these mappings before you type is the skill this capstone builds.
| Phrase in the request | Construct |
|---|---|
| “by major” | GROUP BY (Step 4) |
| “year 2 and above” — row filter | WHERE (Step 3) |
“department, advisor” — needs majors table |
JOIN (Step 2) |
| “at least 2 such students” — group filter | HAVING COUNT(*) >= 2 (Step 4) |
| “average GPA”, “student count” | AVG(), COUNT() (Step 4) |
| “highest first” | ORDER BY ... DESC (Step 3) |
| (production hygiene) | Explicit columns + aliases (Steps 1, 2) |
Plan, then type
In final.sql, sketch the clauses in execution order as a comment, then write the query. Naming the constructs you’ll need is the highest-leverage habit for aggregation queries.
One query. No SELECT *. Run before you celebrate.
-- Capstone: one query that combines everything from Steps 1–6.
-- Plan first as a comment in execution order:
-- FROM ... → JOIN ... → WHERE ... → GROUP BY ... → HAVING ...
-- → SELECT ... → ORDER BY ...
-- Then write the query below.
Capstone — Cumulative Check
Min. score: 70%1. Two engineers debate. Engineer A writes:
SELECT major, AVG(gpa) FROM students GROUP BY major HAVING COUNT(*) >= 2;
SELECT major, AVG(gpa) FROM students WHERE COUNT(*) >= 2 GROUP BY major;
Engineer A is right. WHERE COUNT(*) >= 2 always errors because aggregates don’t exist when WHERE runs (the logical execution order from Step 4). Filtering on aggregates is HAVING’s job.
2. [Review from Step 3] In your capstone query, you wrote WHERE year >= 2. If a future student has year set to NULL, are they included in the result?
Three-Valued Logic (Step 3): comparing NULL to anything yields UNKNOWN, and WHERE keeps only TRUE rows. The principle matters whenever a filtered column is nullable.
3. [Review from Step 4] The grouping rule states that every column in SELECT must either appear in GROUP BY or be wrapped in an aggregate. Why?
If a column is neither grouped nor aggregated, multiple row values collapse into one group with no rule for which to return. Standard SQL rejects this; SQLite picks arbitrarily — silent corruption, not a feature.
4. Arrange the clauses of your capstone query in the order the database actually executes them. (arrange in order)
FROM students JOIN majorsWHERE year >= 2GROUP BY majorHAVING COUNT(*) >= 2SELECT major, AVG(gpa), COUNT(*)ORDER BY avg_gpa DESC
WHERE year >= 2 AND COUNT(*) >= 2SORT BY avg_gpa DESC
FROM/JOIN → WHERE → GROUP BY → HAVING → SELECT → ORDER BY. SELECT is written first but evaluated fifth — which is why aggregates belong in HAVING (after grouping), never in WHERE. The distractors fail two ways: mixing an aggregate into WHERE, and using SORT BY (not valid SQL).