SQL is not a language for EDA, but rather a language for gathering data from distributed sources. This is why we teach SQL in addition to Python.
Key points to keep in mind:
Context matters: The database schema is key to knowing which operations are valid.
Syntax conventions: SQL syntax is loose (not case-sensitive, reads like natural language), but there are still rules. Use conventions—like capitalizing keywords—to clarify structure.
Flavors differ: SQL comes in many flavors (MySQL, MSSQL, PostgreSQL). In this class, we use DuckDB.
Draw parallels to pandas: Relate SQL clauses to pandas functions to help understand similarities and differences.
Distinguishing clauses: Students often confuse WHERE and HAVING. One helpful analogy:
WHERE → Boolean indexing in pandas
HAVING → filtering after a groupby
Use these analogies during discussion to reinforce the logic and structure of SQL queries.
For this question, we will be working with the UC Berkeley Undergraduate Career Survey dataset, named survey. Each year, the UC Berkeley Career Center surveys graduating seniors for their plans after graduating. Below is a sample of the full dataset that contains many thousands of rows.
Each record of the survey table is an entry corresponding to a student. We have the job title, company information, and the student’s major.
Write an SQL query that contains all data science major graduates who got jobs in Berkeley. The result generated by your query should include all 4 columns.
_______________ FROM survey______________________________________________________
Answer
%%sql SELECT * FROM surveyWHERE m_name ='Data Science'AND c_location ='Berkeley';
Running query in 'duckdb'
j_name
c_name
c_location
m_name
Data Analyst
Startup
Berkeley
Data Science
Note how we add the semi-colon “;” at the end as a statement terminator. This allows us to write more than one SQL query in a cell and is generally good practice.
25.1.2 (b)
Write an SQL query to find the top 2 most popular companies that data science graduates will work at, from most popular to 2nd most popular.
SELECT column1, column2 FROM table1 JOIN table2 ON table1.id= table2.id;
25.2.1 (a)
In the figure above, assume has \(m\) records, while has \(n\) records. Describe which records are returned from each type of join. What is the possible number of records returned in each join? Consider the cases where on the joined field, (1) both tables have unique values, and (2) both tables have duplicated values. Finally, what is the possible number of records returned in each join?
Answer
(INNER) JOIN: Returns records that have matching values in both tables. The maximum number of rows is \(\min(m, n)\) if unique rows in each table. The minimum number of rows is 0.
LEFT (OUTER) JOIN: Return all records from the left table, and the matched records from the right table. The maximum number of rows is m if unique rows. The minimum number of rows is \(m\).
RIGHT (OUTER) JOIN: Return all records from the right table, and the matched records from the left table. The maximum number of rows is \(n\) if unique rows. The minimum number of rows is \(n\).
FULL (OUTER) JOIN: Return all records when there is a match in either the left or right table. The maximum number of rows in \(m + n\) if unique rows. The minimum number of rows is \(\max(m, n)\).
CROSS JOIN: Return all pairs of records between the left and right tables. The maximum number of rows is \(m \times n\). The minimum number of rows is \(m \times n\).
All joins have a maximum number of \(m \times n\) rows if duplicates are allowed.
25.3 More Joins
Consider the following real estate schema (underlined column names have unique values and no duplicates):
homes(home_id int, city text, bedrooms int, bathrooms int, area text)transactions(home_id int, buyer_id int, seller_id int, transaction_date date, sale_price int)buyers(buyer_id int, name text)sellers(seller_id int, name text)
25.3.1 (a)
Fill in the blanks in the SQL query to find the home_id, sale_price, and area for each home in Berkeley with an area greater than 600. If the home has not been sold yet and has an area greater than 600, it should still be included in the table with The price as Null.
Key Points About SQL Queries
This question highlights several important concepts in SQL:
Pay attention to the schema: For example, notice that area is of type text.
Use CASTing when needed: Converting data types is often necessary for comparisons or calculations.
LEFT JOIN usage:
LEFT JOIN is commonly used to include all rows from the left table, even if there is no match in the right table.
Some SQL dialects allow RIGHT JOIN, but not all. The general convention is to use LEFT JOIN whenever possible.
%%sqlSELECT H.home_id, T.sale_price, H.areaFROM homes AS HLEFT JOIN transactions AS TON H.home_id = T.home_idWHERE H.city ='Berkeley'AND CAST(H.area AS INT) >600;
Running query in 'duckdb'
home_id
sale_price
area
3
750
1000
4
None
1500
An alternate solution was to use transactions in the FROM clause and perform a RIGHT JOIN with homes. Note that you can also choose to CAST in the SELECT clause, but in order to access the CAST column later on, an alias different from the original name of the column must be used (in this case, area_int)
%%sqlSELECT H.home_id, T.sale_price, CAST(H.area AS INT) AS area_intFROM transactions AS TRIGHT JOIN homes AS HON H.home_id=T.home_idWHERE H.city ='Berkeley'AND area_int >600
Running query in 'duckdb'
home_id
sale_price
area_int
3
750
1000
4
None
1500
Note that homes AS H, homes as H, homes H are all equivalent ways of specifying the alias. By convention, the first option is preferred even though SQL is not case sensitive.
25.4 More SQL Queries
Examine this schema for these two tables:
CREATETABLE cat_owners ( CREATETABLE cats ( idinteger, idinteger, name text, owner_id integer, age integer, name text,PRIMARYKEY (id) breed text,); age integer,PRIMARYKEY (id),FOREIGNKEY (owner_id) REFERENCES cat_owners );
25.4.1 (a)
Write an SQL query to create an almost identical table as cats, except with an additional column Nickname that has the value “Kitten” for cats less than or equal to the age of 1, “Catto” for cats between 1 and 15, and “Wise One” for cats older than or equal to 15.
Answer
%%sqlSELECT id, owner_id, name, breed, age, CASE WHEN age <=1 THEN 'Kitten' WHEN age >=15 THEN 'Wise One' ELSE 'Catto' END AS NicknameFROM cats AS C;
Running query in 'duckdb'
id
owner_id
name
breed
age
Nickname
51
10
Mittens
Tabby
2
Catto
52
10
Whisker
Black
3
Catto
53
11
Pishi
Orange
1
Kitten
54
11
Lucky
Tabby
2
Catto
55
12
Fluffy
Black
16
Wise One
Alternatively, the first line can also be written as SELECT *, with the rest of the query being the same.
%%sqlSELECT *, CASE WHEN age <=1 THEN 'Kitten' WHEN age >=15 THEN 'Wise One' ELSE 'Catto' END AS NicknameFROM cats AS C;
Running query in 'duckdb'
id
owner_id
name
breed
age
Nickname
51
10
Mittens
Tabby
2
Catto
52
10
Whisker
Black
3
Catto
53
11
Pishi
Orange
1
Kitten
54
11
Lucky
Tabby
2
Catto
55
12
Fluffy
Black
16
Wise One
25.4.2 (b)
Considering only cats with ages strictly greater than 1, write an SQL query that returns the owner_ids of owners that own more than one cat.
Comparing SQL and Pandas for Filtering and Grouping
It can be helpful to understand the parallels and differences between Pandas and SQL when working with filtering and grouping:
Boolean filtering:
In Pandas: df[df.condition] filters rows but keeps the original DataFrame structure.
In SQL: WHERE filters rows before any grouping.
Grouping:
In Pandas: groupby() followed by aggregation changes how you summarize data but does not change the original DataFrame’s granularity unless you assign it.
In SQL: GROUP BY combined with aggregation changes the granularity of the output — each group becomes a single row in the result.
Filtering after grouping:
In Pandas: groupby().filter() can filter groups without changing the overall DataFrame’s structure.
In SQL: HAVING filters groups after aggregation, affecting which groups appear in the output.
Key takeaway:GROUP BY + HAVING in SQL changes the shape (granularity) of your result, while groupby + filter in Pandas does not automatically change the original DataFrame’s granularity.
Answer
%%sqlSELECT owner_idFROM catsWHERE age >1GROUP BY owner_idHAVING COUNT(*) >1;
Running query in 'duckdb'
owner_id
10
25.4.3 (c) (Extra)
Write an SQL query to figure out the owner_id of the one cat owner who owns the most cats.
Answer
%%sqlSELECT owner_idFROM catsGROUP BY owner_idORDER BY COUNT(*) DESCLIMIT 1;
Running query in 'duckdb'
owner_id
10
25.4.4 (d)
Write an SQL query that returns the total number of cats each owner_id owns sorted by the number of cats in descending order. There should be two columns (owner_id and num_cats).
Answer
%%sqlSELECT owner_id, COUNT(*) AS num_catsFROM cats GROUP BY owner_idORDER BY num_cats DESC;
Running query in 'duckdb'
owner_id
num_cats
10
2
11
2
12
1
Using Column Names in SQL Clauses
In SQL, an output column’s name can sometimes be used to refer to its value in certain clauses:
You can use the column name in ORDER BY and GROUP BY clauses.
Example: If you alias a count as num_cats, you can write ORDER BY num_cats.
You cannot use the column name in WHERE or HAVING clauses.
Instead, you must use the full expression that defines it.
Example: To filter groups by the count, use `HAVING COUNT(*
25.4.5 (e)
Write an SQL query to figure out the names of all of the cat owners who have a cat named Pishi.
Answer
%%sqlSELECT O.nameFROM cats AS CJOIN cat_owners AS OON C.owner_id = O.idWHERE C.name ='Pishi';
Running query in 'duckdb'
name
Bob
25.4.6 (f)
Is it possible to have a cat with an owner_id not in the cat_owners table? Explain your answer.
Answer
Since the table cats has a FOREIGN KEY requirement on owner_id, a corresponding entry for an owner_id must exist.
25.4.7 (g)
Write an SQL query to select all rows from the cats table that have cats of the top 2 most popular cat breeds.
Breaking Down Complex SQL Questions
When tackling a complex SQL query, it can help to break the problem into smaller parts:
Filter by a predefined list:
For example, if you already know the top 2 most popular cat breeds, the query becomes as simple as filtering for cats in that list.
Find the top items using SQL:
Students may also recognize that SQL can be used to determine the top 2 most popular cat breeds dynamically using aggregation (COUNT) and ordering (ORDER BY ... DESC LIMIT 2).
Tip: Breaking a problem into smaller, manageable steps often makes writing SQL queries easier and more intuitive.
Answer
%%sqlSELECT *FROM cats WHERE breed IN (SELECT breed FROM cats GROUP BY breed ORDER BY COUNT(*) DESC LIMIT 2);