25  Discussion 10: SQL (From Summer 2025)

Slides

25.1 SQL Syntax

All SQL queries should follow this basic framework. Note that the order of the clauses matter.

SELECT [DISTINCT] ___<columns>___
FROM ___<tables>___
[WHERE ___<predicate>___]
[GROUP BY ___<columns>___]
[HAVING ___<predicate>___]
[ORDER BY ___<columns>___]
[LIMIT ___<number of rows>___]
Tips for Working with SQL

SQL is not a language for EDA, but rather a language for gathering data from distributed sources. This is why we teach SQL in addition to Python.

Key points to keep in mind:

  1. Context matters: The database schema is key to knowing which operations are valid.
  2. Syntax conventions: SQL syntax is loose (not case-sensitive, reads like natural language), but there are still rules. Use conventions—like capitalizing keywords—to clarify structure.
  3. Flavors differ: SQL comes in many flavors (MySQL, MSSQL, PostgreSQL). In this class, we use DuckDB.
  4. Draw parallels to pandas: Relate SQL clauses to pandas functions to help understand similarities and differences.
  5. Distinguishing clauses: Students often confuse WHERE and HAVING. One helpful analogy:
    • WHERE → Boolean indexing in pandas
    • HAVING → filtering after a groupby

Use these analogies during discussion to reinforce the logic and structure of SQL queries.

For this question, we will be working with the UC Berkeley Undergraduate Career Survey dataset, named survey. Each year, the UC Berkeley Career Center surveys graduating seniors for their plans after graduating. Below is a sample of the full dataset that contains many thousands of rows.

Each record of the survey table is an entry corresponding to a student. We have the job title, company information, and the student’s major.

Code
import pandas as pd
import numpy as np
import duckdb

%load_ext sql
conn = duckdb.connect()
conn.query("INSTALL sqlite")
%sql conn --alias duckdb

data = {'j_name': ['Llama Technician','Software Engineer','Open Source Maintainer','Big Data Engineer', 'Data Analyst', 'Analyst Intern'],
        'c_name': ["Google","Salesforce", "Github","Microsoft","Startup","Google"],
        'c_location' : ['Mountain View', 'SF', 'SF', 'Redmond', 'Berkeley', 'SF'],
        'm_name': ["Applied Math","ORMS","Computer Science", "Data Science", "Data Science","Philosophy"]
        }

survey = pd.DataFrame(data, columns = list(data.keys()))

homes_data = {'home_id': [1,2,3,4,5,6],
        'city': ["Berkeley","San Jose","Berkeley","Berkeley","Berkeley", "Sunnyvale"],
        'bedrooms': [2,1,5,3,4,1],
        'bathrooms': [2,2,1,1,3,2],
        'area': [str(i) for i in [500,750,1000,1500,500,1000]] 
        }

homes = pd.DataFrame(homes_data, columns = list(homes_data.keys()))

transactions_data = {'home_id': [1,2,3,5],
        'buyer_id': [5,6,7,8],
        'seller_id': [8,7,6,5],
        'transaction_data': ['1/12/2001','4/14/2001','8/11/2001','12/21/2001'],
        'sale_price': [1000,500,750,1200]
        }

transactions = pd.DataFrame(transactions_data, columns = list(transactions_data.keys()))


buyers_data = {'buyer_id': [5,6,7,8],
        'name': ["Xiaorui","Conan","Rose","Brandon"],
        }

buyers = pd.DataFrame(buyers_data, columns = list(buyers_data.keys()))

seller_data = {'seller_id': [8,7,6,5],
        'name': ["Shreya","Emrie","Jake","Sam"],
        }

seller = pd.DataFrame(seller_data, columns = list(seller_data.keys()))

cat_owners_data = {'id': [10,11,12],
        'name': ["Alice","Bob","Candice"],
        }

cat_owners = pd.DataFrame(cat_owners_data, columns = list(cat_owners_data.keys()))

cats_data = {'id': [51,52,53,54,55],
        'owner_id': [10, 10, 11, 11, 12],
        'name': ["Mittens","Whisker","Pishi","Lucky","Fluffy"],
        'breed' : ["Tabby","Black","Orange","Tabby","Black"],
        'age': [2,3,1,2,16]
        }

cats = pd.DataFrame(cats_data, columns = list(cats_data.keys()))
DuckDB String and Column Name Conventions

In DuckDB:

  • Single quotes '...' are used for strings.
  • Double quotes "..." are used for column names.

25.1.1 (a)

Write an SQL query that contains all data science major graduates who got jobs in Berkeley. The result generated by your query should include all 4 columns.

_______________ FROM survey
___________________________
___________________________
Answer
%%sql 
SELECT * FROM survey
WHERE m_name = 'Data Science'
AND c_location = 'Berkeley';
Running query in 'duckdb'
j_name c_name c_location m_name
Data Analyst Startup Berkeley Data Science
Note how we add the semi-colon “;” at the end as a statement terminator. This allows us to write more than one SQL query in a cell and is generally good practice.

25.1.2 (b)

Write an SQL query to find the top 2 most popular companies that data science graduates will work at, from most popular to 2nd most popular.

SELECT c_name, ____________ AS count 
FROM survey
WHERE _____________ = 'Data Science'
GROUP BY ______________
ORDER BY ______________
LIMIT 2;
Answer
%%sql
SELECT c_name, COUNT(*) AS count 
FROM survey
WHERE m_name = 'Data Science'
GROUP BY c_name
ORDER BY count DESC
LIMIT 2;
Running query in 'duckdb'
c_name count
Microsoft 1
Startup 1

25.2 Joins

Note: You do not need the JOIN keyword to join SQL tables. The following are equivalent:

SELECT column1, column2
FROM table1, table2
WHERE table1.id = table2.id;
SELECT column1, column2 
FROM table1 JOIN table2 
ON table1.id = table2.id; 

25.2.1 (a)

In the figure above, assume has \(m\) records, while has \(n\) records. Describe which records are returned from each type of join. What is the possible number of records returned in each join? Consider the cases where on the joined field, (1) both tables have unique values, and (2) both tables have duplicated values. Finally, what is the possible number of records returned in each join?

Answer

(INNER) JOIN: Returns records that have matching values in both tables. The maximum number of rows is \(\min(m, n)\) if unique rows in each table. The minimum number of rows is 0.

LEFT (OUTER) JOIN: Return all records from the left table, and the matched records from the right table. The maximum number of rows is m if unique rows. The minimum number of rows is \(m\).

RIGHT (OUTER) JOIN: Return all records from the right table, and the matched records from the left table. The maximum number of rows is \(n\) if unique rows. The minimum number of rows is \(n\).

FULL (OUTER) JOIN: Return all records when there is a match in either the left or right table. The maximum number of rows in \(m + n\) if unique rows. The minimum number of rows is \(\max(m, n)\).

CROSS JOIN: Return all pairs of records between the left and right tables. The maximum number of rows is \(m \times n\). The minimum number of rows is \(m \times n\).

All joins have a maximum number of \(m \times n\) rows if duplicates are allowed.

25.3 More Joins

Consider the following real estate schema (underlined column names have unique values and no duplicates):

homes(home_id int, city text, bedrooms int, bathrooms int, area text)
transactions(home_id int, buyer_id int, seller_id int, transaction_date date, sale_price int)
buyers(buyer_id int, name text)
sellers(seller_id int, name text)

25.3.1 (a)

Fill in the blanks in the SQL query to find the home_id, sale_price, and area for each home in Berkeley with an area greater than 600. If the home has not been sold yet and has an area greater than 600, it should still be included in the table with The price as Null.

Key Points About SQL Queries

This question highlights several important concepts in SQL:

  1. Pay attention to the schema: For example, notice that area is of type text.
  2. Use CASTing when needed: Converting data types is often necessary for comparisons or calculations.
  3. LEFT JOIN usage:
    • LEFT JOIN is commonly used to include all rows from the left table, even if there is no match in the right table.
    • Some SQL dialects allow RIGHT JOIN, but not all. The general convention is to use LEFT JOIN whenever possible.
SELECT ____________________________
FROM ____________________________
________ JOIN ____________________________
ON ____________________________
WHERE ____________________________
AND ____________________________;
Answer
%%sql
SELECT H.home_id, T.sale_price, H.area
FROM homes AS H
LEFT JOIN transactions AS T
ON H.home_id = T.home_id
WHERE H.city = 'Berkeley'
AND CAST(H.area AS INT) > 600;
Running query in 'duckdb'
home_id sale_price area
3 750 1000
4 None 1500

An alternate solution was to use transactions in the FROM clause and perform a RIGHT JOIN with homes. Note that you can also choose to CAST in the SELECT clause, but in order to access the CAST column later on, an alias different from the original name of the column must be used (in this case, area_int)

%%sql
SELECT H.home_id, T.sale_price, CAST(H.area AS INT) AS area_int
FROM transactions AS T
RIGHT JOIN homes AS H
ON H.home_id=T.home_id
WHERE H.city = 'Berkeley'
AND area_int > 600
Running query in 'duckdb'
home_id sale_price area_int
3 750 1000
4 None 1500
Note that homes AS H, homes as H, homes H are all equivalent ways of specifying the alias. By convention, the first option is preferred even though SQL is not case sensitive.

25.4 More SQL Queries

Examine this schema for these two tables:

CREATE TABLE cat_owners (       CREATE TABLE cats (    
    id integer,                     id integer,
    name text,                      owner_id integer,
    age integer,                    name text,
    PRIMARY KEY (id)                breed text,
);                                  age integer,
                                    PRIMARY KEY (id),
                                    FOREIGN KEY (owner_id) 
                                        REFERENCES cat_owners
                                );

25.4.1 (a)

Write an SQL query to create an almost identical table as cats, except with an additional column Nickname that has the value “Kitten” for cats less than or equal to the age of 1, “Catto” for cats between 1 and 15, and “Wise One” for cats older than or equal to 15.

Answer
%%sql
SELECT id, owner_id, name, breed, age,
    CASE
        WHEN age <= 1 THEN 'Kitten'
        WHEN age >= 15 THEN 'Wise One'
        ELSE 'Catto'
    END AS Nickname
FROM cats AS C;
Running query in 'duckdb'
id owner_id name breed age Nickname
51 10 Mittens Tabby 2 Catto
52 10 Whisker Black 3 Catto
53 11 Pishi Orange 1 Kitten
54 11 Lucky Tabby 2 Catto
55 12 Fluffy Black 16 Wise One

Alternatively, the first line can also be written as SELECT *, with the rest of the query being the same.

%%sql
SELECT *,
    CASE
        WHEN age <= 1 THEN 'Kitten'
        WHEN age >= 15 THEN 'Wise One'
        ELSE 'Catto'
    END AS Nickname
FROM cats AS C;
Running query in 'duckdb'
id owner_id name breed age Nickname
51 10 Mittens Tabby 2 Catto
52 10 Whisker Black 3 Catto
53 11 Pishi Orange 1 Kitten
54 11 Lucky Tabby 2 Catto
55 12 Fluffy Black 16 Wise One

25.4.2 (b)

Considering only cats with ages strictly greater than 1, write an SQL query that returns the owner_ids of owners that own more than one cat.

Comparing SQL and Pandas for Filtering and Grouping

It can be helpful to understand the parallels and differences between Pandas and SQL when working with filtering and grouping:

  1. Boolean filtering:
    • In Pandas: df[df.condition] filters rows but keeps the original DataFrame structure.
    • In SQL: WHERE filters rows before any grouping.
  2. Grouping:
    • In Pandas: groupby() followed by aggregation changes how you summarize data but does not change the original DataFrame’s granularity unless you assign it.
    • In SQL: GROUP BY combined with aggregation changes the granularity of the output — each group becomes a single row in the result.
  3. Filtering after grouping:
    • In Pandas: groupby().filter() can filter groups without changing the overall DataFrame’s structure.
    • In SQL: HAVING filters groups after aggregation, affecting which groups appear in the output.

Key takeaway: GROUP BY + HAVING in SQL changes the shape (granularity) of your result, while groupby + filter in Pandas does not automatically change the original DataFrame’s granularity.

Answer
%%sql
SELECT owner_id
FROM cats
WHERE age > 1
GROUP BY owner_id
HAVING COUNT(*) > 1;
Running query in 'duckdb'
owner_id
10

25.4.3 (c) (Extra)

Write an SQL query to figure out the owner_id of the one cat owner who owns the most cats.

Answer
%%sql
SELECT owner_id
FROM cats
GROUP BY owner_id
ORDER BY COUNT(*) DESC
LIMIT 1;
Running query in 'duckdb'
owner_id
10

25.4.4 (d)

Write an SQL query that returns the total number of cats each owner_id owns sorted by the number of cats in descending order. There should be two columns (owner_id and num_cats).

Answer
%%sql
SELECT owner_id, COUNT(*) AS num_cats
FROM cats 
GROUP BY owner_id
ORDER BY num_cats DESC;
Running query in 'duckdb'
owner_id num_cats
10 2
11 2
12 1
Using Column Names in SQL Clauses

In SQL, an output column’s name can sometimes be used to refer to its value in certain clauses:

  • You can use the column name in ORDER BY and GROUP BY clauses.
    • Example: If you alias a count as num_cats, you can write ORDER BY num_cats.
  • You cannot use the column name in WHERE or HAVING clauses.
    • Instead, you must use the full expression that defines it.
    • Example: To filter groups by the count, use `HAVING COUNT(*

25.4.5 (e)

Write an SQL query to figure out the names of all of the cat owners who have a cat named Pishi.

Answer
%%sql
SELECT O.name
FROM cats AS C
JOIN cat_owners AS O
ON C.owner_id = O.id
WHERE C.name = 'Pishi';
Running query in 'duckdb'
name
Bob

25.4.6 (f)

Is it possible to have a cat with an owner_id not in the cat_owners table? Explain your answer.

Answer
Since the table cats has a FOREIGN KEY requirement on owner_id, a corresponding entry for an owner_id must exist.

25.4.7 (g)

Write an SQL query to select all rows from the cats table that have cats of the top 2 most popular cat breeds.

Breaking Down Complex SQL Questions

When tackling a complex SQL query, it can help to break the problem into smaller parts:

  1. Filter by a predefined list:
    • For example, if you already know the top 2 most popular cat breeds, the query becomes as simple as filtering for cats in that list.
  2. Find the top items using SQL:
    • Students may also recognize that SQL can be used to determine the top 2 most popular cat breeds dynamically using aggregation (COUNT) and ordering (ORDER BY ... DESC LIMIT 2).

Tip: Breaking a problem into smaller, manageable steps often makes writing SQL queries easier and more intuitive.

Answer
%%sql
SELECT *
FROM cats WHERE breed IN
    (SELECT breed
    FROM cats
    GROUP BY breed
    ORDER BY COUNT(*) DESC
    LIMIT 2);
Running query in 'duckdb'
id owner_id name breed age
51 10 Mittens Tabby 2
52 10 Whisker Black 3
54 11 Lucky Tabby 2
55 12 Fluffy Black 16