2 Discussion 02: Intro to Tables and Causation

Slides

2.0.1 Contact Information

Name	Wesley Zheng
Pronouns	He/him/his
Email	wzheng0302@berkeley.edu
Discussion	Wednesdays, 12–2 PM @ Etcheverry 3105
Office Hours	Tuesdays/Thursdays, 2–3 PM @ Warren Hall

Contact me by email at ease — I typically respond within a day or so!

2.0.2 Announcements

Announcements

Reminders

Office Hours are the best way to get help on labs, homeworks, and projects
Tutoring sections start Week 3 — a great way to get extra practice and support
Please open worksheet links using your Berkeley email address
Double-check that your work is saved and that your submission on Pensieve passes the same public tests as on Datahub
If you notice any issues with your submission, let a TA know right away

Deadlines

Lab 2 is due Friday (9/5) at 5 PM

2.1 Warm Up

A study followed 369 people with cardiovascular disease, randomly selected from all hospital patients with cardiovascular disease. A year later, those who owned a dog were four times more likely to be alive than those who did not. For all of the following questions, please provide a brief explanation.
(Spring 2017 Practice Midterm Question 3b)

2.1.1 (a)

True or False. This study is a randomized controlled experiment.

Answer

False. The researchers did not randomly assign individuals to a treatment group (having a dog) and a control group (not having a dog). The experimenters had no control over who owned a dog.

2.1.2 (b)

True or False. This study shows that dog owners live longer than cat owners on average.

Answer

False. The experiment compares those who owned a dog to those who didn’t (not specifically cat owners). Also, the experiment only involves people with cardiovascular disease, not all people.

2.1.3 (c)

True or False. This study shows that for someone with cardiovascular disease, adopting a dog causes them to live longer.

Answer

False. An observational study does not show causation.

Thinking Carefully About Causation

In data science, it’s important to distinguish between association and causation.

Key Idea

Observational studies can reveal patterns but cannot prove cause-and-effect.
Randomized controlled experiments (RCTs) allow stronger causal claims, since randomization balances out confounders.
Tip for Careful Thinking:
- Make sure samples are drawn from the relevant population.
- Be cautious about extrapolating results beyond the group actually studied.

The key lesson: Correlation is not causation.

2.2 Fun with Arrays

Suppose we have executed the following lines of code. Answer each question with the appropriate output associated with each line of code, or write ERROR if you think the operation is not possible.

Code

# You don’t need to understand this code! I’m just importing the necessary libraries in case you’re curious
from datascience import *
import numpy as np

odd = make_array(1, 3, 5, 7)
even = np.arange(2, 10, 2)
nums = make_array('1', '2', '3', '4')

Working with Arrays

Arrays in Python let us store and manipulate collections of values at once.

Key Idea

Before solving problems, always check:

What is stored in the array?
What type of object are you working with?
Which operations are valid for that type?

This prevents small mistakes from snowballing into bigger ones.

2.2.1

odd + even

Answer

odd + even

array([ 3,  7, 11, 15])

2.2.2

odd + nums

Answer

ERROR. Arrays can only be added together if they have the same size, and are of similar data types (i.e. ints and floats). In this case, one array contains integers and the other contains strings, so the operation is invalid.

odd + nums

---------------------------------------------------------------------------
UFuncTypeError                            Traceback (most recent call last)
Cell In[4], line 1
----> 1 odd + nums

UFuncTypeError: ufunc 'add' did not contain a loop with signature matching types (dtype('int64'), dtype('<U1')) -> None

2.2.3

even.item(3) * odd.item(1)

Answer

even.item(3) * odd.item(1)

2.2.4

odd * 3

Answer

odd * 3

array([ 3,  9, 15, 21])

2.2.5

(odd + 1) == even

Answer

(odd + 1) == even

array([ True,  True,  True,  True], dtype=bool)

Remember array operations are performed element-wise, including comparison operators.

Element-wise Operations

Array operations in Python are done element by element.

For example, two arrays may look identical, but comparing them directly won’t return a single True—instead, the comparison happens for each element.

This is a common source of confusion, so be careful!

2.2.6

nums.item(3) + '0'

Answer

nums.item(3) + '0'

'40'

2.2.7

sum(odd > 4)

Answer

sum(odd > 4)

2.2.8

sum(odd > 4) / len(odd > 4). What does this output represent? Discuss with peers.

Answer

sum(odd > 4) / len(odd > 4)

0.5

This computes the proportion of values strictly greater than 4! It is equivalent to np.mean(odd > 4). We will revisit this during the hypothesis testing topic.

np.mean(odd > 4)

0.5

2.2.9

odd + make_array(True, True, False, False)

Answer

odd + make_array(True, True, False, False)

array([2, 4, 5, 7])

Boolean values of True and False are equivalent to integer values of 1 and 0 respectively.

2.3 Rise and Shine

Tables are a fundamental way of representing data sets. A table can be viewed in two ways:

A sequence of named columns that each describe a single attribute of all entries in a data set, or
A sequence of rows where each row contains all the attribute information about that entry in the data set

Data 8 uses a library consisting of many Table functions which will allow you to manipulate and visualize data. All of the functions we will use in this course are listed on the Data 8 course webpage under Python Reference, and a similar outline will be provided during exams.

Exploring the Power of Tables

Tables are one of the most powerful tools in data science. They help us summarize, organize, and see patterns in raw data.

Key Idea

You don’t need to know the exact Python code yet—the focus is on building intuition.

Sometimes it’s enough to walk through the logic or even just manually summarize the raw information.
For example, grouping a table might look complicated in code, but we can first practice reasoning about how the summarized version is created by hand.

The takeaway is that tables let us see connections and patterns that are hidden in unsummarized data.

In this question, let’s look at an example table called weather. The table has 8 rows, each corresponding to a day of the year. Each row has three attributes: the date, the outdoor temperature in Celsius, and the number of students at lecture that day.

Code

import warnings
warnings.filterwarnings("ignore") # This line and the one above disable some confusing warnings that Python might generate
%matplotlib inline

weather = Table().with_columns(
    "Date", ["June 21", "June 22", "June 23", "June 24", "June 27", "June 28", "June 29", "June 30"],
    "Outdoor Temp. (Celsius)", [28, 30, 34, 36, 34, 26, 26, 28],
    "Students at Lecture", [435, 417, 394, 398, 410, 385, 370, 373]
) # Creating the weather table

weather

Date	Outdoor Temp. (Celsius)	Students at Lecture
June 21	28	435
June 22	30	417
June 23	34	394
June 24	36	398
June 27	34	410
June 28	26	385
June 29	26	370
June 30	28	373

The table has 8 rows, each corresponding to a day of the year. Each row has three attributes: the date, the outdoor temperature in celsius, and the number of students at lecture that day.

2.3.1

Using just the information provided in the weather table, can you generate the following tables and visualizations? If not, what additional information do you need?

YES/NO

Answer

YES, YES.

# The following code will generate the table on the left. You don’t need to understand this for now
weather.select("Students at Lecture", "Outdoor Temp. (Celsius)").group("Outdoor Temp. (Celsius)", np.mean).relabeled("Students at Lecture mean", "Mean # Students at Lecture")

Outdoor Temp. (Celsius)	Mean # Students at Lecture
26	377.5
28	404
30	417
34	402
36	398

# The following code will generate the graph on the right. You don’t need to understand this for now
weather.scatter("Outdoor Temp. (Celsius)", "Students at Lecture")

2.3.2

Matthew likes using Fahrenheit more than Celsius. Suppose that he has access to a function called fahrenheit that takes in a number (temperature in Celsius) and returns the temperature in Fahrenheit.

Code

# You don’t need to understand this for now! We will learn about functions very soon
def fahrenheit(celsius):
    return 9/5 * celsius + 32

2.3.2.1 (i)

Write an expression that evaluates to 30°C in Fahrenheit. Hint: which function can we use?

Answer

fahrenheit(30)

86.0

2.3.2.2 (ii)

What will fahrenheit("thirty") output?

Answer

ERROR. fahrenheit takes in a number, so we should only pass numbers into the function. Passing in something that is not a number will result in an error.

fahrenheit("thirty")

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[18], line 1
----> 1 fahrenheit("thirty")

Cell In[16], line 3, in fahrenheit(celsius)
      2 def fahrenheit(celsius):
----> 3     return 9/5 * celsius + 32

TypeError: can't multiply sequence by non-int of type 'float'

2.3.2.3 (iii)

Matthew assigns a variable temperature to the number 25. What will fahrenheit(temperature * 2) output?

Code

temperature = 25

Answer

The Fahrenheit value for 50°C, 122°F.

fahrenheit(temperature * 2)

122.0

2.3.3 (Bonus!)

Matthew prefers using units of Fahrenheit instead of Celsius. Fill in the blank line of code to calculate an array of new values that we can add to the weather table.

Recall the following formula: \(F = \frac{9}{5}C + 32\)

temp_in_celsius = weather.column('Outdoor Temp. (Celsius)')
temp_in_fahrenheit = _________________________
new_table = weather.with_column('Outdoor Temp. (Fahrenheit)', temp_in_fahrenheit)

Answer

temp_in_celsius = weather.column('Outdoor Temp. (Celsius)')
temp_in_fahrenheit = (9/5)*temp_in_celsius + 32
new_table = weather.with_column('Outdoor Temp. (Fahrenheit)', temp_in_fahrenheit)

new_table

Date	Outdoor Temp. (Celsius)	Students at Lecture	Outdoor Temp. (Fahrenheit)
June 21	28	435	82.4
June 22	30	417	86
June 23	34	394	93.2
June 24	36	398	96.8
June 27	34	410	93.2
June 28	26	385	78.8
June 29	26	370	78.8
June 30	28	373	82.4

Alternative Solution:

temp_in_celsius = weather.column('Outdoor Temp. (Celsius)')
temp_in_fahrenheit = weather.apply(fahrenheit, 'Outdoor Temp. (Celsius)') # You can learn more about the apply function in the Python Reference tab on the course website
new_table = weather.with_column('Outdoor Temp. (Fahrenheit)', temp_in_fahrenheit)

new_table

Date	Outdoor Temp. (Celsius)	Students at Lecture	Outdoor Temp. (Fahrenheit)
June 21	28	435	82.4
June 22	30	417	86
June 23	34	394	93.2
June 24	36	398	96.8
June 27	34	410	93.2
June 28	26	385	78.8
June 29	26	370	78.8
June 30	28	373	82.4

2.3.4 (Bonus!)

Matthew wants to compare the data in the weather table with a previous summer’s data. He found the following scatterplot:

Is the data collected previously the result of an observational study or a randomized controlled experiment? Why?

Answer

Observational study. We are simply observing the temperature and students, and not placing students into treatment and control groups.

2.3.5 (Bonus!)

Using our answer to part 4 and the visualization, is there a relationship between outdoor temperature and the number of students at lecture – an association, a causal relationship or something else? Why?

Answer

There is only an association between Students at Lecture and Outdoor Temp. A valid description would be: As the outdoor temperature increases, the number of students at lecture decreases. Since the data was not collected from a randomized controlled experiment, we cannot conclude a causal relationship.

2.4 Made with ♡ and Coffee

Carisma collected the following information about her coworkers’ methods of getting to work and their coffee consumption. The data is stored in a table called coworkers:

Code

# Initializing the table, you don't need to understand this code
coffee_table = Table().with_columns(
    "Name", ["Dagny", "Marissa", "Isaac", "Tiffany", "Wesley"],
    "Method", ["drive", "drive", "bus", "drive", "bus"],
    "Average Cups of Coffee", [2.3, 1.5, 0.8, 1.8, 1.2]
)

more_names = [f"Person{i}" for i in range(6, 66)]
more_methods = np.random.choice(["drive", "bus", "bike", "walk"], size=60)
more_coffee = np.round(np.random.uniform(0.5, 3.0, size=60), 1)

extra_table = Table().with_columns(
    "Name", more_names,
    "Method", more_methods,
    "Average Cups of Coffee", more_coffee
)

coworkers = coffee_table.append(extra_table)

coworkers.show(5)

Name	Method	Average Cups of Coffee
Dagny	drive	2.3
Marissa	drive	1.5
Isaac	bus	0.8
Tiffany	drive	1.8
Wesley	bus	1.2

... (60 rows omitted)

The table contains three columns:

Name (string): The name of the coworker.
Method (string): The coworker’s way of commuting to work.
Average Cups of Coffee (float): The average number of cups of coffee consumed per day by the coworker.

From English to Python

When coding on paper, the challenge is to translate an English description into Python code.

Key Idea

Even without a computer, you can check your reasoning and syntax.

Strategies for Success:
- Count parentheses and brackets to make sure they match.
- Pay attention to variable names—they usually signal the purpose of the code.
Practice Tip: Try underlining the parts of the English prompt that correspond to each line of the solution.

This practice strengthens the connection between the language of the problem and the language of code.

2.4.1

Help Carisma analyze her coworkers’ coffee consumption habits by creating the following tables.

2.4.1.1 (i)

Carisma wants to focus on her coworkers who commute by driving to work. Create a table titled drivers that only includes coworkers who drive to work.

drivers = __________________.__________________(_______________________, _______________________)

Answer

drivers = coworkers.where("Method", "drive")

drivers

Name	Method	Average Cups of Coffee
Dagny	drive	2.3
Marissa	drive	1.5
Tiffany	drive	1.8
Person12	drive	2.5
Person13	drive	0.7
Person20	drive	2.7
Person30	drive	2
Person36	drive	0.7
Person38	drive	2.5
Person40	drive	0.9

... (9 rows omitted)

2.4.1.2 (ii)

Carisma wants to see which of her coworkers consume the most coffee. Create a table titled consumption that organizes her coworkers from the highest average cups of coffee per day to the lowest.

consumption = _________________._________________(______________________, ______________________)

Answer

consumption = coworkers.sort("Average Cups of Coffee", descending=True)

consumption

Name	Method	Average Cups of Coffee
Person43	bike	3
Person22	bike	2.9
Person47	bus	2.9
Person41	walk	2.8
Person63	drive	2.8
Person65	walk	2.8
Person9	walk	2.7
Person11	bike	2.7
Person20	drive	2.7
Person17	walk	2.6

... (55 rows omitted)

2.4.1.3 (iii)

Carisma decides she only wants to focus on her coworkers’ average coffee consumption and method of commuting to work. Create a table titled coffee_and_commute that includes only the Average Cups of Coffee and Method columns in that order.

coffee_and_commute = _________________._________________(_______________________________________)

Answer

coffee_and_commute = coworkers.select("Average Cups of Coffee", "Method")

coffee_and_commute

Average Cups of Coffee	Method
2.3	drive
1.5	drive
0.8	bus
1.8	drive
1.2	bus
1.6	bus
0.9	walk
1.4	walk
2.7	walk
1.8	bike

... (55 rows omitted)

2.4.2

Carisma wants to determine whether she is spending more money on coffee compared to boba over the course of 1 semester (15 weeks long). Assume she purchases 4 coffees and 3 bobas per week. Complete the following lines of code, which should assign result to True if Carisma spends strictly more on coffee and False otherwise.

cost_coffee = 6.0
cost_boba = 7.0
total_coffee = ______________ * ______ * ______
total_boba = ________________ * ______ * ______
result = ______________________________________

Answer

cost_coffee = 6.0
cost_boba = 7.0
total_coffee = cost_coffee * 4 * 15
total_boba = cost_boba * 3 * 15
result = total_coffee > total_boba

result

True

2.4.3 (Bonus!)

Carisma is trying to compute the absolute value of the difference between the total number of cups drunk by driving coworkers per year vs the total number of cups drunk by bussing coworkers per year. She will do all of this in a single cell. Identify the errors in the following cell and correct them. Make sure that the code cell outputs a single positive number.

number_cups_bus = 12(1.1)
number_cups_drive = 15(1.9)
number_cups_day_difference = ((number_cups_bus - number_cups_drive)
number_cups_week_difference = number_cups_difference * 7
yearly cups = number_cups_week_difference * 52

Answer

number_cups_bus = 12(1.1)
number_cups_drive = 15(1.9)
1 Error - Explanation: can’t use () for multiplication

number_cups_bus = 12(1.1)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[34], line 1
----> 1 number_cups_bus = 12(1.1)

TypeError: 'int' object is not callable

Code

number_cups_bus = 12 * 1.1
number_cups_drive = 15 * 1.9

number_cups_day_difference = ((number_cups_bus - number_cups_drive)
2 Error - Parentheses were wrong! In jupyter you can use put the cursor on parentheses to see if they’re matched! Syntax errors are often reported on the wrong line.

number_cups_day_difference = ((number_cups_bus - number_cups_drive)

  Cell In[36], line 1
    number_cups_day_difference = ((number_cups_bus - number_cups_drive)
                                                                       ^
SyntaxError: incomplete input

Code

number_cups_day_difference = abs(number_cups_bus - number_cups_drive)

number_cups_week_difference = number_cups_day_difference * 7
3 Error - Explanation: the variable name was wrong, you can use tab to autocomplete!

4 Error - Also, we want to use absolute value (by calling abs) at some point (can do it in any of the last three lines) because the question asked us to! We could also use a different subtraction order to make sure the answer is positive.

number_cups_week_difference = number_cups_difference * 7

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[38], line 1
----> 1 number_cups_week_difference = number_cups_difference * 7

NameError: name 'number_cups_difference' is not defined

Code

number_cups_week_difference = number_cups_day_difference * 7

yearly cups = number_cups_week_difference * 52
5 Error - Explanation: variable names cannot have spaces. It’s always good to have descriptive variable names, including ones that are multiple words, but we need to use underscores to separate them instead of spaces.

yearly cups = number_cups_week_difference * 52

  Cell In[40], line 1
    yearly cups = number_cups_week_difference * 52
           ^
SyntaxError: invalid syntax

Code

yearly_cups = number_cups_week_difference * 52

6 Error - Explanation: a cell will not output anything unless a variable name is the last line or a print statement is executed at some point in the cell.