2  Discussion 02: Intro to Tables and Causation

Slides

2.0.1 Contact Information

Name Wesley Zheng
Pronouns He/him/his
Email wzheng0302@berkeley.edu
Discussion Wednesdays, 12–2 PM @ Etcheverry 3105
Office Hours Tuesdays/Thursdays, 2–3 PM @ Warren Hall

Contact me by email at ease — I typically respond within a day or so!


2.0.2 Announcements

Announcements

Reminders

  • Office Hours are the best way to get help on labs, homeworks, and projects
  • Tutoring sections start Week 3 — a great way to get extra practice and support
  • Please open worksheet links using your Berkeley email address
  • Double-check that your work is saved and that your submission on Pensieve passes the same public tests as on Datahub
  • If you notice any issues with your submission, let a TA know right away

Deadlines

  • Lab 2 is due Friday (9/5) at 5 PM

2.1 Warm Up

A study followed 369 people with cardiovascular disease, randomly selected from all hospital patients with cardiovascular disease. A year later, those who owned a dog were four times more likely to be alive than those who did not. For all of the following questions, please provide a brief explanation.
(Spring 2017 Practice Midterm Question 3b)


2.1.1 (a)

True or False. This study is a randomized controlled experiment.

Answer False. The researchers did not randomly assign individuals to a treatment group (having a dog) and a control group (not having a dog). The experimenters had no control over who owned a dog.

2.1.2 (b)

True or False. This study shows that dog owners live longer than cat owners on average.

Answer False. The experiment compares those who owned a dog to those who didn’t (not specifically cat owners). Also, the experiment only involves people with cardiovascular disease, not all people.

2.1.3 (c)

True or False. This study shows that for someone with cardiovascular disease, adopting a dog causes them to live longer.

Answer

False. An observational study does not show causation.

Thinking Carefully About Causation

In data science, it’s important to distinguish between association and causation.

Key Idea

  • Observational studies can reveal patterns but cannot prove cause-and-effect.
  • Randomized controlled experiments (RCTs) allow stronger causal claims, since randomization balances out confounders.
  • Tip for Careful Thinking:
    • Make sure samples are drawn from the relevant population.
    • Be cautious about extrapolating results beyond the group actually studied.

The key lesson: Correlation is not causation.

2.2 Fun with Arrays

Suppose we have executed the following lines of code. Answer each question with the appropriate output associated with each line of code, or write ERROR if you think the operation is not possible.

Code
# You don’t need to understand this code! I’m just importing the necessary libraries in case you’re curious
from datascience import *
import numpy as np
odd = make_array(1, 3, 5, 7)
even = np.arange(2, 10, 2)
nums = make_array('1', '2', '3', '4')
Working with Arrays

Arrays in Python let us store and manipulate collections of values at once.

Key Idea

Before solving problems, always check:

  • What is stored in the array?
  • What type of object are you working with?
  • Which operations are valid for that type?

This prevents small mistakes from snowballing into bigger ones.


2.2.1

odd + even

Answer
odd + even
array([ 3,  7, 11, 15])

2.2.2

odd + nums

Answer

ERROR. Arrays can only be added together if they have the same size, and are of similar data types (i.e. ints and floats). In this case, one array contains integers and the other contains strings, so the operation is invalid.

odd + nums
---------------------------------------------------------------------------
UFuncTypeError                            Traceback (most recent call last)
Cell In[4], line 1
----> 1 odd + nums

UFuncTypeError: ufunc 'add' did not contain a loop with signature matching types (dtype('int64'), dtype('<U1')) -> None

2.2.3

even.item(3) * odd.item(1)

Answer
even.item(3) * odd.item(1)
24

2.2.4

odd * 3

Answer
odd * 3
array([ 3,  9, 15, 21])

2.2.5

(odd + 1) == even

Answer
(odd + 1) == even
array([ True,  True,  True,  True], dtype=bool)

Remember array operations are performed element-wise, including comparison operators.

Element-wise Operations

Array operations in Python are done element by element.

For example, two arrays may look identical, but comparing them directly won’t return a single True—instead, the comparison happens for each element.

This is a common source of confusion, so be careful!


2.2.6

nums.item(3) + '0'

Answer
nums.item(3) + '0'
'40'

2.2.7

sum(odd > 4)

Answer
sum(odd > 4)
2

2.2.8

sum(odd > 4) / len(odd > 4). What does this output represent? Discuss with peers.

Answer
sum(odd > 4) / len(odd > 4)
0.5

This computes the proportion of values strictly greater than 4! It is equivalent to np.mean(odd > 4). We will revisit this during the hypothesis testing topic.

np.mean(odd > 4)
0.5

2.2.9

odd + make_array(True, True, False, False)

Answer
odd + make_array(True, True, False, False)
array([2, 4, 5, 7])
Boolean values of True and False are equivalent to integer values of 1 and 0 respectively.

2.3 Rise and Shine

Tables are a fundamental way of representing data sets. A table can be viewed in two ways:

  • A sequence of named columns that each describe a single attribute of all entries in a data set, or
  • A sequence of rows where each row contains all the attribute information about that entry in the data set

Data 8 uses a library consisting of many Table functions which will allow you to manipulate and visualize data. All of the functions we will use in this course are listed on the Data 8 course webpage under Python Reference, and a similar outline will be provided during exams.

Exploring the Power of Tables

Tables are one of the most powerful tools in data science. They help us summarize, organize, and see patterns in raw data.

Key Idea

You don’t need to know the exact Python code yet—the focus is on building intuition.

  • Sometimes it’s enough to walk through the logic or even just manually summarize the raw information.
  • For example, grouping a table might look complicated in code, but we can first practice reasoning about how the summarized version is created by hand.

The takeaway is that tables let us see connections and patterns that are hidden in unsummarized data.

In this question, let’s look at an example table called weather. The table has 8 rows, each corresponding to a day of the year. Each row has three attributes: the date, the outdoor temperature in Celsius, and the number of students at lecture that day.

Code
import warnings
warnings.filterwarnings("ignore") # This line and the one above disable some confusing warnings that Python might generate
%matplotlib inline

weather = Table().with_columns(
    "Date", ["June 21", "June 22", "June 23", "June 24", "June 27", "June 28", "June 29", "June 30"],
    "Outdoor Temp. (Celsius)", [28, 30, 34, 36, 34, 26, 26, 28],
    "Students at Lecture", [435, 417, 394, 398, 410, 385, 370, 373]
) # Creating the weather table

weather
Date Outdoor Temp. (Celsius) Students at Lecture
June 21 28 435
June 22 30 417
June 23 34 394
June 24 36 398
June 27 34 410
June 28 26 385
June 29 26 370
June 30 28 373

The table has 8 rows, each corresponding to a day of the year. Each row has three attributes: the date, the outdoor temperature in celsius, and the number of students at lecture that day.


2.3.1

Using just the information provided in the weather table, can you generate the following tables and visualizations? If not, what additional information do you need?

YES/NO YES/NO
Answer

YES, YES.

# The following code will generate the table on the left. You don’t need to understand this for now
weather.select("Students at Lecture", "Outdoor Temp. (Celsius)").group("Outdoor Temp. (Celsius)", np.mean).relabeled("Students at Lecture mean", "Mean # Students at Lecture")
Outdoor Temp. (Celsius) Mean # Students at Lecture
26 377.5
28 404
30 417
34 402
36 398
# The following code will generate the graph on the right. You don’t need to understand this for now
weather.scatter("Outdoor Temp. (Celsius)", "Students at Lecture")


2.3.2

Matthew likes using Fahrenheit more than Celsius. Suppose that he has access to a function called fahrenheit that takes in a number (temperature in Celsius) and returns the temperature in Fahrenheit.

Code
# You don’t need to understand this for now! We will learn about functions very soon
def fahrenheit(celsius):
    return 9/5 * celsius + 32

2.3.2.1 (i)

Write an expression that evaluates to 30°C in Fahrenheit. Hint: which function can we use?

Answer
fahrenheit(30)
86.0

2.3.2.2 (ii)

What will fahrenheit("thirty") output?

Answer

ERROR. fahrenheit takes in a number, so we should only pass numbers into the function. Passing in something that is not a number will result in an error.

fahrenheit("thirty")
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[18], line 1
----> 1 fahrenheit("thirty")

Cell In[16], line 3, in fahrenheit(celsius)
      2 def fahrenheit(celsius):
----> 3     return 9/5 * celsius + 32

TypeError: can't multiply sequence by non-int of type 'float'

2.3.2.3 (iii)

Matthew assigns a variable temperature to the number 25. What will fahrenheit(temperature * 2) output?

Code
temperature = 25
Answer

The Fahrenheit value for 50°C, 122°F.

fahrenheit(temperature * 2)
122.0

2.3.3 (Bonus!)

Matthew prefers using units of Fahrenheit instead of Celsius. Fill in the blank line of code to calculate an array of new values that we can add to the weather table.

Recall the following formula: \(F = \frac{9}{5}C + 32\)

temp_in_celsius = weather.column('Outdoor Temp. (Celsius)')
temp_in_fahrenheit = _________________________
new_table = weather.with_column('Outdoor Temp. (Fahrenheit)', temp_in_fahrenheit)
Answer
temp_in_celsius = weather.column('Outdoor Temp. (Celsius)')
temp_in_fahrenheit = (9/5)*temp_in_celsius + 32
new_table = weather.with_column('Outdoor Temp. (Fahrenheit)', temp_in_fahrenheit)
new_table
Date Outdoor Temp. (Celsius) Students at Lecture Outdoor Temp. (Fahrenheit)
June 21 28 435 82.4
June 22 30 417 86
June 23 34 394 93.2
June 24 36 398 96.8
June 27 34 410 93.2
June 28 26 385 78.8
June 29 26 370 78.8
June 30 28 373 82.4

Alternative Solution:

temp_in_celsius = weather.column('Outdoor Temp. (Celsius)')
temp_in_fahrenheit = weather.apply(fahrenheit, 'Outdoor Temp. (Celsius)') # You can learn more about the apply function in the Python Reference tab on the course website
new_table = weather.with_column('Outdoor Temp. (Fahrenheit)', temp_in_fahrenheit)
new_table
Date Outdoor Temp. (Celsius) Students at Lecture Outdoor Temp. (Fahrenheit)
June 21 28 435 82.4
June 22 30 417 86
June 23 34 394 93.2
June 24 36 398 96.8
June 27 34 410 93.2
June 28 26 385 78.8
June 29 26 370 78.8
June 30 28 373 82.4

2.3.4 (Bonus!)

Matthew wants to compare the data in the weather table with a previous summer’s data. He found the following scatterplot:

Is the data collected previously the result of an observational study or a randomized controlled experiment? Why?

Answer Observational study. We are simply observing the temperature and students, and not placing students into treatment and control groups.

2.3.5 (Bonus!)

Using our answer to part 4 and the visualization, is there a relationship between outdoor temperature and the number of students at lecture – an association, a causal relationship or something else? Why?

Answer There is only an association between Students at Lecture and Outdoor Temp. A valid description would be: As the outdoor temperature increases, the number of students at lecture decreases. Since the data was not collected from a randomized controlled experiment, we cannot conclude a causal relationship.

2.4 Made with ♡ and Coffee

Carisma collected the following information about her coworkers’ methods of getting to work and their coffee consumption. The data is stored in a table called coworkers:

Code
# Initializing the table, you don't need to understand this code
coffee_table = Table().with_columns(
    "Name", ["Dagny", "Marissa", "Isaac", "Tiffany", "Wesley"],
    "Method", ["drive", "drive", "bus", "drive", "bus"],
    "Average Cups of Coffee", [2.3, 1.5, 0.8, 1.8, 1.2]
)

more_names = [f"Person{i}" for i in range(6, 66)]
more_methods = np.random.choice(["drive", "bus", "bike", "walk"], size=60)
more_coffee = np.round(np.random.uniform(0.5, 3.0, size=60), 1)

extra_table = Table().with_columns(
    "Name", more_names,
    "Method", more_methods,
    "Average Cups of Coffee", more_coffee
)

coworkers = coffee_table.append(extra_table)

coworkers.show(5)
Name Method Average Cups of Coffee
Dagny drive 2.3
Marissa drive 1.5
Isaac bus 0.8
Tiffany drive 1.8
Wesley bus 1.2

... (60 rows omitted)

The table contains three columns:

  • Name (string): The name of the coworker.
  • Method (string): The coworker’s way of commuting to work.
  • Average Cups of Coffee (float): The average number of cups of coffee consumed per day by the coworker.
From English to Python

When coding on paper, the challenge is to translate an English description into Python code.

Key Idea

Even without a computer, you can check your reasoning and syntax.

  • Strategies for Success:
    • Count parentheses and brackets to make sure they match.
    • Pay attention to variable names—they usually signal the purpose of the code.
  • Practice Tip: Try underlining the parts of the English prompt that correspond to each line of the solution.

This practice strengthens the connection between the language of the problem and the language of code.


2.4.1

Help Carisma analyze her coworkers’ coffee consumption habits by creating the following tables.


2.4.1.1 (i)

Carisma wants to focus on her coworkers who commute by driving to work. Create a table titled drivers that only includes coworkers who drive to work.

drivers = __________________.__________________(_______________________, _______________________)
Answer
drivers = coworkers.where("Method", "drive")
drivers
Name Method Average Cups of Coffee
Dagny drive 2.3
Marissa drive 1.5
Tiffany drive 1.8
Person12 drive 2.5
Person13 drive 0.7
Person20 drive 2.7
Person30 drive 2
Person36 drive 0.7
Person38 drive 2.5
Person40 drive 0.9

... (9 rows omitted)

2.4.1.2 (ii)

Carisma wants to see which of her coworkers consume the most coffee. Create a table titled consumption that organizes her coworkers from the highest average cups of coffee per day to the lowest.

consumption = _________________._________________(______________________, ______________________)
Answer
consumption = coworkers.sort("Average Cups of Coffee", descending=True)
consumption
Name Method Average Cups of Coffee
Person43 bike 3
Person22 bike 2.9
Person47 bus 2.9
Person41 walk 2.8
Person63 drive 2.8
Person65 walk 2.8
Person9 walk 2.7
Person11 bike 2.7
Person20 drive 2.7
Person17 walk 2.6

... (55 rows omitted)

2.4.1.3 (iii)

Carisma decides she only wants to focus on her coworkers’ average coffee consumption and method of commuting to work. Create a table titled coffee_and_commute that includes only the Average Cups of Coffee and Method columns in that order.

coffee_and_commute = _________________._________________(_______________________________________)
Answer
coffee_and_commute = coworkers.select("Average Cups of Coffee", "Method")
coffee_and_commute
Average Cups of Coffee Method
2.3 drive
1.5 drive
0.8 bus
1.8 drive
1.2 bus
1.6 bus
0.9 walk
1.4 walk
2.7 walk
1.8 bike

... (55 rows omitted)


2.4.2

Carisma wants to determine whether she is spending more money on coffee compared to boba over the course of 1 semester (15 weeks long). Assume she purchases 4 coffees and 3 bobas per week. Complete the following lines of code, which should assign result to True if Carisma spends strictly more on coffee and False otherwise.

cost_coffee = 6.0
cost_boba = 7.0
total_coffee = ______________ * ______ * ______
total_boba = ________________ * ______ * ______
result = ______________________________________
Answer
cost_coffee = 6.0
cost_boba = 7.0
total_coffee = cost_coffee * 4 * 15
total_boba = cost_boba * 3 * 15
result = total_coffee > total_boba
result
True

2.4.3 (Bonus!)

Carisma is trying to compute the absolute value of the difference between the total number of cups drunk by driving coworkers per year vs the total number of cups drunk by bussing coworkers per year. She will do all of this in a single cell. Identify the errors in the following cell and correct them. Make sure that the code cell outputs a single positive number.

number_cups_bus = 12(1.1)
number_cups_drive = 15(1.9)
number_cups_day_difference = ((number_cups_bus - number_cups_drive)
number_cups_week_difference = number_cups_difference * 7
yearly cups = number_cups_week_difference * 52
Answer

number_cups_bus = 12(1.1)
number_cups_drive = 15(1.9)
1 Error - Explanation: can’t use () for multiplication

number_cups_bus = 12(1.1)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[34], line 1
----> 1 number_cups_bus = 12(1.1)

TypeError: 'int' object is not callable
Code
number_cups_bus = 12 * 1.1
number_cups_drive = 15 * 1.9

number_cups_day_difference = ((number_cups_bus - number_cups_drive)
2 Error - Parentheses were wrong! In jupyter you can use put the cursor on parentheses to see if they’re matched! Syntax errors are often reported on the wrong line.

number_cups_day_difference = ((number_cups_bus - number_cups_drive)
  Cell In[36], line 1
    number_cups_day_difference = ((number_cups_bus - number_cups_drive)
                                                                       ^
SyntaxError: incomplete input
Code
number_cups_day_difference = abs(number_cups_bus - number_cups_drive)

number_cups_week_difference = number_cups_day_difference * 7
3 Error - Explanation: the variable name was wrong, you can use tab to autocomplete!

4 Error - Also, we want to use absolute value (by calling abs) at some point (can do it in any of the last three lines) because the question asked us to! We could also use a different subtraction order to make sure the answer is positive.

number_cups_week_difference = number_cups_difference * 7
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[38], line 1
----> 1 number_cups_week_difference = number_cups_difference * 7

NameError: name 'number_cups_difference' is not defined
Code
number_cups_week_difference = number_cups_day_difference * 7

yearly cups = number_cups_week_difference * 52
5 Error - Explanation: variable names cannot have spaces. It’s always good to have descriptive variable names, including ones that are multiple words, but we need to use underscores to separate them instead of spaces.

yearly cups = number_cups_week_difference * 52
  Cell In[40], line 1
    yearly cups = number_cups_week_difference * 52
           ^
SyntaxError: invalid syntax
Code
yearly_cups = number_cups_week_difference * 52

6 Error - Explanation: a cell will not output anything unless a variable name is the last line or a print statement is executed at some point in the cell.