Code
# You don’t need to understand this code! I’m just importing the necessary libraries in case you’re curious
from datascience import *
import numpy as np
Name | Wesley Zheng |
Pronouns | He/him/his |
wzheng0302@berkeley.edu | |
Discussion | Wednesdays, 12–2 PM @ Etcheverry 3105 |
Office Hours | Tuesdays/Thursdays, 2–3 PM @ Warren Hall |
Contact me by email at ease — I typically respond within a day or so!
Reminders
Deadlines
A study followed 369 people with cardiovascular disease, randomly selected from all hospital patients with cardiovascular disease. A year later, those who owned a dog were four times more likely to be alive than those who did not. For all of the following questions, please provide a brief explanation.
(Spring 2017 Practice Midterm Question 3b)
True or False. This study is a randomized controlled experiment.
True or False. This study shows that dog owners live longer than cat owners on average.
True or False. This study shows that for someone with cardiovascular disease, adopting a dog causes them to live longer.
False. An observational study does not show causation.
In data science, it’s important to distinguish between association and causation.
Key Idea
The key lesson: Correlation is not causation.
Suppose we have executed the following lines of code. Answer each question with the appropriate output associated with each line of code, or write ERROR
if you think the operation is not possible.
# You don’t need to understand this code! I’m just importing the necessary libraries in case you’re curious
from datascience import *
import numpy as np
= make_array(1, 3, 5, 7)
odd = np.arange(2, 10, 2)
even = make_array('1', '2', '3', '4') nums
Arrays in Python let us store and manipulate collections of values at once.
Key Idea
Before solving problems, always check:
This prevents small mistakes from snowballing into bigger ones.
odd + even
+ even odd
array([ 3, 7, 11, 15])
odd + nums
ERROR
. Arrays can only be added together if they have the same size, and are of similar data types (i.e. ints
and floats
). In this case, one array contains integers and the other contains strings, so the operation is invalid.
+ nums odd
--------------------------------------------------------------------------- UFuncTypeError Traceback (most recent call last) Cell In[4], line 1 ----> 1 odd + nums UFuncTypeError: ufunc 'add' did not contain a loop with signature matching types (dtype('int64'), dtype('<U1')) -> None
even.item(3) * odd.item(1)
3) * odd.item(1) even.item(
24
odd * 3
* 3 odd
array([ 3, 9, 15, 21])
(odd + 1) == even
+ 1) == even (odd
array([ True, True, True, True], dtype=bool)
Remember array operations are performed element-wise, including comparison operators.
Array operations in Python are done element by element.
For example, two arrays may look identical, but comparing them directly won’t return a single True
—instead, the comparison happens for each element.
This is a common source of confusion, so be careful!
nums.item(3) + '0'
3) + '0' nums.item(
'40'
sum(odd > 4)
sum(odd > 4)
2
sum(odd > 4) / len(odd > 4)
. What does this output represent? Discuss with peers.
sum(odd > 4) / len(odd > 4)
0.5
This computes the proportion of values strictly greater than 4! It is equivalent to np.mean(odd > 4)
. We will revisit this during the hypothesis testing topic.
> 4) np.mean(odd
0.5
odd + make_array(True, True, False, False)
+ make_array(True, True, False, False) odd
array([2, 4, 5, 7])
True
and False
are equivalent to integer values of 1 and 0 respectively.
Tables are a fundamental way of representing data sets. A table can be viewed in two ways:
Data 8 uses a library consisting of many Table
functions which will allow you to manipulate and visualize data. All of the functions we will use in this course are listed on the Data 8 course webpage under Python Reference, and a similar outline will be provided during exams.
Tables are one of the most powerful tools in data science. They help us summarize, organize, and see patterns in raw data.
Key Idea
You don’t need to know the exact Python code yet—the focus is on building intuition.
The takeaway is that tables let us see connections and patterns that are hidden in unsummarized data.
In this question, let’s look at an example table called weather
. The table has 8 rows, each corresponding to a day of the year. Each row has three attributes: the date, the outdoor temperature in Celsius, and the number of students at lecture that day.
import warnings
"ignore") # This line and the one above disable some confusing warnings that Python might generate
warnings.filterwarnings(%matplotlib inline
= Table().with_columns(
weather "Date", ["June 21", "June 22", "June 23", "June 24", "June 27", "June 28", "June 29", "June 30"],
"Outdoor Temp. (Celsius)", [28, 30, 34, 36, 34, 26, 26, 28],
"Students at Lecture", [435, 417, 394, 398, 410, 385, 370, 373]
# Creating the weather table
)
weather
Date | Outdoor Temp. (Celsius) | Students at Lecture |
---|---|---|
June 21 | 28 | 435 |
June 22 | 30 | 417 |
June 23 | 34 | 394 |
June 24 | 36 | 398 |
June 27 | 34 | 410 |
June 28 | 26 | 385 |
June 29 | 26 | 370 |
June 30 | 28 | 373 |
The table has 8 rows, each corresponding to a day of the year. Each row has three attributes: the date, the outdoor temperature in celsius, and the number of students at lecture that day.
Using just the information provided in the weather
table, can you generate the following tables and visualizations? If not, what additional information do you need?
YES/NO | YES/NO |
YES, YES.
# The following code will generate the table on the left. You don’t need to understand this for now
"Students at Lecture", "Outdoor Temp. (Celsius)").group("Outdoor Temp. (Celsius)", np.mean).relabeled("Students at Lecture mean", "Mean # Students at Lecture") weather.select(
Outdoor Temp. (Celsius) | Mean # Students at Lecture |
---|---|
26 | 377.5 |
28 | 404 |
30 | 417 |
34 | 402 |
36 | 398 |
# The following code will generate the graph on the right. You don’t need to understand this for now
"Outdoor Temp. (Celsius)", "Students at Lecture") weather.scatter(
Matthew likes using Fahrenheit more than Celsius. Suppose that he has access to a function called fahrenheit
that takes in a number (temperature in Celsius) and returns the temperature in Fahrenheit.
# You don’t need to understand this for now! We will learn about functions very soon
def fahrenheit(celsius):
return 9/5 * celsius + 32
Write an expression that evaluates to 30°C in Fahrenheit. Hint: which function can we use?
30) fahrenheit(
86.0
What will fahrenheit("thirty")
output?
ERROR
. fahrenheit takes in a number, so we should only pass numbers into the function. Passing in something that is not a number will result in an error.
"thirty") fahrenheit(
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[18], line 1 ----> 1 fahrenheit("thirty") Cell In[16], line 3, in fahrenheit(celsius) 2 def fahrenheit(celsius): ----> 3 return 9/5 * celsius + 32 TypeError: can't multiply sequence by non-int of type 'float'
Matthew assigns a variable temperature
to the number 25. What will fahrenheit(temperature * 2)
output?
= 25 temperature
The Fahrenheit value for 50°C, 122°F.
* 2) fahrenheit(temperature
122.0
Matthew prefers using units of Fahrenheit instead of Celsius. Fill in the blank line of code to calculate an array of new values that we can add to the weather
table.
Recall the following formula: \(F = \frac{9}{5}C + 32\)
= weather.column('Outdoor Temp. (Celsius)')
temp_in_celsius = _________________________
temp_in_fahrenheit = weather.with_column('Outdoor Temp. (Fahrenheit)', temp_in_fahrenheit) new_table
= weather.column('Outdoor Temp. (Celsius)')
temp_in_celsius = (9/5)*temp_in_celsius + 32
temp_in_fahrenheit = weather.with_column('Outdoor Temp. (Fahrenheit)', temp_in_fahrenheit) new_table
new_table
Date | Outdoor Temp. (Celsius) | Students at Lecture | Outdoor Temp. (Fahrenheit) |
---|---|---|---|
June 21 | 28 | 435 | 82.4 |
June 22 | 30 | 417 | 86 |
June 23 | 34 | 394 | 93.2 |
June 24 | 36 | 398 | 96.8 |
June 27 | 34 | 410 | 93.2 |
June 28 | 26 | 385 | 78.8 |
June 29 | 26 | 370 | 78.8 |
June 30 | 28 | 373 | 82.4 |
Alternative Solution:
= weather.column('Outdoor Temp. (Celsius)')
temp_in_celsius = weather.apply(fahrenheit, 'Outdoor Temp. (Celsius)') # You can learn more about the apply function in the Python Reference tab on the course website
temp_in_fahrenheit = weather.with_column('Outdoor Temp. (Fahrenheit)', temp_in_fahrenheit) new_table
new_table
Date | Outdoor Temp. (Celsius) | Students at Lecture | Outdoor Temp. (Fahrenheit) |
---|---|---|---|
June 21 | 28 | 435 | 82.4 |
June 22 | 30 | 417 | 86 |
June 23 | 34 | 394 | 93.2 |
June 24 | 36 | 398 | 96.8 |
June 27 | 34 | 410 | 93.2 |
June 28 | 26 | 385 | 78.8 |
June 29 | 26 | 370 | 78.8 |
June 30 | 28 | 373 | 82.4 |
Matthew wants to compare the data in the weather
table with a previous summer’s data. He found the following scatterplot:
Is the data collected previously the result of an observational study or a randomized controlled experiment? Why?
Using our answer to part 4 and the visualization, is there a relationship between outdoor temperature and the number of students at lecture – an association, a causal relationship or something else? Why?
Carisma collected the following information about her coworkers’ methods of getting to work and their coffee consumption. The data is stored in a table called coworkers
:
# Initializing the table, you don't need to understand this code
= Table().with_columns(
coffee_table "Name", ["Dagny", "Marissa", "Isaac", "Tiffany", "Wesley"],
"Method", ["drive", "drive", "bus", "drive", "bus"],
"Average Cups of Coffee", [2.3, 1.5, 0.8, 1.8, 1.2]
)
= [f"Person{i}" for i in range(6, 66)]
more_names = np.random.choice(["drive", "bus", "bike", "walk"], size=60)
more_methods = np.round(np.random.uniform(0.5, 3.0, size=60), 1)
more_coffee
= Table().with_columns(
extra_table "Name", more_names,
"Method", more_methods,
"Average Cups of Coffee", more_coffee
)
= coffee_table.append(extra_table)
coworkers
5) coworkers.show(
Name | Method | Average Cups of Coffee |
---|---|---|
Dagny | drive | 2.3 |
Marissa | drive | 1.5 |
Isaac | bus | 0.8 |
Tiffany | drive | 1.8 |
Wesley | bus | 1.2 |
... (60 rows omitted)
The table contains three columns:
string
): The name of the coworker.string
): The coworker’s way of commuting to work.float
): The average number of cups of coffee consumed per day by the coworker.When coding on paper, the challenge is to translate an English description into Python code.
Key Idea
Even without a computer, you can check your reasoning and syntax.
This practice strengthens the connection between the language of the problem and the language of code.
Help Carisma analyze her coworkers’ coffee consumption habits by creating the following tables.
Carisma wants to focus on her coworkers who commute by driving to work. Create a table titled drivers
that only includes coworkers who drive to work.
= __________________.__________________(_______________________, _______________________) drivers
= coworkers.where("Method", "drive") drivers
drivers
Name | Method | Average Cups of Coffee |
---|---|---|
Dagny | drive | 2.3 |
Marissa | drive | 1.5 |
Tiffany | drive | 1.8 |
Person12 | drive | 2.5 |
Person13 | drive | 0.7 |
Person20 | drive | 2.7 |
Person30 | drive | 2 |
Person36 | drive | 0.7 |
Person38 | drive | 2.5 |
Person40 | drive | 0.9 |
... (9 rows omitted)
Carisma wants to see which of her coworkers consume the most coffee. Create a table titled consumption
that organizes her coworkers from the highest average cups of coffee per day to the lowest.
= _________________._________________(______________________, ______________________) consumption
= coworkers.sort("Average Cups of Coffee", descending=True) consumption
consumption
Name | Method | Average Cups of Coffee |
---|---|---|
Person43 | bike | 3 |
Person22 | bike | 2.9 |
Person47 | bus | 2.9 |
Person41 | walk | 2.8 |
Person63 | drive | 2.8 |
Person65 | walk | 2.8 |
Person9 | walk | 2.7 |
Person11 | bike | 2.7 |
Person20 | drive | 2.7 |
Person17 | walk | 2.6 |
... (55 rows omitted)
Carisma decides she only wants to focus on her coworkers’ average coffee consumption and method of commuting to work. Create a table titled coffee_and_commute
that includes only the Average Cups of Coffee and Method columns in that order.
= _________________._________________(_______________________________________) coffee_and_commute
= coworkers.select("Average Cups of Coffee", "Method") coffee_and_commute
coffee_and_commute
Average Cups of Coffee | Method |
---|---|
2.3 | drive |
1.5 | drive |
0.8 | bus |
1.8 | drive |
1.2 | bus |
1.6 | bus |
0.9 | walk |
1.4 | walk |
2.7 | walk |
1.8 | bike |
... (55 rows omitted)
Carisma wants to determine whether she is spending more money on coffee compared to boba over the course of 1 semester (15 weeks long). Assume she purchases 4 coffees and 3 bobas per week. Complete the following lines of code, which should assign result
to True
if Carisma spends strictly more on coffee and False
otherwise.
= 6.0
cost_coffee = 7.0
cost_boba = ______________ * ______ * ______
total_coffee = ________________ * ______ * ______
total_boba = ______________________________________ result
= 6.0
cost_coffee = 7.0
cost_boba = cost_coffee * 4 * 15
total_coffee = cost_boba * 3 * 15
total_boba = total_coffee > total_boba result
result
True
Carisma is trying to compute the absolute value of the difference between the total number of cups drunk by driving coworkers per year vs the total number of cups drunk by bussing coworkers per year. She will do all of this in a single cell. Identify the errors in the following cell and correct them. Make sure that the code cell outputs a single positive number.
= 12(1.1)
number_cups_bus = 15(1.9)
number_cups_drive = ((number_cups_bus - number_cups_drive)
number_cups_day_difference = number_cups_difference * 7
number_cups_week_difference = number_cups_week_difference * 52 yearly cups
number_cups_bus = 12(1.1)
number_cups_drive = 15(1.9)
1 Error - Explanation: can’t use () for multiplication
= 12(1.1) number_cups_bus
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[34], line 1 ----> 1 number_cups_bus = 12(1.1) TypeError: 'int' object is not callable
= 12 * 1.1
number_cups_bus = 15 * 1.9 number_cups_drive
number_cups_day_difference = ((number_cups_bus - number_cups_drive)
2 Error - Parentheses were wrong! In jupyter you can use put the cursor on parentheses to see if they’re matched! Syntax errors are often reported on the wrong line.
= ((number_cups_bus - number_cups_drive) number_cups_day_difference
Cell In[36], line 1 number_cups_day_difference = ((number_cups_bus - number_cups_drive) ^ SyntaxError: incomplete input
= abs(number_cups_bus - number_cups_drive) number_cups_day_difference
number_cups_week_difference = number_cups_day_difference * 7
3 Error - Explanation: the variable name was wrong, you can use tab to autocomplete!
4 Error - Also, we want to use absolute value (by calling abs) at some point (can do it in any of the last three lines) because the question asked us to! We could also use a different subtraction order to make sure the answer is positive.
= number_cups_difference * 7 number_cups_week_difference
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[38], line 1 ----> 1 number_cups_week_difference = number_cups_difference * 7 NameError: name 'number_cups_difference' is not defined
= number_cups_day_difference * 7 number_cups_week_difference
yearly cups = number_cups_week_difference * 52
5 Error - Explanation: variable names cannot have spaces. It’s always good to have descriptive variable names, including ones that are multiple words, but we need to use underscores to separate them instead of spaces.
= number_cups_week_difference * 52 yearly cups
Cell In[40], line 1 yearly cups = number_cups_week_difference * 52 ^ SyntaxError: invalid syntax
= number_cups_week_difference * 52 yearly_cups