Code
from datascience import *
import numpy as np
An important aspect of data science is using data to make predictions about the future based on the information that we currently have. A question one might ask would be, “Given the amount of time a student studied for an exam, what would we predict their grade to be?” In order to answer this question, we will investigate a method of using one variable to predict another by looking at the correlation between two variables.
When calculating the correlation coefficient, why do we convert data to standard units?
We convert data to standard units in order to compare it with other data that may be of different units and scales. For example, if we wanted to compare the weights of cars (usually thousands of pounds) to the maximum speed of cars (usually tens of miles per hour), converting to standard units allows us to effectively compares the two variables.
Moreover, using standard units gives us the following nice properties:
from datascience import *
import numpy as np
Write a function called convert_su
which takes in an array of elements called data
and returns an array of the values represented in standard units.
def convert_su(data):
= ______________________________
sd = _____________________________
avg return ____________________________
def convert_su(data):
= np.std(data)
sd = np.mean(data)
avg return (data - avg) / sd
1, 2, 3, 4, 5, 6, 7)) convert_su(make_array(
array([-1.5, -1. , -0.5, 0. , 0.5, 1. , 1.5])
Write a function called calculate_correlation
which takes in a table of data containing the columns x
and y
and returns the correlation coefficient.
def calculate_correlation(tbl, x, y):
= ______________________________
x_su = ______________________________
y_su return ______________________________
def calculate_correlation(tbl, x, y):
= convert_su(tbl.column(x))
x_su = convert_su(tbl.column(y))
y_su return np.mean(x_su * y_su)
"x", make_array(1, 2, 3, 4, 5), "y", make_array(1, 3, 5, 7, 9)), "x", "y") calculate_correlation(Table().with_columns(
0.99999999999999978
Correlation Coefficient Visualizer!
Look at the following four datasets. Rank them in order from weakest correlation to strongest correlation.
Ranking: D, A, B, C.
We have introduced correlation as a way of quantifying the strength and direction of a linear relationship between two variables. However, the correlation coefficient can do more than just tell us about how clustered the points in a scatter plot are about a straight line. It can also help us define the straight line about which the points (in original units) are clustered, also known as the regression line.
The formulas for the slope and intercept of the regression line in original units are shown below. In fact, by a remarkable fact in mathematics, the line uniquely defined by the slope and intercept below is always the best possible straight line we could construct.
\[ \begin{aligned} \textbf{Slope of the regression line} &= r * \frac{\text{SD of } y}{\text{SD of } x} \\[0.25cm] \textbf{Intercept of the regression line} &= \text{average of } y - \text{slope} * \text{average of } x \end{aligned} \]
These formulas allow us to construct the regression line.
\[ \begin{aligned} \textbf{estimate of } y &= \text{slope} * x + \text{intercept} \\[0.25cm] \end{aligned} \]
The further understand these formulas, consider the diagram below:
For every 1 SD increase in \(x\), the predicted value of \(y\) increases by \(r\) SDs of \(y\).
Bonus Question: See if you can derive the slope and intercept above using the equation for the regression line in standard units (\(y_{SU}\) represents \(y\) in standard units)
\[ y_{SU} = r * x_{SU} \]
You just submitted a ticket at Office Hours and would like to know how long it will take to receive help. However, you don’t believe the estimated wait time displayed on the queue to be very accurate, so you decide to make your own predictions based on the total number of students present at OH when you submitted your ticket. You obtain data for 100 wait times and plot them below, also fitting a regression line to the data.
Suppose that you submit a ticket at Office Hours when there were a total of 20 students present. Based on the regression line, what would you predict the waiting time to be?
You go to Office Hours right before a homework assignment is due, and despite safety concerns, you observe 70 students at Office Hours. Would it be appropriate to use your regression line to predict the waiting time? Explain.
It would not be appropriate to use the regression line to make a prediction. Most values that were used to construct the regression line were between \(x=0\) and \(x=25\). Since \(x=70\) is far out of that range, we cannot expect the regression line to make a very accurate prediction. Furthermore, since we don’t have data for \(x > 30\), we are not sure that the linear trend continues for larger values of \(x\).
When constructing your regression line, you find the correlation coefficient \(r\) to be roughly 0.73. Does this value of \(r\) suggest that an increase in the number of students at Office Hours causes an increase in the waiting time? Explain.
Correlation does not imply causation! Just looking at the data it is unclear whether we account for confounding factors, and how they might contribute to the overall waiting time. For example, it is definitely possible that varying difficulties in the assignments across tickets affects the overall waiting time.
Suppose you never generated the scatter plot at the beginning of this section. Knowing only that the value of \(r\) is roughly 0.73, can you assume that the two variables have a linear association? Circle the correct statement and explain.
(This question uses the same data as Question 3 on the Summer 2024 Final Exam)
Isaac has an unhealthy addiction to Rocket League, a game where players play soccer but with cars instead of people. Players can pick up boost pads that are scattered across the field, which players can use to make their cars go faster! Isaac plays 50 games and records how much boost he used, as well as how many times he touched the ball in a given game.
Select the correct option:
Select the correct option:
Isaac runs some calculations and obtains the following statistics:
For the following questions, feel free to leave your answers as mathematical expressions.
Isaac touched the ball 40 times in one of his games. What is this in standard units?
Isaac wishes to fit a regression line to the data. What would be the slope and intercept of the regression line in original units?
What would the slope and intercept be if the data were in standard units?
Slope
\(= r*\frac{\text{SD of Touches (SU)}}{\text{SD of Boost Usage (SU)}} = 0.705*\frac{1}{1} = 0.705\)
Intercept \(= \text{Average of Touches (SU) - slope * Average of Boost Usage (SU)} = 0 - 0.705*0 = 0\)