12 Discussion 12: Correlation and Regression (From Summer 2025)

Slides

An important aspect of data science is using data to make predictions about the future based on the information that we currently have. A question one might ask would be, “Given the amount of time a student studied for an exam, what would we predict their grade to be?” In order to answer this question, we will investigate a method of using one variable to predict another by looking at the correlation between two variables.

12.1 Standard Units and Correlation

Note

The Correlation Coefficient

The correlation coefficient r is:

A number between -1 and 1 that measures the strength and direction of the linear relationship between two variables.
The average of the product of x and y when both are in standard units.

12.1.1 (a)

When calculating the correlation coefficient, why do we convert data to standard units?

Standard Units and Comparing Distributions

Changing the units of your data does not affect the correlation coefficient.
Standardizing allows us to compare distributions on different scales.
- Standard units tell us the relative position of each value: how many standard deviations above or below the mean it is.
Example: If convert_su() returns 3 for a value, that value is 3 standard deviations above the mean.

Answer

We convert data to standard units in order to compare it with other data that may be of different units and scales. For example, if we wanted to compare the weights of cars (usually thousands of pounds) to the maximum speed of cars (usually tens of miles per hour), converting to standard units allows us to effectively compares the two variables.

Moreover, using standard units gives us the following nice properties:

r is a pure number with no units (because of standardization).
r is unaffected by changing the units on either axis (because of standardization).

Code

from datascience import *
import numpy as np

12.1.2 (b)

Write a function called convert_su which takes in an array of elements called data and returns an array of the values represented in standard units.

def convert_su(data):
    sd = ______________________________
    avg = _____________________________
    return ____________________________

Answer

def convert_su(data):
    sd = np.std(data)
    avg = np.mean(data)
    return (data - avg) / sd

convert_su(make_array(1, 2, 3, 4, 5, 6, 7))

array([-1.5, -1. , -0.5,  0. ,  0.5,  1. ,  1.5])

12.1.3 (c)

Write a function called calculate_correlation which takes in a table of data containing the columns x and y and returns the correlation coefficient.

def calculate_correlation(tbl, x, y):
    x_su = ______________________________
    y_su = ______________________________
    return ______________________________

Answer

def calculate_correlation(tbl, x, y):
    x_su = convert_su(tbl.column(x))
    y_su = convert_su(tbl.column(y))
    return np.mean(x_su * y_su)

calculate_correlation(Table().with_columns("x", make_array(1, 2, 3, 4, 5), "y", make_array(1, 3, 5, 7, 9)), "x", "y")

0.99999999999999978

12.2 Comparing Correlation

Correlation Coefficient Visualizer!

12.2.1 (a)

Look at the following four datasets. Rank them in order from weakest correlation to strongest correlation.

Interpreting Correlation

The magnitude (absolute value) of r represents the strength of a linear relationship.
Two distributions may look similar at first, but subtle trends matter:
- Example: Distribution A has a negative trend; D has no obvious trend.
Textbook chapter 15.2.5 is helpful for understanding regression equations, slopes, and intercepts in both standard and original units.
Bonus questions can be explored at home to connect regression lines in standard units to original units.

Answer

Ranking: D, A, B, C.

D has almost no visible negative or positive trend as it is basically a blob, so its correlation is near 0.
A has a negative correlation, but the points are not very tightly clustered around a straight line, so the strength of its correlation is greater than D but still quite small.
B has a positive correlation, and the points are more tightly clustered around a positive sloping line, so the strength of its correlation is greater than A.
C has a negative correlation, and the points almost perfectly form a straight line. This indicates that the strength of the correlation is very close to 1.

We have introduced correlation as a way of quantifying the strength and direction of a linear relationship between two variables. However, the correlation coefficient can do more than just tell us about how clustered the points in a scatter plot are about a straight line. It can also help us define the straight line about which the points (in original units) are clustered, also known as the regression line.

The formulas for the slope and intercept of the regression line in original units are shown below. In fact, by a remarkable fact in mathematics, the line uniquely defined by the slope and intercept below is always the best possible straight line we could construct.

\[ \begin{aligned} \textbf{Slope of the regression line} &= r * \frac{\text{SD of } y}{\text{SD of } x} \\[0.25cm] \textbf{Intercept of the regression line} &= \text{average of } y - \text{slope} * \text{average of } x \end{aligned} \]

These formulas allow us to construct the regression line.

\[ \begin{aligned} \textbf{estimate of } y &= \text{slope} * x + \text{intercept} \\[0.25cm] \end{aligned} \]

The further understand these formulas, consider the diagram below:

For every 1 SD increase in \(x\), the predicted value of \(y\) increases by \(r\) SDs of \(y\).

12.2.2 (b)

Bonus Question: See if you can derive the slope and intercept above using the equation for the regression line in standard units (\(y_{SU}\) represents \(y\) in standard units)

\[ y_{SU} = r * x_{SU} \]

Answer

\[ \begin{aligned} y_{SU} &= r * x_{SU} \\ \frac{y - \text{average of } y}{\text{SD of } y} &= r * \frac{x - \text{average of } x}{\text{SD of } x} \\ y - \text{average of } y &= \left(r * \frac{\text{SD of } y}{\text{SD of } x}\right) * (x - \text{average of } x) \\ y &= \text{slope} * (x - \text{average of } x) + \text{average of } y \\ y &= \text{slope} * x + (\text{average of } y - \text{slope} * \text{average of } x) \\ y &= \text{slope} * x + \text{intercept} \end{aligned} \]

12.3 Linear Regressi(OH)n

Will Furtado’s Visualizer 🐐!

You just submitted a ticket at Office Hours and would like to know how long it will take to receive help. However, you don’t believe the estimated wait time displayed on the queue to be very accurate, so you decide to make your own predictions based on the total number of students present at OH when you submitted your ticket. You obtain data for 100 wait times and plot them below, also fitting a regression line to the data.

12.3.1 (a)

Suppose that you submit a ticket at Office Hours when there were a total of 20 students present. Based on the regression line, what would you predict the waiting time to be?

Linear Regression: Making Predictions

Regression line formula: y = mx + b
- Once m and b are calculated, plug in x to predict y.
Visual example: find the height of the regression line for x = 20.

Answer

We observe 20 students at Office Hours, so we would want to look at the height of the regression line for \(x = 20\). Looking at the scatter plot, we see that the regression line roughly passes through the point (20, 14) so we would predict the waiting time to be around 14 minutes.

12.3.2 (b)

You go to Office Hours right before a homework assignment is due, and despite safety concerns, you observe 70 students at Office Hours. Would it be appropriate to use your regression line to predict the waiting time? Explain.

Answer

It would not be appropriate to use the regression line to make a prediction. Most values that were used to construct the regression line were between \(x=0\) and \(x=25\). Since \(x=70\) is far out of that range, we cannot expect the regression line to make a very accurate prediction. Furthermore, since we don’t have data for \(x > 30\), we are not sure that the linear trend continues for larger values of \(x\).

Extrapolation Caution

Data only covers certain ranges (e.g., x ≤ 30).
We cannot assume trends continue beyond observed data — relationships may not remain linear.
Other variables (like assignment difficulty) could affect outcomes — correlation ≠ causation.

12.3.3 (c)

When constructing your regression line, you find the correlation coefficient \(r\) to be roughly 0.73. Does this value of \(r\) suggest that an increase in the number of students at Office Hours causes an increase in the waiting time? Explain.

Answer

Correlation does not imply causation! Just looking at the data it is unclear whether we account for confounding factors, and how they might contribute to the overall waiting time. For example, it is definitely possible that varying difficulties in the assignments across tickets affects the overall waiting time.

Correlation vs. Causation

A large r does not imply a causal relationship.
Visualize your data to check for linearity — e.g., Anscombe’s Quartet.
Scatter plots show associations but cannot confirm causation.

12.3.4 (d)

Suppose you never generated the scatter plot at the beginning of this section. Knowing only that the value of \(r\) is roughly 0.73, can you assume that the two variables have a linear association? Circle the correct statement and explain.

Yes, \(r\) tells us the strength of a linear association and a high value of \(r\) always proves that the two variables have a linear association.
Yes, because if we can compute the value of \(r\), the two variables must have a linear association.
No. A high value of \(r\) does not necessarily imply that the relationship between the variables is linear.
No, the value of \(r\) = 0.73 is not high enough to imply a linear association.

Answer

C. No. A high value of \(r\) does not necessarily imply that the relationship between the variables is linear. For example, a quadratic or exponential relationship between 2 variables can still have a high value of \(r\). To determine if the relationship between two variables is linear, it is a good idea to plot the data for a visual interpretation as well.

12.4 This is Regression! (Optional)

(This question uses the same data as Question 3 on the Summer 2024 Final Exam)

Isaac has an unhealthy addiction to Rocket League, a game where players play soccer but with cars instead of people. Players can pick up boost pads that are scattered across the field, which players can use to make their cars go faster! Isaac plays 50 games and records how much boost he used, as well as how many times he touched the ball in a given game.

12.4.1 (a)

Select the correct option:

There appears to be a positive association.
There appears to be a negative association.
There is not enough information.

Answer

There appears to be a positive association.

12.4.2 (b)

Select the correct option:

An increase in Boost Usage causes an increase in Touches.
An increase in Boost Usage does not cause an increase in Touches.
There is not enough information.

Answer

There is not enough information.

Regression in Scatter Plots

From a scatter plot, you can infer association, but not causation.
There may still be causation, but without controlled experiments or accounting for confounders, no conclusions can be drawn.
Review worksheet section 2 for regression line formulas.

Isaac runs some calculations and obtains the following statistics:

The correlation coefficient between Touches and Boost Usage was approximately 0.705.
The average number of Touches was 28.54 with a standard deviation of 9.51.
The average of Boost Usage was 1773.4 with a standard deviation of 471.7.

For the following questions, feel free to leave your answers as mathematical expressions.

12.4.3 (c)

Isaac touched the ball 40 times in one of his games. What is this in standard units?

Answer

\(\frac{40-28.54}{9.51} \approx 1.2\)

12.4.4 (d)

Isaac wishes to fit a regression line to the data. What would be the slope and intercept of the regression line in original units?

Answer

Slope \(= r*\frac{\text{SD of Touches}}{\text{SD of Boost Usage}} = 0.705 * \frac{9.51}{471.7} \approx 0.0142\)\[0.3cm] Intercept \(= \text{Average of Touches} - \text{Slope * Average of Boost Usage} = 28.54 - 0.0142*1773.4 \approx 3.36\)

12.4.5 (e)

What would the slope and intercept be if the data were in standard units?

Answer

Slope
\(= r*\frac{\text{SD of Touches (SU)}}{\text{SD of Boost Usage (SU)}} = 0.705*\frac{1}{1} = 0.705\)

Intercept \(= \text{Average of Touches (SU) - slope * Average of Boost Usage (SU)} = 0 - 0.705*0 = 0\)

When in standard units, the slope of the regression line is just \(r\), and the intercept is just zero!