21 Discussion 06: Modeling & OLS (From Summer 2025)

Slides

21.1 Driving with a Constant Model

Lillian is trying to use modeling to drive her car autonomously. To do this, she collects a lot of data from driving around her neighborhood and stores it in drive. She wants your help to design a model that can drive on her behalf in the future using the outputs of the models you design. First, she wants to tackle two aspects of this autonomous car modeling framework: going forward and turning.

Some statistics from the collected dataset are shown below using drive.describe(), which returns the mean, standard deviation, quartiles, minimum, and maximum for the two columns in the dataset: target_speed and degree_turn.

21.1.1 (a)

Suppose the first part of the model predicts the target speed of the car. Using constant models trained on the speeds of the collected data shown above with \(L_1\) and \(L_2\) loss functions, which of the following is true?

The model trained with the \(L_1\) loss will always drive slower than the model trained with \(L_2\) loss.
The model trained with the \(L_2\) loss will always drive slower than the model trained with \(L_1\) loss.
The model trained with the \(L_1\) loss will sometimes drive slower than the model trained with \(L_2\) loss.
The model trained with the \(L_2\) loss will sometimes drive slower than the model trained with \(L_1\) loss.

Answer

The model trained with the \(L_1\) loss will always drive slower than the model trained with \(L_2\) loss.

When we train the model with \(L_1\) loss, the optimal value \(\hat{\theta}\) is the median of all the target speeds. From the summary statistics, the median value is indicated by the percentile value, namely, \(\approx25.82\).

When training with \(L_2\) loss, \(\hat{\theta}\) is the mean of all the target speeds, which happens to be \(\approx32.92\). Since the mean is larger than the median, the constant model trained with \(L_2\) loss will always result in a higher speed than when trained with \(L_1\) loss.

21.1.2 (b)

Finding that the model trained with the \(L_2\) loss drives too slowly, Lillian changes the loss function for the constant model where the loss is penalized if the true speed is higher. That way, in order to minimize loss, the model would have to output predictions closer to the true value, particularly as speeds get faster, the end result being a higher constant speed. Lillian writes this as \(L(y, \hat{y}) = y(y - \hat{y})^2\).

Find the optimal \(\hat{\theta_0}\) for the constant model using the new empirical risk function \(R(\theta_0)\) below:

\[ R(\theta_0) = \frac{1}{n} \sum_i y_i (y_i - \theta_0)^2 \]

Answer

Take the derivative:

\[ \frac{dR}{d\theta_0} = \frac{1}{n} \sum_i \frac{d}{d\theta_0} y_i (y_i - \theta_0)^2 = \frac{1}{n} \sum_i -2y_i (y_i - \theta_0) = -\frac{2}{n} \sum_i y_i^2 - y_i \theta_0 \]

Set the derivative to 0:

\[ -\frac{2}{n} \sum_i y_i^2 - y_i\theta_0 = 0 \]

\[ \theta_0 \sum_i y_i = \sum_i y_i^2 \]

\[ \theta_0 = \frac{\sum_i y_i^2}{\sum_i y_i} \]

Note that the empirical risk function is convex, which you can show by computing the second derivative of \(R(\theta_0)\) and noting that it is positive for all values of \(\theta_0\).

\[\frac{d^2R}{d\theta_0^2} = \frac{2}{n} \sum_i y_i > 0\] since \(y_i\) represents speed, also validated by the fact that the minimum value is positive.

Therefore, any critical point must be the global minimum and so the optimal value we found minimizes empirical risk.

21.1.3 (c) (Extra)

To upgrade her model to be able to use a feature \(x\), Lillian uses a simple linear regression model without an intercept term; that is, the model is given by \(y = \theta x\). Using the standard \(L_2\) loss, what is the optimal \(\theta\) that minimizes the following empirical risk?

\[ R(\theta) = \frac{1}{n} \sum_i (y_i - \theta x_{i})^2 \]

Answer

\[ \begin{align*} \frac{d}{d\theta} R(\theta) &= \frac{1}{n}\sum_{i=1}^{n} \frac{d}{d\theta}(y_{i} - \theta x_{i})^{2} \\ &= -\frac{2}{n}\sum_{i=1}^{n} x_{i}(y_{i} - \theta x_{i}) \end{align*} \]

Setting to \(0\) and solving for \(\theta\):

\[ \begin{align*} -\frac{2}{n}\sum_{i=1}^{n} x_{i}(y_{i} - \hat{\theta} x_{i}) &= 0 \\ \sum_{i=1}^{n} x_{i}(y_{i} - \hat{\theta} x_{i}) &= 0 \\ \sum_{i=1}^{n} x_{i}y_{i} - \hat{\theta} \sum_{i=1}^{n} x_{i}^{2} &= 0 \\ \hat{\theta} &= \frac{\sum_{i=1}^{n}x_{i}y_{i}}{\sum_{i=1}^{n}x_{i}^{2}} \end{align*} \]

21.1.4 (d)

Lillian’s friend, Yash, also begins working on a model that predicts the degree of turning at a particular time between 0 and 359 degrees using the data in the degree_turn column. Explain why a constant model is likely inappropriate in this use case.

Extra: If you’ve studied some physics, you may recognize the behavior of our constant model!

Answer

Any constant model will essentially be always turning at an angle and will be unable to turn either direction or go straight (i.e. it’ll essentially go in a circle forever).

21.1.5 (e)

Suppose we finally expand our modeling framework to use simple linear regression (i.e. \(f_\theta(x) = \theta_{w,0} + \theta_{w,1}x\)). For our first simple linear regression model, we predict the turn angle (\(y\)) using target speed (\(x\)). Our optimal parameters are: \(\hat{\theta}_{w,1} = 0.019\) and \(\hat{\theta}_{w,0} = 143.1\).

However, we realize that we actually want a model that predicts target speed (our new \(y\)) using turn angle, our new \(x\) (instead of the other way around)! What are our new optimal parameters for this new model?

Answer

To predict target speed (new \(y\)) from turn angle (new \(x\)) (what we finally want), we need to compute \(\hat{\theta}_1 = \frac{r\sigma_{\text{speed}}}{\sigma_{\text{turn}}}\).

When we predicted the turn angle from target speed (our first SLR model), we computed \(\hat{\theta}_{w,1} = \frac{r \sigma_{\text{turn}}}{\sigma_{\text{speed}}}\). To go from \(\hat{\theta}_{w,1}\) to \(\hat{\theta}_1\), we multiply by \(\frac{\sigma^2_{\text{speed}}}{\sigma^2_{\text{turn}}}\).

That is, \(\hat{\theta}_1 = \frac{r\sigma_{\text{speed}}}{\sigma_{\text{turn}}} = (\frac{r \sigma_{\text{turn}}}{\sigma_{\text{speed}}}) \frac{\sigma^2_{\text{speed}}}{\sigma^2_{\text{turn}}} = \hat{\theta}_{w,1} \frac{\sigma^2_{\text{speed}}}{\sigma^2_{\text{turn}}} = 0.019 \cdot \frac{46.678744^2}{153.641504^2} = 0.00175\).

Then, \(\hat{\theta}_0 = \bar{y} - \hat{\theta}_1 \bar{x} = 32.92 - 0.00175 \cdot 143.72 = 32.67\).

\(\bar{y}\) and \(\bar{x}\) refer to the means of target speed (our new \(y\)) and turn angle (our new \(x\)) respectively.

Note: we can’t use the inverse function \(f_\theta^{-1}(x)\) since minimizing the sum of squared verticals is not the inverse problem of minimizing the sum of squared horizontal residuals.

Why You Can’t Just Use the Inverse Function in Simple Linear Regression

Have you ever wondered why we can’t just flip the regression line and use the inverse function \(f_\theta^{-1}(x)\)? The reason is that regression is directional.

In simple linear regression (SLR), we minimize the sum of squared vertical distances from the data points to the line. This is different from minimizing the horizontal distances. So switching \(x\) and \(y\) does not give you the inverse regression line.

For example:

When predicting \(y\) from \(x\), the slope is

\[ m_1 = \frac{r \sigma_y}{\sigma_x} \]

When predicting \(x\) from \(y\), the slope is

\[ m_2 = \frac{r \sigma_x}{\sigma_y} \]

These slopes are not inverses of each other.

The figures below help visualize this:

Here you can see the difference in regression direction. The first figure shows how the line changes when swapping \(x\) and \(y\), and the second figure shows vertical vs. horizontal residuals:

Key takeaway: Minimizing squared residuals in different directions gives different slopes. Always choose the regression direction based on what you want to predict.

21.2 Geometry of Least Squares

Suppose we have a dataset represented with the design matrix \(\text{span}(\mathbb{X})\) and response vector \(\mathbb{Y}\). We use linear regression to solve for this and obtain optimal weights as \(\hat{\theta}\). Label the following terms on the geometric interpretation of ordinary least squares:

\(\mathbb{X}\) (i.e., \(\text{span}(\mathbb{X})\))
The response vector \(\mathbb{Y}\)
The residual vector \(\mathbb{Y} - \mathbb{X}\hat{\theta}\)

The prediction vector \(\mathbb{X}\hat{\theta}\) (using optimal parameters)
A prediction vector \(\mathbb{X}{\alpha}\) (using an arbitrary vector \(\alpha\))

Answer

21.4 Modeling using Multiple Regression (Extra)

Ishani wants to model exam grades for DS100 students. She collects various information about student habits, such as how many hours they studied, how many hours they slept before the exam, and how many lectures they attended and observes how well they did on the exam. Suppose she collected such information on \(n\) students, and wishes to use a multiple-regression model to predict exam grades.

21.4.1 (a)

Using the data from the \(n\) individuals, she constructs a design matrix \(\mathbb{X}\) and uses the OLS formula to obtain the estimated parameter vector:

\[ \hat{\theta} = \begin{bmatrix} 3 \\ 2 \\ 1 \end{bmatrix} \]

The design matrix \(\mathbb{X}\) was constructed so that:

The first column represents how many hours each student studied.
The second column represents how many hours each student slept before the exam.
The third column represents how many lectures each student attended.

With this in mind, interpret each entry of \(\hat{\theta}\) in context. For example:
- \(\hat{\theta}_1 = 3\) means that, holding sleep and lectures constant, each additional hour of study is associated with an expected increase of 3 points on the exam.
- \(\hat{\theta}_2 = 2\) means that, holding study hours and lectures constant, each additional hour of sleep is associated with an expected increase of 2 points.
- \(\hat{\theta}_3 = 1\) means that, holding study hours and sleep constant, attending one more lecture is associated with an expected increase of 1 point.

Answer

Each regression coefficient (component of vector) represents the amount we expect an individual’s exam score to go up when increasing the corresponding variable by one unit and holding all other variables fixed. For instance, for the first component, when holding ‘hours of sleep’ and ‘lectures attended’ constant, an individual’s score is expected go up by 3 per extra hour spent studying. The other components can be interpreted similarly.

Tip for Understanding Fitted Models

It can help to write out the fitted models and explicitly see what happens when you increase one covariate by 1 while holding all other covariates fixed. This makes it easier to interpret each parameter in context.

21.4.2 (b)

After fitting this model, we would like to predict the exam grades for two individuals using these variables. Suppose for Individual 1, they slept 10 hours, studied 15 hours, and attended 4 lectures. Suppose also for Individual 2, they slept 5 hours, studied 20 hours, and attended 10 lectures. Construct a matrix \(\mathbb{X}'\) such that, if you computed \(\mathbb{X}'\hat{\theta}\), you would obtain a vector of each individual’s predicted exam scores.

Answer

\[ \mathbb{X}' = \begin{bmatrix} 15 & 10 & 4 \\ 20 & 5 & 10 \end{bmatrix} \]

21.4.3 (c)

Denote \(y'\) as a \(2 \times 1\) vector that represents the actual exam scores of the individuals Ishani is predicting on. Write out an expression that evaluates to give the Mean Squared Error (MSE) of our predictions using matrix notation.

Answer

\[ \text{MSE} = \frac{1}{2} \| y' - \mathbb{X}'\hat{\theta} \|^{2} \]

21 Discussion 06: Modeling & OLS (From Summer 2025)

Slides

21.1 Driving with a Constant Model

21.1.1 (a)

21.1.2 (b)

21.1.3 (c) (Extra)

21.1.4 (d)

21.1.5 (e)

21.2 Geometry of Least Squares

21.3 More Geometry of Least Squares

21.3.1 (a)

21.3.2 (b)

21.3.3 (c)

21.3.4 (d)

21.3.5 (e) (Extra)

21.3.6 (f)

21.3.7 (g)

21.4 Modeling using Multiple Regression (Extra)

21.4.1 (a)

21.4.2 (b)

21.4.3 (c)