22 Discussion 07: Gradient Descent (From Summer 2025)
Slides
22.1 Dive into Gradient Descent
We want to minimize the loss function \(L(\theta) = (\theta_1-1)^2 + |\theta_2-3|\). While you may notice that this function is not differentiable everywhere, we can still use gradient descent wherever the function differentiable!
Recall that for a function \(f(x) = k|x|\), \(\frac{df}{dx} = k\) for all \(x > 0\) and \(\frac{df}{dx} = -k\) for all \(x < 0\).
22.1.1 (a)
What are the optimal values \(\hat{\theta}_1\) and \(\hat{\theta}_2\) to minimize \(L(\theta)\)? What is the gradient at those values \(\nabla L = \begin{bmatrix} \frac{\partial L}{\partial \theta_1} & \frac{\partial L}{\partial \theta_2} \end{bmatrix}^T \Bigr|_{\substack{\theta_1 = \hat{\theta}_1, \theta_2 = \hat{\theta}_2}}\)?
Answer
To find \(\frac{\partial L}{\partial \theta_1}\):
\[ \frac{\partial L}{\partial \theta_1} = 2(\theta_1 - 1) \]
To find the value of \(\hat{\theta}_1\) that minimizes \(L(\theta)\), we set the derivative to 0:
\[ \hat{\theta}_1 = 1 \]
To find \(\frac{\partial L}{\partial \theta_2}\):
\[ \frac{\partial L}{\partial \theta_2} = \frac{\partial |\theta_2 - 3|}{\partial \theta_2} = \begin{cases} 1 & \text{if } \theta_2 > 3 \\ -1 & \text{if } \theta_2 < 3 \\ \text{undefined} & \text{if } \theta_2 = 3 \end{cases} \]
Despite \(\frac{\partial L}{\partial \theta_2}\) being undefined at \(\theta_2 = 3\), we can see by inspection that the absolute loss \(|\theta_2-3|\) cannot be smaller than 0. Hence, the minimizing value is \(\hat\theta_2 = 3\).
The gradient at the optimal values are: \(\nabla L = \begin{bmatrix} \frac{\partial L}{\partial \theta_1} & \frac{\partial L}{\partial \theta_2} \end{bmatrix}^T \Bigr|_{\substack{\theta_1 = \hat{\theta}_1, \theta_2 = \hat{\theta}_2}} = [0, \text{undefined}]\). : Setting the derivative to 0 only finds the minimum if the function is convex. Both \((\theta_1-1)^2\) and \(|\theta_2-3|\) are convex. We proved in homework 5 that the sum of convex functions is also convex. Therefore \(L(\theta)\) is convex and we can set its derivative to 0 to find the minimum.
Another way to test convexity is by performing the second derivative test, i.e., show \(L''(\theta) \geq 0\).
Even though \(\frac{\partial L}{\partial \theta_2}\) is undefined at \(\theta_2 = 3\), we can see by inspection that the absolute loss \(|\theta_2 - 3|\) cannot be smaller than 0. Therefore, the minimizing value is:
\[ \hat{\theta}_2 = 3 \]
The gradient at the optimal values is:
\[ \nabla L = \begin{bmatrix} \frac{\partial L}{\partial \theta_1} \\[2mm] \frac{\partial L}{\partial \theta_2} \end{bmatrix}_{\theta_1 = \hat{\theta}_1, \theta_2 = \hat{\theta}_2} = \begin{bmatrix} 0 \\ \text{undefined} \end{bmatrix} \]
Note: Setting the derivative to 0 only guarantees a minimum if the function is convex. Both \((\theta_1 - 1)^2\) and \(|\theta_2 - 3|\) are convex. From Homework 5, we know that the sum of convex functions is also convex, so \(L(\theta)\) is convex and we can use its derivative to find the minimum.
Another way to check convexity is by performing the second derivative test, i.e., showing that \(L''(\theta) \geq 0\).
22.1.2 (b)
Suppose we initialize our gradient descent algorithm randomly at \(\theta_1 = 2\) and \(\theta_2 = 5\). Calculate the gradient \(\nabla L = \begin{bmatrix} \frac{\partial L}{\partial \theta_1} & \frac{\partial L}{\partial \theta_2} \end{bmatrix}^T \Bigr|_{\substack{\theta_1 = 2, \theta_2 = 5}}\) at the specified \(\theta_1\) and \(\theta_2\) values.
Answer
Plugging \(\theta_1 = 2\) and \(\theta_2 = 5\) into the equations derived above, we get: \[\begin{bmatrix} \frac{\partial L}{\partial \theta_1} & \frac{\partial L}{\partial \theta_2} \end{bmatrix}^T = \begin{bmatrix} 2(\theta_1-1)& 1 \end{bmatrix}^T = \begin{bmatrix} 2& 1 \end{bmatrix}^T\]22.1.3 (c)
Apply the first gradient update with a learning rate \(\alpha = 0.5\). In other words, calculate \(\theta_1^{(1)}\) and \(\theta_2^{(1)}\) using the initializations \(\theta_1^{(0)} = 2\) and \(\theta_2^{(0)} = 5\).
Answer
Applying the gradient step:
\[\theta^{(1)} = \theta^{(0)} - \alpha \nabla L = \begin{bmatrix} 2 - 0.5(2) \\ 5 - 0.5(1) \end{bmatrix} = \begin{bmatrix} 1 \\ 4.5 \end{bmatrix}\]22.1.4 (d)
How many gradient descent steps does it take for \(\theta_1\) and \(\theta_2\) to converge to their optimal values obtained in part (a) assuming we keep a constant learning rate of \(\alpha = 0.5\)? In other words, what is the value of t when \(\theta^{(t)}=\begin{bmatrix} \hat{\theta}_1& \hat{\theta}_2 \end{bmatrix}^T\).
Hint: After part (c), what is the derivative \(\frac{\partial L}{\partial \theta_1}\) evaluated at \(\theta_1^{(1)}\)?
Answer
Note that the derivative with respect to \(\theta_1\) is 0 at \(\theta_1^{(1)} = 1\) since it is the optimal solution! Then, we essentially only update \(\theta_2\) where the partial derivative is always 1 (until we reach the optimal solution - then our derivative is undefined)! Every time, the partial derivative of \(\theta_2\) is 1 - so the update is simply:
\[\theta_2^{(i+1)} = \theta_2^{(i)} - 0.5\]
Hence, to update this from 5 to 3, we must take 4 gradient steps (i.e. from 5 to 4.5, 4.5 to 4, 4 to 3.5, 3.5 to 3).
Writing this all out:
\[\theta^{(2)} = \theta^{(1)} - \alpha \nabla L = \begin{bmatrix} 1 - 0.5(0) \\ 4.5 - 0.5(1) \end{bmatrix} = \begin{bmatrix} 1 \\ 4 \end{bmatrix}\]
\[\theta^{(3)} = \theta^{(2)} - \alpha \nabla L = \begin{bmatrix} 1 - 0.5(0) \\ 4 - 0.5(1) \end{bmatrix} = \begin{bmatrix} 1 \\ 3.5 \end{bmatrix}\]
\[\theta^{(4)} = \theta^{(3)} - \alpha \nabla L = \begin{bmatrix} 1 - 0.5(0) \\ 3.5 - 0.5(1) \end{bmatrix} = \begin{bmatrix} 1 \\ 3 \end{bmatrix}\]
Notice that every time, we reduce \(\theta_2\) by 0.5 as expected, so the number of gradient descent steps is 4.22.2 One-hot Encoding
In order to include a qualitative variable in a model, we convert it into a collection of Boolean vectors that only contain the values 0 and 1. For example, suppose we have a qualitative variable with 3 possible values, call them \(A\), \(B\), and \(C\), respectively. For concreteness, we use a specific example with 10 observations: \[ [A, A, A, A, B, B, B, C, C, C] \] We can represent this qualitative variable with 3 Boolean vectors that take on values \(1\) or \(0\) depending on the value of this qualitative variable. Specifically, the values of these 3 Boolean vectors for this dataset are \(x_A\), \(x_B\), and \(x_C\), arranged from left to right in the following design matrix, where we use the following indicator variable:
\[x_{i,k} =\begin{cases} 1 &\text{if $i$-th observation has value $k$}\\ 0 &\text{otherwise}. \end{cases}\]
Furthermore, let \(\vec{y}\) represent any vector of outcome variables, and \(y_i\) is the value of said outcome for the \(i\)-th subject. This representation is also called one-hot encoding. It should be noted here that \(\vec{x_A}\), \(\vec{x_B}\), \(\vec{x_C}\), and \(\vec{y}\) are all vectors. \[ \Bbb{X} = \begin{bmatrix} \vert & \vert & \vert \\ \vec{x_A} & \vec{x_B} & \vec{x_C} \\ \vert & \vert & \vert \\ \end{bmatrix} = \begin{bmatrix} 1 & 0 & 0\\ 1 & 0 & 0\\ 1 & 0 & 0\\ 1 & 0 & 0\\ 0 & 1 & 0\\ 0 & 1 & 0\\ 0 & 1 & 0\\ 0 & 0 & 1\\ 0 & 0 & 1\\ 0 & 0 & 1\\ \end{bmatrix} \] We will show that the fitted coefficients for \(\vec{x_A}\), \(\vec{x_B}\), and \(\vec{x_C}\) are \(\bar{y}_A\), \(\bar{y}_B\), and \(\bar{y}_C\), the average of the \(y_i\) values for each of the groups, respectively.
For this question, it can be helpful to think of a concrete example of what \(\vec{y}\) could represent.
For instance, outcome variables might be height, weight, or other measurable quantities.
22.2.1 (a)
Show that if you augment your \(\mathbb{X}\) matrix with an additional \(\vec{1}\) bias vector as shown below, \(\mathbb{X}^T\mathbb{X}\) is not full rank. Conclude that the new \(\mathbb{X}^T\mathbb{X}\) is not invertible, and we cannot use the least squares estimate in this situation.
\[ \Bbb{X} = \begin{bmatrix} \vert & \vert & \vert & \vert \\ \vec{1} & \vec{x_A} & \vec{x_B} & \vec{x_C} \\ \vert & \vert & \vert & \vert \\ \end{bmatrix} \]
Answer
Solution 1:
By the definition of one-hot encoding, the one-hot-encoded columns of \(\mathbb{X}\) sum up to $ $:
\[ \vec{x_A} + \vec{x_B} + \vec{x_C} = \vec{1}\]
Since the leftmost vector of \(\mathbb{X}\) is a linear combination of the other vectors, \(\mathbb{X}\) is not full column rank. Since \(\mathbb{X}\) has the same rank as \(\mathbb{X}^T\mathbb{X}\) (proof is out of scope), \(\mathbb{X}^T\mathbb{X}\) is not invertible.
Solution 2:
We can show that \(\mathbb{X}^T\mathbb{X}\) is equal to the following.
\[ \mathbb{X}^T\mathbb{X} = \begin{bmatrix} n & n_A & n_B & n_C \\ n_A & n_A & 0 & 0 \\ n_B & 0 & n_B & 0 \\ n_C & 0 & 0 & n_C \end{bmatrix} \]
It can be observed that since \(n_A + n_B + n_C = n\), the sum of the final 3 columns subtracted from the first column yields the zero vector \(\vec{0}\). By the definition of linear dependence, we can conclude that this matrix is not full rank, and hence, we cannot invert it. As a result, we cannot compute our least squares estimate since it requires \(\mathbb{X}^T\mathbb{X}\).22.2.2 (b)
Show that the columns of \(\Bbb{X}\) (without the additional \(\vec{1}\) bias vector) are orthogonal, (i.e., the dot product between any pair of column vectors is 0).
Answer
The argument is the same for any pair of \(\mathbb{X}\)’s columns, so we show the orthogonality for one pair, \(\vec{x}_A \cdot \vec{x}_B\):
\[ \begin{aligned} \vec{x}_A \cdot \vec{x}_B &= \sum_{i=1}^{10} x_{A,i} x_{B,i} \\ &= \sum_{i=1}^4 (1 \times 0) + \sum_{i=5}^7 (0 \times 1) + \sum_{i=8}^{10} (0 \times 0) \\ &= 0 \end{aligned} \]22.2.3 (c)
Show that \[ \Bbb{X}^T\Bbb{X} = \begin{bmatrix} n_A & 0 & 0 \\ 0 & n_B & 0 \\ 0 & 0& n_C \\ \end{bmatrix} \] Here, \(n_A\), \(n_B\), and \(n_C\) are the number of observations in each of the three groups defined by the levels of the qualitative variable.
Answer
Here, we note that \[ \Bbb{X}^T = \begin{bmatrix} 1 & 1 & 1 & 1 & 0 & 0 & 0 & 0 & 0 &0 \\ 0 & 0 & 0 &0 & 1 & 1 & 1 & 0 & 0 & 0\\ 0 & 0 & 0 &0 & 0 & 0 & 0 & 1 & 1 & 1\\ \end{bmatrix} \] We also note that \[ \Bbb{X}^T\Bbb{X} = \begin{bmatrix} \vec{x_A}^T\vec{x_A} & \vec{x_A}^T\vec{x_B} & \vec{x_A}^T\vec{x_C} \\ \vec{x_B}^T\vec{x_A} & \vec{x_B}^T\vec{x_B} & \vec{x_B}^T\vec{x_C} \\ \vec{x_C}^T\vec{x_A} & \vec{x_C}^T\vec{x_B} & \vec{x_C}^T\vec{x_C} \\ \end{bmatrix} \] Since we earlier established the orthogonality of the vectors in \(\Bbb{X}\), we find \(\Bbb{X}^T\Bbb{X}\) to be the diagonal matrix: \[ \Bbb{X}^T\Bbb{X} = \begin{bmatrix} 4 & 0 & 0 \\ 0 & 3 & 0 \\ 0 & 0& 3 \\ \end{bmatrix} = \begin{bmatrix} n_A & 0 & 0 \\ 0 & n_B & 0 \\ 0 & 0& n_C \\ \end{bmatrix} \]22.2.4 (d)
Show that \[ \Bbb{X}^T\mathbb{Y} = \begin{bmatrix} \sum_{i \in A} y_i\\ \sum_{i \in B} y_i\\ \sum_{i \in C} y_i\\ \end{bmatrix} \] where \(i\) is an element in group \(A, B,\) or \(C\).
Answer
Note in the previous solution we found \(\Bbb{X}^T\). The solution follows from recognizing that for a row in \(\Bbb{X}^T\), e.g., the first row, we have \[ \sum_{i=1}^{10} x_{A,i} \times y_i = \sum_{i=1}^4 y_i = \sum_{i \in \textrm{ group A}} y_i \]
Continuing from the example above, we can plug in the outcome vector:
\[ \mathbb{Y} = \begin{bmatrix} 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10 \end{bmatrix} \]
into the equation to compute the sum per group:
\[ \begin{bmatrix} 10 & 18 & 27 \end{bmatrix} \]
This shows how the values in \(\mathbb{Y}\) are aggregated for each group.
22.2.5 (e)
Use the results from the previous questions to solve the normal equations for \(\hat{\theta}\):
\[ \hat{\theta} = (\mathbb{X}^T \mathbb{X})^{-1} \mathbb{X}^T \mathbb{Y} = \begin{bmatrix} \bar{y}_A \\ \bar{y}_B \\ \bar{y}_C \end{bmatrix} \]
Answer
By inspection, we can find \[ [\Bbb{X}^T\Bbb{X}]^{-1} = \begin{bmatrix} \frac{1}{n_A} & 0 & 0 \\ 0 & \frac{1}{n_B} & 0 \\ 0 & 0 & \frac{1}{n_C}\\ \end{bmatrix} \] When we pre-multiply \(\Bbb{X}^T \mathbb{Y}\) by this matrix, we get \[ \begin{bmatrix} \frac{1}{n_A} & 0 & 0 \\ 0 & \frac{1}{n_B} & 0 \\ 0 & 0 & \frac{1}{n_C}\\ \end{bmatrix} \begin{bmatrix} \sum_{i \in A} y_i\\ \sum_{i \in B} y_i\\ \sum_{i \in C} y_i\\ \end{bmatrix} = \begin{bmatrix} \bar{y}_A\\ \bar{y}_B\\ \bar{y}_C\\ \end{bmatrix} \]
Continuing from the example above, we can plug in the outcome vector:
\[ \mathbb{Y} = \begin{bmatrix} 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10 \end{bmatrix} \]
to compute the average per group:
\[ \begin{bmatrix} 2.5 & 6 & 9 \end{bmatrix} \]
This shows how the values in \(\mathbb{Y}\) are averaged for each group.
22.3 Human Contexts and Ethics: Case Study
Which of the following scenarios strike you as unfair and why? You can choose more than one. There is no single right answer, but you must explain your reasoning.
- A homeowner whose home is assessed at a higher price than it would sell for.
- A homeowner whose home is assessed at a lower price than it would sell for.
- An assessment process that systematically overvalues inexpensive properties and undervalues expensive properties.
- An assessment process that systematically undervalues inexpensive properties and overvalues expensive properties.
Answer
Your findings from this discussion can be used for Project A2, so we won’t release solutions to this question.
Here are some discussion points:
- What is the definition of “fairness’’ in this context?
- Which individual or group of people are considered when we evaluate fairness?
- Are we talking about whether each situation above is fair to an individual homeowner, a subgroup of society, or the whole society?
- At the level of the individual homeowner, any error can feel unfair: why am be over-/under-taxed?
- t the level of the society, is it fair to have a general over or underestimation that varies with the value of the home (a Regressive or progressive{progressive} tax scheme)?
Other important things to note:
- Some may feel that both regressive and progressive are unfair; that any systematic errors should not depend at all on the value of the property
- It is possible to have a tax scheme that is systematically fair tax (i.e., not regressive or progressive) but is still quite inaccurate and still generates errors for individual homeowners. However, a tax scheme that generates no errors for homeowners will also have no systematic unfairness.
- Unfairness of the second type (scenarios C and D) is a form of statistical bias that varies as a function of \(y\) (the response variable).
Work in pairs, with each pair assigned to one of the four scenarios. After exploring your scenario, we’ll come back together for a class discussion.
This activity is designed to help you reflect on your own ideas of fairness. Ask yourself: why do I think something is fair or unfair? What criteria am I using to make that judgment?
It’s also a chance to check your understanding of regressive versus progressive taxation.
22.4 More Human Context and Ethics: Case Study
Suppose you created a model to predict property values (as you are doing presently!) and wanted to understand whether your model’s predictions align more with Scenario C or D from the last question. You decide to break down your data into different intervals depending on the true Log Sale Price and compute the Root Mean Squared Error (RMSE) and percentage of property values overestimated for each interval. Using this information, you create the plots below:
Which plot would be more useful in determining whether the model’s predicted property values align more with Scenario C or D? Provide a brief justification for your answer.
Answer
Your findings from this discussion can be used for Project A2, so we won’t release solutions to this question. A good starting point to consider when answering the question: How does the sign of the residual relate to whether a property is overvalued or undervalued?
More broadly, when answering the written questions at the end of A2, you may also consider the following questions: Does a low RMSE necessarily mean a model’s predictions are more accurate? Does a low RMSE necessarily mean a model’s predictions are more ``fair’’? What is the difference between your answers to both questions?
The most helpful plot for comparing Scenario C and D is the second plot, which shows how the percentage of houses overestimated varies across different intervals.
- RMSE, by design, does not account for whether predictions are over or under the true value. So it cannot answer questions about specific overvaluation or undervaluation.
- For example, even if the RMSE plot suggests the model is more accurate for expensive houses (Log Sale Price = 12 ≈ $160,000), the overestimation plot tells a different story. Overestimating expensive houses leads to higher property taxes, which aligns more with Scenario D.
Important: These example graphs do not represent Cook County—they are only for practice. In Project A2, your graphs will be similar in structure (axes, labels, intervals), but distributions will likely differ. Be prepared to interpret them accordingly.
Follow-up discussion question: What is the relationship between RMSE, accuracy, and fairness?
Key takeaway: Accuracy and fairness are not the same. While RMSE measures accuracy, fairness is a more complex concept. A fair model may not just minimize loss—it may also consider bias, historical context, and the broader impact of predictions.
For further exploration of these ideas, consider looking into Data 104.