16  Discussion 01: Prerequisites (From Summer 2025)

Slides

Review of Prerequisite Material

This discussion covers material taught in Data 100’s prerequisite and co-requisite course requirements. For more resources to review refer to the following supplemental resources. Additional resources can be found on the course website under “Resources”.

16.1 Linear Algebra Fundamentals

Linear algebra is what powers linear regression, logistic regression, and PCA (concepts that we will be studying in this course). This question aims to build an understanding of how matrix-vector operations work.

Consider yourself starstruck: it’s 2016, and you just spotted the first family of music, Beyoncé, husband Jay-Z, and their four year-old daughter Blue, shopping for fruit bowls at Berkeley Bowl. Each bowl contains some fruit and the price of a fruit bowl is simply the total price of all of its individual fruit.

Berkeley Bowl has apples for $2, bananas for $1, and cantaloupes for $4 (expensive!). The price of each of these can be written in a vector:

\[ \vec{v} = \begin{bmatrix} 2 \\ 1 \\ 4 \\ \end{bmatrix} \]

Berkeley Bowl sells the following fruit bowls:

  1. 2 of each fruit
  2. 5 apples and 8 bananas
  3. 2 bananas and 3 cantaloupes
  4. 10 cantaloupes

16.1.1 (a)

Define a matrix \(B\) such that \(B\vec{v}\) evaluates to a length 4 column vector containing the price of each fruit bowl. The first entry of the result should be the cost of fruit bowl #1, the second entry the cost of fruit bowl #2, etc.

Answer \[ B = \begin{bmatrix} 2 & 2 & 2 \\ 5 & 8 & 0 \\ 0 & 2 & 3 \\ 0 & 0 & 10 \end{bmatrix} \]

16.1.2 (b)

Beyoncé, Jay-Z, and Blue make the following purchases:

  • Beyoncé buys 2 fruit bowl #1s and 1 fruit bowl #2.
  • Jay-Z buys 1 of each fruit bowl.
  • Blue buys 10 fruit bowl #4s.

Define a matrix \(A\) such that the matrix expression \(AB\vec{v}\) evaluates to a length 3 column vector containing how much each of them spent. The first entry of the result should be the total amount spent by Beyoncé, the second entry the amount spent by Jay-Z, etc.

Answer \[ A = \begin{bmatrix} 2 & 1 & 0 & 0 \\ 1 & 1 & 1 & 1 \\ 0 & 0 & 0 & 10 \end{bmatrix} \]

16.1.3 (c)

Let’s suppose Berkeley Bowl changes their fruit prices, but you don’t know what they changed their prices to. Beyoncé, Jay-Z, and Blue buy the same quantity of fruit bowls and the number of fruit in each bowl is the same, but now they each spent these amounts:

\[ \vec{x} = \begin{bmatrix} 80 \\ 80 \\ 100 \\ \end{bmatrix} \]

Let \(\vec{v_2}\) be a vector containing the new prices of each fruit. Express \(\vec{v_2}\) in terms of \(A\), \(B\), and \(\vec{x}\).

Answer

We know that \(\vec{x} = AB\vec{v_2}\) from the previous part. To solve for \(\vec{v_2}\) we need to left-multiply both sides of the above equation by \((AB)^{-1}\). Doing so yields \(\vec{v_2} = (AB)^{-1} \vec{x}\).

This assumes that the product \(AB\) is invertible. We will explore in the next part why it is invertible.

16.1.4 (d)

In the previous part, we assumed that \(AB\) (the matrix multiplication of \(A\) and \(B\)) was invertible. Why is \(AB\) (as calculated below) invertible? State two conditions for an arbitrary matrix to be invertible.

\[ AB = \begin{bmatrix} 9 & 12 & 4 \\ 7 & 12 & 15 \\ 0 & 0 & 100\end{bmatrix} \]

Answer

There are many valid answers as to what makes an arbitrary matrix invertible. Besides needing to be a square matrix (same number of rows and columns), answers include, but are not limited to:

  1. The column vectors are linearly independent (meaning the matrix is full column rank).
  2. The determinant of the matrix is nonzero.

For (1), we can intuitively argue that the column vectors are linearly independent. Vectors are linearly independent if, when forming a linear combination of the vectors, the only scalar coefficients that produce the zero vector are 0. Another way of saying this is that no one vector can be written as a nontrivial linear combination of the others. Observe that only the third vector can travel along the third dimension, so it can neither form either of the two other vectors or be formed by them using a linear combination. The first two vectors point in different directions, so they cannot form the other through a linear combination.

For (2), we can compute the determinant of a \(3 \times 3\) matrix by the following computation:

\[ \det(AB) = 9 \cdot \det\left(\begin{bmatrix} 12 & 15 \\ 0 & 100 \end{bmatrix} \right) - 12 \cdot \det\left( \begin{bmatrix} 7 & 15 \\ 0 & 100 \end{bmatrix} \right) + 4 \cdot \det\left( \begin{bmatrix} 7 & 12 \\ 0 & 0 \end{bmatrix} \right) \]

\[ \dots = 9 \cdot (12 \cdot 100 - 15 \cdot 0) - 12 \cdot (7 \cdot 100 - 15 \cdot 0) + 4 \cdot (7 \cdot 0 - 12 \cdot 0) \]

\[ \dots = 2400 \neq 0 \]

Thinking Through the Problem with Matrices

While the problems in this section have straightforward answers, the main goal is to build intuition and understand the why behind our methods. The final number is less important than the process you use to get there.

Breaking Down the Problem

Before jumping straight into matrix multiplication, a great first step is to calculate the cost of each individual fruit bowl. Once you have those values, think about how you can represent that calculation as a matrix-vector operation. For those who haven’t seen matrix-vector multiplication in a while, it’s a process of multiplying the rows of the matrix by the corresponding entries in the vector and summing the results.

For Deeper Understanding: Why Not Use an Inverse?

Here’s a challenge question to consider: why can’t we solve for a vector \(\vec{v_2}\) using the expression \(B^{-1} A^{-1} \vec{x}\)?

The answer is that matrix inverses are only defined for square matrices. Since matrices \(A\) and \(B\) in this problem are not square, we can’t find \(A^{-1}\) or \(B^{-1}\).

Bonus Topic: What About Pseudoinverses?

You might wonder if there’s an alternative for non-square matrices. There is! It’s called the pseudoinverse (often denoted with a + superscript). However, it wouldn’t be the right tool here either.

A pseudoinverse gives an approximation, not an exact solution. For a “tall” matrix like \(B\) (with more rows than columns), its pseudoinverse \((B^T B)^{-1}B^T\) maps vectors from a higher dimension (\(\mathbb{R}^4\)) to a lower one (\(\mathbb{R}^3\)). This is related to the method of least squares, which finds the best-fit approximation when an exact solution doesn’t exist.

If we calculate the pseudoinverse of \(B\) (let’s call it \(B^+\)) and multiply it by the original matrix \(B\), we don’t get the identity matrix (a matrix of ones on the diagonal and zeros elsewhere), which is what a true inverse would yield. Instead, we get something like this (with values rounded):

\[BB^+ = B(B^TB)^{-1}B^T = \begin{bmatrix} 0.4 & 0.2 & -0.4 & 0.2\\ 0.2 & 0.9 & 0.1 & -0.1\\ -0.4 & 0.1 & 0.8 & 0.1\\ 0.2 & -0.1 & 0.1 & 0.9\\ \end{bmatrix} \]

As you can see, \(BB^+ \neq I\). This confirms that the pseudoinverse doesn’t act like a true inverse in this case.

For those interested in future topics, this concept also has a strong connection to Singular Value Decomposition (SVD), where the pseudoinverse of a tall matrix is equal to \(V\Sigma^{-1} U^T\).


16.2 More Linear Algebra

In this question, we will apply various essential concepts in linear algebra.


16.2.1 (a)

Linear dependence among a set of vectors \(\{\vec{v_1}, \vec{v_2}, \vec{v_3}, ..., \vec{v_n}\}\) is defined as follows:

Linear Dependence vs. Independence

If any (non-trivial) linear combination of the vectors can produce the zero vector, then the set of vectors is linearly dependent.

In other words, if we can scale the vectors \(\vec{v_j}\) by scalars \(\alpha_j\) and sum the quantity to obtain the zero vector (given at least one \(\alpha_j \ne 0\)), then the set is linearly dependent.

\[ \sum_{i = 1}^n \alpha_i \vec{v_i} = \vec{0} \text{ such that some } \alpha_j \ne 0 \implies \text{linear dependence} \]

Any set of vectors such that we cannot obtain the zero vector as described above is linearly independent.

The column rank of a matrix \(M\) is the maximal number of linearly independent column vectors in \(M\). A full column rank matrix has a column rank equal to the number of column vectors.

Now consider the matrix \(M = \begin{bmatrix} 2 & 0 \\ 0 & 3 \end{bmatrix}\) containing two column vectors \(\vec{v_1} = \begin{bmatrix} 2 \\ 0 \end{bmatrix}\) and \(\vec{v_2} = \begin{bmatrix} 0 \\ 3 \end{bmatrix}\). Is it possible to construct the zero vector using a linear combination of the column vectors? What can be concluded about the column rank of the matrix \(M\)?

Answer

No, it is impossible to construct the zero vector if we use at least one \(\alpha_i \ne 0\) since the first vector can’t affect the second dimension and vice versa. Hence, neither of the vectors can “undo” each other. As a more formal proof:

\[ \alpha_1 \vec{v_1} + \alpha_2 \vec{v_2} = \begin{bmatrix} 2\alpha_1 \\ 3\alpha_2 \end{bmatrix} \]

If at least one of the \(\alpha\) values are not 0, then the vector cannot be \(\vec{0}\). Hence, this matrix is full column rank (i.e. the set of vectors is linearly independent).

16.2.2 (b)

The inverse of a square invertible matrix \(M\) is denoted \(M^{-1}\). It is defined as a matrix such that \(MM^{-1} = I\) and \(M^{-1}M = I\). The matrix \(I\) is a special matrix known as the identity matrix, where the diagonal elements are 1 and the non-diagonal elements are 0.

Consider the inverse matrix \(M^{-1} = \begin{bmatrix} a & b \\ c & d \end{bmatrix}\) of \(M\). Carry out the matrix multiplication \(MM^{-1}\), and determine what \(M^{-1}\) must be. Recall that \(M = \begin{bmatrix} 2 & 0 \\ 0 & 3 \end{bmatrix}\).

Answer

We carry out the matrix multiplication:

\[ \begin{bmatrix} 2 & 0 \\ 0 & 3 \end{bmatrix} \begin{bmatrix} a & b \\ c & d \end{bmatrix} = \begin{bmatrix} 2a & 2b \\ 3c & 3d \end{bmatrix} \] Hence, since we know that \(MM^{-1} = I\), \(b = c = 0\) and \(2a = 1\) and \(3d = 1\). Thus, the inverse matrix is:

\[ M^{-1} = \begin{bmatrix} \frac{1}{2} & 0 \\ 0 & \frac{1}{3} \end{bmatrix} \]

An interesting fact about diagonal matrices (i.e. matrices that are nonzero on the diagonal entries but zero everywhere else) is that their inverse is simply the elementwise multiplicative inverse (reciprocal) of the diagonal entries!

16.2.3 (c)

Consider a different matrix \(Q = \begin{bmatrix} 1 & 0 & 5 \\ 0 & 1 & 5 \end{bmatrix} = \begin{bmatrix} \vec{v_1} & \vec{v_2} & \vec{v_3} \end{bmatrix}\). What is the column rank of the matrix? Is the matrix invertible?

Answer

We can construct the zero vector with all of \(\vec{v_1}, \vec{v_2}\), and \(\vec{v_3}\) with the combination \(5\vec{v_1} + 5\vec{v_2} - \vec{v_3} = 0\), so this initial set is linearly dependent. We can try sets of two by removing any one of the vectors, and we find that all sets of two are linearly independent. A similar argument as from the first subpart can be applied to explain why there is no linear dependence in any subset of 2 vectors (for example, \(\{\vec{v_1}, \vec{v_2}\}\)). The column rank is 2.

The matrix is not invertible since it is not square.

16.2.4 (d)

The transpose of a matrix is an operation on matrices of any size in which the rows and columns are “flipped.” In more precise terms, if \(A\) is an \(m \times n\) matrix, its transpose \(A^{T}\) is the \(n \times m\) matrix whose element in the \(i\)th row and \(j\)th column is the element in the \(j\)th row and \(i\)th column of \(A\).

Consider a matrix \(R\), which is equal to the transpose of \(Q\): \(R = Q^T\). What is the column rank of \(R\)?

Answer

We take the transpose: \[ R = Q^T = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 5 & 5 \end{bmatrix} \] The column rank is 2 because neither column is a scalar multiple of the other. Thus, this matrix is full column rank.

A Quick Note on Notation: Transposed Vectors

You will often see vectors written as a row of numbers with a transpose symbol (\(^T\)). For example: \(\begin{bmatrix} a & b & c \end{bmatrix}^T\).

By convention in linear algebra, vectors are treated as column vectors. This transposed notation is simply a common formatting trick to save vertical space on the page.

So, whenever you see a transposed row vector, remember it means the exact same thing as its column form:

\(\begin{bmatrix} a & b & c \end{bmatrix}^T\) is equivalent to \(\begin{bmatrix} a \\ b \\ c \end{bmatrix}\).

16.3 Calculus

In this class, we will have to determine which inputs to functions minimize the output (for instance, when we choose a model and need to fit it to our data). This process involves taking derivatives.

In cases where we have multiple inputs, the derivative of our function with respect to one of our inputs is called a partial derivative. For example, given a function \(f(x, y)\), the partial derivative with respect to \(x\) (denoted by \(\frac{\partial f}{\partial x}\)) is the derivative of \(f\) with respect to \(x\), taken while treating all other variables as if they’re constants.

Suppose we have the following scalar-valued function on \(x\) and \(y\): \[ f(x, y) = x^2 + 4xy + 2y^3 + e^{-3y} + \ln(2y) \]


16.3.1 (a)

Compute the partial derivative of \(f(x,y)\) with respect to \(x\).

Answer \[ \frac{\partial}{\partial x} f(x,y) = 2x + 4y \]

16.3.2 (b)

Compute the partial derivative of \(f(x,y)\) with respect to \(y\).

Answer \[ \frac{\partial}{\partial y} f(x,y) = 4x + 6y^2 -3 e^{-3y} + \frac{1}{y} \]

16.3.3 (c)

The gradient of a function \(f(x, y)\) is a vector of its partial derivatives. That is, \[\nabla f(x, y) = \begin{bmatrix} \frac{\partial f}{\partial x} & \frac{\partial f}{\partial y} \end{bmatrix}^T\]

As a vector, \(\nabla f(x, y)\) has both a magnitude and direction. The direction of \(\nabla f(x, y)\) corresponds to the direction in which the graph of \(f(x,y)\) is increasing most steeply from the point \((x, y)\). The magnitude gives a sense of how steep the ascent up the graph is in this particular direction. This is a generalization of the single-variable case, where \(f'(x)\) is the rate of change of \(f\), at the point \(x\). In the two-variable case, we specify a direction to evaluate the rate of change, since the function is technically changing in all directions from \((x, y)\).

Using your answers to the above two parts, compute \(\nabla f(x, y)\) and evaluate the gradient at the point (\(x = 2, y = -1\)).

Answer \[ \nabla f(x, y) = \begin{bmatrix} 2x + 4y \\ 4x + 6y^2 -3 e^{-3y} + \frac{1}{y}\end{bmatrix} \] \[ \nabla f(2, -1) = \begin{bmatrix} 2(2) + 4(-1) \\ 4(2) + 6(-1)^2 -3 e^{-3(-1)} + \frac{1}{-1}\end{bmatrix} = \begin{bmatrix} 0 \\ 8 + 6 - 3 e^{3} - 1 \end{bmatrix} = \begin{bmatrix} 0 \\ 13 - 3 e^{3} \end{bmatrix} \]

16.4 Probability & Sampling

Ishani wants to measure interest for a party on her street. She assigns numbers and letters to each house on her street as illustrated above. She picks a letter “a”, “b”, or “c” at random and then surveys every household on the street ending in that letter.


A Quick Refresher on Probability Rules

As you work through these problems, it’s helpful to keep some fundamental rules of probability in mind. Emphasizing these concepts can make the solutions feel more intuitive. Here’s a quick summary of the key ideas you might need.

Conditional Probability

This is the probability of an event happening, given that another event has already occurred. It’s a way of updating our knowledge.

  • Formula: \(P(A|B) = \frac{P(A \cap B)}{P(B)}\)
  • In words: The probability of “A given B” is the probability of both A and B happening, divided by the probability of B.

The Multiplication Rule (Joint Probability)

This rule helps you find the probability of two events occurring together. It’s just a rearrangement of the conditional probability formula.

  • Formula: \(P(A \cap B) = P(A|B)P(B)\)
  • For Independent Events: If events A and B are independent (the outcome of one doesn’t affect the other), the rule simplifies to: \(P(A \cap B) = P(A)P(B)\).

The Addition Rule

Use this rule to find the probability of at least one of two events occurring (A or B).

  • Formula: \(P(A \cup B) = P(A) + P(B) - P(A \cap B)\)
  • In words: We add the probabilities of A and B, then subtract the probability of their intersection to avoid double-counting the scenario where they both occur.

16.4.1 (a)

What is the chance that two houses next door to each other are both in the sample?

Answer None of the adjacent houses end in the same letter, so the chance is zero. This is an example of a cluster sample with each cluster representing each group of houses ending in the same letter.

16.4.2 (b)

Ishani decides to collect a simple random sample (SRS) of four houses. What is the probability that house 1a is not in Ishani’s simple random sample of four houses?

Answer This time, we are taking a sample of 4 houses. The probability that house 1a is not in Ishani’s sample is the probability of not picking it on the first draw, second, third, and fourth draws (sampling without replacement). The probability is \(\frac{11}{12} \cdot \frac{10}{11} \cdot \frac{9}{10} \cdot \frac{8}{9} = \frac{8}{12} = \frac{2}{3}\).

16.4.3 (c)

Instead of surveying every member of each house from the SRS of four houses, Ishani decides to survey two members chosen without replacement from each selected house. Four people live in house 1a, one of whom is Bob. What is the probability that Bob is not chosen in Ishani’s new sample?

Answer This is a two-stage process. First, house 1a must be chosen. Then, Bob must be chosen from the house. The probability that house 1a is included in the SRS of four houses is \(1 - P(\text{1a is not chosen}) = 1 - \frac{2}{3} = \frac{1}{3}\). Given that house 1a is selected, the probability that Bob is one of the two people surveyed (out of four) is \(\frac{2}{4} = \frac{1}{2}\). Therefore, the probability that Bob is surveyed is the probability of both events happening: \(P(\text{1a is selected}) \times P(\text{Bob is selected | 1a is selected}) = \frac{1}{3} \cdot \frac{1}{2} = \frac{1}{6}\). Thus the probability that Bob is not selected is \(1 - \frac{1}{6} = \frac{5}{6}\).

16.4.4 (d)

To increase interest in her party, Ishani randomly selects 4 houses to gift a small present. Assuming she samples with replacement (meaning that a house could be chosen multiple times), what’s the probability that all “c” houses are given a gift?

Answer For all “c” houses to be selected in 4 draws, the 4 draws must be the 4 distinct “c” houses in some order. There are \(4!\) ways to order the four “c” houses (e.g., 1c, 2c, 3c, 4c or 2c, 1c, 3c, 4c, etc.). For any specific order, the probability is \((\frac{1}{12})^4\) since each draw is independent and has a \(\frac{1}{12}\) chance of being a specific house. Therefore, the total probability is the number of valid orderings times the probability of one specific ordering: \[ P(\text{all "c" houses}) = 4! \cdot \left(\frac{1}{12}\right)^4 = \frac{24}{20736} \]