15  Discussion 15: Wrapping Up (From Summer 2025)

Slides

15.1 Semester Recap

Write or describe the appropriate concept from this class that would be applied (e.g., hypothesis test, linear regression, classification, etc.) to answer each of these questions surrounding data. If no method is applicable explain why. You might remember some examples from the very first worksheet!

15.1.1 (a)

Estimating the average amount of students in each class at UC Berkeley based on a random sample.

Answer Perform bootstrap resamples from the original sample and construct a confidence interval.

15.1.2 (b)

Predicting whether or not a customer will make an online purchase based on their browsing activity.

Answer Classification — they will either make a purchase or they won’t. We could use features like the amount of items in their shopping cart, amount of time spent browsing, amount of searches/visits, etc.

15.1.3 (c)

Updating the probability someone will make a purchase based on new information about their household income bracket.

Answer We can apply the concepts of Bayes’ Rule to update the probability of making a purchase with new information.

15.1.4 (d)

Determining if there is any difference in household utility bills between Massachusetts and California.

Answer We can run a hypothesis test. Since we are comparing household utility bills in two distributions (Massachusetts and California), we can perform an A/B test and use a test statistic such as the differences in means.

15.1.5 (e)

Predicting how many students will graduate in a particular major based on the amount of Reddit posts about the subject.

Answer Linear regression. We can further use regression inference to determine whether or not an association exists between these two variables.

15.1.6 (f)

Determining whether a random Data 8 student in Fall 2024 sleeps on their stomach assuming you have data on how every single Data 8 student sleeps.

Answer

Since we have the entire population data set there is no reason to make a prediction. Instead, we can just look up the student in the data set and see what their preference is!

When Full Population Data Means No Need for Predictive Models

If you already have data on the entire population, there’s no need to use fancy prediction methods like regression or classification. Those methods are useful when you have just a sample and want to make guesses about the whole group. But if you have everyone’s data, you can directly calculate what you need without approximations.

15.2 Thomas’ Bays

Thomas is an avid consumer of fish and is interested in finding ethically sourced fish to eat from 3 local bays near him. He researches online and finds the following information:

  • Out of all fish, 20% of fish come from Labor Bay, 45% of fish come from Obi-Wan Keno Bay, and 35% come from Spelling Bay.
  • In these three bays, there are only 2 types of fish: salmon and tuna.
  • In Labor Bay, 37% of the fish are salmon.
  • In Obi-Wan Keno Bay, 40% of the fish are tuna.
  • In Spelling Bay, 42% of the fish are salmon.

15.2.1 (a)

Draw a tree diagram to represent the result of Thomas’ research.

Understanding Probability Trees: Visualizing Compound Events

Probability trees are great tools to map out all possible outcomes step-by-step. At every split in the tree, the probabilities of all branches add up to 1. For example, imagine a bag containing two dice: one is weighted to roll a 6 half the time, the other is a normal die. You randomly pick one (50-50 chance), then roll it. The tree helps you see the total probability of rolling a six by multiplying the chance of picking a die by the chance of rolling six on that die.

Answer


15.2.2 (b)

Thomas is shopping for a fish at random. What is the chance he picks a tuna from Obi-Wan Keno Bay?

Answer

\(0.45*0.4\)

Using the Multiplication Rule to Calculate Combined Probabilities

To find the probability of multiple events happening together, multiply the probabilities along the branches of the probability tree that lead to the event. You can trace the branches to understand how combined probabilities work. This helps you see complex event chances clearly.


15.2.3 (c)

Thomas chooses a fish at random. What is the probability that the fish is a salmon? What is the probability that it is a tuna?

Answer

P(Salmon) = \(0.2*0.37 + 0.45*0.6 + 0.35*0.42\)
P(Tuna) = \(0.2*0.63 + 0.45*0.4 + 0.35*0.58\)
OR
P(Tuna) = 1 - P(Salmon) = \(1 - (0.2*0.37 + 0.45*0.6 + 0.35*0.42)\)

Marginal Probability: Finding the Overall Chance of an Event

Marginal probability means the total chance of an event happening, regardless of other conditions. For example, if you want the chance of getting a salmon, add up the probabilities of all the different ways you could get salmon. You can also find the chance of getting tuna as the complement (everything else), like:

\(P(\text{Tuna}) = 1 - P(\text{Salmon})\)

You can also calculate tuna probability directly by adding the relevant branches.


15.2.4 (d)

Thomas ends up buying a salmon. What is the probability the salmon came from Spelling Bay?

Answer

\[\frac{P(\text{Spelling Bay and Salmon})}{P(\text{Salmon})}\] \[= {\frac{0.35 * 0.42}{(0.2*0.37) + (0.45*0.6) + (0.35*0.42)}}\]

Applying Bayes’ Rule with Probability Trees

Bayes’ Rule helps us update probabilities when given new information. The denominator in Bayes’ formula is the total probability of the condition (like Thomas buying a salmon), which you get by adding all the relevant salmon branches on the tree. The numerator is the joint probability of both events happening (like Thomas buying a salmon from Spelling Bay). Divide numerator by denominator to get the conditional probability:

\(P(\text{Spelling Bay} | \text{Salmon}) = \frac{P(\text{Spelling Bay and Salmon})}{P(\text{Salmon})}\)

You can circle the relevant parts on the tree to visualize the numerator and denominator, which makes it easier to understand.


15.2.5 (e)

Thomas buys 10 fish at random. What is the probability that at least one of them is a salmon from Spelling Bay?

Answer

\[ \begin{aligned} P(\text{Salmon from Spelling Bay}) &= 0.35 \times 0.42 \\ P(\text{Not Salmon from Spelling Bay}) &= 1 - (0.35 \times 0.42) \\ P(\text{At least one salmon from Spelling Bay}) &= 1 - P(\text{Not Salmon from Spelling Bay})^{10} \\ P(\text{At least one salmon from Spelling Bay}) &= 1 - \left( 1 - (0.35 \times 0.42) \right)^{10} \end{aligned} \]

Solving “At Least One” Probability Problems with Complements

These problems often require careful use of the complement rule. For example, the probability that at least one salmon comes from Spelling Bay is NOT the same as the probability that a salmon is from Spelling Bay given it’s salmon. Use the complement rule twice: find the chance that none satisfy the condition, then subtract from 1 to get the “at least one” probability.

15.3 Multiple Linear Regression + k-NN Regression

In multiple linear regression, a numerical output is predicted from multiple attributes; the output is obtained by multiplying each attribute value by a different slope and then summing the results. Much like simple linear regression, we will find our optimal slopes by minimizing the root mean squared error (RMSE) between our actual values and our predicted values.

Below is a table rides that contains data on 1000 theme park rides. There are three columns:

  • Number of Visitors (int): the total number of visitors currently in the entire theme park.
  • Popularity Score (int): a rating from 0 to 100 measuring how popular the ride is.
  • Wait Time (int): the average wait time for the ride in minutes.

The first 5 rows are shown below:

Code
from datascience import *
import numpy as np

fixed_visitors = [400, 700, 1100, 1500, 1800]
fixed_popularity = [30, 50, 75, 85, 90]
fixed_wait = [20, 35, 55, 70, 80]

n_remaining = 995
visitors_rest = np.random.randint(100, 5000, size=n_remaining)
popularity_rest = np.random.randint(0, 101, size=n_remaining)
noise_rest = np.random.normal(0, 5, size=n_remaining)
wait_rest = 0.015 * visitors_rest + 0.5 * popularity_rest + noise_rest
wait_rest = np.round(wait_rest).astype(int)

visitors_all = fixed_visitors + visitors_rest.tolist()
popularity_all = fixed_popularity + popularity_rest.tolist()
wait_all = fixed_wait + wait_rest.tolist()

rides = Table().with_columns(
    "Number of Visitors", visitors_all,
    "Popularity Score", popularity_all,
    "Wait Time", wait_all
)

rides.show(5)

train = rides.take(np.arange(750))
test = rides.take(np.arange(750, 1000))
Number of Visitors Popularity Score Wait Time
400 30 20
700 50 35
1100 75 55
1500 85 70
1800 90 80

... (995 rows omitted)

Aayan is interested in predicting the average wait time of a ride, measured in minutes, given the Number of Visitors and Popularity Score attributes.


15.3.1 (a)

Assume Aayan has determined that MLR is a good model choice, and has correctly split the data into a test and train table. Help Aayan define a function predict(slope1, slope2, intercept, tbl) that takes in two slopes, an intercept, and a table with the same structure as rides and predicts the wait times for all rows in the table. Assume that the first column in the table corresponds to slope1, and the second column corresponds to slope2.

def predict(slope1, slope2, intercept, tbl):
    return ____________________________
Answer
def predict(slope1, slope2, intercept, tbl):
    return slope1 * tbl.column(0) + slope2 * tbl.column(1) + intercept
predict(0.02, 0.4, 5, train)
array([  25.  ,   39.  ,   57.  ,   69.  ,   77.  ,   74.5 ,  104.34,
        127.72,   75.46,   30.44,  134.18,   26.08,   23.48,   49.58,
         46.46,  100.94,   72.94,   75.6 ,  108.72,  105.78,   74.1 ,
         54.82,   19.76,   78.66,   25.62,   87.28,   69.7 ,   68.7 ,
         46.96,   54.84,   95.64,  102.16,   73.6 ,   73.5 ,   40.02,
         78.42,   10.78,   34.1 ,   95.88,   58.58,   22.88,   68.08,
         68.82,   70.4 ,   56.5 ,   94.92,   75.82,  120.86,   34.58,
         47.58,   95.2 ,   47.14,  124.22,   90.38,   97.  ,   53.36,
        114.28,   47.94,   76.8 ,   47.2 ,   53.82,   79.58,   36.84,
         87.44,  133.64,  101.76,  118.22,   71.84,   72.76,   19.1 ,
         20.38,  113.16,   53.12,  106.48,  103.18,  103.9 ,   74.96,
         35.88,  108.52,   42.42,   43.68,   46.66,   87.8 ,   95.68,
         89.7 ,  102.32,   67.16,   67.22,   81.42,   29.98,   79.94,
        108.54,   86.  ,   43.78,   92.38,  110.12,   69.5 ,  128.16,
        100.46,  114.56,   54.08,   41.68,   49.34,   50.86,   41.3 ,
         66.7 ,   69.56,   98.66,   67.02,   61.46,   91.04,   34.38,
         59.72,   35.38,  114.44,   41.6 ,   41.38,   73.56,  118.56,
        126.78,   74.3 ,   73.58,  115.72,   71.44,   59.68,   91.26,
         49.76,  131.36,   86.62,   75.32,  116.12,  113.22,  114.4 ,
        115.34,   64.62,   87.22,    9.86,  102.44,   78.46,   50.02,
        115.04,   65.36,   59.62,   86.2 ,   46.66,   72.02,  110.3 ,
         45.82,   70.18,   98.18,   24.18,   48.92,   85.38,   90.84,
         36.92,   95.4 ,   49.58,   90.48,   27.48,   72.62,   72.88,
         28.32,   73.84,  139.28,   41.2 ,   97.96,  106.78,   84.8 ,
         43.64,   69.16,  135.7 ,   59.56,   31.28,   90.5 ,   78.46,
         54.24,  118.34,   72.64,  106.44,   60.4 ,   26.06,   40.66,
         23.72,   32.76,  130.66,   49.66,   77.  ,   25.08,  117.8 ,
         39.42,   89.64,  118.28,   49.56,   60.66,   17.02,   45.68,
         54.84,   58.1 ,   34.78,   68.94,   58.8 ,   35.68,   26.34,
         79.36,  109.64,  121.8 ,   78.46,   56.46,   43.78,  129.76,
         41.6 ,   86.46,   60.74,   44.7 ,   71.2 ,   14.52,   93.56,
         49.98,   88.32,   60.28,  132.36,   68.82,   83.54,   52.5 ,
         76.38,   98.48,   49.42,  117.94,  111.5 ,  117.66,   93.36,
         50.68,   75.38,   50.62,   28.04,  138.96,   30.66,  112.38,
         56.84,   82.82,   76.74,  121.16,   72.4 ,  104.14,  113.8 ,
         54.92,   54.08,   72.38,   24.94,   98.88,   55.3 ,   69.18,
        106.16,   59.18,   68.4 ,   39.18,   97.38,  129.14,   82.48,
         56.14,  122.88,   18.3 ,  115.5 ,   57.96,  117.98,   28.48,
         69.82,   37.5 ,  118.46,   84.3 ,   83.7 ,   55.2 ,   41.98,
         49.58,  120.84,  119.96,   92.32,   56.26,  109.06,  127.3 ,
        128.66,   62.1 ,  125.7 ,   83.54,   47.76,   71.96,   83.14,
         68.82,   68.72,  131.6 ,   24.9 ,   48.24,  118.4 ,  115.16,
        109.68,  132.42,   66.28,   65.44,   78.72,   86.34,  128.74,
         67.72,  115.4 ,   96.24,   46.18,   79.36,  135.76,  108.2 ,
        123.  ,   22.36,   69.76,   95.  ,  107.12,   88.12,   94.86,
         70.68,   95.22,   24.68,   21.86,   32.92,   61.36,  128.06,
         54.92,  117.84,  104.32,  116.34,   81.22,   70.4 ,   89.08,
         84.5 ,   96.22,   69.34,   74.14,   60.26,   98.5 ,   91.52,
         51.9 ,   45.92,  109.2 ,   21.56,   23.72,    7.44,  106.18,
         92.02,   88.32,   75.66,   43.34,   95.2 ,  127.42,   57.74,
         82.2 ,   92.26,   97.22,   77.76,   54.18,   46.04,  113.28,
         30.84,   74.54,   20.74,  119.28,   39.9 ,   78.26,   73.06,
         53.36,   80.84,   46.34,  104.68,  115.48,   57.24,  105.04,
         75.44,   93.1 ,   68.14,   89.64,   92.34,   66.04,   71.32,
         75.26,   24.94,   43.9 ,   82.46,  104.06,   61.96,   95.42,
         76.52,   58.36,   61.54,   34.74,   26.28,  135.04,   97.14,
         89.4 ,   94.84,   50.94,  119.68,   64.98,  106.34,  110.92,
        104.04,   97.84,   78.58,   68.7 ,   96.86,   44.36,  128.06,
         28.38,   62.96,   73.88,   31.  ,   70.06,   55.54,  134.46,
        115.52,   67.48,   87.12,   47.22,   97.16,   47.38,   73.02,
         47.12,   57.  ,   39.36,  134.28,   96.68,   42.64,  106.36,
         19.6 ,   20.86,   50.06,   65.44,   44.24,   35.28,  111.7 ,
         73.48,   90.14,  117.98,   81.6 ,   39.56,  127.76,   44.6 ,
        113.44,  104.76,  107.4 ,   95.62,  112.64,   96.64,   34.  ,
         93.  ,   95.4 ,   61.26,   62.6 ,  108.32,   69.34,   78.58,
        128.2 ,   19.58,   67.28,   30.1 ,   64.82,  135.5 ,   91.72,
         95.84,   52.36,   58.44,   85.1 ,   70.  ,   39.12,   65.26,
         99.88,   66.66,   89.32,   45.94,  121.72,   76.08,   90.28,
         90.86,   63.12,  120.3 ,   75.4 ,   90.56,   95.38,   81.28,
        109.74,   62.54,   26.22,   90.5 ,   28.46,   14.68,  124.14,
         71.42,  107.62,   84.82,   99.04,   58.5 ,   79.76,   20.34,
         69.94,   59.58,   39.4 ,   72.4 ,   68.36,   67.18,  112.66,
        103.88,   61.14,   57.42,   59.04,  129.6 ,  124.1 ,   82.56,
         34.44,   65.06,   35.78,   45.84,  116.84,  114.46,   57.44,
        105.64,  141.04,   99.24,   51.34,   62.54,   22.76,   70.44,
         94.96,   64.92,  104.52,  107.62,  122.44,   14.4 ,   36.14,
         67.2 ,   35.44,   56.22,   79.54,   76.86,   40.08,  130.  ,
        108.82,   55.64,  120.9 ,   81.2 ,   92.22,  110.2 ,  114.12,
         88.1 ,   76.88,   94.5 ,  102.98,   61.8 ,   99.72,   65.88,
         16.8 ,  132.08,   67.94,   76.46,   66.56,  109.6 ,   66.72,
         60.42,    7.86,   42.64,   74.66,   96.9 ,   53.74,   60.42,
         39.24,   62.36,   53.98,  102.72,   41.6 ,   95.86,   46.78,
        107.5 ,   91.14,   97.64,  119.92,   68.96,   72.1 ,   79.14,
         67.5 ,   52.14,  132.28,   44.22,  119.  ,   19.36,   85.16,
         96.44,  100.5 ,  102.9 ,   57.22,   51.8 ,   97.02,   71.84,
         35.66,   93.48,   91.84,   94.72,   60.08,  125.42,  107.1 ,
         89.22,   51.18,  122.12,  123.94,  122.52,  109.6 ,   23.56,
         36.34,  102.68,   51.88,   58.16,  104.2 ,   94.48,  100.94,
        117.54,   42.44,   98.1 ,   34.82,   18.88,   85.62,   43.02,
         78.24,   51.94,   94.66,   67.64,   66.74,   26.48,  122.7 ,
        100.18,   58.22,  101.84,   68.1 ,   67.5 ,   80.1 ,   85.76,
         26.84,   90.06,   34.96,  120.62,   40.54,  128.42,   49.  ,
         93.62,  110.46,   89.6 ,   39.3 ,   32.68,  111.04,   18.66,
         69.7 ,  122.06,   16.82,  132.96,   25.02,   38.96,   67.84,
         94.52,   23.1 ,   66.26,  114.66,   67.74,   39.92,   21.64,
         99.82,   91.02,  103.74,   31.3 ,  109.68,  106.6 ,  104.3 ,
         59.5 ,   31.42,  100.78,   31.76,   85.28,   46.74,   67.08,
        123.36,   75.6 ,   41.14,   72.76,   38.8 ,   50.08,   67.7 ,
         56.58,   72.  ,   34.26,   42.72,   56.28,   29.42,   97.76,
         55.56,   34.58,   47.58,   42.64,  107.04,   77.56,   27.56,
        108.74,   71.56,   58.06,  124.94,   78.88,   76.32,   60.38,
         46.7 ,   40.18,   54.34,   49.54,   65.26,  109.34,  115.14,
        126.16,  128.68,  109.14,   62.44,   67.44,   65.1 ,   63.24,
         52.26,   82.62,  127.2 ,   48.4 ,  103.02,  109.52,   68.7 ,
        114.26,  118.56,   55.14,   67.72,  116.36,  104.64,   88.22,
        104.04,   27.64,   79.8 ,   56.  ,   99.76,  114.02,   92.86,
         75.62,   74.02,  128.24,   98.1 ,   90.56,  120.68,   55.16,
         64.98])

15.3.2 (b)

Using the predict function, help complete the code for the function train_rmse(slope1, slope2, intercept) that takes in two slopes and an intercept and computes the RMSE of the predictions on the train table:

def train_rmse(slope1, slope2, intercept):
    predictions = ______________________
    actual = ______________________
    residuals = ______________________
    return ______________________
Answer
def train_rmse(slope1, slope2, intercept):
    predictions = predict(slope1, slope2, intercept, train)
    actual = train.column("Wait Time")
    residuals = actual - predictions
    return np.sqrt(np.mean(residuals ** 2))
train_rmse(0.02, 0.4, 5)
15.742561451894247

15.3.3 (c)

Assume that after following the demos from lecture, Aayan has properly defined the train_rmse(slope1, slope2, intercept)} function and assigns best_slopes to the result of calling minimize(train_rmse, start= make_array(5, 5, 5), smooth=True, array=True). Assume the array best_slopes evaluates to array([0.02, 0.4, 5]). Help Aayan answer the following questions about his results.


15.3.3.1 (i)

Write out the equation of the regression line as a mathematical expression, using the values in best_slopes.

Answer \(\text{Predicted Wait Time} = 0.02 * \text{Number of Visitors} + 0.4 * \text{Popularity Score} + 5\)

15.3.3.2 (ii)

Using the equation above, what would the predicted wait time be for a ride when there are 1000 visitors currently in the park and the ride has a popularity score of 70?

Answer

\[ \begin{aligned} \text{Predicted Wait Time} &= 0.02 \times \text{Number of Visitors} + 0.4 \times \text{Popularity Score} + 5 \\ &= 0.02 \times 1000 + 0.4 \times 70 + 5 \\ &= 20 + 28 + 5 \\ &= 53 \end{aligned} \]

So, the predicted wait time is 53 minutes.


15.3.4 (d)

How would we interpret the slope for the “Number of Visitors” attribute? Write your answer in 1-2 sentences, making sure to use precise language.

Answer

The slope for the Number of Visitors attribute is 0.02. This means that for every additional visitor in the park, the predicted wait time increases by 0.02 minutes, holding all other variables constant.

Interpreting the Meaning of a Regression Slope

The slope in regression tells you how much the predicted outcome changes when your input variable increases by one unit, assuming all other variables stay the same. Be precise with this interpretation because it helps understand model results clearly.


15.3.5 (e)

After completing his model, Aayan realizes he could have also used k-NN regression to predict wait time instead. For the following scenarios, determine which of the techniques are applicable (based on what has been covered in Data 8): (1) Simple Linear Regression, (2) Multiple Linear Regression, (3) Classification (k-NN), (4) k-NN Regression.

  • Splitting the data into a testing and training set
  • Predicting a categorical variable from numerical features
  • Evaluating the performance of the model using RMSE
  • Evaluating the performance of the model using accuracy
  • Examining the residual plot from our predictions
Differences Between Regression, Classification, and k-NN Predictions
  • k-NN regression predicts numerical values (like prices or scores).
  • Regression in general predicts numbers, not categories.
  • Classification predicts categories or classes (like “cat” or “dog”).

When measuring classification accuracy, use the formula:

\(\text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total predictions}}\)

Answer
  • Splitting the data into a testing and training set — Techniques: (1), (2), (3), (4)
  • Predicting a categorical variable from numerical features — Techniques: (3)
  • Evaluating the performance of the model using RMSE — Techniques: (1), (2), (4)
  • Evaluating the performance of the model using accuracy — Techniques: (3)
  • Examining the residual plot of our predictions — Techniques: (1), (2), (4)

15.4 Confidence Intervals - Fa20 Final Q6 Modified (Optional)

Every day, Samiksha gets a boba drink from Asha in Berkeley. She believes the machine that’s used to add the boba pearls to the drink is calibrated to put an exact amount of boba in each drink, with some variability due to chance.

To get a sense of how the amount of boba in a drink varies, Samiksha plans to randomly sample customers throughout November and record the weight (in grams) of the boba pearls in each customer’s drink.


15.4.1 (a)

Suppose that Samiksha wants to use a sample of size 100 to create a confidence interval for the true population mean of boba weight per drink.

Which of the following could be used to help her create this confidence interval? Select all that are correct:

  1. Central Limit Theorem
  2. Bootstrapping
  3. Nearest Neighbors
  4. Linear Regression
  5. Classification
Answer A and B are correct.

15.4.2 (b)

Suppose that Samiksha wants to use a sample of size 100 to create a confidence interval for the true population median of boba weight per drink.

Which of the following could be used to help her create this confidence interval? Select all that are correct:

  1. Central Limit Theorem
  2. Bootstrapping
  3. Nearest Neighbors
  4. Linear Regression
  5. Classification
Answer

Only B is correct. The CLT only applies to the sum or mean of a random sample, not the median.

Central Limit Theorem vs Bootstrapping: When to Use Each

The Central Limit Theorem (CLT) applies when you have large random samples and you’re looking at sample means or sums — it tells you these will be approximately normal. Bootstrapping is more flexible and can estimate the distribution of any statistic, including medians, even when the sample size is small.


15.4.3 (c)

Suppose that Samiksha observes an average of 30 grams of boba per drink from a random sample of 100 customers and she knows that the population SD is 2 grams.

Which of the following is guaranteed to be true? Select all that are correct:

  1. At least 68% of the customers in the population will have a boba weight that is between 2 grams below and 2 grams above the population mean.
  2. At least 75% of the customers in the population will have a boba weight that is between 4 grams below and 4 grams above the population mean.
  3. At least 75% of the customers in the population will have a boba weight that is between 26 grams and 34 grams.
  4. At least 68% of the customers in Samiksha’s sample have a boba weight that is between 28 grams and 32 grams.
Answer

Only B is correct. We do not know if the population is normally distributed, so we can only use Chebyshev’s bounds. Chebyshev’s bounds tell us that 75% of the data lies within 2 SDs of the mean.

Using Chebyshev’s Inequality When Population Distribution Is Unknown

If you don’t know the shape of the population distribution and can’t assume normality, use Chebyshev’s inequality. It guarantees a minimum proportion of data within any specified distance from the mean, no matter how weird the distribution looks.


15.4.4 (d)

Suppose that Samiksha samples 500 customers and her 95% confidence interval for the true population mean is (29, 31).

Which of the following can be concluded from this confidence interval? Select all that are correct:

  1. If Samiksha repeats this process 1000 times, she can expect that roughly 95% of the intervals she creates will contain the true population mean.
  2. If Samiksha gets boba once a day throughout November, she can expect to get between 29 and 31 grams of boba on roughly 95% of the days.
  3. If Samiksha gets boba once a day throughout the year, she can expect to get between 29 and 31 grams of boba on roughly 95% of the days.
  4. If you sample 100 Asha boba customers in November, you can expect roughly 95% of them to get between 29 and 31 grams of boba.
Answer

Only A is correct. B, C, and D are all incorrect as the confidence interval is estimating the true average grams of boba in drinks, and makes NO claim about what percentage of days have between 29 and 31 grams of boba. It wouldn’t make sense to think that 95% of drinks all have between 29 and 31 grams of boba - those bobaristas would have be super precise!

Correct Interpretation of Confidence Intervals

A 95% confidence interval means that about 95% of the intervals constructed from repeated samples will contain the true mean. It does NOT mean 95% of individual data points fall within the interval — that’s a common mistake to avoid.

15.5 Variability of Sample Mean - Fa19 Final Q9 Modified (Optional)

Ethan, an International Relations major, is writing his Masters thesis on aid given to foreign governments by the World Bank. He finds a sample of donations given to various countries over the last decade and collects these findings into a table called . Here are the first few rows.

The table contains four columns:

  • Date: a string, the date upon which the donation was made
  • Recipient: a string, the country receiving the money
  • Amount: an int, the amount of the donation in USD
  • Purpose: a string, the reason listed for the aid

For parts (a) and (b), assume that Ethan is interested in studying the average Amount of aid given per donation.


15.5.1 (a)

To get a sense of the data, Ethan first plots a histogram of the aid ‘Amount’ in his sample. He finds that the empirical distribution of ‘Amount’ has an average of $3,532,423 and an SD of $1,121,240. The distribution of ‘Amount’ in his sample is:

  1. Approximately normal
  2. Not approximately normal
  3. There isn’t enough information to answer this question
Answer

C. There isn’t enough information to answer this question. We cannot tell whether Amount is normal based on the sample statistics given. It is definitely possible as 3 SDs in each direction is still a valid dollar amount.

Why Mean and SD Alone Don’t Show If Data Is Normal

Just knowing the mean and standard deviation isn’t enough to tell if your data follows a normal distribution. You need to look at the shape (like histograms or QQ-plots) because the same mean and SD can come from very different shapes.


15.5.2 (b)

Suppose Ethan wants to use his sample data to create a 95% confidence interval of the true average amount of aid of all donations. If the distribution of all World Bank donations has an SD of $1,000,000 and the aid table contains 10,000 rows, can Ethan create a 95% confidence interval that has a width less than $25,000?

Note: an interval of [-5, 5] has a width of 10.

  1. He can because the sample size is large enough
  2. He can’t because the sample size is too small
  3. There isn’t enough information to answer this question
Answer

B. He can’t because the sample size is too small. We’re given the sample size of 10,000 and the population SD of 1,000,000.

We can find the SD of the sample means to be \(\cfrac{1,000,000}{\sqrt{10,000}} = 10,000\).

Since we’re making a CI for the true average amount of aid, the CLT applies to the distribution of sample means. We know a 95% CI must have a width of 4 SDs. However, \(4 \cdot 10,000 > 25,000\), so the sample size is too small for such a narrow interval.

For the rest of the question, assume that Ethan has turned his attention to learning more about aid ‘Purpose’.

15.5.3 (c)

As Ethan is combing through the data set, he notices that some countries in South Asia appear to have received a disproportionate amount of aid with the purpose of ‘rail’ and ‘manufacturing’ compared to others in the region. He creates the following table, which displays the aid given to countries in the region with the following proportions. For example, the last column tells us that of the aid that Pakistan received from the World Bank, 20% was for agriculture, 20% was for rail, and 60% was for manufacturing. Note that each country’s column adds up to 1.

According to the above distributions, what is the empirical total variation distance of aid ‘Purpose’ between India and Bangladesh? You may leave your answer as a mathematical expression (not Python).

Calculating Total Variation Distance Between Distributions

Total Variation Distance (TVD) measures how different two probability distributions are by summing half the absolute differences between their category proportions:

\(\text{TVD} = \frac{1}{2} \sum |\text{difference in proportions}|\)

Make sure not to forget dividing by 2 — it’s a common mistake!

\[ \underline{\hspace{12cm}} \\ \]

Answer We find the sum of the absolute differences between the corresponding purposes, and divide by 2. \(\cfrac{0.9 + 0.3 + 0.6}{2} = 0.9\).

15.5.4 (d)

The World Bank claims the total variation distance of aid ‘Purpose’ between India and Bangladesh is 0.3. Ethan is not sure if his empirical TVD (from part (a)) is different from 0.3 just due to chance, but he thinks he could bootstrap his sample to get a better idea.

Complete the code below to write a function purpose_tvd that takes in a table tbl with the same column labels as aid, two country names, country_a and country_b, and computes the total variation distance between the two countries’ ‘Purpose’ distributions.

For example, purpose_tvd(aid, ‘Bangladesh’, ‘India’) should return your answer from part (c).

def purpose_tvd(tbl, country_a, country_b):
    dist_a = tbl.where(__________).__________
    counts_a = dist_a.sort('Purpose').__________
    dist_b = tbl.where(__________).__________
    counts_b = dist_b.sort('Purpose').__________
    props_a = counts_a / np.sum(counts_a)
    props_b = counts_b / np.sum(counts_b)
    return __________ * np.sum(abs(__________))
Answer
def purpose_tvd(tbl, country_a, country_b):
    dist_a = tbl.where('Recipient', country_a).group('Purpose')
    counts_a = dist_a.sort('Purpose').column(1)
    dist_b = tbl.where('Recipient', country_b).group('Purpose')
    counts_b = dist_b.sort('Purpose').column(1)
    props_a = counts_a / np.sum(counts_a)
    props_b = counts_b / np.sum(counts_b)
    return 0.5 * np.sum(abs(props_a - props_b))

15.6 Classification - Fa18 Final Q1 Modified (Optional)

Candidate A decides to train a classifier to predict whether people will vote in the 2020 U.S. election or not. They gather data on voting records from the 2018 U.S. election and decide to use two features: the number of political and non-political posts on social media that the person made in the month leading up to the election. A scatter plot of their initial sample is shown below:


15.6.1 (a)

The candidate is trying to classify the point at (8, 17) shown as a triangle on the graph above. If they use a 3-nearest neighbor classifier, what will their classification be?

  1. Voter
  2. Non-voter
Answer Non-voter. The 3 nearest points to the triangle on the graph are all red circles corresponding to the Non-voter class.

15.6.2 (b)

Suppose the candidate randomly divides the data into test and training sets (both much larger than the set shown above), and finds a test set accuracy of 94%. The candidate decides to apply their trained classifier to a test set from another country with lower rates of internet access. Should they expect the accuracy to be the same, higher or lower? Why?

  1. Same
  2. Higher
  3. Lower
Answer

The correct answer is Lower. In regions with lower rates of internet access, people might post less, so they will be in the lower-left corner, and accuracy there is lower due to lack of data.

Impact of Sparse Data on Model Accuracy

If the data is sparse in certain areas (for example, because internet access is low), your model won’t have enough examples nearby to make good predictions. This usually lowers accuracy in those regions. This is called extrapolation!


15.6.3 (c)

Instead of a \(k\)-nearest neighbor classifier, the candidate decides to use a \(d\)-distance classifier. In this classifier, instead of choosing the \(k\) closest neighbors, we’ll instead choose all neighbors within a specified distance \(d\) (including points that are exactly \(d\) units away). If there are an equal number of points with both labels within that distance, choose whichever class you wish.

If \(d = 5\), how would you classify the point at (8, 17) shown as a triangle on the graph above?

  1. Voter
  2. Non-voter
Answer The correct answer is Voter. Within a circular region of radius 5, there are many more points corresponding to the Voter class than Non-voter.

15.6.4 (d)

In the above scatter plot, there are 21 points with a label of Voter, and 25 points with a label of Non-voter. On the plot below, draw the approximate decision boundary for a 43-neighbor classifier, or write “Impossible” below if you cannot draw a decision boundary for this classifier. Explain.

Answer

Impossible. A 43-neighbor classifier would always return Non-voter, since there are only 21 points with a label of Voter. No boundary exists.

Effect of Large k Values on k-NN Classification

When you set \(k\) in k-NN really large (more than twice the count of the minority class plus one), the classifier just always picks the majority class. This means there’s no real decision boundary — the model becomes oversimplified.


15.6.5 (e)

When building a \(k\)-nearest-neighbor classifier, increasing \(k\) will result in which of the below? Select all that apply.

  1. Higher training accuracy
  2. Higher test accuracy
  3. Lower training accuracy
  4. Lower test accuracy
Answer

While some generalizations can be made about how increasing \(k\) will lead to higher test accuracy, for example, these trends are not always true for all values of \(k\). This is why finding the best value of \(k\) is often part of building our model, and is a hyperparameter that we tune.

How Changing k Affects k-NN Accuracy and Overfitting

Increasing \(k\) makes the model less sensitive to noise (reduces overfitting), so training accuracy goes down. Sometimes, for small \(k\), test accuracy improves, but it’s not guaranteed — it depends on the data.

15.7 Linear Regression - Sp17 Final Q4 Modified (Optional)

This scatter plot of a sample of 1,000 trips for New York taxis in January 2016 compares distance and cost. The regression line is shown. Two trips of the same length can vary in cost because of waiting times, special fees, taxes, tolls, tips, discounts, etc.

np.average(t.column("Distance")) = 3
np.std(t.column("Distance")) = 2
np.average(t.column("Cost")) = 13
np.std(t.column("Cost")) = 6
correlation(t, "Distance", "Cost") = 0.9

15.7.1 (a)

Convert a trip total cost of $9 to standard units.

Answer \((9-13)/6 = -2/3\)

15.7.2 (b)

What is the slope of the regression line for this sample in dollars per mile?

Answer \(0.9 * 6 / 2 = 2.7\)

15.7.3 (c)

What is the intercept of the regression line for this sample in dollars?

Answer \(13 - 2.7 * 3 = 4.9\)

15.7.4 (d)

If instead we fit a regression line to estimate distance in miles from total cost in dollars, what would be the slope of that line in miles per dollar? Write not enough info if it’s impossible to say.

Answer \(0.9 * 2 / 6 = 0.3\)

15.7.5 (e)

Circle one of (A) True, (B) False, or (C) Not Enough Info to describe the following statement:

The total cost values in this sample are normally distributed.

Answer False: Most values are between 0 and 2 with a tail to the right.

15.7.6 (f)

Circle one of (A) True, (B) False, or (C) Not Enough Info to describe the following statement:

All of the total cost values in this sample are within 3 standard deviations of the mean.

Answer False: One value is 40, which is larger than 3 * 6 + 13 = 31.

15.7.7 (g)

Circle one of (A) True, (B) False, or (C) Not Enough Info to describe the following statement:

At least 88% of the total cost values in this sample are within 3 standard deviations of the mean.

Answer True: By Chebyshev’s inequality, it must be true. At least \(1-\frac{1}{z^2} = 1 - \frac{1}{9} = 8/9\) or 88.89% are within 3 SDs of the mean for any distribution.

15.7.8 (h)

Circle one of (A) True, (B) False, or (C) Not Enough Info to describe the following statement:

The residual costs have a similar average magnitude for short trips (1 mile) and long trips (5+ miles).

Answer

False: There is less variability for shorter rides, and therefore smaller error magnitudes.

Recognizing Heteroscedasticity in Residuals

Sometimes the spread (variance) of residuals changes with the input variable \(x\). This phenomenon is called heteroscedasticity and is important to recognize because it affects model assumptions.


15.7.9 (i)

You compute a 95% confidence interval from this sample to estimate the height (fitted value) of the population regression line at 6 miles. Which one of the following could plausibly be the result?

  1. 5 to 7
  2. 7 to 19
  3. 12 to 14
  4. 15 to 35
  5. 24 to 26
Answer

The regression estimate for this sample is 25, so that will be the center of the confidence interval. Since it’s a large sample, most resampled estimates will be near to this value. 15 to 35 is not plausible because a resampled regression line would almost never vary so much that it would go through those extreme values.

Confidence Intervals Centered on Predictions Get Tighter with More Data

The center of a confidence interval is the predicted value itself. The more data (larger \(n\)) you have, the narrower (more precise) the interval becomes, meaning more confidence in your prediction.

15.8 More Bayes’ Rule - Fa20 Final Q3 Modified (Optional)

Mr. White is a teaching Chemistry class in Latimer Hall. His class has 100 undergraduate staff (uGSIs, tutors, etc.).

Mr. White learns that one of his staff is stealing blackboard erasers from his lecture hall when the building is closed overnight. The only way the thief could access the building overnight is with a key card that belongs to them.

Suppose Mr. White discovers that 10 of his staff have a key card.

15.8.1 (a)

If Mr. White randomly selects one of his staff, what is the probability that they are the eraser thief?

  1. (0.01 * 1) / (0.01 * 1 + 0.99 * 9/99)
  2. (0.1 * 1) / (0.1 * 1 + 0.9 * 1/99)
  3. 0.01
  4. 0.1
  5. 0.99 * 9/99
Answer C. 0.01. We are told that Mr. White has 100 staff members, and 1 of them is the thief. \(\frac{1}{100} = 0.01\)

15.8.2 (b)

If Mr. White randomly selects one of his staff, what is the probability that they are not the eraser thief and have a key card?

  1. (0.01 * 1) / (0.01 * 1 + 0.99 * 9/99)
  2. (0.1 * 1) / (0.1 * 1 + 0.9 * 1/99)
  3. 0.01
  4. 0.1
  5. 0.99 * 9/99
Answer E. 0.99 * 9/99. 99 out of 100 staff are not the thief. Since the thief has the key card as well, we need to account for that when looking at the probability of having a key card, giving us 9 out of 99.

15.8.3 (c)

At the next course staff meeting, Mr. White notices that one of his GSIs, Gus, has a key card sticking out of his wallet.

Given this information, what is the probability that Gus is the eraser thief?

  1. (0.01 * 1) / (0.01 * 1 + 0.99 * 9/99)
  2. (0.1 * 1) / (0.1 * 1 + 0.9 * 1/99)
  3. 0.01
  4. 0.1
  5. 0.99 * 9/99
Answer A. (0.01 * 1) / (0.01 * 1 + 0.99 * 9/99). We want to find P(Gus is a thief \(|\) Gus has a key card). Using Bayes’ this equals: \(\cfrac{\text{P(Gus has a key card $|$ Gus is a thief)} * \text{P(Gus is a thief)}}{\text{P(Gus has a key card)}} = \cfrac{1 * 0.01}{\text{P(Gus has a key card)}}\). The probability that Gus has a key card if he is a thief is 1, since we are told the thief has a key card. The probability that Gus is a thief is \(\frac{1}{100}\) = 0.01. Finally, the probability that Gus has a key card is the sum of 2 probabilities: Gus is the thief and has the key card + Gus is NOT the thief and has the key card. This is \(0.01 * 1 + 0.99 * 9/99\). Therefore, our answer is \(\cfrac{1 * 0.01}{0.01 * 1 + 0.99 * 9/99}\).

15.8.4 (d)

Mr. White is skeptical of his head GSI, Jesse. Prior to learning any information about Jesse’s key card access, Mr. White believes there is a 25% probability that Jesse is the eraser thief.

Suppose Mr. White later discovers that Jesse has a key card.

Given this new information, what is the probability that Jesse is the eraser thief?

  1. (0.01 * 1) / (0.01 * 1 + 0.99 * 9/99)
  2. (0.1 * 1) / (0.1 * 1 + 0.9 * 1/99)
  3. 0.01
  4. 0.1
  5. 0.99 * 9/99
How Changing Priors Affects Bayesian Posterior Probabilities

In Bayesian inference, changing the prior probability (like setting it to 25%) changes the posterior probabilities you calculate. The prior expresses your initial belief before seeing data, so adjusting it naturally changes your updated beliefs.

Answer A. (0.25 * 1) / (0.25 * 1 + 0.75 * 9/99). We want to find P(Jesse is thief | Jesse has a key card). Using Bayes’: \(\cfrac{\text{P(Jesse has a key card $|$ Jesse is thief)} * \text{P(Jesse is thief)}}{\text{P(Jesse has a key card}} = \cfrac{1 * 0.25}{0.25 * 1 + 0.75 * 9/99}\) The probabilities are found using similar logic to 3c, except we know the 25% probability of Jesse being the thief.