9  Discussion 09: The Bootstrap (From Summer 2025)

Slides

9.1 Mid-Semester Check In

What has been your favorite topic, assignment, lecture, or anything so far with the first half of the class done?

Answer Congrats on finishing the midterm!!!

If you have any concerns about your performance in the class so far, feel free to bring it up to your lab TA.

Keep Going After the Midterm
  • Don’t be discouraged if the midterm didn’t go as well as you hoped.
  • It’s only worth 25% of your grade, and labs are also worth 25%.
  • You still have the final exam and other assignments that can make a big difference.
  • A lot of your grade is still to be determined—keep pushing!

9.2 Facts About the Bootstrap

Note

Setup
We want to be able to produce an estimate of a particular population parameter of interest, say the median. However, we know that if we had gotten a different sample, then our estimate of the population median could have also been different.

Main Objective
If we were satisfied with our sample, we could simply take the statistic of the sample and call it the prediction for the population median. Even though this is a valid approximation method, we want to use the method of the bootstrap to generate a range of values for which we believe our population parameter falls into.

Method
Ideally, we would be able to take more samples from the population and find estimates for the population parameter in all of these samples. However, we are usually not able to resample from the original population due to resource constraints, necessitating the process of the bootstrap.

  1. Given a large, simple random sample of a population, resample from the original sample with replacement. Generate many resamples with the same sample size as the original sample.
  2. Calculate the statistic for each resample and store it in a collection array, as we saw in the case of hypothesis testing.
  3. Repeat steps 1–2 multiple times to obtain an empirical distribution of your estimate.

Making Sense of the Bootstrap
  • Many students find it tricky to understand why certain statistical methods work, not just how.
  • Mechanics of the bootstrap:
    • Sample with replacement from your data.
    • Keep the sample size the same as the original.
  • Why keep the same size?
    • The variability of a statistic depends on sample size.
    • Example: With 10 flips of a coin, you might see 30% or 70% heads. With 100 flips, results will be much closer to 50%.
  • Why sample with replacement?
    • Without replacement, you’d just reproduce the original sample every time.
  • Key assumption: The sample represents the population. If the sample is biased, bootstrapping won’t magically fix it.

9.2.1 (a)

When we construct a bootstrap resample, what size should our resample from our original sample be?

Answer The resample should have the same sample size as our original sample. This is because our original estimate of some parameter is based on a certain sample size. If we changed the sample size, the distribution and variability of the estimate would change.

9.2.2 (b)

Why do we need to resample from our sample with replacement?

Answer If we do not sample with replacement, then we will get the same exact sample every time!

9.2.3 (c)

When we conduct a bootstrap resample, what is the underlying assumption/reasoning for resampling from our sample? Why does it work?

Answer The underlying assumption is that our sample looks similar to our population — that is, the sample is representative of what the population looks like. The validity of the bootstrap is based on this assumption because if the sample is unrepresentative of the population, we do not actually end up with a good picture of what range of values our estimate could take on.

9.3 Thirsty

9.3.1 Warm Up

What is the difference between a parameter and a statistic? Which of the two is random?

Answer A parameter is a property of the population, so it is fixed and does not change. On the other hand, we calculate statistics from samples which are often random. Typically, we use statistics in order to estimate population parameters.

You are interested in investigating the liters of water consumed every day by UC Berkeley students. In particular, you want to study the proportion of students drinking less than 3 liters of water per day. You contact 150 random students from the directory and obtain the amounts of water each one of them drinks, storing them in the table water. The table has 1 column, amount, which stores the number of liters of water drunk by each student.

Code
import numpy as np
from datascience import *
%matplotlib inline

amounts = np.random.normal(loc=2.5, scale=1, size=150)
amounts = np.clip(amounts, 0.5, 7)

water = Table().with_columns(
    "amount", amounts
)

9.3.2 (a)

What is the parameter and what is the statistic in this scenario?

Bootstrap Practice & Language
  • Parameter vs. Statistic:
    • A parameter describes the population.
    • A statistic comes from the sample.
  • Practice steps:
    1. Take a bootstrap resample.
    2. Compute the same statistic (e.g., mean, proportion).
    3. Repeat many times to build a distribution.
  • Visualize your results:
    • Think carefully about which graph type fits best.
    • How many variables do you have? Are they categorical or numerical?
Answer Population parameter: The proportion of UC Berkeley students who drink less than 3 liters of water per day.
Statistic: The proportion of students in the sample who drink less than 3 liters of water per day.

9.3.3 (b)

Write a line of code to calculate the proportion of students in your sample who drank less than 3 liters of water per day.

Answer
np.mean(water.column("amount") < 3)
0.70666666666666667

9.3.4 (c)

Write a line of code to perform a single bootstrap resample of the data stored in the water table.

Answer
water.sample(water.num_rows, with_replacement = True)
amount
0.907068
0.678507
2.42839
1.40353
1.5887
2.30076
1.40353
2.75515
2.1995
2.0294

... (140 rows omitted)


9.3.5 (d)

Fill in the following blanks to conduct 10,000 bootstrap resamples of your data, calculating the proportion of students in each resample that drink less than 3 liters of water per day, then plotting the distribution of those proportions using an appropriate visualization.

proportions = __________________________
for i in __________________________:
    resampled_table = __________________________
    resampled_statistic = __________________________
    proportions = __________________________
proportions_table = Table().with_column("Resampled proportions", proportions)
proportions_table.__________________________
Answer
proportions = make_array()
for i in np.arange(10000):
    resampled_table = water.sample(water.num_rows, with_replacement=True)
    resampled_statistic = np.mean(resampled_table.column("amount") < 3)
    proportions = np.append(proportions, resampled_statistic)
proportions_table = Table().with_column("Resampled proportions", proportions)
proportions_table.hist("Resampled proportions")

9.4 Tennis Time

Samiksha is interested in exploring the heights of women’s tennis players. She has collected a sample of 100 heights of professional women’s tennis players and wants to use this sample to estimate the true interquartile range (IQR) of all heights of professional women’s tennis players.

We define the interquartile range (IQR) to be: 75th percentile - 25th percentile.

Code
heights = np.random.normal(loc=175, scale=7, size=100)

tennis = Table().with_columns(
    "Height (cm)", heights
)

9.4.1 (a)

In order to construct a 99% confidence interval for the IQR, what should our upper and lower endpoints be in terms of percentiles?

Answer

Our lower endpoint should be the 0.5th percentile and the upper endpoint should be the 99.5th percentile.

Confidence Intervals
  • An n% confidence interval captures the middle n% of the bootstrap distribution.
    • That leaves (100 − n)% outside.
    • Half is on each side → (100 − n)/2% in each tail.
  • Be aware: the word percentile has two uses—
    • To find the IQR.
    • To compute confidence intervals.
  • Each CI comes from a sample. To build many CIs, we’d need many samples from the population.

9.4.2 (b)

Define a function sa_iqr that constructs a 99% confidence interval for the IQR and returns an array containing the left endpoint and right endpoint of the 99% confidence interval in that order. The function takes in the following arguments:

  • tbl: A one-column table consisting of a random sample from the population; you can assume this sample is large.
  • reps: The number of bootstrap repetitions.

To find the 25th and 75th percentile of an array, you can use the percentile function.

def sa_iqr(tbl, reps):
    stats = __________________________
    for _____________________________:
        resample_col = ________________________________
        new_iqr = _____________________________________________
        stats = ________________________________
    left_end = _____________________________
    right_end = ____________________________
    return ____________________________
Answer
def sa_iqr(tbl, reps):
    stats = make_array()
    for i in np.arange(reps):
        resample_col = tbl.sample().column(0)
        new_iqr = percentile(75, resample_col) - percentile(25, resample_col)
        stats = np.append(stats, new_iqr)
    left_end = percentile(0.5, stats)
    right_end = percentile(99.5, stats)
    return make_array(left_end, right_end)
sa_iqr(tennis, 100)
array([  6.09581774,  13.19718855])

9.4.3 (c)

Once again, we toss the same biased coin 3 times. What is the probability you get no heads?Say Samiksha recruited 500 of her friends to perform the same bootstrapping process she did. In other words, each of her friends drew a large, random sample of 100 heights from the population of professional women’s tennis players and constructed their own 99% confidence intervals. Approximately how many of these CIs do we expect to contain the actual IQR for the heights of professional women’s tennis athletes?

Note how in this example, we obtain different random samples from the population for each confidence interval, rather than each person re-using the same original sample. Why is this distinction important?

Answer

We interpret a 99% confidence interval to mean that we are 99% confident in the process used to construct that given interval. In other words, 99% of the time we use this process we expect to construct an interval that contains the true population parameter. Since we have 500 CIs, each at a 99% confidence level, we find that since 500 * (0.99) = 495, we expect to have 495 of these CIs containing the actual IQR of heights.

Note that the explanation above only applies to if we were using different random samples for each confidence interval, NOT the same random sample for all 500 confidence intervals.