Code
import numpy as np
from datascience import *
%matplotlib inline
= np.random.normal(loc=2.5, scale=1, size=150)
amounts = np.clip(amounts, 0.5, 7)
amounts
= Table().with_columns(
water "amount", amounts
)
What has been your favorite topic, assignment, lecture, or anything so far with the first half of the class done?
If you have any concerns about your performance in the class so far, feel free to bring it up to your lab TA.
When we construct a bootstrap resample, what size should our resample from our original sample be?
Why do we need to resample from our sample with replacement?
When we conduct a bootstrap resample, what is the underlying assumption/reasoning for resampling from our sample? Why does it work?
What is the difference between a parameter and a statistic? Which of the two is random?
You are interested in investigating the liters of water consumed every day by UC Berkeley students. In particular, you want to study the proportion of students drinking less than 3 liters of water per day. You contact 150 random students from the directory and obtain the amounts of water each one of them drinks, storing them in the table water
. The table has 1 column, amount
, which stores the number of liters of water drunk by each student.
import numpy as np
from datascience import *
%matplotlib inline
= np.random.normal(loc=2.5, scale=1, size=150)
amounts = np.clip(amounts, 0.5, 7)
amounts
= Table().with_columns(
water "amount", amounts
)
What is the parameter and what is the statistic in this scenario?
Write a line of code to calculate the proportion of students in your sample who drank less than 3 liters of water per day.
"amount") < 3) np.mean(water.column(
0.70666666666666667
Write a line of code to perform a single bootstrap resample of the data stored in the water
table.
= True) water.sample(water.num_rows, with_replacement
amount |
---|
0.907068 |
0.678507 |
2.42839 |
1.40353 |
1.5887 |
2.30076 |
1.40353 |
2.75515 |
2.1995 |
2.0294 |
... (140 rows omitted)
Fill in the following blanks to conduct 10,000 bootstrap resamples of your data, calculating the proportion of students in each resample that drink less than 3 liters of water per day, then plotting the distribution of those proportions using an appropriate visualization.
= __________________________
proportions for i in __________________________:
= __________________________
resampled_table = __________________________
resampled_statistic = __________________________
proportions = Table().with_column("Resampled proportions", proportions)
proportions_table proportions_table.__________________________
= make_array()
proportions for i in np.arange(10000):
= water.sample(water.num_rows, with_replacement=True)
resampled_table = np.mean(resampled_table.column("amount") < 3)
resampled_statistic = np.append(proportions, resampled_statistic)
proportions = Table().with_column("Resampled proportions", proportions)
proportions_table "Resampled proportions") proportions_table.hist(
Samiksha is interested in exploring the heights of women’s tennis players. She has collected a sample of 100 heights of professional women’s tennis players and wants to use this sample to estimate the true interquartile range (IQR) of all heights of professional women’s tennis players.
We define the interquartile range (IQR) to be: 75th percentile - 25th percentile.
= np.random.normal(loc=175, scale=7, size=100)
heights
= Table().with_columns(
tennis "Height (cm)", heights
)
In order to construct a 99% confidence interval for the IQR, what should our upper and lower endpoints be in terms of percentiles?
Our lower endpoint should be the 0.5th percentile and the upper endpoint should be the 99.5th percentile.
Define a function sa_iqr
that constructs a 99% confidence interval for the IQR and returns an array containing the left endpoint and right endpoint of the 99% confidence interval in that order. The function takes in the following arguments:
tbl
: A one-column table consisting of a random sample from the population; you can assume this sample is large.reps
: The number of bootstrap repetitions.To find the 25th and 75th percentile of an array, you can use the percentile
function.
def sa_iqr(tbl, reps):
= __________________________
stats for _____________________________:
= ________________________________
resample_col = _____________________________________________
new_iqr = ________________________________
stats = _____________________________
left_end = ____________________________
right_end return ____________________________
def sa_iqr(tbl, reps):
= make_array()
stats for i in np.arange(reps):
= tbl.sample().column(0)
resample_col = percentile(75, resample_col) - percentile(25, resample_col)
new_iqr = np.append(stats, new_iqr)
stats = percentile(0.5, stats)
left_end = percentile(99.5, stats)
right_end return make_array(left_end, right_end)
100) sa_iqr(tennis,
array([ 6.09581774, 13.19718855])
Once again, we toss the same biased coin 3 times. What is the probability you get no heads?Say Samiksha recruited 500 of her friends to perform the same bootstrapping process she did. In other words, each of her friends drew a large, random sample of 100 heights from the population of professional women’s tennis players and constructed their own 99% confidence intervals. Approximately how many of these CIs do we expect to contain the actual IQR for the heights of professional women’s tennis athletes?
Note how in this example, we obtain different random samples from the population for each confidence interval, rather than each person re-using the same original sample. Why is this distinction important?
We interpret a 99% confidence interval to mean that we are 99% confident in the process used to construct that given interval. In other words, 99% of the time we use this process we expect to construct an interval that contains the true population parameter. Since we have 500 CIs, each at a 99% confidence level, we find that since 500 * (0.99) = 495, we expect to have 495 of these CIs containing the actual IQR of heights.
Note that the explanation above only applies to if we were using different random samples for each confidence interval, NOT the same random sample for all 500 confidence intervals.