Code
from datascience import *
import numpy as np
%matplotlib inline
= np.random.normal(loc=950000, scale=150000, size=50)
plays = np.clip(plays, 500000, 1500000)
plays
= Table().with_columns(
tlc_songs "Plays", plays
)
You are a super fan of the girl group TLC and are interested in estimating the average amount of plays their songs have online. You generate an 80% confidence interval for this parameter to be [700000, 1200000] based on a random sample of 50 songs using the Central Limit Theorem. Are each of the following statements true or false? (Fun Fact: Generally, \(n \geq 30\) is considered the minimum sample size for Central Limit Theorem to take effect!)
from datascience import *
import numpy as np
%matplotlib inline
= np.random.normal(loc=950000, scale=150000, size=50)
plays = np.clip(plays, 500000, 1500000)
plays
= Table().with_columns(
tlc_songs "Plays", plays
)
The value of our population parameter changes depending on our sampling process.
The empirical distribution of any statistic we choose will be roughly normal based on the Central Limit Theorem, but it requires our population to have a normal distribution to begin with.
If we generate a 95% confidence interval using the same sample, the interval will be narrower than the original confidence interval because we are more certain of our results.
False. Using a 95% confidence level would result in a wider interval than an 80% confidence level. In fact, the 95% confidence interval will envelop the original 80% confidence interval. The “increased confidence” comes from having a wider interval.
= make_array()
arr for i in np.arange(1000):
= np.append(arr, np.mean(tlc_songs.sample().column(0)))
arr
print(f'95% Confidence Interval for average plays: [{percentile(2.5, arr)}, {percentile(97.5, arr)}]')
print(f'80% Confidence Interval for average plays: [{percentile(10, arr)}, {percentile(90, arr)}]')
95% Confidence Interval for average plays: [889543.8503444515, 980861.2931898981]
80% Confidence Interval for average plays: [904682.6209657554, 967047.156062074]
Using the same process to generate another 80% confidence interval, there is an 80% chance that the next confidence interval you generate will contain the true average number of plays for TLC songs.
There is an 80% chance that a confidence interval we generated contains the true average number of plays for TLC songs.
False. Once we have calculated a confidence interval, it is fixed - it either contains the parameter or it does not. There is no chance involved.
80% of TLC’s songs have between 700,000 and 1,200,000 plays.
False. Our confidence interval is estimating the average number of plays but makes absolutely NO claim about what percentage of songs have between 700,000 and 1,200,000 plays. A \(n\)% confidence interval contains \(n\)% of the statistics you simulated, but it does not suggest that \(n\)% of the population is in that interval!
0, are.between(70000, 1200000)).num_rows / tlc_songs.num_rows tlc_songs.where(
0.96
The original sample mean you obtained was approximately 950,000 plays.
Researchers studying annual incomes in a city take a random sample of 400 households and create 10,000 bootstrap samples from the original sample. They then use the bootstrap percentile method to construct an approximate 95% confidence interval for the mean household income in the city. This is the method we have always used in Data 8, and you can assume that it works fine in this situation. The 95% confidence interval goes from $60,000 to $62,000.
= 400
sample_size = np.random.normal(loc=61000, scale=2000, size=sample_size)
incomes = np.clip(incomes, 50000, 75000)
incomes
= Table().with_columns(
households "Income", incomes
)
Select all statements that must be true based on the information above.
Statements C and D are correct. The 95% bootstrap confidence interval estimates that the population mean household income is between $60,000 and $62,000, though there’s a chance this range misses the true mean. A 90% confidence interval would be narrower than a 95% interval, so its endpoints would fall inside the $60,000 to $62,000 range. Statements A and B incorrectly describe the distribution of individual household incomes rather than the mean.
0, are.between(60000, 62000)).num_rows / households.num_rows households.where(
0.3925
= make_array()
resampled_means for i in np.arange(10000):
= np.append(arr, np.mean(households.sample().column(0)))
resampled_means
print(f'90% Confidence Interval for average plays: [{percentile(5, resampled_means)}, {percentile(95, resampled_means)}]')
90% Confidence Interval for average plays: [896184.5848066957, 975336.9161425781]
The array resampled_means
contains the 10,000 bootstrapped means. Complete the code below so that the last line evaluates to the left and right endpoints of an approximate 90% confidence interval for the mean household income in the city.
= ________________________________
left_end = _______________________________
right_end make_array(left_end, right_end)
= percentile(5, resampled_means)
left_end = percentile(95, resampled_means)
right_end make_array(left_end, right_end)
array([ 896184.5848067 , 975336.91614258])
A histogram of the salaries of all employees of the City of San Francisco in 2014 appears below.
The mean of the entire population is about $75,500 and the standard deviation is about $51,700.
Suppose we took a large random sample, of size 5000, from this population.
= 75500
pop_mean = 51700
pop_sd = 5000
sample_size
= np.random.normal(loc=pop_mean, scale=pop_sd, size=sample_size)
salaries = np.clip(salaries, 0, None)
salaries
= Table().with_columns(
sf_salaries "Salary", salaries
)
We plot a histogram of the sample mentioned above. Fill in the oval next to the histogram below that is the most likely to have been generated in this way.
Suppose we take the sample mentioned above, and we resample from it with replacement. We do this 10,000 times and obtain 10,000 resamples. We compute the mean of each resample and then plot a histogram of these 10,000 means. Fill in the oval next to the histogram below that is the most likely to have been generated in this way.
The top-left histogram shows a bell-shaped, symmetric distribution centered around $75,000 — exactly what we expect from 10,000 resampled means due to the Central Limit Theorem.
= make_array()
resamples for i in np.arange(10000):
= np.append(resamples, np.mean(sf_salaries.sample().column(0)))
resamples "Resamples", resamples).hist("Resamples") Table().with_columns(
Wayland, Anirra, and Monica are arguing about whether we can use the Central Limit Theorem to help think about what the histogram should look like in part (b).
Who is right?