11 Discussion 11: More CLT (From Summer 2025)

Slides

11.1 CLT with TLC

You are a super fan of the girl group TLC and are interested in estimating the average amount of plays their songs have online. You generate an 80% confidence interval for this parameter to be [700000, 1200000] based on a random sample of 50 songs using the Central Limit Theorem. Are each of the following statements true or false? (Fun Fact: Generally, $n \geq 30$ is considered the minimum sample size for Central Limit Theorem to take effect!)

Code

from datascience import *
import numpy as np
%matplotlib inline

plays = np.random.normal(loc=950000, scale=150000, size=50)
plays = np.clip(plays, 500000, 1500000)

tlc_songs = Table().with_columns(
    "Plays", plays
)

11.1.1 (a)

The value of our population parameter changes depending on our sampling process.

Answer

False. Our population parameter is fixed. It depends on an entire population that we know to be true. On the other hand, the sample statistic is dependent on what our sample looks like, which can vary. This is what allows us to describe probabilities regarding a confidence interval before we generate them but not after.

11.1.2 (b)

The empirical distribution of any statistic we choose will be roughly normal based on the Central Limit Theorem, but it requires our population to have a normal distribution to begin with.

Answer

False. The Central Limit Theorem states that the probability distribution of the sum or average of a large random sample drawn with replacement will be roughly normal, regardless of the distribution of the population from which the sample is drawn.

11.1.3 (c)

If we generate a 95% confidence interval using the same sample, the interval will be narrower than the original confidence interval because we are more certain of our results.

Answer

False. Using a 95% confidence level would result in a wider interval than an 80% confidence level. In fact, the 95% confidence interval will envelop the original 80% confidence interval. The “increased confidence” comes from having a wider interval.

arr = make_array()
for i in np.arange(1000):
  arr = np.append(arr, np.mean(tlc_songs.sample().column(0)))

print(f'95% Confidence Interval for average plays: [{percentile(2.5, arr)}, {percentile(97.5, arr)}]')
print(f'80% Confidence Interval for average plays: [{percentile(10, arr)}, {percentile(90, arr)}]')

95% Confidence Interval for average plays: [889543.8503444515, 980861.2931898981]
80% Confidence Interval for average plays: [904682.6209657554, 967047.156062074]

11.1.4 (d)

Using the same process to generate another 80% confidence interval, there is an 80% chance that the next confidence interval you generate will contain the true average number of plays for TLC songs.

Answer

True. Before we create the confidence interval, there is an 80% chance that the confidence interval we create next will contain the true population parameter. Confidence refers to the probability associated with our process of generating confidence intervals, not the probability associated with a single well-defined interval.

11.1.5 (e)

There is an 80% chance that a confidence interval we generated contains the true average number of plays for TLC songs.

Answer

False. Once we have calculated a confidence interval, it is fixed - it either contains the parameter or it does not. There is no chance involved.

Confidence Intervals vs. Confidence Level

When working with confidence intervals (CIs), remember:
- Confidence Level refers to the process of creating the interval, not the interval itself after it’s calculated.
- Example: flipping a coin
  - There’s randomness while the coin is in the air.
  - Once it lands, it’s either Heads or Tails — no randomness remains.
The CI Guide is a helpful resource for understanding these nuances.

11.1.6 (f)

80% of TLC’s songs have between 700,000 and 1,200,000 plays.

Answer

False. Our confidence interval is estimating the average number of plays but makes absolutely NO claim about what percentage of songs have between 700,000 and 1,200,000 plays. A $n$% confidence interval contains $n$% of the statistics you simulated, but it does not suggest that $n$% of the population is in that interval!

tlc_songs.where(0, are.between(70000, 1200000)).num_rows / tlc_songs.num_rows

0.96

11.1.7 (g)

The original sample mean you obtained was approximately 950,000 plays.

Answer

True. If we sample repeatedly from the original sample to construct the confidence interval, the Central Limit Theorem tells us that the distribution of sample means will be roughly normal and that it will be centered around the sample mean.

11.2 Income Intervals (Modified from Sp18 Final Q1)

Researchers studying annual incomes in a city take a random sample of 400 households and create 10,000 bootstrap samples from the original sample. They then use the bootstrap percentile method to construct an approximate 95% confidence interval for the mean household income in the city. This is the method we have always used in Data 8, and you can assume that it works fine in this situation. The 95% confidence interval goes from $60,000 to $62,000.

Confidence Interval Practice for Finals

Questions about confidence intervals are likely to appear on the final exam.
Key point: A higher confidence level → a wider confidence interval.
Be careful to distinguish confidence level (process) from confidence interval (result).
Common misconceptions:
- Thinking the CI itself is “more likely” to contain the true value after calculation.
- Confusing confidence level with the probability of the sample mean.

Code

sample_size = 400
incomes = np.random.normal(loc=61000, scale=2000, size=sample_size)
incomes = np.clip(incomes, 50000, 75000)

households = Table().with_columns(
    "Income", incomes
)

11.2.1 (a)

Select all statements that must be true based on the information above.

About 50% of the households in the city have incomes between $60,000 and $62,000.
About 95% of the households in the city have incomes between $60,000 and $62,000.
The researchers are estimating that the mean household income in the city is between $60,000 and $62,000, but they could be wrong.
If the researchers had constructed an approximate 90% confidence interval based on the same bootstrap samples they used for the 95% interval, then both ends of their 90% confidence interval would have been inside the range $60,000 to $62,000.
None of the above is true.

Answer

Statements C and D are correct. The 95% bootstrap confidence interval estimates that the population mean household income is between $60,000 and $62,000, though there’s a chance this range misses the true mean. A 90% confidence interval would be narrower than a 95% interval, so its endpoints would fall inside the $60,000 to $62,000 range. Statements A and B incorrectly describe the distribution of individual household incomes rather than the mean.

households.where(0, are.between(60000, 62000)).num_rows / households.num_rows

0.3925

resampled_means = make_array()
for i in np.arange(10000):
  resampled_means = np.append(arr, np.mean(households.sample().column(0)))

print(f'90% Confidence Interval for average plays: [{percentile(5, resampled_means)}, {percentile(95, resampled_means)}]')

90% Confidence Interval for average plays: [896184.5848066957, 975336.9161425781]

11.2.2 (b)

The array resampled_means contains the 10,000 bootstrapped means. Complete the code below so that the last line evaluates to the left and right endpoints of an approximate 90% confidence interval for the mean household income in the city.

left_end = ________________________________
right_end = _______________________________
make_array(left_end, right_end)

Answer

left_end = percentile(5, resampled_means)
right_end = percentile(95, resampled_means)
make_array(left_end, right_end)

array([ 896184.5848067 ,  975336.91614258])

11.3 Variability of a Sample (Modified from Fa17 Final Q4)

A histogram of the salaries of all employees of the City of San Francisco in 2014 appears below.

The mean of the entire population is about $75,500 and the standard deviation is about $51,700.

Suppose we took a large random sample, of size 5000, from this population.

Code

pop_mean = 75500
pop_sd = 51700
sample_size = 5000

salaries = np.random.normal(loc=pop_mean, scale=pop_sd, size=sample_size)
salaries = np.clip(salaries, 0, None)

sf_salaries = Table().with_columns(
    "Salary", salaries
)

11.3.1 (a)

We plot a histogram of the sample mentioned above. Fill in the oval next to the histogram below that is the most likely to have been generated in this way.

Answer

The top-right histogram is most likely the sample because it has the same right-skewed shape and wide range of salaries as the population. The other histograms are either too narrow, symmetric, or include impossible values like negative salaries.

11.3.2 (b)

Suppose we take the sample mentioned above, and we resample from it with replacement. We do this 10,000 times and obtain 10,000 resamples. We compute the mean of each resample and then plot a histogram of these 10,000 means. Fill in the oval next to the histogram below that is the most likely to have been generated in this way.

Sampling Distributions and CLT

As sample size increases, the sampling distribution more closely resembles the empirical distribution.
Be clear about the difference:
- Empirical distribution: distribution of the observed data.
- Sampling distribution: distribution of a statistic computed from many repeated samples.
The definition of each is available on the reference sheet for quick review.
Classic example: applying the Central Limit Theorem to a sample mean — larger samples → distribution of the mean approximates normal.

Answer

The top-left histogram shows a bell-shaped, symmetric distribution centered around $75,000 — exactly what we expect from 10,000 resampled means due to the Central Limit Theorem.

resamples = make_array()
for i in np.arange(10000):
  resamples = np.append(resamples, np.mean(sf_salaries.sample().column(0)))
Table().with_columns("Resamples", resamples).hist("Resamples")

11.3.3 (c)

Wayland, Anirra, and Monica are arguing about whether we can use the Central Limit Theorem to help think about what the histogram should look like in part (b).

Wayland believes that we can not use the Central Limit Theorem in part (b), as there is a large spike of people whose salary is very low in the city of San Francisco (as evidenced by the histogram of the population), so the distribution is not normal.
Monica believes we can not use the Central Limit Theorem in part (b), since we are looking at the empirical histogram of sample means and we have no idea what that probability distribution looks like.
Anirra believes that both of these concerns are invalid, and the Central Limit Theorem is helpful for part (b).

Who is right?

Answer

Anirra is right. The Central Limit Theorem applies here, making it valid to expect the distribution of sample means to be approximately normal, even with a skewed population.