10  Discussion 10: Sample Means and CLT (From Summer 2025)

Slides

Note

So far in the course, you have used the bootstrap to estimate multiple different parameters of a population such as the median and mean. You are now capable of building empirical distributions for these sample statistics. An empirical distribution for a sample statistic is usually obtained by repeatedly resampling and calculating the statistic for those resamples (i.e., via bootstrapping!).

Now we will introduce the Central Limit Theorem (CLT), which tells us more about the distribution of the sample mean: if you draw a large random sample with replacement from a population, then, regardless of the distribution of the population, the probability distribution for that sample’s mean is roughly normal, centered at the population mean.

Furthermore, the standard deviation (spread) of the distribution of sample means is governed by a simple equation, shown below:

SD of all possible sample means = \(\frac{\text{Population SD}}{\sqrt{\text{sample size}}}\)

“SD of the distribution of all sample means” is the same thing as saying “sample mean SD”.

CLT Visualizer!

10.1 Sample Means

Note that in this question, the empirical distribution of the sample mean is made up from the means of samples drawn from the population (not from resamples from a single sample).

Assume that you have a certain population of interest whose histogram is below.

Code
from datascience import *
import numpy as np
%matplotlib inline

pops = Table().with_column(
  'Population', [1] * 40 + [2] * 25 + [3] * 15 + [4] * 10 + [5] * 5 + [6] * 2 + [7] * 1
)

pops.hist("Population", bins = np.arange(1.5, step = 1, stop = 8.5))

10.1.1 (a)

Cyrus takes many large random samples with replacement from the population with the goal of generating an empirical distribution of the sample mean. What shape do you expect this distribution to have? Which value will it be centered around?

Answer

The distribution will look like a bell curve (normally distributed) centered around the population mean, by the Central Limit Theorem. We are satisfying the conditions of the Central Limit Theorem as we are drawing large random samples with replacement, and that it is a distribution of the sum or mean.

sample_means = make_array()
for i in np.arange(1000):
  sample_means = np.append(sample_means, np.mean(pops.sample().column(0)))

Table().with_columns("Sample Means", sample_means).hist("Sample Means")

Central Limit Theorem: Conditions
  • There are two key conditions for the Central Limit Theorem (CLT) to apply:
    1. Samples must be large, random, and drawn with replacement (or relatively small compared to the population if sampling without replacement).
    2. The statistic must be a sample sum or sample mean — CLT does not apply to just any statistic.
  • When solving problems, look for parts of the question that suggest these conditions are met.

10.1.2 (b)

Why are we able to use the CLT to reason about the empirical distribution of the sample mean’s shape if the population data is skewed?

Answer

To use the CLT all we need is a large random sample with replacement from the population and a statistic that is a sum or mean. The CLT applies regardless of the distribution of the population.

CLT Intuition: Sample Size and Variability
  • CLT applies regardless of the population’s shape.
  • Standard deviation of sample means decreases with larger sample sizes:
    • SD of sample mean = Population SD ÷ √Sample Size
  • Example: flipping a coin
    • 10 flips → you might see 30% or 70% heads
    • 100 flips → most results will be between 40% and 60% heads

10.1.3 (c)

Suppose that Cyrus creates two empirical distributions of sample means, with different sample sizes. Which distribution corresponds to a larger sample size? Why?

Answer

The distribution to the right corresponds to a larger sample size compared to the one to the left. We know based on the spread of the two distributions. The larger the sample size you take, the less variable the distribution of the sample mean becomes. You can see that increasing the sample size is increasing the denominator in calculating the SD of sample means, which decreases the standard deviation.

sample_means = make_array()
for i in np.arange(10):
  sample_means = np.append(sample_means, np.mean(pops.sample().column(0)))

Table().with_columns("Sample Means", sample_means).hist("Sample Means")

sample_means = make_array()
for i in np.arange(1000):
  sample_means = np.append(sample_means, np.mean(pops.sample().column(0)))

Table().with_columns("Sample Means", sample_means).hist("Sample Means")


10.1.4 (d)

Based solely on the information in the histogram, what is an estimate for the standard deviation of the sample mean on the left? How did you determine this?

Understanding Points of Inflection
  • A point of inflection is where a curve transitions from curving upward to curving downward.
  • No calculus required!
  • Visual analogy:
    • Draw a smooth version of your histogram.
    • Alternatively, imagine placing a downward-facing bowl on top of an upward-facing bowl.
    • The single point where they touch is the inflection point.
Answer Approximately 0.3. This is because the point of inflection on a normal distribution reflects 1 standard deviation away from the mean. Looking at the histogram, we can see that the inflection points occur at around 1.7 and 2.3, with the mean of the histogram being 2.0.

10.1.5 (e)

Suppose you were told that the distribution on the right was generated based on a sample size of 100 and has a standard deviation of 0.2. How big of a sample size would you need if you wanted the standard deviation of the distribution of sample means to be 10 times smaller?

Using CLT Formulas
  • Substitute the known values into the SD formula from CLT to find the population SD or compute a new sample size.
  • This step is mostly arithmetic once you understand the formula.
Answer

First, find the Population SD:

\(0.2 = \frac{Population\;SD}{\sqrt{100}}\)
\(Population\;SD = 0.2 * \sqrt{100} = 2\)

Then, calculate the new sample size needed:

\(\frac{0.2}{10} = \frac{2}{\sqrt{new\;sample\;size}}\)
\(0.2 * \sqrt{new\;sample\;size} = 2 * 10\)
\(\sqrt{new\;sample\;size} = 100\)
\(new\;sample\;size = 10000\)

To divide the SD of the sample means by a factor of 10, we need to multiply the sample size by \(10^2\), which is 100. To skip the calculations, reference the square root law!

10.2 Want to Go to Main Stacks?

Confidence Interval Visualizer!

UC Berkeley students love studying in Main Stacks! You are working with Aayan on constructing a confidence interval for the mean number of hours students spend in Main Stacks each year. To do this, you take a random sample of 400 UC Berkeley students and record how many hours each student spent studying in Main Stacks over the past year. Then, you compute the mean number of hours for your sample; it is 170 hours. You also calculate the standard deviation of your sample to be 10 hours.


10.2.1 (a)

Aayan claims that the distribution of all possible sample means is normal with an SD of 0.5 hours. Use this information to construct an approximate 68% confidence interval for the mean hours spent studying in Main Stacks for all UC Berkeley students.

Answer

We know that the distribution is approximately normal and hence we know that roughly 68% of the area under the normal curve (or 68% of the data) is contained within 1 SD of the mean. Therefore, our confidence interval range will be [170 - 1*0.5, 170 + 1*0.5] = [169.5 hours, 170.5 hours].

Constructing Confidence Intervals
  • If the distribution is roughly normal, about 68% of values lie within 1 SD of the mean.
  • Example: mean = 170 hours, SD = 0.5 hours
    • CI ≈ [mean ± 1 SD] → [169.5, 170.5]

10.2.2 (b)

Suppose Richard and Bing took a different random sample of 400 UC Berkeley students and found a sample mean of 172 hours, with a sample standard deviation of 10 hours. Would it be surprising to observe this sample mean if the true average number of hours all UC Berkeley students spend in Main Stacks is 170?

Answer

If the true population mean is 170 hours and the standard deviation of the sample means is 0.5 (as we saw before), then we can compute how many standard deviations away 172 is from the mean using a z-score: \(z = \frac{172 - 170}{0.5} = 4\). A z-score of 4 is very far in the tails of the normal distribution — much further than what we would expect by random chance. Therefore, yes, it would be surprising to observe a sample mean of 172 hours if the true mean were 170. This might suggest the true population mean is not 170 hours.

true_mean = 170
true_sd = 10
n = 400
repetitions = 5000

sample_means = []
for _ in range(repetitions):
    sample = np.random.normal(true_mean, true_sd, n)
    sample_means.append(np.mean(sample))

sim_table = Table().with_column("Sample Means", sample_means)

prop_extreme = np.count_nonzero((sim_table.column("Sample Means") >= 172) |
                                (sim_table.column("Sample Means") <= 168)) / repetitions

print("Simulated probability of observing 172 or more extreme:", prop_extreme)

sim_table.hist("Sample Means")
Simulated probability of observing 172 or more extreme: 0.0002


10.2.3 (c)

If Aayan had not told you what the SD of the distribution of sample means was, could you estimate it from the data in the sample? If yes, how?

Answer

We know that the sample size is 400 and that the standard deviation of our sample was 10 hours. Since the SD of sample means is equal to \(Population \; SD / \sqrt{Sample \; size}\), we can substitute the sample SD (i.e., 10 hours) in place of the \(Population \; SD\) in this formula (assuming the sample is representative of the population) and calculate the value!\

SD of all possible sample means = 10 hours / \(\sqrt{400}\) = 10 hours / 20 = 0.5 hours.

Sample vs Population SD
  • For large samples, the sample SD is a good approximation of the population SD.
  • Even if sampling without replacement, the difference is negligible if the sample is very small relative to the population.
  • This logic is similar to bootstrapping: if the sample represents the population, we can treat the sample SD as the population SD.