7  Discussion 07: Assessing Models (From Summer 2025)

Slides

When we observe something different from what we expect in real life (i.e., four 3’s in six rolls of a fair die), a natural question to ask is “Was this unexpected behavior due to random chance, or something else?”

Hypothesis testing allows us to answer the above question in a scientific and consistent manner, using the power of computation and statistics to conduct simulations and draw conclusions from our data.

7.1 Test Statistics

Wayland is playing with a coin and he wants to test whether his coin is fair. His experiment is to toss the coin 100 times. He chooses the following null hypothesis.

Null Hypothesis: The coin is fair and any deviation observed is due to chance.

For each of the alternative hypotheses listed below, determine whether or not the test statistic is valid.

Choosing a Test Statistic
  • If we only care about whether the coin is fair or unfair, we use a test statistic with absolute value.
  • If the alternative hypothesis is directional, we do not use absolute value.
  • For more details, check out the Hypothesis Testing Guide.

7.1.1 (a)

Alternative Hypothesis: The coin is biased towards heads.
Test Statistic: # of heads.

Answer Correct.

7.1.2 (b)

Alternative Hypothesis: The coin is not fair.
Test Statistic: # of heads.

Answer Incorrect, we want more extreme values of our test statistic to favor the alternative hypothesis. We want to consider the two cases in which the coin is biased towards heads or that it’s biased towards tails, but simply counting the number of heads does not account for the second case.

7.1.3 (c)

Alternative Hypothesis: The coin is not fair.
Test Statistic: \(|\)# of heads - expected # of heads \(|\).

Answer

Correct.

Why Use Absolute Distance?
  • We use absolute distance because bias can appear in either direction (too many heads or too many tails).
  • Larger values in either direction give evidence against fairness.
  • Think of it like “folding” a histogram of the distance test statistic at the expected mean (5 heads).

7.1.4 (d)

Alternative Hypothesis: The coin is biased towards heads.
Test Statistic: \(|\)# of heads - expected # of heads \(|\).

Answer

Incorrect, this is the opposite case of part (b). We see that this test statistic will also account for a bias towards tails (because of the absolute value).

Unfair ≠ Just More Heads
  • The problem states: “test if the coin is unfair,” not “biased towards heads.”
  • That means either outcome (too many heads or too many tails) counts as evidence.
  • Even though we observed 9 heads, note that 1 head out of 10 flips is just as suspicious.

7.1.5 (e)

Alternative Hypothesis: The coin is not fair.
Test Statistic: 1/2 - proportion of heads.

Answer Incorrect, without the absolute value, we will not achieve large values of our test statistic leaning towards the alternative hypothesis.

7.2 Flip Flop

Wayne is flipping a coin. He thinks it is unfair, but is not sure. He flips it 10 times and gets heads 9 times. He wants to determine whether the coin was actually unfair, or whether the coin was fair and his result of 9 heads in 10 flips was due to random chance.

Code
from datascience import *
import numpy as np
%matplotlib inline

7.2.1 (a)

What is a possible model that he can simulate under?

Answer

A possible model that you could simulate under is that on each flip, there is a 50% chance that the coin lands heads and a 50% chance that the coin lands tails. Any difference is due to chance.

If you are more familiar with probability: The heads are like independent and identically distributed draws at random from a distribution in which 50% are Heads and 50% are Tails.

7.2.2 (b)

What is an alternative model for Wayne’s coin? You do not necessarily have to be able to simulate under this model.

Answer An alternative model that Wayne might suggest is that the coin is unfair, and that the difference in the observed data is due to something other than just chance. We would not be able to simulate under this model because the statement “the coin is unfair” is not very specific (we can ask questions like “How unfair?” or “Biased towards heads or tails?”).

7.2.3 (c)

What is a good test statistic that you could compute from the outcome of his flips? Calculate that statistic for your observed data. Hint: If the coin was unfair, it could either be biased towards heads or biased towards tails.

Answer

A good test statistic is the absolute difference between the number of heads we observe and the expected number of heads (5). Our observed test statistic is $\(9 - 5\)$ = 4. Notice that this statistic is large for both a large number of heads, as well as a small number of heads.

We could also use proportions as our test statistic, i.e., \(\vert\) proportion of heads - 0.5 \(\vert\).


7.2.4 (d)

Complete the function flip_ten, which takes no arguments and does the following: - Simulates flipping a fair coin 10 times - Computes the simulated statistics, based on the one chosen in the previous question

def flip_ten():
    faces = make_array("Heads", "Tails")
    flips = ____________________
    num_heads = ____________________
    return ____________________
Answer
def flip_ten():
    faces = make_array("Heads", "Tails")
    flips = np.random.choice(faces, 10)
    num_heads = np.count_nonzero(flips == "Heads")
    return abs(num_heads - 5)
flip_ten()
0

7.2.5 (e)

Complete the code below to simulate the experiment 10000 times and record the statistic computed in each of those trials in an array called simulated_stats.

trials = ____________________
simulated_stats = ____________________
for ____________________:
    one_stat = ____________________
    ____________________ = ____________________
How Simulation Is Structured
  • Steps for simulating:
    1. Define a function that simulates once and computes one test statistic.
    2. Run a for-loop that:
      • Calls this function.
      • Stores results in an array.
  • Key reminder: sample size ≠ number of repetitions.
  • This structure shows up often and is worth practicing.
Answer
trials = 10000
simulated_stats = make_array()
for i in np.arange(trials):
    one_stat = flip_ten()
    simulated_stats = np.append(simulated_stats, one_stat)
simulated_stats
array([ 0.,  2.,  1., ...,  2.,  2.,  2.])

7.2.6 (f)

Suppose we performed the simulation and plotted a histogram of simulated_stats. The histogram is shown below.

Code
Table().with_columns('Absolute Differences', simulated_stats).hist("Absolute Differences", bins = np.arange(11))

Is our observed statistic from part (c) consistent with the model we simulated under?

Answer No, the observed statistic is not consistent with the model we simulated under. If we look for the observed statistic (4), we will see that it rarely ever happened in our simulation; most of the statistics generated in our simulations were in the range [0, 2]. Therefore, we would say that it is inconsistent with the model we simulated under.

7.3 Carnival Games

You are playing a wheel-spinning game at a carnival, where you can earn prizes based on where the wheel stops. The booth attendant claims the distribution of prizes is as below, but you think the game is rigged and doesn’t follow the listed probabilities.

Prize Chance
Nothing 80%
Teddy bear 2%
Pinwheel 6%
Sticker 12%

You would like to test your claim so you can report the carnival for fraud. Before you design your test, consider: do you have numerical data or categorical data?

Setting Up Hypotheses
  • Start by asking yourself:
    • What are we trying to prove?
    • How can we simulate this?
  • The null hypothesis is usually the “baseline” with a fully defined model that we can actually simulate under.

7.3.1 (a)

What is your hypothesis?

Answer The distribution of prizes does not follow the distribution listed by the carnival. Any observed difference is not just due to chance.

7.3.2 (b)

What is the booth attendant’s hypothesis?

Answer The distribution of prizes follows the distribution listed by the carnival. Any observed difference is simply due to chance.

7.3.3 (c)

Which hypothesis (of the two we defined) can you simulate under?

Answer You could simulate under the booth attendant’s hypothesis. This is because it is a fully defined model, meaning we are able to describe the parameters of an experiment surrounding it. Your hypothesis is simply that the distribution is not the same as the carnival’s; there is no fully defined model that we can simulate under.

7.3.4 (d)

What is a good statistic to use?

Answer TVD from expected distribution. When we are observing categorical distributions of data and want to compare them, we should use TVD. Note, this is a good example because we have four different components in the distribution that we would like to test.

7.3.5 (e)

Write code that simulates playing the carnival game 1,000 times, returns an array of proportions representing how often each prize was won, and finally extracts the number of teddy bears won in the simulation.

prize_chances = ____________________
my_simulation = ____________________
num_teddy_bears = ____________________
Answer
prize_chances = make_array(0.8, 0.02, 0.06, 0.12)
my_simulation = sample_proportions(1000, prize_chances)
num_teddy_bears = my_simulation.item(1) * 1000
Understanding sample_proportions
  • sample_proportions can be tricky—here’s a toy example:
    • Bag: 1 red marble + 2 blue marbles → make_array(1/3, 2/3).
    • Run: sample_proportions(5, make_array(1/3, 2/3)).
    • Imagine drawing 5 times with replacement and writing down each color.
    • At the end, record the proportion of red vs. blue.
  • One possible output: array([2/5, 3/5]).
  • For more details, see the Sampling Methods Guide.
num_teddy_bears
17.0

Suppose the wheel-spinning game received a lot of complaints at the carnival, and the owners of the game are pressured to release their true distribution of prizes as below:

Prize Chance
Nothing 90%
Teddy bear 1%
Pinwheel 3%
Sticker 6%

Use the distribution above to answer the following probability questions.


7.3.6 (f)

What is the probability of winning a Teddy bear and a Sticker in two spins?

Answer

P(Teddy bear and Sticker) = 2 * P(Teddy bear) * P(Sticker) = 2 * 0.01 * 0.06 = 0.12%
We multiply by 2 because we could have won the Teddy bear and then the Sticker OR the Sticker first and then the Teddy bear.

Trick for Counting Outcomes
  • Sometimes you need to multiply by 2 (or more) because different orders produce the same overall outcome.
  • Example: winning a Teddy then a Sticker, or a Sticker then a Teddy—both count!

7.3.7 (g)

What is the probability of winning at least one prize in 10 spins?

Answer Complement Rule: P(at least one prize) = 1 - P(no prizes in 10 spins) = 1 - P(Nothing) \(^{10} = 1 - (0.9)^{10}\)

7.4 Spring 2018 Midterm Question 4 (Bonus!)

Researchers are studying the effectiveness of a particular flu vaccine. A large random sample was taken from the population of people who took the vaccine in 2016. Among the sampled people, 48% did not get the flu. Another large random sample was taken in 2017, from among the people who took the vaccine that year. Among these sampled people, 40% did not get the flu.


7.4.1 (a)

A researcher thinks the vaccine was less effective in 2017 than in 2016. To test this, a null hypothesis is needed. Exactly one of the following choices is the correct null hypothesis.

A. The vaccine was less effective in the 2017 population than in the 2016 population, due to chance.
B. The vaccine was equally effective in the two samples but its effectiveness was different in the two populations due to chance.
C. The vaccine was equally effective in the two populations but its effectiveness was different in the two samples due to chance.

Answer

Option A - Incorrect as it describes a model that is difficult to simulate under. How can we quantify “less effective”?
Option B - Incorrect as the question tells us that the vaccine was not equally effective in the two samples (48% vs 40%).
Option C - Correct. The null hypothesis would state that the vaccine was equally effective in the two populations, and that the differences we observe in the two samples are simply due to chance.

Sample vs. Population
  • When we say “any observed difference is due to chance,” we’re talking about differences in the sample, not the population itself.

7.4.2 (b)

The researcher says, “The observed value of my test statistic is 40% – 48% = − 8%.” To perform the test, the statistic is simulated under the null hypothesis. One of the figures below is the empirical histogram of the simulated values. Which is it?

Simulating Under the Null
  • If the null says two populations are equally effective, the expected difference = 0.
  • The histogram of simulated differences will then be centered around 0.

Answer Histogram (iii) is correct.
The test statistic we are using is the difference between the two sample percentages. Under the null hypothesis, this could be positive or negative depending on the sample. This rules out (ii). Under the null hypothesis, the two sample percentages are expected to be equal and hence the difference is expected to be 0. This rules out (i). Only (iii) has all the right properties.