8  Discussion 08: Midterm Review (From Summer 2025)

Slides

8.1 Tables

You are given the following table called pokemon. For the following questions, fill in the blanks.

Working with Tables
  • Table manipulation skills are super valuable for the course!
  • Review methods like where, group, pivot, join, and others.
  • Pro tip: Pay attention to variable names — they often tell you exactly what the problem wants.
  • Always check the expected output type: do you need a number, an array, or a table?
  • Drawing out or imagining intermediate tables can help clarify your steps.
  • Working backwards from the desired result is often a great strategy (and can help earn partial credit).
Code
import random
from datascience import *
import numpy as np
%matplotlib inline

names = ["Bulbasaur", "Charmander", "Squirtle", "Palkia", "Dialga", "Giratina"]
types = ["Grass", "Fire", "Water", "Water/Dragon", "Steel/Dragon", "Ghost/Dragon"]
hp = [45, 39, 44, 90, 100, 150]
speed = [45, 65, 43, 100, 90, 90]
generation = [1, 1, 1, 4, 4, 4]
legendary = [False, False, False, True, True, True]

type_pool = [
    "Grass", "Fire", "Water", "Electric", "Psychic",
    "Ice", "Dragon", "Steel", "Ghost", "Dark",
    "Fairy", "Ground", "Rock", "Fighting", "Bug",
    "Flying", "Poison", "Normal"
]

for i in range(734):
    names.append(f"Pokemon_{i+7}")
    if random.random() < 0.2:
        t1, t2 = random.sample(type_pool, 2)
        types.append(f"{t1}/{t2}")
    else:
        types.append(random.choice(type_pool))
    hp.append(random.randint(30, 170))
    speed.append(random.randint(20, 150))
    generation.append(random.randint(1, 8))
    legendary.append(random.random() < 0.05)

names.append("Wailord")
types.append("Water")
hp.append(170)
speed.append(60)
generation.append(3)
legendary.append(False)

pokemon = Table().with_columns(
    "Name", names,
    "Type", types,
    "HP", hp,
    "Speed", speed,
    "Generation", generation,
    "Legendary", legendary
)

pokemon.show(6)
Name Type HP Speed Generation Legendary
Bulbasaur Grass 45 45 1 False
Charmander Fire 39 65 1 False
Squirtle Water 44 43 1 False
Palkia Water/Dragon 90 100 4 True
Dialga Steel/Dragon 100 90 4 True
Giratina Ghost/Dragon 150 90 4 True

... (735 rows omitted)


8.1.1 (a)

Find the name of the pokemon of only type Water that has the highest HP.

water_pokemon = pokemon.__________(__________, __________)
water_pokemon.__________(__________, __________).column("Name").item(0)
Answer
water_pokemon = pokemon.where("Type", are.equal_to("Water"))
water_pokemon.sort("HP", descending = True).column("Name").item(0)
'Wailord'

8.1.2 (b)

Find the proportion of Fire-type Pokemon with a Speed less than 100.

fire_pokemon = pokemon._______(_______, _______)
fire_pokemon._______(_______, _______)._______ / _______._______
Answer
fire_pokemon = pokemon.where("Type", "Fire")
fire_pokemon.where("Speed", are.below(100)).num_rows / fire_pokemon.num_rows
0.5666666666666667

8.1.3 (c)

Create a table containing Type and Generation that is sorted in decreasing order by the average HP for each pair of Type and Generation that appears in the table.

avg_hp = pokemon._______(_______(_______, _______), _______)
avg_hp.sort("HP mean", _______ = _______)._______(_______, _______)
Answer
avg_hp = pokemon.group(make_array("Type", "Generation"), np.mean)
avg_hp.sort("HP mean", descending = True).select("Type", "Generation")
Type Generation
Dark/Normal 4
Dragon/Electric 1
Fighting/Rock 8
Psychic/Fighting 7
Steel/Dragon 7
Grass/Fighting 1
Steel/Fairy 3
Dark/Ghost 3
Steel/Bug 6
Dark/Water 5

... (285 rows omitted)


8.1.4 (d)

Return an array that contains ratios of legendary to non-legendary pokemons for each generation. You may assume that the Legendary column is a column of booleans.

leg_per_gen = pokemon._______(_______, _______)
ratios = leg_per_gen._______(_______) / leg_per_gen._______(_______)
Answer
leg_per_gen = pokemon.pivot("Legendary", "Generation")
ratios = leg_per_gen.column("True") / leg_per_gen.column("False")
ratios
array([ 0.07317073,  0.05617978,  0.02380952,  0.07446809,  0.06666667,
        0.05940594,  0.03225806,  0.07228916])

8.1.5 (e)

Consider another table called trainers, which contains information about Pokemon trainers and the Pokemon they own. The trainers table has two columns: Trainer, the name of the trainer and Pokemon, the name of the Pokemon. Use table operations to create a new table called pokemon_with_trainers that includes each Pokemon’s Name, Type, Generation, and their Trainer.

Code
trainers = Table().with_columns(
    "Trainer", ["Ash", "Misty", "Brock", "Wallace"],
    "Pokemon", ["Bulbasaur", "Squirtle", "Charmander", "Wailord"]
)
trainers_added = pokemon._______(_______, _______, _______)
pokemon_with_trainers = trainers_added._______(_______, _______, _______)
Answer
trainers_added = pokemon.join("Name", trainers, "Pokemon")
pokemon_with_trainers = trainers_added.drop("HP", "Speed", "Legendary")
pokemon_with_trainers
Name Type Generation Trainer
Bulbasaur Grass 1 Ash
Charmander Fire 1 Brock
Squirtle Water 1 Misty
Wailord Water 3 Wallace

8.2 Histograms

The World Happiness Report is a landmark study on the state of global happiness. This study calculated and ranked the happiness level for 155 countries using data from the Gallup World Poll. The histogram below shows the distribution of happiness scores computed from this study in 2019. Suppose the data is stored in the table called happiness. The following code was used to generate the histogram you see below:

Code
happiness = Table.read_table("happiness.csv")
happiness.hist("Score", bins = np.arange(2.8, 8, 0.5))

Note that the histogram may look a bit different from usual for the purpose of making bar heights easier to interpret.

Thinking About Histograms
  • Remember the area principle: the area of each bar corresponds to its proportion of the data.
  • We don’t know how values are distributed within a single bin — only the count (or percent) of values in that bin.
  • Without the actual counts, a histogram only gives us percentages, not raw numbers.

8.2.1 (a)

What is the “unit” in “Percent per unit” on the y-axis in the histogram?

Answer Happiness score. Note that it is NOT county. Although each data point used to generate the histogram represents a country, the values used to make the histogram are the happiness scores.

For part b - e, use the above histogram to calculate the following quantities. If it’s not possible, write “Cannot calculate” and explain your reasoning.

8.2.2 (b)

The proportion of countries with happiness scores between 4.3 and 5.8.

Answer

Each bin has width 0.5. We’re interested in the [4.3, 4.8), [4.8, 5.3), and [5.3, 5.8)] bins, which have heights approximately 38, 28, and 27 percent per unit respectively. We can use the Area Principle to calculate the sum of the areas of these bins, which in turn represents the proportion we’re looking for: \[ \begin{align*} \text{Total Area} &= \text{height}_1 \cdot \text{width} + \text{height}_2 \cdot \text{width} + \text{height}_3 \cdot \text{width} \\ &= 0.5 \cdot (\text{height}_1 + \text{height}_2 + \text{height}_3) \\ &= 0.5 \cdot (38 + 28 + 27) \\ &= 0.5 \cdot 93 \\ &= 46.5\% \quad \text{or as a proportion, } 0.465 \end{align*} \]

happiness.where("Score", are.between(4.3, 5.8)).num_rows / happiness.num_rows
0.46794871794871795

8.2.3 (c)

The number of countries with happiness scores between 4.3 and 5.8 (round to the nearest country).

Answer

We can simply multiply the proportion of countries with happiness score between 4.3 and 5.8 (calculated in the previous question) by the number of countries represented in the histogram:
Number of countries = \(46.5\% * 155\text{ countries}\approx 72 \text{ countries}\)

happiness.where("Score", are.between(4.3, 5.8)).num_rows
73

8.2.4 (d)

The number of countries with happiness scores between 6 and 7.

Answer

Not possible; we don’t have bins that exactly comprise [6, 7), and we do not know the distribution of countries within bins.

# Just in case you are curious about this!
happiness.where("Score", are.between(6, 7)).num_rows
36

8.2.5 (e)

The height of the new bin after combining the three leftmost bins.

Answer

Combining the three leftmost bins combines their areas. The total area of the three leftmost bins is:

\[ 0.5 \cdot (5 + 10 + 14) \approx 14.5\% \]

The width of the new bin is the sum of the widths of the original bins (each 0.5), so the new width is:

\[ 1.5 \]

Now that we have the area and width of the new bin, we can calculate the height:

\[ \begin{align*} \text{Area} &= \text{width} \cdot \text{height} \\ 14.5 &= 1.5 \cdot \text{height} \\ \text{height} &= \frac{14.5}{1.5} \\ \text{height} &\approx 9.67\% \text{ per unit} \end{align*} \]

happiness.hist("Score", bins = make_array(2.8, 4.3, 4.8, 5.3, 5.8, 6.3, 6.8, 7.3, 7.8, 8.3))

8.3 Probability

Building Probability Intuition
  • Draw out what’s happening to better visualize probabilities.
  • Start simple: consider just the first trial, then extend to more trials.
  • Use the multiplication rule when combining independent events across trials.
  • If you’re unsure whether to add or multiply:
    • Use add when the problem says “or” (multiple possible ways).
    • Use multiply when the problem says “and” (several conditions must happen together).
  • Example: There’s only one way to get 3/3 heads, but three ways to get 1/3 heads.

8.3.1 (a)

A fair coin is tossed five times. Two possible sequences of results are HTHTH and HTHHH. Which sequence of results is more likely? Explain your answer and calculate the probability of each sequence appearing.

Answer They are equally likely since the coin is fair. By the multiplication rule, the probability that either of the two sequences appears is \((\frac{1}{2})^5\).

8.3.2 (b)

For parts b - e, assume we have a biased coin such that the probability of getting heads is \(\frac{1}{5}\) and the probability of getting tails is \(\frac{4}{5}\). The coin is tossed 3 times. What is the probability that you get exactly 2 heads?

Answer Here we need to consider all the possible outcomes that fall into this event and calculate their probabilities.
There are 3 possible outcomes: HHT, HTH, THH. The probability for each of them is \((\frac{1}{5})^2 * (\frac{4}{5})\). Therefore, the probability of getting exactly 2 heads is \(3*(\frac{1}{5})^2*(\frac{4}{5})\).

8.3.3 (c)

Once again, we toss the same biased coin 3 times. What is the probability you get no heads?

Answer We must have gotten all tails, and the probability for that is: \((\frac{4}{5})^3\).

8.3.4 (d)

Again, we toss the same biased coin 3 times. What is the probability you get 1 or more heads?

Answer We can use the complement rule here. The complement of getting at least 1 head is getting no heads, which we’ve just calculated in the previous question. Therefore, the probability is: \(1 - (\frac{4}{5})^3\).

8.3.5 (e)

What is the probability of getting exactly 1 head or exactly 2 heads?

Answer We can use the addition rule to add the probability of getting exactly 1 head, and the probability of getting exactly 2 heads. There are three possible outcomes for each case. The probability is: \(3*(\frac{1}{5})*(\frac{4}{5})^2 + 3*(\frac{1}{5})^2*(\frac{4}{5})\).

8.4 Simulation and Hypothesis Testing

A tortoise and a hare want to have a race on a number line! They both start at 0 and the race lasts for 100 time steps. However, they move differently. At each time step the tortoise moves 1 step forward with a \(\frac{1}{2}\) chance (and stays in place with a \(\frac{1}{2}\) chance), and the hare moves 3 steps forward with a \(\frac{1}{6}\) chance.

They race, and the tortoise loses badly; the hare finished 50 steps ahead of the tortoise. Suspicious, the tortoise decides to conduct a hypothesis test to determine whether or not the hare is actually faster.

Simulation and Hypothesis Testing
  • Example: testing whether the hare is actually faster than the tortoise.
  • Use a one-sided alternative hypothesis — don’t take absolute values in the test statistic.
  • To simulate using hypothetical probabilities, use sample_proportions.
  • Be careful: the one_race function doesn’t directly compute the test statistic.
    • It gives an intermediate result, from which you can build an overlaid histogram or compute the difference to get the test statistic you need.

8.4.1 (a)

Fill in the blanks below for the null and alternative hypotheses of this test, as well as a valid test statistic.

Null Hypothesis:

Alternative Hypothesis:

Test Statistic:

Answer

Null Hypothesis: The hare is not faster than the tortoise. The hare finishing 50 steps ahead of the tortoise was simply due to random chance.

Alternative Hypothesis: The hare is faster than the tortoise. The hare finishing 50 steps ahead of the tortoise was not just due to random chance.

Test Statistic: The difference in distances between the hare and the tortoise at the end of each race.

8.4.2 (b)

Write a function called one_race() that simulates a single race of 100 time steps. It should return a two element array of the final distances of both the tortoise and the hare (in that order) from the origin after 100 time steps.

def one_race():
    tortoise_array = ________
    hare_array = ________
    tortoise_sim = ________
    hare_sim = ________
    tortoise_distance = ________
    hare_distance = ________
    return make_array(tortoise_distance, hare_distance)
Answer
def one_race():
    tortoise_array = make_array(1/2, 1/2)
    hare_array = make_array(1/6, 5/6)
    tortoise_sim = sample_proportions(100, tortoise_array)
    hare_sim = sample_proportions(100, hare_array)
    tortoise_distance = tortoise_sim.item(0) * 100
    hare_distance = hare_sim.item(0) * 300
    return make_array(tortoise_distance, hare_distance)
one_race()
array([ 43.,  57.])

8.4.3 (c)

We would now like to simulate what would happen if the tortoise and the hare races 10,000 times. Complete the code below and record how far the tortoise and the hare end from the origin in the arrays tortoise_distances and hare_distances respectively.

tortoise_distances = make_array()
hare_distances = make_array()
for ________:
    race = ________
    one_tortoise_dist = ________
    one_hare_dist = ________
    tortoise_distances = ________
    hare_distances = ________
Answer
tortoise_distances = make_array()
hare_distances = make_array()
for i in np.arange(10000):
    race = one_race()
    one_tortoise_dist = race.item(0)
    one_hare_dist = race.item(1)
    tortoise_distances = np.append(tortoise_distances, one_tortoise_dist)
    hare_distances = np.append(hare_distances, one_hare_dist)
tortoise_distances, hare_distances
(array([ 47.,  55.,  48., ...,  59.,  46.,  39.]),
 array([ 39.,  30.,  42., ...,  45.,  63.,  57.]))

8.4.4 (d)

After 10,000 simulations of the race, we have recorded the distances in the results table which contains the following two columns:

  • Competitor (string): the name of the competitor, either “tortoise” or “hare”
  • Distance (int): the final distance of the competitor at the end of the race

The table has 20,000 rows in total — 10,000 for the tortoise and 10,000 for the hare. Using the results table, create an overlaid histogram that shows the distribution of final distances for both the tortoise and the hare.

Code
results = Table().with_columns(
    "Competitor", np.append(np.full(10000, 'tortoise'), np.full(10000, 'hare')),
    "Distance", np.append(tortoise_distances, hare_distances)
)

________.________(________, group = ________)

Answer
results.hist("Distance", group = "Competitor")


8.4.5 (e)

Create an array called differences where each value in the array represents how many steps the hare finished ahead of the tortoise in a given race. Then, write a line of code to calculate the observed p-value of the hypothesis test. Finally, assuming a 5% p-value cutoff, describe the different conclusions you would come to based on the possible values of the observed p-value.

differences = ________
p_value = ________
Answer
differences = hare_distances - tortoise_distances
p_value = np.count_nonzero(differences >= 50) / 10000

If the observed p-value is less than the 5% cutoff, we would consider this evidence against the null hypothesis and reject it in favor of the alternative hypothesis. If the observed p-value is greater than the 5% cutoff, our observation is consistent with our null hypothesis and we would fail to reject the null.

p_value
0.0001

8.5 True/False

Respond with true or false to the following questions. If your answer is false, explain why.


8.5.1 (a)

In the U.S. in 2000, there were 2.4 million deaths from all causes, compared to 1.9 million in 1970, which represents a 25% increase. The data shows that the public’s health got worse over the period 1970-2000.

Answer False, because the population also got bigger between 1970 and 2000. It would be more appropriate to look at the total number of deaths compared to the total population at each year. In fact, the U.S. population in 1970 was 203 million, while in 2000 it was 281 million.

8.5.2 (b)

A company is interested in knowing whether women are paid less than men in their organization. They share all their salary data with you. An A/B test is the best way to examine the hypothesis that all employees in the company are paid equally.

Answer False, there is no room for statistical inference here. We have access to the whole population, so the answer can simply be retrieved by directly looking at the data. There is no need for an A/B test here.

8.5.3 (c)

Consider a randomized controlled trial where participants are randomly split into treatment and control groups. We are 100% certain there will be no systematic differences between the treatment and control groups if the process is followed correctly.

Answer

False. Randomization can still give rise to significantly different treatment and control groups merely by chance, meaning there is still the possibility for systematic differences between the treatment and control groups.

As an example, if you were holding an RCT for a new energy drink, there’s a chance that the randomization, just by chance led to one group having a large majority of coffee drinkers and another having a large majority of non-caffeine users. In this case, the systematic difference refers to how the randomization did not effectively assign these two groups to control for their differences outside of the treatment.

RCTs can help minimize events like this occurring, but we cannot say it is impossible for something like this to occur.

8.5.4 (d)

A researcher considers the following scheme for splitting people into control and treatment groups. People are arranged in a line and for each person, a fair, six-sided die is rolled. If the die comes up to be a 1 or a 2, the person is allocated to the treatment group. If the die comes up to be a 3, 4, 5, or 6 then the person is allocated to the control group. This is a randomized control experiment.

Answer True, because the participants were randomly assigned to each group through the roll of a die. This makes it a randomized controlled experiment!

8.5.5 (e)

You are conducting a hypothesis test to check whether a coin is fair. After you calculate your observed test statistic, you see that its p-value is below the 5% cutoff. At this point, you can claim with certainty that the null hypothesis can not be true.

Answer False, remember the definition of a p-value: A p-value expresses the probability, under the null Hypothesis, that you observe a value for your test statistic that is at least as extreme as your observed test statistic in the direction of the alternative. Assuming that this probability is non 0 then we can not claim that the null can never be true. It could be the case that we simply got an unusual sample from our null.

8.5.6 (f)

You roll a fair die a large number of times. While you are doing that, you observe the frequencies with which each face appears and you make the following statement: As I increase the number of times I roll the die, the probability histogram of the observed frequencies converges to the empirical histogram.

Answer False, the statement should be: As I increase the number of times I roll the die, the empirical histogram of the observed frequencies converges to the probability histogram of a fair die.

8.6 Functions

Cyrus loves completing the NYT Monday crossword puzzle, and is interested in seeing how fast he completes it in comparison with his friends. Over the past two months, Cyrus and his friend Monica have been recording their crossword completion times (in seconds) in the arrays cyrus_times and monica_times respectively. Cyrus decides to put his skills to the test by randomly selecting one of his times and comparing it to a randomly chosen time of Monica’s.

Code
cyrus_times = [210, 230, 250, 240, 225, 260, 245, 255, 235, 220]
monica_times = [260, 270, 300, 280, 265, 275, 290, 310, 295, 285]

crossword_times = Table().with_columns(
    "Cyrus", cyrus_times,
    "Monica", monica_times
)
Fun with Functions
  • This type of problem might look like hypothesis testing, but the focus is on functions and string manipulation.
  • Example: one_comparison returns True/False, so it can be used directly in a conditional.
  • If a function only prints and doesn’t return, you can’t save its result in a variable.
  • Don’t forget to convert numbers to strings (like wins and trials) before concatenating them.

8.6.1 (a)

Write a function called one_comparison that randomly chooses one time from cyrus_times and one time from monica_times, and returns True if Cyrus’s time was better than Monica’s.

def one_comparison():
    return ______________________________
Answer
def one_comparison():
    return np.random.choice(cyrus_times) < np.random.choice(monica_times)
one_comparison()
True

8.6.2 (b)

Now, write a function called crossword_comparison that takes in trials, which is the number of times we randomly compare one of Cyrus’s completion times with one of Monica’s. The function should print a statement explaining the total number of times Cyrus won. For example, if Cyrus won 6 times out of 10 trials, the statement should read “Cyrus beat Monica 6 times out of 10 trials”.

def crossword_comparison(trials):
    wins = ________
    for i in ________:
        if ________:
            wins = ________
    print("Cyrus beat Monica " + ________ + " times out of " + ________ + " trials")
Answer
def crossword_comparison(trials):
    wins = 0
    for i in np.arange(trials):
        if one_comparison():
            wins = wins + 1
    print("Cyrus beat Monica " + str(wins) + " times out of " + str(trials) + " trials")
crossword_comparison(100)
Cyrus beat Monica 99 times out of 100 trials

8.6.3 (c)

Cyrus is interested in using his new function to show Monica that he is superior in crossword solving. He runs crossword_comparison over 100 trials, and assigns the output to a variable called my_wins for easy access. What is one issue with this process?

Answer

This function prints a sentence rather than returning a value or string. The variable my_wins will therefore be assigned to nothing, and could result in an error if it were to be used in any calculations.

print(crossword_comparison(100))
Cyrus beat Monica 98 times out of 100 trials
None

8.6.4 (d)

Finally, Cyrus wants to create a team of Data 8 course staff for competitive crossword-puzzle solving. However, he is particular and will only accept them if they satisfy the following two conditions:

  • He wants to create a very strong team, so He only wants to recruit people who have an average crossword completion time below 5 minutes.
  • He favorite number is 10, so of the people above he will only recruit those who have a last name that is exactly 10 letters long.

Write a function that takes in a table with three columns:

  • First (str): Player’s first name
  • Last (str): Player’s last name
  • Time (int): Player’s completion time for that puzzle in seconds

and returns an array of player names (First and Last) that Cyrus will recruit for his team. If Bing Concepcion is supposed to be in the array, you may leave his name as “BingConcepcion”.

def create_team(players):
    player_means = ______________________________
    with_lengths = ______________________________
    chosen_players = ______________________________
    return ______________________________
Answer
def create_team(players):
    player_means = players.group(make_array("First", "Last"), np.mean)
    with_lengths = player_means.with_columns("Length", player_means.apply(len, "Last"))
    chosen_players = with_lengths.where("Time mean", are.below(300)).where("Length", 10)
    return np.char.add(chosen_players.column("First"), chosen_players.column("Last"))
players = Table().with_columns(
    "First", ["Cyrus", "Monica", "Bing", "Wesley", "Wayland"],
    "Last", ["McSwain", "Tsai", "Concepcion", "Zheng", "La"],
    "Time", [250, 280, 260, 310, 240]
)
create_team(players)
array(['BingConcepcion'],
      dtype='<U17')

8.7 More Hypothesis Testing

Samiksha is a big fan of Trader Joe’s Frozen Mac ’n Cheese, but she noticed that the cheese used in it varies from box to box. A Trader Joe’s employee provides her with some data about the 4 different cheeses used and the probability of them being used in each box:

Samiksha is suspicious about this distribution. After all, Velveeta is much cheaper to use than Gruyère, and she has also never bought a box that uses Gruyère. Samiksha decides to buy many boxes throughout the next month and tracks the type of cheese used in each box. She uses this to conduct a hypothesis test.


8.7.1 (a)

Write a valid null and alternative hypothesis for this experiment.

  • Null Hypothesis:
  • Alternative Hypothesis:
Answer
  • Null Hypothesis:
    The types of cheese in the Frozen Mac ’n Cheese boxes are distributed according to the probability distribution provided by the employee. Any observed difference is simply due to chance.
  • Alternative Hypothesis:
    The types of cheese in the Frozen Mac ’n Cheese boxes are not distributed according to the probability distribution provided by the employee. Any observed difference is not just due to chance.

observed_proportions = make_array(0.2, 0.3, 0.45, 0.05)
employee_proportions = make_array(0.05, 0.55, 0.25, 0.15)

The array observed_proportions contains the proportions of cheese that Samiksha observed in 20 boxes of Mac ’n Cheese.


8.7.2 (b)

Samiksha wants to use the mean as a test statistic, but Anirra suggests that Samiksha use the TVD (total variation distance) instead. Which test statistic should Samiksha use in this case? Briefly justify your answer. Then, write a line of code to assign the observed value of the test statistic to observed_stat.

Answer Anirra is correct, we should use the total variation distance because she is comparing two categorical distributions (the observed distribution and the one provided by the Trader Joe’s employee).
observed_stat = __________________________
Answer
observed_stat = sum(np.abs(observed_proportions - employee_proportions)) / 2
observed_stat
0.35000000000000003

8.7.3 (c)

Define the function one_simulated_test_stat to simulate a random sample according to the null hypothesis and return the test statistic for that sample.

def one_simulated_test_stat():
    sample_prop = __________________________
    return __________________________
Answer
def one_simulated_test_stat():
    sample_prop = sample_proportions(20, employee_proportions)
    return sum(abs(employee_proportions - sample_prop)) / 2
one_simulated_test_stat()
0.15000000000000002

8.7.4 (d)

Samiksha simulates the test statistic 10,000 times and stores the results in an array called simulated_stats. The observed value of the test statistic is stored in observed_stat. Complete the code below so that it evaluates to the p-value of the test:

________________(simulated_stats ______ observed_stat) / __________________

Code
simulated_stats = make_array()
for i in np.arange(10000):
    simulated_stats = np.append(simulated_stats, one_simulated_test_stat())
Answer
np.count_nonzero(simulated_stats >= observed_stat) / 10000
0.0042

8.7.5 (e)

Given that the computed p-value is 0.0825, which of the following are true? Select all that may apply.

  1. Using an 8% p-value cutoff, the null hypothesis should be rejected.
  2. Using a 10% p-value cutoff, the null hypothesis should be rejected.
  3. There is an 8.25% chance that the null hypothesis is true.
  4. There is an 8.25% chance that the alternative hypothesis is true.
Answer Only choice b is correct. We can only reject the null hypothesis when the observed p-value is less than the cutoff. Furthermore, there is no chance associated with whether or not the null or alternative hypothesis is true.

8.8 More A/B Testing

Understanding A/B Testing
  • Use an A/B test to check if two samples come from the same distribution.
  • In a permutation test, sample without replacement so that the proportions of categories stay the same when shuffling.
  • A permutation is just a reordering of the original data.
    • You can shuffle either the labels or the data column.
  • Remember: the p-value cutoff (like 0.05) represents the probability of falsely rejecting the null hypothesis — it is not the same as your observed p-value.

8.8.1 (a)

Choose True/False for each of the statements below, and explain your answer.


8.8.1.1 (i)

A/B testing is used to determine whether or not we believe two samples come from the same underlying distribution.

Answer True, this is the definition of A/B testing.

8.8.1.2 (ii)

To conduct a permutation test, you should sample your data with replacement with a sample size equal to the number of rows in the original table.

Answer False, you should sample your data without replacement–otherwise, you would not get a permutation of your data.

8.8.1.3 (iii)

A/B testing is the same as using total variation distance as a test statistic for a hypothesis test.

Answer False, total variation distance is just a test statistic that computes the distance between two distributions. It does not involve taking a random permutation of your data.

8.8.2 (b)

Richard, a museum curator, has recently been given specimens of caddisflies collected from various parts of Northern California. The scientists who collected the caddisflies think that caddisflies collected at higher altitudes tend to be bigger. They tell him that the average length of the 560 caddisflies collected at high elevation is 14mm, while the average length of the 450 caddisflies collected from a slightly lower elevation is 12mm. He is not sure that this difference really matters and thinks that this could just be the result of chance in sampling.


8.8.2.1 (i)

What is an appropriate null hypothesis that Richard can simulate under?

Answer Null Hypothesis: The distribution of specimen lengths is the same for caddisflies sampled from high elevation as those sampled from low elevation. Any observed difference between the two samples is simply due to random chance.

8.8.2.2 (ii)

How could you test the null hypothesis in the A/B test from above? What assumption would you make to test the hypothesis, and how would you simulate under that assumption?

Answer If the null hypothesis is true – the caddisflies did not come from different distributions – then it should not matter how the samples were labeled (high elevation or low elevation). Under this assumption, you could shuffle the labels of the caddisflies and calculate your test statistic from this “relabeled” data.

8.8.2.3 (iii)

What would be a useful test statistic for the A/B test? Remember that the direction of your test statistic should come from the initial setting.

Answer Difference in mean lengths between the two groups. Note that this is not an absolute difference – we could choose either order for subtraction, but that would affect the direction of our alternative hypothesis so we need to be careful!

8.8.2.4 (iv)

Assume flies refers to the following table:

Code
high_lengths = np.random.normal(loc=14, scale=2, size=557)
low_lengths  = np.random.normal(loc=12, scale=2, size=450)

flies = Table().with_columns(
    "Elevation", ["High elevation", "Low elevation", "High elevation"] + ["High elevation"]*557 + ["Low elevation"]*450,
    "Specimen length", np.append(make_array(12.3,13.1, 12.0), np.append(high_lengths, low_lengths))
)

flies.show(3)
Elevation Specimen length
High elevation 12.3
Low elevation 13.1
High elevation 12

... (1007 rows omitted)

Fill in the blanks in this code to generate one value of the test statistic simulated under the null hypothesis.

def one_simulation():
    shuffled_labels = flies.__________________.column(__________________)
    shuffled_flies = flies.with_columns(____________________, ____________________)
    grouped = shuffled_flies.________(__________________, __________________)
    means = grouped.column('Specimen length mean')
    statistic = _________________________________
    return statistic
Answer
def one_simulation():
    shuffled_labels = flies.sample(with_replacement = False).column('Elevation')
    shuffled_flies = flies.with_columns('Elevation', shuffled_labels)
    grouped = shuffled_flies.group('Elevation', np.mean)
    means = grouped.column('Specimen length mean')
    statistic = means.item(0) - means.item(1)
    return statistic
one_simulation()
-0.16776699329442124

8.8.2.5 (v)

Fill in the code below to simulate 10000 trials of our permutation test.

test_stats = ______________________________
repetitions = ______________________________
for i in np.arange(__________):
    one_stat = ________________________
    test_stats = np.append(test_stats, one_stat)
test_stats
Answer
test_stats = make_array()
repetitions = 10000
for i in np.arange(repetitions):
    one_stat = one_simulation()
    test_stats = np.append(test_stats, one_stat)
test_stats
array([ 0.00734303, -0.10501258,  0.11388187, ..., -0.17104778,
       -0.19607721, -0.17655782])

8.8.2.6 (vi)

The histogram of test_stats is plotted below with a vertical red line indicating the observed value of our test statistic. If the p-value cutoff we use is 5%, what is the conclusion of our test?

Answer We can inspect the histogram above to see that the area to the right of the observed value (which is our p-value) is greater than 5%. Since our p-value is greater than our p-value cutoff, we fail to reject the null hypothesis and conclude that the data tend to favor the null hypothesis.

8.8.2.7 (vii)

Suppose that the null hypothesis is true. If we ran this same hypothesis test 1000 times, each time from our flies table and with a p-value cutoff of 5%, how many times would we expect to incorrectly reject the null hypothesis?

Answer We would expect to reject the null hypothesis \(1000 * 0.05 = 50\) times. A p-value cutoff of 5% represents the probability of incorrectly rejecting the null hypothesis.

8.8.2.8 (viii)

What effect does decreasing our p-value cutoff have on the number of times we expect to incorrectly reject the null hypothesis?

Answer If we decrease our p-value cutoff, we are reducing the expected number of times we will incorrectly reject the null.