18  Discussion 03: Pandas II & EDA (From Summer 2025)

Slides

18.1 Dealing with Missing Data

While exploring a Berkeley dataset (separate from babynames) with a million records, you realize that a portion of measurements in different fields are NaN values! You decide to impute these missing values before continuing your EDA. Given the empirical distribution of each of the below variables, determine how to solve the missing data problem. (Note that the data in these graphs are fictional).

Data Cleaning and Imputation

When working with data cleaning and imputation, remember that the goal is to keep the data as closely connected to the real-world situation you’re studying as possible.

  • Making decisions that weaken this connection can reduce the usefulness and accuracy of your analysis.
  • There usually isn’t one “right” answer—some methods are better than others based on judgment and context.
  • Your choices should be clear and justifiable so others can understand and trust your work.

A good way to improve your understanding is to discuss your answers with classmates, compare different approaches, and think about when each choice makes sense or doesn’t.


18.1.1 (a)

Suppose that you plot “cups of coffee sold at V&A Cafe per day” versus “inches of rain per day” across a period of 2 months, shown below. V&A Cafe is not missing any data, but 30% of the data in “inches of rain” are NaN values that have been represented with “-2”, an impossible amount of rain. Which of the following techniques would be most effective in solving the issue of missing data? (Select all that apply)

Answer

A, C: Since the distribution of the y axis is roughly symmetric around the center, replacing the NaN values with 10 is reasonable. Thus the mean, median, or mode would all work.

B: The variables are quantitative continuous. It is not possible to take a mode of this variable.

D: Remember that 30% of the data is missing, and it’s possible that the data could be missing due to an inherent bias. Hence, dropping the NaN values could significantly affect the data.

E: Since we have data on the number of cups of coffee sold, we can use it to perform interpolation and replace missing values. For example, if a datapoint at 20 cups of coffee was missing it’s inches of rain, we could take the average inches of rain from 19-21 cups of coffee (which will likely be close to the mean/mode/median) and use that to fill in this missing value.

Handling Missing Values with Sentinel Values

Sometimes, missing values (NaNs) are replaced with unlikely or impossible numbers to signal missing data.

For example, in Homework 2, missing zip codes were replaced with -9999 — a value that doesn’t normally occur in real data.

This technique helps keep the dataset complete while clearly marking missing entries for later processing or analysis.


18.1.2 (b)

Suppose we examine the amount of money lost/gained in a game of poker and see that this variable is missing 1% of its values. Its distribution, shown below, is constructed from all valid (non-NaN) values. Which of the following techniques would be reasonably effective in solving this issue of missing data? (Select all that apply)

Answer

A, C: This is a bimodal distribution, so the mean or median would yield values around 50, where the least amount of data is concentrated. That would not be a good representation of this distribution.

B: The modes of 0 and 100 both represent where the majority of the data is concentrated, so using them to impute the missing values gives the best estimate.

D: Since only 1% of the data is missing, dropping the values would not have an immense effect on the data. Additionally, there could be an underlying reason behind the two different modes, so it might be safer to drop values than to (incorrectly) impute the wrong mode.

E: Because we’re only given one variable, we cannot perform interpolation.

Choosing Between Multiple Modes

If you encounter multiple modes (e.g., 0 and 100) and are unsure which to choose, remember that this is often a judgment call—there may not be a clearly “right” answer.

A good practice is to try both choices and check if your overall analysis changes or stays consistent. This helps you understand the impact of your decision and strengthens your conclusions.


18.1.3 (c)

Suppose that the relationship between students’ time asleep (in hours) and the amount of extra credit they received in Data 100 is shown below. There is no missing data for “hours asleep”, but 0.5% of “extra credit score” is missing. Like in part a, the missing NaN values were replaced with an impossible score of -0.002, making the graph look funky. Which of the following techniques would be most effective in solving this issue of missing data? (Select all that apply)

Answer

A, B, C: The distribution of this data does not follow a clean or symmetrical pattern. Instead, there are 3 different peaks of different magnitude. Thus, neither the mean, mode, or median are accurate representations of the overall distribution, and using them to impute the missing values would likely be a bad estimation.

D: Like in question 1b, one possibility is to drop the missing values; since only 0.5% of the data is missing, dropping s won’t have a large effect on the overall distribution.

E: This example is a lot like the one shown in the Data Cleaning and EDA lecture. We can interpolate by averaging the extra credit scores in a window, say +/-2 hours, around “hours slept”. In doing so, our graph now looks smooth:

The Art of Data Cleaning

Data cleaning goes beyond simply imputing missing values with mean, median, or mode — it is an art that requires thoughtful justification.

Often, domain knowledge helps us make reasonable inferences about missing data. When such knowledge is not available, statistical techniques like imputations are useful tools. However, it is important to always document any data cleaning choices you make to maintain transparency and rigor in your analysis.

18.2 Pandas + EDA exam prep (modeled after Fa22 Midterm Q1)

It’s the annual Monopoly World Championship! The finalists: Shawn, Amanda, Neil, and Annie are playing Monopoly, a board game where players pay a price to buy properties, which can then generate income for them. Each property can be owned by only one player at a time. At the end of the game, the player with the most money wins.

Shawn wants to figure out which properties are most worth buying. He creates a DataFrame income with data on the current game state, shown on the left. He also finds a DataFrame properties with data on Monopoly properties, shown on the right.

Both tables have 28 rows. For brevity, only the first few rows of each DataFrame are shown.

Code
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings("ignore")

properties_data = {
    'Property': [
        'Mediterranean Avenue', 'Baltic Avenue', 'Oriental Avenue', 'Vermont Avenue',
        'Connecticut Avenue', 'St. Charles Place', 'States Avenue', 'Virginia Avenue',
        'St. James Place', 'Tennessee Avenue', 'New York Avenue', 'Kentucky Avenue',
        'Indiana Avenue', 'Illinois Avenue', 'Atlantic Avenue', 'Ventnor Avenue',
        'Marvin Gardens', 'Pacific Avenue', 'North Carolina Avenue', 'Pennsylvania Avenue',
        'Park Place', 'Boardwalk', 'Reading Railroad', 'Pennsylvania Railroad',
        'B. & O. Railroad', 'Short Line', 'Electric Company', 'Water Works'
    ],
    'Property Color': [
        'Brown', 'Brown', 'Light Blue', 'Light Blue', 'Light Blue', 'Pink', 'Pink', 'Pink',
        'Orange', 'Orange', 'Orange', 'Red', 'Red', 'Red', 'Yellow', 'Yellow', 'Yellow',
        'Green', 'Green', 'Green', 'Dark Blue', 'Dark Blue', 'Railroad', 'Railroad',
        'Railroad', 'Railroad', 'Utility', 'Utility'
    ],
    'Purchase Price': [
        60, 60, 100, 100, 120, 140, 140, 160, 180, 180, 200, 220, 220, 240, 260, 260,
        280, 300, 300, 320, 350, 400, 200, 200, 200, 200, 150, 150
    ]
}

properties = pd.DataFrame(properties_data)

income_data = {
    'Player': [
        'Neil', 'Amanda', 'Annie', 'Annie', 'Annie', 'Shawn', np.nan, 'Shawn', 'Neil', 'Neil',
        'Neil', np.nan, np.nan, 'Amanda', 'Amanda', 'Amanda', 'Neil', 'Shawn', 'Shawn',
        'Shawn', 'Amanda', 'Shawn', 'Annie', np.nan, 'Annie', 'Annie', 'Neil', 'Neil'
    ],
    'Property': [
        'Mediterranean Avenue', 'Baltic Avenue', 'Oriental Avenue', 'Vermont Avenue',
        'Connecticut Avenue', 'St. Charles Place', 'States Avenue', 'Virginia Avenue',
        'St. James Place', 'Tennessee Avenue', 'New York Avenue', 'Kentucky Avenue',
        'Indiana Avenue', 'Illinois Avenue', 'Atlantic Avenue', 'Ventnor Avenue',
        'Marvin Gardens', 'Pacific Avenue', 'North Carolina Avenue', 'Pennsylvania Avenue',
        'Park Place', 'Boardwalk', 'Reading Railroad', 'Pennsylvania Railroad',
        'B. & O. Railroad', 'Short Line', 'Electric Company', 'Water Works'
    ],
    'Income Generated': [
        '$40', '$60', '$50', '$50', '$60', '$100', np.nan, '$120', '$140', '$140', '$160', np.nan, np.nan, '$180', '$200',
        '$200', '$200', '$250', '$250', '$150', '$375', '$425', '$100', np.nan, '$100', '$100', '$75', '$75'
    ]
}

income = pd.DataFrame(income_data)

18.2.1 (a)

What is the granularity of the income table?

Answer Property

18.2.2 (b)

Consider the Player and Purchase Price variables. What type of variable is each one? (quantitative, qualitative nominal, qualitative ordinal)

Answer

Player is a qualitative nominal variable.

Purchase Price is a quantitative variable.

18.2.3 (c)

Which of the following line(s) of code successfully returns a with the number of properties each player owns? Select all that apply.

Using the Reference Sheet

If you’re unsure about what a function returns, you can use the reference sheet found under the Resources tab on the course website. It’s a useful tool to quickly look up function details when you need a refresher.

Answer
income.groupby('Player').agg(pd.value_counts)
Property Income Generated
Player
Amanda [1, 1, 1, 1, 1] [2, 1, 1, 1]
Annie [1, 1, 1, 1, 1, 1] [3, 2, 1]
Neil [1, 1, 1, 1, 1, 1, 1] [2, 2, 1, 1, 1]
Shawn [1, 1, 1, 1, 1, 1] [2, 1, 1, 1, 1]

Option A is wrong because it returns a DataFrame with multiple columns.

income['Player'].value_counts()
Player
Neil      7
Annie     6
Shawn     6
Amanda    5
Name: count, dtype: int64
income.groupby('Player').size()
Player
Amanda    5
Annie     6
Neil      7
Shawn     6
dtype: int64

Options B and D are correct.

income['Player', 'Property'].groupby('Player').size()
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
   3811 try:
-> 3812     return self._engine.get_loc(casted_key)
   3813 except KeyError as err:

File pandas/_libs/index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7096, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: ('Player', 'Property')

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[5], line 1
----> 1 income['Player', 'Property'].groupby('Player').size()

File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandas/core/frame.py:4107, in DataFrame.__getitem__(self, key)
   4105 if self.columns.nlevels > 1:
   4106     return self._getitem_multilevel(key)
-> 4107 indexer = self.columns.get_loc(key)
   4108 if is_integer(indexer):
   4109     indexer = [indexer]

File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandas/core/indexes/base.py:3819, in Index.get_loc(self, key)
   3814     if isinstance(casted_key, slice) or (
   3815         isinstance(casted_key, abc.Iterable)
   3816         and any(isinstance(x, slice) for x in casted_key)
   3817     ):
   3818         raise InvalidIndexError(key)
-> 3819     raise KeyError(key) from err
   3820 except TypeError:
   3821     # If we have a listlike key, _check_indexing_error will raise
   3822     #  InvalidIndexError. Otherwise we fall through and re-raise
   3823     #  the TypeError.
   3824     self._check_indexing_error(key)

KeyError: ('Player', 'Property')

Option C will error because we attempt to select two column names with a single indexing operator, which will error.

18.2.4 (d)

He now decides to calculate the amount of profit from each property. He wants to store this in a column called Profit in the income DataFrame.

To do this, he first has to transform the Income Generated column to be of a float datatype. Write one line of code to replace the old column with a new column, also called Income Generated, with the datatype modification described above. You may assume that each entry in Income Generated consists of a dollar sign ($) followed by a number, except for the NaN values.

Answer
income['Income Generated'] = income['Income Generated'].str[1:].astype(float)
income['Income Generated']
0      40.0
1      60.0
2      50.0
3      50.0
4      60.0
5     100.0
6       NaN
7     120.0
8     140.0
9     140.0
10    160.0
11      NaN
12      NaN
13    180.0
14    200.0
15    200.0
16    200.0
17    250.0
18    250.0
19    150.0
20    375.0
21    425.0
22    100.0
23      NaN
24    100.0
25    100.0
26     75.0
27     75.0
Name: Income Generated, dtype: float64

18.2.5 (e)

Assuming that the answer to (c) is correct, let’s add a Profit column to the income DataFrame. Fill in the following blanks to do this, and please add arguments to function class as you see appropriate.

Note: Profit is calculated by subtracting the purchase price from generated income.

combined_df = income.___A___(__________B__________)
income["Profit"] = _____________C_____________

18.2.5.1 (i)

Fill in blank A:

Answer merge

18.2.5.2 (ii)

Fill in blank B:

Answer

properties, on = "Property"

Note that you can have multiple arguments in each blank—this will also be the case on exams! Don’t assume you can only fill in one item per blank!

18.2.5.3 (iii)

Fill in blank C:

Answer

combined_df["Income Generated"] - combined_df["Purchase Price"]

combined_df = income.merge(properties, on = "Property")
income["Profit"] = combined_df["Income Generated"] - combined_df["Purchase Price"]
income.head()
Player Property Income Generated Profit
0 Neil Mediterranean Avenue 40.0 -20.0
1 Amanda Baltic Avenue 60.0 0.0
2 Annie Oriental Avenue 50.0 -50.0
3 Annie Vermont Avenue 50.0 -50.0
4 Annie Connecticut Avenue 60.0 -60.0

18.2.6 (f)

Shawn realizes he’s lost more money than he’s made. To solve this problem, he begins by writing some Pandas code to merge the Property Color column into the income DataFrame and drops all rows with NaN values. He calls this DataFrame merged_df. Shown below are the first few rows.

Shawn decides he will now only buy properties from a color group that he deems “profitable.” He deems a color group “profitable” if at least 50% of the properties in the group that are currently owned by players have made a positive (non-zero) profit for those players.

Fill in the following lines of code to help him display a DataFrame with a subset of the rows in merged_df: the rows with properties that belong to profitable color groups. Your solution may use fewer lines of code than we provide.

def func(group):
    ____________________________________________________
    ____________________________________________________
    ____________________________________________________
    ____________________________________________________
    
merged_df.______________________________________________
Code
merged_df = income.merge(properties.drop("Purchase Price", axis=1), on="Property")
Answer
def func(group):
  if sum(group['Profit'] > 0)/len(group['Profit'])>=0.5:
    return True
  return False

merged_df.groupby("Property Color").filter(func)
Player Property Income Generated Profit Property Color
20 Amanda Park Place 375.0 25.0 Dark Blue
21 Shawn Boardwalk 425.0 25.0 Dark Blue
How filter Works with groupby

When working through this question, break down how filter uses the function it is given. After groupby creates subgroups, filter applies the function on each group (which is why the argument for the function is named group).

All subgroups for which the function returns False are removed from the resulting DataFrame, while those that return True are kept.

Another important point to explain is how to translate the filtering condition from the question into code — specifically, the if statement inside the function.

Keep in mind, this task can also be completed using a long lambda function.

18.3 Data Cleaning and EDA (Extra)

Consider the following sample of the babynames DataFrame obtained by using babynames.sample(5).

Code
file_path = 'namesbystate_ca.txt.gz'
column_labels = ['State', 'Sex', 'Year', 'Name', 'Count']

babynames = pd.read_csv(file_path, names=column_labels) 
babynames.sample(5)
State Sex Year Name Count
286380 CA M 1972 Garry 24
101760 CA F 1986 Chau 7
248750 CA M 1930 Edmond 20
86711 CA F 1981 Margie 23
181959 CA F 2008 Ivana 38

18.3.1 (a)

Consider the Name, Year, and Count variables. What type of variable is each one? (quantitative discrete, quantitative continuous, qualitative nominal, qualitative ordinal)

Answer

Year is a quantitative discrete variable.

Count is a quantitative discrete variable.

Name is a qualitative nominal variable.

Quantitative Discrete vs. Continuous

Note that the difference between quantitative discrete and quantitative continuous variables is not emphasized in this semester’s material.


18.3.2 (b)

Which of the following is a (minimal) primary key for the babynames table from question 1 (given the sample of the data and reasonable assumptions about what might be in the rest of the data)?

A primary key is the column or minimal set of columns that uniquely determines the values in all the remaining columns. This is a statement about the table structure and should hold for all data that could be put in the table.

Answer

The columns State, Sex, Year, and Name form the primary key, which uniquely determines the value of Count for this data and all data that might go in this table. We would not expect two different values of Count for the same State, Sex, Year, and Name.

Understanding Tables as Functions and Primary Keys

Think of the table as a function that maps a primary key (a combination of columns) to the value of Count.

If you remove one of the columns from the primary key, the Count can no longer be uniquely determined by the remaining columns.

An important point to highlight is that the order of columns in the primary key does not matter.

Also, discuss the index column introduced by pandas. This index acts like a record number and could serve as a primary key. However, in this case, the table already has a primary key independent of the record order.