While exploring a Berkeley dataset (separate from babynames) with a million records, you realize that a portion of measurements in different fields are NaN values! You decide to impute these missing values before continuing your EDA. Given the empirical distribution of each of the below variables, determine how to solve the missing data problem. (Note that the data in these graphs are fictional).
Data Cleaning and Imputation
When working with data cleaning and imputation, remember that the goal is to keep the data as closely connected to the real-world situation you’re studying as possible.
Making decisions that weaken this connection can reduce the usefulness and accuracy of your analysis.
There usually isn’t one “right” answer—some methods are better than others based on judgment and context.
Your choices should be clear and justifiable so others can understand and trust your work.
A good way to improve your understanding is to discuss your answers with classmates, compare different approaches, and think about when each choice makes sense or doesn’t.
18.1.1 (a)
Suppose that you plot “cups of coffee sold at V&A Cafe per day” versus “inches of rain per day” across a period of 2 months, shown below. V&A Cafe is not missing any data, but 30% of the data in “inches of rain” are NaN values that have been represented with “-2”, an impossible amount of rain. Which of the following techniques would be most effective in solving the issue of missing data? (Select all that apply)
Answer
A, C: Since the distribution of the y axis is roughly symmetric around the center, replacing the NaN values with 10 is reasonable. Thus the mean, median, or mode would all work.
B: The variables are quantitative continuous. It is not possible to take a mode of this variable.
D: Remember that 30% of the data is missing, and it’s possible that the data could be missing due to an inherent bias. Hence, dropping the NaN values could significantly affect the data.
E: Since we have data on the number of cups of coffee sold, we can use it to perform interpolation and replace missing values. For example, if a datapoint at 20 cups of coffee was missing it’s inches of rain, we could take the average inches of rain from 19-21 cups of coffee (which will likely be close to the mean/mode/median) and use that to fill in this missing value.
Handling Missing Values with Sentinel Values
Sometimes, missing values (NaNs) are replaced with unlikely or impossible numbers to signal missing data.
For example, in Homework 2, missing zip codes were replaced with -9999 — a value that doesn’t normally occur in real data.
This technique helps keep the dataset complete while clearly marking missing entries for later processing or analysis.
18.1.2 (b)
Suppose we examine the amount of money lost/gained in a game of poker and see that this variable is missing 1% of its values. Its distribution, shown below, is constructed from all valid (non-NaN) values. Which of the following techniques would be reasonably effective in solving this issue of missing data? (Select all that apply)
Answer
A, C: This is a bimodal distribution, so the mean or median would yield values around 50, where the least amount of data is concentrated. That would not be a good representation of this distribution.
B: The modes of 0 and 100 both represent where the majority of the data is concentrated, so using them to impute the missing values gives the best estimate.
D: Since only 1% of the data is missing, dropping the values would not have an immense effect on the data. Additionally, there could be an underlying reason behind the two different modes, so it might be safer to drop values than to (incorrectly) impute the wrong mode.
E: Because we’re only given one variable, we cannot perform interpolation.
Choosing Between Multiple Modes
If you encounter multiple modes (e.g., 0 and 100) and are unsure which to choose, remember that this is often a judgment call—there may not be a clearly “right” answer.
A good practice is to try both choices and check if your overall analysis changes or stays consistent. This helps you understand the impact of your decision and strengthens your conclusions.
18.1.3 (c)
Suppose that the relationship between students’ time asleep (in hours) and the amount of extra credit they received in Data 100 is shown below. There is no missing data for “hours asleep”, but 0.5% of “extra credit score” is missing. Like in part a, the missing NaN values were replaced with an impossible score of -0.002, making the graph look funky. Which of the following techniques would be most effective in solving this issue of missing data? (Select all that apply)
Answer
A, B, C: The distribution of this data does not follow a clean or symmetrical pattern. Instead, there are 3 different peaks of different magnitude. Thus, neither the mean, mode, or median are accurate representations of the overall distribution, and using them to impute the missing values would likely be a bad estimation.
D: Like in question 1b, one possibility is to drop the missing values; since only 0.5% of the data is missing, dropping s won’t have a large effect on the overall distribution.
E: This example is a lot like the one shown in the Data Cleaning and EDA lecture. We can interpolate by averaging the extra credit scores in a window, say +/-2 hours, around “hours slept”. In doing so, our graph now looks smooth:
The Art of Data Cleaning
Data cleaning goes beyond simply imputing missing values with mean, median, or mode — it is an art that requires thoughtful justification.
Often, domain knowledge helps us make reasonable inferences about missing data. When such knowledge is not available, statistical techniques like imputations are useful tools. However, it is important to always document any data cleaning choices you make to maintain transparency and rigor in your analysis.
It’s the annual Monopoly World Championship! The finalists: Shawn, Amanda, Neil, and Annie are playing Monopoly, a board game where players pay a price to buy properties, which can then generate income for them. Each property can be owned by only one player at a time. At the end of the game, the player with the most money wins.
Shawn wants to figure out which properties are most worth buying. He creates a DataFrame income with data on the current game state, shown on the left. He also finds a DataFrame properties with data on Monopoly properties, shown on the right.
Both tables have 28 rows. For brevity, only the first few rows of each DataFrame are shown.
Consider the Player and Purchase Price variables. What type of variable is each one? (quantitative, qualitative nominal, qualitative ordinal)
Answer
Player is a qualitative nominal variable.
Purchase Price is a quantitative variable.
18.2.3 (c)
Which of the following line(s) of code successfully returns a with the number of properties each player owns? Select all that apply.
Using the Reference Sheet
If you’re unsure about what a function returns, you can use the reference sheet found under the Resources tab on the course website. It’s a useful tool to quickly look up function details when you need a refresher.
Answer
income.groupby('Player').agg(pd.value_counts)
Property
Income Generated
Player
Amanda
[1, 1, 1, 1, 1]
[2, 1, 1, 1]
Annie
[1, 1, 1, 1, 1, 1]
[3, 2, 1]
Neil
[1, 1, 1, 1, 1, 1, 1]
[2, 2, 1, 1, 1]
Shawn
[1, 1, 1, 1, 1, 1]
[2, 1, 1, 1, 1]
Option A is wrong because it returns a DataFrame with multiple columns.
income['Player'].value_counts()
Player
Neil 7
Annie 6
Shawn 6
Amanda 5
Name: count, dtype: int64
income.groupby('Player').size()
Player
Amanda 5
Annie 6
Neil 7
Shawn 6
dtype: int64
---------------------------------------------------------------------------KeyError Traceback (most recent call last)
File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key) 3811try:
-> 3812returnself._engine.get_loc(casted_key) 3813exceptKeyErroras err:
File pandas/_libs/index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()
File pandas/_libs/hashtable_class_helper.pxi:7096, in pandas._libs.hashtable.PyObjectHashTable.get_item()KeyError: ('Player', 'Property')
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Cell In[5], line 1----> 1income['Player','Property'].groupby('Player').size()
File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandas/core/frame.py:4107, in DataFrame.__getitem__(self, key) 4105ifself.columns.nlevels >1:
4106returnself._getitem_multilevel(key)
-> 4107 indexer =self.columns.get_loc(key) 4108if is_integer(indexer):
4109 indexer = [indexer]
File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandas/core/indexes/base.py:3819, in Index.get_loc(self, key) 3814ifisinstance(casted_key, slice) or (
3815isinstance(casted_key, abc.Iterable)
3816andany(isinstance(x, slice) for x in casted_key)
3817 ):
3818raise InvalidIndexError(key)
-> 3819raiseKeyError(key) fromerr 3820exceptTypeError:
3821# If we have a listlike key, _check_indexing_error will raise 3822# InvalidIndexError. Otherwise we fall through and re-raise 3823# the TypeError. 3824self._check_indexing_error(key)
KeyError: ('Player', 'Property')
Option C will error because we attempt to select two column names with a single indexing operator, which will error.
18.2.4 (d)
He now decides to calculate the amount of profit from each property. He wants to store this in a column called Profit in the income DataFrame.
To do this, he first has to transform the Income Generated column to be of a float datatype. Write one line of code to replace the old column with a new column, also called Income Generated, with the datatype modification described above. You may assume that each entry in Income Generated consists of a dollar sign ($) followed by a number, except for the NaN values.
Assuming that the answer to (c) is correct, let’s add a Profit column to the incomeDataFrame. Fill in the following blanks to do this, and please add arguments to function class as you see appropriate.
Note: Profit is calculated by subtracting the purchase price from generated income.
Shawn realizes he’s lost more money than he’s made. To solve this problem, he begins by writing some Pandas code to merge the Property Color column into the income DataFrame and drops all rows with NaN values. He calls this DataFramemerged_df. Shown below are the first few rows.
Shawn decides he will now only buy properties from a color group that he deems “profitable.” He deems a color group “profitable” if at least 50% of the properties in the group that are currently owned by players have made a positive (non-zero) profit for those players.
Fill in the following lines of code to help him display a DataFrame with a subset of the rows in merged_df: the rows with properties that belong to profitable color groups. Your solution may use fewer lines of code than we provide.
When working through this question, break down how filter uses the function it is given. After groupby creates subgroups, filter applies the function on each group (which is why the argument for the function is named group).
All subgroups for which the function returns False are removed from the resulting DataFrame, while those that return True are kept.
Another important point to explain is how to translate the filtering condition from the question into code — specifically, the if statement inside the function.
Keep in mind, this task can also be completed using a long lambda function.
18.3 Data Cleaning and EDA (Extra)
Consider the following sample of the babynamesDataFrame obtained by using babynames.sample(5).
Consider the Name, Year, and Count variables. What type of variable is each one? (quantitative discrete, quantitative continuous, qualitative nominal, qualitative ordinal)
Answer
Year is a quantitative discrete variable.
Count is a quantitative discrete variable.
Name is a qualitative nominal variable.
Quantitative Discrete vs. Continuous
Note that the difference between quantitative discrete and quantitative continuous variables is not emphasized in this semester’s material.
18.3.2 (b)
Which of the following is a (minimal) primary key for the babynames table from question 1 (given the sample of the data and reasonable assumptions about what might be in the rest of the data)?
A primary key is the column or minimal set of columns that uniquely determines the values in all the remaining columns. This is a statement about the table structure and should hold for all data that could be put in the table.
Answer
The columns State, Sex, Year, and Name form the primary key, which uniquely determines the value of Count for this data and all data that might go in this table. We would not expect two different values of Count for the same State, Sex, Year, and Name.
Understanding Tables as Functions and Primary Keys
Think of the table as a function that maps a primary key (a combination of columns) to the value of Count.
If you remove one of the columns from the primary key, the Count can no longer be uniquely determined by the remaining columns.
An important point to highlight is that the order of columns in the primary key does not matter.
Also, discuss the index column introduced by pandas. This index acts like a record number and could serve as a primary key. However, in this case, the table already has a primary key independent of the record order.