18 Discussion 03: Pandas II & EDA (From Summer 2025)

Slides

18.2 Pandas + EDA exam prep (modeled after Fa22 Midterm Q1)

It’s the annual Monopoly World Championship! The finalists: Shawn, Amanda, Neil, and Annie are playing Monopoly, a board game where players pay a price to buy properties, which can then generate income for them. Each property can be owned by only one player at a time. At the end of the game, the player with the most money wins.

Shawn wants to figure out which properties are most worth buying. He creates a DataFrame income with data on the current game state, shown on the left. He also finds a DataFrame properties with data on Monopoly properties, shown on the right.

Both tables have 28 rows. For brevity, only the first few rows of each DataFrame are shown.

Code

import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings("ignore")

properties_data = {
    'Property': [
        'Mediterranean Avenue', 'Baltic Avenue', 'Oriental Avenue', 'Vermont Avenue',
        'Connecticut Avenue', 'St. Charles Place', 'States Avenue', 'Virginia Avenue',
        'St. James Place', 'Tennessee Avenue', 'New York Avenue', 'Kentucky Avenue',
        'Indiana Avenue', 'Illinois Avenue', 'Atlantic Avenue', 'Ventnor Avenue',
        'Marvin Gardens', 'Pacific Avenue', 'North Carolina Avenue', 'Pennsylvania Avenue',
        'Park Place', 'Boardwalk', 'Reading Railroad', 'Pennsylvania Railroad',
        'B. & O. Railroad', 'Short Line', 'Electric Company', 'Water Works'
    ],
    'Property Color': [
        'Brown', 'Brown', 'Light Blue', 'Light Blue', 'Light Blue', 'Pink', 'Pink', 'Pink',
        'Orange', 'Orange', 'Orange', 'Red', 'Red', 'Red', 'Yellow', 'Yellow', 'Yellow',
        'Green', 'Green', 'Green', 'Dark Blue', 'Dark Blue', 'Railroad', 'Railroad',
        'Railroad', 'Railroad', 'Utility', 'Utility'
    ],
    'Purchase Price': [
        60, 60, 100, 100, 120, 140, 140, 160, 180, 180, 200, 220, 220, 240, 260, 260,
        280, 300, 300, 320, 350, 400, 200, 200, 200, 200, 150, 150
    ]
}

properties = pd.DataFrame(properties_data)

income_data = {
    'Player': [
        'Neil', 'Amanda', 'Annie', 'Annie', 'Annie', 'Shawn', np.nan, 'Shawn', 'Neil', 'Neil',
        'Neil', np.nan, np.nan, 'Amanda', 'Amanda', 'Amanda', 'Neil', 'Shawn', 'Shawn',
        'Shawn', 'Amanda', 'Shawn', 'Annie', np.nan, 'Annie', 'Annie', 'Neil', 'Neil'
    ],
    'Property': [
        'Mediterranean Avenue', 'Baltic Avenue', 'Oriental Avenue', 'Vermont Avenue',
        'Connecticut Avenue', 'St. Charles Place', 'States Avenue', 'Virginia Avenue',
        'St. James Place', 'Tennessee Avenue', 'New York Avenue', 'Kentucky Avenue',
        'Indiana Avenue', 'Illinois Avenue', 'Atlantic Avenue', 'Ventnor Avenue',
        'Marvin Gardens', 'Pacific Avenue', 'North Carolina Avenue', 'Pennsylvania Avenue',
        'Park Place', 'Boardwalk', 'Reading Railroad', 'Pennsylvania Railroad',
        'B. & O. Railroad', 'Short Line', 'Electric Company', 'Water Works'
    ],
    'Income Generated': [
        '$40', '$60', '$50', '$50', '$60', '$100', np.nan, '$120', '$140', '$140', '$160', np.nan, np.nan, '$180', '$200',
        '$200', '$200', '$250', '$250', '$150', '$375', '$425', '$100', np.nan, '$100', '$100', '$75', '$75'
    ]
}

income = pd.DataFrame(income_data)

18.2.1 (a)

What is the granularity of the income table?

Answer

Property

18.2.2 (b)

Consider the Player and Purchase Price variables. What type of variable is each one? (quantitative, qualitative nominal, qualitative ordinal)

Answer

Player is a qualitative nominal variable.

Purchase Price is a quantitative variable.

18.2.3 (c)

Which of the following line(s) of code successfully returns a with the number of properties each player owns? Select all that apply.

income.groupby('Player').agg(pd.value_counts)
income['Player'].value_counts()
income['Player', 'Property'].groupby('Player').size()
income.groupby('Player').size()

Using the Reference Sheet

If you’re unsure about what a function returns, you can use the reference sheet found under the Resources tab on the course website. It’s a useful tool to quickly look up function details when you need a refresher.

Answer

income['Player'].value_counts()
income.groupby('Player').size()

income.groupby('Player').agg(pd.value_counts)

	Property	Income Generated
Player
Amanda	[1, 1, 1, 1, 1]	[2, 1, 1, 1]
Annie	[1, 1, 1, 1, 1, 1]	[3, 2, 1]
Neil	[1, 1, 1, 1, 1, 1, 1]	[2, 2, 1, 1, 1]
Shawn	[1, 1, 1, 1, 1, 1]	[2, 1, 1, 1, 1]

Option A is wrong because it returns a DataFrame with multiple columns.

income['Player'].value_counts()

Player
Neil      7
Annie     6
Shawn     6
Amanda    5
Name: count, dtype: int64

income.groupby('Player').size()

Player
Amanda    5
Annie     6
Neil      7
Shawn     6
dtype: int64

Options B and D are correct.

income['Player', 'Property'].groupby('Player').size()

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
   3811 try:
-> 3812     return self._engine.get_loc(casted_key)
   3813 except KeyError as err:

File pandas/_libs/index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7096, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: ('Player', 'Property')

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[5], line 1
----> 1 income['Player', 'Property'].groupby('Player').size()

File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandas/core/frame.py:4107, in DataFrame.__getitem__(self, key)
   4105 if self.columns.nlevels > 1:
   4106     return self._getitem_multilevel(key)
-> 4107 indexer = self.columns.get_loc(key)
   4108 if is_integer(indexer):
   4109     indexer = [indexer]

File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pandas/core/indexes/base.py:3819, in Index.get_loc(self, key)
   3814     if isinstance(casted_key, slice) or (
   3815         isinstance(casted_key, abc.Iterable)
   3816         and any(isinstance(x, slice) for x in casted_key)
   3817     ):
   3818         raise InvalidIndexError(key)
-> 3819     raise KeyError(key) from err
   3820 except TypeError:
   3821     # If we have a listlike key, _check_indexing_error will raise
   3822     #  InvalidIndexError. Otherwise we fall through and re-raise
   3823     #  the TypeError.
   3824     self._check_indexing_error(key)

KeyError: ('Player', 'Property')

Option C will error because we attempt to select two column names with a single indexing operator, which will error.

18.2.4 (d)

He now decides to calculate the amount of profit from each property. He wants to store this in a column called Profit in the income DataFrame.

To do this, he first has to transform the Income Generated column to be of a float datatype. Write one line of code to replace the old column with a new column, also called Income Generated, with the datatype modification described above. You may assume that each entry in Income Generated consists of a dollar sign ($) followed by a number, except for the NaN values.

Answer

income['Income Generated'] = income['Income Generated'].str[1:].astype(float)
income['Income Generated']

0      40.0
1      60.0
2      50.0
3      50.0
4      60.0
5     100.0
6       NaN
7     120.0
8     140.0
9     140.0
10    160.0
11      NaN
12      NaN
13    180.0
14    200.0
15    200.0
16    200.0
17    250.0
18    250.0
19    150.0
20    375.0
21    425.0
22    100.0
23      NaN
24    100.0
25    100.0
26     75.0
27     75.0
Name: Income Generated, dtype: float64

18.2.5 (e)

Assuming that the answer to (c) is correct, let’s add a Profit column to the income DataFrame. Fill in the following blanks to do this, and please add arguments to function class as you see appropriate.

Note: Profit is calculated by subtracting the purchase price from generated income.

combined_df = income.___A___(__________B__________)
income["Profit"] = _____________C_____________

18.2.5.1 (i)

Fill in blank A:

Answer

merge

18.2.5.2 (ii)

Fill in blank B:

Answer

properties, on = "Property"

Note that you can have multiple arguments in each blank—this will also be the case on exams! Don’t assume you can only fill in one item per blank!

18.2.5.3 (iii)

Fill in blank C:

Answer

combined_df["Income Generated"] - combined_df["Purchase Price"]

combined_df = income.merge(properties, on = "Property")
income["Profit"] = combined_df["Income Generated"] - combined_df["Purchase Price"]
income.head()

	Player	Property	Income Generated	Profit
0	Neil	Mediterranean Avenue	40.0	-20.0
1	Amanda	Baltic Avenue	60.0	0.0
2	Annie	Oriental Avenue	50.0	-50.0
3	Annie	Vermont Avenue	50.0	-50.0
4	Annie	Connecticut Avenue	60.0	-60.0

18.2.6 (f)

Shawn realizes he’s lost more money than he’s made. To solve this problem, he begins by writing some Pandas code to merge the Property Color column into the income DataFrame and drops all rows with NaN values. He calls this DataFrame merged_df. Shown below are the first few rows.

Shawn decides he will now only buy properties from a color group that he deems “profitable.” He deems a color group “profitable” if at least 50% of the properties in the group that are currently owned by players have made a positive (non-zero) profit for those players.

Fill in the following lines of code to help him display a DataFrame with a subset of the rows in merged_df: the rows with properties that belong to profitable color groups. Your solution may use fewer lines of code than we provide.

def func(group):
    ____________________________________________________
    ____________________________________________________
    ____________________________________________________
    ____________________________________________________
    
merged_df.______________________________________________

Code

merged_df = income.merge(properties.drop("Purchase Price", axis=1), on="Property")

Answer

def func(group):
  if sum(group['Profit'] > 0)/len(group['Profit'])>=0.5:
    return True
  return False

merged_df.groupby("Property Color").filter(func)

	Player	Property	Income Generated	Profit	Property Color
20	Amanda	Park Place	375.0	25.0	Dark Blue
21	Shawn	Boardwalk	425.0	25.0	Dark Blue

How filter Works with groupby

When working through this question, break down how filter uses the function it is given. After groupby creates subgroups, filter applies the function on each group (which is why the argument for the function is named group).

All subgroups for which the function returns False are removed from the resulting DataFrame, while those that return True are kept.

Another important point to explain is how to translate the filtering condition from the question into code — specifically, the if statement inside the function.

Keep in mind, this task can also be completed using a long lambda function.

18.3 Data Cleaning and EDA (Extra)

Consider the following sample of the babynames DataFrame obtained by using babynames.sample(5).

Code

file_path = 'namesbystate_ca.txt.gz'
column_labels = ['State', 'Sex', 'Year', 'Name', 'Count']

babynames = pd.read_csv(file_path, names=column_labels) 
babynames.sample(5)

	State	Sex	Year	Name	Count
286380	CA	M	1972	Garry	24
101760	CA	F	1986	Chau	7
248750	CA	M	1930	Edmond	20
86711	CA	F	1981	Margie	23
181959	CA	F	2008	Ivana	38

18.3.1 (a)

Consider the Name, Year, and Count variables. What type of variable is each one? (quantitative discrete, quantitative continuous, qualitative nominal, qualitative ordinal)

Answer

Year is a quantitative discrete variable.

Count is a quantitative discrete variable.

Name is a qualitative nominal variable.

Quantitative Discrete vs. Continuous

Note that the difference between quantitative discrete and quantitative continuous variables is not emphasized in this semester’s material.

18.3.2 (b)

Which of the following is a (minimal) primary key for the babynames table from question 1 (given the sample of the data and reasonable assumptions about what might be in the rest of the data)?

A primary key is the column or minimal set of columns that uniquely determines the values in all the remaining columns. This is a statement about the table structure and should hold for all data that could be put in the table.

Answer

State, Sex, Year, and Name

The columns State, Sex, Year, and Name form the primary key, which uniquely determines the value of Count for this data and all data that might go in this table. We would not expect two different values of Count for the same State, Sex, Year, and Name.

Understanding Tables as Functions and Primary Keys

Think of the table as a function that maps a primary key (a combination of columns) to the value of Count.

If you remove one of the columns from the primary key, the Count can no longer be uniquely determined by the remaining columns.

An important point to highlight is that the order of columns in the primary key does not matter.

Also, discuss the index column introduced by pandas. This index acts like a record number and could serve as a primary key. However, in this case, the table already has a primary key independent of the record order.

18 Discussion 03: Pandas II & EDA (From Summer 2025)

Slides

18.1 Dealing with Missing Data

18.1.1 (a)

18.1.2 (b)

18.1.3 (c)

18.2 Pandas + EDA exam prep (modeled after Fa22 Midterm Q1)

18.2.1 (a)

18.2.2 (b)

18.2.3 (c)

18.2.4 (d)

18.2.5 (e)

18.2.5.1 (i)

18.2.5.2 (ii)

18.2.5.3 (iii)

18.2.6 (f)

18.3 Data Cleaning and EDA (Extra)

18.3.1 (a)

18.3.2 (b)