17  Discussion 02: Pandas I (From Summer 2025)

Slides

Pandas Functions Are Not Exhaustive

The Pandas functions we cover in class and discussion are just a starting point — not the full set of tools Pandas offers. Unlike the datascience library in Data 8, Pandas has far more capabilities than what’s on the reference sheet. If you need to do something specific, don’t hesitate to look it up in the documentation or online — you’ll often find built-in functions that make your work much easier.

17.1 Pandas Practice

The following questions are based on the elections.csv data referenced in the lecture. Assume that you have loaded the data into the elections variable as follows:

Code
import pandas as pd
import numpy as np
elections = pd.read_csv("elections.csv")
Code
elections
Year Candidate Party Popular vote Result %
0 1824 Andrew Jackson Democratic-Republican 151271 loss 57.210122
1 1824 John Quincy Adams Democratic-Republican 113142 win 42.789878
2 1828 Andrew Jackson Democratic 642806 win 56.203927
3 1828 John Quincy Adams National Republican 500897 loss 43.796073
4 1832 Andrew Jackson Democratic 702735 win 54.574789
... ... ... ... ... ... ...
182 2024 Donald Trump Republican 77303568 win 49.808629
183 2024 Kamala Harris Democratic 75019230 loss 48.336772
184 2024 Jill Stein Green 861155 loss 0.554864
185 2024 Robert Kennedy Independent 756383 loss 0.487357
186 2024 Chase Oliver Libertarian Party 650130 loss 0.418895

187 rows × 6 columns


17.1.1 (a)

Write a line of code that returns the elections table sorted in descending order by Popular vote". Store your result in a variable named sorted.
Would calling sorted.iloc[[0], :] give the same result as sorted.loc[[0], :]?

Answer
sorted = elections.sort_values("Popular vote", ascending=False)
sorted
Year Candidate Party Popular vote Result %
178 2020 Joseph Biden Democratic 81268924 win 51.311515
182 2024 Donald Trump Republican 77303568 win 49.808629
183 2024 Kamala Harris Democratic 75019230 loss 48.336772
179 2020 Donald Trump Republican 74216154 loss 46.858542
162 2008 Barack Obama Democratic 69498516 win 53.023510
... ... ... ... ... ... ...
110 1956 T. Coleman Andrews States' Rights 107929 loss 0.174883
141 1992 Bo Gritz Populist 106152 loss 0.101918
99 1948 Claude A. Watson Prohibition 103708 loss 0.212747
89 1932 William Z. Foster Communist 103307 loss 0.261069
6 1832 William Wirt Anti-Masonic 100715 loss 7.821583

187 rows × 6 columns

sorted.iloc[[0], :] gives a different result as sorted.loc[[0], :]. iloc uses integer position-based indexing, so sorted.iloc[[0], :] will return a table with the first row in sorted.

sorted.iloc[[0], :]
Year Candidate Party Popular vote Result %
178 2020 Joseph Biden Democratic 81268924 win 51.311515

On the other hand, loc uses label-based indexing, so sorted.loc[[0], :] will return a table with the row with an index label of 0.

sorted.loc[[0], :]
Year Candidate Party Popular vote Result %
0 1824 Andrew Jackson Democratic-Republican 151271 loss 57.210122

Note that this doesn’t mean all calls to iloc and loc will return different results. For example, if Andrew Jackson did have the highest "Popular vote", the results would be equal when calling sorted.iloc[[0], :] and sorted.loc[[0], :]!

Selecting Rows and Columns in Pandas

When selecting rows or columns in Pandas, extra brackets change the type of the returned object:

  • sorted.loc[[0], :] → returns a DataFrame (because the row index is passed as a list).
  • sorted.loc[0, :] → returns a row object (a Series).

This is similar to column selection: - tbl["column_name"] → returns a Series.
- tbl[["column_name"]] → returns a DataFrame.

Tip: If you want to preserve the DataFrame structure (even for a single row or column), wrap the index/column name in an extra set of brackets.


17.1.2 (b)

Using Boolean slicing, write one line of pandas code that returns a DataFrame that only contains election results from the 1900s.

Answer
elections[(elections["Year"] >= 1900) & (elections["Year"] < 2000)]
Year Candidate Party Popular vote Result %
54 1900 John G. Woolley Prohibition 210864 loss 1.526821
55 1900 William Jennings Bryan Democratic 6370932 loss 46.130540
56 1900 William McKinley Republican 7228864 win 52.342640
57 1904 Alton B. Parker Democratic 5083880 loss 37.685116
58 1904 Eugene V. Debs Socialist 402810 loss 2.985897
... ... ... ... ... ... ...
146 1996 Harry Browne Libertarian 485759 loss 0.505198
147 1996 Howard Phillips Taxpayers 184656 loss 0.192045
148 1996 John Hagelin Natural Law 113670 loss 0.118219
149 1996 Ralph Nader Green 685297 loss 0.712721
150 1996 Ross Perot Reform 8085294 loss 8.408844

97 rows × 6 columns

There are other correct solutions, but we present this one to showcase how to utilize multiple Boolean conditions.

17.1.3 (c)

Write one line of pandas code that returns a Series, where the index is the "Party", and the values are how many times that party won an election. Only include parties that have won an election.

Answer
elections[elections["Result"] == "win"]["Party"].value_counts()
Party
Republican               24
Democratic               23
Whig                      2
Democratic-Republican     1
National Union            1
Name: count, dtype: int64

You can also do this with groupby, but it’s more complex:

elections[elections['Result'] == 'win'].groupby('Party').size()
Party
Democratic               23
Democratic-Republican     1
National Union            1
Republican               24
Whig                      2
dtype: int64
Notice the slight difference in the output of the two lines of code.

17.1.4 (d)

Write a line of pandas code that returns a Series whose index is the years and whose values are the number of candidates that participated in those years’ elections.

Answer
elections.groupby("Year").size()
Year
1824    2
1828    2
1832    3
1836    3
1840    2
1844    2
1848    3
1852    3
1856    3
1860    4
1864    2
1868    2
1872    2
1876    2
1880    3
1884    4
1888    4
1892    4
1896    4
1900    3
1904    5
1908    4
1912    5
1916    4
1920    5
1924    3
1928    3
1932    4
1936    4
1940    3
1944    2
1948    6
1952    3
1956    3
1960    2
1964    2
1968    3
1972    3
1976    6
1980    5
1984    3
1988    4
1992    5
1996    7
2000    5
2004    6
2008    6
2012    4
2016    6
2020    4
2024    5
dtype: int64
Difference Between .size() and .count() in Pandas

A common question is when to use .size() versus .count() for aggregations:

  • .size() counts all rows in each group, including those with NaN values, and returns a Series because the size is the same across all columns.
  • .count() counts non-missing values per column in each group, so results can vary by column and returns a DataFrame.

Use .size() when you want the total number of rows regardless of missing data, and .count() when you want counts of valid (non-NaN) entries per column.


17.1.5 (e)

Write a line of pandas code that creates a filtered DataFrame named filtered_parties from the elections dataset and keeps only the parties that have at least one election % more than 50%.

Answer
filtered_parties = elections.groupby("Party").filter(lambda sf: sf["%"].max() > 50)
filtered_parties
Year Candidate Party Popular vote Result %
0 1824 Andrew Jackson Democratic-Republican 151271 loss 57.210122
1 1824 John Quincy Adams Democratic-Republican 113142 win 42.789878
2 1828 Andrew Jackson Democratic 642806 win 56.203927
4 1832 Andrew Jackson Democratic 702735 win 54.574789
7 1836 Hugh Lawson White Whig 146109 loss 10.005985
... ... ... ... ... ... ...
176 2016 Hillary Clinton Democratic 65853514 loss 48.521539
178 2020 Joseph Biden Democratic 81268924 win 51.311515
179 2020 Donald Trump Republican 74216154 loss 46.858542
182 2024 Donald Trump Republican 77303568 win 49.808629
183 2024 Kamala Harris Democratic 75019230 loss 48.336772

99 rows × 6 columns

Understanding the Difference Between filter and agg in Pandas

When working with grouped data, it’s important to distinguish between filter and agg:

  • The filter operation does NOT change the granularity of the original DataFrame. It returns rows from the original dataset where groups satisfy a condition, keeping the original indexing and structure.
  • In contrast, groupby + agg reduces granularity, returning one row per group with aggregated results and using the groups as the new index.

Think of filter as applying a Boolean test to each group to decide whether to keep or discard row.


17.1.6 (f)

Write a line of pandas code that uses the filtered_parties DataFrame to return a new DataFrame with row indices that correspond to the year and columns that correspond to each party. Each entry should be the total percentage of votes for all the candidates that ran during that particular year for the specified party. Missing values from the dataset (the cases where a party did not have a candidate in a particular year) should be entered as 0. Below is an example.

Answer
elections_pivot = filtered_parties.pivot_table(index="Year",  columns="Party", values=["%"], aggfunc="sum", fill_value = 0)
elections_pivot.head()
%
Party Democratic Democratic-Republican National Union Republican Whig
Year
1824 0.000000 100.0 0.0 0.0 0.000000
1828 56.203927 0.0 0.0 0.0 0.000000
1832 54.574789 0.0 0.0 0.0 0.000000
1836 52.272472 0.0 0.0 0.0 47.727528
1840 46.948787 0.0 0.0 0.0 53.051213
Why Use pivot_table in Pandas?

At first glance, it might not be obvious that the code is using a pivot_table call. It’s important to emphasize how pivot tables provide an alternative way to view and summarize data.

  • The combinations of year and party typically do not reflect the original row-level granularity of the dataset.
  • Both groupby and pivot_table are powerful tools that change the granularity of data, summarizing it in meaningful ways.
  • You can also connect this to groupby by asking if students can replicate the same result with a 2-key groupby on “Year” and “Party”.

Understanding pivot tables helps you reshape data into a format that’s easier to analyze and interpret.