17 Discussion 02: Pandas I (From Summer 2025)

Slides

Pandas Functions Are Not Exhaustive

The Pandas functions we cover in class and discussion are just a starting point — not the full set of tools Pandas offers. Unlike the datascience library in Data 8, Pandas has far more capabilities than what’s on the reference sheet. If you need to do something specific, don’t hesitate to look it up in the documentation or online — you’ll often find built-in functions that make your work much easier.

17.1 Pandas Practice

The following questions are based on the elections.csv data referenced in the lecture. Assume that you have loaded the data into the elections variable as follows:

Code

import pandas as pd
import numpy as np

elections = pd.read_csv("elections.csv")

Code

elections

	Year	Candidate	Party	Popular vote	Result	%
0	1824	Andrew Jackson	Democratic-Republican	151271	loss	57.210122
1	1824	John Quincy Adams	Democratic-Republican	113142	win	42.789878
2	1828	Andrew Jackson	Democratic	642806	win	56.203927
3	1828	John Quincy Adams	National Republican	500897	loss	43.796073
4	1832	Andrew Jackson	Democratic	702735	win	54.574789
...	...	...	...	...	...	...
182	2024	Donald Trump	Republican	77303568	win	49.808629
183	2024	Kamala Harris	Democratic	75019230	loss	48.336772
184	2024	Jill Stein	Green	861155	loss	0.554864
185	2024	Robert Kennedy	Independent	756383	loss	0.487357
186	2024	Chase Oliver	Libertarian Party	650130	loss	0.418895

187 rows × 6 columns

17.1.1 (a)

Write a line of code that returns the elections table sorted in descending order by Popular vote". Store your result in a variable named sorted.
Would calling sorted.iloc[[0], :] give the same result as sorted.loc[[0], :]?

Answer

sorted = elections.sort_values("Popular vote", ascending=False)
sorted

	Year	Candidate	Party	Popular vote	Result	%
178	2020	Joseph Biden	Democratic	81268924	win	51.311515
182	2024	Donald Trump	Republican	77303568	win	49.808629
183	2024	Kamala Harris	Democratic	75019230	loss	48.336772
179	2020	Donald Trump	Republican	74216154	loss	46.858542
162	2008	Barack Obama	Democratic	69498516	win	53.023510
...	...	...	...	...	...	...
110	1956	T. Coleman Andrews	States' Rights	107929	loss	0.174883
141	1992	Bo Gritz	Populist	106152	loss	0.101918
99	1948	Claude A. Watson	Prohibition	103708	loss	0.212747
89	1932	William Z. Foster	Communist	103307	loss	0.261069
6	1832	William Wirt	Anti-Masonic	100715	loss	7.821583

187 rows × 6 columns

sorted.iloc[[0], :] gives a different result as sorted.loc[[0], :]. iloc uses integer position-based indexing, so sorted.iloc[[0], :] will return a table with the first row in sorted.

sorted.iloc[[0], :]

	Year	Candidate	Party	Popular vote	Result	%
178	2020	Joseph Biden	Democratic	81268924	win	51.311515

On the other hand, loc uses label-based indexing, so sorted.loc[[0], :] will return a table with the row with an index label of 0.

sorted.loc[[0], :]

	Year	Candidate	Party	Popular vote	Result	%
0	1824	Andrew Jackson	Democratic-Republican	151271	loss	57.210122

Note that this doesn’t mean all calls to iloc and loc will return different results. For example, if Andrew Jackson did have the highest "Popular vote", the results would be equal when calling sorted.iloc[[0], :] and sorted.loc[[0], :]!

Selecting Rows and Columns in Pandas

When selecting rows or columns in Pandas, extra brackets change the type of the returned object:

sorted.loc[[0], :] → returns a DataFrame (because the row index is passed as a list).
sorted.loc[0, :] → returns a row object (a Series).

This is similar to column selection: - tbl["column_name"] → returns a Series.
- tbl[["column_name"]] → returns a DataFrame.

Tip: If you want to preserve the DataFrame structure (even for a single row or column), wrap the index/column name in an extra set of brackets.

17.1.2 (b)

Using Boolean slicing, write one line of pandas code that returns a DataFrame that only contains election results from the 1900s.

Answer

elections[(elections["Year"] >= 1900) & (elections["Year"] < 2000)]

	Year	Candidate	Party	Popular vote	Result	%
54	1900	John G. Woolley	Prohibition	210864	loss	1.526821
55	1900	William Jennings Bryan	Democratic	6370932	loss	46.130540
56	1900	William McKinley	Republican	7228864	win	52.342640
57	1904	Alton B. Parker	Democratic	5083880	loss	37.685116
58	1904	Eugene V. Debs	Socialist	402810	loss	2.985897
...	...	...	...	...	...	...
146	1996	Harry Browne	Libertarian	485759	loss	0.505198
147	1996	Howard Phillips	Taxpayers	184656	loss	0.192045
148	1996	John Hagelin	Natural Law	113670	loss	0.118219
149	1996	Ralph Nader	Green	685297	loss	0.712721
150	1996	Ross Perot	Reform	8085294	loss	8.408844

97 rows × 6 columns

There are other correct solutions, but we present this one to showcase how to utilize multiple Boolean conditions.

17.1.3 (c)

Write one line of pandas code that returns a Series, where the index is the "Party", and the values are how many times that party won an election. Only include parties that have won an election.

Answer

elections[elections["Result"] == "win"]["Party"].value_counts()

Party
Republican               24
Democratic               23
Whig                      2
Democratic-Republican     1
National Union            1
Name: count, dtype: int64

You can also do this with groupby, but it’s more complex:

elections[elections['Result'] == 'win'].groupby('Party').size()

Party
Democratic               23
Democratic-Republican     1
National Union            1
Republican               24
Whig                      2
dtype: int64

Notice the slight difference in the output of the two lines of code.

17.1.4 (d)

Write a line of pandas code that returns a Series whose index is the years and whose values are the number of candidates that participated in those years’ elections.

Answer

elections.groupby("Year").size()

Year
1824    2
1828    2
1832    3
1836    3
1840    2
1844    2
1848    3
1852    3
1856    3
1860    4
1864    2
1868    2
1872    2
1876    2
1880    3
1884    4
1888    4
1892    4
1896    4
1900    3
1904    5
1908    4
1912    5
1916    4
1920    5
1924    3
1928    3
1932    4
1936    4
1940    3
1944    2
1948    6
1952    3
1956    3
1960    2
1964    2
1968    3
1972    3
1976    6
1980    5
1984    3
1988    4
1992    5
1996    7
2000    5
2004    6
2008    6
2012    4
2016    6
2020    4
2024    5
dtype: int64

Difference Between .size() and .count() in Pandas

A common question is when to use .size() versus .count() for aggregations:

.size() counts all rows in each group, including those with NaN values, and returns a Series because the size is the same across all columns.
.count() counts non-missing values per column in each group, so results can vary by column and returns a DataFrame.

Use .size() when you want the total number of rows regardless of missing data, and .count() when you want counts of valid (non-NaN) entries per column.

17.1.5 (e)

Write a line of pandas code that creates a filtered DataFrame named filtered_parties from the elections dataset and keeps only the parties that have at least one election % more than 50%.

Answer

filtered_parties = elections.groupby("Party").filter(lambda sf: sf["%"].max() > 50)
filtered_parties

	Year	Candidate	Party	Popular vote	Result	%
0	1824	Andrew Jackson	Democratic-Republican	151271	loss	57.210122
1	1824	John Quincy Adams	Democratic-Republican	113142	win	42.789878
2	1828	Andrew Jackson	Democratic	642806	win	56.203927
4	1832	Andrew Jackson	Democratic	702735	win	54.574789
7	1836	Hugh Lawson White	Whig	146109	loss	10.005985
...	...	...	...	...	...	...
176	2016	Hillary Clinton	Democratic	65853514	loss	48.521539
178	2020	Joseph Biden	Democratic	81268924	win	51.311515
179	2020	Donald Trump	Republican	74216154	loss	46.858542
182	2024	Donald Trump	Republican	77303568	win	49.808629
183	2024	Kamala Harris	Democratic	75019230	loss	48.336772

99 rows × 6 columns

Understanding the Difference Between filter and agg in Pandas

When working with grouped data, it’s important to distinguish between filter and agg:

The filter operation does NOT change the granularity of the original DataFrame. It returns rows from the original dataset where groups satisfy a condition, keeping the original indexing and structure.
In contrast, groupby + agg reduces granularity, returning one row per group with aggregated results and using the groups as the new index.

Think of filter as applying a Boolean test to each group to decide whether to keep or discard row.

17.1.6 (f)

Write a line of pandas code that uses the filtered_parties DataFrame to return a new DataFrame with row indices that correspond to the year and columns that correspond to each party. Each entry should be the total percentage of votes for all the candidates that ran during that particular year for the specified party. Missing values from the dataset (the cases where a party did not have a candidate in a particular year) should be entered as 0. Below is an example.

Answer

elections_pivot = filtered_parties.pivot_table(index="Year",  columns="Party", values=["%"], aggfunc="sum", fill_value = 0)
elections_pivot.head()

	%
Party	Democratic	Democratic-Republican	National Union	Republican	Whig
Year
1824	0.000000	100.0	0.0	0.0	0.000000
1828	56.203927	0.0	0.0	0.0	0.000000
1832	54.574789	0.0	0.0	0.0	0.000000
1836	52.272472	0.0	0.0	0.0	47.727528
1840	46.948787	0.0	0.0	0.0	53.051213

Why Use pivot_table in Pandas?

At first glance, it might not be obvious that the code is using a pivot_table call. It’s important to emphasize how pivot tables provide an alternative way to view and summarize data.

The combinations of year and party typically do not reflect the original row-level granularity of the dataset.
Both groupby and pivot_table are powerful tools that change the granularity of data, summarizing it in meaningful ways.
You can also connect this to groupby by asking if students can replicate the same result with a 2-key groupby on “Year” and “Party”.

Understanding pivot tables helps you reshape data into a format that’s easier to analyze and interpret.