Code
import pandas as pd
import numpy as np
The Pandas functions we cover in class and discussion are just a starting point — not the full set of tools Pandas offers. Unlike the datascience
library in Data 8, Pandas has far more capabilities than what’s on the reference sheet. If you need to do something specific, don’t hesitate to look it up in the documentation or online — you’ll often find built-in functions that make your work much easier.
The following questions are based on the elections.csv
data referenced in the lecture. Assume that you have loaded the data into the elections
variable as follows:
import pandas as pd
import numpy as np
= pd.read_csv("elections.csv") elections
elections
Year | Candidate | Party | Popular vote | Result | % | |
---|---|---|---|---|---|---|
0 | 1824 | Andrew Jackson | Democratic-Republican | 151271 | loss | 57.210122 |
1 | 1824 | John Quincy Adams | Democratic-Republican | 113142 | win | 42.789878 |
2 | 1828 | Andrew Jackson | Democratic | 642806 | win | 56.203927 |
3 | 1828 | John Quincy Adams | National Republican | 500897 | loss | 43.796073 |
4 | 1832 | Andrew Jackson | Democratic | 702735 | win | 54.574789 |
... | ... | ... | ... | ... | ... | ... |
182 | 2024 | Donald Trump | Republican | 77303568 | win | 49.808629 |
183 | 2024 | Kamala Harris | Democratic | 75019230 | loss | 48.336772 |
184 | 2024 | Jill Stein | Green | 861155 | loss | 0.554864 |
185 | 2024 | Robert Kennedy | Independent | 756383 | loss | 0.487357 |
186 | 2024 | Chase Oliver | Libertarian Party | 650130 | loss | 0.418895 |
187 rows × 6 columns
Write a line of code that returns the elections
table sorted in descending order by Popular vote"
. Store your result in a variable named sorted
.
Would calling sorted.iloc[[0], :]
give the same result as sorted.loc[[0], :]
?
sorted = elections.sort_values("Popular vote", ascending=False)
sorted
Year | Candidate | Party | Popular vote | Result | % | |
---|---|---|---|---|---|---|
178 | 2020 | Joseph Biden | Democratic | 81268924 | win | 51.311515 |
182 | 2024 | Donald Trump | Republican | 77303568 | win | 49.808629 |
183 | 2024 | Kamala Harris | Democratic | 75019230 | loss | 48.336772 |
179 | 2020 | Donald Trump | Republican | 74216154 | loss | 46.858542 |
162 | 2008 | Barack Obama | Democratic | 69498516 | win | 53.023510 |
... | ... | ... | ... | ... | ... | ... |
110 | 1956 | T. Coleman Andrews | States' Rights | 107929 | loss | 0.174883 |
141 | 1992 | Bo Gritz | Populist | 106152 | loss | 0.101918 |
99 | 1948 | Claude A. Watson | Prohibition | 103708 | loss | 0.212747 |
89 | 1932 | William Z. Foster | Communist | 103307 | loss | 0.261069 |
6 | 1832 | William Wirt | Anti-Masonic | 100715 | loss | 7.821583 |
187 rows × 6 columns
sorted.iloc[[0], :]
gives a different result as sorted.loc[[0], :]
. iloc
uses integer position-based indexing, so sorted.iloc[[0], :]
will return a table with the first row in sorted
.
sorted.iloc[[0], :]
Year | Candidate | Party | Popular vote | Result | % | |
---|---|---|---|---|---|---|
178 | 2020 | Joseph Biden | Democratic | 81268924 | win | 51.311515 |
On the other hand, loc
uses label-based indexing, so sorted.loc[[0], :]
will return a table with the row with an index label of 0.
sorted.loc[[0], :]
Year | Candidate | Party | Popular vote | Result | % | |
---|---|---|---|---|---|---|
0 | 1824 | Andrew Jackson | Democratic-Republican | 151271 | loss | 57.210122 |
Note that this doesn’t mean all calls to iloc
and loc
will return different results. For example, if Andrew Jackson did have the highest "Popular vote"
, the results would be equal when calling sorted.iloc[[0], :]
and sorted.loc[[0], :]
!
When selecting rows or columns in Pandas, extra brackets change the type of the returned object:
sorted.loc[[0], :]
→ returns a DataFrame (because the row index is passed as a list).sorted.loc[0, :]
→ returns a row object (a Series).This is similar to column selection: - tbl["column_name"]
→ returns a Series.
- tbl[["column_name"]]
→ returns a DataFrame.
Tip: If you want to preserve the DataFrame structure (even for a single row or column), wrap the index/column name in an extra set of brackets.
Using Boolean slicing, write one line of pandas
code that returns a DataFrame
that only contains election results from the 1900s.
"Year"] >= 1900) & (elections["Year"] < 2000)] elections[(elections[
Year | Candidate | Party | Popular vote | Result | % | |
---|---|---|---|---|---|---|
54 | 1900 | John G. Woolley | Prohibition | 210864 | loss | 1.526821 |
55 | 1900 | William Jennings Bryan | Democratic | 6370932 | loss | 46.130540 |
56 | 1900 | William McKinley | Republican | 7228864 | win | 52.342640 |
57 | 1904 | Alton B. Parker | Democratic | 5083880 | loss | 37.685116 |
58 | 1904 | Eugene V. Debs | Socialist | 402810 | loss | 2.985897 |
... | ... | ... | ... | ... | ... | ... |
146 | 1996 | Harry Browne | Libertarian | 485759 | loss | 0.505198 |
147 | 1996 | Howard Phillips | Taxpayers | 184656 | loss | 0.192045 |
148 | 1996 | John Hagelin | Natural Law | 113670 | loss | 0.118219 |
149 | 1996 | Ralph Nader | Green | 685297 | loss | 0.712721 |
150 | 1996 | Ross Perot | Reform | 8085294 | loss | 8.408844 |
97 rows × 6 columns
Write one line of pandas
code that returns a Series
, where the index is the "Party"
, and the values are how many times that party won an election. Only include parties that have won an election.
"Result"] == "win"]["Party"].value_counts() elections[elections[
Party
Republican 24
Democratic 23
Whig 2
Democratic-Republican 1
National Union 1
Name: count, dtype: int64
You can also do this with groupby
, but it’s more complex:
'Result'] == 'win'].groupby('Party').size() elections[elections[
Party
Democratic 23
Democratic-Republican 1
National Union 1
Republican 24
Whig 2
dtype: int64
Write a line of pandas
code that returns a Series
whose index is the years and whose values are the number of candidates that participated in those years’ elections.
"Year").size() elections.groupby(
Year
1824 2
1828 2
1832 3
1836 3
1840 2
1844 2
1848 3
1852 3
1856 3
1860 4
1864 2
1868 2
1872 2
1876 2
1880 3
1884 4
1888 4
1892 4
1896 4
1900 3
1904 5
1908 4
1912 5
1916 4
1920 5
1924 3
1928 3
1932 4
1936 4
1940 3
1944 2
1948 6
1952 3
1956 3
1960 2
1964 2
1968 3
1972 3
1976 6
1980 5
1984 3
1988 4
1992 5
1996 7
2000 5
2004 6
2008 6
2012 4
2016 6
2020 4
2024 5
dtype: int64
.size()
and .count()
in Pandas
A common question is when to use .size()
versus .count()
for aggregations:
.size()
counts all rows in each group, including those with NaN
values, and returns a Series because the size is the same across all columns..count()
counts non-missing values per column in each group, so results can vary by column and returns a DataFrame.Use .size()
when you want the total number of rows regardless of missing data, and .count()
when you want counts of valid (non-NaN) entries per column.
Write a line of pandas
code that creates a filtered DataFrame
named filtered_parties
from the elections dataset and keeps only the parties that have at least one election % more than 50%.
= elections.groupby("Party").filter(lambda sf: sf["%"].max() > 50)
filtered_parties filtered_parties
Year | Candidate | Party | Popular vote | Result | % | |
---|---|---|---|---|---|---|
0 | 1824 | Andrew Jackson | Democratic-Republican | 151271 | loss | 57.210122 |
1 | 1824 | John Quincy Adams | Democratic-Republican | 113142 | win | 42.789878 |
2 | 1828 | Andrew Jackson | Democratic | 642806 | win | 56.203927 |
4 | 1832 | Andrew Jackson | Democratic | 702735 | win | 54.574789 |
7 | 1836 | Hugh Lawson White | Whig | 146109 | loss | 10.005985 |
... | ... | ... | ... | ... | ... | ... |
176 | 2016 | Hillary Clinton | Democratic | 65853514 | loss | 48.521539 |
178 | 2020 | Joseph Biden | Democratic | 81268924 | win | 51.311515 |
179 | 2020 | Donald Trump | Republican | 74216154 | loss | 46.858542 |
182 | 2024 | Donald Trump | Republican | 77303568 | win | 49.808629 |
183 | 2024 | Kamala Harris | Democratic | 75019230 | loss | 48.336772 |
99 rows × 6 columns
filter
and agg
in Pandas
When working with grouped data, it’s important to distinguish between filter
and agg
:
filter
operation does NOT change the granularity of the original DataFrame
. It returns rows from the original dataset where groups satisfy a condition, keeping the original indexing and structure.groupby
+ agg
reduces granularity, returning one row per group with aggregated results and using the groups as the new index.Think of filter
as applying a Boolean test to each group to decide whether to keep or discard row.
Write a line of pandas
code that uses the filtered_parties
DataFrame
to return a new DataFrame
with row indices that correspond to the year and columns that correspond to each party. Each entry should be the total percentage of votes for all the candidates that ran during that particular year for the specified party. Missing values from the dataset (the cases where a party did not have a candidate in a particular year) should be entered as 0. Below is an example.
= filtered_parties.pivot_table(index="Year", columns="Party", values=["%"], aggfunc="sum", fill_value = 0)
elections_pivot elections_pivot.head()
% | |||||
---|---|---|---|---|---|
Party | Democratic | Democratic-Republican | National Union | Republican | Whig |
Year | |||||
1824 | 0.000000 | 100.0 | 0.0 | 0.0 | 0.000000 |
1828 | 56.203927 | 0.0 | 0.0 | 0.0 | 0.000000 |
1832 | 54.574789 | 0.0 | 0.0 | 0.0 | 0.000000 |
1836 | 52.272472 | 0.0 | 0.0 | 0.0 | 47.727528 |
1840 | 46.948787 | 0.0 | 0.0 | 0.0 | 53.051213 |
pivot_table
in Pandas?
At first glance, it might not be obvious that the code is using a pivot_table
call. It’s important to emphasize how pivot tables provide an alternative way to view and summarize data.
groupby
and pivot_table
are powerful tools that change the granularity of data, summarizing it in meaningful ways.groupby
by asking if students can replicate the same result with a 2-key groupby
on “Year” and “Party”.Understanding pivot tables helps you reshape data into a format that’s easier to analyze and interpret.