5  Discussion 05: Functions and Table Methods (From Summer 2025)

Slides

Making the Most of the Reference Sheet

Your reference sheet is a powerful tool—spend a couple minutes learning how to use it!

  • The columns tell you what the inputs and outputs of each method are.
  • tbl is just a placeholder name. When you write your own code, replace it with the name of your table.

Knowing how to quickly read the sheet will save you time on assignments and exams.

5.1 Fun(ctions)

Code
import warnings
warnings.filterwarnings("ignore")

from datascience import *
import numpy as np
import random

5.1.1 (a)

After learning about them in Data 8, Wayne wants to write a function that can calculate the hypotenuse of any right triangle. He wants to use his function to assign C to the hypotenuse of a right triangle with legs (sides adjacent to the hypotenuse) A and B. However, he’s made a few mistakes. Which ones can you identify?

Hint: There are 5 unique issues. Assume that numpy has been imported as np.

Code
A = 3
B = 4
def hypotenuse(a, b)
    """Returns the length of the hypotenuse of a right triangle, the square root of a squared + b squared."""
    squares = make_array(side1, side2) * 2
    sum = sum(squares)
    squareroot = np.sqrt(sum)
    print(squareroot)
C = hypotenuse(A, B)
Answer

Error 1: the function is missing a colon “:” after the arguments list.

def hypotenuse(a, b)
    """Returns the length of the hypotenuse of a right triangle, the square root of a squared + b squared."""
    squares = make_array(side1, side2) * 2
    sum = sum(squares)
    squareroot = np.sqrt(sum)
    print(squareroot)
C = hypotenuse(A, B)
  Cell In[3], line 1
    def hypotenuse(a, b)
                        ^
SyntaxError: expected ':'

Error 2: squares should be squared with ** not *.

Error 3: We need to be consistent with our argument names so they get accurately assigned throughout the function. We can either replace a and b with side1 and side2, or vice versa.

def hypotenuse(a, b):
    """Returns the length of the hypotenuse of a right triangle, the square root of a squared + b squared."""
    squares = make_array(side1, side2) ** 2
    sum = sum(squares)
    squareroot = np.sqrt(sum)
    print(squareroot)
C = hypotenuse(A, B)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[4], line 7
      5     squareroot = np.sqrt(sum)
      6     print(squareroot)
----> 7 C = hypotenuse(A, B)

Cell In[4], line 3, in hypotenuse(a, b)
      1 def hypotenuse(a, b):
      2     """Returns the length of the hypotenuse of a right triangle, the square root of a squared + b squared."""
----> 3     squares = make_array(side1, side2) ** 2
      4     sum = sum(squares)
      5     squareroot = np.sqrt(sum)

NameError: name 'side1' is not defined

Error 4: When we assign sum to a number we have lost the original behavior of the built-in sum function. We should not re-assign variable names.

Naming Variables Safely

Avoid using protected names like sum or max.

  • These are built-in Python functions. If you reuse them as variable names, you may cause errors in your code and with the autograder.
  • You can still use short or non-descriptive names if needed—just avoid overwriting protected ones.
def hypotenuse(a, b):
    """Returns the length of the hypotenuse of a right triangle, the square root of a squared + b squared."""
    squares = make_array(a, b) ** 2
    sum = sum(squares)
    squareroot = np.sqrt(sum)
    print(squareroot)
C = hypotenuse(A, B)
---------------------------------------------------------------------------
UnboundLocalError                         Traceback (most recent call last)
Cell In[5], line 7
      5     squareroot = np.sqrt(sum)
      6     print(squareroot)
----> 7 C = hypotenuse(A, B)

Cell In[5], line 4, in hypotenuse(a, b)
      2 """Returns the length of the hypotenuse of a right triangle, the square root of a squared + b squared."""
      3 squares = make_array(a, b) ** 2
----> 4 sum = sum(squares)
      5 squareroot = np.sqrt(sum)
      6 print(squareroot)

UnboundLocalError: cannot access local variable 'sum' where it is not associated with a value

Error 5: The function will print the value of squareroot but will not return it, which means we will not have access to the value of squareroot anymore. That is, we will not be able to assign it to any values or use it as the argument to any functions! In this case, C will not be equal to anything (it will actually be None)!

Return vs. Print

It’s important to understand the difference between return and print.

  • print just displays something on the screen—it doesn’t give you back a usable value. In fact, it returns None.
  • return gives back a value you can save in a variable and use later.

If you want to keep and work with the result of a function, you should use return.

Function Arguments and Scope

Inside a function:
* Argument names are placeholders—they can be named anything, but they must be used consistently within the function.
* Variables defined inside the function exist only inside the function. They disappear once the function finishes running.

This is called scope.

def hypotenuse(a, b):
    """Returns the length of the hypotenuse of a right triangle, the square root of a squared + b squared."""
    squares = make_array(a, b) ** 2
    sum_squares = sum(squares)
    squareroot = np.sqrt(sum_squares)
    print(squareroot)
C = hypotenuse(A, B)
print(C)
5.0
None

5.1.2 (b)

Write a function that takes in the following arguments:

  • tbl: a table.
  • col: a string, name of a column in tbl.
  • n: an int.

The function should return a table that contains the rows that have the n largest values for the specified column.

def top_n(tbl, col, n):
    sorted_tbl = _________________________
    top_n_rows = _________________________
    return _________________________
Answer
def top_n(tbl, col, n):
    sorted_tbl = tbl.sort(col, descending = True)
    top_n_rows = sorted_tbl.take(np.arange(n))
    return top_n_rows
table = Table().with_columns(
  "Some Column", [10, 1, 100, 10000, 1000]
)

table
Some Column
10
1
100
10000
1000
top_n(table, "Some Column", 3)
Some Column
10000
1000
100

5.2 Table Matchmaking

Bing’s Amazing Slides!

Shown below are the chocolates and nutrition tables respectively.

chocolates

Color Shape Amount Price ($)
Dark Round 4 1.30
Milk Rectangular 6 1.20
White Rectangular 12 2.00
Dark Round 7 1.75
Milk Rectangular 9 1.40
Milk Round 2 1.00

nutrition

Type Calories
Dark 120
Milk 130
White 115
Ruby 120
Code
chocolates = Table().with_columns(
    'Color', ['Dark', 'Milk', 'White', 'Dark', 'Milk', 'Milk'],
    'Shape', ['Round', 'Rectangular', 'Rectangular', 'Round', 'Rectangular', 'Round'],
    'Amount', [4, 6, 12, 7, 9, 2],
    'Price ($)', [1.30, 1.20, 2.00, 1.75, 1.40, 1.00]
)

nutrition = Table().with_columns(
    'Type', ['Dark', 'Milk', 'White', 'Ruby'],
    'Calories', [120, 130, 115, 120]
)

Match the following table method calls to the resulting descriptions of tables.

Hint: Pay attention to the column names of the resulting tables! For example, what happens when you only specify a column name(s) in .group()? What happens to the column names when you specify an aggregating function in .group()?

Letter Function Call
A chocolates.group("Shape")
B chocolates.group("Shape", max)
C chocolates.group(make_array("Shape", "Color"), max)
D chocolates.pivot("Color", "Shape", "Price ($)", max)
E chocolates.join("Color", nutrition, "Type")
F chocolates.group(make_array("Shape", "Color"))
Number Columns # of Rows
1 Shape, Color max, Amount max, Price ($) max 2
2 Shape, Dark, Milk, White 2
3 Shape, Color, Amount max, Price ($) max 4
4 Color, Shape, Amount, Price ($), Calories 6
5 Shape, count 2
6 Shape, Color, count 4


A: ____________  C: ____________  E: ____________
B: ____________  D: ____________  F: ____________

Grouping Tables

.group lets us summarize a table by grouping rows with the same value in a column.

  • With no aggregation function, you get:
    • the grouped column
    • a new count column
  • With an aggregation function (e.g. max), you get:
    • the grouped column
    • all other columns summarized with the function name added to their labels

Think of .group whenever you want to count or summarize rows by a specific column.

Joining Tables

.join combines information from two tables when they share a column of values.

  • Example: chocolates.join(nutrition) allows us to bring in the Calories column, which .group or .pivot cannot do.
  • Watch out if the column names differ—you may need to specify a third argument to align them.
  • The join column itself (like Type) may not appear in the final joined table.
Pivoting Tables

.pivot is like grouping on two columns, but the output is shaped into a grid.

  • Each row corresponds to one unique value from the second argument.
  • Each column corresponds to one unique value from the first argument.

Example:

chocolates.pivot("Color", "Shape")

creates a table where Shape values are rows, and Color values (Dark, Milk, White) become column names.

Answer

A - 5: .group() with no aggregating function yields a table with just the column that was grouped on, and a count column.

chocolates.group("Shape")
Shape count
Rectangular 3
Round 3

B - 1: .group() with an aggregating function yields the grouped column, and the other columns of the original table with the function name added at the end of each one.

chocolates.group("Shape", max)
Shape Color max Amount max Price ($) max
Rectangular White 12 2
Round Milk 7 1.75

C - 3: .group() with multiple columns and an aggregating function yields the columns that were grouped on, and the remaining columns with the function name added at the ends.

chocolates.group(make_array("Shape", "Color"), max)
Shape Color Amount max Price ($) max
Rectangular Milk 9 1.4
Rectangular White 12 2
Round Dark 7 1.75
Round Milk 2 1

D - 2: .pivot() yields the second argument to .pivot() (“Shape”), as well as all unique values in the column of the first argument (“Dark”, “Milk”, “White” from the “Color” column) as columns in the resulting table.

chocolates.pivot("Color", "Shape", "Price ($)", max)
Shape Dark Milk White
Rectangular 0 1.4 2
Round 1.75 1 0

E - 4: .join() gives you all the columns from the two tables, except for the extra column that is being used to join the two tables (“Type” is dropped in the resulting table).

chocolates.join("Color", nutrition, "Type")
Color Shape Amount Price ($) Calories
Dark Round 4 1.3 120
Dark Round 7 1.75 120
Milk Rectangular 6 1.2 130
Milk Rectangular 9 1.4 130
Milk Round 2 1 130
White Rectangular 12 2 115

F - 6: .group() with multiple columns and no aggregating function yields the columns that were grouped on, and a count column.

chocolates.group(make_array("Shape", "Color"))
Shape Color count
Rectangular Milk 2
Rectangular White 1
Round Dark 2
Round Milk 1

5.3 Squirrel!

The table squirrel below contains some information on reported squirrel sightings across the UC Berkeley campus. Each row in the squirrel table represents one unique squirrel sighting:

  • Squirrel ID (int): unique identification number for each squirrel.
  • Location (string): common name of the nearest campus landmark where the squirrel was spotted.
  • Month (int): numerical representation of the month when the squirrel was spotted.
  • Day (int): day of the month when the squirrel was spotted.
  • Year (int): year when the squirrel was spotted.
Code
squirrel_ids = [2937, 8421, 472, 239, 2937]
locations = ["Wheeler Hall", "East Asian Library", "Etcheverry Hall", "Campbell Hall", "Moffitt Library"]
days = [17, 28, 7, 4, 7]
months = [3, 9, 1, 10, 6]
years = [2024, 2022, 2024, 2023, 2021]

location_pool = [
    "Wheeler Hall", "East Asian Library", "Etcheverry Hall",
    "Campbell Hall", "Moffitt Library", "Doe Library",
    "Cory Hall", "Soda Hall", "Evans Hall", "Haas Pavilion",
    "Stanley Hall", "Physics North", "Physics South"
]

for _ in range(995):
    squirrel_ids.append(random.randint(100, 9999))
    locations.append(random.choice(location_pool))
    days.append(random.randint(1, 28))
    months.append(random.randint(1, 12))
    years.append(random.choice([2021, 2022, 2023, 2024]))

squirrel = Table().with_columns(
    "Squirrel ID", squirrel_ids,
    "Location", locations,
    "Day", days,
    "Month", months,
    "Year", years
)

squirrel.show(5)
Squirrel ID Location Day Month Year
2937 Wheeler Hall 17 3 2024
8421 East Asian Library 28 9 2022
472 Etcheverry Hall 7 1 2024
239 Campbell Hall 4 10 2023
2937 Moffitt Library 7 6 2021

... (995 rows omitted)

5.3.1 (a)

Write a line of code that evaluates to the proportion of Squirrel IDs in the table that are even.

Working with Proportions

Proportions often come from counting how many times a condition is True.

  • In Python, True is stored as 1 and False as 0.

  • This means we can calculate proportions in many ways:

    • np.mean(condition_array)
    • np.average(condition_array)
    • sum(condition_array) / len(condition_array)
  • With tables:

    tbl.where(...).num_rows / tbl.num_rows

    This flexibility means you can often solve the same problem in multiple ways.

Answer

np.mean(squirrel.column("Squirrel ID") % 2 == 0) (or any equivalent code)

np.mean(squirrel.column("Squirrel ID") % 2 == 0)
0.48899999999999999

5.3.2 (b)

Jessica wants to find the best location where she is most likely to find a squirrel. Write a line of code that evaluates to the location with the most squirrel sightings.

Common Patterns in Data 8

There are a few coding patterns you’ll see again and again.

  • Finding the largest or smallest value:

    tbl.sort("column").column("column").item(0)
  • Sort the table, pull out the column, then grab a single value.

  • Watching data types:

  • Many problems move through this chain:

    • Table → Array → Single Value

Knowing what type of object you have at each step helps you decide what methods you can use next.

Answer
squirrel.group("Location").sort("count", descending = True).column("Location").item(0)
'East Asian Library'

5.3.3 (c)

Jessica is interested in how many squirrels were sighted at every location during every month. Create a table called sightings where each cell contains the number of squirrel sightings that occurred in 2023 at each location during each month. Note: Each row should be in a different location.

squirrels_2023 = _________________________
sightings = _________________________
Making Sense of Pivot Tables

Pivot tables reorganize data so you can make comparisons more easily.

  • The first argument becomes the set of new column labels.
  • The second argument becomes the set of row labels.
  • Each cell shows the value(s) associated with that row–column pair.

Example:

chocolates.pivot("Color", "Shape")

This creates a table where each row corresponds to a shape, and each column corresponds to a chocolate color (Dark, Milk, White).

Answer
squirrels_2023 = squirrel.where("Year", are.equal_to(2023))
sightings = squirrels_2023.pivot("Month", "Location")
sightings
Location 1 2 3 4 5 6 7 8 9 10 11 12
Campbell Hall 1 1 2 1 1 1 1 2 3 6 2 3
Cory Hall 3 1 2 2 0 1 0 0 2 2 0 1
Doe Library 3 5 0 0 1 1 2 1 0 4 2 4
East Asian Library 5 1 1 0 3 6 0 3 2 0 1 1
Etcheverry Hall 1 1 1 2 1 0 1 0 1 0 3 2
Evans Hall 2 0 3 1 2 2 1 1 3 1 2 1
Haas Pavilion 1 0 2 4 3 2 0 1 2 0 2 1
Moffitt Library 2 0 1 0 0 3 1 0 1 2 0 3
Physics North 3 2 1 2 1 2 1 0 2 1 3 1
Physics South 1 1 1 1 1 0 1 3 4 2 2 2

... (3 rows omitted)


5.3.4 (d)

Jessica now has access to another table, species, that contains information about the species of each squirrel. Some of the rows from this table are shown below.

Breaking Down Multi-Step Problems

Some problems ask you to combine several table methods in sequence, like .where(), .group(), .sort(), or .join().

How to Tackle These

  • Take it step by step. Don’t try to write the whole solution in one go.
  • Check what each step produces. Is it still a Table? Or did it turn into an Array or a single Value?
  • Build gradually. Run one line at a time and confirm the output before moving on.

This approach makes complicated problems much easier to manage and helps you avoid getting lost.

Code
species_pool = [
    "eastern gray squirrel",
    "western gray squirrel",
    "fox squirrel",
    "Douglas squirrel",
    "red squirrel"
]

squirrel_ids_species = [2937, 8421, 472, 239]
species_array = [
    "eastern gray squirrel",
    "fox squirrel",
    "western gray squirrel",
    "western gray squirrel"
]

for sid in squirrel.column("Squirrel ID")[4:]:
    squirrel_ids_species.append(sid)
    species_array.append(random.choice(species_pool))

species = Table().with_columns(
    "Squirrel ID", squirrel_ids_species,
    "Species", species_array
)

species.show(4)
Squirrel ID Species
2937 eastern gray squirrel
8421 fox squirrel
472 western gray squirrel
239 western gray squirrel

... (996 rows omitted)

Write lines of code to find the least observed species in 2024. If multiple species are tied for the least sightings, find the species that comes first alphabetically.

squirrels_2024 = _________________________
species_counts = _________________________
least_observed = _________________________
Answer
squirrels_2024 = squirrel.where("Year", 2024)
species_counts = squirrels_2024.join("Squirrel ID", species).group("Species")
least_observed = species_counts.sort("Species").sort("count").column("Species").item(0)
least_observed
'red squirrel'