3  Discussion 03: Visualizations, Data Types, Extending Tables (From Summer 2025)

Slides

In lecture, you have been introduced to various in Python such as integers, strings, and arrays. These data types are particularly important for manipulating and extracting useful information out of data, an important skill for data science. In this section, we’ll be analyzing some of the behavior that Python displays when dealing with particular data types.

3.1 Fun with Arrays

Suppose we have executed the following lines of code. Answer each question with the appropriate output associated with each line of code, or write ERROR if you think the operation is not possible.

Code
from datascience import *
import numpy as np
odd_array = make_array(1, 3, 5, 7)
even_array = np.arange(2, 10, 2)
an_array = make_array('1', '2', '3', '4')
Working with Arrays

Arrays in Python let us store and manipulate collections of values at once.

Key Idea

Before solving problems, always check:

  • What is stored in the array?
  • What type of object are you working with?
  • Which operations are valid for that type?

This prevents small mistakes from snowballing into bigger ones.


3.1.1 (a)

odd_array + even_array

Answer
odd_array + even_array
array([ 3,  7, 11, 15])

3.1.2 (b)

odd_array + an_array

Answer
odd_array + an_array
---------------------------------------------------------------------------
UFuncTypeError                            Traceback (most recent call last)
Cell In[4], line 1
----> 1 odd_array + an_array

UFuncTypeError: ufunc 'add' did not contain a loop with signature matching types (dtype('int64'), dtype('<U1')) -> None

3.1.3 (c)

odd_array * 3

Answer
odd_array * 3
array([ 3,  9, 15, 21])

3.1.4 (d)

(odd_array + 1) == even_array

Answer
(odd_array + 1) == even_array
array([ True,  True,  True,  True], dtype=bool)
Element-wise Operations

Array operations in Python are done element by element.

For example, two arrays may look identical, but comparing them directly won’t return a single True—instead, the comparison happens for each element.

This is a common source of confusion, so be careful!


3.1.5 (e)

an_array.item(3) + 'abcd'

Answer
an_array.item(3) + 'abcd'
'4abcd'

In this next section, we will practice working with tables. In particular, we’ll be focusing on table methods and what data types they return. This will help in understanding how to effectively manipulate tables. Remember to make use of the Python Reference guide when working through these questions – A similar guide will be provided on exams.

3.2 Aces

In tennis, a player hits an ace when their serve is untouched by the opponent. Your friend Wesley is interested in analyzing the 8 male tennis professionals who have hit the most aces throughout tennis history. The table below is called tennis and includes statistics obtained from the ATP Tour (Association of Tennis Professionals). The table is sorted in decreasing order of Num. Aces, the total number of aces (as an integer) hit by that player in their career.

Code
tennis = Table().with_columns(
    "Name", ["John Isner", "Ivo Karlovic", "Roger Federer", "Feliciano Lopez",
             "Goran Ivanisevic", "Andy Roddick", "Sam Querrey", "Pete Sampras"],
    "Nationality", ["United States", "Croatia", "Switzerland", "Spain",
                    "Croatia", "United States", "United States", "United States"],
    "Num. Aces", [14470, 13728, 11478, 10261, 10237, 9074, 8879, 8858],
    "Matches Played", [772, 694, 1462, 976, 731, 776, 694, 792],
    "Height (cm)", [208, 211, 185, 188, 193, 188, 198, 185],
    "Weight (kg)", [108, 104, 85, 88, 82, 88, 95, 77]
)

tennis
Name Nationality Num. Aces Matches Played Height (cm) Weight (kg)
John Isner United States 14470 772 208 108
Ivo Karlovic Croatia 13728 694 211 104
Roger Federer Switzerland 11478 1462 185 85
Feliciano Lopez Spain 10261 976 188 88
Goran Ivanisevic Croatia 10237 731 193 82
Andy Roddick United States 9074 776 188 88
Sam Querrey United States 8879 694 198 95
Pete Sampras United States 8858 792 185 77

Unfortunately, the code Wesley wrote to analyze the data has some issues. Below are some error messages that appeared, along with what Wesley was trying to calculate. Describe the issues and explain how you would change the code to fix them.

Debugging in Python

Error messages in Jupyter notebooks may look intimidating, but they are one of your best tools for learning.

Key Idea

  • Some errors are straightforward (like a typo in a name).
  • Others may be less obvious, but each error message contains clues about what went wrong and where.
  • Learning to read error messages carefully will save you time and frustration.

Even unusual errors can usually be traced back to a small mistake.


3.2.1 (a)

The proportion of players in the table that are from the United States.

Answer

The code results in an error because it tries to divide a table by an integer, which is not a valid operation.
Correct code: tennis.where("Nationality", "United States").num_rows / tennis.num_rows

tennis.where("Nationality", "United States").num_rows / tennis.num_rows
0.5

3.2.2 (b)

An array of the average number of aces hit per match for each player.

Answer

The code results in an error because it attempts to divide an array by a table. Arrays can only be divided by an integer or another array of a similar data type and of the same length.
Correct code: tennis.column("Num. Aces") / tennis.column("Matches Played")

tennis.column("Num. Aces") / tennis.column("Matches Played")
array([ 18.74352332,  19.78097983,   7.85088919,  10.51331967,
        14.00410397,  11.69329897,  12.79394813,  11.18434343])

3.3 K-Pop Enthusiasts

Ethan and his friends attend various concerts. The table concerts contains information about their adventures in 2022. There are four columns:

  • Name: string, name of the concertgoer
  • Artist: string, name of the performing artist
  • Month: string, 1-12 corresponding to the month
  • Price: float, cost of the concert ticket

Some rows are shown below:

Code
concerts = Table().with_columns(
    "Name", [
        "Jeffrey", "Kristen", "Ethan", "Jeffrey", "Oscar",
        "Maya", "Liam", "Sophia", "Noah", "Emma",
        "Ethan", "Ava", "Olivia", "Ethan", "Lucas"
    ],
    "Artist", [
        "Blackpink", "Seventeen", "BTS", "Twice", "BTS",
        "Blackpink", "Seventeen", "Twice", "BTS", "Blackpink",
        "Twice", "Seventeen", "BTS", "Twice", "Twice"
    ],
    "Month", [
        11, 8, 4, 5, 4,
        12, 7, 6, 9, 11,
        5, 8, 4, 5, 6
    ],
    "Price", [
        132.62, 42.68, 70.02, 392.11, 70.02,
        145.00, 50.00, 400.00, 75.00, 150.00,
        392.11, 48.50, 72.00, 400.00, 390.00
    ]
)

concerts.show(5)
Name Artist Month Price
Jeffrey Blackpink 11 132.62
Kristen Seventeen 8 42.68
Ethan BTS 4 70.02
Jeffrey Twice 5 392.11
Oscar BTS 4 70.02

... (10 rows omitted)


3.3.1 (a)

For each of the columns in concerts, identify if the data contained in that column is numerical or categorical.

Categorical vs. Numerical Data

Not all numbers are numerical!

  • Categorical example: An ID column is made of numbers, but taking the sum or average of IDs doesn’t mean anything.
  • Numerical example: A column of heights or ages is truly numeric because averages and sums make sense.

A good rule of thumb:

If adding or averaging the values doesn’t make sense, the data is categorical.

Example: ZIP Codes

ZIP codes are a classic case: they look like numbers, but they are categorical.
You can group or compare them, but taking an average ZIP code isn’t meaningful.

Answer Name - Categorical
Artist - Categorical
Month - Categorical
Price - Numerical

3.3.2 (b)

Assume Ethan attended two Twice concerts in 2022. Assign months_passed to the number of months in between those two concerts as an integer.

ethan_twice_months = concerts.where(___________, ___________).where(___________, ___________).column('Month')
months_passed = abs(___________________________)
Filtering with Multiple Conditions

Chaining .where() calls is the only way to filter a table by multiple conditions.

  • .where() returns a new table, so it can be followed by another .where().
  • You cannot pass multiple conditions at once.

Being mindful of the data type that each method returns is crucial.

Answer
ethan_twice_months = concerts.where('Name', 'Ethan').where('Artist', 'Twice').column('Month')
months_passed = abs(ethan_twice_months.item(0) - ethan_twice_months.item(1))
months_passed
0

3.4 Fa17 Midterm Q2 Modified

A table named seat contains a row for each time a student submitted the attendance form in lecture on September 18th, 20th, or 22nd. The table contains four columns.

  • Email: a string, the email address of the student
  • Row: a string, the letter of the row in which they claim to be seated
  • Seat: an int, the number of the seat in which they claim to be seated
  • Date: an int, the date of the submission, either 18, 20, or 22.
Code
seat = Table.read_table('seat.csv')
seat.show(3)
Email Row Seat Date
sulu@berkeley.edu C 102 20
mccoy@berkeley.edu A 3 18
kirk@berkeley.edu R 110 20

... (1997 rows omitted)

Fill in the blanks of the Python expressions to compute the described values. You must use all and only the lines provided. The last (or only) line of each answer should evaluate to the value described.

Practice with Exam-Style Questions

Sometimes it’s helpful to practice problems that feel like actual exam questions.

Key Lessons

  • There is often more than one way to solve a problem. For example, you might use a Python method or a table method to reach the same result.
  • Some questions involve special details, like comparing rows in lexicographic (alphabetical) order. Hints are usually given, but you’ll need to think carefully.

Working through these types of problems now gives you a taste of what to expect on exams and builds confidence.


3.4.1 (a)

The largest seat number in the seat table.

Method 1:

max(___________________________)

Answer
max(seat.column('Seat'))
150

Method 2:

___________.sort(___________, ___________).___________(___________)

Answer
seat.sort('Seat', descending=True).column('Seat').item(0)
150
Finding the Largest Value

A common pattern in Data 8 is:

table.sort("column_name").column("column_name").item(0)

This lets us find the row with the largest (or smallest) value in a column.

You’ll see this pattern repeatedly, so it’s worth getting comfortable with it now.


3.4.2 (b)

The total number of attendance submissions for September 20th in rows A, B, C, D, or E.

Hint: Use Table.where predicates to compare letters lexicographically (e.g. A is below B)

sept_20 = seat.___________(___________, ___________)
sept_20.___________(___________, ___________).___________
Answer
sept_20 = seat.where('Date', 20)
sept_20.where('Row', are.below('F')).num_rows
125