Code
from datascience import *
import numpy as np
In lecture, you have been introduced to various in Python such as integers, strings, and arrays. These data types are particularly important for manipulating and extracting useful information out of data, an important skill for data science. In this section, we’ll be analyzing some of the behavior that Python displays when dealing with particular data types.
Suppose we have executed the following lines of code. Answer each question with the appropriate output associated with each line of code, or write ERROR if you think the operation is not possible.
from datascience import *
import numpy as np
= make_array(1, 3, 5, 7)
odd_array = np.arange(2, 10, 2)
even_array = make_array('1', '2', '3', '4') an_array
odd_array + even_array
+ even_array odd_array
array([ 3, 7, 11, 15])
odd_array + an_array
+ an_array odd_array
--------------------------------------------------------------------------- UFuncTypeError Traceback (most recent call last) Cell In[4], line 1 ----> 1 odd_array + an_array UFuncTypeError: ufunc 'add' did not contain a loop with signature matching types (dtype('int64'), dtype('<U1')) -> None
odd_array * 3
* 3 odd_array
array([ 3, 9, 15, 21])
(odd_array + 1) == even_array
+ 1) == even_array (odd_array
array([ True, True, True, True], dtype=bool)
an_array.item(3) + 'abcd'
3) + 'abcd' an_array.item(
'4abcd'
In this next section, we will practice working with tables. In particular, we’ll be focusing on table methods and what data types they return. This will help in understanding how to effectively manipulate tables. Remember to make use of the Python Reference guide when working through these questions – A similar guide will be provided on exams.
In tennis, a player hits an ace when their serve is untouched by the opponent. Your friend Wesley is interested in analyzing the 8 male tennis professionals who have hit the most aces throughout tennis history. The table below is called tennis
and includes statistics obtained from the ATP Tour (Association of Tennis Professionals). The table is sorted in decreasing order of Num. Aces, the total number of aces (as an integer) hit by that player in their career.
= Table().with_columns(
tennis "Name", ["John Isner", "Ivo Karlovic", "Roger Federer", "Feliciano Lopez",
"Goran Ivanisevic", "Andy Roddick", "Sam Querrey", "Pete Sampras"],
"Nationality", ["United States", "Croatia", "Switzerland", "Spain",
"Croatia", "United States", "United States", "United States"],
"Num. Aces", [14470, 13728, 11478, 10261, 10237, 9074, 8879, 8858],
"Matches Played", [772, 694, 1462, 976, 731, 776, 694, 792],
"Height (cm)", [208, 211, 185, 188, 193, 188, 198, 185],
"Weight (kg)", [108, 104, 85, 88, 82, 88, 95, 77]
)
tennis
Name | Nationality | Num. Aces | Matches Played | Height (cm) | Weight (kg) |
---|---|---|---|---|---|
John Isner | United States | 14470 | 772 | 208 | 108 |
Ivo Karlovic | Croatia | 13728 | 694 | 211 | 104 |
Roger Federer | Switzerland | 11478 | 1462 | 185 | 85 |
Feliciano Lopez | Spain | 10261 | 976 | 188 | 88 |
Goran Ivanisevic | Croatia | 10237 | 731 | 193 | 82 |
Andy Roddick | United States | 9074 | 776 | 188 | 88 |
Sam Querrey | United States | 8879 | 694 | 198 | 95 |
Pete Sampras | United States | 8858 | 792 | 185 | 77 |
Unfortunately, the code Wesley wrote to analyze the data has some issues. Below are some error messages that appeared, along with what Wesley was trying to calculate. Describe the issues and explain how you would change the code to fix them.
The proportion of players in the table that are from the United States.
The code results in an error because it tries to divide a table by an integer, which is not a valid operation.
Correct code: tennis.where("Nationality", "United States").num_rows / tennis.num_rows
"Nationality", "United States").num_rows / tennis.num_rows tennis.where(
0.5
An array of the average number of aces hit per match for each player.
The code results in an error because it attempts to divide an array by a table. Arrays can only be divided by an integer or another array of a similar data type and of the same length.
Correct code: tennis.column("Num. Aces") / tennis.column("Matches Played")
"Num. Aces") / tennis.column("Matches Played") tennis.column(
array([ 18.74352332, 19.78097983, 7.85088919, 10.51331967,
14.00410397, 11.69329897, 12.79394813, 11.18434343])
Ethan and his friends attend various concerts. The table concerts
contains information about their adventures in 2022. There are four columns:
Some rows are shown below:
= Table().with_columns(
concerts "Name", [
"Jeffrey", "Kristen", "Ethan", "Jeffrey", "Oscar",
"Maya", "Liam", "Sophia", "Noah", "Emma",
"Ethan", "Ava", "Olivia", "Ethan", "Lucas"
],"Artist", [
"Blackpink", "Seventeen", "BTS", "Twice", "BTS",
"Blackpink", "Seventeen", "Twice", "BTS", "Blackpink",
"Twice", "Seventeen", "BTS", "Twice", "Twice"
],"Month", [
11, 8, 4, 5, 4,
12, 7, 6, 9, 11,
5, 8, 4, 5, 6
],"Price", [
132.62, 42.68, 70.02, 392.11, 70.02,
145.00, 50.00, 400.00, 75.00, 150.00,
392.11, 48.50, 72.00, 400.00, 390.00
]
)
5) concerts.show(
Name | Artist | Month | Price |
---|---|---|---|
Jeffrey | Blackpink | 11 | 132.62 |
Kristen | Seventeen | 8 | 42.68 |
Ethan | BTS | 4 | 70.02 |
Jeffrey | Twice | 5 | 392.11 |
Oscar | BTS | 4 | 70.02 |
... (10 rows omitted)
For each of the columns in concerts, identify if the data contained in that column is numerical or categorical.
Assume Ethan attended two Twice concerts in 2022. Assign months_passed
to the number of months in between those two concerts as an integer.
= concerts.where(___________, ___________).where(___________, ___________).column('Month')
ethan_twice_months = abs(___________________________) months_passed
= concerts.where('Name', 'Ethan').where('Artist', 'Twice').column('Month')
ethan_twice_months = abs(ethan_twice_months.item(0) - ethan_twice_months.item(1)) months_passed
months_passed
0
A table named seat
contains a row for each time a student submitted the attendance form in lecture on September 18th, 20th, or 22nd. The table contains four columns.
= Table.read_table('seat.csv')
seat 3) seat.show(
Row | Seat | Date | |
---|---|---|---|
sulu@berkeley.edu | C | 102 | 20 |
mccoy@berkeley.edu | A | 3 | 18 |
kirk@berkeley.edu | R | 110 | 20 |
... (1997 rows omitted)
Fill in the blanks of the Python expressions to compute the described values. You must use all and only the lines provided. The last (or only) line of each answer should evaluate to the value described.
The largest seat number in the seat
table.
Method 1:
max(___________________________)
max(seat.column('Seat'))
150
Method 2:
___________.sort(___________, ___________).___________(___________)
'Seat', descending=True).column('Seat').item(0) seat.sort(
150
The total number of attendance submissions for September 20th in rows A, B, C, D, or E.
Hint: Use Table.where predicates to compare letters lexicographically (e.g. A is below B)
= seat.___________(___________, ___________)
sept_20 sept_20.___________(___________, ___________).___________
= seat.where('Date', 20)
sept_20 'Row', are.below('F')).num_rows sept_20.where(
125