4  Discussion 04: Functions and Visualizations (From Summer 2025)

Slides

Making Sense of Histograms

Histograms help us understand the distribution of one numerical variable. They show how spread out the data is and where it tends to cluster.

Histograms vs. Bar Charts

  • Histogram: Used for numerical data. You can adjust the bin widths to change how the distribution looks.
  • Bar Chart: Used for categorical data. Categories are fixed, so there are no adjustable bins.

The X-Axis

  • The x-axis units match the numerical variable being plotted.
  • Bins indicate how the range of values is divided up.
  • Choosing bins that are too narrow or too wide can hide useful information about the distribution.

The Y-Axis

  • The y-axis is the density scale, showing how “crowded” the data are within each bin.
  • The area of each bin is proportional to the percent of the data in that bin.
    • The total area of the histogram always equals 100% (or 1.0).
    • If all the data were in one bin, that bin’s area would represent 100%.
  • A helpful analogy: a packed Wheeler 150 and a packed Dwinelle 155 both feel crowded, even though the total number of people is different—that’s the idea of density.

4.1 Histograms

The table below shows the distribution of rents paid by students in Boston. The first column consists of ranges of monthly rent, in dollars. Ranges include the lower bound but not the upper bound. The second column shows the percentage of students who pay rent in each of the ranges.

Code
import numpy as np
from datascience import *
%matplotlib inline

rent = Table().with_columns(
  "Dollars", np.append(np.append(np.append(np.ones(15) * 600, np.ones(25) * 900), np.ones(40) * 1100), np.ones(20) * 1400)
)

4.1.1 (a)

Calculate the heights of the bars for the bins listed in the table, with correct units. Recall the Area Principle:
Area = % of values in a bin = Width * Height

The Area Principle for Histograms

The most important idea when working with histograms is the area principle:

\[ \text{Area} = \text{Width} \times \text{Height} \]

The area of a bar (bin) tells us the proportion of the data that falls within that range.

Answer

500-800: 0.050% per dollar
800-1000: 0.125% per dollar
1000-1200: 0.200% per dollar
1200-1600: 0.050% per dollar

Calculation (demonstrated on the 500-800 bin): \(\frac{area}{width} = \frac{15\%}{\$800 - \$500} = 0.050\%\) per dollar

4.1.2 (b)

Draw a histogram of the data. Make sure you label your axes!

Height vs. Area

A larger area does not always mean a taller bar.

  • The area depends on both the width and the height of the bin.
  • A wide bin might have a large area but still a relatively short height.

This is why you should always connect the shape of the histogram back to the area principle.

Answer
Code
rent.hist("Dollars", bins = [500, 800, 1000, 1200, 1600])


4.1.3 (c)

True or False: If we combine the [500, 800) and [800, 1000) bins together, the height of the new bin would be greater than the heights of both of the old bins. Please explain your answer.

Answer

False: When we combine bins together, the height of the new bin is the weighted average of the old bin heights. Thus, the new bin height will be greater than the [500, 800) bin, but less than the [800, 1000) bin. If we calculate the new height, it will be:

height = \(\frac{area}{width} = \frac{40\%}{(\$800 - \$500) + (\$1000 - \$800)} = 0.08\%\) per dollar

Combining Bins

When two bins are combined, the new height is like an average.

  • The height of the combined bin will never exceed the tallest of the original bins.
  • This is just like averages in general—an average can never be larger than the maximum value.
Code
rent.hist("Dollars", bins = [500, 1000, 1200, 1600])

4.2 Sheng Kee Fridays

Samiksha’s favorite activity to celebrate Fridays is buying pastries at Sheng Kee before class. She stores her purchase data in a table, pastries, to keep track of her spending. Each row represents an individual purchase. The first few rows look like this:

Code
pastries = Table().with_columns(
    'item', ['Hot Dog Bun', 'Yudane Milk Bun', 'Summer Romance', 'Pineapple Bun', 'Ham and Cheese Croissant'],
    'category', ['Savory', 'Sweet', 'Sweet', 'Sweet', 'Savory'],
    'price', [2.75, 2.99, 2.79, 2.45, 3.15],
    'satisfaction', [8.5, 9.0, 10.0, 7.75, 7.25]
)

pastries
item category price satisfaction
Hot Dog Bun Savory 2.75 8.5
Yudane Milk Bun Sweet 2.99 9
Summer Romance Sweet 2.79 10
Pineapple Bun Sweet 2.45 7.75
Ham and Cheese Croissant Savory 3.15 7.25

The table has 4 columns:

  • item (string): name of the pastry.
  • category (string): whether the pastry is sweet or savory.
  • price (float): price of the pastry.
  • satisfaction (float): how satisfied (out of 10) Samiksha was after eating the pastry.
Practicing with Tables

Working with tables involves a wide set of operations:

  • .column(), .with_columns(), .where(), .sort(), .group(), .apply()
  • Selecting rows with tbl.take()
  • Using NumPy functions like np.mean() and np.arange()

These tools let us create new columns, operate on them, and select multiple rows at once. Practicing these now will make exam-style questions much easier.


4.2.1 (a)

Write a line of code to calculate the total amount Samiksha spent on pastries. Assume all of her pastry purchases are recorded in the table.

Answer
sum(pastries.column('price'))
14.130000000000001

4.2.2 (b)

Write a line of code to calculate the average satisfaction Samiksha felt after eating sweet pastries.

__________(pastries.__________(__________).column(__________))

Answer
np.mean(pastries.where('category', are.equal_to('Sweet')).column('satisfaction'))
8.9166666666666661

4.2.3 (c)

Samiksha’s budget is getting tight, and she wants to buy pastries that will give her the most satisfaction per dollar. Write lines of code that will help us achieve this.

4.2.3.1 (i)

First, create an array that contains each purchase’s satisfaction per dollar. Then, add a new column called “satisfaction per $”, to the pastries table. (Hint: You can calculate a purchase’s satisfaction per dollar by dividing its satisfaction score by its price.)

score_array = pastries.__________(__________) / pastries.__________(__________)
pastries = __________.with_column(__________, __________)
Answer
score_array = pastries.column('satisfaction') / pastries.column('price')
pastries = pastries.with_column('satisfaction per $', score_array)
pastries
item category price satisfaction satisfaction per $
Hot Dog Bun Savory 2.75 8.5 3.09091
Yudane Milk Bun Sweet 2.99 9 3.01003
Summer Romance Sweet 2.79 10 3.58423
Pineapple Bun Sweet 2.45 7.75 3.16327
Ham and Cheese Croissant Savory 3.15 7.25 2.30159

4.2.3.2 (ii)

Samiksha is interested in finding the pastries in the table with the top 3 satisfaction values per dollar. Write code that will output the names of these items as an array.

pastries_sorted = pastries.__________(__________, __________)
pastries_sorted.__________(__________).column(__________)
Answer
pastries_sorted = pastries.sort('satisfaction per $', descending = True)
pastries_sorted.take(np.arange(3)).column('item')
array(['Summer Romance', 'Pineapple Bun', 'Hot Dog Bun'],
      dtype='<U24')

4.3 Fall 2018 Midterm Question 2 (Modified)

The table insurance contains one row for each beneficiary that is covered by a particular insurance company:

Code
insurance = Table.read_table("insurance.csv")
insurance.show(3)
age bmi smoker region cost
25 20.8 no southwest 3208.79
25 30.2 yes southwest 33900.7
62 32.1 no northeast 1355.5

... (20198 rows omitted)

The table contains five columns:

  • age (int): the age of the beneficiary.
  • bmi (float): the Body Mass Index (BMI) of the beneficiary.
  • smoker (string): indicates whether the beneficiary smokes.
  • region (string): the region of the United States where the beneficiary lives.
  • cost (float): the total amount in medical costs that the insurance company paid for this beneficiary last year.

In each part below, fill in the blanks to achieve the desired outputs.


4.3.1 (a)

A scatter plot comparing the amount paid last year vs. BMI (titles are usually written as Y vs. X) for only the beneficiaries whose costs exceeded $25,000. Each dot on the scatter plot should represent one beneficiary.

high_cost = __________.__________(__________, __________)
__________.__________(__________, __________)
Answer
high_cost = insurance.where("cost", are.above(25000))
high_cost.scatter("bmi", "cost")


4.3.2 (b)

Write a function that takes an age as an argument, and returns the average BMI among all beneficiaries of that age.

Functions in Tables

Functions are a critical part of working with tables.

  • You will use them heavily in Project 1.
  • A function lets you define a reusable operation that can then be applied to entire columns.
def average_bmi(age):
    right_age = insurance.where(__________, __________)
    bmis = right_age.__________(__________)
    avg = sum(bmis) / len(bmis)
    __________
Answer
def average_bmi(age):
    right_age = insurance.where("age", age)
    bmis = right_age.column("bmi")
    avg = sum(bmis) / len(bmis)
    return avg 
average_bmi(30)
28.487799043062214