1  Data 8 Discussion 01: By the Numbers

Slides

1.0.1 Contact Information

Name Wesley Zheng
Pronouns He/him/his
Email wzheng0302@berkeley.edu
Discussion Wednesdays, 12–2 PM @ Etcheverry 3105
Office Hours Tuesdays/Thursdays, 2–3 PM @ Warren Hall

Contact me by email at ease — I typically respond within a day or so!


1.0.2 Resources and Announcements

Lab Policies

Regular Lab Credit

  • 80% credit for attendance
  • 20% credit for passing all public test cases (all or nothing)

Self-Service Lab Credit

  • Credit depends on % of test cases passed (e.g., 80% of test cases passed = 80% credit)

Lab Format Switching

  • Allowed until September 2nd and again post-midterm
  • Switching into regular lab is subject to capacity limitations
  • If you choose self-service at any point, grading will be based only on test cases, regardless of whether you were/are in a regular lab

No Technology Use

  • No technology allowed during the first hour of lab section
  • Reference sheets will be provided

Lab Drops

  • You have 2 lab drops
  • These are for extenuating circumstances only and are not the norm
  • No additional drops will be given
External Materials Policy
  • On coding assignments and exams, you may not use material not taught in Data C8 (conceptual or coding).
  • Any use of external material on homeworks, projects, or labs will result in an automatic 0 on the assignment.
  • Any use of external material on the exam will not be graded.

Examples (not allowed):

  • matplotlib on homeworks
  • list comprehensions on exams

Exceptions (allowed):

  • Syntax shown in the textbook (e.g., [], range())
  • Only when used in the context shown in the given chapter
Extensions + DSP Policies

DSP

  • All DSP deadlines will be automatically extended for ALL assignments (both bonus point and regular deadlines).
  • DSP students will NOT need to fill out an extension form.

“Additional Accommodations” Form

  • Intended only for students with injuries or other non-DSP situations.
  • Unless extenuating, requests must be submitted >24 hours in advance of the deadline.
  • Does NOT apply to bonus point deadlines.
  • Students must request the number of extension days, not exceeding the date when solutions are released.
Other

HW Drops
- You have 2 homework drops, but these are for extenuating circumstances only.
- Submissions after the deadline will be accepted for 24 hours with a 20% penalty.
- Submissions more than 24 hours late will not be accepted.

Office Hours (OH)
- Students wishing to ask questions virtually during in-person OH times will be directed to Ed.

Important Websites

Optional but Recommended Websites

For help with submitting labs on Pensieve, watch this short video:

Submission Walkthrough

Announcements

Reminders

  • Make sure you have access to Pensieve and Ed
  • Check Ed (and email) frequently, as this is where we will post important course updates
  • Read the course policies on the website

Deadlines

  • Lab 1 is due Friday (8/29) at 5 PM
  • HW 1 is due Wednesday (9/3) at 10 AM
  • Submissions after this time will be accepted for 24 hours with a 20% penalty
  • Submissions more than 24 hours late will not be accepted
  • Submit by Tuesday (9/2) at 10 AM to earn +5 extra credit bonus points

Welcome to your first Data 8 lab! Data Science is about making sense of real numbers in the world, and this worksheet will help introduce you to the type of thinking you will learn in this class.

Answer the following questions to the best of your ability. Complete the worksheet with your table group. Be ready to discuss your reasoning behind each of your answers!

1.1 Class Sizes

Which of the following numbers is the closest to the average size of all undergraduate Berkeley classes in the 2024-2025 academic year?

30 40 50 60 70 80 90

Answer

50, the actual average was 53.6. It was 53.1 in the 2023-2024 academic year, 53.0 in the 2022-2023 academic year and 52.1 in the 2021-2022 academic year.

Note: The classes used in this data set do not include independent studies classes, ungraded discussion/lab secondary sections, or classes taught in the Summer term. UC Berkeley Class Size statistics

1.3 Healthy Living

“Life expectancy at birth” (LEB) measures roughly how many years a newborn baby is expected to live, and is a commonly used metric to quantify the health of a population. What two states would you expect to have the highest LEBs among all states in the US?

Answer

Hawaii (81.3) and California (80.9), but there is some uncertainty in these predictions. Connecticut and Minnesota are at 80.8. United States Life Expectancy statistics table

1.4 WorldWide

Picking from the list of numbers below, fill in the numbers that you expect to be the closest to the 2022 LEB of the United States as well as the world.

69 71 73 75 77 79

United States: ________  World: ________

Answer

United States: 76.4, World: 71. World Health Organization Life Expectancy statistics

Making Sense of Data: Certainty, Uncertainty, and Estimates

As we saw in the discussion questions, not all data problems are the same. A key distinction in data science is whether we are working with complete or incomplete information.

Complete Information: Exact Answers

Some questions, like the ones about UC Berkeley, have precise, factual answers. This is because they are based on complete information—a full dataset for a specific time period. We can look up the data and know the answer without error.

  • Clarification for the Class Size Question: For this problem, we are looking at the average size of all undergraduate Berkeley classes. The data does not include independent studies, ungraded discussion/labs, or Summer term classes.

Even when the final answer isn’t a number (like a list of popular degrees), it’s still derived from sorting and analyzing this complete set of numerical data.

Incomplete Information: Estimated Answers

Other questions, like those about life expectancy (LEB), require estimates. We are working with incomplete information because we can’t know exactly how long every person will live. The answers are estimates, which means they contain some uncertainty.

Let’s take a closer look at this idea of uncertainty using the LEB example. A news headline might confidently state that Hawaii’s LEB (81.3 years) is higher than California’s (80.9 years). These single numbers are called “point estimates.”

However, the scientific study behind the headline knows there is uncertainty and reports a range of plausible values (an interval) to reflect this:

  • California’s LEB: 79.9 to 81.9 years
  • Hawaii’s LEB: 80.6 to 81.9 years

Notice how much those ranges overlap. It’s possible that the true value for California is 81.5 and for Hawaii is 80.9. Because the ranges of plausible values overlap, we can’t be 100% certain which state is higher.

The key takeaway is this: The world is often fuzzy, and real data has uncertainty. A huge part of Data Science is learning how to quantify that uncertainty and interpret data carefully!