1 Data 8 Discussion 01: By the Numbers
Slides
1.0.1 Contact Information
Name | Wesley Zheng |
Pronouns | He/him/his |
wzheng0302@berkeley.edu | |
Discussion | Wednesdays, 12–2 PM @ Etcheverry 3105 |
Office Hours | Tuesdays/Thursdays, 2–3 PM @ Warren Hall |
Contact me by email at ease — I typically respond within a day or so!
1.0.2 Resources and Announcements
Regular Lab Credit
- 80% credit for attendance
- 20% credit for passing all public test cases (all or nothing)
Self-Service Lab Credit
- Credit depends on % of test cases passed (e.g., 80% of test cases passed = 80% credit)
Lab Format Switching
- Allowed until September 2nd and again post-midterm
- Switching into regular lab is subject to capacity limitations
- If you choose self-service at any point, grading will be based only on test cases, regardless of whether you were/are in a regular lab
No Technology Use
- No technology allowed during the first hour of lab section
- Reference sheets will be provided
Lab Drops
- You have 2 lab drops
- These are for extenuating circumstances only and are not the norm
- No additional drops will be given
- On coding assignments and exams, you may not use material not taught in Data C8 (conceptual or coding).
- Any use of external material on homeworks, projects, or labs will result in an automatic 0 on the assignment.
- Any use of external material on the exam will not be graded.
Examples (not allowed):
matplotlib
on homeworks
- list comprehensions on exams
Exceptions (allowed):
- Syntax shown in the textbook (e.g.,
[]
,range()
)
- Only when used in the context shown in the given chapter
DSP
- All DSP deadlines will be automatically extended for ALL assignments (both bonus point and regular deadlines).
- DSP students will NOT need to fill out an extension form.
“Additional Accommodations” Form
- Intended only for students with injuries or other non-DSP situations.
- Unless extenuating, requests must be submitted >24 hours in advance of the deadline.
- Does NOT apply to bonus point deadlines.
- Students must request the number of extension days, not exceeding the date when solutions are released.
HW Drops
- You have 2 homework drops, but these are for extenuating circumstances only.
- Submissions after the deadline will be accepted for 24 hours with a 20% penalty.
- Submissions more than 24 hours late will not be accepted.
Office Hours (OH)
- Students wishing to ask questions virtually during in-person OH times will be directed to Ed.
Important Websites
- data8.datahub.berkeley.edu — HWs, Labs, Projects
- pensieve.co — Submitting Assignments
- edstem.org — Ask Questions
- data8.org/fa25 — Course Website
Optional but Recommended Websites
- inferentialthinking.com — Course Textbook
- wkaiz.github.io — Slide Decks
- wkaiz.github.io/discussions — Discussion Notes
For help with submitting labs on Pensieve, watch this short video:
Reminders
- Make sure you have access to Pensieve and Ed
- Check Ed (and email) frequently, as this is where we will post important course updates
- Read the course policies on the website
Deadlines
- Lab 1 is due Friday (8/29) at 5 PM
- HW 1 is due Wednesday (9/3) at 10 AM
- Submissions after this time will be accepted for 24 hours with a 20% penalty
- Submissions more than 24 hours late will not be accepted
- Submit by Tuesday (9/2) at 10 AM to earn +5 extra credit bonus points
Welcome to your first Data 8 lab! Data Science is about making sense of real numbers in the world, and this worksheet will help introduce you to the type of thinking you will learn in this class.
Answer the following questions to the best of your ability. Complete the worksheet with your table group. Be ready to discuss your reasoning behind each of your answers!
1.1 Class Sizes
Which of the following numbers is the closest to the average size of all undergraduate Berkeley classes in the 2024-2025 academic year?
30 40 50 60 70 80 90
Answer
50, the actual average was 53.6. It was 53.1 in the 2023-2024 academic year, 53.0 in the 2022-2023 academic year and 52.1 in the 2021-2022 academic year.
Note: The classes used in this data set do not include independent studies classes, ungraded discussion/lab secondary sections, or classes taught in the Summer term. UC Berkeley Class Size statistics
1.2 Popular Degrees
In the 2023-2024 academic year, close to half of all undergraduate degrees at Berkeley were awarded to students in the 10 largest major programs. The top five (in decreasing order of number of degrees awarded) are CDSS Computer Science, Data Science, Economics, MCB, and EECS. What are the next five?
Answer
Business Administration (425), Political Science (403), Psychology (359), Sociology (273), and Media Studies (258). The number of students in the top five largest programs are as follows: CDSS Computer Science (891), Data Science (846), Economics (678), MCB (637), and EECS (510). UC Berkeley Demographic statistics
1.3 Healthy Living
“Life expectancy at birth” (LEB) measures roughly how many years a newborn baby is expected to live, and is a commonly used metric to quantify the health of a population. What two states would you expect to have the highest LEBs among all states in the US?
Answer
Hawaii (81.3) and California (80.9), but there is some uncertainty in these predictions. Connecticut and Minnesota are at 80.8. United States Life Expectancy statistics table
1.4 WorldWide
Picking from the list of numbers below, fill in the numbers that you expect to be the closest to the 2022 LEB of the United States as well as the world.
69 71 73 75 77 79
United States: ________ World: ________
Answer
United States: 76.4, World: 71. World Health Organization Life Expectancy statistics
As we saw in the discussion questions, not all data problems are the same. A key distinction in data science is whether we are working with complete or incomplete information.
Complete Information: Exact Answers
Some questions, like the ones about UC Berkeley, have precise, factual answers. This is because they are based on complete information—a full dataset for a specific time period. We can look up the data and know the answer without error.
- Clarification for the Class Size Question: For this problem, we are looking at the average size of all undergraduate Berkeley classes. The data does not include independent studies, ungraded discussion/labs, or Summer term classes.
Even when the final answer isn’t a number (like a list of popular degrees), it’s still derived from sorting and analyzing this complete set of numerical data.
Incomplete Information: Estimated Answers
Other questions, like those about life expectancy (LEB), require estimates. We are working with incomplete information because we can’t know exactly how long every person will live. The answers are estimates, which means they contain some uncertainty.
Let’s take a closer look at this idea of uncertainty using the LEB example. A news headline might confidently state that Hawaii’s LEB (81.3 years) is higher than California’s (80.9 years). These single numbers are called “point estimates.”
However, the scientific study behind the headline knows there is uncertainty and reports a range of plausible values (an interval) to reflect this:
- California’s LEB: 79.9 to 81.9 years
- Hawaii’s LEB: 80.6 to 81.9 years
Notice how much those ranges overlap. It’s possible that the true value for California is 81.5 and for Hawaii is 80.9. Because the ranges of plausible values overlap, we can’t be 100% certain which state is higher.
The key takeaway is this: The world is often fuzzy, and real data has uncertainty. A huge part of Data Science is learning how to quantify that uncertainty and interpret data carefully!