19  Discussion 04: RegEx & Visualizations (From Summer 2025)

Slides

19.1 Regular Expressions

Regular Expressions (RegEx for short) are an immensely powerful tool for parsing strings. However, it’s many rules make RegEx very confusing, even for veteran users, so please don’t hesitate to ask questions! Here’s a snippet of the RegEx portion of the Fall 2023 Midterm Reference Sheet:

Getting Comfortable with RegEx

If you’re not familiar with RegEx (or if it’s been a while), the first question might feel tricky. Take some time to review the operators listed above, and check out the modified lecture example to see how they work in practice.

Remember — the reference sheet is your best friend during midterms and finals. You won’t have access to search engines, so make sure you’re comfortable using it now.

Code
import re

19.1.1 (a)

Which string contains a match for the following regular expression, "1+1$"? The character represents a single space.

Answer
print(re.findall("1+1$", "What is 1+1}"))
print(re.findall("1+1$", "Make a wish at 11:11"))
print(re.findall("1+1$", "111 Ways to Succeed"))
[]
['11']
[]
Recall that 1+ matches on at least one occurrence of the character 1, and $ marks the end of the string. So the ending "11" is matched.

19.1.2 (b)

Write a regular expression that matches a string which contains only one word containing only lowercase letters and numbers (including the empty string).

Answer

^[a-z0-9]*$

print(re.findall("^[a-z0-9]*$", "1word"))
print(re.findall("^[a-z0-9]*$", "word"))
print(re.findall("^[a-z0-9]*$", "1Word"))
print(re.findall("^[a-z0-9]*$", ""))
print(re.findall("^[a-z0-9]*$", "1 word"))
['1word']
['word']
[]
['']
[]
Using ^ and $ in RegEx

The ^ and $ operators are important because they make sure your pattern matches the entire string. Without them, strings with extra words or spaces in-between might still match your pattern — even if they shouldn’t.

For example:
- Pattern cat would match "cat" and "black cat"
- Pattern ^cat$ would match only "cat"


19.1.3 (c)

Given sometext = "I've got 10 eggs, 20 gooses, and 30 giants.", use re.findall to extract all the items and quantities from the string. The result should look like ['10 eggs', '20 gooses', '30 giants']. You may assume that a space separates quantity and type, and that each item ends in s.

Answer

re.findall(r"\d+\s\w+", sometext)

sometext = "I've got 10 eggs, 20 gooses, and 30 giants."
re.findall(r"\d+\s\w+", sometext)
['10 eggs', '20 gooses', '30 giants']
Using \s vs. a Space in RegEx

Even though using \s and just typing a space may both work, a plain space can be a bit ambiguous.
RegEx is already tricky to read, so using \s makes your pattern clearer and easier to understand.


19.1.4 (d)

For each pattern specify the starting and ending position of the first match in the string. The index starts at zero and we are using closed intervals (both endpoints are included).

abcdefg abcs! ab␣abc abc,␣123
abc* [0, 2]
[^\s]+
ab.*c
[a-z1,9]+
Working Through RegEx — Greedy Operators

Work through each regex slowly. Write the pattern and the target string side-by-side and step through the pattern token by token so you can see which part of the regex captures which substring.

Pay attention to greedy operators (for example *, +, ?, .*) — they try to match as much as possible and can change how the string is split among pattern parts. If a match looks surprising, ask whether a greedy operator grabbed more than you intended.

Classroom strategy you can use or follow: - Put the string and the pattern on the board. - Mark which characters each token of the pattern matches. - Try the non-greedy version (e.g. use *? or +?, like .*?) to see how the result changes.

Doing this step-by-step makes it much easier to predict and control regex behavior.

Answer
abcdefg abcs! ab␣abc abc,␣123
abc* [0, 2] [0, 2] [0, 1] [0, 2]
[^\s]+ [0, 6] [0, 4] [0, 1] [0, 3]
ab.*c [0, 2] [0, 2] [0, 5] [0, 2]
[a-z1,9]+ [0, 6] [0, 3] [0, 1] [0, 3]
for i in ["abc*", "[^\s]", "ab.*c", "[a-z1,9]+"]:
    for j in ["abcdefg", "abcs!", "ab abc", "abc, 123"]:
        print(f"re.findall({i}, {j})[0] =", re.findall(i, j)[0])
re.findall(abc*, abcdefg)[0] = abc
re.findall(abc*, abcs!)[0] = abc
re.findall(abc*, ab abc)[0] = ab
re.findall(abc*, abc, 123)[0] = abc
re.findall([^\s], abcdefg)[0] = a
re.findall([^\s], abcs!)[0] = a
re.findall([^\s], ab abc)[0] = a
re.findall([^\s], abc, 123)[0] = a
re.findall(ab.*c, abcdefg)[0] = abc
re.findall(ab.*c, abcs!)[0] = abc
re.findall(ab.*c, ab abc)[0] = ab abc
re.findall(ab.*c, abc, 123)[0] = abc
re.findall([a-z1,9]+, abcdefg)[0] = abcdefg
re.findall([a-z1,9]+, abcs!)[0] = abcs
re.findall([a-z1,9]+, ab abc)[0] = ab
re.findall([a-z1,9]+, abc, 123)[0] = abc,
<>:1: SyntaxWarning:

invalid escape sequence '\s'

<>:1: SyntaxWarning:

invalid escape sequence '\s'

/var/folders/bl/1vx9mxbs4wb6r4dlwshm21240000gn/T/ipykernel_7157/1304568287.py:1: SyntaxWarning:

invalid escape sequence '\s'

19.2 Visualizations

Here’s a snippet of the Visualization portion of the Fall 2023 Midterm Reference Sheet:

Bigfoot is a mysterious ape-like creature that is said to live in North American forests. Most doubt its existence, but a passionate few swear that Bigfoot is real. In this discussion, you will be working with a dataset on Bigfoot sightings, visualizing variable distributions and combinations to understand better how/when/where Bigfoot is reportedly spotted and possibly either confirm or cast doubt on its existence. The Bigfoot data contains many variables about each reported Bigfoot spotting, including location information, weather, and moon phase.

This dataset is extremely messy, with observations missing many values across multiple columns. This is normally the case with data based on citizen reports (many do not fill out all required fields). For the purposes of this discussion, we will drop all observations with any missing values and some unneeded columns. However, note this is not a good practice, and you should almost never do this in real life!

Here are the first few entries of the bigfoot DataFrame:

Code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

url = 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-09-13/bigfoot.csv'
bigfoot = pd.read_csv(url)
bigfoot.head()
observed location_details county state season title latitude longitude date number ... moon_phase precip_intensity precip_probability precip_type pressure summary uv_index visibility wind_bearing wind_speed
0 I was canoeing on the Sipsey river in Alabama.... NaN Winston County Alabama Summer NaN NaN NaN NaN 30680.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Ed L. was salmon fishing with a companion in P... East side of Prince William Sound Valdez-Chitina-Whittier County Alaska Fall NaN NaN NaN NaN 1261.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 While attending U.R.I in the Fall of 1974,I wo... Great swamp area, Narragansett Indians Washington County Rhode Island Fall Report 6496: Bicycling student has night encou... 41.45 -71.5 1974-09-20 6496.0 ... 0.16 0.0 0.0 NaN 1020.61 Foggy until afternoon. 4.0 2.75 198.0 6.92
3 Hello, My name is Doug and though I am very re... I would rather not have exact location (listin... York County Pennsylvania Summer NaN NaN NaN NaN 8000.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 It was May 1984. Two friends and I were up in ... Logging roads north west of Yamhill, OR, about... Yamhill County Oregon Spring NaN NaN NaN NaN 703.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 28 columns

Let’s first look at distributions of individual quantitative variables. Let’s say we’re interested in wind_speed.


19.2.1 (a)

Which of the following are appropriate visualizations for plotting the distribution of a quantitative variable? (Select all that apply)

Answer
wind_speed is a single quantitative variable. This rules out pie charts, as they visualize qualitative variables. It also rules out scatter plots and hex plots, as those require at least 2 quantitative variables. The remaining choices are all valid.

19.2.2 (b)

Write a line of code that produces the visualization that depicts the variable’s (example shown below).

Answer
sns.histplot(data=bigfoot, x="wind_speed", kde=True);

The above the solution sets the y-axis as "Count". Setting the y-axis "Density" is also valid:

sns.histplot(data=bigfoot, x="wind_speed", kde=True, stat="density");

Both count and density help depict a variable’s distribution. For count, the height of the bin gives the number of data points that fall within a bin. For density, the area of the bin gives the proportion of data points that fall within it.

Why Add a Semicolon After Plotting Code?

You can (optionally) put a semicolon at the end of your plotting code to prevent extra output from being displayed.

For example:

plt.plot(x, y);

This is just a display preference — it doesn’t affect the plot itself.

You can read more about why this works here.


19.2.3 (c)

Now, let’s look at some qualitative variables. Write a line of code that produces a visualization that shows the distribution of Bigfoot sightings across the variable season (example shown below).

Answer
sns.countplot(data=bigfoot, x="season");

season_counts = bigfoot["season"].value_counts()
plt.bar(season_counts.index, season_counts.values);

Adding Color to Bars in a Plot

By default, the output of these lines of code will not show colored bars.

To add color, you need to specify the color for each bar by passing a list of colors to the color parameter.

Example:

plt.bar(categories, values, color=["red", "blue", "green"])

19.2.4 (d)

Finally, produce a single visualization that showcases how the prevalence of bigfoot sightings at particular combinations of moon_phase and wind_speed vary across each season.

Hint: Think about color as the third information channel in the plot.

Answer
sns.scatterplot(data=bigfoot, x="moon_phase", y="wind_speed", hue="season", alpha=0.2);

19.3 Kernel Density Estimation (KDE)

Consider the following sample of the babynames DataFrame obtained by using babynames.sample(5).

Kernel Density Estimation is used to estimate a probability density function (or density curve) from a set of data. A kernel with a bandwidth parameter \(\alpha\) is placed on data observations \(x_i\) with \(i \in \{1, \ldots, n\}\), and the density estimation is calculated by averaging all kernels. Below, Gaussian and Boxcar kernel equations are listed:

  • Gaussian Kernel:
    \[ K_\alpha(x, x_i) = \frac{1}{\sqrt{2 \pi \alpha^2}} \exp\left(-\frac{(x - x_i)^2}{2 \alpha^2}\right) \]

  • Boxcar Kernel:
    \[ B_\alpha(x, x_i) = \begin{cases} \frac{1}{\alpha} & \text{if } -\frac{\alpha}{2} \leq x - x_i \leq \frac{\alpha}{2} \\ 0 & \text{else} \end{cases} \]

The KDE is calculated as follows:
\[ f_\alpha(x) = \frac{1}{n}\sum_{i=1}^{n} K_\alpha(x, x_i) \]


19.3.1 (a)

Draw a KDE plot (by hand is fine) for data points [1, 4, 8, 9] using Gaussian Kernel and \(\alpha = 1\). On the plot show \(x\), \(x_i\), \(\alpha\), and the KDE.

Answer

With \(\alpha = 1\), we get a Gaussian Kernel of \(K_{1}(x, x_i) = \frac{1}{\sqrt{2 \pi}} \exp\left(-\frac{(x - x_i)^2}{2} \right)\).

This kernel is greatest when \(x = x_i\), giving us maximum point at \[K_{1}(x, x) = \frac{1}{\sqrt{2 \pi}} = 0.3989 \approx 0.4\]

Each individual kernel is a Gaussian centered, respectively, at . Since we have 4 kernels, each with an area of 1, we normalize by dividing each kernel by 4. This gives us a maximum height of \(0.1\). We then sum those kernels together to obtain the final KDE plot:

(Optional) For students who want to find the maximum point rigorously, we take the derivative of \(K_{1}(x, x_i)\) with respect to \(x_i\) and set the equation to 0: \[ \frac{\partial}{\partial x_i} K_{1}(x, x_i) = \frac{1}{\sqrt{2 \pi}} \exp\left(-\frac{(x - x_i)^2}{2} \right) \frac{2(x-x_i)}{2} = 0 \]

19.3.2 (b)

We wish to compare the results of KDE using a Gaussian kernel and a boxcar kernel. For \(\alpha>0\), which of the following statements is true? Choose all that apply.

Answer
  1. True.
  2. False; if the \(\alpha\) values are not carefully selected for the Gaussian kernel, the boxcar kernel can provide a better kernel density estimate.
  3. False; if we set \(\alpha\) too high, we potentially risk including too many points in our estimate, resulting in a flatter curve.
  4. True.

19.4 Plotting Basics (Extra)

Name some appropriate 2D visualizations if your goal is to explore:


19.4.1 (a)

The distribution of population for various cities.

Answer A dotplot or bar plot.

19.4.2 (b)

The distribution of income.

Answer If the sample size is manageable, a rug plot (or stripplot) that displays all of the data. If the sample size is large, a density plot or a boxplot that visualizes numeric summaries of the data. A log transformation is most likely appropriate for these data.

19.4.3 (c)

The relationship between income and life expectancy.

Answer A scatterplot. with the income axis on a log scale.

19.4.4 (d)

The relationship between income and city.

Answer Side-by-side boxplots, overlayed densities, side-by-side violin plots. Overlayed histograms can be a good idea if there are 2-3 groups, but may look cluttered if there are more.

19.4.5 (e)

The relationship between income, life expectancy, smoking status (non-smoker, social smoker, smoker), and city.

Answer If you’re investigating income vs. life expectancy and controlling for smoking status and city, you can grid the smoking status and city combinations and make sub-scatterplots of income vs. life expectancy. Or, you can also use different colors for city and grid on smoking status. Or, you can use different colors for city and different plotting symbols (ie + or *) for smoking status.

19.5 Data Visualizations

The first part of the discussion will be centered on the above visualization.


19.5.1 (a)

Five variables are being represented visually in this graphic. What are they and what are their feature types (ie qualitative, quantitative, nominal, ordinal)?

Answer
  • Country - qualitative/nominal (categorical)
  • Healthcare spending per person in USD - quantitative/continuous
  • Average number of doctor visits per year - This number is quantitative. In general, this number is also continuous. However, in the graph shown, this number is grouped into four ordered bins, so it is ordinal (categorical) in this particular case.
  • Universal health coverage status - qualitative/nominal
  • Average life expectancy at birth - quantitative/continuous
Organizing Your Thoughts

Try to organize your thoughts clearly, for example by writing them on the board or on paper. This can help you see connections and solve problems more effectively.


19.5.2 (b)

How are the variables represented in the graphic, e.g., the variable XXX is mapped to the \(x\)-axis, the variable WWW is mapped to the \(y\)-axis, the variable ZZZ is conveyed through color, etc.?

Answer

This plot is called a parallel coordinate plot. The observations appear as connected line segments between variables. The variables are represented on parallel vertical axes.

  • “Country” appears as labels for the line segments (on the left y-axis)
  • “Healthcare spending per person in USD” is mapped to the y-axis on the left. It increases vertically.
  • “Average number of doctor visits per year” is presented as the thickness of the line connecting “healthcare spending per person in USD” to “average life expectancy at birth”. The variable has been discretized.
  • “Universal health coverage status” is presented through the color of the line segment.
  • “Average life expectancy at birth” is presented on the second y-axis (on the right). It increases vertically.
Organizing Your Thoughts

Try to organize your thoughts clearly, for example by writing them on the board or on paper. This can help you see connections and solve problems more effectively.


19.5.3 (c)

How can we figure out how to interpret the visual qualities of the plot, e.g., how do we know what a color represents?

Answer
  • The left and right axes provide scales for “Healthcare spending per person in USD” and “Average life expectancy at birth”, respectively.
  • Legends in the top left and top right regions of the graph provide information on “Average number of doctor visits per year” and “Universal health coverage status”, respectively.

19.5.4 (d)

What purpose does the comment at the top right of the plot serve?

Answer It provides information on the source and temporality of the data. This data was collected between 2007-2009. How much would you expect this visualization to change if we considered similar data from 2017-2019?

19.5.5 (e)

Make 3 observations about the figure. Describe the feature that you are basing your observation on.

For example, South Korea’s expenditure on health care is comparable to Eastern European countries (and among the lowest of all countries plotted), but the life expectancy is much higher than the Eastern European countries. In the plot we see that the left endpoint of South Korea’s line segment is near the Eastern European countries, but the slope of the line segment is much steeper.

Answer There are many observations that can be made. Here are a few examples: - “Healthcare expenditure per person” in the US is {} higher than any other nation, but the average life expectancy at birth is below average. This is seen by the left endpoint being far above the others, with a huge gap between the US and the next highest country (Switzerland). And the steep decline of the line segment to a below average life expectancy. - The line segment representing the US is very thin, which represents very few (0 to 4) doctor’s visits a year. - Japan has a very thick line segment, which indicates a high average number of doctor’s visits. It is interesting to note that Japan has the highest life expectancy, yet the healthcare expenditures are in the middle of the pack AND it has a high number of average doctor’s visits. - Countries with below average “healthcare expenditure per person” but with above average “average number of doctor visits per year” tend to have higher “average life expectancy at birth” than countries with similar levels of “healthcare expenditure per person” but whose citizens visit their doctors less often. - All countries listed, with the exception of the US, have universal health coverage. NB: Mexico achieved universal health coverage in 2012. This information is presented by the “universal health coverage status” variable.

19.5.6 (f)

Consider the steep negative slope and narrowness of the line segment that represents the data for the United States. What systemic, social, or societal issues might explain this?

Answer Access to health care. E.g. what if those who are uninsured were plotted separately, how might these two line segments be different? What if we created separate lines by race. How might they look? Why might it be misleading to draw conclusions about health based on race? e.g., are there any confounding factors that could be problematic? If you’re curious, you can read more about the uninsured population in the US here: Kaiser Family Foundation Report.