Code
import re
Regular Expressions (RegEx for short) are an immensely powerful tool for parsing strings. However, it’s many rules make RegEx very confusing, even for veteran users, so please don’t hesitate to ask questions! Here’s a snippet of the RegEx portion of the Fall 2023 Midterm Reference Sheet:
If you’re not familiar with RegEx (or if it’s been a while), the first question might feel tricky. Take some time to review the operators listed above, and check out the modified lecture example to see how they work in practice.
Remember — the reference sheet is your best friend during midterms and finals. You won’t have access to search engines, so make sure you’re comfortable using it now.
import re
Which string contains a match for the following regular expression, "1+1$"
? The character ␣
represents a single space.
print(re.findall("1+1$", "What is 1+1}"))
print(re.findall("1+1$", "Make a wish at 11:11"))
print(re.findall("1+1$", "111 Ways to Succeed"))
[]
['11']
[]
1+
matches on at least one occurrence of the character 1
, and $
marks the end of the string. So the ending "11"
is matched.
Write a regular expression that matches a string which contains only one word containing only lowercase letters and numbers (including the empty string).
^[a-z0-9]*$
print(re.findall("^[a-z0-9]*$", "1word"))
print(re.findall("^[a-z0-9]*$", "word"))
print(re.findall("^[a-z0-9]*$", "1Word"))
print(re.findall("^[a-z0-9]*$", ""))
print(re.findall("^[a-z0-9]*$", "1 word"))
['1word']
['word']
[]
['']
[]
^
and $
in RegEx
The ^
and $
operators are important because they make sure your pattern matches the entire string. Without them, strings with extra words or spaces in-between might still match your pattern — even if they shouldn’t.
For example:
- Pattern cat
would match "cat"
and "black cat"
- Pattern ^cat$
would match only "cat"
Given sometext = "I've got 10 eggs, 20 gooses, and 30 giants."
, use re.findall
to extract all the items and quantities from the string. The result should look like ['10 eggs', '20 gooses', '30 giants']
. You may assume that a space separates quantity and type, and that each item ends in s.
re.findall(r"\d+\s\w+", sometext)
= "I've got 10 eggs, 20 gooses, and 30 giants."
sometext r"\d+\s\w+", sometext) re.findall(
['10 eggs', '20 gooses', '30 giants']
\s
vs. a Space in RegEx
Even though using \s
and just typing a space may both work, a plain space can be a bit ambiguous.
RegEx is already tricky to read, so using \s
makes your pattern clearer and easier to understand.
For each pattern specify the starting and ending position of the first match in the string. The index starts at zero and we are using closed intervals (both endpoints are included).
abcdefg |
abcs! |
ab␣abc |
abc,␣123 |
|
---|---|---|---|---|
abc* |
[0, 2] | |||
[^\s]+ |
||||
ab.*c |
||||
[a-z1,9]+ |
Work through each regex slowly. Write the pattern and the target string side-by-side and step through the pattern token by token so you can see which part of the regex captures which substring.
Pay attention to greedy operators (for example *
, +
, ?
, .*
) — they try to match as much as possible and can change how the string is split among pattern parts. If a match looks surprising, ask whether a greedy operator grabbed more than you intended.
Classroom strategy you can use or follow: - Put the string and the pattern on the board. - Mark which characters each token of the pattern matches. - Try the non-greedy version (e.g. use *?
or +?
, like .*?
) to see how the result changes.
Doing this step-by-step makes it much easier to predict and control regex behavior.
abcdefg |
abcs! |
ab␣abc |
abc,␣123 |
|
---|---|---|---|---|
abc* |
[0, 2] | [0, 2] | [0, 1] | [0, 2] |
[^\s]+ |
[0, 6] | [0, 4] | [0, 1] | [0, 3] |
ab.*c |
[0, 2] | [0, 2] | [0, 5] | [0, 2] |
[a-z1,9]+ |
[0, 6] | [0, 3] | [0, 1] | [0, 3] |
for i in ["abc*", "[^\s]", "ab.*c", "[a-z1,9]+"]:
for j in ["abcdefg", "abcs!", "ab abc", "abc, 123"]:
print(f"re.findall({i}, {j})[0] =", re.findall(i, j)[0])
re.findall(abc*, abcdefg)[0] = abc
re.findall(abc*, abcs!)[0] = abc
re.findall(abc*, ab abc)[0] = ab
re.findall(abc*, abc, 123)[0] = abc
re.findall([^\s], abcdefg)[0] = a
re.findall([^\s], abcs!)[0] = a
re.findall([^\s], ab abc)[0] = a
re.findall([^\s], abc, 123)[0] = a
re.findall(ab.*c, abcdefg)[0] = abc
re.findall(ab.*c, abcs!)[0] = abc
re.findall(ab.*c, ab abc)[0] = ab abc
re.findall(ab.*c, abc, 123)[0] = abc
re.findall([a-z1,9]+, abcdefg)[0] = abcdefg
re.findall([a-z1,9]+, abcs!)[0] = abcs
re.findall([a-z1,9]+, ab abc)[0] = ab
re.findall([a-z1,9]+, abc, 123)[0] = abc,
<>:1: SyntaxWarning:
invalid escape sequence '\s'
<>:1: SyntaxWarning:
invalid escape sequence '\s'
/var/folders/bl/1vx9mxbs4wb6r4dlwshm21240000gn/T/ipykernel_7157/1304568287.py:1: SyntaxWarning:
invalid escape sequence '\s'
Here’s a snippet of the Visualization portion of the Fall 2023 Midterm Reference Sheet:
Bigfoot is a mysterious ape-like creature that is said to live in North American forests. Most doubt its existence, but a passionate few swear that Bigfoot is real. In this discussion, you will be working with a dataset on Bigfoot sightings, visualizing variable distributions and combinations to understand better how/when/where Bigfoot is reportedly spotted and possibly either confirm or cast doubt on its existence. The Bigfoot data contains many variables about each reported Bigfoot spotting, including location information, weather, and moon phase.
This dataset is extremely messy, with observations missing many values across multiple columns. This is normally the case with data based on citizen reports (many do not fill out all required fields). For the purposes of this discussion, we will drop all observations with any missing values and some unneeded columns. However, note this is not a good practice, and you should almost never do this in real life!
Here are the first few entries of the bigfoot
DataFrame:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
= 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-09-13/bigfoot.csv'
url = pd.read_csv(url)
bigfoot bigfoot.head()
observed | location_details | county | state | season | title | latitude | longitude | date | number | ... | moon_phase | precip_intensity | precip_probability | precip_type | pressure | summary | uv_index | visibility | wind_bearing | wind_speed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | I was canoeing on the Sipsey river in Alabama.... | NaN | Winston County | Alabama | Summer | NaN | NaN | NaN | NaN | 30680.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | Ed L. was salmon fishing with a companion in P... | East side of Prince William Sound | Valdez-Chitina-Whittier County | Alaska | Fall | NaN | NaN | NaN | NaN | 1261.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | While attending U.R.I in the Fall of 1974,I wo... | Great swamp area, Narragansett Indians | Washington County | Rhode Island | Fall | Report 6496: Bicycling student has night encou... | 41.45 | -71.5 | 1974-09-20 | 6496.0 | ... | 0.16 | 0.0 | 0.0 | NaN | 1020.61 | Foggy until afternoon. | 4.0 | 2.75 | 198.0 | 6.92 |
3 | Hello, My name is Doug and though I am very re... | I would rather not have exact location (listin... | York County | Pennsylvania | Summer | NaN | NaN | NaN | NaN | 8000.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | It was May 1984. Two friends and I were up in ... | Logging roads north west of Yamhill, OR, about... | Yamhill County | Oregon | Spring | NaN | NaN | NaN | NaN | 703.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 28 columns
Let’s first look at distributions of individual quantitative variables. Let’s say we’re interested in wind_speed
.
Which of the following are appropriate visualizations for plotting the distribution of a quantitative variable? (Select all that apply)
wind_speed
is a single quantitative variable. This rules out pie charts, as they visualize qualitative variables. It also rules out scatter plots and hex plots, as those require at least 2 quantitative variables. The remaining choices are all valid.
Write a line of code that produces the visualization that depicts the variable’s (example shown below).
=bigfoot, x="wind_speed", kde=True); sns.histplot(data
The above the solution sets the y-axis as "Count"
. Setting the y-axis "Density"
is also valid:
=bigfoot, x="wind_speed", kde=True, stat="density"); sns.histplot(data
Both count and density help depict a variable’s distribution. For count, the height of the bin gives the number of data points that fall within a bin. For density, the area of the bin gives the proportion of data points that fall within it.
You can (optionally) put a semicolon at the end of your plotting code to prevent extra output from being displayed.
For example:
; plt.plot(x, y)
This is just a display preference — it doesn’t affect the plot itself.
You can read more about why this works here.
Now, let’s look at some qualitative variables. Write a line of code that produces a visualization that shows the distribution of Bigfoot sightings across the variable season
(example shown below).
=bigfoot, x="season"); sns.countplot(data
= bigfoot["season"].value_counts()
season_counts ; plt.bar(season_counts.index, season_counts.values)
By default, the output of these lines of code will not show colored bars.
To add color, you need to specify the color for each bar by passing a list of colors to the color
parameter.
Example:
=["red", "blue", "green"]) plt.bar(categories, values, color
Finally, produce a single visualization that showcases how the prevalence of bigfoot sightings at particular combinations of moon_phase
and wind_speed
vary across each season.
Hint: Think about color as the third information channel in the plot.
=bigfoot, x="moon_phase", y="wind_speed", hue="season", alpha=0.2); sns.scatterplot(data
Consider the following sample of the babynames
DataFrame
obtained by using babynames.sample(5)
.
Kernel Density Estimation is used to estimate a probability density function (or density curve) from a set of data. A kernel with a bandwidth parameter \(\alpha\) is placed on data observations \(x_i\) with \(i \in \{1, \ldots, n\}\), and the density estimation is calculated by averaging all kernels. Below, Gaussian and Boxcar kernel equations are listed:
Gaussian Kernel:
\[
K_\alpha(x, x_i) = \frac{1}{\sqrt{2 \pi \alpha^2}} \exp\left(-\frac{(x - x_i)^2}{2 \alpha^2}\right)
\]
Boxcar Kernel:
\[
B_\alpha(x, x_i) =
\begin{cases}
\frac{1}{\alpha} & \text{if } -\frac{\alpha}{2} \leq x - x_i \leq \frac{\alpha}{2} \\
0 & \text{else}
\end{cases}
\]
The KDE is calculated as follows:
\[
f_\alpha(x) = \frac{1}{n}\sum_{i=1}^{n} K_\alpha(x, x_i)
\]
Draw a KDE plot (by hand is fine) for data points [1, 4, 8, 9]
using Gaussian Kernel and \(\alpha = 1\). On the plot show \(x\), \(x_i\), \(\alpha\), and the KDE.
With \(\alpha = 1\), we get a Gaussian Kernel of \(K_{1}(x, x_i) = \frac{1}{\sqrt{2 \pi}} \exp\left(-\frac{(x - x_i)^2}{2} \right)\).
This kernel is greatest when \(x = x_i\), giving us maximum point at \[K_{1}(x, x) = \frac{1}{\sqrt{2 \pi}} = 0.3989 \approx 0.4\]
Each individual kernel is a Gaussian centered, respectively, at . Since we have 4 kernels, each with an area of 1, we normalize by dividing each kernel by 4. This gives us a maximum height of \(0.1\). We then sum those kernels together to obtain the final KDE plot:
We wish to compare the results of KDE using a Gaussian kernel and a boxcar kernel. For \(\alpha>0\), which of the following statements is true? Choose all that apply.
Name some appropriate 2D visualizations if your goal is to explore:
The distribution of population for various cities.
The distribution of income.
The relationship between income and life expectancy.
The relationship between income and city.
The relationship between income, life expectancy, smoking status (non-smoker, social smoker, smoker), and city.
The first part of the discussion will be centered on the above visualization.
Five variables are being represented visually in this graphic. What are they and what are their feature types (ie qualitative, quantitative, nominal, ordinal)?
Try to organize your thoughts clearly, for example by writing them on the board or on paper. This can help you see connections and solve problems more effectively.
How are the variables represented in the graphic, e.g., the variable XXX
is mapped to the \(x\)-axis, the variable WWW
is mapped to the \(y\)-axis, the variable ZZZ
is conveyed through color, etc.?
This plot is called a parallel coordinate plot. The observations appear as connected line segments between variables. The variables are represented on parallel vertical axes.
Try to organize your thoughts clearly, for example by writing them on the board or on paper. This can help you see connections and solve problems more effectively.
How can we figure out how to interpret the visual qualities of the plot, e.g., how do we know what a color represents?
What purpose does the comment at the top right of the plot serve?
Make 3 observations about the figure. Describe the feature that you are basing your observation on.
For example, South Korea’s expenditure on health care is comparable to Eastern European countries (and among the lowest of all countries plotted), but the life expectancy is much higher than the Eastern European countries. In the plot we see that the left endpoint of South Korea’s line segment is near the Eastern European countries, but the slope of the line segment is much steeper.
Consider the steep negative slope and narrowness of the line segment that represents the data for the United States. What systemic, social, or societal issues might explain this?