27  Discussion 12: ROC Curves & Performance Metrics (From Summer 2025)

Slides

27.1 ROC Curves

Consider the following ROC (receiver operating characteristic) curves that were each created from different models.


27.1.1 (a)

Compare the Area Under the Curve (AUC) of Line 1 and Line 4. What kind of model would predict Line 1? What kind of model would predict Line 4?

Decision Threshold and the ROC Curve

Think about how changing the decision threshold affects the position on the ROC curve:

  • Increasing the threshold from 0 to 1 makes the model stricter about predicting the positive class.
  • At a threshold near 0, almost everything is predicted as positive → high true positive rate (TPR) and high false positive rate (FPR) → top-right of the ROC curve.
  • At a threshold near 1, almost nothing is predicted as positive → low TPR and low FPR → bottom-left of the ROC curve.

Intuition: Moving the threshold from 0 to 1 traces the ROC curve from its top-right corner toward its bottom-left corner.

Answer

Line 1 (solid black line) is known as a “perfect predictor”; it always predicts the correct class for \(y\), so its true positive rate (TPR) is 1, and its false positive rate (FPR) is 0. Because we want our classifier to be as close as possible to the perfect predictor, we aim to maximize the AUC.

On the other hand, Line 4 (solid grey line) is a random predictor that predicts \(y=1\) with a probability of 0.5 and \(y=0\) with a probability of 0.5. It’s AUC = 0.5, and it does no better than a random coin flip.

27.1.2 (b)

Suppose we fix the decision threshold for all 4 models such that we get an FPR of 0.1 when we evaluate our model. In order of most to least preferred, rank models given their ROC curve.

Answer Since the FPR is now fixed at 0.1, the only thing that we can adjust is the TPR. To determine the most to least preferred model, we look at each model’s TPR at an FPR of 0.1. Higher TPRs are better because they indicate that a greater proportion of true positive cases are correctly identified by the model. Hence, we get the following ranking: \[\text{Line 1} > \text{Line 3} >\text{Line 2} >\text{Line 4} \]

27.1.3 (c) (Extra)

(Bonus) Calculate the AUC of Line 3!

Answer The AUC under this curve can be found by taking areas of rectangles. We have \(0.8 \cdot 0.6 + 1 \cdot 0.4 = 0.88\).

27.2 Performance Metrics

Here’s the “Classification Performance” section from the Spring 2025 Final Reference Sheet:

27.2.1 (a)

Suppose we train a binary classifier on the following dataset where \(y\) is the set of true labels, and \(\hat{y}\) is the set of predicted labels:

\(y\) 0 0 0 0 0 1 1 1 1 1
\(\hat{y}\) 0 1 1 1 1 1 1 0 0 0

Fill out the confusion matrix for the given data.

Hint: The first row contains the true negatives and false positives, and the second row contains false negatives and true positives (in that order).

Answer
array([[1, 4], [3, 2]])

27.2.2 (b)

The precision of our classifier. Write your answer as a simplified fraction.

Answer \(\frac{2}{2+4}=\frac{1}{3}\)

27.2.3 (c)

The recall of our classifier. Write your answer as a simplified fraction.

Answer \(\frac{2}{2+3}=\frac{2}{5}\)

27.2.4 (d)

(Discussion) It is revealed that this dataset describes the results of an algorithm used to predict whether someone is at risk of developing a severe disease with expensive treatment. You are tasked with improving the classifier. Which metrics should you aim to optimize for (Accuracy/Precision/Recall)? Explain your reasoning. (Things to consider: cost of treatment, the severity of disease)

Tips for Leading Discussions

When you’re leading a discussion, your goal is to help everyone share ideas and learn from each other. Here are some tips:

  1. Think for yourself first – try not to rely too much on the leader’s hints; give your own thoughts a chance.
  2. Share your perspective – all ideas are welcome, even if they aren’t “perfect.” Thank others when they share.
  3. Be patient – building a participatory discussion takes time. Don’t worry if everyone isn’t speaking up right away.

Key idea: Good discussions happen when everyone feels comfortable contributing.

Answer

There is no singular correct answer, but here are some examples of reasons.

Accuracy: Since this dataset is fairly balanced, accuracy is not too bad of a metric. (3/10). The main flaw of using accuracy as the metric is that it is agnostic towards the original class of the data point when evaluating performance.

Precision: Optimizing for precision means that we care more about making sure the positives we output are truly positive. In this setting, we want to ensure that the people we predict to have the disease truly have the disease. If the cost of treatment is very expensive, we don’t want to overburden people who may not have this disease financially.

Recall: Optimizing for recall means we care more about detecting all the true positives from the dataset. In this setting, we want to make sure that almost everyone who has the disease knows they have it. If the disease is particularly deadly, then we would be aiming to save the most lives.