Interpretation of AUC

One of the most widely used metric to evaluate binary classifiers is the AUC (“Area Under the Curve”) where “the Curve” refers to the Receiver Operating Characteristic (ROC) curve. It is well-known to be a measure between 0 and 1, the higher value meaning better performance. The perfect classifier correctly labeling all cases has a value 1 while a completely random classifier entails a score of 0.5.

But what is the meaning of a score of 0.8 or 0.95? How much better is the latter than the former? A nice interpretation of the score is the following: if we take a random positive and a random negative case, AUC shows the probability that the classifier assigns a higher score to the positive case than to the negative.

This interpretation is simple and intuitive – but why is it true? Instead of presenting theoretical arguments I provide here simulation evidence: I implement the interpretation and compare it to a built-in calculation of AUC using R.

library(tidyverse)
library(ROCR)
library(titanic)
set.seed(42)

I use the “train” part of the titanic data set and separate it to a train and a test sample.

# data preparation
data <- titanic_train %>% as_tibble()
data <- data %>% select(Survived, Name, Sex, Fare)

# check that name is a unique identifier
n_unique_names <- data %>% select(Name) %>% n_distinct()
stopifnot(nrow(data) == n_unique_names)

data_train <- data %>% sample_frac(0.5)
data_test <- anti_join(data, data_train, by = "Name")

I use a simple logistic regression to model survival of Titanic passengers and calculate AUC using the ROCR package.

# train a binary classifier
model <- glm(Survived ~ Sex + Fare, 
             data = data_train, 
             family = binomial(link = "logit"))

# predict on test set
predicted <- predict.glm(model, newdata = data_test, type = "response")
data_test <- data_test %>% mutate(predicted = predicted)

# calculate AUC
prediction_obj <- prediction(predicted, data_test %>% select(Survived))
auc <- performance(prediction_obj, measure = "auc")@y.values[[1]]

Then let’s calculate a measure based on the interpretation: I randomly draw positive and negative cases many times and calculate the share of draws where the positive has a higher score than the negative predicted by the model.

# calculate the interpretation

data_test_negative <- data_test %>% filter(Survived == 0)
data_test_positive <- data_test %>% filter(Survived == 1)

n_repetitions <- 100000

sample_negatives <- sample(1:nrow(data_test_negative), n_repetitions, replace = TRUE)
sample_positives <- sample(1:nrow(data_test_positive), n_repetitions, replace = TRUE)

scores_negatives <- data_test_negative[sample_negatives, "predicted"]
scores_positives <- data_test_positive[sample_positives, "predicted"]

simulated_auc <- sum(scores_positives > scores_negatives) / n_repetitions

Finally, let’s see the two AUCs:

print(paste0("AUC: ", round(auc, 4)))

## [1] "AUC: 0.8298"

print(paste0("AUC from interpretation: ", round(simulated_auc, 4)))

## [1] "AUC from interpretation: 0.8285"

Indeed, the two are very close. We can conclude that the simple logistic model predicting survival based on the gender of the passenger and how much (s)he paid for the ticket performs decently: with 83% probability it assigns a higher score to a randomly chosen survivor than to a randomly chosen non-survivor.