# Evaluating the equivalence of different formulations of the scaled Brier score

## Background

The Brier Score is a composite measure of discrimination and calibration for a prediction model. The Brier Score is defined as

\[ BS = \frac{1}{N} \sum (y_i - \hat y_i)^2, \]

where \(N\) is the number of observations, \(y_i\) is the observed outcome, either 0 or 1, and \(\hat y_i\) is the predicted probability for the \(i\)th observation. Let’s create an R function to calculate the Brier Score:

```
brier_score <- function(obs, pred) {
mean((obs - pred)^2)
}
```

The scaled Brier Score accounts for the event rate and provides an immediate comparison to an uninformative model that is equivalent to “just guess the event rate.” An intuitive way to define the scaled Brier Score (also called the “Brier skill score”) is

\[ BS_{scaled} = 1 - \frac{BS}{BS_{max}}, \]

where \(BS_{max} = \frac{1}{N} \sum (y_i - \bar y_i)^2\) and \(\bar y_i\) is the event rate among the observed outcome.

## My confusion

This formulation of the scaled Brier Score makes intuitive sense to me and is how I go about calculating it in practice. However, two other distinct formulations have been proposed for calculating \(BS_{max}\) that — at least to the limits of my algebraic skills – differ. Thus, here I proposed a numeric investigation of these different definitions to see if they are indeed equivalent.

## Definition 1

This is the intuitive definition to which I am accustomed, and is made explicit here: https://www.ncbi.nlm.nih.gov/pubmed/29713202

\[ BS_{scaled} = 1 - \frac{\frac{1}{N} \sum (y_i - \hat y_i)^2}{\frac{1}{N} \sum (y_i - \bar y_i)^2}. \]

Let’s create an R function to calculate this value.

```
scaled_brier_score_1 <- function(obs, pred) {
1 - (brier_score(obs, pred) / brier_score(obs, mean(obs)))
}
```

## Definition 2

A second formulation of the scaled Brier Score is defined with a slightly different definition of \(BS_{max}\), which is in this case described in https://www.ncbi.nlm.nih.gov/pubmed/20010215

\[ BS_{max} = \bar p \times (1 - \bar p). \]

Let’s create an R function to calculate this measure.

```
scaled_brier_score_2 <- function(obs, pred) {
1 - (brier_score(obs, pred) / (mean(obs) * (1 - mean(obs))))
}
```

## Definition 3

A third formulation of the scaled Brier Score is defined with a slightly different definition of \(BS_{max}\), which is in this case described in https://www.ncbi.nlm.nih.gov/pubmed/22961910

\[ BS_{max} = \bar p \times (1 - \bar p)^2 + (1 - \bar p) \times \bar p^2. \]

Let’s create an R function to calculate this measure.

```
scaled_brier_score_3 <- function(obs, pred) {
1 - (brier_score(obs, pred) / (mean(obs) * (1 - mean(obs))^2 + (1 - mean(obs)) * mean(obs)^2))
}
```

## Build a model

Let’s build a sample model based on the UCI abalone data (https://archive.ics.uci.edu/ml/datasets/Abalone).

```
# get data
df <- read.csv('https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data')
names(df) <- c('sex', 'length', 'diameter', 'height', 'weight_whole',
'weight_shucked', 'weight_viscera', 'weight_shell', 'rings')
# inspect data
knitr::kable(head(df))
```

sex | length | diameter | height | weight_whole | weight_shucked | weight_viscera | weight_shell | rings |
---|---|---|---|---|---|---|---|---|

M | 0.350 | 0.265 | 0.090 | 0.2255 | 0.0995 | 0.0485 | 0.070 | 7 |

F | 0.530 | 0.420 | 0.135 | 0.6770 | 0.2565 | 0.1415 | 0.210 | 9 |

M | 0.440 | 0.365 | 0.125 | 0.5160 | 0.2155 | 0.1140 | 0.155 | 10 |

I | 0.330 | 0.255 | 0.080 | 0.2050 | 0.0895 | 0.0395 | 0.055 | 7 |

I | 0.425 | 0.300 | 0.095 | 0.3515 | 0.1410 | 0.0775 | 0.120 | 8 |

F | 0.530 | 0.415 | 0.150 | 0.7775 | 0.2370 | 0.1415 | 0.330 | 20 |

```
# Let's predict whether or not an abalone will have > 10 rings
m1 <- glm(I(rings > 10) ~ ., data = df, family = binomial)
preds_m1 <- predict(m1, type = 'response')
# And another model with severe class imablance
m2 <- glm(I(rings > 3) ~ ., data = df, family = binomial)
```

`## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred`

`preds_m2 <- predict(m2, type = 'response')`

## Score the model

```
# ---------- Model 1
# Calculate the Brier Score
brier_score(df$rings > 10, preds_m1)
```

`## [1] 0.1479862`

```
# Calculate each type of scaled Brier Score
scaled_brier_score_1(df$rings > 10, preds_m1)
```

`## [1] 0.3462507`

`scaled_brier_score_2(df$rings > 10, preds_m1)`

`## [1] 0.3462507`

`scaled_brier_score_3(df$rings > 10, preds_m1)`

`## [1] 0.3462507`

```
# ---------- Model 2
# Calculate the Brier Score
brier_score(df$rings > 3, preds_m2)
```

`## [1] 0.002690905`

```
# Calculate each type of scaled Brier Score
scaled_brier_score_1(df$rings > 3, preds_m2)
```

`## [1] 0.3362851`

`scaled_brier_score_2(df$rings > 3, preds_m2)`

`## [1] 0.3362851`

`scaled_brier_score_3(df$rings > 3, preds_m2)`

`## [1] 0.3362851`