Prediction scores

Finn Lindgren

Corrections and code updates 2025-04-09

library(magrittr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(scoringRules)

Proper posterior prediction scores

A prediction score \(S(F,y)\) evaluates some measure of closeness between a prediction distribution identified by \(F\), and an observed value \(y\).

A basic score, and motivating remarks

A common score is the Squared Error, \[ S_\text{SE}(F,y) = [y - \mathbb{E}_F(Y)]^2, \] and we would like predictions to have low values for \(S_\text{SE}(F,y)\), indicating a “good” prediction, in that specific sense that puts a penalty on the squared deviation of the prediction mean from the true observed value. We can imagine constructing other such scoring functions that penalise other aspects of the prediction.

A score where “lower is better” is called negatively oriented, and a score where “higher is better” is called positively oriented. One can always turn one type of score into the other by changing the sign, so to simplify the presentation, we’ll make all scores be negatively oriented, like the squared error.

We often care about the prediction uncertainty and not just the mean (or at least we should care!). Just adding a prediction variance penalty to the squared error wouldn’t be useful, as we could then construct a new, “better”, prediction by reducing the stated prediction variance to zero. This would understate the real prediction uncertainty, so wouldn’t be a fair scoring approach for comparing different prediction models. In the next section, we make this fairness idea more precise.

Proper and strictly proper scores

The expected value under a distribution identified by \(G\) is denoted \(S(F,G):=\mathbb{E}_{Y\sim F}[S(F,Y)]\). For a negatively oriented score, we seek scoring functions that are fair, in the sense that one cannot, on average, make a better prediction that that which generated the data. This requires \(S(F,G)\geq S(G,G)\) for all predictive distributions \(F\) and any distribution \(G\). Such scores are call proper. If in addition, equality of the score expectations only hold when \(F=G\), the score is strictly proper.

Non-strict proper scores ignore some aspect of the prediction, typically by only being sensitive to some summary information, such as the mean, median, and/or variance.

It’s notable that proper scores retain their properness under affine transformations, with just potential changes in whether they are positively or negatively oriented. If \(S(F,y)\) is a proper score, then \[ S'(F,y) = a + b S(F,y),\quad a,b\in\mathbb{R}, \] is also a proper score, with the same orientation if \(b>0\) and the opposite orientation if \(b<0\). The degenerate case \(b=0\) gives the score \(a\) to all predictions, which is technically a proper score (you cannot to better than an ideal prediction), but a useless one (an ideal prediction is no better than any other prediction).

Examples

Other scores include the Interval score that is minimised for short prediction intervals with the intended coverage probability, and the Quantile score, that generalises the Absolute Median Error to other quantiles than the median.

Improper scores

We’ve seen that some scores are strictly proper, and others are only proper scores, sensitive to specific aspects of the predictive distribution, such as mean, median, and/or variance.

In contrast, improper scores do not fulfil the fairness idea. Such scores include the the aforementioned penalised squared error, \([y-\mathbb{E}(Y)]^2+\mathbb{V}_F(Y)\), but also the probability/density function, \(p_F(y)\). The latter might come as a surprise, as the log-score is proper.

Mean error/score

Up to this point, we only considered individual scores. When summarising predictions \(\{F_i\}\) for a collection of observations \(\{y_i\}\), we usually compute the mean score, \[ S(\{F_i\},\{y_i\}) = \frac{1}{N}\sum_{i=1}^N S(F_i,y_i) . \]

When comparing two different prediction models \(F\) and \(F'\), the scores are dependent with respect to the observations \(y_i\). This means that in order to more easily handle the score variability in the comparison, we should treat it as a paired sample problem. The pairwise score differences are given by \[ S(F_i,F'_i,y_i) = S(F_i,y_i) - S(F'_i,y_i) . \] It’s also much more reasonable to make conditional independence assumptions about these differences, than for the plain score values \(S(F_i,y_i)\); \[ \mathbb{V}_{\{y_i\}\sim G}\left[\frac{1}{N}\sum_{i=1}^N S(F_i,y_i)\right] = \frac{1}{N^2}\sum_{i=1}^N\sum_{j=1}^N\mathbb{C}_{\{y_i\}\sim G}\left[S(F_i,y_i),S(F_j,y_j)\right] \] but \[ \mathbb{V}_{\{y_i\}\sim G}\left[\frac{1}{N}\sum_{i=1}^N S(F_i,y_i) - S(F'_i,y_i)\right] \approx \frac{1}{N^2}\sum_{i=1}^N\mathbb{V}_{y_i\sim G_i}\left[ S(F_i,y_i) - S(F'_i,y_i)\right]. \]

Note that taking the average of prediction scores, or averages of prediction score differences, is quite different from assessing summary statistics of the collection of predictions, since the scores are individual for each observation; we’re not assessing the collective value distribution, as that might be misleading. For example, consider a spatial model where he estimated procession has an empirical distribution of the predictive means that matches that of the observed data. Scores based on the marginal empirical distribution would not be able to detect if the location of the values is maximally different to the actual locations, whereas averages of individual scores would be sensitive to this.

Poisson model example

Consider a model with Poisson outcomes \(y\), conditionally on a log-linear predictor \(\lambda=\exp(\eta)\), where \(\eta\) is some linear expression in latent variables.

The posterior predictive distributions are Poisson mixture distributions across the posterior distribution of \(\lambda\), \(p(\lambda|\text{data})\).

For illustration purposes, consider a simple Poisson model with a single covariate \(x\) and a linear predictor \(\eta = \beta_0 + \beta_1 x\).

df_pois <- tibble::tibble(
  x = rnorm(50),
  y = rpois(length(x), exp(2 + 1 * x))
)
fit_pois <- bru(
  y ~ Intercept(1) + x,
  family = "poisson",
  data = df_pois
)
newdata_pois <- tibble::tibble(
  x = rnorm(100),
  y = rpois(100, exp(2 + 1 * x))
)

Moment scores

For the Squared Error and Dawid-Sebastiani scores, we’ll need the posterior expectation and variance: \[ \mathbb{E}(Y|\text{data}) = \mathbb{E}[\mathbb{E}(Y|\lambda,\text{data}) | \text{data}] = \mathbb{E}(\lambda | \text{data}) \] and \[ \mathbb{V}(Y|\text{data}) = \mathbb{E}[\mathbb{V}(Y|\lambda,\text{data}) | \text{data}] + \mathbb{V}[\mathbb{E}(Y | \lambda, \text{data}) | \text{data}] = \mathbb{E}[\lambda | \text{data}] + \mathbb{V}[\lambda | \text{data}] \] i.e. the sum of the posterior mean and variance for lambda.

The SE and DS scores are therefore relatively easy to compute after estimating a model with inlabru. You just need to estimate the posterior mean and variance with predict() for each test data point. If Intercept+x is the expression for the linear predictor, and newdata holds the covariate information for the prediction points, run

pred_pois <- predict(
  fit_pois,
  newdata_pois,
  formula = ~ exp(Intercept + x),
  n.samples = 2000
)
post_E <- pred_pois$mean
post_Var <- pred_pois$mean + pred_pois$sd^2
scores <- data.frame(
  SE = (newdata_pois$y - post_E)^2,
  DS = (newdata_pois$y - post_E)^2 / post_Var + log(post_Var)
)
summary(scores)
#>        SE                  DS          
#>  Min.   :   0.0012   Min.   :-0.02481  
#>  1st Qu.:   0.6179   1st Qu.: 1.72725  
#>  Median :   3.3618   Median : 2.66350  
#>  Mean   :  58.3902   Mean   : 3.06775  
#>  3rd Qu.:  12.9955   3rd Qu.: 4.18486  
#>  Max.   :4507.3164   Max.   :11.23107

Log-Probability and log-density scores

The full log-score can actually also be estimated/computed in a similar way. We seek, for a fixed observation y, \(\log[\mathbb{P}(Y = y | \text{data})]\) The probability is \[ \mathbb{P}(Y = y | \text{data}) = \mathbb{E}[\mathbb{P}(Y = y | \lambda, \text{data}) | \text{data}], \] so we can estimate it using predict():

pred_pois <- predict(fit_pois,
  newdata_pois,
  formula = ~ dpois(y, lambda = exp(Intercept + x)),
  n.samples = 2000
)
log_score <- -log(pred_pois$mean)
summary(log_score)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  0.6627  1.7901  2.2675  2.4337  2.9694  6.7507

to estimate the log_score (increase n.samples if needed for sufficiently small Monte Carlo error). Approximate Monte Carlo confidence intervals for the scores can be obtained via the Monte Carlo standard errors:

head(
  data.frame(
    log_S = log_score,
    lower = -log(pred_pois$mean + 2 * pred_pois$mean.mc_std_err),
    upper = -log(pred_pois$mean - 2 * pred_pois$mean.mc_std_err)
  )
)
#>      log_S    lower    upper
#> 1 2.336190 2.328507 2.343931
#> 2 2.237725 2.233808 2.241658
#> 3 3.219634 3.211652 3.227680
#> 4 2.894551 2.886698 2.902465
#> 5 3.188912 3.179749 3.198159
#> 6 1.463654 1.460032 1.467289

CRPS

Yet another option would be to use the CRPS, which for each prediction value \(y_i\) would be \[ S_\text{CRPS}(F_i,y_i) = \sum_{k=0}^\infty [\mathbb{P}(Y_i \leq k |\text{data}) - I(y_i \leq k)]^2 \] For this, one would first need to get \(\mathbb{P}(Y \leq k | \text{data})\) from a predict call with ppois(k, lambda = exp(eta)), for a vector \(k=0,1\dots,K\), for each \(y_i\), for some sufficiently large \(K>\max_i{y_i}\) for the remainder to be negligible. However, to avoid repeated predict() calls for each \(y_i\), the storage requirements is of order \((K+1) \times N\times N_\text{samples}\). To avoid that, one option would be to reformulate the estimator into a recursive estimator, so that batches of simulations could be used to iteratively compute the estimator.

A basic estimator can proceed as follows:

Define \(K\geq K_0=\max_i(y_i)\) sufficiently large for the posterior predictive probability above \(K\) to be negligible. Perhaps a value like \(K=K_0+4\sqrt{K_0}\) might be sufficient. You can check afterwards, and change if needed.

  1. Simulate samples from \(\lambda^{(j)}\sim p(\lambda|\text{data})\) using generate() (size \(N\times N_\text{samples}\)).
  2. For each \(i=1,\dots,N\), use the samples to estimate the residuals \(r_{ik}=\mathbb{P}(Y\leq k|\text{data})-I(y_i\leq k)\), for \(k\in 0,1,2,\dots,K\), with

\[ \widehat{r}_{ik} = \frac{1}{N_\text{samples}} \sum_{j=1}^{N_\text{samples}} \{ \mathbb{P}(Y\leq k|\lambda^{(j)}_i) - I(y_i\leq k) \} . \] 3. Compute

\[ \widehat{S}_\text{CRPS}(F_i,y_i) = \sum_{k=0}^{K} \widehat{r}_{ik}^2 \]

# some large value, so that 1-F(K) is small
max_K <- ceiling(max(newdata_pois$y) + 4 * sqrt(max(newdata_pois$y)))
k <- seq(0, max_K)
kk <- rep(k, times = length(newdata_pois$y))
i <- seq_along(newdata_pois$y)
pred_pois <- generate(fit_pois, newdata_pois,
  formula = ~ {
    lambda <- exp(Intercept + x)
    ppois(kk, lambda = rep(lambda, each = length(k)))
  },
  n.samples = 2000
)
results <- data.frame(
  i = rep(i, each = length(k)),
  k = kk,
  Fpred = rowMeans(pred_pois),
  residuals =
    rowMeans(pred_pois) - (rep(newdata_pois$y, each = length(k)) <= kk)
)
# Check that the cutoff point K has nearly probability mass 1 below it,
# for all i:
min(results %>% dplyr::filter(k == max_K) %>% pull(Fpred))
#> [1] 0.4263052

crps_scores <-
  (results %>%
    group_by(i) %>%
    summarise(crps = sum(residuals^2), .groups = "drop") %>%
    pull(crps))
summary(crps_scores)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  0.2134  0.6590  1.2502  2.1658  2.4041 43.4962

Normal model example

Consider a model with Normal outcomes \(y\sim N(\eta,\sigma^2_y)\), conditionally on a linear predictor \(\lambda=\eta\), where \(\eta\) is some linear expression in latent variables.

The posterior predictive distributions are Normal mixture distributions across the posterior distribution of \(\eta\) and \(\sigma^2_y\), \(p(\lambda,\sigma^2_y|\text{data})\).

For illustration purposes, consider a simple Poisson model with a single covariate \(x\) and a linear predictor \(\eta = \beta_0 + \beta_1 x\).

true_sigma_y <- 0.1
df_normal <- tibble::tibble(
  x = rnorm(50),
  y = rnorm(length(x), 2 + 1 * x, sd = true_sigma_y)
)
fit_normal <- bru(
  y ~ Intercept(1) + x,
  family = "normal",
  data = df_normal
)
newdata_normal <- tibble::tibble(
  x = rnorm(100),
  y = rnorm(100, 2 + 1 * x, sd = true_sigma_y)
)

Moment scores and log-score

For the Squared Error and Dawid-Sebastiani scores, we’ll need the posterior expectation and variance: \[ \mathbb{E}(Y|\text{data}) = \mathbb{E}[\mathbb{E}(Y|\eta, \sigma_y^2,\text{data}) | \text{data}] = \mathbb{E}(\eta | \text{data}) \] and \[ \mathbb{V}(Y|\text{data}) = \mathbb{E}[\mathbb{V}(Y|\eta,\sigma_y^2,\text{data}) | \text{data}] + \mathbb{V}[\mathbb{E}(Y | \eta,\sigma_y^2,\text{data}) | \text{data}] = \mathbb{E}[\sigma_y^2 | \text{data}] + \mathbb{V}[\eta | \text{data}] \] i.e. the sum of the posterior mean of \(\sigma_y^2\) and the variance of \(\eta\).

The SE and DS scores are therefore relatively easy to compute after estimating a model with inlabru, as well as the log-score. You just need to estimate the posterior mean and variance with predict() for each test data point. If Intercept+x is the expression for the linear predictor, and newdata holds the covariate information for the prediction points, run

pred_normal <- predict(fit_normal, newdata_normal,
  formula = ~ {
    E <- Intercept + x
    V <- 1 / Precision_for_the_Gaussian_observations
    list(
      eta = E,
      sigma_y_2 = V,
      dens = dnorm(y, mean = E, sd = sqrt(V))
    )
  },
  n.samples = 2000
)
post_E <- pred_normal$eta$mean
post_Var <- pred_normal$sigma_y_2$mean + pred_normal$eta$sd^2
scores_normal <- data.frame(
  SE = (newdata_normal$y - post_E)^2,
  DS = (newdata_normal$y - post_E)^2 / post_Var + log(post_Var),
  LS = -log(pred_normal$dens$mean)
)
summary(scores_normal)
#>        SE                  DS               LS         
#>  Min.   :2.100e-08   Min.   :-4.978   Min.   :-1.5852  
#>  1st Qu.:1.008e-03   1st Qu.:-4.826   1st Qu.:-1.5050  
#>  Median :5.919e-03   Median :-4.116   Median :-1.1313  
#>  Mean   :1.089e-02   Mean   :-3.408   Mean   :-0.7898  
#>  3rd Qu.:1.417e-02   3rd Qu.:-2.970   3rd Qu.:-0.5415  
#>  Max.   :1.103e-01   Max.   :10.828   Max.   : 5.7342

Here, we’ve again used the negated log-score, so that small values are “better” for all three scores. Since the DS score is based on the log-score for Normal predictions, the conditional DS and LS scores differ only by a constant offset and scaling. However, since the full posterior predictions are Normal mixtures, the DS and LS scores for the full posterior predictive distribution are not equivalent.

As for the Poisson case, we can estimate the log-score uncertainty:

head(
  data.frame(
    log_S = scores_normal$LS,
    lower = -log(pred_normal$dens$mean + 2 * pred_normal$dens$mean.mc_std_err),
    upper = -log(pred_normal$dens$mean - 2 * pred_normal$dens$mean.mc_std_err)
  )
)
#>        log_S      lower      upper
#> 1 -0.9511831 -0.9617924 -0.9404601
#> 2 -0.9068057 -0.9173169 -0.8961828
#> 3 -0.5739629 -0.5964087 -0.5510017
#> 4  2.1856340  2.1477539  2.2250057
#> 5 -1.0352574 -1.0423552 -1.0281088
#> 6  0.4701370  0.4433998  0.4976088

Posterior expectation of conditional scores

In some cases, one might be tempted to consider posterior distribution properties of conditional predictive scores, e.g. the posterior expectation \(\mathbb{E}_{\lambda|\text{data}}[S(F_\lambda, y)]\) for \(S(F_\lambda, y)\) under the posterior distribution for \(\lambda\) in the Poisson model.

For Squared Error, \[ \begin{aligned} \mathbb{E}_{\lambda|\text{data}}[(y - \lambda)^2] &= \mathbb{E}_{\lambda|\text{data}}[\{y - \mathbb{E}(\lambda|\text{data}) + \mathbb{E}(\lambda|\text{data}) - \lambda\}^2] \\ &= [y - \mathbb{E}(\lambda|\text{data})]^2 + \mathbb{E}_{\lambda|\text{data}}(\{\mathbb{E}(\lambda|\text{data}) - \lambda\}^2] \\ &= [y - \mathbb{E}(\lambda|\text{data})]^2 + \mathbb{V}(\lambda|\text{data}) . \end{aligned} \] It’s noteworthy that this is similar to the improper score \([y - \mathbb{E}(y|\text{data})]^2 + \mathbb{V}(y|\text{data})\), and also in this new case, one can have a model with artificially small posterior variance with a smaller expected score, making this type of construction problematic to interpret.

However, in some cases it does provide alternative approaches for how to compute the proper scores for the full posterior predictive distributions. If \(\lambda_{ij}\), \(j=1,\dots,J\) are samples from the posterior distribution, one score estimator is \[ \widehat{S}(F_i,y_i) = \left(y_i - \frac{1}{J}\sum_{j=1}^J \lambda_{ij}\right)^2 , \] with the averaging over the samples inside the quadratic expression, and we can use

pred_pois <- predict(fit_pois, newdata_pois, formula = ~ exp(Intercept + x))
scores <- (newdata_pois$y - pred_pois$mean)^2

If we instead take advantage of the new expression above, we have \[ [y - \mathbb{E}(\lambda|\text{data})]^2 = \mathbb{E}_{\lambda|\text{data}}[(y - \lambda)^2] - \mathbb{V}(\lambda|\text{data}) \] so the score can be estimated by

pred_pois <- predict(fit_pois, newdata_pois,
  formula = ~ {
    lambda <- exp(Intercept + x)
    list(
      cond_scores = (y - lambda)^2,
      lambda = lambda
    )
  },
  n.samples = 2000
)
scores <- pred_pois$cond_scores$mean - pred_pois$lambda$sd^2
summary(scores)
#>      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
#>    0.0014    0.6160    3.3390   58.9887   12.9231 4561.4004

For this particular case, this approach is unlikely to be an improvement or more accurate than the basic estimator. Remarkably, this estimator of Squared Error isn’t even guaranteed to be positive! However, for other scores there may potentially be practical benefits.

An alternative estimator for CRPS

For the CRPS score, there are closed form expressions available for some distributions, conditionally on their parameters, but not for the full predictive mixture distribution. We take a similar approach as for SE, and let \(F\) and \(F_\lambda\) denote the unconditional and conditional cumulative distribution functions for the posterior predictive distribution. Then \(F(x)=\mathbb{E}_{\lambda|\text{data}}[F_\lambda(x)]\) for all \(x\), and \[ \begin{aligned} S_\text{CRPS}(F,y) &= \int_{-\infty}^\infty [F(x) - I(y\leq x)]^2 \,\mathrm{d}x \\ &= \mathbb{E}_{\lambda|\text{data}}\left[ \int_{-\infty}^\infty [F_\lambda(x) - I(y\leq x)]^2 \,\mathrm{d}x \right] - \int_{-\infty}^\infty \left\{\mathbb{E}_{\lambda|\text{data}}\left[F_\lambda(x)^2\right] - \mathbb{E}_{\lambda|\text{data}}[F_\lambda(x)]^2\right\} \,\mathrm{d}x \\ &= \mathbb{E}_{\lambda|\text{data}}\left[ S_\text{CRPS}(F_\lambda,y) \right] - \int_{-\infty}^\infty \mathbb{V}_{\lambda|\text{data}}[F_\lambda(x)] \,\mathrm{d}x . \end{aligned} \] Note that we didn’t need to use any particular model properties here, so this holds for any predictive model with mixture structure, when \(\lambda\) is the collection of model parameters. We also note the resemblance to the alternative expression for the Squared Error; this is because CRPS can be seen as the integral over all Brier scores for predicting event indicators of the form \(z=I(y\leq x)\), with probability \(F(x)\).

In the Poisson case, we can now estimate the CRPS scores like this, that makes the code a bit easier than the previous version that needed generate(). We can use the Poisson CRPS implementation crps_pois() from the scoringRules package.

However, it can be shown that the two approaches have nearly identical Monte Carlo variance, so the previous version is likely preferable as it doesn’t require knowing a closed form CRPS expression.

max_K <- 100 # some large value, so that 1-F(K) is small
pred_pois <- predict(fit_pois, newdata_pois,
  formula = ~ {
    lambda <- exp(Intercept + x)
    list(
      crps = scoringRules::crps_pois(y, lambda),
      F = do.call(
        c,
        lapply(
          seq_along(y),
          function(i) {
            ppois(seq(0, max_K), lambda = lambda[i])
          }
        )
      )
    )
  },
  n.samples = 2000
)
crps_score <-
  pred_pois$crps$mean -
  (pred_pois$F %>%
    mutate(i = rep(seq_along(newdata_pois$y), each = max_K + 1)) %>%
    group_by(i) %>%
    summarise(F_var = sum(sd^2), .groups = "drop") %>%
    pull(F_var))
summary(crps_score)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  0.2133  0.6575  1.2494  2.3172  2.4308 57.2278

Formulas and functions for Poisson CRPS, as well as for other distributions, can be found via the “Evaluating Probabilistic Forecasts with scoringRules” vignette of the scoringRules package, and the references in ?scoringRules::crps.numeric.