When is a forecaster performing well? An increasingly common way to measure this is to use a scoring rule known as the Brier score.

The essential idea behind the Brier score is simple enough: it is the average gap (mean squared difference) between forecast probabilities and actual outcomes. This post tries to explain and motivate the Brier score by “composing” it from some other simple ideas about forecasting quality, unlike many presentations which start with the Brier score and then show how it can be decomposed. There is nothing surprising here for anyone well-versed in these topics, but others who (like me) are just beginning to explore these ideas might find the post helpful.

I’ll use a very small real-world dataset, a set of predictions about what the Reserve Bank of Australia will decide about interest rates at its monthly meetings. The RBA generally leaves interest rates unchanged, but sometimes raises them and sometimes lowers them, depending on economic conditions. The dataset consists of predictions implicit in the assessments of the ANU RBA Shadow Board, as found on the website RBA.Tips. To keep things simple, the dataset reduces the predictions to binary outcomes – Change or No Change – and provides a numerical estimate of the probability of No Change.

The “Coded Outcome” column just translates the RBA’s decision into numbers – 1 for No Change, and 0 for Change. This makes it possible to do the kind of mathematics described below.

**Uncertainty**

One obvious thing about this dataset is that more often than not, there is No Change. In this small sample, the RBA made no change 5/7 times or 71.4% of the time, which as it happens is quite close to the long term (1990-2015) average or overall *base rate* of 75%. In other words, there isn’t a lot of uncertainty in the outcomes being predicted. Conversely, uncertainty would also be low if the RBA almost always changed the interest rate. A simple way to put a single number on the uncertainty of either of these flavors is to take the base rate and multiply it by 1 minus itself, i.e.

Uncertainty = base rate * (1 – base rate).

For this dataset, Uncertainty is 0.714 – (1 – 0.714) = 0.204

This relative lack of uncertainty means that an attractive forecasting strategy would be to simply go with the base rate, i.e. always predicting that the RBA will do whatever it does most often. How well would such a forecaster do? A simple way to measure this is in terms of hits and misses. For the period above, the base-rate strategy would yield 5 hits and two misses out of a total of seven predictions, i.e. a hit/miss ratio of 5/7 = 71%. Over a long period, this ratio should converge on the base rate – as long, that is, as variations in economic conditions, and RBA decision making, tend in future to be similar to they were in the past when the base rate was being determined.

The base rate strategy has some advantages (assuming you can access the base rate information). First, it is better than the most naive approach, which would be to pick randomly, or assign equal probabilities to the possible outcomes. Second, it is easy; you don’t have to know much or think hard about economic conditions and the interplay between those and RBA decision making. The downside is that over the long term you can’t do better than the base rate, and you can’t do better than anyone else who is also using the base strategy strategy. If you’re ambitious or competitive or just take pride in good work, you’ll need to make predictions which are more sensitive to the underlying probabilities of the outcomes – i.e. more likely to predict No Change when no change is more likely, and vice versa for change.

This can be seen in our simple dataset. Crude inspection suggests the predictions fall into two groups or “bins”. From Oct-14 to Dec-14 the probabilities assigned to No Change were all 70% or above, and over this period, interest rates in fact never changed. From Feb-15 to May-15, the probabilities were lower, in the 60-70% range, and twice there was in fact a change. It seems that whoever made these predictions believed that the economic conditions made a change more likely in 2015 than it was in late 2014, and they correctly adjusted their predictions accordingly. Note that they had two misses in 2015, suggesting that their probabilities had not been reduced sufficiently. But intuitively the “miss” predictions were not quite as off-the-mark as they would have been if the probabilities had been at the higher 2104 level – an idea captured by the Brier score.

**Resolution**

So in general a good forecaster will not make the same forecast regardless of circumstances but rather will have “horses for courses,” i.e. different forecasts when the actual probabilities of various outcomes are different. Can we measure the extent to which a forecaster is doing this? One way to do it is:

- Put the forecasts into groups or bins with the same forecast probability
- For each bin, measure how different the outcomes of predictions in the bin – the “bin base rate” are to the overall base rate.
- Add up these differences

Lets see how this goes with out dataset. Suppose we have two bins, the 70s (2014) bin and the 60s (2015) bin. For forecasts in the 70s bin, the outcomes were *all* No Change, so the bin base rate is 1. For the 60s bin, the bin base rate is 2/4 = 0.5. So we get:

70s bin: 1 (the bin base rate) – 0.714 (the overall base rate) = 0.286

60s bin: 0.5 – 0.714 = -0.214

Before we just add up these differences, we need to square them to make sure they’re both positive, and then “weight” them by the number of forecasts in each bin:

0.286^2 * 3 = 0.245

-0.214^2 * 4 = 0.183

Then we add them and divide by the total number of forecasts (7), to get 0.061.

This number is known as the Resolution of the forecast set. The higher the Resolution the better; a forecaster with higher Resolution is making forecasts which are more different to the overall base rate than a forecaster with a lower score, and in that sense more interesting or bold.

**Calibration**

In order to define resolution we had to sort forecasts into probability-of-outcome bins. A natural question to ask is how well these bins correspond to the rate at which outcomes actually occur. Consider for example the 70s bin. Forecasts in that bin predict No Change with a probability, on average, of 70.67%. Does the RBA choose No Change 70.67% of the time in those months? No; it decided No Change 100% of the time. So there’s a mismatch between forecast probabilities and outcome rates. Since the latter is higher, we call the forecasts *underconfident*; the probabilities should have been higher.

Similarly forecasts in the 60s bin predicted No Change with probability (on average) 66%, but the RBA in fact made no change only half the time. Since .66 is larger than 0.5, we call this *overconfidence*.

Calibration is the term used to describe the alignment between forecast probabilities and outcome rates. Calibration is usually illustrated with a chart like this:

The orange line represents a hypothetical forecaster with perfect calibration, i.e. where the observed rate for every bin is exactly the same as the forecast probability defining that bin; the orange dots represent hypothetical bins with probabilities 0, 0.1, 0.2, etc.. The two bins from our dataset are shown as blue dots. The 70s bin is out to the left of the line, indicating underconfidence; vice versa for the 60s bin.

**Reliability**

So it seems our forecaster is not particularly well calibrated (though be aware that we are dealing with a tiny dataset where luck of the draw can have undue effects). Can we quantify the level of calibration shown by the forecaster in a particular set of forecasts? Yes, using an approach very similar to the calculation in the previous section. There we took the mean (average) squared difference between bin base rates and overall base rates. To quantify calibration, we take the mean squared difference between bin probability and bin base rate. If that sounds cryptic, lets walk through the numbers.

For the 70s bin, the average forecast probability was 70.67%, and the bin base rate was 1, so the squared difference is

(.7067 – 1)^2 = 0.086

Similarly for the 60s bin:

(0.66 – 0.5)^2 = .026

Multiple each of these by the number of forecasts in the bin:

0.086 * 3 = 0.258

.026 * 4 = 0.102

Add these together and divide by the total number of forecasts, to get 0.052. This, as you guessed, is called the Reliability of the forecast set. Note however that Reliability is good when the mean squared difference is minimized, so the lower reliability score, the better, unlike Resolution where higher is better.

**Recap**

Lets briefly take stock. Our guiding question has been: how good are the forecasts in our little dataset? So far, to get a handle on this we’ve loosely defined four quantities

**Uncertainty** in the outcomes. Uncertainty indicates the degree to which outcomes are predictable.
**Resolution** of the forecast set. This is the degree to which the forecasts fall into subsets with outcome rates different from the overall outcome base rate, calculated as mean squared difference.
**Calibration** – the correspondence, on a bin-by-bin basis, between the forecast probabilities and the outcome rates;
**Reliability** – an overall measure of calibration, calculated as the mean squared difference between forecast bin probabilities and outcome rates – or in other words, mean squared calibration.

**Brier Score**

But wouldn’t it be good if we could somehow capture all this in a single, goodness-of-forecasts number? That’s what the Brier score does. The Brier score is yet another mean squared difference measure, but this time it compares forecast probabilities with outcomes on a forecast-by-forecast basis. In other words, for each forecast, subtract the outcome (coded as 1 or 0) from the forecast probability and square the result; add up all the results and divide by the total number of forecasts. For out little dataset we get

(0.7067 – 1)^2 = 0.086

(0.7067 – 1)^2 = 0.086

(0.7067 – 1)^2 = 0.086

(0.66 – 0)^2 = 0.436

(0.66 – 1)^2 = 0.116

(0.66 – 1)^2 = 0.116

(0.66 – 0)^2 = 0.436

Add these all up and divide by 7 to get 0.195 – the Brier Score for this set of forecasts. (Note that because, in calculating Uncertainty, Resolution and Reliability we collapsed forecasts into bins with a single forecast probability, in calculating the Brier score we treat each forecast as having its “bin” probability.)

Like Reliability, lower is better for Brier scores; a perfect score is 0.

**Brier Score Composition**

It turns out that all these measures are unified by the simple equation

Brier Score = Reliability – Resolution + Uncertainty

or in our numbers

Brier Score = 0.195

Reliability – Resolution + Uncertainty = 0.052 – 0.061 + 0.204 = 0.195

In other words, the Brier score is composed out of Reliability (a measure of Calibration), Resolution, and Uncertainty.

The equation above – which can be found in full formulaic glory on the Wikipedia page and in many other places – is attributed to Alan Murphy in a paper published in 1973 in the Journal of Applied Meteorology. It is usually called the Brier score decomposition, but here I’ve called it the Brier Score Composition because I’ve approached it in a bottom-up way.

**Interpreting the Brier Score**

As mentioned at the outset, the Brier score is increasingly common as a measure of forecasting performance. According to Barbara Meller, a principal reseacher in the Good Judgement Project, “This measure of accuracy is central to the question of whether forecasters can perform well over extended periods and what factors predict their success.”

Having followed how the Brier score is built up out of other measures of forecasting quality, we should keep in mind two important points.

- One of the Brier score components is Uncertainty, which is function solely of the outcomes, not of the forecasts. Greater Uncertainty will push up Brier scores. This means that a forecaster trying to forecast in a highly uncertain domain will have higher Brier score than a forecaster of the same skill level tackling a less uncertain domain. In other words, you can’t directly compare Brier scores unless they are scoring forecasts on the same set of events (or two sets of events with the same Uncertainty). As a rough rule of thumb, only compare Brier scores if the forecasters were forecasting the same events.
- The Brier score is convenient as a single number, but it collapses three other measures. You can get more insight into a forecaster’s performance if you look not just at “headline number” – the Brier score – but at all four measures.