Feeds:
Posts
Comments

Archive for the ‘Proper scoring’ Category

Are Treasury forecasts credible?

In a recent opinion piece economics columnist Ross Gittins defended Treasury on the grounds that:

  • It assesses its own performance and publishes the results
  • It bases its forecasts on reasonable assumptions and sophisticated modeling
  • Its critics don’t do either of these things

But still, are they credible?  What does “credible” mean?

As evidence of Treasury’s credibility Gittins points to a new section of the recent Budget Papers, Statement 7: Forecasting Performance and Scenario Analysis, in which Treasury describes its own performance.

Here is the first data displayed in that Statement:

chart1

It shows Treasury GDP growth forecasts (dots) and actual GDP growth (columns). Eyeballing the chart suggests that Treasury are usually out by half a percent or more, and sometimes much more.  But then, economic forecasting is notoriously difficult.  Is this performance good or bad or OK?  How do we tell?

One way is to compare against simple benchmarks.  For example, what if I were to “compete” with Treasury by forecasting that growth one year will be the same as growth in the previous year?  Clearly some years I’d do well, and some years I’d be way out.  Would I be on average more or less accurate than Treasury? You can’t tell just by looking, but it can be calculated easily enough.

I took the data from the chart and calculated the mean squared error for Treasury forecasts vs actual growth, and for two alternative “naive” strategies vs actual growth – the one described in the previous paragraph, and another where growth next year is forecast to be equal to the average of growth in all the previous years.  Here are the results:

Treasury: 1.12
Naive 1: 1.22
Naive 2: 0.93

Apparently, Treasury is doing about the same as one dumb strategy, same as last year and worse than another, average of prior years.  (Note that for mean squared error measures, low is good.)

In other words, on the face of it, despite all its the effort, intelligence, data and modeling, Treasury forecasts GDP growth worse than a simple extrapolation well within the ability of most high school students.

If that’s right, I’d say Treasury forecasts are not credible.

To be sure, this is very rough and ready.  The analysis could be made more sophisticated in all sorts of ways.  The key question however is whether Treasury is managing to outperform simple benchmarks.  If they aren’t demonstrably doing so, why shouldn’t we ignore what they say and just go with the benchmarks?

Interestingly, Statement 7 of the Budget Papers makes no comparison with simple benchmarks.  They tell us how well they did, and provide a long list of reasons why they think doing as well as they did is pretty hard. They don’t tell us how well they would have done if they had used some other less expensive strategy.

The other thing they don’t tell us is how their forecasts compare with how good it is possible to be.  The implicit claim is that their forecasts are as good as anyone could do, but this is far from obvious.

I’m raising these points not to condemn Treasury forecasts but to throw out some challenges.

First, Treasury should compare how it is performing against simple benchmarks.  Indeed, I’d be surprised if they weren’t doing this already in some cubicle somewhere. They should make the results easily accessible to the public.

Second, Treasury and its critics should enter into independently-run public forecasting competitions.  These competitions should be open to anyone who’d like to try their hand at it, using whatever methods or data they like. Such competitions would be the best way to establish the level of credibility Treasury forecasts really have.

The new site www.rba.tips – a “tipping competition” for RBA interest rate decisions – is an example of the kind of approach that could be used.

Such steps might help make Treasury, in Gittins’ phrase, “the only honest players in this game.”

Advertisements

Read Full Post »

When is a forecaster performing well?  An increasingly common way to measure this is to use a scoring rule known as the Brier score.

The essential idea behind the Brier score is simple enough: it is the average gap (mean squared difference) between forecast probabilities and actual outcomes. This post tries to explain and motivate the Brier score by “composing” it from some other simple ideas about forecasting quality, unlike many presentations which start with the Brier score and then show how it can be decomposed. There is nothing surprising here for anyone well-versed in these topics, but others who (like me) are just beginning to explore these ideas might find the post helpful.

I’ll use a very small real-world dataset, a set of predictions about what the Reserve Bank of Australia will decide about interest rates at its monthly meetings.  The RBA generally leaves interest rates unchanged, but sometimes raises them and sometimes lowers them, depending on economic conditions.  The dataset consists of predictions implicit in the assessments of the ANU RBA Shadow Board, as found on the website RBA.Tips.  To keep things simple, the dataset reduces the predictions to binary outcomes – Change or No Change – and provides a numerical estimate of the probability of No Change.

data

The “Coded Outcome” column just translates the RBA’s decision into numbers – 1 for No Change, and 0 for Change.  This makes it possible to do the calculations described below.

Uncertainty

One obvious thing about this dataset is that more often than not, there is No Change.  In this small sample, the RBA made no change 5/7 times or 71.4% of the time, which as it happens is quite close to the long term (1990-2015) average or overall base rate of 75%.  In other words, there isn’t a lot of uncertainty in the outcomes being predicted. Conversely, uncertainty would also be low if the RBA almost always changed the interest rate.  A simple way to put a single number on the uncertainty of either of these flavors is to take the base rate and multiply it by 1 minus itself, i.e.

Uncertainty = base rate * (1 – base rate).

For this dataset, Uncertainty is 0.714 – (1 – 0.714) = 0.204

This relative lack of uncertainty means that an attractive forecasting strategy would be to simply go with the base rate, i.e. always predicting that the RBA will do whatever it does most often.  How well would such a forecaster do? A simple way to measure this is in terms of hits and misses.  For the period above, the base-rate strategy would yield 5 hits and two misses out of a total of seven predictions, i.e. a hit/miss ratio of 5/7 = 71%.   Over a long period, this ratio should converge on the base rate – as long, that is, as variations in economic conditions, and RBA decision making, tend in future to be similar to what they were in the past when the base rate was being determined.

The base rate strategy has some advantages (assuming you can access the base rate information).  First, it is better than the most naive approach, which would be to pick randomly, or assign equal probabilities to the possible outcomes.  Second, it is easy; you don’t have to know much or think hard about economic conditions and the interplay between those and RBA decision making. The downside is that over the long term you can’t do better than the base rate, and you can’t do better than anyone else who is also using the base strategy strategy.  If you’re ambitious or competitive or just take pride in good work,  you’ll need to make predictions which are more sensitive to the underlying probabilities of the outcomes – i.e. more likely to predict No Change when no change is more likely, and vice versa for change.

This can be seen in our simple dataset.  Crude inspection suggests the predictions fall into two groups or “bins”.  From Oct-14 to Dec-14 the probabilities assigned to No Change were all 70% or above, and over this period, interest rates in fact never changed.  From Feb-15 to May-15, the probabilities were lower, in the 60-70% range, and twice there was in fact a change.   It seems that whoever made these predictions believed that the economic conditions made a change more likely in 2015 than it was in late 2014, and they correctly adjusted their predictions accordingly.  Note that they had two misses in 2015, suggesting that their probabilities had not been reduced sufficiently.  But intuitively the “miss” predictions were not quite as off-the-mark as they would have been if the probabilities had been at the higher 2104 level – an idea captured by the Brier score.

Resolution

So in general a good forecaster will not make the same forecast regardless of circumstances but rather will have “horses for courses,” i.e. different forecasts when the actual probabilities of various outcomes are different.  Can we measure the extent to which a forecaster is doing this?  One way to do it is:

  • Put the forecasts into groups or bins with the same forecast probability
  • For each bin, measure how different the outcomes of predictions in the bin – the “bin base rate” are to the overall base rate.
  • Add up these differences

Lets see how this goes with out dataset.  Suppose we have two bins, the 70s (2014) bin and the 60s (2015) bin.  For forecasts in the 70s bin, the outcomes were all No Change, so the bin base rate is 1.  For the 60s bin, the bin base rate is 2/4 = 0.5.  So we get:

70s bin: 1 (the bin base rate) – 0.714 (the overall base rate) = 0.286
60s bin: 0.5 – 0.714 = -0.214

Before we just add up these differences, we need to square them to make sure they’re both positive, and then “weight” them by the number of forecasts in each bin:

0.286^2 * 3 = 0.245
-0.214^2 * 4 = 0.183

Then we add them and divide by the total number of forecasts (7), to get 0.061.

This number is known as the Resolution of the forecast set.  The higher the Resolution the better; a forecaster with higher Resolution is making forecasts which are more different to the overall base rate than a forecaster with a lower score, and in that sense more interesting or bold.

Calibration

In order to define resolution we had to sort forecasts into probability-of-outcome bins.  A natural question to ask is how well these bins correspond to the rate at which outcomes actually occur.  Consider for example the 70s bin.  Forecasts in that bin predict No Change with a probability, on average, of 70.67%.  Does the RBA choose No Change 70.67% of the time in those months? No; it decided No Change 100% of the time.  So there’s a mismatch between forecast probabilities and outcome rates.  Since the latter is higher, we call the forecasts underconfident; the probabilities should have been higher.

Similarly forecasts in the 60s bin predicted No Change with probability (on average) 66%, but the RBA in fact made no change only half the time.  Since .66 is larger than 0.5, we call this overconfidence.

Calibration is the term used to describe the alignment between forecast probabilities and outcome rates.  Calibration is usually illustrated with a chart like this:

calibration

The orange line represents a hypothetical forecaster with perfect calibration, i.e. where the observed rate for every bin is exactly the same as the forecast probability defining that bin; the orange dots represent hypothetical bins with probabilities 0, 0.1, 0.2, etc..  The two bins from our dataset are shown as blue dots.  The 70s bin is out to the left of the line, indicating underconfidence; vice versa for the 60s bin.

Reliability

So it seems our forecaster is not particularly well calibrated  (though be aware that we are dealing with a tiny dataset where luck of the draw can have undue effects). Can we quantify the level of calibration shown by the forecaster in a particular set of forecasts? Yes, using an approach very similar to the calculation in the previous section.  There we took the mean (average) squared difference between bin base rates and overall base rates.  To quantify calibration, we take the mean squared difference between bin probability and bin base rate.  If that sounds cryptic, lets walk through the numbers.

For the 70s bin, the average forecast probability was 70.67%, and the bin base rate was 1, so the squared difference is

(.7067 – 1)^2 = 0.086

Similarly for the 60s bin:

(0.66 – 0.5)^2 = .026

Multiple each of these by the number of forecasts in the bin:

0.086 * 3 = 0.258
.026 * 4 = 0.102

Add these together and divide by the total number of forecasts, to get 0.052.  This, as you guessed, is called the Reliability of the forecast set.  Note however that Reliability is good when the mean squared difference is minimized, so the lower reliability score, the better, unlike Resolution where higher is better.

Recap

Lets briefly take stock. Our guiding question has been: how good are the forecasts in our little dataset? So far, to get a handle on this we’ve loosely defined four quantities

  1. Uncertainty in the outcomes.  Uncertainty indicates the degree to which outcomes are predictable.
  2. Resolution of the forecast set.  This is the degree to which the forecasts fall into subsets with outcome rates different from the overall outcome base rate, calculated as mean squared difference.
  3. Calibration – the correspondence, on a bin-by-bin basis, between the forecast probabilities and the outcome rates;
  4. Reliability – an overall measure of calibration, calculated as the mean squared difference between forecast bin probabilities and outcome rates – or in other words, mean squared calibration.

Brier Score

But wouldn’t it be good if we could somehow capture all this in a single, goodness-of-forecasts number?  That’s what the Brier score does.  The Brier score is yet another mean squared difference measure, but this time it compares forecast probabilities with outcomes on a forecast-by-forecast basis.  In other words, for each forecast, subtract the outcome (coded as 1 or 0) from the forecast probability and square the result; add up all the results and divide by the total number of forecasts.  For out little dataset we get

(0.7067 – 1)^2 = 0.086
(0.7067 – 1)^2 = 0.086
(0.7067 – 1)^2 = 0.086
(0.66 – 0)^2 = 0.436
(0.66 – 1)^2 = 0.116
(0.66 – 1)^2 = 0.116
(0.66 – 0)^2 = 0.436

Add these all up and divide by 7 to get 0.195 – the Brier Score for this set of forecasts.  (Note that because, in calculating Uncertainty, Resolution and Reliability we collapsed forecasts into bins with a single forecast probability, in calculating the Brier score we treat each forecast as having its “bin” probability.)

Like Reliability, lower is better for Brier scores; a perfect score is 0.

Brier Score Composition

It turns out that all these measures are unified by the simple equation

Brier Score = Reliability – Resolution + Uncertainty

or in our numbers

Brier Score = 0.195
Reliability – Resolution + Uncertainty = 0.052 – 0.061 + 0.204 = 0.195

In other words, the Brier score is composed out of Reliability (a measure of Calibration), Resolution, and Uncertainty.

The equation above – which can be found in full formulaic glory on the Wikipedia page and in many other places – is attributed to Alan Murphy in a paper published in 1973 in the Journal of Applied Meteorology.  It is usually called the Brier score decomposition, but here I’ve called it the Brier Score Composition because I’ve approached it in a bottom-up way.

Interpreting the Brier Score

As mentioned at the outset, the Brier score is increasingly common as a measure of forecasting performance.  According to Barbara Meller, a principal reseacher in the Good Judgement Project, “This measure of accuracy is central to the question of whether forecasters can perform well over extended periods and what factors predict their success.”

Having followed how the Brier score is built up out of other measures of forecasting quality, we should keep in mind two important points.

  1. One of the Brier score components is Uncertainty, which is function solely of the outcomes, not of the forecasts.  Greater Uncertainty will push up Brier scores.  This means that a forecaster trying to forecast in a highly uncertain domain will have higher Brier score than a forecaster of the same skill level tackling a less uncertain domain.  In other words, you can’t directly compare Brier scores unless they are scoring forecasts on the same set of events (or two sets of events with the same Uncertainty).  As a rough rule of thumb, only compare Brier scores if the forecasters were forecasting the same events.
  2. The Brier score is convenient as a single number, but it collapses three other measures.  You can get more insight into a forecaster’s performance if you look not just at “headline number” – the Brier score – but at all four measures.

Read Full Post »