Assessing Accuracy and Ensemble Spread with MurCSS
Henning W. Rust, Madlen Fischer |
Chistopher Kadow, Sebastian Illing, Oliver Kunst |
Institut für Meteorologie, Freie Universität Berlin |
Version from April 10, 2014
Please note that this documentation is currently under construction, particularly Sections 3 and 4 have not been finished yet. Comments or any kind of feedback is highly appreciated. Please send an e-mail to the authors.
1 Introduction
This document briefly reviews and extends the ideas for verifying decadal prediction presented by Goddard et al. [2013] and describes how the plugin MurCSS of the MiKlip Central Evaluation System (CES) can be used to assess hindcast skill accordingly.
Following Goddard et al. [2013], two key questions can be identified which should guide the quality assessment of a decadal ensemble prediction system:
- Is the initialized hindcast more accurate compared to the un-initialized projection?
- Is the models’ ensemble spread representative for the forecast uncertainty?
A way to tackle the first issue is to estimate accuracy of the ensemble mean with the mean squared error skill score (MSESS) and its decomposition, covered in Sec. 2. The second question can be addressed using a particular variant of the continuous ranked probability skill score (CRPSS), see Sec. 3. The basic ideas underlying these two approaches stem from traditional forecast verification, for an overview, see, e.g., [Wilks, 2011] and [Jolliffe and Stephenson, 2012]. The decomposition of the mean squared error skill score goes back to Murphy [1988]. Here, however, we use these methods for forecast verification to assess the quality of an ensemble prediction system by verifying a set of hindcasts against a reanalysis or another data set derived from observations.
In section 2, the assessment of accuracy of the ensemble mean with the mean square error skill score is described. Section 3 discusses the representation of forecast uncertainty with the ensemble spread and Sec. 4 explains input and output of the component MurCSS and gives an example analysis.
2 Accuracy with the mean square error skill score
2.1 Mean square error and skill scores
The accuracy of the ensemble mean can be assessed with the mean squared error (MSE). To evaluate the added value of initialising the ensemble prediction system with an estimated state of the ocean (and the atmosphere), performance is compared an uninitialised set of predictions; in the context of climate research uninitialised predictions are called climate projections or climate simulations and are typically driven by scenarios for the radiative forcing (or emissions) [Kirtman et al., 2013]. This setting calls for using skill scores, which evaluate the performance of one prediction system against a reference prediction, typically a more or less trivial benchmark [e.g., Wilks, 2011]. In weather forecast, this trivial benchmark is the climatological forecast, the persistent forecast or a random forecast. The appropriate benchmark for decadal prediction is the uninitialised prediction (climate projection). For hindcasts, the “climate projection” is called a historical run. A skill score evaluates the gain in performance against the trivial benchmark. The general expression for a skill score is
(1) |
and is typically given in percent gain with respect to a reference forecast. Here, is the value for the accuracy measure (here the MSE, see below) for the prediction system under question, is its value for a perfect prediction and is its value for a reference forecast system. I the prediction system in question obtains the same accuracy as the perfect prediction, it will be assigned 100% skill; skill of 0% or less is interpreted as being equal or worse than the (typically trivial) reference system.
In the following, the mean square error MSE is considered as a more or less intuitive (or at least familiar) score to be the basis of the skill score. The mean square error (MSE) between a hindcast and observations (reanalysis) is defined as the sum of the squared deviations of the two at instances (times) , and reads following the notation of Goddard et al. [2013]
(2) |
The minimum value for the MSE is 0, indicating a (hypothetical) perfect prediction system, leading to in Eq. (1). The upper bound of the MSE depends on whether (and ) is bounded, which is not the case for many continuous variables, leading to no upper bound for the MSE.
Using the MSE (Eq. (2)) as a skill score (Eq. (1)) with the climatological forecast as the reference leads to
(3) |
with the MSE of the climatological forecast
(4) |
The MSESS represents the improvement in accuracy of the hindcasts over the climatological forecast derived from the observations. It is unbounded for negative values (as hindcasts can be arbitrarily bad) but bounded by 1 (if not expressed as percentages) for positive values. It is a function of the hindcast (which is the ensemble mean in our case), the reference forecast, here the climatological forecast , and the observations themselves, as they enter both, and . A positive value of the MSESS indicates gain in hindcast performance compared to the reference forecast; zero or negative values indicates no gain.
It follows that the MSESS against a general reference forecast, e.g., the uninitialised historical runs , reads
(5) |
In the following, decompositions of the MSE (Eq. (2)) and its skill score (Eqs. (4) and (5)) are discussed.
2.2 Decomposing the mean square error
Rewriting of the MSE (Eq. (2)) or the MSESS (Eqs. (4) and (5)) leads to elegant decompositions of the MSE and its skill scores into a few contributing terms with intuitive interpretations.
Consider first the decomposition of the MSE: inserting a zero by adding and substracting the mean hindcast over all predictions issued (all times) and the mean observation to the right hand side (rhs) of Eq. (2) yields
(6) |
. Expanding Eq. (6) into single terms leads to
with the third and fourth term being zero because the sum of all anomalies are zero. The first term represents the sample variance of the hindcasts , the second term is the sample covariance of hindcasts and observations , the next-to-last term is the sample variance of the observations and the last term is the square of the bias . A shorthand notation of the above equation is
(8) |
Using and thus the sample correlation between the hindcasts and the observations , Eq. (8) yields
(9) |
In the following, the MSE decomposition Eq. (9) is plugged into the skill score Eq. (1) to yield the decomposition of the MSESS.
2.3 Decomposing the mean square error skill score
The decomposition Eq. (9) is inserted into the MSESS with the climatological forecast as reference (Eq. (4)). Using , i.e. the MSE of the climatological forecast being an estimate of the climatological variance, we obtain
(10) |
Adding and subtracting and some transformations the MSESS reads
(11) |
The first term of Eq. (11) is the square of the sample correlation coefficient, a dimensionless measure of the linear relationship between the hindcasts and the observations. The second term is the conditional bias squared, a measure of reliability. The last term represents the bias or unconditional bias; the latter can be removed by analysing anomalies. The MSESS with the bias removed reads thus
(12) |
hence being basically decomposed into correlation and conditional bias. For the hindcast against the climatological forecast , we denote the left-hand-side (lhs) of Eq. (12) as and, analogously for another forecast against the climatological forecast as .
A similar decomposition can be obtained for a general reference from Eq. (5) by multiplying and subtracting . After a few transformations, it reads
(13) |
Thus, the MSESS of the hindcast against a reference forecast can be described with the MSESS of the hindcast against the climatological forecast and the MSESS of the reference prediction against the climatological forecast
(14) |
In the following, the interpretation of the various terms of this decomposition are discussed.
2.4 Interpretation of the MSESS and its decomposition
2.4.1 MSESS
The MSESS is an integral measure of the improvement in accuracy of the hindcasts over the climatological forecast or another reference forecast with respect to the observations (reanalysis) . The MSESS can take arbitrary large negative values (as hindcast performance can be arbitrarily bad) but it is bounded from by 1 for positive values. Positive values indicate a gain in accuarcy for the hindcast compared to the reference; zero or a negative values indicate no skill.
2.4.2 Correlation
The correlation coefficient represents a dimensionless measure of the linear relationship between hindcast (or reference) and the observations, ranging from -1 to 1. In the case of the MSESS comparing the hindcast against the climatological forecast the component MurCSS of the central evaluation system (CES) yields the correlation coefficient between and . In the case of an analysis of the hindcast against another reference prediction the gain in correlation is shown.
2.4.3 Conditional Bias
The conditional bias is a measure of the reliability of the hindcast/reference prediction, i.e. it gives information about the distribution of the observation given the hindcast, [see, e.g., Wilks, 2011]. Its meaning can be illustrated with a linear regression of the hindcast against the observations, see Fig. 1. The slope of this regression line can be expressed by the correlation coefficient and the standard deviations . Thus, the conditional bias gives an indication of the ratio of the variances of hindcast and observation. Or in other words: given the hindcasts, it gives information about the distribution of the observations. Figure 1 gives an example for a positive correlation coefficient.
For a positive correlation coefficient a positive/negative conditional bias indicates a smaller/greater variance of the hindcast compared to the observations. For a negative correlation this relationship is reversed.
The component MurCSS of the central evaluation system (CES) gives for the analysis of the hindcast against a reference prediction the gain in the conditional bias
(15) |
–
3 Evaluating ensemble spread
3.1 Continuous ranked probabiliy score
Additionally to the question whether the ensemble mean is an accurate prediction, it rests to clarify whether the ensemble spread is an appropriate representation of the forecast uncertainty. To this end, Goddard et al. [2013] suggested to employ the continuous ranked probability skill score (CRPSS). The CRPSS is a skill score based on the continuous ranked probability score (CRPS), a quadratic measure in probability space, [e.g., Wilks, 2011].
(16) |
with being the cumulative distribution function (CDF) derived from the hindcast and, as the observations are not assigned a probability, is the Heaviside (or switch) function. The Heavyside function is defined as
(17) |
and basically jumps from 0 to 1 at the observation . CDFs of hypothetical hindcasts and the Heaviside function are schematically depicted in Fig. 2(left). The other two panels of Fig. 2 illustrates how the CRPS works, the middel panel gives the difference of the two forecasts and an observation , and the right panel shows the squared difference.
For this example, we deliberately chose two forecasts distributions with the same mean but different variance to illustrate that the CRPS is sensitive to the variance of the forecasted probability distribution. For an ensemble prediction system, the forecast distribution is derived from a discrete (and often small) ensemble. This can be done by, e.g., deriving an empirical cumulative distribution function (ECDF) based on the ranks of the individual members, or assuming an underlying Gaussian distribution and estimate its mean and variance, or by various other methods, such as ensemble dressing [Broecker and Smith, 2008]. Goddard et al. [2013] suggested to assume the Gaussian hypothesis and use the parametric form of the CRPS derived by Gneiting and Raftery [2007]. For a hindcast ensemble with mean and variance at hindcast dates , it reads
(18) |
with and denoting the Gaussian probability density and distribution function, respectively. This setting allows to assess the ensemble spread for the hypothetical case of an unbiased forecast: the ensemble mean is set to the climatological mean derived from the observations, .
3.2 Continuous ranked probability skill score
As the goal is to assess whether the ensemble spread is a useful measure for the forecast uncertainty, we compare the ensemble forecast to a reference forecast with the desired spread. Goddard et al. [2013] suggested an unbiased forecast with spread equal to the MSE between the ensemble mean and the observations. The latter is a valid representation of the forecast uncertainty. Hence, the CRPSS for this particular case reads
(19) |
and it basically compares the ensemble spread to the . Please note that for the desired case , the ratio on the rhs is 1 and the . Thus, for the case that the ensemble spread compares well to the (our measure for forecast uncertainty), i.e. for the optimal case, the skill score used here attains 0 not 1, as one would typically expect. Figure 3 illustrates this behaviour based on a simulation study with simple stochastic processes emulating the observations and hindcasts.
[ENSEMBLE SPREAD SCORE HAS BEEN RECENTLY IMPLEMENTED BUT IS NOT DESCRIBED HERE YET.]
4 Verification with MurCSS
4.1 Input
The plugin MurCSS for the CED calculates the MSESS and its decomposition. This sections describes the various options for the plugin. It is useful to search the data for the analysis beforehand with sol_search to reduce the possibilities of the options. Table 1 list and describes all possible options.
Output | Output directory |
default: /pf/b/user/evaluation_system/output/murcss | |
Output plots | Output directory of the produced plots, |
default: /pf/b/user/evaluation_system/plotst/murcss | |
Decadals | Specify the experiments you want to use. I.e. 1960,1965,1970,..,1995. |
mandatory | Or you can write 1960:5:1995. To exclude a year (for example 1970) |
you can write 1960:5:1995-1970 | |
Variable | The name of the variable you want to analyze (CMOR). The website |
mandatory | provides a select list with a search function. |
Project1 | That could be BASELINE1, BASELINE0 or CMIP5. It is also possible |
mandatory | to specify a custom project if the username is included to the |
evaluation system. | |
Product1 | Product to Project1, part of the datapath |
mandatory | |
Institute1 | Institute to Project1, part of the datapath |
mandatory | |
Model1 | Model to Project1, part of the datapath |
mandatory | |
Experiment1 | Experiment to Project1 I.e. ”decs4e” or ”decadal”, part of the datapath |
mandatory | |
Project2 | Project which is compared to Project1. That could be BASELINE1, |
mandatory | BASELINE0 or CMIP5. It is also possible to specify a custom |
project if the username is included to the evaluation system. | |
Product2 | Product to Project2, part of the datapath |
mandatory | |
Institute2 | Institute to Project2, part of the datapath |
mandatory | |
Model2 | Model to Project2, part of the datapath |
mandatory | |
Experiment2 | Experiment to Project2 I.e. ”decs4e” or ”decadal”, part of the datapath |
mandatory | |
Leadtimes | Leadtimes to analyze, comma separated list, time range can be |
mandatory | hyphenated, default: 1,2-9 |
Observation | Observations to compare, i.e. merra or hadcrut. It is also possible |
mandatory | to specify a path of an observation file. |
Ensemblemembers | Here you can specify a subset of the ensemblemembers you want to use. |
optional | Insert a comma separated list, for example r1i1p1,r2i1p1,r3i1p1 |
Significance | Whether you want to calculate significance levels (95%) using |
optional | bootstraps method. WARNING This could take up to 1 day! |
Bootstrap number | Number of bootstrap runs, using if Significance is true |
optional | default: 500 |
Level | Here you can choose a certain height for the analysis, for example |
optional | 700 000 (Pa). At the moment only one level can be specified. |
Tip: Find out common levels in observation and project beforehand. | |
Lonlatbox | Here you can specify a region. If you want to select the region |
optional | 100W-40E and 20N-80N than you have to write -100,40,20,80 |
Maskmissingvalues | Whether you want to mask grid points with missing values in the |
mandatory | observation file or not, default: TRUE |
Cache | Path of the working directory |
mandatory | default: $USER_CACHE_DIR/$SYSTEM_TIMESTAMP |
Result grid | You can specify a gridfile or a grid description like r20x25 |
optional | |
4.2 Output
Here, the output of the plugin is described, for comparing hindcasts against the climatological forecast (Sec. 4.2.1) and hindcasts versus another reference forecast (Sec. 4.2.2).
4.2.1 Hindcast vs. climatological forecast
Table 2 lists the output obtained for the analysis of the hincasts versus the climatological forecast.
msess | Mean Squared Error Skill Score according to equation 12 |
rmss | Root of the Mean Squared Error Skill Score according to equation 12 |
correlation | correlation coefficient as a measure of the linear relationship |
between the hindcast and the observations | |
conditional bias | measure of the reliability, |
std_ratio | ratio of the standard deviations of hindcast and observations |
(part of conditional bias) | |
biasslope | slope of the linear regression between hindcast and observations |
, conditional bias=0 if m=1 | |
biasslope-1 | slope of the linear regression between hindcast and observations minus 1 |
, conditional bias=0 if m=0 | |
4.2.2 Hindcast vs. reference prediction
Table 3 lists the output obtained for the analysis of the hincasts versus another reference forecast, e.g., the uninitialised historical runs.
msess | Mean Squared Error Skill Score calculated with equation 13 |
rmss | Root of the Mean Squared Error Skill Score calculated with equation 13 |
correlation | gain in correlation of the hindcast with respect to the reference prediction |
conditional bias | gain in conditional bias of the hindcast with respect to the reference |
prediction | |
biasSkillScore | represents the improvement of the conditional bias by the slope of the |
linear regression, . Positive values indicate a higher | |
conditional bias with respect to the reference prediction and negative | |
values the opposite. | |
biasslope-1 | gain in conditional bias representated by the slope of the linear regression |
minus 1 of the hindcast with respect to the reference prediction | |
4.3 Example analysis
This section gives an example analysis with the plugin MurCSS. Suppose, we want to estimate the skill of baseline1 against baseline0 of the near-surface air temperature tas with the HadCrut data set as observations. First, solr_search identifies the respective data series. For baseline1 the command looks like
The output is a list of the paths of all available data files, for example:
/miklip/integration/data4miklip/model/global/miklip/ baseline1/output/
MPI-M/MPI-ESM-LR/decs4e1960/mon/atmos/tas/r1i1p1/
tas_Amon_MPI-ESM-LR_decs4e1960_r1i1p1_196101-197012.nc
With the knowledge of the directory structure we have made the following adjustments for the
analysis Baseline1 vs. Baseline0 :
Decadals | 1960:5:1995 | Variable | tas | |
Project1 | baseline1 | Product1 | output | |
Institute1 | mpi-m | Model1 | mpi-esm-lr | |
Experiment1 | decs4e | Project2 | baseline0 | |
Model2 | mpi-esm-lr | Experiment2 | decadal | |
Leadtimes | 2-5 | Observation | HadCrut | |
Significance | TRUE |
After the analysis is completed the netcdf-files and the produced plots (section 4.2) are available. Some of these plots are depicted in Fig. 4 for the analysis Baseline0/Baseline1 vs. Climatology.
Baseline0 / Climatology | Baseline1 / Climatology | |
MSESS | ![]() | ![]() |
Correlation | ![]() | ![]() |
Conditional Bias | ![]() | ![]() |
Biasslope-1 | ![]() | ![]() |
Std Ratio | ![]() | ![]() |
In the analysis with the standard deviation of the observations as reference the MSESS describes how well the observations can be reproduced. A positive value indicates a good reproduction of the observations with a perfect hindcast at 1 and negative values the opposite. The near-surface air temperature is better represented by Baseline1 especially in the indian ocean and southeast asia with respect to Baseline0. But both models can not reproduce the observations in the pacific ocean sufficiently.
The MSESS can be decomposed into the correlation and the conditional bias (see Sec. 2.3). High postive values of the correlation indicate a good accordance of the model and the observations. High negative values suggest a strong linear relationship as well, but the different signs show a poor accordance of model and observations. The improved MSESS of Baseline1 with respect to Baseline0 is mainly produced by a higher correlation coefficient at most of the grid points.
The observations can be best represented with a conditional bias of 0. In Fig. 4 it can be seen that Baseline1 shows the lowest conditional bias, but there are still some areas, for example the pacific ocean and the nothern atlantic, with a pronounced conditional bias.
Another approach to understand the conditional bias and its relationsship to the standard deviations of the hindcast and the observations is the slope of the linear regression. In Fig. 4 the biasslope -1 is depicted as well. It is noticable that the pattern are quite similar to the pattern of the conditional bias. For a positive correlation coefficient a positive slope and a positive conditional bias suggest a higher standard deviation of the observations with respect to the standard deviation of the hindcast. If the correlation coefficient is negative a positive slope and conditional bias indicate a higher standard deviation of the hindcast. For the near-surface air temperatur the correlation and the conditional bias is positive in Central Europe for Baseline1 and Baseline0, which suggest a higher standard deviation of the observations. The negative correlation and conditional bias over the indian ocean in Baseline0 indicate a higher standard deviation of the observations as well, but in Baseline1 the variations of the hindcast is higher due to the positive correlation and the negative conditional bias.
The ratio of the standard deviations is also shown in Fig. 4.
In Fig. 5 the output of MurCSS is depicted for the analysis Baseline1 vs. Baseline0. A positive value of the MSESS suggest a improved accuracy of the hindcast compared to the reference and a negative value indicates the opposite.
![]() (e)
Bias
SkillScore |
In Fig. 5 a) it can be seen, that Baseline1 is more accurate in most of the regions. Only a few small areas, for example in South America or the Labbrador Sea the accordance of Baseline0 and the observations is higher. Fig. 5 b) shows the gain of the correlation coefficient of Baseline1 compared to Baseline0. Positive values indicate a improved correlation of Baseline1. The analysis of the near-surface air temperature show an improved accuracy of Baseline1 with respect to Baseline0 for almost every gridpoint as well. In Fig. 5 c) the gain of the conditional bias is illustrated. Here it is noticable, that negative values represent a reduction of the bias and thus a improved accuarcy. For example in the Labrador Sea the conditional bias is higher in Baseline0, the correlation is, however, slightly increased. Allover the contribution of the bias overbalance the effect of the correlation, thus the MSESS shows a decreased accuracy in the Labrador Sea.
An additional approach to illustrate the conditional bias is the gain of the biasslope -1, which is shown in Fig. 5 d). At this figure the negative values suggest a reduction of the conditional bias as well. Thus the pattern is quite similar to Fig. 5 c). Due to the calculation of the gain with absolute values () a statement about the gain in the standard deviations is no longer possible.
A third method the illustrate the condiational bias is the bias Skill Score , Fig. 5 e). This time, positive values suggest a higher conditional bias of Baseline1 compared to Baseline1 and negative values the opposite. Nevertheless, the Figs. 5 c)-e) illustrating the conditional bias have similar patterns.
References
J. Broecker and L. A. Smith. From ensemble forecasts to predictive distribution functions. Tellus A, 60(4):663 ff, 2008.
T. Gneiting and A. E. Raftery. Strictly proper scoring rules, prediction, and estimation. J. Amer. Statist. Assoc., 102(477):359–378, 2007.
L. Goddard, A. Kumar, A. Solomon, D. Smith, G. Boer, P. Gonzalez, V. Kharin, W. Merryfield, C. Deser, S.J. Mason, B.P. Kirtman, R. Msadek, R. Sutton, E. Hawkins, T. Fricker, G. Hegerl, C.A.T. Ferro, D.B. Stephenson, G.A. Meehl, T. Stockdale, R. Burgman, A.M. Greene, Y. Kushnir, M. Newman, J. Carton, I. Fukumori, and T. Delworth. A verification framework for interannual-to-decadal predictions experiments. Climate Dynamics, 40:245–272, 2013. ISSN 0930-7575. doi: 10.1007/s00382-012-1481-2. URL http://dx.doi.org/10.1007/s00382-012-1481-2.
I. T. Jolliffe and D. B. Stephenson, editors. Forecast Verification. Wiley-Blackwell, second edition, 2012.
B. Kirtman, J. A. Adedoyin S. B. Power and, G. J. Boer, R. Bojariu, I. Camilloni, F. J. Doblas-Reyes, , A. M. Fiore, M. Kimoto, G. A. Meehl, M. Prather, A. Sarr, C. Schär, R. Sutton, G.J. van Oldenborgh, G. Vecchi, and H. J. Wang. Near-term climate change: Projections and predictability. In T. F. Stocker, D. Qin, G.-K. Plattner, M. Tignor, S. K. Allen, J. Boschung, A. Nauels, Y. Xia, V. Bex, and P.M. Midgley, editors, Climate Change 2013: The Physical Science Basis. Contribution of Working Group I to the Fifth Assessment Report of the Intergovernmental Panel on Climate Change. Cambridge University Press, 2013.
A. H. Murphy. Skill scores based on the mean square error and their relationships to the correlation coefficient. Mon. Wea. Rev., 116(12):2417–2424, December 1988. ISSN 0027-0644. URL http://dx.doi.org/10.1175/1520-0493(1988)116<2417:SSBOTM>2.0.CO;2.
D. S. Wilks. Statistical methods in the atmospheric sciences. Academic Press, San Diego, CA, 3rd edition, 2011.