null hypothesis can finally be rejected by including enough poor models. Also the RC has unnecessary low power, which can be driven to zero by the inclusion of ``silly'' models. Hansen (2001) concludes that the RC can misguide the researcher to believe that no real forecasting improvement is provided by a class of competing models, even though one of the models indeed is a superior forecasting model. Therefore Hansen (2001) applies within the framework of White (2000) the similarity condition to derive a test for superior predictive ability (SPA), which reduces the influence of poor performing strategies in deriving the critical values. This test is unbiased and is more powerful than the RC. The null hypothesis tested is that none of the alternative models is superior to the benchmark model. The alternative hypothesis is that one or more of the alternative models are superior to the benchmark model. The SPA-test p-value is determined by comparing the test statistic (3.1) to the quantiles of

V
b*=
 
max
k=1...K
{M (
f
k,b*-g(
f
k))},     (3)
where
g(
f
k)=



0, if
f
k ≤ -Ak=-
1
4
M-1/4
var(M1/2
f
k )
f
k
.     (4)
The correction factor Ak depends on an estimate of var(M1/2 fk ). A simple estimate can be calculated from the bootstrap resamples as
var(M1/2
f
k )=
1
B
B
b=1
(M1/2
f
k,b*-M1/2
f
k)2.
Equations (3.3) and (3.4) ensure that poor and irrelevant strategies cannot have a large impact on the SPA-test p-value, because (3.4) filters the strategy set for these kind of strategies.

Hansen (2001) uses the RC and the SPA-test to evaluate forecasting models applied to US annual inflation in the period 1952 through 2000. The forecasting models are linear regression models with fundamental variables, such as employment, inventory, interest, fuel and food prices, as the regressors. The benchmark model is a random walk and as performance measure the mean absolute deviation is chosen. Hansen (2001) shows that the null hypothesis is neither rejected by the SPA-test p-value, nor by the RC p-value, but that there is a large difference in magnitude between both p-values, likely to be caused by the inclusion of poor models in the space of forecasting models.

105