macroforecast.tests#

Back to reference

macroforecast.tests owns forecast-comparison tests and residual diagnostics. It does not compute general scoring tables, fit models, or choose windows.

Use the namespace form:

import macroforecast as mf

mf.tests.dm_test(loss_a, loss_b, horizon=1)

Top-level shortcuts such as mf.dm_test(...) are intentionally not exported.

TestResult#

Most pairwise forecast-comparison tests return TestResult.

macroforecast.tests.TestResult(
    statistic,
    p_value,
    decision,
    alternative,
    correction_policy=None,
    n_obs=None,
    metadata={},
)

Field

Meaning

statistic

Test statistic, or None when the sample is too small or degenerate.

p_value

P-value, or None when unavailable.

decision

True when the null is rejected at the supplied alpha.

alternative

two_sided or one_sided.

correction_policy

HAC or small-sample correction label.

n_obs

Number of aligned observations used.

metadata

Test-specific details.

Methods:

Method

Output

to_dict()

JSON-ready dictionary with metadata_schema.kind="forecast_test_result".

to_json(path=None)

JSON text and optional file write.

summary()

Compact string summary.

Custom Tests#

custom_test#

macroforecast.tests.custom_test(
    name,
    func,
    *args,
    alternative="two_sided",
    alpha=0.05,
    correction_policy=None,
    metadata=None,
    **params,
) -> TestResult

Runs a user-supplied forecast test and coerces the result to TestResult.

The callable receives *args and **params. It may return:

Return type

Meaning

TestResult

Used directly, with custom metadata merged.

mapping

Must contain statistic or stat, and may contain p_value/pvalue, decision, alternative, correction_policy, n_obs, and metadata.

(statistic, p_value)

Decision is p_value < alpha.

(statistic, p_value, n_obs)

Same as above plus sample size.

def sign_test_stat(loss_a, loss_b):
    diff = pd.Series(loss_a).sub(pd.Series(loss_b)).dropna()
    return {
        "statistic": float((diff < 0).mean()),
        "p_value": 0.04,
        "n_obs": len(diff),
    }

result = mf.tests.custom_test(
    "sign_loss_test",
    sign_test_stat,
    loss_a,
    loss_b,
)

custom_test() records the callable name, parameters, alpha, and custom=True in result.metadata.

Equal Predictive Accuracy#

dm_test#

macroforecast.tests.dm_test(
    loss_a,
    loss_b,
    *,
    horizon=1,
    correction="hln",
    kernel="acf",
    input_type="loss",
    power=2.0,
    alternative="two_sided",
    alpha=0.05,
)

Input: two aligned loss series by default. Set input_type="error" to match forecast::dm.test(e1, e2, h, power, varestimator) from the R forecast package: the function then computes abs(e1)^power - abs(e2)^power internally. Output: TestResult for the Diebold-Mariano equal predictive accuracy test. correction="hln" applies the Harvey-Leybourne-Newbold small-sample correction. P-values use a Student-t reference distribution with df=n-1, matching forecast/R/DM2.R::dm.test.

kernel="acf" matches the R varestimator="acf" autocovariance estimator. kernel="bartlett" or "newey_west" uses the Bartlett-weighted estimator, matching the R varestimator="bartlett" option.

R/source alignment:

Setting

Alignment

input_type="error", correction="hln", kernel="acf"

Same statistic and Student-t p-value as forecast/R/DM2.R::dm.test(varestimator="acf").

input_type="error", correction="hln", kernel="bartlett" or "newey_west"

Same statistic and Student-t p-value as forecast/R/DM2.R::dm.test(varestimator="bartlett").

input_type="loss"

Uses the same DM statistic after accepting precomputed losses. This is convenient for custom losses, but it is not a direct call-equivalent to R forecast::dm.test(e1, e2).

correction="none"

Omits the Harvey-Leybourne-Newbold small-sample factor used by forecast::dm.test.

kernel="parzen" or "andrews"

Macroforecast extension. These HAC estimators are not options in R forecast::dm.test.

Returned metadata includes statistic_type="t", null_hypothesis="equal predictive accuracy", p_value_status, p_value_reference, source_reference, r_reference, r_alignment, and r_argument_mapping.

gw_test#

macroforecast.tests.gw_test(
    loss_a,
    loss_b,
    *,
    horizon=1,
    correction="hln",
    kernel="acf",
    input_type="loss",
    power=2.0,
    alternative="two_sided",
)

Input: two aligned loss series. Output: TestResult using the package’s Giacomini-White-compatible loss differential surface. This callable uses the same aligned DM-style loss-differential statistic; conditional predictive ability with time-varying fluctuation paths is exposed separately through conditional_predictive_ability_test(...).

Source boundary: gw_test() does not claim exact R-package alignment. It preserves the legacy callable surface by reusing the DM/HLN loss-differential statistic on aligned inputs. For the package’s time-varying conditional predictive-ability path, use conditional_predictive_ability_test(...).

dmp_test#

macroforecast.tests.dmp_test(
    loss_differences,
    *,
    kernel="newey_west",
    alpha=0.05,
)

Input: one loss-difference series or a sequence of loss-difference series. Output: TestResult for a stacked Diebold-Mariano-Pesaran-style joint test.

The test stacks finite loss-difference values, computes a HAC standard error for the stacked mean, and reports a two-sided standard-normal p-value. No exact R-package comparator is claimed in the checked R sources. Metadata records statistic_type="z", null_hypothesis, p_value_status, p_value_reference, source_reference, and r_alignment.

equal_predictive_tests#

macroforecast.tests.equal_predictive_tests(
    loss_a,
    loss_b,
    *,
    tests=("dm", "gw", "dmp"),
    error_a=None,
    error_b=None,
    horizon=1,
    correction="hln",
    kernel="acf",
    alpha=0.05,
) -> pandas.DataFrame

Runs multiple equal-predictive-ability tests and stacks one row per test. Supported names are dm, gw, dmp, and hn. hn requires error_a and error_b because Harvey-Newbold is an encompassing test on forecast errors.

Output: a pandas.DataFrame with one row per requested test. The table keeps the full component metadata in the metadata column and also promotes the paper-facing fields below to top-level columns.

Column

Meaning

test, name

Requested key and display name.

statistic_type, statistic

Reference family (t or z) and test statistic.

p_value, p_value_status, p_value_reference

P-value, availability flag, and reference distribution.

decision, alternative, null_hypothesis

Rejection flag, alternative direction, and null statement.

correction_policy, n_obs

Small-sample/HAC policy and aligned observation count.

source_reference, external_reference, r_reference, r_alignment

Provenance and source-comparison fields.

metadata

Full TestResult.metadata dictionary for the component test.

Current source alignment by row:

Test

R/source status

dm

Exact forecast::dm.test alignment only under the settings listed in dm_test.

gw

Legacy GW-compatible DM-style surface; no exact R comparator claimed.

dmp

Macroforecast stacked HAC screen; no exact R comparator claimed.

hn

Legacy encompassing covariance approximation; not forecast::dm.test.

For paper output, pass this table to macroforecast.reporting.test_report_table(...). For an appendix/audit table that spells out source and R alignment, use macroforecast.reporting.test_provenance_table(...).

harvey_newbold_test#

macroforecast.tests.harvey_newbold_test(
    error_a,
    error_b,
    *,
    horizon=1,
    kernel="newey_west",
    small_sample=True,
    alpha=0.05,
)

Input: two forecast-error series. Output: one-sided TestResult for the legacy forecast-error covariance approximation.

Source note: this is not forecast::dm.test. The R forecast package function implements Harvey-Leybourne-Newbold Diebold-Mariano equal-accuracy testing. harvey_newbold_test() remains a callable encompassing-style covariance approximation and records that distinction in result.metadata.

The callable forms d_t = e_a,t * (e_a,t - e_b,t), computes a HAC standard error, optionally applies an HLN-style small-sample factor, and reports a one-sided Student-t upper-tail p-value. Metadata records statistic_type="t", p_value_status, p_value_reference, source_reference, r_reference=None, and r_alignment.

Alias: hn_test.

Nested And Encompassing Tests#

clark_west_test#

macroforecast.tests.clark_west_test(
    loss_small,
    loss_large,
    forecast_small,
    forecast_large,
    *,
    horizon=1,
    cw_adjustment=True,
    kernel="newey_west",
    alpha=0.05,
)

Input: small-model loss, large-model loss, and both forecast series. Output: one-sided TestResult for the Clark-West nested forecast comparison.

Statistic:

q_t = e_r,t^2 - e_u,t^2 + (f_r,t - f_u,t)^2
z = mean(q_t) / sqrt(LRV(q_t) / n)

Here r is the restricted/small model and u is the unrestricted/large model. The implementation follows the standard adjusted MSPE differential used by Clark-West references such as GAUSS cwTest and HypothesisTests.jl ClarkWestTest. Archived R examples can differ by sign convention, so this page treats the formula above as the package contract.

Alias: cw_test.

enc_new_test#

macroforecast.tests.enc_new_test(
    error_small,
    error_large,
    *,
    critical_value=None,
    alpha=0.05,
)

Input: restricted/small-model forecast errors and unrestricted/large-model forecast errors. Output: one-sided TestResult.

Statistic:

c_t = e_r,t * (e_r,t - e_u,t)
ENC-NEW = n * mean(c_t) / mean(e_u,t^2)

Default p_value is None because Clark-McCracken nested forecast encompassing tests have nonstandard distributions. Pass a design-appropriate critical_value to get a boolean decision.

enc_t_test#

macroforecast.tests.enc_t_test(
    error_small,
    error_large,
    *,
    horizon=1,
    kernel="newey_west",
    critical_value=None,
    normal_approximation=False,
    alpha=0.05,
)

Input: restricted/small-model forecast errors and unrestricted/large-model forecast errors. Output: one-sided TestResult.

Statistic:

c_t = e_r,t * (e_r,t - e_u,t)
ENC-T = mean(c_t) / sqrt(LRV(c_t) / n)

Default p_value is None. Set normal_approximation=True only for diagnostic screening, or pass critical_value for a design-specific decision.

nested_tests#

macroforecast.tests.nested_tests(
    loss_small,
    loss_large,
    *,
    forecast_small=None,
    forecast_large=None,
    error_small=None,
    error_large=None,
    tests=("clark_west", "enc_new", "enc_t"),
    horizon=1,
    kernel="newey_west",
    enc_critical_value=None,
    enc_normal_approximation=False,
    alpha=0.05,
) -> pandas.DataFrame

Runs multiple nested-model tests and stacks one row per test. Clark-West requires forecast_small and forecast_large; enc_new and enc_t require error_small and error_large. This separation is intentional because Clark-West is an adjusted MSPE differential while ENC-NEW and ENC-T are forecast-error encompassing covariance statistics.

Directional Accuracy Tests#

directional_accuracy_test#

macroforecast.tests.directional_accuracy_test(
    y_true,
    y_pred,
    *,
    threshold=0.0,
    method="pesaran_timmermann",
    alpha=0.05,
)

Input: realized values and forecasts. Output: TestResult. Supported methods are pesaran_timmermann, anatolyev_gerko, and henriksson_merton.

The pesaran_timmermann and anatolyev_gerko branches are aligned with R tstests/R/dac.R::dac_test and rugarch/R/rugarch-tests.R::DACTest. The p-value is a one-sided upper-tail normal p-value, 1 - Phi(statistic). Forecasts that are constant after subtracting threshold are rejected because the directional tests are undefined for a constant sign forecast.

Options:

Option

Default

Choices

Meaning

threshold

0.0

numeric

Values above this threshold are positive-direction observations.

method

"pesaran_timmermann"

"pesaran_timmermann", "anatolyev_gerko", "henriksson_merton"

Directional statistic to compute.

alpha

0.05

probability in (0, 1)

Rejection level.

Method notes:

Method

Null

Statistic input

pesaran_timmermann

No sign predictability.

Exact R alignment with .pt_test / DACTest(test="PT"): sign hit rate versus independence-implied sign hit rate.

anatolyev_gerko

No excess profitability.

Exact R alignment with .ag_test / DACTest(test="AG"): sign(forecast) * actual excess profitability, using raw actual and forecast values after threshold subtraction.

henriksson_merton

No market-timing skill.

Macroforecast extension. No exact comparator in tstests::dac_test or rugarch::DACTest; statistic is based on up/down conditional hit rates.

R/source alignment:

Branch

R comparator

Notes

pesaran_timmermann

tstests/R/dac.R::.pt_test; rugarch/R/rugarch-tests.R::DACTest(test="PT")

Uses x_t=1{actual>0}, y_t=1{forecast>0}, z_t=1{forecast*actual>0}, and p.value=1-pnorm(statistic).

anatolyev_gerko

tstests/R/dac.R::.ag_test; rugarch/R/rugarch-tests.R::DACTest(test="AG")

Uses r_t=sign(forecast)*actual, excess-profitability variance V_EP, and p.value=1-pnorm(statistic).

henriksson_merton

None

Kept as a callable screening diagnostic, not claimed as an R-package-aligned DAC branch.

Zero rule: R uses strict positivity, actual > 0 and forecast > 0. macroforecast applies the same strict rule after subtracting threshold, so values equal to threshold are treated as non-positive.

Aliases:

Alias

Equivalent call

pesaran_timmermann_test(y_true, y_pred)

directional_accuracy_test(..., method="pesaran_timmermann")

anatolyev_gerko_test(y_true, y_pred)

directional_accuracy_test(..., method="anatolyev_gerko")

henriksson_merton_test(y_true, y_pred)

directional_accuracy_test(..., method="henriksson_merton")

Density And Interval Diagnostics#

density_interval_tests#

macroforecast.tests.density_interval_tests(
    pit,
    *,
    alpha=0.05,
    n_bins=10,
    pit_lag=1,
)

Input: probability integral transform values in [0, 1]. Output: JSON-ready dictionary with metadata_schema.kind="density_interval_tests" plus Berkowitz, KS, Kupiec POF, Christoffersen independence, Engle-Manganelli DQ, Du-Escanciano shortfall, PIT histogram, and PIT autocorrelation diagnostics.

Options:

Option

Default

Meaning

alpha

0.05

Tail probability for VaR/shortfall-style hit tests.

n_bins

10

Number of PIT histogram bins.

pit_lag

1

Lag used for PIT autocorrelation, Berkowitz AR lag, and Du-Escanciano conditional shortfall lag.

Output keys:

Key

Meaning

berkowitz

Berkowitz density LR test plus Jarque-Bera normality check after normal score transform.

ks

Kolmogorov-Smirnov test against uniform PIT.

kupiec_pof

Unconditional hit-rate test at alpha.

christoffersen_independence

Markov independence test for hits.

engle_manganelli_dq

PIT hit-only DQ proxy. Use dynamic_quantile_test(...) for the full Engle-Manganelli VaR DQ test.

du_escanciano_shortfall

Du-Escanciano unconditional and conditional shortfall tests.

pit_histogram

One record per histogram bin.

pit_autocorrelation

TestResult dictionary for serial PIT dependence.

r_reference, r_alignment

Composite provenance metadata. Component-level diagnostics also carry their own R/source metadata.

R/source alignment:

Diagnostic

Reference

Berkowitz

tstests/R/berkowitz.R::berkowitz_test: PIT to normal scores, ARIMA(pit_lag,0,0) unrestricted likelihood versus Normal(0,1); LR df is 2 + pit_lag.

Du-Escanciano shortfall

tstests/R/shortfall_de.R::shortfall_de_test: cumulative tail shortfall mean test and portmanteau test on centered tail shortfall autocorrelations.

Kupiec/Christoffersen

tstests/R/var_cp.R::var_cp_test and rugarch/R/rugarch-tests.R: Bernoulli/transition likelihood-ratio construction.

PIT hit-only DQ proxy

No direct R comparator. It is a PIT-hit lag diagnostic inside this composite wrapper, not the full Engle-Manganelli VaR DQ test.

Boundary handling: values outside [0, 1] raise. Boundary PIT values 0 and 1 are accepted as PIT values but clipped internally for the normal-score Berkowitz transform to avoid infinite ARIMA inputs.

shortfall_de_test#

macroforecast.tests.shortfall_de_test(
    pit,
    *,
    alpha=0.05,
    lags=1,
    boot=False,
    n_boot=2000,
    random_state=0,
) -> dict

Input: PIT values in [0, 1]. Output: JSON-ready dictionary with metadata_schema.kind="shortfall_de_test".

The unconditional statistic is the sample mean of cumulative tail shortfall, mean((alpha - pit) * 1{pit <= alpha} / alpha). The conditional statistic is a portmanteau statistic on autocorrelations of that series centered by alpha / 2. With boot=False, the unconditional p-value uses the Du-Escanciano normal approximation and the conditional p-value uses Chi-squared(lags). With boot=True, both p-values use simulated uniform PIT draws with the same sample size.

dynamic_quantile_test#

macroforecast.tests.dynamic_quantile_test(
    y_true,
    var,
    *,
    alpha=0.05,
    lag=1,
    lag_hit=1,
    lag_var=1,
) -> TestResult

Input: realized values and one-step-ahead lower-tail VaR forecasts. Output: TestResult for the Engle-Manganelli dynamic quantile test.

This is the full VaR DQ callable. It is separate from density_interval_tests(...) because the exact DQ statistic needs realized values and VaR forecasts, not PIT values alone.

R/source alignment: segMGarch/R/DQtest.R::DQtest. The hit series is 1 - alpha when y_true < var and -alpha otherwise. The regressor matrix contains a constant, lag-aligned VaR forecasts, lag_hit lagged hit columns, and lagged squared realized values. The statistic is Hit' X (X'X)^(-1) X' Hit / (alpha * (1 - alpha)), with a chi-squared reference distribution using the number of columns of X.

R argument mapping: segMGarch::DQtest names the VaR probability VaR_level and converts it internally to the lower-tail probability 1 - VaR_level. macroforecast accepts the lower-tail probability directly as alpha; therefore a 5% lower-tail VaR is alpha=0.05, corresponding to VaR_level=0.95 in the R function.

Source: https://rdrr.io/cran/segMGarch/src/R/DQtest.R

Options:

Option

Default

Meaning

alpha

0.05

Lower-tail probability. A 5% VaR uses alpha=0.05.

lag

1

Lag used for squared realized values.

lag_hit

1

Number of lagged hit columns.

lag_var

1

Lag alignment for VaR forecasts.

pit_histogram#

macroforecast.tests.pit_histogram(pit, *, n_bins=10) -> pandas.DataFrame

Returns one row per PIT histogram bin with observed count, expected count under uniformity, and deviation.

pit_autocorrelation_test#

macroforecast.tests.pit_autocorrelation_test(
    pit,
    *,
    lag=1,
    alpha=0.05,
) -> TestResult

Runs a normal-approximation test for serial dependence in PIT values.

interval_coverage_test#

macroforecast.tests.interval_coverage_test(
    y_true,
    lower,
    upper,
    *,
    alpha=0.05,
) -> dict

Runs Kupiec POF, Christoffersen independence, combined conditional coverage, and Christoffersen-Pelletier duration diagnostics for forecast intervals. alpha is the expected non-coverage rate, so a 90% interval uses alpha=0.10.

Boundary cases follow the likelihood-ratio convention used by R tstests::var_cp_test and rugarch::VaRTest: zero violations do not automatically imply a passing Kupiec statistic; the restricted Bernoulli likelihood is compared with the boundary unrestricted likelihood.

The christoffersen_pelletier_duration output follows the duration-test logic in tstests/R/var_cp.R::.duration_test: durations between interval misses are modeled with a Weibull likelihood, and the no-memory exponential restriction is tested by setting the Weibull shape parameter to 1. The duration statistic is unavailable when there is one or fewer misses.

Coverage output:

Key

Meaning

kupiec_pof

Unconditional coverage LR. Carries tstests and rugarch references.

christoffersen_independence

First-order Markov independence LR plus transition counts n00, n01, n10, n11.

christoffersen_conditional_coverage

Sum of Kupiec and independence LR statistics with chi-squared df 2.

christoffersen_pelletier_duration

Weibull duration LR for the exponential no-memory restriction.

r_reference, rugarch_reference, r_alignment

Package-level provenance.

Duration likelihood note: the duration construction is the same in tstests and rugarch. The implemented density/survival likelihood follows rugarch/R/rugarch-tests.R::VaRDurTest, which is the internally consistent Christoffersen-Pelletier Weibull likelihood form.

Conditional Predictive Ability#

conditional_predictive_ability_test#

macroforecast.tests.conditional_predictive_ability_test(
    loss_a,
    loss_b,
    *,
    method="giacomini_rossi",
    window_ratio=0.5,
    dmv_fullsample=True,
    lag_truncate=0,
    alpha=0.05,
)

Input: two aligned loss series. Output: JSON-ready dictionary with metadata_schema.kind="conditional_predictive_ability", a fluctuation statistic, critical value, decision, time path, window size, loss-difference orientation, and source-alignment metadata.

Supported methods: giacomini_rossi, recursive_fluctuation.

The giacomini_rossi branch is aligned with murphydiagram/R/procs.R::fluctuation_test, which implements Proposition 1 of Giacomini and Rossi (2010). It computes rolling-window Diebold-Mariano-type statistics for the loss difference loss_a - loss_b, uses Bartlett HAC variance, and compares the supremum absolute statistic with the tabulated critical values from Giacomini-Rossi Table 1. Positive path values mean loss_a is larger than loss_b over that window, so the final statistic is two-sided because it uses the supremum absolute path.

R alignment:

R package / function

macroforecast branch

Alignment

murphydiagram/R/procs.R::fluctuation_test

method="giacomini_rossi"

Same ld <- loss1 - loss2, same mu grid, same Table 1 critical values, same lag_truncate in 0:5, same Bartlett HAC convention, same dmv_fullsample and rolling-denominator branches.

None

method="recursive_fluctuation"

Package extension over expanding-prefix loss windows. It reuses the same Bartlett HAC helper but does not claim to implement a named R-package test.

Options:

Option

Default

Choices

Meaning

method

"giacomini_rossi"

"giacomini_rossi", "recursive_fluctuation"

giacomini_rossi is R-aligned; recursive_fluctuation is a package extension over expanding loss windows.

window_ratio

0.5

0.1, 0.2, …, 0.9 for giacomini_rossi

Rolling window size as a fraction of the evaluation sample.

dmv_fullsample

True

boolean

If True, estimate HAC variance on the full loss-difference sample, matching the R default. If False, use each rolling window’s HAC variance.

lag_truncate

0

0, 1, …, 5

Bartlett HAC truncation lag, matching the R package’s allowed range.

alpha

0.05

0.05, 0.10 for giacomini_rossi

Test size used to select the tabulated critical value.

Output fields:

Field

Meaning

statistic

Supremum absolute value of the fluctuation path.

time_path

Rolling or recursive fluctuation path before the supremum is taken.

critical_value, critical_band, decision

Tabulated Giacomini-Rossi comparison when available; None for the recursive extension.

variance_scope

"full_sample" when dmv_fullsample=True; "rolling_window" otherwise.

loss_difference_orientation

Always loss_a - loss_b; positive path values mean loss_a has larger loss.

source_reference, external_reference, r_reference, r_alignment

Source and R-package comparison metadata.

requested_method, method, alias_warning

The user-supplied method, normalized method, and any alias caveat.

method="rossi_sekhposyan" remains accepted as a legacy alias for recursive_fluctuation, but Rossi-Sekhposyan forecast rationality is a different test family and is not represented by this loss-comparison callable.

Multiple-Model Tests#

blocked_oob_reality_check#

macroforecast.tests.blocked_oob_reality_check(
    loss_panel,
    *,
    benchmark,
    loss="squared_error",
    alpha=0.05,
    n_boot=1000,
    block_length=4,
    bootstrap_method="fixed_block_bootstrap",
    random_state=0,
    target="target",
    horizon="horizon",
    origin="origin",
    model="model_id",
) -> pandas.DataFrame

Block-bootstrap one-sided benchmark-superiority screen against a named benchmark model. This is the direct callable replacement for the legacy blocked_oob_reality_check operation. It is intentionally documented as a legacy screen, not as the exact White Reality Check.

Inputs:

Form

Required columns

Long panel

origin, model_id, squared_error; optional target and horizon. Column names are configurable.

Wide matrix

One column per model, including the benchmark column. The index is treated as origin order.

Long-panel input must have one row per target/horizon/origin/model key. If the loss table contains duplicate rows for that key, aggregate them explicitly before calling; the test helpers do not average duplicates silently.

Output: one row per candidate model and target/horizon group.

Column

Meaning

target, horizon

Group labels. Wide input uses "all" for both.

model

Candidate model tested against the benchmark.

benchmark

Benchmark model name.

mean_diff

benchmark_loss - candidate_loss; positive means candidate has lower loss.

statistic

Mean loss differential scaled by bootstrap standard error.

p_value

Pairwise one-sided block-bootstrap p-value for no improvement over benchmark.

decision

True when p_value < alpha.

familywise_p_value

Max-bootstrap p-value adjusted across all candidate models in the same target/horizon group.

familywise_decision

True when familywise_p_value < alpha.

familywise_n_obs

Complete-case origins used for the family-wise adjustment.

n_obs

Number of aligned origins.

block_length, n_boot, bootstrap_method

Bootstrap settings used.

source_reference, r_reference, r_alignment

Provenance metadata. r_reference is None because this legacy screen has no exact R-package comparator.

The returned table carries attrs["macroforecast_metadata_schema"]["kind"] = "blocked_oob_reality_check".

R/source comparison:

Function

Status

blocked_oob_reality_check(...)

No exact R-package comparator. It computes pairwise and family-wise max-centered block bootstrap p-values from precomputed benchmark/candidate loss differences.

ttrTests/R/dataSnoop.R::dataSnoop(test="RC" or "SPA")

Strategy-specific data-snooping code. It rebuilds technical-trading parameter-grid performance on each bootstrapped price sample, so it is not the same API contract.

reality_check_test(...), superior_predictive_ability_test(...), stepm_test(...)

Exact multiple-comparison callable family for White RC, Hansen SPA, and Romano-Wolf StepM using the optional arch.bootstrap backend.

superior_predictive_ability_test#

macroforecast.tests.superior_predictive_ability_test(
    loss_panel,
    *,
    benchmark,
    loss="squared_error",
    alpha=0.05,
    n_boot=1000,
    block_length="auto",
    bootstrap_method="stationary_bootstrap",
    p_value_type="consistent",
    studentize=True,
    nested=False,
    random_state=0,
    target="target",
    horizon="horizon",
    origin="origin",
    model="model_id",
) -> dict

Input: long or wide loss panel with a named benchmark model. Output: JSON-ready dictionary with one record per target/horizon group. The record contains p_values for lower, consistent, and upper SPA p-value variants, critical_values, selected p_value, superior_models, and backend metadata.

Backend alignment: delegates to arch.bootstrap.SPA. The backend takes benchmark losses and candidate losses, forms loss differentials internally as benchmark_loss - candidate_loss, and reports lower, consistent, and upper p-values from Hansen’s recentering choices. Positive mean_loss_difference in the output means the candidate has lower average loss than the benchmark.

R/source comparison: archived R ttrTests/R/dataSnoop.R::dataSnoop(test="SPA") implements Hansen SPA for technical-trading rule parameter grids. It recomputes strategy performance on each bootstrapped price sample, so it is not a direct general loss-matrix API. macroforecast keeps the general forecast-evaluation contract and records this as conceptual R alignment in each output record.

Options:

Option

Default

Choices

Meaning

bootstrap_method

"stationary_bootstrap"

"stationary_bootstrap", "fixed_block_bootstrap"

Bootstrap family. Fixed-block inputs are mapped to arch’s moving-block backend.

p_value_type

"consistent"

"lower", "consistent", "upper"

Which SPA p-value variant to use for p_value and decision.

studentize

True

boolean

Passed to arch.bootstrap.SPA.

nested

False

boolean

Passed to arch.bootstrap.SPA for nested model sets.

reality_check_test#

macroforecast.tests.reality_check_test(
    loss_panel,
    *,
    benchmark,
    loss="squared_error",
    alpha=0.05,
    n_boot=1000,
    block_length="auto",
    bootstrap_method="stationary_bootstrap",
    p_value_type="consistent",
    studentize=True,
    nested=False,
    random_state=0,
    target="target",
    horizon="horizon",
    origin="origin",
    model="model_id",
) -> dict

Input and output follow superior_predictive_ability_test(...). Backend: arch.bootstrap.RealityCheck. In the current arch backend this class is a Reality Check alias over the same SPA machinery, with the same p-value fields. Use this when the research design calls for the White Reality Check against a benchmark model.

R/source comparison: archived R ttrTests/R/dataSnoop.R::dataSnoop(test="RC") implements White’s Reality Check for technical-trading rule grids. As with SPA, the R function is strategy-generator specific; macroforecast uses precomputed benchmark and candidate forecast-loss series.

stepm_test#

macroforecast.tests.stepm_test(
    loss_panel,
    *,
    benchmark,
    loss="squared_error",
    alpha=0.05,
    n_boot=1000,
    block_length="auto",
    bootstrap_method="stationary_bootstrap",
    studentize=True,
    nested=False,
    random_state=0,
    target="target",
    horizon="horizon",
    origin="origin",
    model="model_id",
) -> dict

Input: long or wide loss panel with a named benchmark model. Output: JSON-ready dictionary with superior_models for each target/horizon group. Backend: arch.bootstrap.StepM.

R/source comparison: oosanalysis-R-library/R/stepm.R::stepm implements a generic Romano-Wolf stepdown loop from supplied test statistics and bootstrap test-statistic draws. macroforecast delegates to arch.bootstrap.StepM, which constructs the benchmark-vs-candidate loss-difference statistics using the SPA backend and then applies the stepdown procedure. The objective is aligned, but the inputs are higher level in macroforecast: forecast-loss panel in, superior model names out.

model_confidence_set#

macroforecast.tests.model_confidence_set(
    loss_panel,
    *,
    loss="squared_error",
    alpha=0.10,
    n_boot=1000,
    block_length="auto",
    bootstrap_method="mcs_fixed_block",
    statistic="max",
    random_state=0,
    target="target",
    horizon="horizon",
    origin="origin",
    model="model_id",
) -> dict

Exact Hansen-Lunde-Nason model confidence set callable aligned with the R MCS package’s MCSprocedure. It constructs pairwise loss-difference statistics, bootstraps those loss-difference means, removes one model per step, tracks cumulative MCS p-values, and records included and rejected model sets by target/horizon group.

Inputs:

Form

Required columns

Long panel

origin, model_id, and the selected loss column. target and horizon are optional grouping columns.

Wide matrix

Numeric model-loss columns. The target/horizon labels are set to "all".

Long-panel input must have one row per target/horizon/origin/model key. Duplicate loss rows are rejected instead of being averaged inside the pivot step.

Options:

Option

Default

Choices

Meaning

statistic

"max"

"max", "range"

"max" maps to R statistic="Tmax" over d_i.; "range" maps to R statistic="TR" over pairwise d_ij.

bootstrap_method

"mcs_fixed_block"

"mcs_fixed_block", "stationary_bootstrap", "fixed_block_bootstrap"

mcs_fixed_block follows R MCS/R/internalFunctions.R::GetIndices; the other choices are package extensions.

block_length

"auto"

positive int or "auto"

Block length. "auto" follows the R rule conceptually: selected AR order across loss columns, with a minimum of 3.

Output: JSON-ready dictionary with metadata_schema.kind="model_confidence_set".

Key

Meaning

mcs_inclusion

Included model records by target, horizon, and alpha after the iterative procedure stops.

mcs_rejections

Eliminated model records by target, horizon, and alpha.

p_values

Final stopping-test p-value by target and horizon.

iteration_path

One record per removal step, including active models, statistic, p-value, cumulative MCS p-value, removed model, rejected model if any, and mean losses.

block_lengths_used

Block length used by target and horizon.

R/source alignment:

R source

Python contract

MCS/R/MCSprocedure.R::MCSprocedure

Sequential elimination until one model remains; included/excluded sets are determined by p-Value for H_{0,M_k} relative to alpha.

MCS/R/internalFunctions.R::GetD

Pairwise loss differences d_ij and model-average differences d_i..

MCS/R/internalFunctions.R::GetIndices

Default bootstrap_method="mcs_fixed_block" samples consecutive fixed blocks and truncates to sample length.

block_length="auto" follows the same rule conceptually as R k=NULL: choose the maximum selected AR order across loss columns and enforce a minimum of 3. For bit-level reproducibility across software stacks, pass an explicit integer block_length.

iterative_model_confidence_set#

macroforecast.tests.iterative_model_confidence_set(
    loss_panel,
    *,
    loss="squared_error",
    alpha=0.10,
    n_boot=1000,
    block_length="auto",
    bootstrap_method="mcs_fixed_block",
    statistic="max",
    random_state=0,
    target="target",
    horizon="horizon",
    origin="origin",
    model="model_id",
)

Descriptive alias for model_confidence_set(...). It calls the same exact MCS engine and returns the same fields, with metadata_schema.kind="iterative_model_confidence_set" so older code can trace which callable produced the result.

Residual Diagnostics#

residual_diagnostics#

macroforecast.tests.residual_diagnostics(
    residuals,
    *,
    tests=(
        "ljung_box_q",
        "arch_lm",
        "jarque_bera_normality",
        "durbin_watson",
    ),
    lag=10,
    alpha=0.05,
    model_df=0,
    exog=None,
    demean_arch=False,
)

Input: residual series. Output: one-row-per-test pandas DataFrame with test, statistic, p_value, decision, lag_used, df, n_obs, source_reference, r_reference, r_alignment, and status. The result carries attrs["macroforecast_metadata_schema"] = {"kind": "residual_diagnostics", "version": 1, ...}.

Supported tests:

Name

Meaning

ljung_box_q

Ljung-Box serial-correlation diagnostic, aligned with stats::Box.test(type="Ljung-Box"); model_df maps to R fitdf.

breusch_godfrey_serial_correlation

Breusch-Godfrey Chisq LM diagnostic under the residual-series contract; default is equivalent to testing residuals ~ 1, and exog supplies additional original-regression design columns.

arch_lm

Engle ARCH LM diagnostic, aligned with FinTS::ArchTest; demean_arch=True matches its demean=TRUE option.

jarque_bera_normality

Jarque-Bera normality diagnostic using the same population-moment formula as tseries::jarque.bera.test.

durbin_watson

Durbin-Watson statistic aligned with the statistic in lmtest::dwtest; p-value is not supplied because lmtest’s exact p-value uses a model-design distribution not available from residuals alone.

Options:

Option

Default

Meaning

lag

10

Maximum lag for Ljung-Box, ARCH-LM, and Breusch-Godfrey.

alpha

0.05

Rejection level used for decision.

model_df

0

Degrees of freedom consumed by the fitted model. Used in Ljung-Box p-values and ARCH-LM degrees-of-freedom adjustment.

exog

None

Optional design matrix for the Breusch-Godfrey auxiliary regression. If omitted, an intercept-only design is used.

demean_arch

False

Demean residuals before ARCH-LM, matching FinTS::ArchTest(demean=TRUE) when enabled.

Source-alignment notes:

Diagnostic

Source logic

Ljung-Box

stats::Box.test(type="Ljung-Box"): Q = n(n+2) sum rho_k^2/(n-k), chi-squared df lag - model_df; model_df is R fitdf.

ARCH-LM

FinTS/R/ArchTest.R::ArchTest: optionally demean residuals, embed x^2, regress current squared residuals on lagged squared residuals, statistic is effective sample size times auxiliary R^2. model_df is a statsmodels degrees-of-freedom adjustment beyond the R API.

Jarque-Bera

tseries/R/test.R::jarque.bera.test: population central moments, n * skewness^2 / 6 + n * (kurtosis - 3)^2 / 24, chi-squared df 2.

Breusch-Godfrey

lmtest/R/bgtest.R::bgtest: R takes a fitted model or formula. macroforecast takes residuals and optional exog, then applies the same Chisq LM auxiliary formula with fill-zero lagged residual columns under that residual-series contract.

Durbin-Watson

lmtest/R/dwtest.R::dwtest: statistic sum(diff(residuals)^2) / sum(residuals^2). P-values are omitted because R’s exact/asymptotic p-value depends on the original regression design matrix.

  • jarque_bera_test – Jarque-Bera normality test (single series, chi2 df=2; tseries::jarque.bera.test convention).

  • granger_causality – Granger causality test in a VAR (vars::causality; F or Wald).

  • instantaneous_causality – instantaneous (contemporaneous) causality test in a VAR.

  • giacomini_white_test – Giacomini-White (2006) CONDITIONAL predictive ability Wald test (chi2, HAC), instrument [1, dL_{t-h}].

  • var_serial_test – multivariate residual serial-correlation (Portmanteau/LM) test for a VAR (vars::serial.test).

  • var_normality_test – multivariate normality (Doornik-Hansen/Lutkepohl JB) test for VAR residuals (vars::normality.test).

  • var_arch_test – multivariate ARCH-LM test for VAR residuals (vars::arch.test, Lutkepohl).