macroforecast.tests#

Back to reference

macroforecast.tests owns forecast-comparison tests and residual diagnostics. It does not compute general scoring tables, fit models, or choose windows.

Use the namespace form:

import macroforecast as mf

mf.tests.dm_test(loss_a, loss_b, horizon=1)

Top-level shortcuts such as mf.dm_test(...) are intentionally not exported.

TestResult#

Most pairwise forecast-comparison tests return TestResult.

macroforecast.tests.TestResult(
    statistic,
    p_value,
    decision,
    alternative,
    correction_policy=None,
    n_obs=None,
    metadata={},
)

Field	Meaning
`statistic`	Test statistic, or `None` when the sample is too small or degenerate.
`p_value`	P-value, or `None` when unavailable.
`decision`	`True` when the null is rejected at the supplied `alpha`.
`alternative`	`two_sided` or `one_sided`.
`correction_policy`	HAC or small-sample correction label.
`n_obs`	Number of aligned observations used.
`metadata`	Test-specific details.

Methods:

Method	Output
`to_dict()`	JSON-ready dictionary with `metadata_schema.kind="forecast_test_result"`.
`to_json(path=None)`	JSON text and optional file write.
`summary()`	Compact string summary.

Custom Tests#

custom_test#

macroforecast.tests.custom_test(
    name,
    func,
    *args,
    alternative="two_sided",
    alpha=0.05,
    correction_policy=None,
    metadata=None,
    **params,
) -> TestResult

Runs a user-supplied forecast test and coerces the result to TestResult.

The callable receives *args and **params. It may return:

Return type	Meaning
`TestResult`	Used directly, with custom metadata merged.
mapping	Must contain `statistic` or `stat`, and may contain `p_value`/`pvalue`, `decision`, `alternative`, `correction_policy`, `n_obs`, and `metadata`.
`(statistic, p_value)`	Decision is `p_value < alpha`.
`(statistic, p_value, n_obs)`	Same as above plus sample size.

def sign_test_stat(loss_a, loss_b):
    diff = pd.Series(loss_a).sub(pd.Series(loss_b)).dropna()
    return {
        "statistic": float((diff < 0).mean()),
        "p_value": 0.04,
        "n_obs": len(diff),
    }

result = mf.tests.custom_test(
    "sign_loss_test",
    sign_test_stat,
    loss_a,
    loss_b,
)

custom_test() records the callable name, parameters, alpha, and custom=True in result.metadata.

Equal Predictive Accuracy#

dm_test#

macroforecast.tests.dm_test(
    loss_a,
    loss_b,
    *,
    horizon=1,
    correction="hln",
    kernel="acf",
    input_type="loss",
    power=2.0,
    alternative="two_sided",
    alpha=0.05,
)

Input: two aligned loss series by default. Set input_type="error" to match forecast::dm.test(e1, e2, h, power, varestimator) from the R forecast package: the function then computes abs(e1)^power - abs(e2)^power internally. Output: TestResult for the Diebold-Mariano equal predictive accuracy test. correction="hln" applies the Harvey-Leybourne-Newbold small-sample correction. P-values use a Student-t reference distribution with df=n-1, matching forecast/R/DM2.R::dm.test.

kernel="acf" matches the R varestimator="acf" autocovariance estimator. kernel="bartlett" or "newey_west" uses the Bartlett-weighted estimator, matching the R varestimator="bartlett" option.

R/source alignment:

Setting	Alignment
`input_type="error"`, `correction="hln"`, `kernel="acf"`	Same statistic and Student-t p-value as `forecast/R/DM2.R::dm.test(varestimator="acf")`.
`input_type="error"`, `correction="hln"`, `kernel="bartlett"` or `"newey_west"`	Same statistic and Student-t p-value as `forecast/R/DM2.R::dm.test(varestimator="bartlett")`.
`input_type="loss"`	Uses the same DM statistic after accepting precomputed losses. This is convenient for custom losses, but it is not a direct call-equivalent to R `forecast::dm.test(e1, e2)`.
`correction="none"`	Omits the Harvey-Leybourne-Newbold small-sample factor used by `forecast::dm.test`.
`kernel="parzen"` or `"andrews"`	Macroforecast extension. These HAC estimators are not options in R `forecast::dm.test`.

Returned metadata includes statistic_type="t", null_hypothesis="equal predictive accuracy", p_value_status, p_value_reference, source_reference, r_reference, r_alignment, and r_argument_mapping.

gw_test#

macroforecast.tests.gw_test(
    loss_a,
    loss_b,
    *,
    horizon=1,
    correction="hln",
    kernel="acf",
    input_type="loss",
    power=2.0,
    alternative="two_sided",
)

Input: two aligned loss series. Output: TestResult using the package’s Giacomini-White-compatible loss differential surface. This callable uses the same aligned DM-style loss-differential statistic; conditional predictive ability with time-varying fluctuation paths is exposed separately through conditional_predictive_ability_test(...).

Source boundary: gw_test() does not claim exact R-package alignment. It preserves the legacy callable surface by reusing the DM/HLN loss-differential statistic on aligned inputs. For the package’s time-varying conditional predictive-ability path, use conditional_predictive_ability_test(...).

dmp_test#

macroforecast.tests.dmp_test(
    loss_differences,
    *,
    kernel="newey_west",
    alpha=0.05,
)

Input: one loss-difference series or a sequence of loss-difference series. Output: TestResult for a stacked Diebold-Mariano-Pesaran-style joint test.

The test stacks finite loss-difference values, computes a HAC standard error for the stacked mean, and reports a two-sided standard-normal p-value. No exact R-package comparator is claimed in the checked R sources. Metadata records statistic_type="z", null_hypothesis, p_value_status, p_value_reference, source_reference, and r_alignment.

equal_predictive_tests#

macroforecast.tests.equal_predictive_tests(
    loss_a,
    loss_b,
    *,
    tests=("dm", "gw", "dmp"),
    error_a=None,
    error_b=None,
    horizon=1,
    correction="hln",
    kernel="acf",
    alpha=0.05,
) -> pandas.DataFrame

Runs multiple equal-predictive-ability tests and stacks one row per test. Supported names are dm, gw, dmp, and hn. hn requires error_a and error_b because Harvey-Newbold is an encompassing test on forecast errors.

Output: a pandas.DataFrame with one row per requested test. The table keeps the full component metadata in the metadata column and also promotes the paper-facing fields below to top-level columns.

Column	Meaning
`test`, `name`	Requested key and display name.
`statistic_type`, `statistic`	Reference family (`t` or `z`) and test statistic.
`p_value`, `p_value_status`, `p_value_reference`	P-value, availability flag, and reference distribution.
`decision`, `alternative`, `null_hypothesis`	Rejection flag, alternative direction, and null statement.
`correction_policy`, `n_obs`	Small-sample/HAC policy and aligned observation count.
`source_reference`, `external_reference`, `r_reference`, `r_alignment`	Provenance and source-comparison fields.
`metadata`	Full `TestResult.metadata` dictionary for the component test.

Current source alignment by row:

Test	R/source status
`dm`	Exact `forecast::dm.test` alignment only under the settings listed in `dm_test`.
`gw`	Legacy GW-compatible DM-style surface; no exact R comparator claimed.
`dmp`	Macroforecast stacked HAC screen; no exact R comparator claimed.
`hn`	Legacy encompassing covariance approximation; not `forecast::dm.test`.

For paper output, pass this table to macroforecast.reporting.test_report_table(...). For an appendix/audit table that spells out source and R alignment, use macroforecast.reporting.test_provenance_table(...).

harvey_newbold_test#

macroforecast.tests.harvey_newbold_test(
    error_a,
    error_b,
    *,
    horizon=1,
    kernel="newey_west",
    small_sample=True,
    alpha=0.05,
)

Input: two forecast-error series. Output: one-sided TestResult for the legacy forecast-error covariance approximation.

Source note: this is not forecast::dm.test. The R forecast package function implements Harvey-Leybourne-Newbold Diebold-Mariano equal-accuracy testing. harvey_newbold_test() remains a callable encompassing-style covariance approximation and records that distinction in result.metadata.

The callable forms d_t = e_a,t * (e_a,t - e_b,t), computes a HAC standard error, optionally applies an HLN-style small-sample factor, and reports a one-sided Student-t upper-tail p-value. Metadata records statistic_type="t", p_value_status, p_value_reference, source_reference, r_reference=None, and r_alignment.

Alias: hn_test.

Nested And Encompassing Tests#

clark_west_test#

macroforecast.tests.clark_west_test(
    loss_small,
    loss_large,
    forecast_small,
    forecast_large,
    *,
    horizon=1,
    cw_adjustment=True,
    kernel="newey_west",
    alpha=0.05,
)

Input: small-model loss, large-model loss, and both forecast series. Output: one-sided TestResult for the Clark-West nested forecast comparison.

Statistic:

q_t = e_r,t^2 - e_u,t^2 + (f_r,t - f_u,t)^2
z = mean(q_t) / sqrt(LRV(q_t) / n)

Here r is the restricted/small model and u is the unrestricted/large model. The implementation follows the standard adjusted MSPE differential used by Clark-West references such as GAUSS cwTest and HypothesisTests.jl ClarkWestTest. Archived R examples can differ by sign convention, so this page treats the formula above as the package contract.

Alias: cw_test.

enc_new_test#

macroforecast.tests.enc_new_test(
    error_small,
    error_large,
    *,
    critical_value=None,
    alpha=0.05,
)

Input: restricted/small-model forecast errors and unrestricted/large-model forecast errors. Output: one-sided TestResult.

Statistic:

c_t = e_r,t * (e_r,t - e_u,t)
ENC-NEW = n * mean(c_t) / mean(e_u,t^2)

Default p_value is None because Clark-McCracken nested forecast encompassing tests have nonstandard distributions. Pass a design-appropriate critical_value to get a boolean decision.

enc_t_test#

macroforecast.tests.enc_t_test(
    error_small,
    error_large,
    *,
    horizon=1,
    kernel="newey_west",
    critical_value=None,
    normal_approximation=False,
    alpha=0.05,
)

Input: restricted/small-model forecast errors and unrestricted/large-model forecast errors. Output: one-sided TestResult.

Statistic:

c_t = e_r,t * (e_r,t - e_u,t)
ENC-T = mean(c_t) / sqrt(LRV(c_t) / n)

Default p_value is None. Set normal_approximation=True only for diagnostic screening, or pass critical_value for a design-specific decision.

nested_tests#

macroforecast.tests.nested_tests(
    loss_small,
    loss_large,
    *,
    forecast_small=None,
    forecast_large=None,
    error_small=None,
    error_large=None,
    tests=("clark_west", "enc_new", "enc_t"),
    horizon=1,
    kernel="newey_west",
    enc_critical_value=None,
    enc_normal_approximation=False,
    alpha=0.05,
) -> pandas.DataFrame

Runs multiple nested-model tests and stacks one row per test. Clark-West requires forecast_small and forecast_large; enc_new and enc_t require error_small and error_large. This separation is intentional because Clark-West is an adjusted MSPE differential while ENC-NEW and ENC-T are forecast-error encompassing covariance statistics.

Directional Accuracy Tests#

directional_accuracy_test#

macroforecast.tests.directional_accuracy_test(
    y_true,
    y_pred,
    *,
    threshold=0.0,
    method="pesaran_timmermann",
    alpha=0.05,
)

Input: realized values and forecasts. Output: TestResult. Supported methods are pesaran_timmermann, anatolyev_gerko, and henriksson_merton.

The pesaran_timmermann and anatolyev_gerko branches are aligned with R tstests/R/dac.R::dac_test and rugarch/R/rugarch-tests.R::DACTest. The p-value is a one-sided upper-tail normal p-value, 1 - Phi(statistic). Forecasts that are constant after subtracting threshold are rejected because the directional tests are undefined for a constant sign forecast.

Options:

Option	Default	Choices	Meaning
`threshold`	`0.0`	numeric	Values above this threshold are positive-direction observations.
`method`	`"pesaran_timmermann"`	`"pesaran_timmermann"`, `"anatolyev_gerko"`, `"henriksson_merton"`	Directional statistic to compute.
`alpha`	`0.05`	probability in `(0, 1)`	Rejection level.

Method notes:

Method	Null	Statistic input
`pesaran_timmermann`	No sign predictability.	Exact R alignment with `.pt_test` / `DACTest(test="PT")`: sign hit rate versus independence-implied sign hit rate.
`anatolyev_gerko`	No excess profitability.	Exact R alignment with `.ag_test` / `DACTest(test="AG")`: `sign(forecast) * actual` excess profitability, using raw actual and forecast values after threshold subtraction.
`henriksson_merton`	No market-timing skill.	Macroforecast extension. No exact comparator in `tstests::dac_test` or `rugarch::DACTest`; statistic is based on up/down conditional hit rates.

R/source alignment:

Branch	R comparator	Notes
`pesaran_timmermann`	`tstests/R/dac.R::.pt_test`; `rugarch/R/rugarch-tests.R::DACTest(test="PT")`	Uses `x_t=1{actual>0}`, `y_t=1{forecast>0}`, `z_t=1{forecast*actual>0}`, and `p.value=1-pnorm(statistic)`.
`anatolyev_gerko`	`tstests/R/dac.R::.ag_test`; `rugarch/R/rugarch-tests.R::DACTest(test="AG")`	Uses `r_t=sign(forecast)*actual`, excess-profitability variance `V_EP`, and `p.value=1-pnorm(statistic)`.
`henriksson_merton`	None	Kept as a callable screening diagnostic, not claimed as an R-package-aligned DAC branch.

Zero rule: R uses strict positivity, actual > 0 and forecast > 0. macroforecast applies the same strict rule after subtracting threshold, so values equal to threshold are treated as non-positive.

Aliases:

Alias	Equivalent call
`pesaran_timmermann_test(y_true, y_pred)`	`directional_accuracy_test(..., method="pesaran_timmermann")`
`anatolyev_gerko_test(y_true, y_pred)`	`directional_accuracy_test(..., method="anatolyev_gerko")`
`henriksson_merton_test(y_true, y_pred)`	`directional_accuracy_test(..., method="henriksson_merton")`

Density And Interval Diagnostics#

density_interval_tests#

macroforecast.tests.density_interval_tests(
    pit,
    *,
    alpha=0.05,
    n_bins=10,
    pit_lag=1,
)

Input: probability integral transform values in [0, 1]. Output: JSON-ready dictionary with metadata_schema.kind="density_interval_tests" plus Berkowitz, KS, Kupiec POF, Christoffersen independence, Engle-Manganelli DQ, Du-Escanciano shortfall, PIT histogram, and PIT autocorrelation diagnostics.

Options:

Option	Default	Meaning
`alpha`	`0.05`	Tail probability for VaR/shortfall-style hit tests.
`n_bins`	`10`	Number of PIT histogram bins.
`pit_lag`	`1`	Lag used for PIT autocorrelation, Berkowitz AR lag, and Du-Escanciano conditional shortfall lag.

Output keys:

Key	Meaning
`berkowitz`	Berkowitz density LR test plus Jarque-Bera normality check after normal score transform.
`ks`	Kolmogorov-Smirnov test against uniform PIT.
`kupiec_pof`	Unconditional hit-rate test at `alpha`.
`christoffersen_independence`	Markov independence test for hits.
`engle_manganelli_dq`	PIT hit-only DQ proxy. Use `dynamic_quantile_test(...)` for the full Engle-Manganelli VaR DQ test.
`du_escanciano_shortfall`	Du-Escanciano unconditional and conditional shortfall tests.
`pit_histogram`	One record per histogram bin.
`pit_autocorrelation`	`TestResult` dictionary for serial PIT dependence.
`r_reference`, `r_alignment`	Composite provenance metadata. Component-level diagnostics also carry their own R/source metadata.

R/source alignment:

Diagnostic	Reference
Berkowitz	`tstests/R/berkowitz.R::berkowitz_test`: PIT to normal scores, ARIMA(`pit_lag`,0,0) unrestricted likelihood versus Normal(0,1); LR df is `2 + pit_lag`.
Du-Escanciano shortfall	`tstests/R/shortfall_de.R::shortfall_de_test`: cumulative tail shortfall mean test and portmanteau test on centered tail shortfall autocorrelations.
Kupiec/Christoffersen	`tstests/R/var_cp.R::var_cp_test` and `rugarch/R/rugarch-tests.R`: Bernoulli/transition likelihood-ratio construction.
PIT hit-only DQ proxy	No direct R comparator. It is a PIT-hit lag diagnostic inside this composite wrapper, not the full Engle-Manganelli VaR DQ test.

Boundary handling: values outside [0, 1] raise. Boundary PIT values 0 and 1 are accepted as PIT values but clipped internally for the normal-score Berkowitz transform to avoid infinite ARIMA inputs.

shortfall_de_test#

macroforecast.tests.shortfall_de_test(
    pit,
    *,
    alpha=0.05,
    lags=1,
    boot=False,
    n_boot=2000,
    random_state=0,
) -> dict

Input: PIT values in [0, 1]. Output: JSON-ready dictionary with metadata_schema.kind="shortfall_de_test".

The unconditional statistic is the sample mean of cumulative tail shortfall, mean((alpha - pit) * 1{pit <= alpha} / alpha). The conditional statistic is a portmanteau statistic on autocorrelations of that series centered by alpha / 2. With boot=False, the unconditional p-value uses the Du-Escanciano normal approximation and the conditional p-value uses Chi-squared(lags). With boot=True, both p-values use simulated uniform PIT draws with the same sample size.

dynamic_quantile_test#

macroforecast.tests.dynamic_quantile_test(
    y_true,
    var,
    *,
    alpha=0.05,
    lag=1,
    lag_hit=1,
    lag_var=1,
) -> TestResult

Input: realized values and one-step-ahead lower-tail VaR forecasts. Output: TestResult for the Engle-Manganelli dynamic quantile test.

This is the full VaR DQ callable. It is separate from density_interval_tests(...) because the exact DQ statistic needs realized values and VaR forecasts, not PIT values alone.

R/source alignment: segMGarch/R/DQtest.R::DQtest. The hit series is 1 - alpha when y_true < var and -alpha otherwise. The regressor matrix contains a constant, lag-aligned VaR forecasts, lag_hit lagged hit columns, and lagged squared realized values. The statistic is Hit' X (X'X)^(-1) X' Hit / (alpha * (1 - alpha)), with a chi-squared reference distribution using the number of columns of X.

R argument mapping: segMGarch::DQtest names the VaR probability VaR_level and converts it internally to the lower-tail probability 1 - VaR_level. macroforecast accepts the lower-tail probability directly as alpha; therefore a 5% lower-tail VaR is alpha=0.05, corresponding to VaR_level=0.95 in the R function.

Source: https://rdrr.io/cran/segMGarch/src/R/DQtest.R

Options:

Option	Default	Meaning
`alpha`	`0.05`	Lower-tail probability. A 5% VaR uses `alpha=0.05`.
`lag`	`1`	Lag used for squared realized values.
`lag_hit`	`1`	Number of lagged hit columns.
`lag_var`	`1`	Lag alignment for VaR forecasts.

pit_histogram#

macroforecast.tests.pit_histogram(pit, *, n_bins=10) -> pandas.DataFrame

Returns one row per PIT histogram bin with observed count, expected count under uniformity, and deviation.

pit_autocorrelation_test#

macroforecast.tests.pit_autocorrelation_test(
    pit,
    *,
    lag=1,
    alpha=0.05,
) -> TestResult

Runs a normal-approximation test for serial dependence in PIT values.

interval_coverage_test#

macroforecast.tests.interval_coverage_test(
    y_true,
    lower,
    upper,
    *,
    alpha=0.05,
) -> dict

Runs Kupiec POF, Christoffersen independence, combined conditional coverage, and Christoffersen-Pelletier duration diagnostics for forecast intervals. alpha is the expected non-coverage rate, so a 90% interval uses alpha=0.10.

Boundary cases follow the likelihood-ratio convention used by R tstests::var_cp_test and rugarch::VaRTest: zero violations do not automatically imply a passing Kupiec statistic; the restricted Bernoulli likelihood is compared with the boundary unrestricted likelihood.

The christoffersen_pelletier_duration output follows the duration-test logic in tstests/R/var_cp.R::.duration_test: durations between interval misses are modeled with a Weibull likelihood, and the no-memory exponential restriction is tested by setting the Weibull shape parameter to 1. The duration statistic is unavailable when there is one or fewer misses.

Coverage output:

Key	Meaning
`kupiec_pof`	Unconditional coverage LR. Carries `tstests` and `rugarch` references.
`christoffersen_independence`	First-order Markov independence LR plus transition counts `n00`, `n01`, `n10`, `n11`.
`christoffersen_conditional_coverage`	Sum of Kupiec and independence LR statistics with chi-squared df 2.
`christoffersen_pelletier_duration`	Weibull duration LR for the exponential no-memory restriction.
`r_reference`, `rugarch_reference`, `r_alignment`	Package-level provenance.

Duration likelihood note: the duration construction is the same in tstests and rugarch. The implemented density/survival likelihood follows rugarch/R/rugarch-tests.R::VaRDurTest, which is the internally consistent Christoffersen-Pelletier Weibull likelihood form.

Conditional Predictive Ability#

conditional_predictive_ability_test#

macroforecast.tests.conditional_predictive_ability_test(
    loss_a,
    loss_b,
    *,
    method="giacomini_rossi",
    window_ratio=0.5,
    dmv_fullsample=True,
    lag_truncate=0,
    alpha=0.05,
)

Input: two aligned loss series. Output: JSON-ready dictionary with metadata_schema.kind="conditional_predictive_ability", a fluctuation statistic, critical value, decision, time path, window size, loss-difference orientation, and source-alignment metadata.

Supported methods: giacomini_rossi, recursive_fluctuation.

The giacomini_rossi branch is aligned with murphydiagram/R/procs.R::fluctuation_test, which implements Proposition 1 of Giacomini and Rossi (2010). It computes rolling-window Diebold-Mariano-type statistics for the loss difference loss_a - loss_b, uses Bartlett HAC variance, and compares the supremum absolute statistic with the tabulated critical values from Giacomini-Rossi Table 1. Positive path values mean loss_a is larger than loss_b over that window, so the final statistic is two-sided because it uses the supremum absolute path.

R alignment:

R package / function	macroforecast branch	Alignment
`murphydiagram/R/procs.R::fluctuation_test`	`method="giacomini_rossi"`	Same `ld <- loss1 - loss2`, same `mu` grid, same Table 1 critical values, same `lag_truncate in 0:5`, same Bartlett HAC convention, same `dmv_fullsample` and rolling-denominator branches.
None	`method="recursive_fluctuation"`	Package extension over expanding-prefix loss windows. It reuses the same Bartlett HAC helper but does not claim to implement a named R-package test.

Options:

Option	Default	Choices	Meaning
`method`	`"giacomini_rossi"`	`"giacomini_rossi"`, `"recursive_fluctuation"`	`giacomini_rossi` is R-aligned; `recursive_fluctuation` is a package extension over expanding loss windows.
`window_ratio`	`0.5`	`0.1`, `0.2`, …, `0.9` for `giacomini_rossi`	Rolling window size as a fraction of the evaluation sample.
`dmv_fullsample`	`True`	boolean	If `True`, estimate HAC variance on the full loss-difference sample, matching the R default. If `False`, use each rolling window’s HAC variance.
`lag_truncate`	`0`	`0`, `1`, …, `5`	Bartlett HAC truncation lag, matching the R package’s allowed range.
`alpha`	`0.05`	`0.05`, `0.10` for `giacomini_rossi`	Test size used to select the tabulated critical value.

Output fields:

Field	Meaning
`statistic`	Supremum absolute value of the fluctuation path.
`time_path`	Rolling or recursive fluctuation path before the supremum is taken.
`critical_value`, `critical_band`, `decision`	Tabulated Giacomini-Rossi comparison when available; `None` for the recursive extension.
`variance_scope`	`"full_sample"` when `dmv_fullsample=True`; `"rolling_window"` otherwise.
`loss_difference_orientation`	Always `loss_a - loss_b`; positive path values mean `loss_a` has larger loss.
`source_reference`, `external_reference`, `r_reference`, `r_alignment`	Source and R-package comparison metadata.
`requested_method`, `method`, `alias_warning`	The user-supplied method, normalized method, and any alias caveat.

method="rossi_sekhposyan" remains accepted as a legacy alias for recursive_fluctuation, but Rossi-Sekhposyan forecast rationality is a different test family and is not represented by this loss-comparison callable.

Multiple-Model Tests#

blocked_oob_reality_check#

macroforecast.tests.blocked_oob_reality_check(
    loss_panel,
    *,
    benchmark,
    loss="squared_error",
    alpha=0.05,
    n_boot=1000,
    block_length=4,
    bootstrap_method="fixed_block_bootstrap",
    random_state=0,
    target="target",
    horizon="horizon",
    origin="origin",
    model="model_id",
) -> pandas.DataFrame

Block-bootstrap one-sided benchmark-superiority screen against a named benchmark model. This is the direct callable replacement for the legacy blocked_oob_reality_check operation. It is intentionally documented as a legacy screen, not as the exact White Reality Check.

Inputs:

Form	Required columns
Long panel	`origin`, `model_id`, `squared_error`; optional `target` and `horizon`. Column names are configurable.
Wide matrix	One column per model, including the `benchmark` column. The index is treated as origin order.

Long-panel input must have one row per target/horizon/origin/model key. If the loss table contains duplicate rows for that key, aggregate them explicitly before calling; the test helpers do not average duplicates silently.

Output: one row per candidate model and target/horizon group.

Column	Meaning
`target`, `horizon`	Group labels. Wide input uses `"all"` for both.
`model`	Candidate model tested against the benchmark.
`benchmark`	Benchmark model name.
`mean_diff`	`benchmark_loss - candidate_loss`; positive means candidate has lower loss.
`statistic`	Mean loss differential scaled by bootstrap standard error.
`p_value`	Pairwise one-sided block-bootstrap p-value for no improvement over benchmark.
`decision`	`True` when `p_value < alpha`.
`familywise_p_value`	Max-bootstrap p-value adjusted across all candidate models in the same target/horizon group.
`familywise_decision`	`True` when `familywise_p_value < alpha`.
`familywise_n_obs`	Complete-case origins used for the family-wise adjustment.
`n_obs`	Number of aligned origins.
`block_length`, `n_boot`, `bootstrap_method`	Bootstrap settings used.
`source_reference`, `r_reference`, `r_alignment`	Provenance metadata. `r_reference` is `None` because this legacy screen has no exact R-package comparator.

The returned table carries attrs["macroforecast_metadata_schema"]["kind"] = "blocked_oob_reality_check".

R/source comparison:

Function	Status
`blocked_oob_reality_check(...)`	No exact R-package comparator. It computes pairwise and family-wise max-centered block bootstrap p-values from precomputed benchmark/candidate loss differences.
`ttrTests/R/dataSnoop.R::dataSnoop(test="RC" or "SPA")`	Strategy-specific data-snooping code. It rebuilds technical-trading parameter-grid performance on each bootstrapped price sample, so it is not the same API contract.
`reality_check_test(...)`, `superior_predictive_ability_test(...)`, `stepm_test(...)`	Exact multiple-comparison callable family for White RC, Hansen SPA, and Romano-Wolf StepM using the optional `arch.bootstrap` backend.

superior_predictive_ability_test#

macroforecast.tests.superior_predictive_ability_test(
    loss_panel,
    *,
    benchmark,
    loss="squared_error",
    alpha=0.05,
    n_boot=1000,
    block_length="auto",
    bootstrap_method="stationary_bootstrap",
    p_value_type="consistent",
    studentize=True,
    nested=False,
    random_state=0,
    target="target",
    horizon="horizon",
    origin="origin",
    model="model_id",
) -> dict

Input: long or wide loss panel with a named benchmark model. Output: JSON-ready dictionary with one record per target/horizon group. The record contains p_values for lower, consistent, and upper SPA p-value variants, critical_values, selected p_value, superior_models, and backend metadata.

Backend alignment: delegates to arch.bootstrap.SPA. The backend takes benchmark losses and candidate losses, forms loss differentials internally as benchmark_loss - candidate_loss, and reports lower, consistent, and upper p-values from Hansen’s recentering choices. Positive mean_loss_difference in the output means the candidate has lower average loss than the benchmark.

R/source comparison: archived R ttrTests/R/dataSnoop.R::dataSnoop(test="SPA") implements Hansen SPA for technical-trading rule parameter grids. It recomputes strategy performance on each bootstrapped price sample, so it is not a direct general loss-matrix API. macroforecast keeps the general forecast-evaluation contract and records this as conceptual R alignment in each output record.

Options:

Option	Default	Choices	Meaning
`bootstrap_method`	`"stationary_bootstrap"`	`"stationary_bootstrap"`, `"fixed_block_bootstrap"`	Bootstrap family. Fixed-block inputs are mapped to `arch`’s moving-block backend.
`p_value_type`	`"consistent"`	`"lower"`, `"consistent"`, `"upper"`	Which SPA p-value variant to use for `p_value` and `decision`.
`studentize`	`True`	boolean	Passed to `arch.bootstrap.SPA`.
`nested`	`False`	boolean	Passed to `arch.bootstrap.SPA` for nested model sets.

reality_check_test#

macroforecast.tests.reality_check_test(
    loss_panel,
    *,
    benchmark,
    loss="squared_error",
    alpha=0.05,
    n_boot=1000,
    block_length="auto",
    bootstrap_method="stationary_bootstrap",
    p_value_type="consistent",
    studentize=True,
    nested=False,
    random_state=0,
    target="target",
    horizon="horizon",
    origin="origin",
    model="model_id",
) -> dict

Input and output follow superior_predictive_ability_test(...). Backend: arch.bootstrap.RealityCheck. In the current arch backend this class is a Reality Check alias over the same SPA machinery, with the same p-value fields. Use this when the research design calls for the White Reality Check against a benchmark model.

R/source comparison: archived R ttrTests/R/dataSnoop.R::dataSnoop(test="RC") implements White’s Reality Check for technical-trading rule grids. As with SPA, the R function is strategy-generator specific; macroforecast uses precomputed benchmark and candidate forecast-loss series.

stepm_test#

macroforecast.tests.stepm_test(
    loss_panel,
    *,
    benchmark,
    loss="squared_error",
    alpha=0.05,
    n_boot=1000,
    block_length="auto",
    bootstrap_method="stationary_bootstrap",
    studentize=True,
    nested=False,
    random_state=0,
    target="target",
    horizon="horizon",
    origin="origin",
    model="model_id",
) -> dict

Input: long or wide loss panel with a named benchmark model. Output: JSON-ready dictionary with superior_models for each target/horizon group. Backend: arch.bootstrap.StepM.

R/source comparison: oosanalysis-R-library/R/stepm.R::stepm implements a generic Romano-Wolf stepdown loop from supplied test statistics and bootstrap test-statistic draws. macroforecast delegates to arch.bootstrap.StepM, which constructs the benchmark-vs-candidate loss-difference statistics using the SPA backend and then applies the stepdown procedure. The objective is aligned, but the inputs are higher level in macroforecast: forecast-loss panel in, superior model names out.

model_confidence_set#

macroforecast.tests.model_confidence_set(
    loss_panel,
    *,
    loss="squared_error",
    alpha=0.10,
    n_boot=1000,
    block_length="auto",
    bootstrap_method="mcs_fixed_block",
    statistic="max",
    random_state=0,
    target="target",
    horizon="horizon",
    origin="origin",
    model="model_id",
) -> dict

Exact Hansen-Lunde-Nason model confidence set callable aligned with the R MCS package’s MCSprocedure. It constructs pairwise loss-difference statistics, bootstraps those loss-difference means, removes one model per step, tracks cumulative MCS p-values, and records included and rejected model sets by target/horizon group.

Inputs:

Form	Required columns
Long panel	`origin`, `model_id`, and the selected loss column. `target` and `horizon` are optional grouping columns.
Wide matrix	Numeric model-loss columns. The target/horizon labels are set to `"all"`.

Long-panel input must have one row per target/horizon/origin/model key. Duplicate loss rows are rejected instead of being averaged inside the pivot step.

Options:

Option	Default	Choices	Meaning
`statistic`	`"max"`	`"max"`, `"range"`	`"max"` maps to R `statistic="Tmax"` over `d_i.`; `"range"` maps to R `statistic="TR"` over pairwise `d_ij`.
`bootstrap_method`	`"mcs_fixed_block"`	`"mcs_fixed_block"`, `"stationary_bootstrap"`, `"fixed_block_bootstrap"`	`mcs_fixed_block` follows R `MCS/R/internalFunctions.R::GetIndices`; the other choices are package extensions.
`block_length`	`"auto"`	positive int or `"auto"`	Block length. `"auto"` follows the R rule conceptually: selected AR order across loss columns, with a minimum of 3.

Output: JSON-ready dictionary with metadata_schema.kind="model_confidence_set".

Key	Meaning
`mcs_inclusion`	Included model records by target, horizon, and alpha after the iterative procedure stops.
`mcs_rejections`	Eliminated model records by target, horizon, and alpha.
`p_values`	Final stopping-test p-value by target and horizon.
`iteration_path`	One record per removal step, including active models, statistic, p-value, cumulative MCS p-value, removed model, rejected model if any, and mean losses.
`block_lengths_used`	Block length used by target and horizon.

R/source alignment:

R source	Python contract
`MCS/R/MCSprocedure.R::MCSprocedure`	Sequential elimination until one model remains; included/excluded sets are determined by `p-Value for H_{0,M_k}` relative to `alpha`.
`MCS/R/internalFunctions.R::GetD`	Pairwise loss differences `d_ij` and model-average differences `d_i.`.
`MCS/R/internalFunctions.R::GetIndices`	Default `bootstrap_method="mcs_fixed_block"` samples consecutive fixed blocks and truncates to sample length.

block_length="auto" follows the same rule conceptually as R k=NULL: choose the maximum selected AR order across loss columns and enforce a minimum of 3. For bit-level reproducibility across software stacks, pass an explicit integer block_length.

iterative_model_confidence_set#

macroforecast.tests.iterative_model_confidence_set(
    loss_panel,
    *,
    loss="squared_error",
    alpha=0.10,
    n_boot=1000,
    block_length="auto",
    bootstrap_method="mcs_fixed_block",
    statistic="max",
    random_state=0,
    target="target",
    horizon="horizon",
    origin="origin",
    model="model_id",
)

Descriptive alias for model_confidence_set(...). It calls the same exact MCS engine and returns the same fields, with metadata_schema.kind="iterative_model_confidence_set" so older code can trace which callable produced the result.

Residual Diagnostics#

residual_diagnostics#

macroforecast.tests.residual_diagnostics(
    residuals,
    *,
    tests=(
        "ljung_box_q",
        "arch_lm",
        "jarque_bera_normality",
        "durbin_watson",
    ),
    lag=10,
    alpha=0.05,
    model_df=0,
    exog=None,
    demean_arch=False,
)

Input: residual series. Output: one-row-per-test pandas DataFrame with test, statistic, p_value, decision, lag_used, df, n_obs, source_reference, r_reference, r_alignment, and status. The result carries attrs["macroforecast_metadata_schema"] = {"kind": "residual_diagnostics", "version": 1, ...}.

Supported tests:

Name	Meaning
`ljung_box_q`	Ljung-Box serial-correlation diagnostic, aligned with `stats::Box.test(type="Ljung-Box")`; `model_df` maps to R `fitdf`.
`breusch_godfrey_serial_correlation`	Breusch-Godfrey Chisq LM diagnostic under the residual-series contract; default is equivalent to testing `residuals ~ 1`, and `exog` supplies additional original-regression design columns.
`arch_lm`	Engle ARCH LM diagnostic, aligned with `FinTS::ArchTest`; `demean_arch=True` matches its `demean=TRUE` option.
`jarque_bera_normality`	Jarque-Bera normality diagnostic using the same population-moment formula as `tseries::jarque.bera.test`.
`durbin_watson`	Durbin-Watson statistic aligned with the statistic in `lmtest::dwtest`; p-value is not supplied because `lmtest`’s exact p-value uses a model-design distribution not available from residuals alone.

Options:

Option	Default	Meaning
`lag`	`10`	Maximum lag for Ljung-Box, ARCH-LM, and Breusch-Godfrey.
`alpha`	`0.05`	Rejection level used for `decision`.
`model_df`	`0`	Degrees of freedom consumed by the fitted model. Used in Ljung-Box p-values and ARCH-LM degrees-of-freedom adjustment.
`exog`	`None`	Optional design matrix for the Breusch-Godfrey auxiliary regression. If omitted, an intercept-only design is used.
`demean_arch`	`False`	Demean residuals before ARCH-LM, matching `FinTS::ArchTest(demean=TRUE)` when enabled.

Source-alignment notes:

Diagnostic	Source logic
Ljung-Box	`stats::Box.test(type="Ljung-Box")`: `Q = n(n+2) sum rho_k^2/(n-k)`, chi-squared df `lag - model_df`; `model_df` is R `fitdf`.
ARCH-LM	`FinTS/R/ArchTest.R::ArchTest`: optionally demean residuals, embed `x^2`, regress current squared residuals on lagged squared residuals, statistic is effective sample size times auxiliary `R^2`. `model_df` is a statsmodels degrees-of-freedom adjustment beyond the R API.
Jarque-Bera	`tseries/R/test.R::jarque.bera.test`: population central moments, `n * skewness^2 / 6 + n * (kurtosis - 3)^2 / 24`, chi-squared df `2`.
Breusch-Godfrey	`lmtest/R/bgtest.R::bgtest`: R takes a fitted model or formula. `macroforecast` takes residuals and optional `exog`, then applies the same Chisq LM auxiliary formula with fill-zero lagged residual columns under that residual-series contract.
Durbin-Watson	`lmtest/R/dwtest.R::dwtest`: statistic `sum(diff(residuals)^2) / sum(residuals^2)`. P-values are omitted because R’s exact/asymptotic p-value depends on the original regression design matrix.

jarque_bera_test – Jarque-Bera normality test (single series, chi2 df=2; tseries::jarque.bera.test convention).
granger_causality – Granger causality test in a VAR (vars::causality; F or Wald).
instantaneous_causality – instantaneous (contemporaneous) causality test in a VAR.
giacomini_white_test – Giacomini-White (2006) CONDITIONAL predictive ability Wald test (chi2, HAC), instrument [1, dL_{t-h}].
var_serial_test – multivariate residual serial-correlation (Portmanteau/LM) test for a VAR (vars::serial.test).
var_normality_test – multivariate normality (Doornik-Hansen/Lutkepohl JB) test for VAR residuals (vars::normality.test).
var_arch_test – multivariate ARCH-LM test for VAR residuals (vars::arch.test, Lutkepohl).