# macroforecast.tests [Back to reference](index.md) `macroforecast.tests` owns forecast-comparison tests and residual diagnostics. It does not compute general scoring tables, fit models, or choose windows. Use the namespace form: ```python import macroforecast as mf mf.tests.dm_test(loss_a, loss_b, horizon=1) ``` Top-level shortcuts such as `mf.dm_test(...)` are intentionally not exported. ## TestResult Most pairwise forecast-comparison tests return `TestResult`. ```python macroforecast.tests.TestResult( statistic, p_value, decision, alternative, correction_policy=None, n_obs=None, metadata={}, ) ``` | Field | Meaning | | --- | --- | | `statistic` | Test statistic, or `None` when the sample is too small or degenerate. | | `p_value` | P-value, or `None` when unavailable. | | `decision` | `True` when the null is rejected at the supplied `alpha`. | | `alternative` | `two_sided` or `one_sided`. | | `correction_policy` | HAC or small-sample correction label. | | `n_obs` | Number of aligned observations used. | | `metadata` | Test-specific details. | Methods: | Method | Output | | --- | --- | | `to_dict()` | JSON-ready dictionary with `metadata_schema.kind="forecast_test_result"`. | | `to_json(path=None)` | JSON text and optional file write. | | `summary()` | Compact string summary. | ## Custom Tests ### custom_test ```python macroforecast.tests.custom_test( name, func, *args, alternative="two_sided", alpha=0.05, correction_policy=None, metadata=None, **params, ) -> TestResult ``` Runs a user-supplied forecast test and coerces the result to `TestResult`. The callable receives `*args` and `**params`. It may return: | Return type | Meaning | | --- | --- | | `TestResult` | Used directly, with custom metadata merged. | | mapping | Must contain `statistic` or `stat`, and may contain `p_value`/`pvalue`, `decision`, `alternative`, `correction_policy`, `n_obs`, and `metadata`. | | `(statistic, p_value)` | Decision is `p_value < alpha`. | | `(statistic, p_value, n_obs)` | Same as above plus sample size. | ```python def sign_test_stat(loss_a, loss_b): diff = pd.Series(loss_a).sub(pd.Series(loss_b)).dropna() return { "statistic": float((diff < 0).mean()), "p_value": 0.04, "n_obs": len(diff), } result = mf.tests.custom_test( "sign_loss_test", sign_test_stat, loss_a, loss_b, ) ``` `custom_test()` records the callable name, parameters, `alpha`, and `custom=True` in `result.metadata`. ## Equal Predictive Accuracy ### dm_test ```python macroforecast.tests.dm_test( loss_a, loss_b, *, horizon=1, correction="hln", kernel="acf", input_type="loss", power=2.0, alternative="two_sided", alpha=0.05, ) ``` Input: two aligned loss series by default. Set `input_type="error"` to match `forecast::dm.test(e1, e2, h, power, varestimator)` from the R `forecast` package: the function then computes `abs(e1)^power - abs(e2)^power` internally. Output: `TestResult` for the Diebold-Mariano equal predictive accuracy test. `correction="hln"` applies the Harvey-Leybourne-Newbold small-sample correction. P-values use a Student-t reference distribution with `df=n-1`, matching `forecast/R/DM2.R::dm.test`. `kernel="acf"` matches the R `varestimator="acf"` autocovariance estimator. `kernel="bartlett"` or `"newey_west"` uses the Bartlett-weighted estimator, matching the R `varestimator="bartlett"` option. R/source alignment: | Setting | Alignment | | --- | --- | | `input_type="error"`, `correction="hln"`, `kernel="acf"` | Same statistic and Student-t p-value as `forecast/R/DM2.R::dm.test(varestimator="acf")`. | | `input_type="error"`, `correction="hln"`, `kernel="bartlett"` or `"newey_west"` | Same statistic and Student-t p-value as `forecast/R/DM2.R::dm.test(varestimator="bartlett")`. | | `input_type="loss"` | Uses the same DM statistic after accepting precomputed losses. This is convenient for custom losses, but it is not a direct call-equivalent to R `forecast::dm.test(e1, e2)`. | | `correction="none"` | Omits the Harvey-Leybourne-Newbold small-sample factor used by `forecast::dm.test`. | | `kernel="parzen"` or `"andrews"` | Macroforecast extension. These HAC estimators are not options in R `forecast::dm.test`. | Returned metadata includes `statistic_type="t"`, `null_hypothesis="equal predictive accuracy"`, `p_value_status`, `p_value_reference`, `source_reference`, `r_reference`, `r_alignment`, and `r_argument_mapping`. ### gw_test ```python macroforecast.tests.gw_test( loss_a, loss_b, *, horizon=1, correction="hln", kernel="acf", input_type="loss", power=2.0, alternative="two_sided", ) ``` Input: two aligned loss series. Output: `TestResult` using the package's Giacomini-White-compatible loss differential surface. This callable uses the same aligned DM-style loss-differential statistic; conditional predictive ability with time-varying fluctuation paths is exposed separately through `conditional_predictive_ability_test(...)`. Source boundary: `gw_test()` does not claim exact R-package alignment. It preserves the legacy callable surface by reusing the DM/HLN loss-differential statistic on aligned inputs. For the package's time-varying conditional predictive-ability path, use `conditional_predictive_ability_test(...)`. ### dmp_test ```python macroforecast.tests.dmp_test( loss_differences, *, kernel="newey_west", alpha=0.05, ) ``` Input: one loss-difference series or a sequence of loss-difference series. Output: `TestResult` for a stacked Diebold-Mariano-Pesaran-style joint test. The test stacks finite loss-difference values, computes a HAC standard error for the stacked mean, and reports a two-sided standard-normal p-value. No exact R-package comparator is claimed in the checked R sources. Metadata records `statistic_type="z"`, `null_hypothesis`, `p_value_status`, `p_value_reference`, `source_reference`, and `r_alignment`. ### equal_predictive_tests ```python macroforecast.tests.equal_predictive_tests( loss_a, loss_b, *, tests=("dm", "gw", "dmp"), error_a=None, error_b=None, horizon=1, correction="hln", kernel="acf", alpha=0.05, ) -> pandas.DataFrame ``` Runs multiple equal-predictive-ability tests and stacks one row per test. Supported names are `dm`, `gw`, `dmp`, and `hn`. `hn` requires `error_a` and `error_b` because Harvey-Newbold is an encompassing test on forecast errors. Output: a `pandas.DataFrame` with one row per requested test. The table keeps the full component metadata in the `metadata` column and also promotes the paper-facing fields below to top-level columns. | Column | Meaning | | --- | --- | | `test`, `name` | Requested key and display name. | | `statistic_type`, `statistic` | Reference family (`t` or `z`) and test statistic. | | `p_value`, `p_value_status`, `p_value_reference` | P-value, availability flag, and reference distribution. | | `decision`, `alternative`, `null_hypothesis` | Rejection flag, alternative direction, and null statement. | | `correction_policy`, `n_obs` | Small-sample/HAC policy and aligned observation count. | | `source_reference`, `external_reference`, `r_reference`, `r_alignment` | Provenance and source-comparison fields. | | `metadata` | Full `TestResult.metadata` dictionary for the component test. | Current source alignment by row: | Test | R/source status | | --- | --- | | `dm` | Exact `forecast::dm.test` alignment only under the settings listed in `dm_test`. | | `gw` | Legacy GW-compatible DM-style surface; no exact R comparator claimed. | | `dmp` | Macroforecast stacked HAC screen; no exact R comparator claimed. | | `hn` | Legacy encompassing covariance approximation; not `forecast::dm.test`. | For paper output, pass this table to `macroforecast.reporting.test_report_table(...)`. For an appendix/audit table that spells out source and R alignment, use `macroforecast.reporting.test_provenance_table(...)`. ### harvey_newbold_test ```python macroforecast.tests.harvey_newbold_test( error_a, error_b, *, horizon=1, kernel="newey_west", small_sample=True, alpha=0.05, ) ``` Input: two forecast-error series. Output: one-sided `TestResult` for the legacy forecast-error covariance approximation. Source note: this is not `forecast::dm.test`. The R `forecast` package function implements Harvey-Leybourne-Newbold Diebold-Mariano equal-accuracy testing. `harvey_newbold_test()` remains a callable encompassing-style covariance approximation and records that distinction in `result.metadata`. The callable forms `d_t = e_a,t * (e_a,t - e_b,t)`, computes a HAC standard error, optionally applies an HLN-style small-sample factor, and reports a one-sided Student-t upper-tail p-value. Metadata records `statistic_type="t"`, `p_value_status`, `p_value_reference`, `source_reference`, `r_reference=None`, and `r_alignment`. Alias: `hn_test`. ## Nested And Encompassing Tests ### clark_west_test ```python macroforecast.tests.clark_west_test( loss_small, loss_large, forecast_small, forecast_large, *, horizon=1, cw_adjustment=True, kernel="newey_west", alpha=0.05, ) ``` Input: small-model loss, large-model loss, and both forecast series. Output: one-sided `TestResult` for the Clark-West nested forecast comparison. Statistic: ```text q_t = e_r,t^2 - e_u,t^2 + (f_r,t - f_u,t)^2 z = mean(q_t) / sqrt(LRV(q_t) / n) ``` Here `r` is the restricted/small model and `u` is the unrestricted/large model. The implementation follows the standard adjusted MSPE differential used by Clark-West references such as GAUSS `cwTest` and HypothesisTests.jl `ClarkWestTest`. Archived R examples can differ by sign convention, so this page treats the formula above as the package contract. Alias: `cw_test`. ### enc_new_test ```python macroforecast.tests.enc_new_test( error_small, error_large, *, critical_value=None, alpha=0.05, ) ``` Input: restricted/small-model forecast errors and unrestricted/large-model forecast errors. Output: one-sided `TestResult`. Statistic: ```text c_t = e_r,t * (e_r,t - e_u,t) ENC-NEW = n * mean(c_t) / mean(e_u,t^2) ``` Default `p_value` is `None` because Clark-McCracken nested forecast encompassing tests have nonstandard distributions. Pass a design-appropriate `critical_value` to get a boolean decision. ### enc_t_test ```python macroforecast.tests.enc_t_test( error_small, error_large, *, horizon=1, kernel="newey_west", critical_value=None, normal_approximation=False, alpha=0.05, ) ``` Input: restricted/small-model forecast errors and unrestricted/large-model forecast errors. Output: one-sided `TestResult`. Statistic: ```text c_t = e_r,t * (e_r,t - e_u,t) ENC-T = mean(c_t) / sqrt(LRV(c_t) / n) ``` Default `p_value` is `None`. Set `normal_approximation=True` only for diagnostic screening, or pass `critical_value` for a design-specific decision. ### nested_tests ```python macroforecast.tests.nested_tests( loss_small, loss_large, *, forecast_small=None, forecast_large=None, error_small=None, error_large=None, tests=("clark_west", "enc_new", "enc_t"), horizon=1, kernel="newey_west", enc_critical_value=None, enc_normal_approximation=False, alpha=0.05, ) -> pandas.DataFrame ``` Runs multiple nested-model tests and stacks one row per test. Clark-West requires `forecast_small` and `forecast_large`; `enc_new` and `enc_t` require `error_small` and `error_large`. This separation is intentional because Clark-West is an adjusted MSPE differential while ENC-NEW and ENC-T are forecast-error encompassing covariance statistics. ## Directional Accuracy Tests ### directional_accuracy_test ```python macroforecast.tests.directional_accuracy_test( y_true, y_pred, *, threshold=0.0, method="pesaran_timmermann", alpha=0.05, ) ``` Input: realized values and forecasts. Output: `TestResult`. Supported methods are `pesaran_timmermann`, `anatolyev_gerko`, and `henriksson_merton`. The `pesaran_timmermann` and `anatolyev_gerko` branches are aligned with R `tstests/R/dac.R::dac_test` and `rugarch/R/rugarch-tests.R::DACTest`. The p-value is a one-sided upper-tail normal p-value, `1 - Phi(statistic)`. Forecasts that are constant after subtracting `threshold` are rejected because the directional tests are undefined for a constant sign forecast. Options: | Option | Default | Choices | Meaning | | --- | --- | --- | --- | | `threshold` | `0.0` | numeric | Values above this threshold are positive-direction observations. | | `method` | `"pesaran_timmermann"` | `"pesaran_timmermann"`, `"anatolyev_gerko"`, `"henriksson_merton"` | Directional statistic to compute. | | `alpha` | `0.05` | probability in `(0, 1)` | Rejection level. | Method notes: | Method | Null | Statistic input | | --- | --- | --- | | `pesaran_timmermann` | No sign predictability. | Exact R alignment with `.pt_test` / `DACTest(test="PT")`: sign hit rate versus independence-implied sign hit rate. | | `anatolyev_gerko` | No excess profitability. | Exact R alignment with `.ag_test` / `DACTest(test="AG")`: `sign(forecast) * actual` excess profitability, using raw actual and forecast values after threshold subtraction. | | `henriksson_merton` | No market-timing skill. | Macroforecast extension. No exact comparator in `tstests::dac_test` or `rugarch::DACTest`; statistic is based on up/down conditional hit rates. | R/source alignment: | Branch | R comparator | Notes | | --- | --- | --- | | `pesaran_timmermann` | `tstests/R/dac.R::.pt_test`; `rugarch/R/rugarch-tests.R::DACTest(test="PT")` | Uses `x_t=1{actual>0}`, `y_t=1{forecast>0}`, `z_t=1{forecast*actual>0}`, and `p.value=1-pnorm(statistic)`. | | `anatolyev_gerko` | `tstests/R/dac.R::.ag_test`; `rugarch/R/rugarch-tests.R::DACTest(test="AG")` | Uses `r_t=sign(forecast)*actual`, excess-profitability variance `V_EP`, and `p.value=1-pnorm(statistic)`. | | `henriksson_merton` | None | Kept as a callable screening diagnostic, not claimed as an R-package-aligned DAC branch. | Zero rule: R uses strict positivity, `actual > 0` and `forecast > 0`. `macroforecast` applies the same strict rule after subtracting `threshold`, so values equal to `threshold` are treated as non-positive. Aliases: | Alias | Equivalent call | | --- | --- | | `pesaran_timmermann_test(y_true, y_pred)` | `directional_accuracy_test(..., method="pesaran_timmermann")` | | `anatolyev_gerko_test(y_true, y_pred)` | `directional_accuracy_test(..., method="anatolyev_gerko")` | | `henriksson_merton_test(y_true, y_pred)` | `directional_accuracy_test(..., method="henriksson_merton")` | ## Density And Interval Diagnostics ### density_interval_tests ```python macroforecast.tests.density_interval_tests( pit, *, alpha=0.05, n_bins=10, pit_lag=1, ) ``` Input: probability integral transform values in `[0, 1]`. Output: JSON-ready dictionary with `metadata_schema.kind="density_interval_tests"` plus Berkowitz, KS, Kupiec POF, Christoffersen independence, Engle-Manganelli DQ, Du-Escanciano shortfall, PIT histogram, and PIT autocorrelation diagnostics. Options: | Option | Default | Meaning | | --- | --- | --- | | `alpha` | `0.05` | Tail probability for VaR/shortfall-style hit tests. | | `n_bins` | `10` | Number of PIT histogram bins. | | `pit_lag` | `1` | Lag used for PIT autocorrelation, Berkowitz AR lag, and Du-Escanciano conditional shortfall lag. | Output keys: | Key | Meaning | | --- | --- | | `berkowitz` | Berkowitz density LR test plus Jarque-Bera normality check after normal score transform. | | `ks` | Kolmogorov-Smirnov test against uniform PIT. | | `kupiec_pof` | Unconditional hit-rate test at `alpha`. | | `christoffersen_independence` | Markov independence test for hits. | | `engle_manganelli_dq` | PIT hit-only DQ proxy. Use `dynamic_quantile_test(...)` for the full Engle-Manganelli VaR DQ test. | | `du_escanciano_shortfall` | Du-Escanciano unconditional and conditional shortfall tests. | | `pit_histogram` | One record per histogram bin. | | `pit_autocorrelation` | `TestResult` dictionary for serial PIT dependence. | | `r_reference`, `r_alignment` | Composite provenance metadata. Component-level diagnostics also carry their own R/source metadata. | R/source alignment: | Diagnostic | Reference | | --- | --- | | Berkowitz | `tstests/R/berkowitz.R::berkowitz_test`: PIT to normal scores, ARIMA(`pit_lag`,0,0) unrestricted likelihood versus Normal(0,1); LR df is `2 + pit_lag`. | | Du-Escanciano shortfall | `tstests/R/shortfall_de.R::shortfall_de_test`: cumulative tail shortfall mean test and portmanteau test on centered tail shortfall autocorrelations. | | Kupiec/Christoffersen | `tstests/R/var_cp.R::var_cp_test` and `rugarch/R/rugarch-tests.R`: Bernoulli/transition likelihood-ratio construction. | | PIT hit-only DQ proxy | No direct R comparator. It is a PIT-hit lag diagnostic inside this composite wrapper, not the full Engle-Manganelli VaR DQ test. | Boundary handling: values outside `[0, 1]` raise. Boundary PIT values `0` and `1` are accepted as PIT values but clipped internally for the normal-score Berkowitz transform to avoid infinite ARIMA inputs. ### shortfall_de_test ```python macroforecast.tests.shortfall_de_test( pit, *, alpha=0.05, lags=1, boot=False, n_boot=2000, random_state=0, ) -> dict ``` Input: PIT values in `[0, 1]`. Output: JSON-ready dictionary with `metadata_schema.kind="shortfall_de_test"`. The unconditional statistic is the sample mean of cumulative tail shortfall, `mean((alpha - pit) * 1{pit <= alpha} / alpha)`. The conditional statistic is a portmanteau statistic on autocorrelations of that series centered by `alpha / 2`. With `boot=False`, the unconditional p-value uses the Du-Escanciano normal approximation and the conditional p-value uses `Chi-squared(lags)`. With `boot=True`, both p-values use simulated uniform PIT draws with the same sample size. ### dynamic_quantile_test ```python macroforecast.tests.dynamic_quantile_test( y_true, var, *, alpha=0.05, lag=1, lag_hit=1, lag_var=1, ) -> TestResult ``` Input: realized values and one-step-ahead lower-tail VaR forecasts. Output: `TestResult` for the Engle-Manganelli dynamic quantile test. This is the full VaR DQ callable. It is separate from `density_interval_tests(...)` because the exact DQ statistic needs realized values and VaR forecasts, not PIT values alone. R/source alignment: `segMGarch/R/DQtest.R::DQtest`. The hit series is `1 - alpha` when `y_true < var` and `-alpha` otherwise. The regressor matrix contains a constant, lag-aligned VaR forecasts, `lag_hit` lagged hit columns, and lagged squared realized values. The statistic is `Hit' X (X'X)^(-1) X' Hit / (alpha * (1 - alpha))`, with a chi-squared reference distribution using the number of columns of `X`. R argument mapping: `segMGarch::DQtest` names the VaR probability `VaR_level` and converts it internally to the lower-tail probability `1 - VaR_level`. `macroforecast` accepts the lower-tail probability directly as `alpha`; therefore a 5% lower-tail VaR is `alpha=0.05`, corresponding to `VaR_level=0.95` in the R function. Source: https://rdrr.io/cran/segMGarch/src/R/DQtest.R Options: | Option | Default | Meaning | | --- | --- | --- | | `alpha` | `0.05` | Lower-tail probability. A 5% VaR uses `alpha=0.05`. | | `lag` | `1` | Lag used for squared realized values. | | `lag_hit` | `1` | Number of lagged hit columns. | | `lag_var` | `1` | Lag alignment for VaR forecasts. | ### pit_histogram ```python macroforecast.tests.pit_histogram(pit, *, n_bins=10) -> pandas.DataFrame ``` Returns one row per PIT histogram bin with observed count, expected count under uniformity, and deviation. ### pit_autocorrelation_test ```python macroforecast.tests.pit_autocorrelation_test( pit, *, lag=1, alpha=0.05, ) -> TestResult ``` Runs a normal-approximation test for serial dependence in PIT values. ### interval_coverage_test ```python macroforecast.tests.interval_coverage_test( y_true, lower, upper, *, alpha=0.05, ) -> dict ``` Runs Kupiec POF, Christoffersen independence, combined conditional coverage, and Christoffersen-Pelletier duration diagnostics for forecast intervals. `alpha` is the expected non-coverage rate, so a 90% interval uses `alpha=0.10`. Boundary cases follow the likelihood-ratio convention used by R `tstests::var_cp_test` and `rugarch::VaRTest`: zero violations do not automatically imply a passing Kupiec statistic; the restricted Bernoulli likelihood is compared with the boundary unrestricted likelihood. The `christoffersen_pelletier_duration` output follows the duration-test logic in `tstests/R/var_cp.R::.duration_test`: durations between interval misses are modeled with a Weibull likelihood, and the no-memory exponential restriction is tested by setting the Weibull shape parameter to `1`. The duration statistic is unavailable when there is one or fewer misses. Coverage output: | Key | Meaning | | --- | --- | | `kupiec_pof` | Unconditional coverage LR. Carries `tstests` and `rugarch` references. | | `christoffersen_independence` | First-order Markov independence LR plus transition counts `n00`, `n01`, `n10`, `n11`. | | `christoffersen_conditional_coverage` | Sum of Kupiec and independence LR statistics with chi-squared df 2. | | `christoffersen_pelletier_duration` | Weibull duration LR for the exponential no-memory restriction. | | `r_reference`, `rugarch_reference`, `r_alignment` | Package-level provenance. | Duration likelihood note: the duration construction is the same in `tstests` and `rugarch`. The implemented density/survival likelihood follows `rugarch/R/rugarch-tests.R::VaRDurTest`, which is the internally consistent Christoffersen-Pelletier Weibull likelihood form. ## Conditional Predictive Ability ### conditional_predictive_ability_test ```python macroforecast.tests.conditional_predictive_ability_test( loss_a, loss_b, *, method="giacomini_rossi", window_ratio=0.5, dmv_fullsample=True, lag_truncate=0, alpha=0.05, ) ``` Input: two aligned loss series. Output: JSON-ready dictionary with `metadata_schema.kind="conditional_predictive_ability"`, a fluctuation statistic, critical value, decision, time path, window size, loss-difference orientation, and source-alignment metadata. Supported methods: `giacomini_rossi`, `recursive_fluctuation`. The `giacomini_rossi` branch is aligned with `murphydiagram/R/procs.R::fluctuation_test`, which implements Proposition 1 of Giacomini and Rossi (2010). It computes rolling-window Diebold-Mariano-type statistics for the loss difference `loss_a - loss_b`, uses Bartlett HAC variance, and compares the supremum absolute statistic with the tabulated critical values from Giacomini-Rossi Table 1. Positive path values mean `loss_a` is larger than `loss_b` over that window, so the final statistic is two-sided because it uses the supremum absolute path. R alignment: | R package / function | macroforecast branch | Alignment | | --- | --- | --- | | `murphydiagram/R/procs.R::fluctuation_test` | `method="giacomini_rossi"` | Same `ld <- loss1 - loss2`, same `mu` grid, same Table 1 critical values, same `lag_truncate in 0:5`, same Bartlett HAC convention, same `dmv_fullsample` and rolling-denominator branches. | | None | `method="recursive_fluctuation"` | Package extension over expanding-prefix loss windows. It reuses the same Bartlett HAC helper but does not claim to implement a named R-package test. | Options: | Option | Default | Choices | Meaning | | --- | --- | --- | --- | | `method` | `"giacomini_rossi"` | `"giacomini_rossi"`, `"recursive_fluctuation"` | `giacomini_rossi` is R-aligned; `recursive_fluctuation` is a package extension over expanding loss windows. | | `window_ratio` | `0.5` | `0.1`, `0.2`, ..., `0.9` for `giacomini_rossi` | Rolling window size as a fraction of the evaluation sample. | | `dmv_fullsample` | `True` | boolean | If `True`, estimate HAC variance on the full loss-difference sample, matching the R default. If `False`, use each rolling window's HAC variance. | | `lag_truncate` | `0` | `0`, `1`, ..., `5` | Bartlett HAC truncation lag, matching the R package's allowed range. | | `alpha` | `0.05` | `0.05`, `0.10` for `giacomini_rossi` | Test size used to select the tabulated critical value. | Output fields: | Field | Meaning | | --- | --- | | `statistic` | Supremum absolute value of the fluctuation path. | | `time_path` | Rolling or recursive fluctuation path before the supremum is taken. | | `critical_value`, `critical_band`, `decision` | Tabulated Giacomini-Rossi comparison when available; `None` for the recursive extension. | | `variance_scope` | `"full_sample"` when `dmv_fullsample=True`; `"rolling_window"` otherwise. | | `loss_difference_orientation` | Always `loss_a - loss_b`; positive path values mean `loss_a` has larger loss. | | `source_reference`, `external_reference`, `r_reference`, `r_alignment` | Source and R-package comparison metadata. | | `requested_method`, `method`, `alias_warning` | The user-supplied method, normalized method, and any alias caveat. | `method="rossi_sekhposyan"` remains accepted as a legacy alias for `recursive_fluctuation`, but Rossi-Sekhposyan forecast rationality is a different test family and is not represented by this loss-comparison callable. ## Multiple-Model Tests ### blocked_oob_reality_check ```python macroforecast.tests.blocked_oob_reality_check( loss_panel, *, benchmark, loss="squared_error", alpha=0.05, n_boot=1000, block_length=4, bootstrap_method="fixed_block_bootstrap", random_state=0, target="target", horizon="horizon", origin="origin", model="model_id", ) -> pandas.DataFrame ``` Block-bootstrap one-sided benchmark-superiority screen against a named benchmark model. This is the direct callable replacement for the legacy `blocked_oob_reality_check` operation. It is intentionally documented as a legacy screen, not as the exact White Reality Check. Inputs: | Form | Required columns | | --- | --- | | Long panel | `origin`, `model_id`, `squared_error`; optional `target` and `horizon`. Column names are configurable. | | Wide matrix | One column per model, including the `benchmark` column. The index is treated as origin order. | Long-panel input must have one row per target/horizon/origin/model key. If the loss table contains duplicate rows for that key, aggregate them explicitly before calling; the test helpers do not average duplicates silently. Output: one row per candidate model and target/horizon group. | Column | Meaning | | --- | --- | | `target`, `horizon` | Group labels. Wide input uses `"all"` for both. | | `model` | Candidate model tested against the benchmark. | | `benchmark` | Benchmark model name. | | `mean_diff` | `benchmark_loss - candidate_loss`; positive means candidate has lower loss. | | `statistic` | Mean loss differential scaled by bootstrap standard error. | | `p_value` | Pairwise one-sided block-bootstrap p-value for no improvement over benchmark. | | `decision` | `True` when `p_value < alpha`. | | `familywise_p_value` | Max-bootstrap p-value adjusted across all candidate models in the same target/horizon group. | | `familywise_decision` | `True` when `familywise_p_value < alpha`. | | `familywise_n_obs` | Complete-case origins used for the family-wise adjustment. | | `n_obs` | Number of aligned origins. | | `block_length`, `n_boot`, `bootstrap_method` | Bootstrap settings used. | | `source_reference`, `r_reference`, `r_alignment` | Provenance metadata. `r_reference` is `None` because this legacy screen has no exact R-package comparator. | The returned table carries `attrs["macroforecast_metadata_schema"]["kind"] = "blocked_oob_reality_check"`. R/source comparison: | Function | Status | | --- | --- | | `blocked_oob_reality_check(...)` | No exact R-package comparator. It computes pairwise and family-wise max-centered block bootstrap p-values from precomputed benchmark/candidate loss differences. | | `ttrTests/R/dataSnoop.R::dataSnoop(test="RC" or "SPA")` | Strategy-specific data-snooping code. It rebuilds technical-trading parameter-grid performance on each bootstrapped price sample, so it is not the same API contract. | | `reality_check_test(...)`, `superior_predictive_ability_test(...)`, `stepm_test(...)` | Exact multiple-comparison callable family for White RC, Hansen SPA, and Romano-Wolf StepM using the optional `arch.bootstrap` backend. | ### superior_predictive_ability_test ```python macroforecast.tests.superior_predictive_ability_test( loss_panel, *, benchmark, loss="squared_error", alpha=0.05, n_boot=1000, block_length="auto", bootstrap_method="stationary_bootstrap", p_value_type="consistent", studentize=True, nested=False, random_state=0, target="target", horizon="horizon", origin="origin", model="model_id", ) -> dict ``` Input: long or wide loss panel with a named benchmark model. Output: JSON-ready dictionary with one record per target/horizon group. The record contains `p_values` for `lower`, `consistent`, and `upper` SPA p-value variants, `critical_values`, selected `p_value`, `superior_models`, and backend metadata. Backend alignment: delegates to `arch.bootstrap.SPA`. The backend takes benchmark losses and candidate losses, forms loss differentials internally as `benchmark_loss - candidate_loss`, and reports `lower`, `consistent`, and `upper` p-values from Hansen's recentering choices. Positive `mean_loss_difference` in the output means the candidate has lower average loss than the benchmark. R/source comparison: archived R `ttrTests/R/dataSnoop.R::dataSnoop(test="SPA")` implements Hansen SPA for technical-trading rule parameter grids. It recomputes strategy performance on each bootstrapped price sample, so it is not a direct general loss-matrix API. `macroforecast` keeps the general forecast-evaluation contract and records this as conceptual R alignment in each output record. Options: | Option | Default | Choices | Meaning | | --- | --- | --- | --- | | `bootstrap_method` | `"stationary_bootstrap"` | `"stationary_bootstrap"`, `"fixed_block_bootstrap"` | Bootstrap family. Fixed-block inputs are mapped to `arch`'s moving-block backend. | | `p_value_type` | `"consistent"` | `"lower"`, `"consistent"`, `"upper"` | Which SPA p-value variant to use for `p_value` and `decision`. | | `studentize` | `True` | boolean | Passed to `arch.bootstrap.SPA`. | | `nested` | `False` | boolean | Passed to `arch.bootstrap.SPA` for nested model sets. | ### reality_check_test ```python macroforecast.tests.reality_check_test( loss_panel, *, benchmark, loss="squared_error", alpha=0.05, n_boot=1000, block_length="auto", bootstrap_method="stationary_bootstrap", p_value_type="consistent", studentize=True, nested=False, random_state=0, target="target", horizon="horizon", origin="origin", model="model_id", ) -> dict ``` Input and output follow `superior_predictive_ability_test(...)`. Backend: `arch.bootstrap.RealityCheck`. In the current `arch` backend this class is a Reality Check alias over the same SPA machinery, with the same p-value fields. Use this when the research design calls for the White Reality Check against a benchmark model. R/source comparison: archived R `ttrTests/R/dataSnoop.R::dataSnoop(test="RC")` implements White's Reality Check for technical-trading rule grids. As with SPA, the R function is strategy-generator specific; `macroforecast` uses precomputed benchmark and candidate forecast-loss series. ### stepm_test ```python macroforecast.tests.stepm_test( loss_panel, *, benchmark, loss="squared_error", alpha=0.05, n_boot=1000, block_length="auto", bootstrap_method="stationary_bootstrap", studentize=True, nested=False, random_state=0, target="target", horizon="horizon", origin="origin", model="model_id", ) -> dict ``` Input: long or wide loss panel with a named benchmark model. Output: JSON-ready dictionary with `superior_models` for each target/horizon group. Backend: `arch.bootstrap.StepM`. R/source comparison: `oosanalysis-R-library/R/stepm.R::stepm` implements a generic Romano-Wolf stepdown loop from supplied test statistics and bootstrap test-statistic draws. `macroforecast` delegates to `arch.bootstrap.StepM`, which constructs the benchmark-vs-candidate loss-difference statistics using the SPA backend and then applies the stepdown procedure. The objective is aligned, but the inputs are higher level in `macroforecast`: forecast-loss panel in, superior model names out. ### model_confidence_set ```python macroforecast.tests.model_confidence_set( loss_panel, *, loss="squared_error", alpha=0.10, n_boot=1000, block_length="auto", bootstrap_method="mcs_fixed_block", statistic="max", random_state=0, target="target", horizon="horizon", origin="origin", model="model_id", ) -> dict ``` Exact Hansen-Lunde-Nason model confidence set callable aligned with the R `MCS` package's `MCSprocedure`. It constructs pairwise loss-difference statistics, bootstraps those loss-difference means, removes one model per step, tracks cumulative MCS p-values, and records included and rejected model sets by target/horizon group. Inputs: | Form | Required columns | | --- | --- | | Long panel | `origin`, `model_id`, and the selected loss column. `target` and `horizon` are optional grouping columns. | | Wide matrix | Numeric model-loss columns. The target/horizon labels are set to `"all"`. | Long-panel input must have one row per target/horizon/origin/model key. Duplicate loss rows are rejected instead of being averaged inside the pivot step. Options: | Option | Default | Choices | Meaning | | --- | --- | --- | --- | | `statistic` | `"max"` | `"max"`, `"range"` | `"max"` maps to R `statistic="Tmax"` over `d_i.`; `"range"` maps to R `statistic="TR"` over pairwise `d_ij`. | | `bootstrap_method` | `"mcs_fixed_block"` | `"mcs_fixed_block"`, `"stationary_bootstrap"`, `"fixed_block_bootstrap"` | `mcs_fixed_block` follows R `MCS/R/internalFunctions.R::GetIndices`; the other choices are package extensions. | | `block_length` | `"auto"` | positive int or `"auto"` | Block length. `"auto"` follows the R rule conceptually: selected AR order across loss columns, with a minimum of 3. | Output: JSON-ready dictionary with `metadata_schema.kind="model_confidence_set"`. | Key | Meaning | | --- | --- | | `mcs_inclusion` | Included model records by target, horizon, and alpha after the iterative procedure stops. | | `mcs_rejections` | Eliminated model records by target, horizon, and alpha. | | `p_values` | Final stopping-test p-value by target and horizon. | | `iteration_path` | One record per removal step, including active models, statistic, p-value, cumulative MCS p-value, removed model, rejected model if any, and mean losses. | | `block_lengths_used` | Block length used by target and horizon. | R/source alignment: | R source | Python contract | | --- | --- | | `MCS/R/MCSprocedure.R::MCSprocedure` | Sequential elimination until one model remains; included/excluded sets are determined by `p-Value for H_{0,M_k}` relative to `alpha`. | | `MCS/R/internalFunctions.R::GetD` | Pairwise loss differences `d_ij` and model-average differences `d_i.`. | | `MCS/R/internalFunctions.R::GetIndices` | Default `bootstrap_method="mcs_fixed_block"` samples consecutive fixed blocks and truncates to sample length. | `block_length="auto"` follows the same rule conceptually as R `k=NULL`: choose the maximum selected AR order across loss columns and enforce a minimum of 3. For bit-level reproducibility across software stacks, pass an explicit integer `block_length`. ### iterative_model_confidence_set ```python macroforecast.tests.iterative_model_confidence_set( loss_panel, *, loss="squared_error", alpha=0.10, n_boot=1000, block_length="auto", bootstrap_method="mcs_fixed_block", statistic="max", random_state=0, target="target", horizon="horizon", origin="origin", model="model_id", ) ``` Descriptive alias for `model_confidence_set(...)`. It calls the same exact MCS engine and returns the same fields, with `metadata_schema.kind="iterative_model_confidence_set"` so older code can trace which callable produced the result. ## Residual Diagnostics ### residual_diagnostics ```python macroforecast.tests.residual_diagnostics( residuals, *, tests=( "ljung_box_q", "arch_lm", "jarque_bera_normality", "durbin_watson", ), lag=10, alpha=0.05, model_df=0, exog=None, demean_arch=False, ) ``` Input: residual series. Output: one-row-per-test pandas `DataFrame` with `test`, `statistic`, `p_value`, `decision`, `lag_used`, `df`, `n_obs`, `source_reference`, `r_reference`, `r_alignment`, and `status`. The result carries `attrs["macroforecast_metadata_schema"] = {"kind": "residual_diagnostics", "version": 1, ...}`. Supported tests: | Name | Meaning | | --- | --- | | `ljung_box_q` | Ljung-Box serial-correlation diagnostic, aligned with `stats::Box.test(type="Ljung-Box")`; `model_df` maps to R `fitdf`. | | `breusch_godfrey_serial_correlation` | Breusch-Godfrey Chisq LM diagnostic under the residual-series contract; default is equivalent to testing `residuals ~ 1`, and `exog` supplies additional original-regression design columns. | | `arch_lm` | Engle ARCH LM diagnostic, aligned with `FinTS::ArchTest`; `demean_arch=True` matches its `demean=TRUE` option. | | `jarque_bera_normality` | Jarque-Bera normality diagnostic using the same population-moment formula as `tseries::jarque.bera.test`. | | `durbin_watson` | Durbin-Watson statistic aligned with the statistic in `lmtest::dwtest`; p-value is not supplied because `lmtest`'s exact p-value uses a model-design distribution not available from residuals alone. | Options: | Option | Default | Meaning | | --- | --- | --- | | `lag` | `10` | Maximum lag for Ljung-Box, ARCH-LM, and Breusch-Godfrey. | | `alpha` | `0.05` | Rejection level used for `decision`. | | `model_df` | `0` | Degrees of freedom consumed by the fitted model. Used in Ljung-Box p-values and ARCH-LM degrees-of-freedom adjustment. | | `exog` | `None` | Optional design matrix for the Breusch-Godfrey auxiliary regression. If omitted, an intercept-only design is used. | | `demean_arch` | `False` | Demean residuals before ARCH-LM, matching `FinTS::ArchTest(demean=TRUE)` when enabled. | Source-alignment notes: | Diagnostic | Source logic | | --- | --- | | Ljung-Box | `stats::Box.test(type="Ljung-Box")`: `Q = n(n+2) sum rho_k^2/(n-k)`, chi-squared df `lag - model_df`; `model_df` is R `fitdf`. | | ARCH-LM | `FinTS/R/ArchTest.R::ArchTest`: optionally demean residuals, embed `x^2`, regress current squared residuals on lagged squared residuals, statistic is effective sample size times auxiliary `R^2`. `model_df` is a statsmodels degrees-of-freedom adjustment beyond the R API. | | Jarque-Bera | `tseries/R/test.R::jarque.bera.test`: population central moments, `n * skewness^2 / 6 + n * (kurtosis - 3)^2 / 24`, chi-squared df `2`. | | Breusch-Godfrey | `lmtest/R/bgtest.R::bgtest`: R takes a fitted model or formula. `macroforecast` takes residuals and optional `exog`, then applies the same Chisq LM auxiliary formula with fill-zero lagged residual columns under that residual-series contract. | | Durbin-Watson | `lmtest/R/dwtest.R::dwtest`: statistic `sum(diff(residuals)^2) / sum(residuals^2)`. P-values are omitted because R's exact/asymptotic p-value depends on the original regression design matrix. | - `jarque_bera_test` -- Jarque-Bera normality test (single series, chi2 df=2; tseries::jarque.bera.test convention). - `granger_causality` -- Granger causality test in a VAR (vars::causality; F or Wald). - `instantaneous_causality` -- instantaneous (contemporaneous) causality test in a VAR. - `giacomini_white_test` -- Giacomini-White (2006) CONDITIONAL predictive ability Wald test (chi2, HAC), instrument [1, dL_{t-h}]. - `var_serial_test` -- multivariate residual serial-correlation (Portmanteau/LM) test for a VAR (vars::serial.test). - `var_normality_test` -- multivariate normality (Doornik-Hansen/Lutkepohl JB) test for VAR residuals (vars::normality.test). - `var_arch_test` -- multivariate ARCH-LM test for VAR residuals (vars::arch.test, Lutkepohl).