macroforecast.evaluation#

Back to reference

macroforecast.evaluation owns evaluation reports. Raw scoring functions still live in macroforecast.metrics, and forecast-comparison statistical tests still live in macroforecast.tests.

import macroforecast as mf

mf.evaluation.metrics is mf.metrics
mf.evaluation.tests is mf.tests

The public API contract is:

Namespace	Owns	Does not own
`macroforecast.metrics`	Forecast scoring, ranking, metric resolution.	Statistical comparison tests.
`macroforecast.tests`	Forecast-comparison tests, density diagnostics, residual diagnostics.	General scoring tables.
`macroforecast.evaluation`	Multi-slice evaluation reports, OOS-period filtering, benchmark comparisons, regime scoring, and error decomposition.	Raw metric functions or statistical test functions.

Public defaults:

Symbol	Meaning
`DEFAULT_METRICS`	Default metric tuple used by `evaluate_report(...)`.
`DEFAULT_SCORE_BY`	Default grouping columns for score aggregation.
`BENCHMARK_METRICS`	Default benchmark-comparison metrics.

Public Flow#

report = mf.evaluation.evaluate_report(
    forecast_result,
    metrics=("mse", "rmse", "mae", "relative_mse", "r2_oos"),
    benchmark_model="historical_mean",
    time_frequency="Q",
)

scores = report.scores
ranking = report.ranking
by_regime = report.regime

evaluate_report#

macroforecast.evaluation.evaluate_report(
    forecasts,
    *,
    metrics=("mse", "rmse", "mae"),
    score_by=("model", "horizon"),
    aggregations=None,
    rank_metric=None,
    rank_by=None,
    benchmark_model=None,
    benchmark_metrics=("mse", "mae", "relative_mse", "relative_mae", "mse_reduction", "r2_oos"),
    oos_start=None,
    oos_end=None,
    regimes=None,
    regime_column="regime",
    target_column="target",
    state_column="state",
    time_frequency=None,
    time_column="date",
    time_bucket_column="time_bucket",
    include_decomposition=False,
    decomposition_by=None,
    include_combined=True,
) -> EvaluationReport

Input#

Name	Type	Default	Choices
`forecasts`	`ForecastResult` or `DataFrame`	required	Forecast runner output or forecast-like table.
`metrics`	sequence	`("mse", "rmse", "mae")`	Metric names accepted by `mf.metrics.get_metric(...)` or callables.
`score_by`	sequence	`("model", "horizon")`	Main score grouping. Columns must exist.
`aggregations`	mapping, sequence, or `None`	auto	Extra groupings to evaluate. `None` creates model, horizon, model-horizon, and available target/state/regime/time slices.
`rank_metric`	str or `None`	auto	Metric used for `ranking`. Auto preference is `rmse`, `mse`, `mae`, `r2_oos`, `relative_mse`.
`rank_by`	sequence or `None`	`score_by` without `model`	Ranking groups.
`benchmark_model`	str or `None`	`None`	Model name used for relative metrics and benchmark table.
`benchmark_metrics`	sequence	default benchmark metrics	Metrics for `benchmark_comparison`.
`oos_start`, `oos_end`	date-like or `None`	`None`	Restrict forecast rows before scoring. Dates are inclusive.
`regimes`	mapping, Series, str, or `None`	`None`	Date-to-regime labels, existing regime column name, or no extra regime attachment.
`regime_column`	str	`"regime"`	Column used for regime scoring.
`target_column`, `state_column`	str	`"target"`, `"state"`	Optional slice columns when present.
`time_frequency`	str or `None`	`None`	Pandas period frequency such as `"M"`, `"Q"`, `"A"` for time-bucket aggregation.
`include_decomposition`	bool	`False`	Add MSE decomposition into squared bias and residual variance.
`decomposition_by`	sequence or `None`	`score_by`	Grouping used by `error_decomposition`.
`include_combined`	bool	`True`	Include forecast-combination rows.

Custom scoring belongs in the metrics argument:

def mean_bias(y_true, y_pred):
    return float(pd.Series(y_pred).sub(pd.Series(y_true)).mean())

report = mf.evaluation.evaluate_report(
    forecast_result,
    metrics=("mse", "rmse", mean_bias),
    aggregations={
        "model_target": ("model", "target"),
        "model_regime": ("model", "regime"),
    },
)

Custom aggregation slices belong in aggregations. The value is a grouping tuple over existing forecast-table columns; evaluation still uses mf.metrics.evaluate_forecasts() to compute the metric table. When relative metrics are requested, every scoring or aggregation grouping must include model; the automatic aggregation set omits model-free slices such as horizon alone because benchmark-relative scores are candidate-model specific.

Output#

Returns EvaluationReport.

Field	Type	Meaning
`scores`	`DataFrame`	Main metric table over `score_by`.
`ranking`	`DataFrame`	Ranked `scores` table.
`aggregations`	`dict[str, DataFrame]`	Extra metric tables by requested or auto-discovered slices.
`benchmark`	`DataFrame` or `None`	Candidate rows relative to `benchmark_model`.
`regime`	`DataFrame` or `None`	Regime-specific metric table when regime labels are available.
`decomposition`	`DataFrame` or `None`	Error decomposition table when requested.
`metadata`	`dict`	Input metadata plus compact `evaluation_report` stage.

EvaluationReport.to_dict() serializes all tables into JSON-ready records. The serialized payload includes metadata_schema={"kind": "evaluation_report", "version": 1}.

The metadata stage records options, table row counts, and forecast-table input shape:

report.metadata["evaluation_report"]

For paper/report output, keep evaluation and presentation separate:

main_table = mf.reporting.metric_report_table(
    report,
    columns=("model", "horizon", "rmse", "r2_oos"),
    percent_columns=("r2_oos",),
)

paper_tables = mf.reporting.evaluation_report_tables(
    report,
    include=("scores", "ranking", "benchmark", "decomposition"),
)

metric_report_table(...) creates one presentation-ready table. evaluation_report_tables(...) creates a named ReportBundle for the report’s main components.

aggregate_scores#

macroforecast.evaluation.aggregate_scores(
    forecasts,
    *,
    groupings,
    metrics=("mse", "rmse", "mae"),
    benchmark_model=None,
) -> dict[str, pandas.DataFrame]

Evaluates one forecast table over multiple explicit groupings.

tables = mf.evaluation.aggregate_scores(
    result,
    groupings={
        "model": ("model",),
        "model_horizon_target": ("model", "horizon", "target"),
    },
)

All requested columns must exist. This function fails loudly instead of silently dropping unavailable dimensions.

filter_oos_period#

macroforecast.evaluation.filter_oos_period(
    forecasts,
    *,
    start=None,
    end=None,
    date_column="date",
) -> pandas.DataFrame

Returns forecast rows inside an inclusive out-of-sample date interval. This is the callable replacement for an oos_period setting. Use it directly when you want to score only a subsample, or pass oos_start/oos_end to evaluate_report(...).

error_decomposition#

macroforecast.evaluation.error_decomposition(
    forecasts,
    *,
    by=("model", "horizon"),
    actual="actual",
    prediction="prediction",
) -> pandas.DataFrame

Decomposes MSE within each group as:

mse = bias_squared + residual_variance

where bias is the mean residual actual - prediction. Output columns include n, mse, bias, bias_squared, residual_variance, bias_share, and variance_share.

benchmark_comparison#

macroforecast.evaluation.benchmark_comparison(
    forecasts,
    *,
    benchmark_model,
    by=("model", "horizon"),
    metrics=("mse", "mae", "relative_mse", "relative_mae", "mse_reduction", "r2_oos"),
) -> pandas.DataFrame

Returns candidate model rows with benchmark-relative scores. The benchmark row itself is removed from the output. benchmark_model must be present in the forecast table. The benchmark forecasts must already exist in the input table; this function does not generate them. Use the forecasting runner to generate a same-window benchmark, or append an external benchmark forecast table only after verifying identical date/origin/horizon/target support.

regime_scores#

macroforecast.evaluation.regime_scores(
    forecasts,
    *,
    regimes=None,
    regime_column="regime",
    by=("model", "horizon", "regime"),
    metrics=("mse", "rmse", "mae"),
    benchmark_model=None,
) -> pandas.DataFrame

regimes can be:

Form	Meaning
`None`	Use an existing `regime_column`.
`str`	Use that existing column as the source and copy to `regime_column` when names differ.
mapping or `Series`	Map forecast `date` values to regime labels.

Boundary#

Question	Use
One metric value or one metric table	`mf.metrics`
Multi-slice report with ranking, OOS filtering, benchmark, regime, target/state/time aggregation, decomposition	`mf.evaluation`
Diebold-Mariano, Clark-West, MCS, residual tests	`mf.tests`