macroforecast.evaluation#

Back to reference

macroforecast.evaluation owns evaluation reports. Raw scoring functions still live in macroforecast.metrics, and forecast-comparison statistical tests still live in macroforecast.tests.

import macroforecast as mf

mf.evaluation.metrics is mf.metrics
mf.evaluation.tests is mf.tests

The public API contract is:

Namespace

Owns

Does not own

macroforecast.metrics

Forecast scoring, ranking, metric resolution.

Statistical comparison tests.

macroforecast.tests

Forecast-comparison tests, density diagnostics, residual diagnostics.

General scoring tables.

macroforecast.evaluation

Multi-slice evaluation reports, OOS-period filtering, benchmark comparisons, regime scoring, and error decomposition.

Raw metric functions or statistical test functions.

Public defaults:

Symbol

Meaning

DEFAULT_METRICS

Default metric tuple used by evaluate_report(...).

DEFAULT_SCORE_BY

Default grouping columns for score aggregation.

BENCHMARK_METRICS

Default benchmark-comparison metrics.

Public Flow#

report = mf.evaluation.evaluate_report(
    forecast_result,
    metrics=("mse", "rmse", "mae", "relative_mse", "r2_oos"),
    benchmark_model="historical_mean",
    time_frequency="Q",
)

scores = report.scores
ranking = report.ranking
by_regime = report.regime

evaluate_report#

macroforecast.evaluation.evaluate_report(
    forecasts,
    *,
    metrics=("mse", "rmse", "mae"),
    score_by=("model", "horizon"),
    aggregations=None,
    rank_metric=None,
    rank_by=None,
    benchmark_model=None,
    benchmark_metrics=("mse", "mae", "relative_mse", "relative_mae", "mse_reduction", "r2_oos"),
    oos_start=None,
    oos_end=None,
    regimes=None,
    regime_column="regime",
    target_column="target",
    state_column="state",
    time_frequency=None,
    time_column="date",
    time_bucket_column="time_bucket",
    include_decomposition=False,
    decomposition_by=None,
    include_combined=True,
) -> EvaluationReport

Input#

Name

Type

Default

Choices

forecasts

ForecastResult or DataFrame

required

Forecast runner output or forecast-like table.

metrics

sequence

("mse", "rmse", "mae")

Metric names accepted by mf.metrics.get_metric(...) or callables.

score_by

sequence

("model", "horizon")

Main score grouping. Columns must exist.

aggregations

mapping, sequence, or None

auto

Extra groupings to evaluate. None creates model, horizon, model-horizon, and available target/state/regime/time slices.

rank_metric

str or None

auto

Metric used for ranking. Auto preference is rmse, mse, mae, r2_oos, relative_mse.

rank_by

sequence or None

score_by without model

Ranking groups.

benchmark_model

str or None

None

Model name used for relative metrics and benchmark table.

benchmark_metrics

sequence

default benchmark metrics

Metrics for benchmark_comparison.

oos_start, oos_end

date-like or None

None

Restrict forecast rows before scoring. Dates are inclusive.

regimes

mapping, Series, str, or None

None

Date-to-regime labels, existing regime column name, or no extra regime attachment.

regime_column

str

"regime"

Column used for regime scoring.

target_column, state_column

str

"target", "state"

Optional slice columns when present.

time_frequency

str or None

None

Pandas period frequency such as "M", "Q", "A" for time-bucket aggregation.

include_decomposition

bool

False

Add MSE decomposition into squared bias and residual variance.

decomposition_by

sequence or None

score_by

Grouping used by error_decomposition.

include_combined

bool

True

Include forecast-combination rows.

Custom scoring belongs in the metrics argument:

def mean_bias(y_true, y_pred):
    return float(pd.Series(y_pred).sub(pd.Series(y_true)).mean())

report = mf.evaluation.evaluate_report(
    forecast_result,
    metrics=("mse", "rmse", mean_bias),
    aggregations={
        "model_target": ("model", "target"),
        "model_regime": ("model", "regime"),
    },
)

Custom aggregation slices belong in aggregations. The value is a grouping tuple over existing forecast-table columns; evaluation still uses mf.metrics.evaluate_forecasts() to compute the metric table. When relative metrics are requested, every scoring or aggregation grouping must include model; the automatic aggregation set omits model-free slices such as horizon alone because benchmark-relative scores are candidate-model specific.

Output#

Returns EvaluationReport.

Field

Type

Meaning

scores

DataFrame

Main metric table over score_by.

ranking

DataFrame

Ranked scores table.

aggregations

dict[str, DataFrame]

Extra metric tables by requested or auto-discovered slices.

benchmark

DataFrame or None

Candidate rows relative to benchmark_model.

regime

DataFrame or None

Regime-specific metric table when regime labels are available.

decomposition

DataFrame or None

Error decomposition table when requested.

metadata

dict

Input metadata plus compact evaluation_report stage.

EvaluationReport.to_dict() serializes all tables into JSON-ready records. The serialized payload includes metadata_schema={"kind": "evaluation_report", "version": 1}.

The metadata stage records options, table row counts, and forecast-table input shape:

report.metadata["evaluation_report"]

For paper/report output, keep evaluation and presentation separate:

main_table = mf.reporting.metric_report_table(
    report,
    columns=("model", "horizon", "rmse", "r2_oos"),
    percent_columns=("r2_oos",),
)

paper_tables = mf.reporting.evaluation_report_tables(
    report,
    include=("scores", "ranking", "benchmark", "decomposition"),
)

metric_report_table(...) creates one presentation-ready table. evaluation_report_tables(...) creates a named ReportBundle for the report’s main components.

aggregate_scores#

macroforecast.evaluation.aggregate_scores(
    forecasts,
    *,
    groupings,
    metrics=("mse", "rmse", "mae"),
    benchmark_model=None,
) -> dict[str, pandas.DataFrame]

Evaluates one forecast table over multiple explicit groupings.

tables = mf.evaluation.aggregate_scores(
    result,
    groupings={
        "model": ("model",),
        "model_horizon_target": ("model", "horizon", "target"),
    },
)

All requested columns must exist. This function fails loudly instead of silently dropping unavailable dimensions.

filter_oos_period#

macroforecast.evaluation.filter_oos_period(
    forecasts,
    *,
    start=None,
    end=None,
    date_column="date",
) -> pandas.DataFrame

Returns forecast rows inside an inclusive out-of-sample date interval. This is the callable replacement for an oos_period setting. Use it directly when you want to score only a subsample, or pass oos_start/oos_end to evaluate_report(...).

error_decomposition#

macroforecast.evaluation.error_decomposition(
    forecasts,
    *,
    by=("model", "horizon"),
    actual="actual",
    prediction="prediction",
) -> pandas.DataFrame

Decomposes MSE within each group as:

mse = bias_squared + residual_variance

where bias is the mean residual actual - prediction. Output columns include n, mse, bias, bias_squared, residual_variance, bias_share, and variance_share.

benchmark_comparison#

macroforecast.evaluation.benchmark_comparison(
    forecasts,
    *,
    benchmark_model,
    by=("model", "horizon"),
    metrics=("mse", "mae", "relative_mse", "relative_mae", "mse_reduction", "r2_oos"),
) -> pandas.DataFrame

Returns candidate model rows with benchmark-relative scores. The benchmark row itself is removed from the output. benchmark_model must be present in the forecast table. The benchmark forecasts must already exist in the input table; this function does not generate them. Use the forecasting runner to generate a same-window benchmark, or append an external benchmark forecast table only after verifying identical date/origin/horizon/target support.

regime_scores#

macroforecast.evaluation.regime_scores(
    forecasts,
    *,
    regimes=None,
    regime_column="regime",
    by=("model", "horizon", "regime"),
    metrics=("mse", "rmse", "mae"),
    benchmark_model=None,
) -> pandas.DataFrame

regimes can be:

Form

Meaning

None

Use an existing regime_column.

str

Use that existing column as the source and copy to regime_column when names differ.

mapping or Series

Map forecast date values to regime labels.

Boundary#

Question

Use

One metric value or one metric table

mf.metrics

Multi-slice report with ranking, OOS filtering, benchmark, regime, target/state/time aggregation, decomposition

mf.evaluation

Diebold-Mariano, Clark-West, MCS, residual tests

mf.tests