macroforecast.evaluation#
macroforecast.evaluation owns evaluation reports. Raw scoring functions still
live in macroforecast.metrics, and forecast-comparison statistical tests still
live in macroforecast.tests.
import macroforecast as mf
mf.evaluation.metrics is mf.metrics
mf.evaluation.tests is mf.tests
The public API contract is:
Namespace |
Owns |
Does not own |
|---|---|---|
|
Forecast scoring, ranking, metric resolution. |
Statistical comparison tests. |
|
Forecast-comparison tests, density diagnostics, residual diagnostics. |
General scoring tables. |
|
Multi-slice evaluation reports, OOS-period filtering, benchmark comparisons, regime scoring, and error decomposition. |
Raw metric functions or statistical test functions. |
Public defaults:
Symbol |
Meaning |
|---|---|
|
Default metric tuple used by |
|
Default grouping columns for score aggregation. |
|
Default benchmark-comparison metrics. |
Public Flow#
report = mf.evaluation.evaluate_report(
forecast_result,
metrics=("mse", "rmse", "mae", "relative_mse", "r2_oos"),
benchmark_model="historical_mean",
time_frequency="Q",
)
scores = report.scores
ranking = report.ranking
by_regime = report.regime
evaluate_report#
macroforecast.evaluation.evaluate_report(
forecasts,
*,
metrics=("mse", "rmse", "mae"),
score_by=("model", "horizon"),
aggregations=None,
rank_metric=None,
rank_by=None,
benchmark_model=None,
benchmark_metrics=("mse", "mae", "relative_mse", "relative_mae", "mse_reduction", "r2_oos"),
oos_start=None,
oos_end=None,
regimes=None,
regime_column="regime",
target_column="target",
state_column="state",
time_frequency=None,
time_column="date",
time_bucket_column="time_bucket",
include_decomposition=False,
decomposition_by=None,
include_combined=True,
) -> EvaluationReport
Input#
Name |
Type |
Default |
Choices |
|---|---|---|---|
|
|
required |
Forecast runner output or forecast-like table. |
|
sequence |
|
Metric names accepted by |
|
sequence |
|
Main score grouping. Columns must exist. |
|
mapping, sequence, or |
auto |
Extra groupings to evaluate. |
|
str or |
auto |
Metric used for |
|
sequence or |
|
Ranking groups. |
|
str or |
|
Model name used for relative metrics and benchmark table. |
|
sequence |
default benchmark metrics |
Metrics for |
|
date-like or |
|
Restrict forecast rows before scoring. Dates are inclusive. |
|
mapping, Series, str, or |
|
Date-to-regime labels, existing regime column name, or no extra regime attachment. |
|
str |
|
Column used for regime scoring. |
|
str |
|
Optional slice columns when present. |
|
str or |
|
Pandas period frequency such as |
|
bool |
|
Add MSE decomposition into squared bias and residual variance. |
|
sequence or |
|
Grouping used by |
|
bool |
|
Include forecast-combination rows. |
Custom scoring belongs in the metrics argument:
def mean_bias(y_true, y_pred):
return float(pd.Series(y_pred).sub(pd.Series(y_true)).mean())
report = mf.evaluation.evaluate_report(
forecast_result,
metrics=("mse", "rmse", mean_bias),
aggregations={
"model_target": ("model", "target"),
"model_regime": ("model", "regime"),
},
)
Custom aggregation slices belong in aggregations. The value is a grouping
tuple over existing forecast-table columns; evaluation still uses
mf.metrics.evaluate_forecasts() to compute the metric table.
When relative metrics are requested, every scoring or aggregation grouping must
include model; the automatic aggregation set omits model-free slices such as
horizon alone because benchmark-relative scores are candidate-model specific.
Output#
Returns EvaluationReport.
Field |
Type |
Meaning |
|---|---|---|
|
|
Main metric table over |
|
|
Ranked |
|
|
Extra metric tables by requested or auto-discovered slices. |
|
|
Candidate rows relative to |
|
|
Regime-specific metric table when regime labels are available. |
|
|
Error decomposition table when requested. |
|
|
Input metadata plus compact |
EvaluationReport.to_dict() serializes all tables into JSON-ready records.
The serialized payload includes
metadata_schema={"kind": "evaluation_report", "version": 1}.
The metadata stage records options, table row counts, and forecast-table input shape:
report.metadata["evaluation_report"]
For paper/report output, keep evaluation and presentation separate:
main_table = mf.reporting.metric_report_table(
report,
columns=("model", "horizon", "rmse", "r2_oos"),
percent_columns=("r2_oos",),
)
paper_tables = mf.reporting.evaluation_report_tables(
report,
include=("scores", "ranking", "benchmark", "decomposition"),
)
metric_report_table(...) creates one presentation-ready table.
evaluation_report_tables(...) creates a named ReportBundle for the report’s
main components.
aggregate_scores#
macroforecast.evaluation.aggregate_scores(
forecasts,
*,
groupings,
metrics=("mse", "rmse", "mae"),
benchmark_model=None,
) -> dict[str, pandas.DataFrame]
Evaluates one forecast table over multiple explicit groupings.
tables = mf.evaluation.aggregate_scores(
result,
groupings={
"model": ("model",),
"model_horizon_target": ("model", "horizon", "target"),
},
)
All requested columns must exist. This function fails loudly instead of silently dropping unavailable dimensions.
filter_oos_period#
macroforecast.evaluation.filter_oos_period(
forecasts,
*,
start=None,
end=None,
date_column="date",
) -> pandas.DataFrame
Returns forecast rows inside an inclusive out-of-sample date interval. This is
the callable replacement for an oos_period setting. Use it directly when you
want to score only a subsample, or pass oos_start/oos_end to
evaluate_report(...).
error_decomposition#
macroforecast.evaluation.error_decomposition(
forecasts,
*,
by=("model", "horizon"),
actual="actual",
prediction="prediction",
) -> pandas.DataFrame
Decomposes MSE within each group as:
mse = bias_squared + residual_variance
where bias is the mean residual actual - prediction. Output columns
include n, mse, bias, bias_squared, residual_variance,
bias_share, and variance_share.
benchmark_comparison#
macroforecast.evaluation.benchmark_comparison(
forecasts,
*,
benchmark_model,
by=("model", "horizon"),
metrics=("mse", "mae", "relative_mse", "relative_mae", "mse_reduction", "r2_oos"),
) -> pandas.DataFrame
Returns candidate model rows with benchmark-relative scores. The benchmark row
itself is removed from the output. benchmark_model must be present in the
forecast table. The benchmark forecasts must already exist in the input table;
this function does not generate them. Use the forecasting runner to generate a
same-window benchmark, or append an external benchmark forecast table only after
verifying identical date/origin/horizon/target support.
regime_scores#
macroforecast.evaluation.regime_scores(
forecasts,
*,
regimes=None,
regime_column="regime",
by=("model", "horizon", "regime"),
metrics=("mse", "rmse", "mae"),
benchmark_model=None,
) -> pandas.DataFrame
regimes can be:
Form |
Meaning |
|---|---|
|
Use an existing |
|
Use that existing column as the source and copy to |
mapping or |
Map forecast |
Boundary#
Question |
Use |
|---|---|
One metric value or one metric table |
|
Multi-slice report with ranking, OOS filtering, benchmark, regime, target/state/time aggregation, decomposition |
|
Diebold-Mariano, Clark-West, MCS, residual tests |
|