# macroforecast.evaluation

[Back to reference](index.md)

`macroforecast.evaluation` owns evaluation reports. Raw scoring functions still
live in `macroforecast.metrics`, and forecast-comparison statistical tests still
live in `macroforecast.tests`.

```python
import macroforecast as mf

mf.evaluation.metrics is mf.metrics
mf.evaluation.tests is mf.tests
```

The public API contract is:

| Namespace | Owns | Does not own |
| --- | --- | --- |
| `macroforecast.metrics` | Forecast scoring, ranking, metric resolution. | Statistical comparison tests. |
| `macroforecast.tests` | Forecast-comparison tests, density diagnostics, residual diagnostics. | General scoring tables. |
| `macroforecast.evaluation` | Multi-slice evaluation reports, OOS-period filtering, benchmark comparisons, regime scoring, and error decomposition. | Raw metric functions or statistical test functions. |

Public defaults:

| Symbol | Meaning |
| --- | --- |
| `DEFAULT_METRICS` | Default metric tuple used by `evaluate_report(...)`. |
| `DEFAULT_SCORE_BY` | Default grouping columns for score aggregation. |
| `BENCHMARK_METRICS` | Default benchmark-comparison metrics. |

## Public Flow

```python
report = mf.evaluation.evaluate_report(
    forecast_result,
    metrics=("mse", "rmse", "mae", "relative_mse", "r2_oos"),
    benchmark_model="historical_mean",
    time_frequency="Q",
)

scores = report.scores
ranking = report.ranking
by_regime = report.regime
```

## evaluate_report

```python
macroforecast.evaluation.evaluate_report(
    forecasts,
    *,
    metrics=("mse", "rmse", "mae"),
    score_by=("model", "horizon"),
    aggregations=None,
    rank_metric=None,
    rank_by=None,
    benchmark_model=None,
    benchmark_metrics=("mse", "mae", "relative_mse", "relative_mae", "mse_reduction", "r2_oos"),
    oos_start=None,
    oos_end=None,
    regimes=None,
    regime_column="regime",
    target_column="target",
    state_column="state",
    time_frequency=None,
    time_column="date",
    time_bucket_column="time_bucket",
    include_decomposition=False,
    decomposition_by=None,
    include_combined=True,
) -> EvaluationReport
```

### Input

| Name | Type | Default | Choices |
| --- | --- | --- | --- |
| `forecasts` | `ForecastResult` or `DataFrame` | required | Forecast runner output or forecast-like table. |
| `metrics` | sequence | `("mse", "rmse", "mae")` | Metric names accepted by `mf.metrics.get_metric(...)` or callables. |
| `score_by` | sequence | `("model", "horizon")` | Main score grouping. Columns must exist. |
| `aggregations` | mapping, sequence, or `None` | auto | Extra groupings to evaluate. `None` creates model, horizon, model-horizon, and available target/state/regime/time slices. |
| `rank_metric` | str or `None` | auto | Metric used for `ranking`. Auto preference is `rmse`, `mse`, `mae`, `r2_oos`, `relative_mse`. |
| `rank_by` | sequence or `None` | `score_by` without `model` | Ranking groups. |
| `benchmark_model` | str or `None` | `None` | Model name used for relative metrics and benchmark table. |
| `benchmark_metrics` | sequence | default benchmark metrics | Metrics for `benchmark_comparison`. |
| `oos_start`, `oos_end` | date-like or `None` | `None` | Restrict forecast rows before scoring. Dates are inclusive. |
| `regimes` | mapping, Series, str, or `None` | `None` | Date-to-regime labels, existing regime column name, or no extra regime attachment. |
| `regime_column` | str | `"regime"` | Column used for regime scoring. |
| `target_column`, `state_column` | str | `"target"`, `"state"` | Optional slice columns when present. |
| `time_frequency` | str or `None` | `None` | Pandas period frequency such as `"M"`, `"Q"`, `"A"` for time-bucket aggregation. |
| `include_decomposition` | bool | `False` | Add MSE decomposition into squared bias and residual variance. |
| `decomposition_by` | sequence or `None` | `score_by` | Grouping used by `error_decomposition`. |
| `include_combined` | bool | `True` | Include forecast-combination rows. |

Custom scoring belongs in the `metrics` argument:

```python
def mean_bias(y_true, y_pred):
    return float(pd.Series(y_pred).sub(pd.Series(y_true)).mean())

report = mf.evaluation.evaluate_report(
    forecast_result,
    metrics=("mse", "rmse", mean_bias),
    aggregations={
        "model_target": ("model", "target"),
        "model_regime": ("model", "regime"),
    },
)
```

Custom aggregation slices belong in `aggregations`. The value is a grouping
tuple over existing forecast-table columns; evaluation still uses
`mf.metrics.evaluate_forecasts()` to compute the metric table.
When relative metrics are requested, every scoring or aggregation grouping must
include `model`; the automatic aggregation set omits model-free slices such as
`horizon` alone because benchmark-relative scores are candidate-model specific.

### Output

Returns `EvaluationReport`.

| Field | Type | Meaning |
| --- | --- | --- |
| `scores` | `DataFrame` | Main metric table over `score_by`. |
| `ranking` | `DataFrame` | Ranked `scores` table. |
| `aggregations` | `dict[str, DataFrame]` | Extra metric tables by requested or auto-discovered slices. |
| `benchmark` | `DataFrame` or `None` | Candidate rows relative to `benchmark_model`. |
| `regime` | `DataFrame` or `None` | Regime-specific metric table when regime labels are available. |
| `decomposition` | `DataFrame` or `None` | Error decomposition table when requested. |
| `metadata` | `dict` | Input metadata plus compact `evaluation_report` stage. |

`EvaluationReport.to_dict()` serializes all tables into JSON-ready records.
The serialized payload includes
`metadata_schema={"kind": "evaluation_report", "version": 1}`.

The metadata stage records options, table row counts, and forecast-table input
shape:

```python
report.metadata["evaluation_report"]
```

For paper/report output, keep evaluation and presentation separate:

```python
main_table = mf.reporting.metric_report_table(
    report,
    columns=("model", "horizon", "rmse", "r2_oos"),
    percent_columns=("r2_oos",),
)

paper_tables = mf.reporting.evaluation_report_tables(
    report,
    include=("scores", "ranking", "benchmark", "decomposition"),
)
```

`metric_report_table(...)` creates one presentation-ready table.
`evaluation_report_tables(...)` creates a named `ReportBundle` for the report's
main components.

## aggregate_scores

```python
macroforecast.evaluation.aggregate_scores(
    forecasts,
    *,
    groupings,
    metrics=("mse", "rmse", "mae"),
    benchmark_model=None,
) -> dict[str, pandas.DataFrame]
```

Evaluates one forecast table over multiple explicit groupings.

```python
tables = mf.evaluation.aggregate_scores(
    result,
    groupings={
        "model": ("model",),
        "model_horizon_target": ("model", "horizon", "target"),
    },
)
```

All requested columns must exist. This function fails loudly instead of silently
dropping unavailable dimensions.

## filter_oos_period

```python
macroforecast.evaluation.filter_oos_period(
    forecasts,
    *,
    start=None,
    end=None,
    date_column="date",
) -> pandas.DataFrame
```

Returns forecast rows inside an inclusive out-of-sample date interval. This is
the callable replacement for an `oos_period` setting. Use it directly when you
want to score only a subsample, or pass `oos_start`/`oos_end` to
`evaluate_report(...)`.

## error_decomposition

```python
macroforecast.evaluation.error_decomposition(
    forecasts,
    *,
    by=("model", "horizon"),
    actual="actual",
    prediction="prediction",
) -> pandas.DataFrame
```

Decomposes MSE within each group as:

```text
mse = bias_squared + residual_variance
```

where `bias` is the mean residual `actual - prediction`. Output columns
include `n`, `mse`, `bias`, `bias_squared`, `residual_variance`,
`bias_share`, and `variance_share`.

## benchmark_comparison

```python
macroforecast.evaluation.benchmark_comparison(
    forecasts,
    *,
    benchmark_model,
    by=("model", "horizon"),
    metrics=("mse", "mae", "relative_mse", "relative_mae", "mse_reduction", "r2_oos"),
) -> pandas.DataFrame
```

Returns candidate model rows with benchmark-relative scores. The benchmark row
itself is removed from the output. `benchmark_model` must be present in the
forecast table. The benchmark forecasts must already exist in the input table;
this function does not generate them. Use the forecasting runner to generate a
same-window benchmark, or append an external benchmark forecast table only after
verifying identical date/origin/horizon/target support.

## regime_scores

```python
macroforecast.evaluation.regime_scores(
    forecasts,
    *,
    regimes=None,
    regime_column="regime",
    by=("model", "horizon", "regime"),
    metrics=("mse", "rmse", "mae"),
    benchmark_model=None,
) -> pandas.DataFrame
```

`regimes` can be:

| Form | Meaning |
| --- | --- |
| `None` | Use an existing `regime_column`. |
| `str` | Use that existing column as the source and copy to `regime_column` when names differ. |
| mapping or `Series` | Map forecast `date` values to regime labels. |

## Boundary

| Question | Use |
| --- | --- |
| One metric value or one metric table | `mf.metrics` |
| Multi-slice report with ranking, OOS filtering, benchmark, regime, target/state/time aggregation, decomposition | `mf.evaluation` |
| Diebold-Mariano, Clark-West, MCS, residual tests | `mf.tests` |