macroforecast.metrics#
macroforecast.metrics owns forecast scoring only. It does not choose windows,
fit models, run statistical comparison tests, or write artifacts.
Use the namespace form:
import macroforecast as mf
mf.metrics.rmse(y_true, y_pred)
Top-level shortcuts such as mf.rmse(...) are intentionally not exported.
MetricLike is the public input type used by metric resolvers: a metric can
be a registered metric name or a callable with the expected scoring signature.
Risk-Return Forecast Evaluation#
Paper Citation And Scope#
This section implements the evaluation framework from:
Goulet Coulombe, Philippe. 2026. “Quantifying the Risk-Return Tradeoff in Forecasting.” arXiv:2605.09712v1, submitted May 10, 2026. arXiv page: https://arxiv.org/abs/2605.09712.
This is not a portfolio-construction module. macroforecast does not treat
macroeconomic forecasts as traded assets here. The word “return” means a
date-level loss differential:
forecast_return_t(model | benchmark)
= loss_t(benchmark) - loss_t(model)
Positive values mean the candidate model reduced forecast loss relative to the benchmark on that date. Negative values mean it underperformed. The financial language is useful because it gives precise names for stability of gains: volatility, downside risk, upside/downside balance, and drawdown. The object being evaluated remains a macro forecast panel.
These functions live in macroforecast.metrics because they score forecasts.
They do not explain fitted models, so they do not belong in
macroforecast.interpretation. They also do not run hypothesis tests, so they
do not belong in macroforecast.tests. Higher-level report integration can
later call these functions from macroforecast.evaluation.
Paper Motivation#
Standard forecast evaluation usually asks whether a model has lower average loss than a benchmark. The risk-return view asks whether those gains are stable enough to trust. A model can have lower RMSE on average while generating large negative episodes in recessions, inflation spikes, post-COVID periods, or other macroeconomic regimes where forecast failures are costly.
The paper’s primitive object is therefore not an aggregated RMSE table. It is a date-level sequence of benchmark-relative loss improvements. This is why the functions below operate on forecast panels and return paths rather than only on already-aggregated metric tables.
compute_point_loss#
macroforecast.metrics.compute_point_loss(
y_true,
y_pred,
*,
loss="squared_error",
variance=None,
quantile=None,
eps=1e-12,
) -> pandas.Series
Input: aligned realized values and forecasts.
Output: one observation-level loss per aligned row, where lower is better.
Supported losses:
|
Required inputs |
Formula or meaning |
|---|---|---|
|
|
|
|
|
|
|
|
Quantile loss for one requested quantile. |
|
|
Gaussian negative log score. |
|
realized variance in |
QLIKE volatility loss. |
forecast_returns#
macroforecast.metrics.forecast_returns(
forecasts,
*,
benchmark,
group_cols=("target", "horizon"),
loss="squared_error",
model_col="model",
actual="actual",
prediction="prediction",
variance_prediction="variance_prediction",
support_cols=None,
include_benchmark=False,
quantile=None,
) -> pandas.DataFrame
Input: a ForecastResult, forecast table, or pandas-like table with candidate
and benchmark rows. The benchmark must already exist in the same forecast
panel. The function does not create a benchmark forecast.
Required columns:
Column |
Meaning |
|---|---|
|
Candidate and benchmark model identifiers. Default: |
|
Realized value. Default: |
|
Point forecast. Default: |
|
Support identity. At least one is required unless supplied through |
|
Alignment groups such as |
Output columns:
Column |
Meaning |
|---|---|
|
Candidate date-level loss. |
|
Benchmark date-level loss on the same support row. |
|
|
|
|
|
Cumulative sum of |
|
Cumulative return minus running peak. |
|
Canonical loss label. |
|
Stable model labels for downstream grouping. |
Validation is intentionally strict. Candidate and benchmark support must match exactly within every group, and realized values must match after alignment. This prevents a benchmark with a different window, horizon, target, or missing date pattern from being treated as a fair comparator.
returns = mf.metrics.forecast_returns(
forecast_result,
benchmark="ar",
group_cols=("target", "horizon"),
loss="squared_error",
)
sortino_ratio#
macroforecast.metrics.sortino_ratio(
returns,
*,
target_return=0.0,
) -> float
Computes mean excess forecast return divided by downside semideviation:
downside_t = min(return_t - target_return, 0)
If all nonzero returns are above the target, the denominator is zero and the
ratio is inf. If numerator and denominator are both zero, the ratio is nan.
omega_ratio#
macroforecast.metrics.omega_ratio(
returns,
*,
threshold=0.0,
) -> float
Computes total upside divided by total downside around a threshold:
omega = sum(max(return_t - threshold, 0))
/ sum(max(threshold - return_t, 0))
inf means there is upside and no downside; nan means there is neither
upside nor downside.
drawdown_series and max_drawdown#
macroforecast.metrics.drawdown_series(returns) -> pandas.Series
macroforecast.metrics.max_drawdown(returns) -> float
Drawdown is computed from cumulative forecast returns:
cumulative_t = sum_{s <= t} return_s
drawdown_t = cumulative_t - max_{s <= t}(cumulative_s)
For example, returns [1, 1, -3, 1] have cumulative returns
[1, 2, -1, 0], drawdowns [0, 0, -3, -2], and maximum drawdown -3.
risk_adjusted_forecast_metrics#
macroforecast.metrics.risk_adjusted_forecast_metrics(
returns,
*,
group_cols=None,
return_col="forecast_return",
hac_lags="auto",
target_return=0.0,
omega_threshold=0.0,
) -> pandas.DataFrame
Input: the date-level output of forecast_returns(...), or any DataFrame with
a return column.
Output: one row per group with:
Column |
Meaning |
|---|---|
|
Number of finite return observations. |
|
Average benchmark-relative loss reduction. |
|
Sample standard deviation of returns. |
|
HAC long-run standard deviation when requested. |
|
Mean return divided by ordinary or HAC volatility. |
|
Downside-risk-adjusted return. |
|
Upside/downside ratio. |
|
Worst cumulative-return drawdown. |
|
Sum of returns over the evaluated path. |
|
Share of dates with positive forecast return. |
Default grouping uses available columns such as model_id, benchmark_id,
target, horizon, sample, regime, and loss_name.
edge_ratio#
macroforecast.metrics.edge_ratio(
forecasts,
*,
group_cols=("target", "horizon"),
loss="squared_error",
model_col="model",
actual="actual",
prediction="prediction",
variance_prediction="variance_prediction",
support_cols=None,
quantile=None,
) -> pandas.DataFrame
Edge Ratio asks whether a model delivers unique gains relative to the model pool, not only relative to one benchmark. For each date and model:
edge_t(model) = min_loss_t(all other models) - loss_t(model)
Therefore:
Edge sign |
Meaning |
|---|---|
|
The model is strictly better than every alternative on that date. |
|
The model ties the best alternative. |
|
At least one alternative is better. |
Aggregated Edge Ratio is:
edge_ratio
= (sum(max(edge_t, 0)) / sum(max(-edge_t, 0)))
* (number_of_models - 1)
If a model has positive edge wins and no edge regrets, the ratio is inf. If a
model never has edge wins, the ratio is 0. The result also carries the
date-level edge path in attrs["macroforecast_edge_path"] for inspection.
Forecast Table Helpers#
evaluate_forecasts#
macroforecast.metrics.evaluate_forecasts(
forecasts,
*,
by=("model", "horizon"),
metrics=("mse", "rmse", "mae"),
actual="actual",
prediction="prediction",
variance_prediction="variance_prediction",
volatility_actual=None,
quantile_predictions="quantile_predictions",
previous_actual="previous_actual",
benchmark_model=None,
model_column="model",
)
Input: a ForecastResult, forecast table, or pandas-like table with realized
values and forecast columns.
Output: a pandas DataFrame, one row per by group. The result carries
attrs["macroforecast_metadata_schema"] = {"kind": "forecast_metrics", "version": 1, ...}.
The metadata schema also records by, requested_metrics, benchmark_model,
relative_support_columns, input columns, and automatically added metric
groups.
Validation: every requested by column must exist in the forecast table.
evaluate_forecasts() fails loudly instead of dropping unavailable grouping
dimensions. Relative metrics such as relative_mse, relative_mae,
mse_reduction, and r2_oos require benchmark_model, and the benchmark must
have matching rows for every scored non-benchmark group. The grouping must
include model_column because relative metrics compare each candidate model
against a named benchmark model.
benchmark_model does not create benchmark forecasts. It selects existing rows
from the forecast table. For a fair comparison, generate the benchmark in the
same forecasting run with the same window/origin/horizon/target contract, or
append an external benchmark CSV only after validating that it has the same
forecast-table schema and the same evaluation support. Relative metrics fail
when candidate and benchmark supports differ. Forecast-table relative metrics
require at least one support identity column: date, origin, or
origin_pos. For matching support rows, candidate and benchmark actual
values must also match; otherwise the forecast table is treated as inconsistent.
Forecast-table behavior:
Available input |
Added scores |
|---|---|
|
Requested point metrics such as |
|
Relative metrics such as |
|
|
|
|
|
|
|
Pinball loss by quantile and interval coverage/width/score for matched lower-upper pairs. |
Malformed probabilistic inputs fail validation. Quantile forecasts must be
per-row dictionaries mapping levels strictly inside (0, 1) to finite numeric
predictions. Invalid variance, volatility, interval, or quantile values are not
silently clipped or skipped.
Requested specialized metrics fail loudly when their required support columns are absent:
Requested metric group |
Required forecast-table column |
|---|---|
|
|
|
|
|
|
|
|
scores = mf.metrics.evaluate_forecasts(
result,
metrics=("mse", "rmse", "relative_mse", "r2_oos"),
benchmark_model="ols",
)
rank_forecasts#
macroforecast.metrics.rank_forecasts(
evaluation,
*,
metric="mse",
by=("horizon",),
ascending=None,
rank_column="rank",
)
Input: an evaluation table from evaluate_forecasts(...) or an equivalent
pandas table.
Output: the same rows with a rank column. If ascending=None, lower is better
for recognized loss metrics and higher is better for recognized gain metrics
such as r2_oos, mse_reduction, success_ratio, and
pesaran_timmermann_metric. Every requested by column must exist in the
evaluation table. Signed bias, coverage metrics, and custom metrics require
an explicit ascending=True or ascending=False. Coverage is intentionally
not treated as automatically higher-is-better because interval coverage should
usually be assessed against a nominal level, not maximized.
get_metric#
macroforecast.metrics.get_metric(metric)
Input: a metric name or callable.
Output: the resolved callable. Name aliases include msfe -> mse,
validation_mse -> mse, validation_rmse -> rmse,
mean_error -> bias, and negative_log_score -> gaussian_nll.
Custom metrics do not need registration. Pass a callable directly anywhere a metric is accepted:
def mean_bias(y_true, y_pred):
return float(pd.Series(y_pred).sub(pd.Series(y_true)).mean())
scores = mf.metrics.evaluate_forecasts(
forecasts,
metrics=("mse", mean_bias),
)
The metric callable should accept (y_true, y_pred) and return one scalar
float. In evaluation tables, the output column name is the callable’s
__name__, or "callable_metric" when no name is available. Metrics requiring
benchmark forecasts, variances, intervals, or previous actuals need one of the
specialized built-in metric names because evaluate_forecasts() must know
which forecast-table columns to pass.
Point Metrics#
All point metrics align inputs as pandas Series, drop missing paired
observations, and return a single float.
Function |
Signature |
Output |
|---|---|---|
|
|
Mean squared error. |
|
|
Root mean squared error. |
|
|
Mean absolute error. |
|
|
Mean residual |
|
|
Median absolute error. |
|
|
Mean absolute percentage error on the 0-100 scale. |
|
|
Symmetric MAPE on the 0-100 scale. |
|
|
Theil U1 inequality coefficient. |
|
|
Theil U2 relative to a no-change forecast. |
Benchmark-Relative Metrics#
These functions require realized values, candidate forecasts, and benchmark forecasts aligned on the same index.
The direct functions and evaluate_forecasts(...) require candidate and
benchmark support to match exactly. They do not silently score only the
intersection of two forecast histories. Forecast-table evaluation also checks
that candidate and benchmark rows carry the same realized value for each support
point.
Function |
Signature |
Interpretation |
|---|---|---|
|
|
Candidate MSE divided by benchmark MSE. Below 1 favors candidate. |
|
|
Candidate MAE divided by benchmark MAE. Below 1 favors candidate. |
|
|
Benchmark MSE minus candidate MSE. Positive favors candidate. |
|
|
Out-of-sample |
Density, Interval, And Volatility Metrics#
Function |
Signature |
Output |
|---|---|---|
|
|
Mean quantile pinball loss. |
|
|
Gaussian negative log likelihood. |
|
|
Gaussian negative log score. |
|
|
Backward-compatible alias for |
|
|
Gaussian continuous ranked probability score. |
|
|
QLIKE volatility loss using realized variance or squared realization. |
|
|
Share of observations inside the interval. |
|
|
Mean interval width. |
|
|
Winkler interval score. |
evaluate_forecasts(...) uses variance_prediction for gaussian_nll,
negative_log_score, log_score, and crps. qlike should be evaluated
against realized variance or squared realization. Pass volatility_actual when
that column differs from actual. It uses quantile_predictions dictionaries
for pinball and interval metrics.
Variance inputs must be finite and strictly positive. QLIKE realized variance
must be finite and nonnegative, while the forecast variance must be strictly
positive. Interval metrics require upper >= lower for every evaluated row.
Quantile levels must be strictly inside (0, 1), and quantile predictions must
be finite.
Direction Metrics#
Function |
Signature |
Output |
|---|---|---|
|
|
Directional hit rate relative to the previous realized value. |
|
|
Pesaran-Timmermann directional accuracy statistic. |
mase– Mean Absolute Scaled Error (Hyndman-Koehler), out-of-sample MAE scaled by the in-sample (seasonal-)naive MAE.seasonal_naive_mae– in-sample (seasonal-)naive MAEmean(|y[t]-y[t-m]|), the MASE scaling denominator.acf1– lag-1 autocorrelation (e.g. of forecast residuals), the ACF1 reported byforecast::accuracy.