macroforecast.metrics#

Back to reference

macroforecast.metrics owns forecast scoring only. It does not choose windows, fit models, run statistical comparison tests, or write artifacts.

Use the namespace form:

import macroforecast as mf

mf.metrics.rmse(y_true, y_pred)

Top-level shortcuts such as mf.rmse(...) are intentionally not exported.

MetricLike is the public input type used by metric resolvers: a metric can be a registered metric name or a callable with the expected scoring signature.

Risk-Return Forecast Evaluation#

Paper Citation And Scope#

This section implements the evaluation framework from:

Goulet Coulombe, Philippe. 2026. “Quantifying the Risk-Return Tradeoff in Forecasting.” arXiv:2605.09712v1, submitted May 10, 2026. arXiv page: https://arxiv.org/abs/2605.09712.

This is not a portfolio-construction module. macroforecast does not treat macroeconomic forecasts as traded assets here. The word “return” means a date-level loss differential:

forecast_return_t(model | benchmark)
    = loss_t(benchmark) - loss_t(model)

Positive values mean the candidate model reduced forecast loss relative to the benchmark on that date. Negative values mean it underperformed. The financial language is useful because it gives precise names for stability of gains: volatility, downside risk, upside/downside balance, and drawdown. The object being evaluated remains a macro forecast panel.

These functions live in macroforecast.metrics because they score forecasts. They do not explain fitted models, so they do not belong in macroforecast.interpretation. They also do not run hypothesis tests, so they do not belong in macroforecast.tests. Higher-level report integration can later call these functions from macroforecast.evaluation.

Paper Motivation#

Standard forecast evaluation usually asks whether a model has lower average loss than a benchmark. The risk-return view asks whether those gains are stable enough to trust. A model can have lower RMSE on average while generating large negative episodes in recessions, inflation spikes, post-COVID periods, or other macroeconomic regimes where forecast failures are costly.

The paper’s primitive object is therefore not an aggregated RMSE table. It is a date-level sequence of benchmark-relative loss improvements. This is why the functions below operate on forecast panels and return paths rather than only on already-aggregated metric tables.

compute_point_loss#

macroforecast.metrics.compute_point_loss(
    y_true,
    y_pred,
    *,
    loss="squared_error",
    variance=None,
    quantile=None,
    eps=1e-12,
) -> pandas.Series

Input: aligned realized values and forecasts.

Output: one observation-level loss per aligned row, where lower is better.

Supported losses:

loss

Required inputs

Formula or meaning

"squared_error", "mse", "msfe"

y_true, y_pred

(y_true - y_pred)^2 at each date.

"absolute_error", "mae"

y_true, y_pred

abs(y_true - y_pred) at each date.

"pinball_loss"

y_true, y_pred, quantile

Quantile loss for one requested quantile.

"negative_log_score", "gaussian_nll", "log_score"

y_true, y_pred, variance

Gaussian negative log score.

"qlike"

realized variance in y_true, forecast variance in y_pred

QLIKE volatility loss.

forecast_returns#

macroforecast.metrics.forecast_returns(
    forecasts,
    *,
    benchmark,
    group_cols=("target", "horizon"),
    loss="squared_error",
    model_col="model",
    actual="actual",
    prediction="prediction",
    variance_prediction="variance_prediction",
    support_cols=None,
    include_benchmark=False,
    quantile=None,
) -> pandas.DataFrame

Input: a ForecastResult, forecast table, or pandas-like table with candidate and benchmark rows. The benchmark must already exist in the same forecast panel. The function does not create a benchmark forecast.

Required columns:

Column

Meaning

model_col

Candidate and benchmark model identifiers. Default: model.

actual

Realized value. Default: actual.

prediction

Point forecast. Default: prediction.

date, origin, or origin_pos

Support identity. At least one is required unless supplied through support_cols.

group_cols

Alignment groups such as target and horizon; all requested columns must exist.

Output columns:

Column

Meaning

model_loss

Candidate date-level loss.

benchmark_loss

Benchmark date-level loss on the same support row.

forecast_return

benchmark_loss - model_loss; positive favors candidate.

return_sign

"positive", "negative", or "zero".

cumulative_return

Cumulative sum of forecast_return within model/benchmark/group/loss path.

drawdown

Cumulative return minus running peak.

loss_name

Canonical loss label.

model_id, benchmark_id

Stable model labels for downstream grouping.

Validation is intentionally strict. Candidate and benchmark support must match exactly within every group, and realized values must match after alignment. This prevents a benchmark with a different window, horizon, target, or missing date pattern from being treated as a fair comparator.

returns = mf.metrics.forecast_returns(
    forecast_result,
    benchmark="ar",
    group_cols=("target", "horizon"),
    loss="squared_error",
)

sharpe_ratio#

macroforecast.metrics.sharpe_ratio(returns, *, hac_lags=None) -> float

Computes mean forecast return divided by return volatility. With hac_lags=None, the denominator is the ordinary sample standard deviation of the return sequence. With hac_lags="auto" or a nonnegative integer, the denominator is a Newey-West/Bartlett long-run standard deviation. This is a path-stability score, not a trading Sharpe ratio.

sortino_ratio#

macroforecast.metrics.sortino_ratio(
    returns,
    *,
    target_return=0.0,
) -> float

Computes mean excess forecast return divided by downside semideviation:

downside_t = min(return_t - target_return, 0)

If all nonzero returns are above the target, the denominator is zero and the ratio is inf. If numerator and denominator are both zero, the ratio is nan.

omega_ratio#

macroforecast.metrics.omega_ratio(
    returns,
    *,
    threshold=0.0,
) -> float

Computes total upside divided by total downside around a threshold:

omega = sum(max(return_t - threshold, 0))
        / sum(max(threshold - return_t, 0))

inf means there is upside and no downside; nan means there is neither upside nor downside.

drawdown_series and max_drawdown#

macroforecast.metrics.drawdown_series(returns) -> pandas.Series
macroforecast.metrics.max_drawdown(returns) -> float

Drawdown is computed from cumulative forecast returns:

cumulative_t = sum_{s <= t} return_s
drawdown_t = cumulative_t - max_{s <= t}(cumulative_s)

For example, returns [1, 1, -3, 1] have cumulative returns [1, 2, -1, 0], drawdowns [0, 0, -3, -2], and maximum drawdown -3.

risk_adjusted_forecast_metrics#

macroforecast.metrics.risk_adjusted_forecast_metrics(
    returns,
    *,
    group_cols=None,
    return_col="forecast_return",
    hac_lags="auto",
    target_return=0.0,
    omega_threshold=0.0,
) -> pandas.DataFrame

Input: the date-level output of forecast_returns(...), or any DataFrame with a return column.

Output: one row per group with:

Column

Meaning

n_obs

Number of finite return observations.

mean_return

Average benchmark-relative loss reduction.

return_sd

Sample standard deviation of returns.

hac_return_sd

HAC long-run standard deviation when requested.

sharpe, hac_sharpe

Mean return divided by ordinary or HAC volatility.

sortino

Downside-risk-adjusted return.

omega

Upside/downside ratio.

max_drawdown

Worst cumulative-return drawdown.

final_cumulative_return

Sum of returns over the evaluated path.

win_rate

Share of dates with positive forecast return.

Default grouping uses available columns such as model_id, benchmark_id, target, horizon, sample, regime, and loss_name.

edge_ratio#

macroforecast.metrics.edge_ratio(
    forecasts,
    *,
    group_cols=("target", "horizon"),
    loss="squared_error",
    model_col="model",
    actual="actual",
    prediction="prediction",
    variance_prediction="variance_prediction",
    support_cols=None,
    quantile=None,
) -> pandas.DataFrame

Edge Ratio asks whether a model delivers unique gains relative to the model pool, not only relative to one benchmark. For each date and model:

edge_t(model) = min_loss_t(all other models) - loss_t(model)

Therefore:

Edge sign

Meaning

edge > 0

The model is strictly better than every alternative on that date.

edge = 0

The model ties the best alternative.

edge < 0

At least one alternative is better.

Aggregated Edge Ratio is:

edge_ratio
    = (sum(max(edge_t, 0)) / sum(max(-edge_t, 0)))
      * (number_of_models - 1)

If a model has positive edge wins and no edge regrets, the ratio is inf. If a model never has edge wins, the ratio is 0. The result also carries the date-level edge path in attrs["macroforecast_edge_path"] for inspection.

Forecast Table Helpers#

evaluate_forecasts#

macroforecast.metrics.evaluate_forecasts(
    forecasts,
    *,
    by=("model", "horizon"),
    metrics=("mse", "rmse", "mae"),
    actual="actual",
    prediction="prediction",
    variance_prediction="variance_prediction",
    volatility_actual=None,
    quantile_predictions="quantile_predictions",
    previous_actual="previous_actual",
    benchmark_model=None,
    model_column="model",
)

Input: a ForecastResult, forecast table, or pandas-like table with realized values and forecast columns.

Output: a pandas DataFrame, one row per by group. The result carries attrs["macroforecast_metadata_schema"] = {"kind": "forecast_metrics", "version": 1, ...}. The metadata schema also records by, requested_metrics, benchmark_model, relative_support_columns, input columns, and automatically added metric groups.

Validation: every requested by column must exist in the forecast table. evaluate_forecasts() fails loudly instead of dropping unavailable grouping dimensions. Relative metrics such as relative_mse, relative_mae, mse_reduction, and r2_oos require benchmark_model, and the benchmark must have matching rows for every scored non-benchmark group. The grouping must include model_column because relative metrics compare each candidate model against a named benchmark model.

benchmark_model does not create benchmark forecasts. It selects existing rows from the forecast table. For a fair comparison, generate the benchmark in the same forecasting run with the same window/origin/horizon/target contract, or append an external benchmark CSV only after validating that it has the same forecast-table schema and the same evaluation support. Relative metrics fail when candidate and benchmark supports differ. Forecast-table relative metrics require at least one support identity column: date, origin, or origin_pos. For matching support rows, candidate and benchmark actual values must also match; otherwise the forecast table is treated as inconsistent.

Forecast-table behavior:

Available input

Added scores

actual, prediction

Requested point metrics such as mse, rmse, mae, bias.

benchmark_model plus benchmark rows

Relative metrics such as relative_mse, relative_mae, mse_reduction, r2_oos.

previous_actual

theil_u2 and success_ratio.

variance_prediction

gaussian_nll, crps, and requested qlike.

volatility_actual plus variance_prediction

qlike against an explicit realized-variance column. If omitted, actual is used.

quantile_predictions dictionaries

Pinball loss by quantile and interval coverage/width/score for matched lower-upper pairs.

Malformed probabilistic inputs fail validation. Quantile forecasts must be per-row dictionaries mapping levels strictly inside (0, 1) to finite numeric predictions. Invalid variance, volatility, interval, or quantile values are not silently clipped or skipped.

Requested specialized metrics fail loudly when their required support columns are absent:

Requested metric group

Required forecast-table column

gaussian_nll, negative_log_score, log_score, crps

variance_prediction

qlike

variance_prediction; use volatility_actual when realized variance is not in actual

theil_u2, success_ratio

previous_actual

pinball_loss, coverage_rate, interval_width, interval_score

quantile_predictions

scores = mf.metrics.evaluate_forecasts(
    result,
    metrics=("mse", "rmse", "relative_mse", "r2_oos"),
    benchmark_model="ols",
)

rank_forecasts#

macroforecast.metrics.rank_forecasts(
    evaluation,
    *,
    metric="mse",
    by=("horizon",),
    ascending=None,
    rank_column="rank",
)

Input: an evaluation table from evaluate_forecasts(...) or an equivalent pandas table.

Output: the same rows with a rank column. If ascending=None, lower is better for recognized loss metrics and higher is better for recognized gain metrics such as r2_oos, mse_reduction, success_ratio, and pesaran_timmermann_metric. Every requested by column must exist in the evaluation table. Signed bias, coverage metrics, and custom metrics require an explicit ascending=True or ascending=False. Coverage is intentionally not treated as automatically higher-is-better because interval coverage should usually be assessed against a nominal level, not maximized.

get_metric#

macroforecast.metrics.get_metric(metric)

Input: a metric name or callable.

Output: the resolved callable. Name aliases include msfe -> mse, validation_mse -> mse, validation_rmse -> rmse, mean_error -> bias, and negative_log_score -> gaussian_nll.

Custom metrics do not need registration. Pass a callable directly anywhere a metric is accepted:

def mean_bias(y_true, y_pred):
    return float(pd.Series(y_pred).sub(pd.Series(y_true)).mean())

scores = mf.metrics.evaluate_forecasts(
    forecasts,
    metrics=("mse", mean_bias),
)

The metric callable should accept (y_true, y_pred) and return one scalar float. In evaluation tables, the output column name is the callable’s __name__, or "callable_metric" when no name is available. Metrics requiring benchmark forecasts, variances, intervals, or previous actuals need one of the specialized built-in metric names because evaluate_forecasts() must know which forecast-table columns to pass.

Point Metrics#

All point metrics align inputs as pandas Series, drop missing paired observations, and return a single float.

Function

Signature

Output

mse

mse(y_true, y_pred)

Mean squared error.

rmse

rmse(y_true, y_pred)

Root mean squared error.

mae

mae(y_true, y_pred)

Mean absolute error.

bias

bias(y_true, y_pred)

Mean residual actual - prediction.

medae

medae(y_true, y_pred)

Median absolute error.

mape

mape(y_true, y_pred, *, eps=1e-10)

Mean absolute percentage error on the 0-100 scale.

smape

smape(y_true, y_pred, *, eps=1e-10)

Symmetric MAPE on the 0-100 scale.

theil_u1

theil_u1(y_true, y_pred)

Theil U1 inequality coefficient.

theil_u2

theil_u2(y_true, y_pred, y_prev)

Theil U2 relative to a no-change forecast.

Benchmark-Relative Metrics#

These functions require realized values, candidate forecasts, and benchmark forecasts aligned on the same index.

The direct functions and evaluate_forecasts(...) require candidate and benchmark support to match exactly. They do not silently score only the intersection of two forecast histories. Forecast-table evaluation also checks that candidate and benchmark rows carry the same realized value for each support point.

Function

Signature

Interpretation

relative_mse

relative_mse(y_true, y_model, y_benchmark)

Candidate MSE divided by benchmark MSE. Below 1 favors candidate.

relative_mae

relative_mae(y_true, y_model, y_benchmark)

Candidate MAE divided by benchmark MAE. Below 1 favors candidate.

mse_reduction

mse_reduction(y_true, y_model, y_benchmark)

Benchmark MSE minus candidate MSE. Positive favors candidate.

r2_oos

r2_oos(y_true, y_model, y_benchmark)

Out-of-sample R^2 = 1 - relative_mse.

Density, Interval, And Volatility Metrics#

Function

Signature

Output

pinball_loss

pinball_loss(y_true, y_quantile, *, quantile)

Mean quantile pinball loss.

gaussian_nll

gaussian_nll(y_true, y_pred, variance)

Gaussian negative log likelihood.

negative_log_score

negative_log_score(y_true, y_pred, variance)

Gaussian negative log score.

log_score

log_score(y_true, y_pred, variance)

Backward-compatible alias for negative_log_score; lower is better.

crps

crps(y_true, y_pred, variance)

Gaussian continuous ranked probability score.

qlike

qlike(y_true, variance, *, eps=1e-12)

QLIKE volatility loss using realized variance or squared realization.

coverage_rate

coverage_rate(y_true, lower, upper)

Share of observations inside the interval.

interval_width

interval_width(lower, upper)

Mean interval width.

interval_score

interval_score(y_true, lower, upper, *, alpha=0.05)

Winkler interval score.

evaluate_forecasts(...) uses variance_prediction for gaussian_nll, negative_log_score, log_score, and crps. qlike should be evaluated against realized variance or squared realization. Pass volatility_actual when that column differs from actual. It uses quantile_predictions dictionaries for pinball and interval metrics.

Variance inputs must be finite and strictly positive. QLIKE realized variance must be finite and nonnegative, while the forecast variance must be strictly positive. Interval metrics require upper >= lower for every evaluated row. Quantile levels must be strictly inside (0, 1), and quantile predictions must be finite.

Direction Metrics#

Function

Signature

Output

success_ratio

success_ratio(y_true, y_pred, y_prev)

Directional hit rate relative to the previous realized value.

pesaran_timmermann_metric

pesaran_timmermann_metric(y_true, y_pred, *, threshold=0.0)

Pesaran-Timmermann directional accuracy statistic.

  • mase – Mean Absolute Scaled Error (Hyndman-Koehler), out-of-sample MAE scaled by the in-sample (seasonal-)naive MAE.

  • seasonal_naive_mae – in-sample (seasonal-)naive MAE mean(|y[t]-y[t-m]|), the MASE scaling denominator.

  • acf1 – lag-1 autocorrelation (e.g. of forecast residuals), the ACF1 reported by forecast::accuracy.