# macroforecast.metrics [Back to reference](index.md) `macroforecast.metrics` owns forecast scoring only. It does not choose windows, fit models, run statistical comparison tests, or write artifacts. Use the namespace form: ```python import macroforecast as mf mf.metrics.rmse(y_true, y_pred) ``` Top-level shortcuts such as `mf.rmse(...)` are intentionally not exported. `MetricLike` is the public input type used by metric resolvers: a metric can be a registered metric name or a callable with the expected scoring signature. ## Risk-Return Forecast Evaluation ### Paper Citation And Scope This section implements the evaluation framework from: > Goulet Coulombe, Philippe. 2026. "Quantifying the Risk-Return Tradeoff in > Forecasting." arXiv:2605.09712v1, submitted May 10, 2026. > arXiv page: . This is not a portfolio-construction module. `macroforecast` does not treat macroeconomic forecasts as traded assets here. The word "return" means a date-level **loss differential**: ```text forecast_return_t(model | benchmark) = loss_t(benchmark) - loss_t(model) ``` Positive values mean the candidate model reduced forecast loss relative to the benchmark on that date. Negative values mean it underperformed. The financial language is useful because it gives precise names for stability of gains: volatility, downside risk, upside/downside balance, and drawdown. The object being evaluated remains a macro forecast panel. These functions live in `macroforecast.metrics` because they score forecasts. They do not explain fitted models, so they do not belong in `macroforecast.interpretation`. They also do not run hypothesis tests, so they do not belong in `macroforecast.tests`. Higher-level report integration can later call these functions from `macroforecast.evaluation`. ### Paper Motivation Standard forecast evaluation usually asks whether a model has lower average loss than a benchmark. The risk-return view asks whether those gains are stable enough to trust. A model can have lower RMSE on average while generating large negative episodes in recessions, inflation spikes, post-COVID periods, or other macroeconomic regimes where forecast failures are costly. The paper's primitive object is therefore not an aggregated RMSE table. It is a date-level sequence of benchmark-relative loss improvements. This is why the functions below operate on forecast panels and return paths rather than only on already-aggregated metric tables. ### compute_point_loss ```python macroforecast.metrics.compute_point_loss( y_true, y_pred, *, loss="squared_error", variance=None, quantile=None, eps=1e-12, ) -> pandas.Series ``` Input: aligned realized values and forecasts. Output: one observation-level loss per aligned row, where lower is better. Supported losses: | `loss` | Required inputs | Formula or meaning | | --- | --- | --- | | `"squared_error"`, `"mse"`, `"msfe"` | `y_true`, `y_pred` | `(y_true - y_pred)^2` at each date. | | `"absolute_error"`, `"mae"` | `y_true`, `y_pred` | `abs(y_true - y_pred)` at each date. | | `"pinball_loss"` | `y_true`, `y_pred`, `quantile` | Quantile loss for one requested quantile. | | `"negative_log_score"`, `"gaussian_nll"`, `"log_score"` | `y_true`, `y_pred`, `variance` | Gaussian negative log score. | | `"qlike"` | realized variance in `y_true`, forecast variance in `y_pred` | QLIKE volatility loss. | ### forecast_returns ```python macroforecast.metrics.forecast_returns( forecasts, *, benchmark, group_cols=("target", "horizon"), loss="squared_error", model_col="model", actual="actual", prediction="prediction", variance_prediction="variance_prediction", support_cols=None, include_benchmark=False, quantile=None, ) -> pandas.DataFrame ``` Input: a `ForecastResult`, forecast table, or pandas-like table with candidate and benchmark rows. The benchmark must already exist in the same forecast panel. The function does not create a benchmark forecast. Required columns: | Column | Meaning | | --- | --- | | `model_col` | Candidate and benchmark model identifiers. Default: `model`. | | `actual` | Realized value. Default: `actual`. | | `prediction` | Point forecast. Default: `prediction`. | | `date`, `origin`, or `origin_pos` | Support identity. At least one is required unless supplied through `support_cols`. | | `group_cols` | Alignment groups such as `target` and `horizon`; all requested columns must exist. | Output columns: | Column | Meaning | | --- | --- | | `model_loss` | Candidate date-level loss. | | `benchmark_loss` | Benchmark date-level loss on the same support row. | | `forecast_return` | `benchmark_loss - model_loss`; positive favors candidate. | | `return_sign` | `"positive"`, `"negative"`, or `"zero"`. | | `cumulative_return` | Cumulative sum of `forecast_return` within model/benchmark/group/loss path. | | `drawdown` | Cumulative return minus running peak. | | `loss_name` | Canonical loss label. | | `model_id`, `benchmark_id` | Stable model labels for downstream grouping. | Validation is intentionally strict. Candidate and benchmark support must match exactly within every group, and realized values must match after alignment. This prevents a benchmark with a different window, horizon, target, or missing date pattern from being treated as a fair comparator. ```python returns = mf.metrics.forecast_returns( forecast_result, benchmark="ar", group_cols=("target", "horizon"), loss="squared_error", ) ``` ### sharpe_ratio ```python macroforecast.metrics.sharpe_ratio(returns, *, hac_lags=None) -> float ``` Computes mean forecast return divided by return volatility. With `hac_lags=None`, the denominator is the ordinary sample standard deviation of the return sequence. With `hac_lags="auto"` or a nonnegative integer, the denominator is a Newey-West/Bartlett long-run standard deviation. This is a path-stability score, not a trading Sharpe ratio. ### sortino_ratio ```python macroforecast.metrics.sortino_ratio( returns, *, target_return=0.0, ) -> float ``` Computes mean excess forecast return divided by downside semideviation: ```text downside_t = min(return_t - target_return, 0) ``` If all nonzero returns are above the target, the denominator is zero and the ratio is `inf`. If numerator and denominator are both zero, the ratio is `nan`. ### omega_ratio ```python macroforecast.metrics.omega_ratio( returns, *, threshold=0.0, ) -> float ``` Computes total upside divided by total downside around a threshold: ```text omega = sum(max(return_t - threshold, 0)) / sum(max(threshold - return_t, 0)) ``` `inf` means there is upside and no downside; `nan` means there is neither upside nor downside. ### drawdown_series and max_drawdown ```python macroforecast.metrics.drawdown_series(returns) -> pandas.Series macroforecast.metrics.max_drawdown(returns) -> float ``` Drawdown is computed from cumulative forecast returns: ```text cumulative_t = sum_{s <= t} return_s drawdown_t = cumulative_t - max_{s <= t}(cumulative_s) ``` For example, returns `[1, 1, -3, 1]` have cumulative returns `[1, 2, -1, 0]`, drawdowns `[0, 0, -3, -2]`, and maximum drawdown `-3`. ### risk_adjusted_forecast_metrics ```python macroforecast.metrics.risk_adjusted_forecast_metrics( returns, *, group_cols=None, return_col="forecast_return", hac_lags="auto", target_return=0.0, omega_threshold=0.0, ) -> pandas.DataFrame ``` Input: the date-level output of `forecast_returns(...)`, or any DataFrame with a return column. Output: one row per group with: | Column | Meaning | | --- | --- | | `n_obs` | Number of finite return observations. | | `mean_return` | Average benchmark-relative loss reduction. | | `return_sd` | Sample standard deviation of returns. | | `hac_return_sd` | HAC long-run standard deviation when requested. | | `sharpe`, `hac_sharpe` | Mean return divided by ordinary or HAC volatility. | | `sortino` | Downside-risk-adjusted return. | | `omega` | Upside/downside ratio. | | `max_drawdown` | Worst cumulative-return drawdown. | | `final_cumulative_return` | Sum of returns over the evaluated path. | | `win_rate` | Share of dates with positive forecast return. | Default grouping uses available columns such as `model_id`, `benchmark_id`, `target`, `horizon`, `sample`, `regime`, and `loss_name`. ### edge_ratio ```python macroforecast.metrics.edge_ratio( forecasts, *, group_cols=("target", "horizon"), loss="squared_error", model_col="model", actual="actual", prediction="prediction", variance_prediction="variance_prediction", support_cols=None, quantile=None, ) -> pandas.DataFrame ``` Edge Ratio asks whether a model delivers unique gains relative to the model pool, not only relative to one benchmark. For each date and model: ```text edge_t(model) = min_loss_t(all other models) - loss_t(model) ``` Therefore: | Edge sign | Meaning | | --- | --- | | `edge > 0` | The model is strictly better than every alternative on that date. | | `edge = 0` | The model ties the best alternative. | | `edge < 0` | At least one alternative is better. | Aggregated Edge Ratio is: ```text edge_ratio = (sum(max(edge_t, 0)) / sum(max(-edge_t, 0))) * (number_of_models - 1) ``` If a model has positive edge wins and no edge regrets, the ratio is `inf`. If a model never has edge wins, the ratio is `0`. The result also carries the date-level edge path in `attrs["macroforecast_edge_path"]` for inspection. ## Forecast Table Helpers ### evaluate_forecasts ```python macroforecast.metrics.evaluate_forecasts( forecasts, *, by=("model", "horizon"), metrics=("mse", "rmse", "mae"), actual="actual", prediction="prediction", variance_prediction="variance_prediction", volatility_actual=None, quantile_predictions="quantile_predictions", previous_actual="previous_actual", benchmark_model=None, model_column="model", ) ``` Input: a `ForecastResult`, forecast table, or pandas-like table with realized values and forecast columns. Output: a pandas `DataFrame`, one row per `by` group. The result carries `attrs["macroforecast_metadata_schema"] = {"kind": "forecast_metrics", "version": 1, ...}`. The metadata schema also records `by`, `requested_metrics`, `benchmark_model`, `relative_support_columns`, input columns, and automatically added metric groups. Validation: every requested `by` column must exist in the forecast table. `evaluate_forecasts()` fails loudly instead of dropping unavailable grouping dimensions. Relative metrics such as `relative_mse`, `relative_mae`, `mse_reduction`, and `r2_oos` require `benchmark_model`, and the benchmark must have matching rows for every scored non-benchmark group. The grouping must include `model_column` because relative metrics compare each candidate model against a named benchmark model. `benchmark_model` does not create benchmark forecasts. It selects existing rows from the forecast table. For a fair comparison, generate the benchmark in the same forecasting run with the same window/origin/horizon/target contract, or append an external benchmark CSV only after validating that it has the same forecast-table schema and the same evaluation support. Relative metrics fail when candidate and benchmark supports differ. Forecast-table relative metrics require at least one support identity column: `date`, `origin`, or `origin_pos`. For matching support rows, candidate and benchmark `actual` values must also match; otherwise the forecast table is treated as inconsistent. Forecast-table behavior: | Available input | Added scores | | --- | --- | | `actual`, `prediction` | Requested point metrics such as `mse`, `rmse`, `mae`, `bias`. | | `benchmark_model` plus benchmark rows | Relative metrics such as `relative_mse`, `relative_mae`, `mse_reduction`, `r2_oos`. | | `previous_actual` | `theil_u2` and `success_ratio`. | | `variance_prediction` | `gaussian_nll`, `crps`, and requested `qlike`. | | `volatility_actual` plus `variance_prediction` | `qlike` against an explicit realized-variance column. If omitted, `actual` is used. | | `quantile_predictions` dictionaries | Pinball loss by quantile and interval coverage/width/score for matched lower-upper pairs. | Malformed probabilistic inputs fail validation. Quantile forecasts must be per-row dictionaries mapping levels strictly inside `(0, 1)` to finite numeric predictions. Invalid variance, volatility, interval, or quantile values are not silently clipped or skipped. Requested specialized metrics fail loudly when their required support columns are absent: | Requested metric group | Required forecast-table column | | --- | --- | | `gaussian_nll`, `negative_log_score`, `log_score`, `crps` | `variance_prediction` | | `qlike` | `variance_prediction`; use `volatility_actual` when realized variance is not in `actual` | | `theil_u2`, `success_ratio` | `previous_actual` | | `pinball_loss`, `coverage_rate`, `interval_width`, `interval_score` | `quantile_predictions` | ```python scores = mf.metrics.evaluate_forecasts( result, metrics=("mse", "rmse", "relative_mse", "r2_oos"), benchmark_model="ols", ) ``` ### rank_forecasts ```python macroforecast.metrics.rank_forecasts( evaluation, *, metric="mse", by=("horizon",), ascending=None, rank_column="rank", ) ``` Input: an evaluation table from `evaluate_forecasts(...)` or an equivalent pandas table. Output: the same rows with a rank column. If `ascending=None`, lower is better for recognized loss metrics and higher is better for recognized gain metrics such as `r2_oos`, `mse_reduction`, `success_ratio`, and `pesaran_timmermann_metric`. Every requested `by` column must exist in the evaluation table. Signed `bias`, coverage metrics, and custom metrics require an explicit `ascending=True` or `ascending=False`. Coverage is intentionally not treated as automatically higher-is-better because interval coverage should usually be assessed against a nominal level, not maximized. ### get_metric ```python macroforecast.metrics.get_metric(metric) ``` Input: a metric name or callable. Output: the resolved callable. Name aliases include `msfe -> mse`, `validation_mse -> mse`, `validation_rmse -> rmse`, `mean_error -> bias`, and `negative_log_score -> gaussian_nll`. Custom metrics do not need registration. Pass a callable directly anywhere a metric is accepted: ```python def mean_bias(y_true, y_pred): return float(pd.Series(y_pred).sub(pd.Series(y_true)).mean()) scores = mf.metrics.evaluate_forecasts( forecasts, metrics=("mse", mean_bias), ) ``` The metric callable should accept `(y_true, y_pred)` and return one scalar `float`. In evaluation tables, the output column name is the callable's `__name__`, or `"callable_metric"` when no name is available. Metrics requiring benchmark forecasts, variances, intervals, or previous actuals need one of the specialized built-in metric names because `evaluate_forecasts()` must know which forecast-table columns to pass. ## Point Metrics All point metrics align inputs as pandas Series, drop missing paired observations, and return a single `float`. | Function | Signature | Output | | --- | --- | --- | | `mse` | `mse(y_true, y_pred)` | Mean squared error. | | `rmse` | `rmse(y_true, y_pred)` | Root mean squared error. | | `mae` | `mae(y_true, y_pred)` | Mean absolute error. | | `bias` | `bias(y_true, y_pred)` | Mean residual `actual - prediction`. | | `medae` | `medae(y_true, y_pred)` | Median absolute error. | | `mape` | `mape(y_true, y_pred, *, eps=1e-10)` | Mean absolute percentage error on the 0-100 scale. | | `smape` | `smape(y_true, y_pred, *, eps=1e-10)` | Symmetric MAPE on the 0-100 scale. | | `theil_u1` | `theil_u1(y_true, y_pred)` | Theil U1 inequality coefficient. | | `theil_u2` | `theil_u2(y_true, y_pred, y_prev)` | Theil U2 relative to a no-change forecast. | ## Benchmark-Relative Metrics These functions require realized values, candidate forecasts, and benchmark forecasts aligned on the same index. The direct functions and `evaluate_forecasts(...)` require candidate and benchmark support to match exactly. They do not silently score only the intersection of two forecast histories. Forecast-table evaluation also checks that candidate and benchmark rows carry the same realized value for each support point. | Function | Signature | Interpretation | | --- | --- | --- | | `relative_mse` | `relative_mse(y_true, y_model, y_benchmark)` | Candidate MSE divided by benchmark MSE. Below 1 favors candidate. | | `relative_mae` | `relative_mae(y_true, y_model, y_benchmark)` | Candidate MAE divided by benchmark MAE. Below 1 favors candidate. | | `mse_reduction` | `mse_reduction(y_true, y_model, y_benchmark)` | Benchmark MSE minus candidate MSE. Positive favors candidate. | | `r2_oos` | `r2_oos(y_true, y_model, y_benchmark)` | Out-of-sample `R^2 = 1 - relative_mse`. | ## Density, Interval, And Volatility Metrics | Function | Signature | Output | | --- | --- | --- | | `pinball_loss` | `pinball_loss(y_true, y_quantile, *, quantile)` | Mean quantile pinball loss. | | `gaussian_nll` | `gaussian_nll(y_true, y_pred, variance)` | Gaussian negative log likelihood. | | `negative_log_score` | `negative_log_score(y_true, y_pred, variance)` | Gaussian negative log score. | | `log_score` | `log_score(y_true, y_pred, variance)` | Backward-compatible alias for `negative_log_score`; lower is better. | | `crps` | `crps(y_true, y_pred, variance)` | Gaussian continuous ranked probability score. | | `qlike` | `qlike(y_true, variance, *, eps=1e-12)` | QLIKE volatility loss using realized variance or squared realization. | | `coverage_rate` | `coverage_rate(y_true, lower, upper)` | Share of observations inside the interval. | | `interval_width` | `interval_width(lower, upper)` | Mean interval width. | | `interval_score` | `interval_score(y_true, lower, upper, *, alpha=0.05)` | Winkler interval score. | `evaluate_forecasts(...)` uses `variance_prediction` for `gaussian_nll`, `negative_log_score`, `log_score`, and `crps`. `qlike` should be evaluated against realized variance or squared realization. Pass `volatility_actual` when that column differs from `actual`. It uses `quantile_predictions` dictionaries for pinball and interval metrics. Variance inputs must be finite and strictly positive. QLIKE realized variance must be finite and nonnegative, while the forecast variance must be strictly positive. Interval metrics require `upper >= lower` for every evaluated row. Quantile levels must be strictly inside `(0, 1)`, and quantile predictions must be finite. ## Direction Metrics | Function | Signature | Output | | --- | --- | --- | | `success_ratio` | `success_ratio(y_true, y_pred, y_prev)` | Directional hit rate relative to the previous realized value. | | `pesaran_timmermann_metric` | `pesaran_timmermann_metric(y_true, y_pred, *, threshold=0.0)` | Pesaran-Timmermann directional accuracy statistic. | - `mase` -- Mean Absolute Scaled Error (Hyndman-Koehler), out-of-sample MAE scaled by the in-sample (seasonal-)naive MAE. - `seasonal_naive_mae` -- in-sample (seasonal-)naive MAE `mean(|y[t]-y[t-m]|)`, the MASE scaling denominator. - `acf1` -- lag-1 autocorrelation (e.g. of forecast residuals), the ACF1 reported by `forecast::accuracy`.