Macroeconomic Data Transformations Matter#

This page records the current replication setting for Goulet Coulombe, Leroux, Stevanovic, and Surprenant (2021), “Macroeconomic data transformations matter,” International Journal of Forecasting, 37(4), 1338-1354.

Sources checked for this page:

Replication Status#

Item

Status

Consequence

Main-text design

available

The sample, target list, horizon grid, transformation grid, learner list, and evaluation objects can be fixed.

Appendix model algorithms

available

Elastic Net, Adaptive Lasso, Random Forest, Boosted Trees, Linear Boosting, GA search, and Bayesian search can be mapped to package settings.

Full official replication code

not found in the checked public/local materials

Table-identical replication is not claimed.

Exact FRED-MD vintage

not stated in the checked public/local materials

Use a historical FRED-MD vintage just after the 2017M12 sample end; verify 2018-01, then compare 2018-02/2018-03 if Dec. 2017 coverage differs.

MATLAB default optimizer behavior

partially stated

We can match the high-level optimizer design, but MATLAB default internal randomness/surrogate settings are not exactly portable.

MARX reference code

available as a small R snippet

macroforecast.feature_engineering.feature_matrix(..., specification="MARX") implements the same moving-average lag rotation logic.

macroforecast==0.9.5a1 diagnostic run

invalid as replication evidence

The runner and diagnostic harness used row < origin for h-step targets, which leaks labels whose realization date is after the forecast origin. The diagnostic also used average_change on an already transformed FRED-MD target, creating a double-difference target. A later audit also found that the first Table 2 script omitted the paper’s MARX_y and MAF_y target-derived blocks. Keep those runs as investigation logs only.

Replication level:

reconstructed_design, not exact_table_replication.

Package feasibility:

all Table 2 design axes can be expressed with public macroforecast callables: official FRED-MD vintage loading, McCracken-Ng t-code preprocessing, direct-average and path-average targets, all 16 feature matrices, AR/FM/AL/EN/LB/RF/BT learner families, RMSE/relative-RMSE scoring, and DM/MCS forecast-comparison tests.

Exact-replication blocker:

the paper and appendix do not provide a full machine-readable replication package or exact FRED-MD vintage in the checked materials.

Vintage policy:

do not use current.csv for this paper. The official FRED-MD page provides historical vintage zip files. Because the pseudo-out-of-sample period ends in 2017M12, the first replication candidate is 2018-01.csv extracted from the historical FRED-MD vintage zip; use 2018-02.csv or 2018-03.csv only if the January vintage does not contain the needed December 2017 realized values.

Paper Design#

Axis

Paper setting

macroforecast setting

Data

monthly FRED-MD after McCracken-Ng transformations

mf.data.load_fred_md(...) followed by mf.preprocessing.reprocess(transform="official")

Initial estimation sample

starts at 1960M01

mf.window.from_cutoffs(estimation_start="1960-01", ...)

Pseudo-out-of-sample period

1980M01 to 2017M12

test start 1980-01, test end 2017-12

Window type

expanding

expanding train/validation/test policy

Targets

INDPRO, EMP, UNRATE, INCOME, CONS, RETAIL, HOUST, M2, CPI, PPI

pass one target at a time to mf.forecasting.run(...); keep the list fixed for batch replication

Horizons

1, 3, 6, 9, 12, 24 months

horizons=[1, 3, 6, 9, 12, 24]

Target autoregressive lags

include lags of the target transformation; appendix setting P_y=12

target_lags=range(0, 13) in row-date convention, i.e. current transformed target plus 12 past lags

Target construction

direct average growth/difference and path-average forecasts from one-period transformed targets

after official t-code preprocessing, use target_transform="average_value" for direct average and target_transform="value" for path average

Loss

RMSE

mf.metrics.rmse(...) or RMSE from forecast output

Reference benchmark

factor model, FM

package mapping: fixed or BIC-selected mf.models.far / OLS factor model

Accuracy tests

DM test and MCS

mf.tests.dm_test(...), mf.tests.model_confidence_set(...)

Target Variables#

The paper uses ten monthly targets:

| Paper label | Meaning in the article | FRED-MD column in 2018-01 vintage | Package alias | | — | — | — | | INDPRO | Industrial production index | INDPRO | INDPRO | | EMP | Total nonfarm employment | PAYEMS | EMP | | UNRATE | Unemployment rate | UNRATE | UNRATE | | INCOME | Real personal income excluding current transfers | W875RX1 | INCOME | | CONS | Real personal consumption expenditures | DPCERA3M086SBEA | CONS | | RETAIL | Retail and food services sales | RETAILx | RETAIL | | HOUST | Housing starts | HOUST | HOUST | | M2 | M2 money stock | M2SL | M2 | | CPI | Consumer price index | CPIAUCSL | CPI | | PPI | Producer price index | PPICMM | PPI |

If the loaded FRED-MD vintage uses different column labels, create a target alias map before running the grid. Do not silently substitute a nearby series.

Feature Matrices#

The main text fixes sixteen feature-matrix cases. Lags of the month-to-month change or log-change of the target are always included. Cases with MARX and MAF also include target-derived blocks, written in the paper as MARX_y and MAF_y; these are not the same as the plain autoregressive target lags. Cases with Level include raw level variables and the target’s raw level Y_t.

Paper case

Content

Package expression

F

factors and factor lags

feature_matrix(X, specification="F", n_factors=8, lags=range(1, 13))

F-X

factors plus lagged transformed observables

specification="F-X"

F-MARX

factors plus predictor MARX and MARX_y

pca_step(...), factor lags, marx_step(name="MARX_X"), and marx_step(name="MARX_y", input="target_panel")

F-MAF

factors plus predictor MAF and MAF_y

pca_step(...), factor lags, maf_step(name="MAF_X"), and maf_step(name="MAF_y", input="target_panel")

F-Level

factors plus raw levels including Y_t

specification="F-Level" for inspection, or LEVEL__* columns with lag_step(lags=range(0, 1)) in runner mode

F-X-MARX

factors, lagged observables, MARX

specification="F-X-MARX"

F-X-MAF

factors, lagged observables, MAF

specification="F-X-MAF"

F-X-Level

factors, lagged observables, levels

specification="F-X-Level", with level_data=levels

F-X-MARX-Level

factors, lagged observables, MARX, levels

specification="F-X-MARX-Level", with level_data=levels

X

lagged transformed observables

specification="X"

MARX

predictor MARX plus MARX_y

marx_step(name="MARX_X") plus marx_step(name="MARX_y", input="target_panel")

MAF

predictor MAF plus MAF_y

maf_step(name="MAF_X") plus maf_step(name="MAF_y", input="target_panel")

X-MARX

lagged observables plus MARX

specification="X-MARX"

X-MAF

lagged observables plus MAF

specification="X-MAF"

X-Level

lagged observables plus levels

specification="X-Level", with level_data=levels

X-MARX-Level

lagged observables, MARX, levels

specification="X-MARX-Level", with level_data=levels

Default reconstruction settings:

feature_settings = {
    "lags": range(1, 13),
    "max_lag": 12,
    "n_factors": 8,
    "n_maf_components": 2,
    "fit_policy": "expanding",
    "include_current_factor": True,
    "scale_factors": True,
    "scale_marx": False,
    "scale_maf": False,
}

The MAF setting follows the article’s stated P_MAF = 12 and two MAFs per series. The MARX setting follows the author R snippet: for each variable and lag order l, replace the lag-l column by the mean of lags 1, ..., l.

Learner Settings#

Learner

Paper setting

Closest package setting

AR

autoregressive benchmark; lag order selected by BIC

mf.models.ar; add BIC lag search in the runner grid when matching the benchmark

FM

Stock-Watson-style factor model; BIC hyperparameter selection mentioned in main text

mf.models.far or factor features with mf.models.ols; for ML comparisons keep (P_y, P_f, k)=(12, 12, 8) where relevant

Adaptive Lasso

gamma 1; first-step ridge; ridge lambda by GA; 100 log-spaced lasso lambdas; random 5-fold squared-loss CV

mf.models.adaptive_lasso(gamma=1, initial="ridge"); use mf.model_selection.custom_search() for the paper lambda grid and ridge-first-step search. The supplied script uses a deterministic ridge-alpha grid approximation because the original MATLAB GA state is unavailable.

Elastic Net

100 lambda values; 100 alpha values in [0.01, 1]; random 5-fold squared-loss CV

mf.models.elastic_net(standardize=True) plus mf.model_selection.custom_search() over a data-dependent lambda grid and l1_ratio grid. standardize=True is model-local scaling inside each fit window, not a full-sample preprocessing step.

Linear Boosting

component-wise L2 boosting; m=min(200,#Z/3) sampled features; N in 1..500, eta in [0,1] via GA

mf.models.glmboost(candidate_sampling="random", candidate_fraction=1/3, candidate_cap=200, candidate_rounding="floor"); search n_iter and learning_rate with mf.model_selection.genetic_search().

Random Forest

200 trees; bootstrap sample; terminal node size greater than 5; #Z/3 split candidates; no CV

mf.models.random_forest(n_estimators=200, min_samples_leaf=5, max_features=max(1, p//3), bootstrap=True)

Boosted Trees

initial mean; depth 5 splits; #Z/3 split candidates; N in 1..500, eta in (0,1) by Bayesian optimization, 5-fold CV

mf.models.gradient_boosting; search n_estimators and learning_rate; use max_depth=5 or a leaf-count equivalent depending on the backend

Important approximation:

MATLAB tree defaults and Python/sklearn tree defaults are not identical. The package setting above matches the published algorithmic intent, not every backend implementation detail.

Paper Result Anchors#

These are not package test assertions yet. They are qualitative/structural anchors that a successful reconstructed run should make visible.

Anchor

Expected pattern from the article

MARX

material gains for real activity variables at short and medium horizons, especially with tree learners

MAF

weaker on average than MARX, but useful in selected longer-horizon cases

Factors

included in most best specifications

Level

useful for money and selected real/activity cases

Direct average vs path average

path average often helps cyclical real variables; direct average is stronger for selected nominal variables and horizons

RF vs BT

nonlinear tree methods dominate many best cases; RF appears more often than BT among best specifications

For exact comparison against Table 2 or Appendix B relative RMSE tables, freeze a FRED-MD vintage and record the backend versions before treating deviations as package bugs.

Notebook-Style Skeleton#

The current page fixes the setting. The full replication notebook should execute the same cells and then attach paper-table captures plus package outputs.

For the current package-only executed notebook report, see Macroeconomic Data Transformations Matter: Package-Only Notebook.

Cell 1: Load a Frozen FRED-MD Panel#

import macroforecast as mf

frozen_vintage = "2018-01"  # first candidate after the paper's 2017M12 sample end
bundle = mf.data.load_fred_md(vintage=frozen_vintage)

raw_panel = bundle.panel
raw_metadata = bundle.metadata

Expected output:

raw_panel: pandas.DataFrame indexed by monthly date
raw_metadata: metadata dictionary attached to the data bundle
raw_metadata["artifact"]["source_url"]: official historical archive plus member CSV

Cell 2: Build the Stationary Panel and Preserve Levels#

levels = raw_panel.copy()

X = mf.preprocessing.reprocess(
    raw_panel,
    transform="official",
    outliers="iqr",
    outlier_action="flag_as_nan",
    iqr_threshold=10.0,
    impute="em_factor",
    em_n_factors=8,
    frame="keep",
)

Expected output:

X: FRED-MD-style stationary pandas.DataFrame
levels: original level panel for Level feature specifications

Cell 3: Define the Paper Grid#

target_map = {
    "INDPRO": "INDPRO",
    "EMP": "PAYEMS",
    "UNRATE": "UNRATE",
    "INCOME": "W875RX1",
    "CONS": "DPCERA3M086SBEA",
    "RETAIL": "RETAILx",
    "HOUST": "HOUST",
    "M2": "M2SL",
    "CPI": "CPIAUCSL",
    "PPI": "PPICMM",
}
targets = list(target_map.values())
horizons = [1, 3, 6, 9, 12, 24]
feature_cases = [
    "F", "F-X", "F-MARX", "F-MAF", "F-Level",
    "F-X-MARX", "F-X-MAF", "F-X-Level", "F-X-MARX-Level",
    "X", "MARX", "MAF", "X-MARX", "X-MAF", "X-Level", "X-MARX-Level",
]
models = ["ar", "far", "adaptive_lasso", "elastic_net", "glmboost", "random_forest", "gradient_boosting"]
target_policies = ["direct_average", "path_average"]

Expected output:

10 targets x 6 horizons x 16 feature cases x 5 ML learners x 2 target policies,
plus AR and FM benchmarks.

Cell 3B: Encode Table 2 Best-Specification Targets#

The published Table 2 can be encoded as a target comparison dictionary. Each entry is (target_policy, learner, feature_case).

PAPER_TABLE2 = {
    ("INDPRO", 1): ("direct_average", "random_forest", "F-X-MARX-Level"),
    ("INDPRO", 3): ("direct_average", "random_forest", "MARX"),
    ("INDPRO", 6): ("path_average", "random_forest", "MARX"),
    ("INDPRO", 9): ("path_average", "random_forest", "MARX"),
    ("INDPRO", 12): ("path_average", "random_forest", "MARX"),
    ("INDPRO", 24): ("direct_average", "random_forest", "F-Level"),

    ("EMP", 1): ("direct_average", "random_forest", "F-X-MARX-Level"),
    ("EMP", 3): ("path_average", "random_forest", "F-MARX"),
    ("EMP", 6): ("path_average", "gradient_boosting", "F-MARX"),
    ("EMP", 9): ("path_average", "gradient_boosting", "F-MARX"),
    ("EMP", 12): ("path_average", "gradient_boosting", "F-MARX"),
    ("EMP", 24): ("path_average", "gradient_boosting", "MAF"),

    ("UNRATE", 1): ("direct_average", "gradient_boosting", "F-MARX"),
    ("UNRATE", 3): ("direct_average", "random_forest", "F-X-MARX-Level"),
    ("UNRATE", 6): ("path_average", "random_forest", "F-MARX"),
    ("UNRATE", 9): ("path_average", "glmboost", "F-X-MARX-Level"),
    ("UNRATE", 12): ("path_average", "glmboost", "F-X-MARX-Level"),
    ("UNRATE", 24): ("direct_average", "gradient_boosting", "F-MAF"),

    ("INCOME", 1): ("direct_average", "random_forest", "MARX"),
    ("INCOME", 3): ("direct_average", "random_forest", "F-MARX"),
    ("INCOME", 6): ("path_average", "random_forest", "F-X-MARX"),
    ("INCOME", 9): ("path_average", "random_forest", "F-MARX"),
    ("INCOME", 12): ("path_average", "random_forest", "F-MARX"),
    ("INCOME", 24): ("path_average", "random_forest", "F-X-MARX"),

    ("CONS", 1): ("direct_average", "far", "F"),
    ("CONS", 3): ("direct_average", "random_forest", "F-Level"),
    ("CONS", 6): ("path_average", "random_forest", "F-Level"),
    ("CONS", 9): ("direct_average", "random_forest", "MAF"),
    ("CONS", 12): ("path_average", "random_forest", "F-MAF"),
    ("CONS", 24): ("path_average", "random_forest", "F-MAF"),

    ("RETAIL", 1): ("direct_average", "far", "F"),
    ("RETAIL", 3): ("path_average", "gradient_boosting", "F-X-MARX"),
    ("RETAIL", 6): ("path_average", "adaptive_lasso", "F-MARX"),
    ("RETAIL", 9): ("direct_average", "gradient_boosting", "F-X-MARX-Level"),
    ("RETAIL", 12): ("direct_average", "gradient_boosting", "F-X-Level"),
    ("RETAIL", 24): ("direct_average", "gradient_boosting", "F-X-MAF"),

    ("HOUST", 1): ("direct_average", "elastic_net", "F-Level"),
    ("HOUST", 3): ("path_average", "elastic_net", "F-Level"),
    ("HOUST", 6): ("path_average", "random_forest", "F-X-MARX"),
    ("HOUST", 9): ("direct_average", "random_forest", "F-MAF"),
    ("HOUST", 12): ("direct_average", "random_forest", "F"),
    ("HOUST", 24): ("direct_average", "random_forest", "F"),

    ("M2", 1): ("direct_average", "random_forest", "X-Level"),
    ("M2", 3): ("path_average", "adaptive_lasso", "X-Level"),
    ("M2", 6): ("path_average", "random_forest", "F-Level"),
    ("M2", 9): ("direct_average", "random_forest", "F-Level"),
    ("M2", 12): ("direct_average", "gradient_boosting", "F-Level"),
    ("M2", 24): ("path_average", "random_forest", "F-Level"),

    ("CPI", 1): ("direct_average", "adaptive_lasso", "MARX"),
    ("CPI", 3): ("direct_average", "random_forest", "F"),
    ("CPI", 6): ("direct_average", "random_forest", "F"),
    ("CPI", 9): ("direct_average", "random_forest", "F"),
    ("CPI", 12): ("direct_average", "random_forest", "F"),
    ("CPI", 24): ("path_average", "random_forest", "X"),

    ("PPI", 1): ("direct_average", "elastic_net", "F-MARX"),
    ("PPI", 3): ("direct_average", "elastic_net", "MARX"),
    ("PPI", 6): ("direct_average", "random_forest", "F"),
    ("PPI", 9): ("direct_average", "random_forest", "F"),
    ("PPI", 12): ("direct_average", "random_forest", "F"),
    ("PPI", 24): ("direct_average", "gradient_boosting", "F-Level"),
}

assert set(PAPER_TABLE2) == {(target, horizon) for target in target_map for horizon in horizons}

The learner labels are package callables:

MODEL_CALLABLES = {
    "ar": mf.models.ar,
    "far": mf.models.far,
    "adaptive_lasso": mf.models.adaptive_lasso,
    "elastic_net": mf.models.elastic_net,
    "glmboost": mf.models.glmboost,
    "random_forest": mf.models.random_forest,
    "gradient_boosting": mf.models.gradient_boosting,
}

Cell 4A: Build the Table-1 Feature Matrix Directly#

Z = mf.feature_engineering.feature_matrix(
    X,
    specification="F-X-MARX",
    level_data=levels,
    lags=range(1, 13),
    max_lag=12,
    n_factors=8,
    n_maf_components=2,
    fit_policy="expanding",
)

Expected output:

Z: pandas.DataFrame matching the paper's F-X-MARX feature family.
Z.attrs["macroforecast_feature_metadata"]: feature lineage table.

This direct call is the clearest way to inspect Table 1. For strict walk-forward forecasting, let the runner fit feature state inside each expanding window with a FeatureSpec step pipeline.

Cell 4B: Runner-Safe Equivalent for One Feature Case#

features = mf.feature_engineering.feature_spec(
    target="INDPRO",
    horizon=3,
    predictors=[column for column in X.columns if column != "INDPRO"],
    lags=None,
    target_lags=range(0, 13),
    target_transform="average_value",
    target_mode="direct",
    steps=[
        mf.feature_engineering.pca_step(
            name="F_raw",
            n_components=8,
            scale=True,
            include=False,
        ),
        mf.feature_engineering.lag_step(
            name="F",
            input="F_raw",
            lags=range(0, 13),
            include=True,
        ),
        mf.feature_engineering.lag_step(
            name="X",
            input="panel",
            lags=range(1, 13),
            include=True,
        ),
        mf.feature_engineering.marx_step(
            name="MARX_X",
            input="panel",
            max_lag=12,
            include=True,
        ),
        mf.feature_engineering.marx_step(
            name="MARX_y",
            input="target_panel",
            columns=["INDPRO"],
            max_lag=12,
            include=True,
        ),
    ],
)

Expected output:

features: FeatureSpec fitted by forecasting.run() according to feature_policy.

FeatureSpec does not call feature_matrix() internally. It uses smaller step-level callables so PCA, MAF, scaling, and custom fitted transforms can be estimated only on the rows available to each forecast origin. The paper’s MARX_y/MAF_y target-derived blocks use input="target_panel"; this keeps target columns out of predictors while still allowing the target’s observed history to enter the feature matrix.

Cell 4C: Run One Reconstructed Design Point#

window = mf.window.from_cutoffs(
    estimation_start="1960-01",
    test_start="1980-01",
    test_end="2017-12",
    mode="expanding",
    val_method="random_kfold",
    val_n_splits=5,
    val_random_state=123,
    horizon=3,
    step=1,
)

result = mf.forecasting.run(
    X,
    model="random_forest",
    target="INDPRO",
    horizon=3,
    forecast_policy="direct_average",
    features=features,
    window=window,
    params={
        "n_estimators": 200,
        "min_samples_leaf": 5,
        "max_features": 0.3333333333333333,
        "bootstrap": True,
    },
)

Expected output:

result.forecasts: one row per forecast origin/date with prediction, actual, model metadata, and window metadata
result.metadata: run-level record of data, feature, model, selection, and window choices

The max_features value uses a fraction because sklearn accepts fractional feature subsampling. If an exact floor rule is required, resolve max(1, p // 3) after the feature matrix has been materialized.

For learner tuning, the paper reports randomly assigned 5-fold CV, not chronological blocked CV. mf.window.val_random_kfold(...) is therefore used only for this replication surface. The ordinary macro-forecasting default should remain time-aware validation such as blocked_kfold, poos, or expanding.

Cell 5: Run Table 2 Cells With Package Callables#

The following helper code runs the package-only Table 2 comparison path. It is written as plain Python so the same block can be copied into a notebook or run as a batch script.

import math
import numpy as np
import pandas as pd

def package_model_fit_predict(model_name, X_train, y_train, X_test):
    """Fit one Table 2 learner with macroforecast public callables only."""
    if model_name == "far":
        # FM benchmark as factor autoregression. For Table 2 cells whose feature
        # matrix is already F-only, mf.models.ols on the factor matrix is also a
        # valid fixed-k FM reconstruction.
        fit = mf.models.ols(X_train, y_train)
    elif model_name == "adaptive_lasso":
        fit = mf.models.adaptive_lasso(
            X_train,
            y_train,
            gamma=1.0,
            initial="ridge",
            initial_alpha=1.0,
            alpha=0.01,
            max_iter=20_000,
            random_state=123,
        )
    elif model_name == "elastic_net":
        fit = mf.models.elastic_net(
            X_train,
            y_train,
            alpha=0.01,
            l1_ratio=0.5,
            max_iter=20_000,
            standardize=True,
        )
    elif model_name == "glmboost":
        fit = mf.models.glmboost(
            X_train,
            y_train,
            n_iter=100,
            learning_rate=0.1,
            candidate_sampling="random",
            candidate_fraction=1 / 3,
            candidate_cap=200,
            candidate_rounding="floor",
            random_state=123,
        )
    elif model_name == "random_forest":
        p = X_train.shape[1]
        fit = mf.models.random_forest(
            X_train,
            y_train,
            n_estimators=200,
            min_samples_leaf=5,
            max_features=max(1, p // 3),
            bootstrap=True,
            random_state=123,
            n_jobs=1,
        )
    elif model_name == "gradient_boosting":
        p = X_train.shape[1]
        fit = mf.models.gradient_boosting(
            X_train,
            y_train,
            n_estimators=200,
            learning_rate=0.05,
            max_depth=5,
            max_features=max(1, p // 3),
            random_state=123,
        )
    else:
        raise ValueError(f"unsupported Table 2 learner: {model_name}")

    prediction = fit.predict(X_test)
    return float(prediction.iloc[0] if hasattr(prediction, "iloc") else np.asarray(prediction).reshape(-1)[0])


def direct_target_column(target_column, horizon, columns):
    name = f"{target_column}_average_value_h{horizon}"
    if name in columns:
        return name
    matches = [
        column for column in columns
        if column.startswith(target_column) and "average_value" in column and f"h{horizon}" in column
    ]
    if not matches:
        raise KeyError((target_column, horizon, "direct_average"))
    return matches[0]


def path_target_columns(target_column, horizon, columns):
    out = []
    for step in range(1, horizon + 1):
        name = f"{target_column}_value_step{step}"
        if name in columns:
            out.append(name)
        else:
            matches = [
                column for column in columns
                if column.startswith(target_column) and "value" in column and f"step{step}" in column
            ]
            if not matches:
                raise KeyError((target_column, horizon, step, "path_average"))
            out.append(matches[0])
    return out


def annual_december_origins(index, horizon):
    """Small Table 2 smoke grid. Use every month for the full paper run."""
    index = set(pd.DatetimeIndex(index))
    origins = []
    for year in range(1980, 2018):
        realized = pd.Timestamp(year=year, month=12, day=1)
        origin = realized - pd.DateOffset(months=horizon)
        if origin in index:
            origins.append(origin)
    return origins


def available_target_training_frame(frame, base_index, origin, target_step):
    """Keep only rows whose h-step or step-specific label is observable at origin."""
    base_index = pd.DatetimeIndex(base_index)
    origin_pos = int(base_index.get_indexer([origin])[0])
    if origin_pos < 0:
        raise KeyError(origin)
    row_pos = pd.Series(base_index.get_indexer(frame.index), index=frame.index)
    mask = (row_pos >= 0) & ((row_pos + int(target_step)) <= origin_pos)
    return frame.loc[mask]


def run_one_table2_cell(processed, levels, target_label, horizon, policy, model_name, feature_case):
    target_column = target_map[target_label]
    feature_set = mf.feature_engineering.build_features(
        processed,
        targets=[target_column],
        horizons=[horizon],
        predictors="all",
        target_lags=range(0, 13),
        feature_specification=feature_case,
        max_lag=12,
        n_factors=8,
        n_maf_components=2,
        feature_fit_policy="expanding",
        feature_min_train_size=240,
        level_data=levels,
        target_mode="path" if policy == "path_average" else "direct",
        target_transform="value" if policy == "path_average" else "average_value",
        drop_missing=False,
    )

    realized_direct = mf.feature_engineering.average_target(
        processed,
        targets=[target_column],
        horizons=[horizon],
        transform="value",
    )
    actual_col = direct_target_column(target_column, horizon, realized_direct.columns)

    rows = []
    for origin in annual_december_origins(feature_set.X.index, horizon):
        if origin not in realized_direct.index:
            continue
        actual = float(realized_direct.loc[origin, actual_col])
        if math.isnan(actual):
            continue

        if policy == "direct_average":
            y_col = direct_target_column(target_column, horizon, feature_set.y.columns)
            frame = pd.concat([feature_set.X, feature_set.y[[y_col]]], axis=1).dropna()
            if origin not in frame.index:
                continue
            train = available_target_training_frame(frame, feature_set.X.index, origin, horizon)
            if len(train) < 240:
                continue
            pred = package_model_fit_predict(
                model_name,
                train[feature_set.X.columns],
                train[y_col],
                frame.loc[[origin], feature_set.X.columns],
            )
        else:
            step_preds = []
            for step, y_col in enumerate(path_target_columns(target_column, horizon, feature_set.y.columns), start=1):
                frame = pd.concat([feature_set.X, feature_set.y[[y_col]]], axis=1).dropna()
                if origin not in frame.index:
                    continue
                train = available_target_training_frame(frame, feature_set.X.index, origin, step)
                if len(train) < 240:
                    continue
                step_preds.append(
                    package_model_fit_predict(
                        model_name,
                        train[feature_set.X.columns],
                        train[y_col],
                        frame.loc[[origin], feature_set.X.columns],
                    )
                )
            if len(step_preds) != horizon:
                continue
            pred = float(np.mean(step_preds))

        rows.append(
            {
                "target_label": target_label,
                "target_column": target_column,
                "horizon": horizon,
                "policy": policy,
                "model": model_name,
                "feature_case": feature_case,
                "origin": origin,
                "realized_date": origin + pd.DateOffset(months=horizon),
                "prediction": pred,
                "actual": actual,
                "error": actual - pred,
            }
        )
    return pd.DataFrame(rows)


table2_forecasts = []
for (target_label, horizon), (policy, model_name, feature_case) in PAPER_TABLE2.items():
    table2_forecasts.append(
        run_one_table2_cell(
            processed=X,
            levels=levels,
            target_label=target_label,
            horizon=horizon,
            policy=policy,
            model_name=model_name,
            feature_case=feature_case,
        )
    )

table2_forecasts = pd.concat(table2_forecasts, ignore_index=True)
table2_rmse = (
    table2_forecasts.assign(squared_error=lambda d: d["error"] ** 2)
    .groupby(["target_label", "horizon", "policy", "model", "feature_case"], as_index=False)
    .agg(n=("error", "size"), rmse=("squared_error", lambda x: float(np.sqrt(x.mean()))))
)

The annual-December origin loop is intentionally light. It verifies that every Table 2 cell is callable with the package and creates a first comparison table. For the full paper run, replace annual_december_origins() with every monthly origin from 1980-01 through 2017-12, and keep the same package calls.

Long-horizon boundary:

the package window uses origin-date cutoffs. With monthly step=1, h=24 forecasts overlap month by month, but a scored forecast at origin t still needs the realized target at t + 24. For a FRED-MD vintage ending in 2017-12, the last h=24 origin that can be scored is 2015-12. Tail blocks such as calendar year 2016 or 2017 should be skipped in a scored replication run unless a later vintage supplies the required realized target dates.

Scripted package run:

uv run python scripts/replication/gcls_2021_table2_batch.py \
  --out-root /path/to/table2_run \
  --vintage 2018-01 \
  --start-year 1980 \
  --end-year 2017 \
  --workers 8 \
  --tuning-mode paper-small

--tuning-mode off uses fixed paper-style defaults and is the fastest pipeline check. --tuning-mode paper-small uses the same callable path as the paper settings but with small lambda grids and short stochastic searches; use it for server smoke runs. --tuning-mode paper expands Elastic Net to 100 lambda values by 100 l1_ratio values, uses random 5-fold CV, and uses the configured Bayesian/GA search sizes for boosted trees and linear boosting. Because the original MATLAB GA/Bayesian optimizer state is not available, exact bit-for-bit optimizer replication is not claimed.

Cell 6: Evaluate Against the FM Benchmark#

rf_forecasts = result.forecasts
fm_forecasts = ...  # run the matching FM benchmark with the same window and target policy

relative = mf.metrics.relative_mse(
    y_true=rf_forecasts["actual"],
    y_model=rf_forecasts["prediction"],
    y_benchmark=fm_forecasts["prediction"],
)

dm = mf.tests.dm_test(
    (rf_forecasts["actual"] - rf_forecasts["prediction"]) ** 2,
    (fm_forecasts["actual"] - fm_forecasts["prediction"]) ** 2,
    horizon=3,
    input_type="loss",
)

Expected output:

relative_mse < 1 means the reconstructed model improves on FM under squared loss.
dm returns a statistic and p-value for equal predictive accuracy against the benchmark.

Observed Package Diagnostics#

This section records package runs already executed from an isolated PyPI environment. These runs are not exact Table-identical replications. They are diagnostic checks that the public package can execute the Table 2 cells and that the resulting patterns can be compared with the paper’s main anchors.

Corrected source smoke check:

after fixing target-label availability and pre-transformed target semantics, the local source checkout was run on mf.data.load_fred_md(vintage="2018-01") with official preprocessing, target INDPRO, horizon 3, and origin 2005-12-01. forecast_policy="direct_average", target_transform="value" normalized to average_value; forecast_policy="path_average", target_transform="value" kept value step targets. Both policies produced the same realized target, 0.0011626872172302665. The direct row recorded target_availability_end_pos=558 and target_availability_lag=3; the path row recorded step cutoffs {1: 560, 2: 559, 3: 558}. This is a smoke check of the corrected callable contract, not a Table 2 result.

Execution environment:

  • machine: server1

  • package: macroforecast==0.9.5a1 installed from PyPI in a fresh virtual environment

  • data: mf.data.load_fred_md(vintage="2018-01")

  • raw panel: 708 monthly rows x 127 columns, 1959-01 through 2017-12

  • preprocessing: official McCracken-Ng transformation codes with the package default FRED-MD preprocessing path

  • benchmark: matching FM cell with the same target, horizon, forecast policy, diagnostic origins, and realized target construction

Important caveat:

the main diagnostic below uses every monthly realized date from 1980-01 through 2017-12, but still uses capped lightweight model settings and full-sample feature fitting for factors, MARX, and MAF. It is closer to the paper’s OOS calendar than the sparse diagnostics. However, it was run with macroforecast==0.9.5a1, whose direct/path training cutoff did not enforce target realization availability for h > 1, and the diagnostic script built targets with average_change on an already McCracken-Ng transformed panel. It also used capped lightweight RF/BT settings and fixed hyperparameters instead of the appendix optimizer settings. Treat the numbers below as an investigation log, not replication evidence.

Run

Origins

Table 2 cells

Forecast rows

Share beats FM

Mean relative RMSE vs FM

Runtime

Role

Ultrafast sparse OOS-lite

1980, 1990, 2000, 2010, 2017

60

600

0.467

1.188

179.8s

smoke diagnostic

Capped sparse OOS-lite

1980, 1983, 1986, 1989, 1992, 1995, 1998, 2001, 2004, 2007, 2010, 2013, 2016, 2017

60

1680

0.367

1.148

333.3s

broader diagnostic

Monthly capped OOS-lite

every month from 1980-01 to 2017-12

60

54720

0.200

1.149

7461.1s

main diagnostic

Verdict from the diagnostic runs:

invalid as replication evidence. The runs are useful only because they exposed two package/harness errors to fix before any serious Table 2 comparison: label-availability leakage for h-step targets and a double-transformed target construction. Re-run the grid after the corrected runner and average_value/value target semantics are released.

Target-level summary for the invalid monthly capped diagnostic:

Target

Cells beating FM

Mean relative RMSE vs FM

CONS

0/6

1.324

CPI

0/6

1.317

EMP

2/6

1.029

HOUST

4/6

0.928

INCOME

0/6

1.284

INDPRO

0/6

1.331

M2

2/6

1.205

PPI

0/6

1.022

RETAIL

0/6

1.034

UNRATE

4/6

1.012

Horizon-level summary for the invalid monthly capped diagnostic:

Horizon

Cells beating FM

Mean relative RMSE vs FM

1

3/10

1.046

3

2/10

1.051

6

2/10

1.133

9

2/10

1.098

12

2/10

1.186

24

1/10

1.378

Full monthly capped diagnostic table:

Target

h

Target policy

Learner

Feature case

n

Relative RMSE vs FM

Beats FM?

CONS

1

direct

FM

F

456

1.000

no

CONS

3

direct

RF

F-Level

456

1.052

no

CONS

6

path

RF

F-Level

456

1.295

no

CONS

9

direct

RF

MAF

456

1.060

no

CONS

12

path

RF

F-MAF

456

1.493

no

CONS

24

path

RF

F-MAF

456

2.041

no

CPI

1

direct

AL

MARX

456

1.309

no

CPI

3

direct

RF

F

456

1.121

no

CPI

6

direct

RF

F

456

1.120

no

CPI

9

direct

RF

F

456

1.123

no

CPI

12

direct

RF

F

456

1.141

no

CPI

24

path

RF

X

456

2.086

no

EMP

1

direct

RF

F-X-MARX-Level

456

0.934

yes

EMP

3

path

RF

F-MARX

456

1.009

no

EMP

6

path

BT

F-MARX

456

0.933

yes

EMP

9

path

BT

F-MARX

456

1.027

no

EMP

12

path

BT

F-MARX

456

1.103

no

EMP

24

path

BT

MAF

456

1.171

no

HOUST

1

direct

EN

F-Level

456

1.064

no

HOUST

3

path

EN

F-Level

456

1.032

no

HOUST

6

path

RF

F-X-MARX

456

0.929

yes

HOUST

9

direct

RF

F-MAF

456

0.749

yes

HOUST

12

direct

RF

F

456

0.914

yes

HOUST

24

direct

RF

F

456

0.882

yes

INCOME

1

direct

RF

MARX

456

1.051

no

INCOME

3

direct

RF

F-MARX

456

1.047

no

INCOME

6

path

RF

F-X-MARX

456

1.163

no

INCOME

9

path

RF

F-MARX

456

1.295

no

INCOME

12

path

RF

F-MARX

456

1.445

no

INCOME

24

path

RF

F-X-MARX

456

1.703

no

INDPRO

1

direct

RF

F-X-MARX-Level

456

1.190

no

INDPRO

3

direct

RF

MARX

456

1.219

no

INDPRO

6

path

RF

MARX

456

1.337

no

INDPRO

9

path

RF

MARX

456

1.545

no

INDPRO

12

path

RF

MARX

456

1.658

no

INDPRO

24

direct

RF

F-Level

456

1.034

no

M2

1

direct

RF

X-Level

456

0.961

yes

M2

3

path

AL

X-Level

456

0.993

yes

M2

6

path

RF

F-Level

456

1.320

no

M2

9

direct

RF

F-Level

456

1.169

no

M2

12

direct

BT

F-Level

456

1.103

no

M2

24

path

RF

F-Level

456

1.686

no

PPI

1

direct

EN

F-MARX

456

1.008

no

PPI

3

direct

EN

MARX

456

1.000

no

PPI

6

direct

RF

F

456

1.002

no

PPI

9

direct

RF

F

456

1.012

no

PPI

12

direct

RF

F

456

1.021

no

PPI

24

direct

BT

F-Level

456

1.088

no

RETAIL

1

direct

FM

F

456

1.000

no

RETAIL

3

path

BT

F-X-MARX

456

1.044

no

RETAIL

6

path

AL

F-MARX

456

1.077

no

RETAIL

9

direct

BT

F-X-MARX-Level

456

1.020

no

RETAIL

12

direct

BT

F-X-Level

456

1.016

no

RETAIL

24

direct

BT

F-X-MAF

456

1.046

no

UNRATE

1

direct

BT

F-MARX

456

0.947

yes

UNRATE

3

direct

RF

F-X-MARX-Level

456

0.991

yes

UNRATE

6

path

RF

F-MARX

456

1.150

no

UNRATE

9

path

LB

F-X-MARX-Level

456

0.977

yes

UNRATE

12

path

LB

F-X-MARX-Level

456

0.965

yes

UNRATE

24

direct

BT

F-MAF

456

1.042

no

Why the diagnostics can differ from the paper:

  • Model caps: RF and BT use ten lightweight trees in the diagnostics, not the paper’s full tree counts and optimizer searches.

  • Feature fitting: diagnostic feature matrices use full-sample fitting to keep all early origins; the final replication should use fit-aware expanding feature state.

  • Hyperparameter search: the diagnostic does not run the appendix GA and Bayesian optimization loops.

  • Backend: the paper is MATLAB-based, while this run uses Python package backends. Tree defaults and optimizer randomness are not table-identical.

Gap Ledger#

Gap

Current handling

Exact FRED-MD vintage

start from 2018-01.csv extracted from the official historical vintage zip; compare nearby 2018 vintages if Dec. 2017 coverage differs

MATLAB GA/Bayesian optimizer defaults

use the article’s ranges, random 5-fold CV, and iteration counts where possible; record backend difference and unavailable MATLAB optimizer seed/state

BIC lag selection for AR/FM

add a small benchmark helper or run fixed (12, 12, 8) for the first reconstructed pass

Boosted-tree depth interpretation

document whether Python backend uses max_depth=5 or an exact 5-split/6-leaf equivalent

Linear Boosting candidate sampling

implemented with explicit glmboost candidate settings: candidate_sampling="random", candidate_fraction=1/3, candidate_cap=200, candidate_rounding="floor", which samples m=min(200, floor(#Z/3)) candidate predictors per boosting step; MATLAB GA optimizer state remains unavailable

Full monthly OOS calendar

monthly capped diagnostic now runs all realized months from 1980-01 through 2017-12; exact replication still needs uncapped learner settings, optimizer search, and fit-aware feature state

Fit-aware feature replication

current diagnostics use full-sample feature fitting; next replication pass should use expanding feature state inside each origin

Paper captures

add static captures of Table 1, Table 2, and selected Appendix B tables only after the settings page is accepted