# Macroeconomic Data Transformations Matter

This page records the current replication setting for Goulet Coulombe, Leroux,
Stevanovic, and Surprenant (2021), "Macroeconomic data transformations matter,"
*International Journal of Forecasting*, 37(4), 1338-1354.

Sources checked for this page:

- IJF article DOI: <https://doi.org/10.1016/j.ijforecast.2021.05.005>
- arXiv working-paper page: <https://arxiv.org/abs/2008.01714>
- local main PDF:
  `/Users/nanyeon/Library/CloudStorage/SynologyDrive-second_brain/wiki/raw/papers/10.1016j.ijforecast.2021.05.005.pdf`
- local online appendix PDF:
  `/Users/nanyeon/Library/CloudStorage/SynologyDrive-second_brain/wiki/raw/papers/10.1016j.ijforecast.2021.05.005_appendix.pdf`
- author Code & Data page:
  <https://philippegouletcoulombe.com/code>
- author MARX note:
  <https://philippegouletcoulombe.com/blog/ml-based-time-series-modelling-with-marx-2>
- local MARX snippet:
  `/Users/nanyeon/Library/CloudStorage/SynologyDrive-second_brain/wiki/raw/paper_code/coulombe_site_github_20260530/marx/MARX_cheap_code.R`

## Replication Status

| Item | Status | Consequence |
| --- | --- | --- |
| Main-text design | available | The sample, target list, horizon grid, transformation grid, learner list, and evaluation objects can be fixed. |
| Appendix model algorithms | available | Elastic Net, Adaptive Lasso, Random Forest, Boosted Trees, Linear Boosting, GA search, and Bayesian search can be mapped to package settings. |
| Full official replication code | not found in the checked public/local materials | Table-identical replication is not claimed. |
| Exact FRED-MD vintage | not stated in the checked public/local materials | Use a historical FRED-MD vintage just after the `2017M12` sample end; verify `2018-01`, then compare `2018-02`/`2018-03` if Dec. 2017 coverage differs. |
| MATLAB default optimizer behavior | partially stated | We can match the high-level optimizer design, but MATLAB default internal randomness/surrogate settings are not exactly portable. |
| MARX reference code | available as a small R snippet | `macroforecast.feature_engineering.feature_matrix(..., specification="MARX")` implements the same moving-average lag rotation logic. |
| `macroforecast==0.9.5a1` diagnostic run | invalid as replication evidence | The runner and diagnostic harness used `row < origin` for h-step targets, which leaks labels whose realization date is after the forecast origin. The diagnostic also used `average_change` on an already transformed FRED-MD target, creating a double-difference target. A later audit also found that the first Table 2 script omitted the paper's `MARX_y` and `MAF_y` target-derived blocks. Keep those runs as investigation logs only. |

Replication level:
: `reconstructed_design`, not `exact_table_replication`.

Package feasibility:
: all Table 2 design axes can be expressed with public `macroforecast`
  callables: official FRED-MD vintage loading, McCracken-Ng t-code
  preprocessing, direct-average and path-average targets, all 16 feature
  matrices, AR/FM/AL/EN/LB/RF/BT learner families, RMSE/relative-RMSE scoring,
  and DM/MCS forecast-comparison tests.

Exact-replication blocker:
: the paper and appendix do not provide a full machine-readable replication
  package or exact FRED-MD vintage in the checked materials.

Vintage policy:
: do not use `current.csv` for this paper. The official FRED-MD page provides
  historical vintage zip files. Because the pseudo-out-of-sample period ends in
  `2017M12`, the first replication candidate is `2018-01.csv` extracted from
  the historical FRED-MD vintage zip; use `2018-02.csv` or `2018-03.csv` only if
  the January vintage does not contain the needed December 2017 realized values.

## Paper Design

| Axis | Paper setting | `macroforecast` setting |
| --- | --- | --- |
| Data | monthly FRED-MD after McCracken-Ng transformations | `mf.data.load_fred_md(...)` followed by `mf.preprocessing.reprocess(transform="official")` |
| Initial estimation sample | starts at `1960M01` | `mf.window.from_cutoffs(estimation_start="1960-01", ...)` |
| Pseudo-out-of-sample period | `1980M01` to `2017M12` | test start `1980-01`, test end `2017-12` |
| Window type | expanding | expanding train/validation/test policy |
| Targets | `INDPRO`, `EMP`, `UNRATE`, `INCOME`, `CONS`, `RETAIL`, `HOUST`, `M2`, `CPI`, `PPI` | pass one target at a time to `mf.forecasting.run(...)`; keep the list fixed for batch replication |
| Horizons | 1, 3, 6, 9, 12, 24 months | `horizons=[1, 3, 6, 9, 12, 24]` |
| Target autoregressive lags | include lags of the target transformation; appendix setting `P_y=12` | `target_lags=range(0, 13)` in row-date convention, i.e. current transformed target plus 12 past lags |
| Target construction | direct average growth/difference and path-average forecasts from one-period transformed targets | after official t-code preprocessing, use `target_transform="average_value"` for direct average and `target_transform="value"` for path average |
| Loss | RMSE | `mf.metrics.rmse(...)` or RMSE from forecast output |
| Reference benchmark | factor model, FM | package mapping: fixed or BIC-selected `mf.models.far` / OLS factor model |
| Accuracy tests | DM test and MCS | `mf.tests.dm_test(...)`, `mf.tests.model_confidence_set(...)` |

## Target Variables

The paper uses ten monthly targets:

| Paper label | Meaning in the article | FRED-MD column in `2018-01` vintage | Package alias |
| --- | --- | --- |
| `INDPRO` | Industrial production index | `INDPRO` | `INDPRO` |
| `EMP` | Total nonfarm employment | `PAYEMS` | `EMP` |
| `UNRATE` | Unemployment rate | `UNRATE` | `UNRATE` |
| `INCOME` | Real personal income excluding current transfers | `W875RX1` | `INCOME` |
| `CONS` | Real personal consumption expenditures | `DPCERA3M086SBEA` | `CONS` |
| `RETAIL` | Retail and food services sales | `RETAILx` | `RETAIL` |
| `HOUST` | Housing starts | `HOUST` | `HOUST` |
| `M2` | M2 money stock | `M2SL` | `M2` |
| `CPI` | Consumer price index | `CPIAUCSL` | `CPI` |
| `PPI` | Producer price index | `PPICMM` | `PPI` |

If the loaded FRED-MD vintage uses different column labels, create a target
alias map before running the grid. Do not silently substitute a nearby series.

## Feature Matrices

The main text fixes sixteen feature-matrix cases. Lags of the
month-to-month change or log-change of the target are always included. Cases
with `MARX` and `MAF` also include target-derived blocks, written in the paper
as `MARX_y` and `MAF_y`; these are not the same as the plain autoregressive
target lags. Cases with `Level` include raw level variables and the target's raw
level `Y_t`.

| Paper case | Content | Package expression |
| --- | --- | --- |
| `F` | factors and factor lags | `feature_matrix(X, specification="F", n_factors=8, lags=range(1, 13))` |
| `F-X` | factors plus lagged transformed observables | `specification="F-X"` |
| `F-MARX` | factors plus predictor MARX and `MARX_y` | `pca_step(...)`, factor lags, `marx_step(name="MARX_X")`, and `marx_step(name="MARX_y", input="target_panel")` |
| `F-MAF` | factors plus predictor MAF and `MAF_y` | `pca_step(...)`, factor lags, `maf_step(name="MAF_X")`, and `maf_step(name="MAF_y", input="target_panel")` |
| `F-Level` | factors plus raw levels including `Y_t` | `specification="F-Level"` for inspection, or `LEVEL__*` columns with `lag_step(lags=range(0, 1))` in runner mode |
| `F-X-MARX` | factors, lagged observables, MARX | `specification="F-X-MARX"` |
| `F-X-MAF` | factors, lagged observables, MAF | `specification="F-X-MAF"` |
| `F-X-Level` | factors, lagged observables, levels | `specification="F-X-Level"`, with `level_data=levels` |
| `F-X-MARX-Level` | factors, lagged observables, MARX, levels | `specification="F-X-MARX-Level"`, with `level_data=levels` |
| `X` | lagged transformed observables | `specification="X"` |
| `MARX` | predictor MARX plus `MARX_y` | `marx_step(name="MARX_X")` plus `marx_step(name="MARX_y", input="target_panel")` |
| `MAF` | predictor MAF plus `MAF_y` | `maf_step(name="MAF_X")` plus `maf_step(name="MAF_y", input="target_panel")` |
| `X-MARX` | lagged observables plus MARX | `specification="X-MARX"` |
| `X-MAF` | lagged observables plus MAF | `specification="X-MAF"` |
| `X-Level` | lagged observables plus levels | `specification="X-Level"`, with `level_data=levels` |
| `X-MARX-Level` | lagged observables, MARX, levels | `specification="X-MARX-Level"`, with `level_data=levels` |

Default reconstruction settings:

```python
feature_settings = {
    "lags": range(1, 13),
    "max_lag": 12,
    "n_factors": 8,
    "n_maf_components": 2,
    "fit_policy": "expanding",
    "include_current_factor": True,
    "scale_factors": True,
    "scale_marx": False,
    "scale_maf": False,
}
```

The MAF setting follows the article's stated `P_MAF = 12` and two MAFs per
series. The MARX setting follows the author R snippet: for each variable and lag
order `l`, replace the lag-`l` column by the mean of lags `1, ..., l`.

## Learner Settings

| Learner | Paper setting | Closest package setting |
| --- | --- | --- |
| AR | autoregressive benchmark; lag order selected by BIC | `mf.models.ar`; add BIC lag search in the runner grid when matching the benchmark |
| FM | Stock-Watson-style factor model; BIC hyperparameter selection mentioned in main text | `mf.models.far` or factor features with `mf.models.ols`; for ML comparisons keep `(P_y, P_f, k)=(12, 12, 8)` where relevant |
| Adaptive Lasso | gamma `1`; first-step ridge; ridge lambda by GA; 100 log-spaced lasso lambdas; random 5-fold squared-loss CV | `mf.models.adaptive_lasso(gamma=1, initial="ridge")`; use `mf.model_selection.custom_search()` for the paper lambda grid and ridge-first-step search. The supplied script uses a deterministic ridge-alpha grid approximation because the original MATLAB GA state is unavailable. |
| Elastic Net | 100 lambda values; 100 alpha values in `[0.01, 1]`; random 5-fold squared-loss CV | `mf.models.elastic_net(standardize=True)` plus `mf.model_selection.custom_search()` over a data-dependent lambda grid and `l1_ratio` grid. `standardize=True` is model-local scaling inside each fit window, not a full-sample preprocessing step. |
| Linear Boosting | component-wise L2 boosting; `m=min(200,#Z/3)` sampled features; `N in 1..500`, `eta in [0,1]` via GA | `mf.models.glmboost(candidate_sampling="random", candidate_fraction=1/3, candidate_cap=200, candidate_rounding="floor")`; search `n_iter` and `learning_rate` with `mf.model_selection.genetic_search()`. |
| Random Forest | 200 trees; bootstrap sample; terminal node size greater than 5; `#Z/3` split candidates; no CV | `mf.models.random_forest(n_estimators=200, min_samples_leaf=5, max_features=max(1, p//3), bootstrap=True)` |
| Boosted Trees | initial mean; depth 5 splits; `#Z/3` split candidates; `N in 1..500`, `eta in (0,1)` by Bayesian optimization, 5-fold CV | `mf.models.gradient_boosting`; search `n_estimators` and `learning_rate`; use `max_depth=5` or a leaf-count equivalent depending on the backend |

Important approximation:
: MATLAB tree defaults and Python/sklearn tree defaults are not identical. The
  package setting above matches the published algorithmic intent, not every
  backend implementation detail.

## Paper Result Anchors

These are not package test assertions yet. They are qualitative/structural
anchors that a successful reconstructed run should make visible.

| Anchor | Expected pattern from the article |
| --- | --- |
| MARX | material gains for real activity variables at short and medium horizons, especially with tree learners |
| MAF | weaker on average than MARX, but useful in selected longer-horizon cases |
| Factors | included in most best specifications |
| Level | useful for money and selected real/activity cases |
| Direct average vs path average | path average often helps cyclical real variables; direct average is stronger for selected nominal variables and horizons |
| RF vs BT | nonlinear tree methods dominate many best cases; RF appears more often than BT among best specifications |

For exact comparison against Table 2 or Appendix B relative RMSE tables, freeze a
FRED-MD vintage and record the backend versions before treating deviations as
package bugs.

## Notebook-Style Skeleton

The current page fixes the setting. The full replication notebook should execute
the same cells and then attach paper-table captures plus package outputs.

For the current package-only executed notebook report, see
[Macroeconomic Data Transformations Matter: Package-Only Notebook](macroeconomic_data_transformations_matter_notebook.md).

For the corrected executed INDPRO reconstructed result, including best-spec
RMSEs and same-support FM relative RMSEs, see
[GCLS 2021 INDPRO Reconstructed Replication Report](gcls_2021_indpro_reconstructed_report.md).

### Cell 1: Load a Frozen FRED-MD Panel

```python
import macroforecast as mf

frozen_vintage = "2018-01"  # first candidate after the paper's 2017M12 sample end
bundle = mf.data.load_fred_md(vintage=frozen_vintage)

raw_panel = bundle.panel
raw_metadata = bundle.metadata
```

Expected output:

```text
raw_panel: pandas.DataFrame indexed by monthly date
raw_metadata: metadata dictionary attached to the data bundle
raw_metadata["artifact"]["source_url"]: official historical archive plus member CSV
```

### Cell 2: Build the Stationary Panel and Preserve Levels

```python
levels = raw_panel.copy()

X = mf.preprocessing.reprocess(
    raw_panel,
    transform="official",
    outliers="iqr",
    outlier_action="flag_as_nan",
    iqr_threshold=10.0,
    impute="em_factor",
    em_n_factors=8,
    frame="keep",
)
```

Expected output:

```text
X: FRED-MD-style stationary pandas.DataFrame
levels: original level panel for Level feature specifications
```

### Cell 3: Define the Paper Grid

```python
target_map = {
    "INDPRO": "INDPRO",
    "EMP": "PAYEMS",
    "UNRATE": "UNRATE",
    "INCOME": "W875RX1",
    "CONS": "DPCERA3M086SBEA",
    "RETAIL": "RETAILx",
    "HOUST": "HOUST",
    "M2": "M2SL",
    "CPI": "CPIAUCSL",
    "PPI": "PPICMM",
}
targets = list(target_map.values())
horizons = [1, 3, 6, 9, 12, 24]
feature_cases = [
    "F", "F-X", "F-MARX", "F-MAF", "F-Level",
    "F-X-MARX", "F-X-MAF", "F-X-Level", "F-X-MARX-Level",
    "X", "MARX", "MAF", "X-MARX", "X-MAF", "X-Level", "X-MARX-Level",
]
models = ["ar", "far", "adaptive_lasso", "elastic_net", "glmboost", "random_forest", "gradient_boosting"]
target_policies = ["direct_average", "path_average"]
```

Expected output:

```text
10 targets x 6 horizons x 16 feature cases x 5 ML learners x 2 target policies,
plus AR and FM benchmarks.
```

### Cell 3B: Encode Table 2 Best-Specification Targets

The published Table 2 can be encoded as a target comparison dictionary. Each
entry is `(target_policy, learner, feature_case)`.

```python
PAPER_TABLE2 = {
    ("INDPRO", 1): ("direct_average", "random_forest", "F-X-MARX-Level"),
    ("INDPRO", 3): ("direct_average", "random_forest", "MARX"),
    ("INDPRO", 6): ("path_average", "random_forest", "MARX"),
    ("INDPRO", 9): ("path_average", "random_forest", "MARX"),
    ("INDPRO", 12): ("path_average", "random_forest", "MARX"),
    ("INDPRO", 24): ("direct_average", "random_forest", "F-Level"),

    ("EMP", 1): ("direct_average", "random_forest", "F-X-MARX-Level"),
    ("EMP", 3): ("path_average", "random_forest", "F-MARX"),
    ("EMP", 6): ("path_average", "gradient_boosting", "F-MARX"),
    ("EMP", 9): ("path_average", "gradient_boosting", "F-MARX"),
    ("EMP", 12): ("path_average", "gradient_boosting", "F-MARX"),
    ("EMP", 24): ("path_average", "gradient_boosting", "MAF"),

    ("UNRATE", 1): ("direct_average", "gradient_boosting", "F-MARX"),
    ("UNRATE", 3): ("direct_average", "random_forest", "F-X-MARX-Level"),
    ("UNRATE", 6): ("path_average", "random_forest", "F-MARX"),
    ("UNRATE", 9): ("path_average", "glmboost", "F-X-MARX-Level"),
    ("UNRATE", 12): ("path_average", "glmboost", "F-X-MARX-Level"),
    ("UNRATE", 24): ("direct_average", "gradient_boosting", "F-MAF"),

    ("INCOME", 1): ("direct_average", "random_forest", "MARX"),
    ("INCOME", 3): ("direct_average", "random_forest", "F-MARX"),
    ("INCOME", 6): ("path_average", "random_forest", "F-X-MARX"),
    ("INCOME", 9): ("path_average", "random_forest", "F-MARX"),
    ("INCOME", 12): ("path_average", "random_forest", "F-MARX"),
    ("INCOME", 24): ("path_average", "random_forest", "F-X-MARX"),

    ("CONS", 1): ("direct_average", "far", "F"),
    ("CONS", 3): ("direct_average", "random_forest", "F-Level"),
    ("CONS", 6): ("path_average", "random_forest", "F-Level"),
    ("CONS", 9): ("direct_average", "random_forest", "MAF"),
    ("CONS", 12): ("path_average", "random_forest", "F-MAF"),
    ("CONS", 24): ("path_average", "random_forest", "F-MAF"),

    ("RETAIL", 1): ("direct_average", "far", "F"),
    ("RETAIL", 3): ("path_average", "gradient_boosting", "F-X-MARX"),
    ("RETAIL", 6): ("path_average", "adaptive_lasso", "F-MARX"),
    ("RETAIL", 9): ("direct_average", "gradient_boosting", "F-X-MARX-Level"),
    ("RETAIL", 12): ("direct_average", "gradient_boosting", "F-X-Level"),
    ("RETAIL", 24): ("direct_average", "gradient_boosting", "F-X-MAF"),

    ("HOUST", 1): ("direct_average", "elastic_net", "F-Level"),
    ("HOUST", 3): ("path_average", "elastic_net", "F-Level"),
    ("HOUST", 6): ("path_average", "random_forest", "F-X-MARX"),
    ("HOUST", 9): ("direct_average", "random_forest", "F-MAF"),
    ("HOUST", 12): ("direct_average", "random_forest", "F"),
    ("HOUST", 24): ("direct_average", "random_forest", "F"),

    ("M2", 1): ("direct_average", "random_forest", "X-Level"),
    ("M2", 3): ("path_average", "adaptive_lasso", "X-Level"),
    ("M2", 6): ("path_average", "random_forest", "F-Level"),
    ("M2", 9): ("direct_average", "random_forest", "F-Level"),
    ("M2", 12): ("direct_average", "gradient_boosting", "F-Level"),
    ("M2", 24): ("path_average", "random_forest", "F-Level"),

    ("CPI", 1): ("direct_average", "adaptive_lasso", "MARX"),
    ("CPI", 3): ("direct_average", "random_forest", "F"),
    ("CPI", 6): ("direct_average", "random_forest", "F"),
    ("CPI", 9): ("direct_average", "random_forest", "F"),
    ("CPI", 12): ("direct_average", "random_forest", "F"),
    ("CPI", 24): ("path_average", "random_forest", "X"),

    ("PPI", 1): ("direct_average", "elastic_net", "F-MARX"),
    ("PPI", 3): ("direct_average", "elastic_net", "MARX"),
    ("PPI", 6): ("direct_average", "random_forest", "F"),
    ("PPI", 9): ("direct_average", "random_forest", "F"),
    ("PPI", 12): ("direct_average", "random_forest", "F"),
    ("PPI", 24): ("direct_average", "gradient_boosting", "F-Level"),
}

assert set(PAPER_TABLE2) == {(target, horizon) for target in target_map for horizon in horizons}
```

The learner labels are package callables:

```python
MODEL_CALLABLES = {
    "ar": mf.models.ar,
    "far": mf.models.far,
    "adaptive_lasso": mf.models.adaptive_lasso,
    "elastic_net": mf.models.elastic_net,
    "glmboost": mf.models.glmboost,
    "random_forest": mf.models.random_forest,
    "gradient_boosting": mf.models.gradient_boosting,
}
```

### Cell 4A: Build the Table-1 Feature Matrix Directly

```python
Z = mf.feature_engineering.feature_matrix(
    X,
    specification="F-X-MARX",
    level_data=levels,
    lags=range(1, 13),
    max_lag=12,
    n_factors=8,
    n_maf_components=2,
    fit_policy="expanding",
)
```

Expected output:

```text
Z: pandas.DataFrame matching the paper's F-X-MARX feature family.
Z.attrs["macroforecast_feature_metadata"]: feature lineage table.
```

This direct call is the clearest way to inspect Table 1. For strict
walk-forward forecasting, let the runner fit feature state inside each
expanding window with a `FeatureSpec` step pipeline.

### Cell 4B: Runner-Safe Equivalent for One Feature Case

```python
features = mf.feature_engineering.feature_spec(
    target="INDPRO",
    horizon=3,
    predictors=[column for column in X.columns if column != "INDPRO"],
    lags=None,
    target_lags=range(0, 13),
    target_transform="average_value",
    target_mode="direct",
    steps=[
        mf.feature_engineering.pca_step(
            name="F_raw",
            n_components=8,
            scale=True,
            include=False,
        ),
        mf.feature_engineering.lag_step(
            name="F",
            input="F_raw",
            lags=range(0, 13),
            include=True,
        ),
        mf.feature_engineering.lag_step(
            name="X",
            input="panel",
            lags=range(1, 13),
            include=True,
        ),
        mf.feature_engineering.marx_step(
            name="MARX_X",
            input="panel",
            max_lag=12,
            include=True,
        ),
        mf.feature_engineering.marx_step(
            name="MARX_y",
            input="target_panel",
            columns=["INDPRO"],
            max_lag=12,
            include=True,
        ),
    ],
)
```

Expected output:

```text
features: FeatureSpec fitted by forecasting.run() according to feature_policy.
```

`FeatureSpec` does not call `feature_matrix()` internally. It uses smaller
step-level callables so PCA, MAF, scaling, and custom fitted transforms can be
estimated only on the rows available to each forecast origin.
The paper's `MARX_y`/`MAF_y` target-derived blocks use
`input="target_panel"`; this keeps target columns out of `predictors` while
still allowing the target's observed history to enter the feature matrix.

### Cell 4C: Run One Reconstructed Design Point

```python
window = mf.window.from_cutoffs(
    estimation_start="1960-01",
    test_start="1980-01",
    test_end="2017-12",
    mode="expanding",
    val_method="random_kfold",
    val_n_splits=5,
    val_random_state=123,
    horizon=3,
    step=1,
)

result = mf.forecasting.run(
    X,
    model="random_forest",
    target="INDPRO",
    horizon=3,
    forecast_policy="direct_average",
    features=features,
    window=window,
    params={
        "n_estimators": 200,
        "min_samples_leaf": 5,
        "max_features": 0.3333333333333333,
        "bootstrap": True,
    },
)
```

Expected output:

```text
result.forecasts: one row per forecast origin/date with prediction, actual, model metadata, and window metadata
result.metadata: run-level record of data, feature, model, selection, and window choices
```

The `max_features` value uses a fraction because sklearn accepts fractional
feature subsampling. If an exact floor rule is required, resolve
`max(1, p // 3)` after the feature matrix has been materialized.

For learner tuning, the paper reports randomly assigned 5-fold CV, not
chronological blocked CV. `mf.window.val_random_kfold(...)` is therefore used
only for this replication surface. The ordinary macro-forecasting default should
remain time-aware validation such as `blocked_kfold`, `poos`, or `expanding`.

### Cell 5: Run Table 2 Cells With Package Callables

The following helper code runs the package-only Table 2 comparison path. It is
written as plain Python so the same block can be copied into a notebook or run
as a batch script.

```python
import math
import numpy as np
import pandas as pd

def package_model_fit_predict(model_name, X_train, y_train, X_test):
    """Fit one Table 2 learner with macroforecast public callables only."""
    if model_name == "far":
        # FM benchmark as factor autoregression. For Table 2 cells whose feature
        # matrix is already F-only, mf.models.ols on the factor matrix is also a
        # valid fixed-k FM reconstruction.
        fit = mf.models.ols(X_train, y_train)
    elif model_name == "adaptive_lasso":
        fit = mf.models.adaptive_lasso(
            X_train,
            y_train,
            gamma=1.0,
            initial="ridge",
            initial_alpha=1.0,
            alpha=0.01,
            max_iter=20_000,
            random_state=123,
        )
    elif model_name == "elastic_net":
        fit = mf.models.elastic_net(
            X_train,
            y_train,
            alpha=0.01,
            l1_ratio=0.5,
            max_iter=20_000,
            standardize=True,
        )
    elif model_name == "glmboost":
        fit = mf.models.glmboost(
            X_train,
            y_train,
            n_iter=100,
            learning_rate=0.1,
            candidate_sampling="random",
            candidate_fraction=1 / 3,
            candidate_cap=200,
            candidate_rounding="floor",
            random_state=123,
        )
    elif model_name == "random_forest":
        p = X_train.shape[1]
        fit = mf.models.random_forest(
            X_train,
            y_train,
            n_estimators=200,
            min_samples_leaf=5,
            max_features=max(1, p // 3),
            bootstrap=True,
            random_state=123,
            n_jobs=1,
        )
    elif model_name == "gradient_boosting":
        p = X_train.shape[1]
        fit = mf.models.gradient_boosting(
            X_train,
            y_train,
            n_estimators=200,
            learning_rate=0.05,
            max_depth=5,
            max_features=max(1, p // 3),
            random_state=123,
        )
    else:
        raise ValueError(f"unsupported Table 2 learner: {model_name}")

    prediction = fit.predict(X_test)
    return float(prediction.iloc[0] if hasattr(prediction, "iloc") else np.asarray(prediction).reshape(-1)[0])


def direct_target_column(target_column, horizon, columns):
    name = f"{target_column}_average_value_h{horizon}"
    if name in columns:
        return name
    matches = [
        column for column in columns
        if column.startswith(target_column) and "average_value" in column and f"h{horizon}" in column
    ]
    if not matches:
        raise KeyError((target_column, horizon, "direct_average"))
    return matches[0]


def path_target_columns(target_column, horizon, columns):
    out = []
    for step in range(1, horizon + 1):
        name = f"{target_column}_value_step{step}"
        if name in columns:
            out.append(name)
        else:
            matches = [
                column for column in columns
                if column.startswith(target_column) and "value" in column and f"step{step}" in column
            ]
            if not matches:
                raise KeyError((target_column, horizon, step, "path_average"))
            out.append(matches[0])
    return out


def annual_december_origins(index, horizon):
    """Small Table 2 smoke grid. Use every month for the full paper run."""
    index = set(pd.DatetimeIndex(index))
    origins = []
    for year in range(1980, 2018):
        realized = pd.Timestamp(year=year, month=12, day=1)
        origin = realized - pd.DateOffset(months=horizon)
        if origin in index:
            origins.append(origin)
    return origins


def available_target_training_frame(frame, base_index, origin, target_step):
    """Keep only rows whose h-step or step-specific label is observable at origin."""
    base_index = pd.DatetimeIndex(base_index)
    origin_pos = int(base_index.get_indexer([origin])[0])
    if origin_pos < 0:
        raise KeyError(origin)
    row_pos = pd.Series(base_index.get_indexer(frame.index), index=frame.index)
    mask = (row_pos >= 0) & ((row_pos + int(target_step)) <= origin_pos)
    return frame.loc[mask]


def run_one_table2_cell(processed, levels, target_label, horizon, policy, model_name, feature_case):
    target_column = target_map[target_label]
    feature_set = mf.feature_engineering.build_features(
        processed,
        targets=[target_column],
        horizons=[horizon],
        predictors="all",
        target_lags=range(0, 13),
        feature_specification=feature_case,
        max_lag=12,
        n_factors=8,
        n_maf_components=2,
        feature_fit_policy="expanding",
        feature_min_train_size=240,
        level_data=levels,
        target_mode="path" if policy == "path_average" else "direct",
        target_transform="value" if policy == "path_average" else "average_value",
        drop_missing=False,
    )

    realized_direct = mf.feature_engineering.average_target(
        processed,
        targets=[target_column],
        horizons=[horizon],
        transform="value",
    )
    actual_col = direct_target_column(target_column, horizon, realized_direct.columns)

    rows = []
    for origin in annual_december_origins(feature_set.X.index, horizon):
        if origin not in realized_direct.index:
            continue
        actual = float(realized_direct.loc[origin, actual_col])
        if math.isnan(actual):
            continue

        if policy == "direct_average":
            y_col = direct_target_column(target_column, horizon, feature_set.y.columns)
            frame = pd.concat([feature_set.X, feature_set.y[[y_col]]], axis=1).dropna()
            if origin not in frame.index:
                continue
            train = available_target_training_frame(frame, feature_set.X.index, origin, horizon)
            if len(train) < 240:
                continue
            pred = package_model_fit_predict(
                model_name,
                train[feature_set.X.columns],
                train[y_col],
                frame.loc[[origin], feature_set.X.columns],
            )
        else:
            step_preds = []
            for step, y_col in enumerate(path_target_columns(target_column, horizon, feature_set.y.columns), start=1):
                frame = pd.concat([feature_set.X, feature_set.y[[y_col]]], axis=1).dropna()
                if origin not in frame.index:
                    continue
                train = available_target_training_frame(frame, feature_set.X.index, origin, step)
                if len(train) < 240:
                    continue
                step_preds.append(
                    package_model_fit_predict(
                        model_name,
                        train[feature_set.X.columns],
                        train[y_col],
                        frame.loc[[origin], feature_set.X.columns],
                    )
                )
            if len(step_preds) != horizon:
                continue
            pred = float(np.mean(step_preds))

        rows.append(
            {
                "target_label": target_label,
                "target_column": target_column,
                "horizon": horizon,
                "policy": policy,
                "model": model_name,
                "feature_case": feature_case,
                "origin": origin,
                "realized_date": origin + pd.DateOffset(months=horizon),
                "prediction": pred,
                "actual": actual,
                "error": actual - pred,
            }
        )
    return pd.DataFrame(rows)


table2_forecasts = []
for (target_label, horizon), (policy, model_name, feature_case) in PAPER_TABLE2.items():
    table2_forecasts.append(
        run_one_table2_cell(
            processed=X,
            levels=levels,
            target_label=target_label,
            horizon=horizon,
            policy=policy,
            model_name=model_name,
            feature_case=feature_case,
        )
    )

table2_forecasts = pd.concat(table2_forecasts, ignore_index=True)
table2_rmse = (
    table2_forecasts.assign(squared_error=lambda d: d["error"] ** 2)
    .groupby(["target_label", "horizon", "policy", "model", "feature_case"], as_index=False)
    .agg(n=("error", "size"), rmse=("squared_error", lambda x: float(np.sqrt(x.mean()))))
)
```

The annual-December origin loop is intentionally light. It verifies that every
Table 2 cell is callable with the package and creates a first comparison table.
For the full paper run, replace `annual_december_origins()` with every monthly
origin from `1980-01` through `2017-12`, and keep the same package calls.

Long-horizon boundary:
: the package window uses origin-date cutoffs. With monthly `step=1`, h=24
  forecasts overlap month by month, but a scored forecast at origin `t` still
  needs the realized target at `t + 24`. For a FRED-MD vintage ending in
  `2017-12`, the last h=24 origin that can be scored is `2015-12`. Tail blocks
  such as calendar year `2016` or `2017` should be skipped in a scored
  replication run unless a later vintage supplies the required realized target
  dates.

Scripted package run:

```bash
uv run python scripts/replication/gcls_2021_table2_batch.py \
  --out-root /path/to/table2_run \
  --vintage 2018-01 \
  --start-year 1980 \
  --end-year 2017 \
  --workers 8 \
  --tuning-mode paper-small
```

`--tuning-mode off` uses fixed paper-style defaults and is the fastest pipeline
check. `--tuning-mode paper-small` uses the same callable path as the paper
settings but with small lambda grids and short stochastic searches; use it for
server smoke runs. `--tuning-mode paper` expands Elastic Net to 100 lambda
values by 100 `l1_ratio` values, uses random 5-fold CV, and uses the configured
Bayesian/GA search sizes for boosted trees and linear boosting. Because the
original MATLAB GA/Bayesian optimizer state is not available, exact
bit-for-bit optimizer replication is not claimed.

### Cell 6: Evaluate Against the FM Benchmark

```python
rf_forecasts = result.forecasts
fm_forecasts = ...  # run the matching FM benchmark with the same window and target policy

relative = mf.metrics.relative_mse(
    y_true=rf_forecasts["actual"],
    y_model=rf_forecasts["prediction"],
    y_benchmark=fm_forecasts["prediction"],
)

dm = mf.tests.dm_test(
    (rf_forecasts["actual"] - rf_forecasts["prediction"]) ** 2,
    (fm_forecasts["actual"] - fm_forecasts["prediction"]) ** 2,
    horizon=3,
    input_type="loss",
)
```

Expected output:

```text
relative_mse < 1 means the reconstructed model improves on FM under squared loss.
dm returns a statistic and p-value for equal predictive accuracy against the benchmark.
```

## Observed Package Diagnostics

This section records package runs already executed from an isolated PyPI
environment. These runs are not exact Table-identical replications. They are
diagnostic checks that the public package can execute the Table 2 cells and that
the resulting patterns can be compared with the paper's main anchors.

Corrected source smoke check:
: after fixing target-label availability and pre-transformed target semantics,
  the local source checkout was run on `mf.data.load_fred_md(vintage="2018-01")`
  with official preprocessing, target `INDPRO`, horizon `3`, and origin
  `2005-12-01`. `forecast_policy="direct_average", target_transform="value"`
  normalized to `average_value`; `forecast_policy="path_average",
  target_transform="value"` kept value step targets. Both policies produced the
  same realized target, `0.0011626872172302665`. The direct row recorded
  `target_availability_end_pos=558` and `target_availability_lag=3`; the path row
  recorded step cutoffs `{1: 560, 2: 559, 3: 558}`. This is a smoke check of the
  corrected callable contract, not a Table 2 result.

Execution environment:

- machine: server1
- package: `macroforecast==0.9.5a1` installed from PyPI in a fresh virtual
  environment
- data: `mf.data.load_fred_md(vintage="2018-01")`
- raw panel: 708 monthly rows x 127 columns, `1959-01` through `2017-12`
- preprocessing: official McCracken-Ng transformation codes with the package
  default FRED-MD preprocessing path
- benchmark: matching FM cell with the same target, horizon, forecast policy,
  diagnostic origins, and realized target construction

Important caveat:
: the main diagnostic below uses every monthly realized date from `1980-01`
  through `2017-12`, but still uses capped lightweight model settings and
  full-sample feature fitting for factors, MARX, and MAF. It is closer to the
  paper's OOS calendar than the sparse diagnostics. However, it was run with
  `macroforecast==0.9.5a1`, whose direct/path training cutoff did not enforce
  target realization availability for `h > 1`, and the diagnostic script built
  targets with `average_change` on an already McCracken-Ng transformed panel.
  It also used capped lightweight RF/BT settings and fixed hyperparameters
  instead of the appendix optimizer settings. Treat the numbers below as an
  investigation log, not replication evidence.

| Run | Origins | Table 2 cells | Forecast rows | Share beats FM | Mean relative RMSE vs FM | Runtime | Role |
| --- | --- | ---: | ---: | ---: | ---: | ---: | --- |
| Ultrafast sparse OOS-lite | 1980, 1990, 2000, 2010, 2017 | 60 | 600 | 0.467 | 1.188 | 179.8s | smoke diagnostic |
| Capped sparse OOS-lite | 1980, 1983, 1986, 1989, 1992, 1995, 1998, 2001, 2004, 2007, 2010, 2013, 2016, 2017 | 60 | 1680 | 0.367 | 1.148 | 333.3s | broader diagnostic |
| Monthly capped OOS-lite | every month from 1980-01 to 2017-12 | 60 | 54720 | 0.200 | 1.149 | 7461.1s | main diagnostic |

Verdict from the diagnostic runs:
: invalid as replication evidence. The runs are useful only because they exposed
  two package/harness errors to fix before any serious Table 2 comparison:
  label-availability leakage for h-step targets and a double-transformed target
  construction. Re-run the grid after the corrected runner and
  `average_value`/`value` target semantics are released.

Target-level summary for the invalid monthly capped diagnostic:

| Target | Cells beating FM | Mean relative RMSE vs FM |
| --- | ---: | ---: |
| CONS | 0/6 | 1.324 |
| CPI | 0/6 | 1.317 |
| EMP | 2/6 | 1.029 |
| HOUST | 4/6 | 0.928 |
| INCOME | 0/6 | 1.284 |
| INDPRO | 0/6 | 1.331 |
| M2 | 2/6 | 1.205 |
| PPI | 0/6 | 1.022 |
| RETAIL | 0/6 | 1.034 |
| UNRATE | 4/6 | 1.012 |

Horizon-level summary for the invalid monthly capped diagnostic:

| Horizon | Cells beating FM | Mean relative RMSE vs FM |
| ---: | ---: | ---: |
| 1 | 3/10 | 1.046 |
| 3 | 2/10 | 1.051 |
| 6 | 2/10 | 1.133 |
| 9 | 2/10 | 1.098 |
| 12 | 2/10 | 1.186 |
| 24 | 1/10 | 1.378 |

Full monthly capped diagnostic table:

| Target | h | Target policy | Learner | Feature case | n | Relative RMSE vs FM | Beats FM? |
| --- | ---: | --- | --- | --- | ---: | ---: | --- |
| CONS | 1 | direct | FM | F | 456 | 1.000 | no |
| CONS | 3 | direct | RF | F-Level | 456 | 1.052 | no |
| CONS | 6 | path | RF | F-Level | 456 | 1.295 | no |
| CONS | 9 | direct | RF | MAF | 456 | 1.060 | no |
| CONS | 12 | path | RF | F-MAF | 456 | 1.493 | no |
| CONS | 24 | path | RF | F-MAF | 456 | 2.041 | no |
| CPI | 1 | direct | AL | MARX | 456 | 1.309 | no |
| CPI | 3 | direct | RF | F | 456 | 1.121 | no |
| CPI | 6 | direct | RF | F | 456 | 1.120 | no |
| CPI | 9 | direct | RF | F | 456 | 1.123 | no |
| CPI | 12 | direct | RF | F | 456 | 1.141 | no |
| CPI | 24 | path | RF | X | 456 | 2.086 | no |
| EMP | 1 | direct | RF | F-X-MARX-Level | 456 | 0.934 | yes |
| EMP | 3 | path | RF | F-MARX | 456 | 1.009 | no |
| EMP | 6 | path | BT | F-MARX | 456 | 0.933 | yes |
| EMP | 9 | path | BT | F-MARX | 456 | 1.027 | no |
| EMP | 12 | path | BT | F-MARX | 456 | 1.103 | no |
| EMP | 24 | path | BT | MAF | 456 | 1.171 | no |
| HOUST | 1 | direct | EN | F-Level | 456 | 1.064 | no |
| HOUST | 3 | path | EN | F-Level | 456 | 1.032 | no |
| HOUST | 6 | path | RF | F-X-MARX | 456 | 0.929 | yes |
| HOUST | 9 | direct | RF | F-MAF | 456 | 0.749 | yes |
| HOUST | 12 | direct | RF | F | 456 | 0.914 | yes |
| HOUST | 24 | direct | RF | F | 456 | 0.882 | yes |
| INCOME | 1 | direct | RF | MARX | 456 | 1.051 | no |
| INCOME | 3 | direct | RF | F-MARX | 456 | 1.047 | no |
| INCOME | 6 | path | RF | F-X-MARX | 456 | 1.163 | no |
| INCOME | 9 | path | RF | F-MARX | 456 | 1.295 | no |
| INCOME | 12 | path | RF | F-MARX | 456 | 1.445 | no |
| INCOME | 24 | path | RF | F-X-MARX | 456 | 1.703 | no |
| INDPRO | 1 | direct | RF | F-X-MARX-Level | 456 | 1.190 | no |
| INDPRO | 3 | direct | RF | MARX | 456 | 1.219 | no |
| INDPRO | 6 | path | RF | MARX | 456 | 1.337 | no |
| INDPRO | 9 | path | RF | MARX | 456 | 1.545 | no |
| INDPRO | 12 | path | RF | MARX | 456 | 1.658 | no |
| INDPRO | 24 | direct | RF | F-Level | 456 | 1.034 | no |
| M2 | 1 | direct | RF | X-Level | 456 | 0.961 | yes |
| M2 | 3 | path | AL | X-Level | 456 | 0.993 | yes |
| M2 | 6 | path | RF | F-Level | 456 | 1.320 | no |
| M2 | 9 | direct | RF | F-Level | 456 | 1.169 | no |
| M2 | 12 | direct | BT | F-Level | 456 | 1.103 | no |
| M2 | 24 | path | RF | F-Level | 456 | 1.686 | no |
| PPI | 1 | direct | EN | F-MARX | 456 | 1.008 | no |
| PPI | 3 | direct | EN | MARX | 456 | 1.000 | no |
| PPI | 6 | direct | RF | F | 456 | 1.002 | no |
| PPI | 9 | direct | RF | F | 456 | 1.012 | no |
| PPI | 12 | direct | RF | F | 456 | 1.021 | no |
| PPI | 24 | direct | BT | F-Level | 456 | 1.088 | no |
| RETAIL | 1 | direct | FM | F | 456 | 1.000 | no |
| RETAIL | 3 | path | BT | F-X-MARX | 456 | 1.044 | no |
| RETAIL | 6 | path | AL | F-MARX | 456 | 1.077 | no |
| RETAIL | 9 | direct | BT | F-X-MARX-Level | 456 | 1.020 | no |
| RETAIL | 12 | direct | BT | F-X-Level | 456 | 1.016 | no |
| RETAIL | 24 | direct | BT | F-X-MAF | 456 | 1.046 | no |
| UNRATE | 1 | direct | BT | F-MARX | 456 | 0.947 | yes |
| UNRATE | 3 | direct | RF | F-X-MARX-Level | 456 | 0.991 | yes |
| UNRATE | 6 | path | RF | F-MARX | 456 | 1.150 | no |
| UNRATE | 9 | path | LB | F-X-MARX-Level | 456 | 0.977 | yes |
| UNRATE | 12 | path | LB | F-X-MARX-Level | 456 | 0.965 | yes |
| UNRATE | 24 | direct | BT | F-MAF | 456 | 1.042 | no |

Why the diagnostics can differ from the paper:

- Model caps: RF and BT use ten lightweight trees in the diagnostics, not the
  paper's full tree counts and optimizer searches.
- Feature fitting: diagnostic feature matrices use full-sample fitting to keep
  all early origins; the final replication should use fit-aware expanding
  feature state.
- Hyperparameter search: the diagnostic does not run the appendix GA and
  Bayesian optimization loops.
- Backend: the paper is MATLAB-based, while this run uses Python package
  backends. Tree defaults and optimizer randomness are not table-identical.

## Gap Ledger

| Gap | Current handling |
| --- | --- |
| Exact FRED-MD vintage | start from `2018-01.csv` extracted from the official historical vintage zip; compare nearby 2018 vintages if Dec. 2017 coverage differs |
| MATLAB GA/Bayesian optimizer defaults | use the article's ranges, random 5-fold CV, and iteration counts where possible; record backend difference and unavailable MATLAB optimizer seed/state |
| BIC lag selection for AR/FM | add a small benchmark helper or run fixed `(12, 12, 8)` for the first reconstructed pass |
| Boosted-tree depth interpretation | document whether Python backend uses `max_depth=5` or an exact 5-split/6-leaf equivalent |
| Linear Boosting candidate sampling | implemented with explicit `glmboost` candidate settings: `candidate_sampling="random"`, `candidate_fraction=1/3`, `candidate_cap=200`, `candidate_rounding="floor"`, which samples `m=min(200, floor(#Z/3))` candidate predictors per boosting step; MATLAB GA optimizer state remains unavailable |
| Full monthly OOS calendar | monthly capped diagnostic now runs all realized months from `1980-01` through `2017-12`; exact replication still needs uncapped learner settings, optimizer search, and fit-aware feature state |
| Fit-aware feature replication | current diagnostics use full-sample feature fitting; next replication pass should use expanding feature state inside each origin |
| Paper captures | add static captures of Table 1, Table 2, and selected Appendix B tables only after the settings page is accepted |