GCLS 2021 INDPRO Reconstructed Replication Report#

This report records the corrected macroforecast run for the INDPRO cells in Table 2 of Goulet Coulombe, Leroux, Stevanovic, and Surprenant (2021), “Macroeconomic data transformations matter,” International Journal of Forecasting, 37(4), 1338-1354.

Report date:: 2026-06-06
Replication level:: reconstructed_design, not exact_table_replication.
Execution scope:: INDPRO only, six horizons, paper Table 2 best-specification cells, compared against the matching factor-model benchmark on the same realized-target support.
Main result:: the corrected INDPRO best-specification cells beat the matching FM benchmark at every horizon. Mean relative RMSE is 0.889818; median relative RMSE is 0.935358. Relative RMSE below 1 means the Table 2 best-specification cell has lower RMSE than the FM benchmark on the common support.

Status#

Item	Status
Best-spec INDPRO run	complete
FM benchmark run	complete
Common-support comparison	complete
Failed tasks	0
Active server jobs	none detected
Table-identical replication claim	not made

The run should be read as corrected package evidence. It is not a claim that the numbers are identical to the paper’s Table 2, because the checked paper, appendix, local files, and public author materials do not expose a full machine-readable replication package, exact FRED-MD vintage, or exact MATLAB backend state.

Source Material#

Sources used to define the replication setting:

IJF article DOI: https://doi.org/10.1016/j.ijforecast.2021.05.005
arXiv working-paper page: https://arxiv.org/abs/2008.01714
local main PDF: /Users/nanyeon/Library/CloudStorage/SynologyDrive-second_brain/wiki/raw/papers/10.1016j.ijforecast.2021.05.005.pdf
local appendix PDF: /Users/nanyeon/Library/CloudStorage/SynologyDrive-second_brain/wiki/raw/papers/10.1016j.ijforecast.2021.05.005_appendix.pdf
local review note: /Users/nanyeon/Library/CloudStorage/SynologyDrive-second_brain/wiki/papers/reviews/10.1016j.ijforecast.2021.05.005-ea1152c5.md
author MARX snippet: /Users/nanyeon/Library/CloudStorage/SynologyDrive-second_brain/wiki/raw/paper_code/coulombe_site_github_20260530/marx/MARX_cheap_code.R

Execution Artifacts#

The run was executed on server1.

Object	Path
Source checkout used for run	`/home/nanyeon99/project/macroforecast_gcls_replication_main_2f526bdf`
Source checkout commit	`2f526bdf`
Best-spec output root	`/home/nanyeon99/project/macroforecast_gcls_runs/table2_indpro_full_20260605`
FM benchmark output root	`/home/nanyeon99/project/macroforecast_gcls_runs/indpro_fm_benchmark_20260606`
Relative comparison output root	`/home/nanyeon99/project/macroforecast_gcls_runs/indpro_relative_vs_fm_20260606`
Relative comparison CSV	`/home/nanyeon99/project/macroforecast_gcls_runs/indpro_relative_vs_fm_20260606/indpro_relative_vs_fm.csv`

The best-spec run produced about 19M of output, the FM benchmark about 16M, and the relative comparison about 336K.

Data And Sample#

Axis	Setting
Dataset	FRED-MD
Vintage	`2018-01`
Loader	`mf.data.load_fred_md(vintage="2018-01")`
Raw panel	708 monthly rows x 127 columns
Raw period	`1959-01` through `2017-12`
Preprocessing	official McCracken-Ng FRED-MD t-code pipeline
Processed panel	706 monthly rows x 127 columns
Processed period	`1959-03` through `2017-12`
Initial estimation start	`1960-01`
Test calendar	monthly origins from `1980-01` through `2017-12` where realized targets are available
Horizons	`1, 3, 6, 9, 12, 24` months
Target	`INDPRO`

The h-step forecast at origin t is scored only when the realized target dated t + h is available. For h=24 with a 2018-01 FRED-MD vintage ending in 2017-12, the final scored origins stop before the tail origins whose realizations would fall after the vintage endpoint. This is expected and is not a missing monthly-step bug: monthly origins still move by one month, but scoring requires the future realized target.

Best-Specification Cells#

The batch script fixes the Table 2 best-specification cell for each INDPRO horizon:

Horizon	Target policy	Model	Feature case
1	`direct_average`	`random_forest`	`F-X-MARX-Level`
3	`direct_average`	`random_forest`	`MARX`
6	`path_average`	`random_forest`	`MARX`
9	`path_average`	`random_forest`	`MARX`
12	`path_average`	`random_forest`	`MARX`
24	`direct_average`	`random_forest`	`F-Level`

Random forest was run with n_estimators=200, min_samples_leaf=5, max_features=1/3, bootstrap=True, random_state=123, and n_jobs=1. Hyperparameter tuning was off for this pass, so this is a fixed paper-style configuration rather than a full appendix optimizer replication.

Command Log#

Best-spec INDPRO batch:

uv run python scripts/replication/gcls_2021_table2_batch.py \
  --out-root /home/nanyeon99/project/macroforecast_gcls_runs/table2_indpro_full_20260605 \
  --targets INDPRO \
  --workers 3 \
  --vintage 2018-01 \
  --cache-root /home/nanyeon99/project/macroforecast_replication_cache \
  --start-year 1980 \
  --end-year 2017 \
  --n-estimators 200 \
  --random-state 123 \
  --tuning-mode off \
  --skip-existing

Observed batch summary:

status: done
workers: 3
task_count: 6
finished_count: 6
failed_count: 0
elapsed: about 15.4 hours

The matching FM benchmark used the same single-cell runner with --feature-case F --model far, horizon-specific target policies matching the best-spec cell, and the same 2018-01 vintage, 1980 to 2017 calendar, and target construction.

Conceptually:

uv run python scripts/replication/gcls_2021_table2_single.py \
  --target-alias INDPRO \
  --horizon <horizon> \
  --feature-case F \
  --target-policy <matching_policy> \
  --model far \
  --vintage 2018-01 \
  --cache-root /home/nanyeon99/project/macroforecast_replication_cache \
  --out-dir /home/nanyeon99/project/macroforecast_gcls_runs/indpro_fm_benchmark_20260606/<task_slug> \
  --start-year 1980 \
  --end-year 2017 \
  --random-state 123 \
  --tuning-mode off \
  --skip-existing

The relative comparison aligns best-spec and FM forecast files by realized target date, checks that the realized targets are identical, and computes RMSE and relative MSE/RMSE on the common support.

Absolute Results#

Horizon	Best-spec task	Rows	RMSE	MAE
1	`INDPRO_h1_direct_average_random_forest_F-X-MARX-Level`	455	0.005964	0.004248
3	`INDPRO_h3_direct_average_random_forest_MARX`	453	0.004482	0.003086
6	`INDPRO_h6_path_average_random_forest_MARX`	450	0.003937	0.002727
9	`INDPRO_h9_path_average_random_forest_MARX`	447	0.003559	0.002487
12	`INDPRO_h12_path_average_random_forest_MARX`	444	0.003328	0.002316
24	`INDPRO_h24_direct_average_random_forest_F-Level`	432	0.002407	0.001698

The row counts fall with the horizon because later origins need later realized targets. The h=24 row count is 432, corresponding to the available common support after excluding tail origins whose 24-month-ahead target is unavailable in the vintage.

Relative Results Against FM#

Horizon	Best RMSE	FM RMSE	Relative MSE	Relative RMSE	Common rows	Beats FM
1	0.005964	0.006283	0.900921	0.949169	455	yes
3	0.004482	0.004573	0.960364	0.979982	453	yes
6	0.003937	0.004159	0.896344	0.946754	450	yes
9	0.003559	0.003852	0.853704	0.923961	447	yes
12	0.003328	0.003752	0.786731	0.886979	444	yes
24	0.002407	0.003692	0.425190	0.652066	432	yes

Common-support checks:

actual_max_abs_diff: 0.0 for every horizon
invalid_rows: 0
nan_prediction_rows: 0
nan_actual_rows: 0

Interpretation:

actual_max_abs_diff=0.0 means the best-spec and FM rows use the same realized target values after alignment.
relative_mse < 1 means the best-spec cell has lower squared-error loss than FM.
relative_rmse < 1 means the same result expressed in RMSE units.
The largest improvement appears at h=24, where the RF F-Level direct-average cell has relative RMSE 0.652066.

What Changed Relative To The Invalid Diagnostic Runs#

Earlier package diagnostics were useful for finding defects, but they are not valid replication evidence. The corrected run differs in the following material ways:

Issue found in earlier diagnostics	Corrected behavior
h-step labels were allowed when the realized target date was after the forecast origin support	forecasts are scored only when `t + h` is available
`average_change` was applied to an already McCracken-Ng transformed target	direct-average cells use `average_value`; path-average cells use one-step `value` targets
target-derived paper blocks were missing	`MARX_y` and `MAF_y` can be built from `input="target_panel"`
feature materialization was too slow for repeated windows	runner now supports cached/corrected feature construction paths
invalid runs compared diagnostic shortcuts	corrected run compares best-spec cells against matching FM cells on identical actual support

Remaining Replication Gaps#

These gaps are not package runtime failures; they are evidence boundaries for claiming exact paper-table equality.

Gap	Consequence
Exact FRED-MD vintage is not stated in the checked materials	`2018-01` is the first defensible post-`2017M12` candidate, but may not be the paper’s exact vintage
Full machine-readable replication package was not found	exact table reproduction cannot be audited line-by-line against author code
MATLAB tree and optimizer defaults are not exactly portable	Python/scikit-style RF/BT values can differ even under the same high-level algorithm
This pass uses `tuning-mode=off`	appendix GA/Bayesian/random-CV tuning is not yet replicated for every learner
Benchmark is fixed FM mapping	BIC-selected FM variants should be added if the paper’s benchmark implementation is recovered
This report covers INDPRO only	ten-target Table 2 completion still requires EMP, UNRATE, INCOME, CONS, RETAIL, HOUST, M2, CPI, and PPI

Next Actions#

Run the same corrected pipeline for the remaining nine Table 2 targets.
Add a benchmark helper that can switch between fixed FM and BIC-selected FM once the paper’s exact benchmark selection rule is pinned down.
Run at least one paper-small tuning pass for Elastic Net, Adaptive Lasso, Linear Boosting, and Boosted Trees to verify the tuned-learner branch.
Add paper-table capture comparison in the notebook page after the full ten-target table is available.
Keep the invalid diagnostic section in the setting page as a debugging log, but do not use those values as evidence.