# GCLS 2021 INDPRO Reconstructed Replication Report This report records the corrected `macroforecast` run for the INDPRO cells in Table 2 of Goulet Coulombe, Leroux, Stevanovic, and Surprenant (2021), "Macroeconomic data transformations matter," *International Journal of Forecasting*, 37(4), 1338-1354. Report date: : 2026-06-06 Replication level: : `reconstructed_design`, not `exact_table_replication`. Execution scope: : INDPRO only, six horizons, paper Table 2 best-specification cells, compared against the matching factor-model benchmark on the same realized-target support. Main result: : the corrected INDPRO best-specification cells beat the matching FM benchmark at every horizon. Mean relative RMSE is `0.889818`; median relative RMSE is `0.935358`. Relative RMSE below `1` means the Table 2 best-specification cell has lower RMSE than the FM benchmark on the common support. ## Status | Item | Status | | --- | --- | | Best-spec INDPRO run | complete | | FM benchmark run | complete | | Common-support comparison | complete | | Failed tasks | 0 | | Active server jobs | none detected | | Table-identical replication claim | not made | The run should be read as corrected package evidence. It is not a claim that the numbers are identical to the paper's Table 2, because the checked paper, appendix, local files, and public author materials do not expose a full machine-readable replication package, exact FRED-MD vintage, or exact MATLAB backend state. ## Source Material Sources used to define the replication setting: - IJF article DOI: - arXiv working-paper page: - local main PDF: `/Users/nanyeon/Library/CloudStorage/SynologyDrive-second_brain/wiki/raw/papers/10.1016j.ijforecast.2021.05.005.pdf` - local appendix PDF: `/Users/nanyeon/Library/CloudStorage/SynologyDrive-second_brain/wiki/raw/papers/10.1016j.ijforecast.2021.05.005_appendix.pdf` - local review note: `/Users/nanyeon/Library/CloudStorage/SynologyDrive-second_brain/wiki/papers/reviews/10.1016j.ijforecast.2021.05.005-ea1152c5.md` - author MARX snippet: `/Users/nanyeon/Library/CloudStorage/SynologyDrive-second_brain/wiki/raw/paper_code/coulombe_site_github_20260530/marx/MARX_cheap_code.R` ## Execution Artifacts The run was executed on server1. | Object | Path | | --- | --- | | Source checkout used for run | `/home/nanyeon99/project/macroforecast_gcls_replication_main_2f526bdf` | | Source checkout commit | `2f526bdf` | | Best-spec output root | `/home/nanyeon99/project/macroforecast_gcls_runs/table2_indpro_full_20260605` | | FM benchmark output root | `/home/nanyeon99/project/macroforecast_gcls_runs/indpro_fm_benchmark_20260606` | | Relative comparison output root | `/home/nanyeon99/project/macroforecast_gcls_runs/indpro_relative_vs_fm_20260606` | | Relative comparison CSV | `/home/nanyeon99/project/macroforecast_gcls_runs/indpro_relative_vs_fm_20260606/indpro_relative_vs_fm.csv` | The best-spec run produced about `19M` of output, the FM benchmark about `16M`, and the relative comparison about `336K`. ## Data And Sample | Axis | Setting | | --- | --- | | Dataset | FRED-MD | | Vintage | `2018-01` | | Loader | `mf.data.load_fred_md(vintage="2018-01")` | | Raw panel | 708 monthly rows x 127 columns | | Raw period | `1959-01` through `2017-12` | | Preprocessing | official McCracken-Ng FRED-MD t-code pipeline | | Processed panel | 706 monthly rows x 127 columns | | Processed period | `1959-03` through `2017-12` | | Initial estimation start | `1960-01` | | Test calendar | monthly origins from `1980-01` through `2017-12` where realized targets are available | | Horizons | `1, 3, 6, 9, 12, 24` months | | Target | `INDPRO` | The h-step forecast at origin `t` is scored only when the realized target dated `t + h` is available. For h=24 with a `2018-01` FRED-MD vintage ending in `2017-12`, the final scored origins stop before the tail origins whose realizations would fall after the vintage endpoint. This is expected and is not a missing monthly-step bug: monthly origins still move by one month, but scoring requires the future realized target. ## Best-Specification Cells The batch script fixes the Table 2 best-specification cell for each INDPRO horizon: | Horizon | Target policy | Model | Feature case | | ---: | --- | --- | --- | | 1 | `direct_average` | `random_forest` | `F-X-MARX-Level` | | 3 | `direct_average` | `random_forest` | `MARX` | | 6 | `path_average` | `random_forest` | `MARX` | | 9 | `path_average` | `random_forest` | `MARX` | | 12 | `path_average` | `random_forest` | `MARX` | | 24 | `direct_average` | `random_forest` | `F-Level` | Random forest was run with `n_estimators=200`, `min_samples_leaf=5`, `max_features=1/3`, `bootstrap=True`, `random_state=123`, and `n_jobs=1`. Hyperparameter tuning was off for this pass, so this is a fixed paper-style configuration rather than a full appendix optimizer replication. ## Command Log Best-spec INDPRO batch: ```bash uv run python scripts/replication/gcls_2021_table2_batch.py \ --out-root /home/nanyeon99/project/macroforecast_gcls_runs/table2_indpro_full_20260605 \ --targets INDPRO \ --workers 3 \ --vintage 2018-01 \ --cache-root /home/nanyeon99/project/macroforecast_replication_cache \ --start-year 1980 \ --end-year 2017 \ --n-estimators 200 \ --random-state 123 \ --tuning-mode off \ --skip-existing ``` Observed batch summary: ```text status: done workers: 3 task_count: 6 finished_count: 6 failed_count: 0 elapsed: about 15.4 hours ``` The matching FM benchmark used the same single-cell runner with `--feature-case F --model far`, horizon-specific target policies matching the best-spec cell, and the same `2018-01` vintage, `1980` to `2017` calendar, and target construction. Conceptually: ```bash uv run python scripts/replication/gcls_2021_table2_single.py \ --target-alias INDPRO \ --horizon \ --feature-case F \ --target-policy \ --model far \ --vintage 2018-01 \ --cache-root /home/nanyeon99/project/macroforecast_replication_cache \ --out-dir /home/nanyeon99/project/macroforecast_gcls_runs/indpro_fm_benchmark_20260606/ \ --start-year 1980 \ --end-year 2017 \ --random-state 123 \ --tuning-mode off \ --skip-existing ``` The relative comparison aligns best-spec and FM forecast files by realized target date, checks that the realized targets are identical, and computes RMSE and relative MSE/RMSE on the common support. ## Absolute Results | Horizon | Best-spec task | Rows | RMSE | MAE | | ---: | --- | ---: | ---: | ---: | | 1 | `INDPRO_h1_direct_average_random_forest_F-X-MARX-Level` | 455 | 0.005964 | 0.004248 | | 3 | `INDPRO_h3_direct_average_random_forest_MARX` | 453 | 0.004482 | 0.003086 | | 6 | `INDPRO_h6_path_average_random_forest_MARX` | 450 | 0.003937 | 0.002727 | | 9 | `INDPRO_h9_path_average_random_forest_MARX` | 447 | 0.003559 | 0.002487 | | 12 | `INDPRO_h12_path_average_random_forest_MARX` | 444 | 0.003328 | 0.002316 | | 24 | `INDPRO_h24_direct_average_random_forest_F-Level` | 432 | 0.002407 | 0.001698 | The row counts fall with the horizon because later origins need later realized targets. The h=24 row count is `432`, corresponding to the available common support after excluding tail origins whose 24-month-ahead target is unavailable in the vintage. ## Relative Results Against FM | Horizon | Best RMSE | FM RMSE | Relative MSE | Relative RMSE | Common rows | Beats FM | | ---: | ---: | ---: | ---: | ---: | ---: | --- | | 1 | 0.005964 | 0.006283 | 0.900921 | 0.949169 | 455 | yes | | 3 | 0.004482 | 0.004573 | 0.960364 | 0.979982 | 453 | yes | | 6 | 0.003937 | 0.004159 | 0.896344 | 0.946754 | 450 | yes | | 9 | 0.003559 | 0.003852 | 0.853704 | 0.923961 | 447 | yes | | 12 | 0.003328 | 0.003752 | 0.786731 | 0.886979 | 444 | yes | | 24 | 0.002407 | 0.003692 | 0.425190 | 0.652066 | 432 | yes | Common-support checks: ```text actual_max_abs_diff: 0.0 for every horizon invalid_rows: 0 nan_prediction_rows: 0 nan_actual_rows: 0 ``` Interpretation: - `actual_max_abs_diff=0.0` means the best-spec and FM rows use the same realized target values after alignment. - `relative_mse < 1` means the best-spec cell has lower squared-error loss than FM. - `relative_rmse < 1` means the same result expressed in RMSE units. - The largest improvement appears at h=24, where the RF `F-Level` direct-average cell has relative RMSE `0.652066`. ## What Changed Relative To The Invalid Diagnostic Runs Earlier package diagnostics were useful for finding defects, but they are not valid replication evidence. The corrected run differs in the following material ways: | Issue found in earlier diagnostics | Corrected behavior | | --- | --- | | h-step labels were allowed when the realized target date was after the forecast origin support | forecasts are scored only when `t + h` is available | | `average_change` was applied to an already McCracken-Ng transformed target | direct-average cells use `average_value`; path-average cells use one-step `value` targets | | target-derived paper blocks were missing | `MARX_y` and `MAF_y` can be built from `input="target_panel"` | | feature materialization was too slow for repeated windows | runner now supports cached/corrected feature construction paths | | invalid runs compared diagnostic shortcuts | corrected run compares best-spec cells against matching FM cells on identical actual support | ## Remaining Replication Gaps These gaps are not package runtime failures; they are evidence boundaries for claiming exact paper-table equality. | Gap | Consequence | | --- | --- | | Exact FRED-MD vintage is not stated in the checked materials | `2018-01` is the first defensible post-`2017M12` candidate, but may not be the paper's exact vintage | | Full machine-readable replication package was not found | exact table reproduction cannot be audited line-by-line against author code | | MATLAB tree and optimizer defaults are not exactly portable | Python/scikit-style RF/BT values can differ even under the same high-level algorithm | | This pass uses `tuning-mode=off` | appendix GA/Bayesian/random-CV tuning is not yet replicated for every learner | | Benchmark is fixed FM mapping | BIC-selected FM variants should be added if the paper's benchmark implementation is recovered | | This report covers INDPRO only | ten-target Table 2 completion still requires EMP, UNRATE, INCOME, CONS, RETAIL, HOUST, M2, CPI, and PPI | ## Next Actions 1. Run the same corrected pipeline for the remaining nine Table 2 targets. 2. Add a benchmark helper that can switch between fixed FM and BIC-selected FM once the paper's exact benchmark selection rule is pinned down. 3. Run at least one `paper-small` tuning pass for Elastic Net, Adaptive Lasso, Linear Boosting, and Boosted Trees to verify the tuned-learner branch. 4. Add paper-table capture comparison in the notebook page after the full ten-target table is available. 5. Keep the invalid diagnostic section in the setting page as a debugging log, but do not use those values as evidence.