# macroforecast.preprocessing

[Back to reference](index.md)

## Purpose

`macroforecast.preprocessing` turns a canonical pandas panel from
[`macroforecast.data`](data.md) into a processed panel plus metadata. It accepts
a `DataSpec`, `DataBundle`, `(panel, metadata)` tuple, or `pandas.DataFrame`,
then returns a `PreprocessedData` object. The preferred input is a
`DataBundle` or `DataSpec` produced by `macroforecast.data`; if preprocessing
receives a plain panel without data-generated metadata, it emits a warning.

The default `reprocess()` path follows the public McCracken-Ng FRED-MD Matlab
workflow for FRED-MD/FRED-QD style panels. FRED-SD has no official t-code map,
so the user must explicitly choose `transform="none"` or pass custom codes.

Preprocessing fails closed on transformation metadata. If
`transform="official"` is selected but no t-code map is available from
`transform_codes`, `metadata["transform_codes"]`, or
`panel.attrs["macroforecast_transform_codes"]`, `reprocess()` raises
`ValueError`. If explicit transform-code keys do not match panel columns, it
also raises. This prevents accidental no-op preprocessing.

## Public Functions

| Function | Purpose | Output |
| --- | --- | --- |
| `reprocess` | Run the full-sample preprocessing sequence. | `PreprocessedData` |
| `preprocess_spec` | Store preprocessing choices for runner-fitted execution. | `PreprocessSpec` |
| `custom_preprocess` | Apply one user callable directly to data. | `PreprocessedData` |
| `custom_preprocess_step` | Build a custom step for `preprocess_spec(custom_steps=[...])`. | `dict` |
| `plan` | Validate and summarize configured preprocessing choices without changing data. | `dict` |
| `report` | Summarize a completed preprocessing result. | `dict` |
| `apply_transform_codes` | Apply McCracken-Ng t-code formulas to matching panel columns. | `pandas.DataFrame` |
| `fred_sd_transform_codes` | Expand FRED-SD variable/state t-code choices and suggestions. | `dict` or `(dict, DataFrame)` |
| `handle_tcode_lag` | Keep or remove transform-induced leading missing rows. | `pandas.DataFrame` |
| `handle_outliers` | Apply one outlier rule. | `pandas.DataFrame` |
| `impute_missing` | Fill missing panel values with one imputation rule. | `pandas.DataFrame` |
| `standardize_panel` | Fit and apply one full-panel scaling rule. | `pandas.DataFrame` |
| `handle_frame_edges` | Keep, truncate, drop, or fill remaining unbalanced edges. | `pandas.DataFrame` |

Low-level clean helpers are also public for exact single-operation use. They
are listed in [Low-Level Clean Helpers](#low-level-clean-helpers).

## Public Flow

```python
import macroforecast as mf

bundle = mf.data.load_fred_md()
data_spec = mf.data.spec(bundle, target="INDPRO", horizons=[1, 3, 6, 12])

processed = mf.preprocessing.reprocess(data_spec)

panel = processed.panel
metadata = processed.metadata
```

## Public Classes And Values

| Symbol | Meaning |
| --- | --- |
| `PreprocessedData` | Output object returned by `reprocess(...)` and `custom_preprocess(...)`. |
| `PreprocessInput` | Accepted direct preprocessing input type: `DataSpec`, `DataBundle`, `(panel, metadata)`, or `DataFrame`. |
| `PreprocessSpec` | Runner-compatible fit/transform preprocessing contract. |
| `FittedPreprocessor` | Fitted preprocessing state used by the runner for fit-window or fixed-reference policies. |
| `FRED_SD_NATIONAL_ANALOG_TRANSFORM_CODES` | High-confidence package t-code suggestions for FRED-SD variables with national analogs. |
| `FRED_SD_MEDIUM_CONFIDENCE_TRANSFORM_CODES` | Broader provisional FRED-SD t-code suggestions. |

## PreprocessedData

```python
macroforecast.preprocessing.PreprocessedData(
    panel: pandas.DataFrame,
    metadata: dict,
    target: str | None = None,
    targets: tuple[str, ...] = (),
    horizons: tuple[int, ...] = (),
    start: str | None = None,
    end: str | None = None,
    predictors = "all",
    steps: tuple[dict, ...] = (),
)
```

### Output Schema

| Field | Type | Meaning |
| --- | --- | --- |
| `panel` | `pandas.DataFrame` | Processed canonical date-indexed panel. |
| `metadata` | `dict` | Input metadata plus preprocessing stages and transform/standardization state. |
| `target`, `targets`, `horizons`, `start`, `end`, `predictors` | copied from `DataSpec` when supplied | Run-level data choices preserved for downstream stages. |
| `steps` | `tuple[dict, ...]` | Ordered preprocessing step log. |

### Methods

| Method | Input | Output | Meaning |
| --- | --- | --- | --- |
| `attach(stage, values)` | `stage: str`, `values: Mapping` | `PreprocessedData` | Return a new object with one metadata stage added. |

`PreprocessedData` also supports tuple unpacking:

```python
panel, metadata = processed
```

## Default Order

| Step | Default | Meaning |
| --- | --- | --- |
| 1. Frequency | `frequency="keep"` | Keep the input frequency unless the user asks for monthly/quarterly alignment. |
| 2. Transform | `transform="official"` | Apply official t-code transforms from FRED-MD/FRED-QD metadata. |
| 3. T-code lag | `tcode_lag="drop"` | Remove leading rows implied by the largest t-code lag. This is two rows for full FRED-MD. |
| 4. Outliers | `outliers="iqr"`, `outlier_action="flag_as_nan"`, `iqr_threshold=10.0` | Flag observations with `abs(x - median) > 10 * IQR` and set them to missing. |
| 5. Imputation | `impute="em_factor"` | Run FRED-MD style PCA-EM with Bai-Ng `PC_p2`, `kmax=8`, `DEMEAN=2`, `max_iter=50`, `tol=1e-6`. |
| 6. Standardize | `standardize="none"` | Optional column-wise scaling after imputation. Choices are `"zscore"`, `"robust"`, and `"minmax"`. |
| 7. Frame | `frame="keep"` | Keep the post-EM frame. No final balanced-panel truncation is applied by default. |

Set `transform_order="before_frequency"` when a mixed-frequency panel should be
transformed in each native frequency before monthly or quarterly alignment. The
default is `transform_order="after_frequency"`, which first aligns frequency and
then applies t-codes.

## T-Code Formulas

The official FRED-MD/FRED-QD t-code map uses these formulas for a raw series
`x_t`.

| T-code | Formula | Leading missing values | Log-domain rule |
| --- | --- | --- | --- |
| `1` | `x_t` | `0` | none |
| `2` | `x_t - x_{t-1}` | `1` | none |
| `3` | `(x_t - x_{t-1}) - (x_{t-1} - x_{t-2})` | `2` | none |
| `4` | `log(x_t)` | `0` | if `min(x) < 1e-6`, the transformed series is all missing |
| `5` | `log(x_t) - log(x_{t-1})` | `1` | requires `min(x) > 1e-6`; otherwise all missing |
| `6` | `(log(x_t) - log(x_{t-1})) - (log(x_{t-1}) - log(x_{t-2}))` | `2` | requires `min(x) > 1e-6`; otherwise all missing |
| `7` | `(x_t / x_{t-1} - 1) - (x_{t-1} / x_{t-2} - 1)` | `2` | none |

There is no `preprocess(...)` compatibility alias in the clean public API. Use
`reprocess(...)` for full-sample preprocessing and `preprocess_spec(...)` for a
runner-fitted preprocessing contract.

Most empirical macro papers preprocess the full panel once before fitting
models. That is supported by `reprocess(...)`. For a real-time forecast design,
where each origin should only use information available at that origin, use
`preprocess_spec(...)` inside `macroforecast.forecasting.run(...)`.
`preprocess_spec(...)` only stores what preprocessing should do; the runner
receives `preprocessing_policy=mf.window.stage_policy(...)` and decides where
the spec may fit.

Common runner policies:

| Policy scope | Meaning |
| --- | --- |
| `"full_panel"` | Fit preprocessing once on the full panel. This is useful for retrospective replication designs. |
| `"origin_available"` | Re-run preprocessing on observations available at each origin plus requested test rows. This supports EM imputation on variables observed by that origin. |
| `"fit_window"` | Fit outlier, imputation, and standardization state on the model fit window, then apply that state to validation/test rows. It currently supports `impute="none"`, `"mean"`, and `"forward_fill"`; use `"origin_available"` for EM or linear imputation. |
| `"fixed_reference"` | Fit supported preprocessing state on a fixed reference period, then apply that state to later windows. |

```python
pre = macroforecast.preprocessing.preprocess_spec(
    transform="official",
    outliers="iqr",
    impute="em_factor",
    frame="keep",
)

result = macroforecast.forecasting.run(
    panel,
    "ridge",
    preprocessing=pre,
    preprocessing_policy=macroforecast.window.stage_policy("origin_available"),
    features=features,
    window=window,
)
```

## reprocess

```python
macroforecast.preprocessing.reprocess(
    data,
    *,
    metadata: Mapping[str, object] | None = None,
    frequency: str = "keep",
    quarterly_to_monthly: str = "step_backward",
    weekly_to_monthly: str = "mean",
    monthly_to_quarterly: str = "quarterly_average",
    weekly_to_quarterly: str = "mean",
    transform_order: str = "after_frequency",
    transform: str = "official",
    transform_codes: Mapping[str, int] | None = None,
    transform_code_overrides: Mapping[str, int] | None = None,
    tcode_lag: str = "drop",
    outliers: str = "iqr",
    outlier_action: str = "flag_as_nan",
    iqr_threshold: float = 10.0,
    zscore_threshold: float = 3.0,
    winsorize_quantiles: tuple[float, float] = (0.01, 0.99),
    impute: str = "em_factor",
    em_n_factors: int = 8,
    em_factor_selection: str = "baing_p2",
    em_demean: int = 2,
    em_max_iter: int = 50,
    em_tolerance: float = 1e-6,
    standardize: str = "none",
    standardize_columns: str | Sequence[str] = "all",
    standardize_ddof: int = 0,
    frame: str = "keep",
    warn_metadata: bool = True,
) -> PreprocessedData
```

### Input

| Name | Type | Default | Choices |
| --- | --- | --- | --- |
| `data` | `DataSpec`, `DataBundle`, `(panel, metadata)`, or `DataFrame` | required | Canonical data input. |
| `metadata` | mapping or `None` | `None` | Extra metadata to merge before preprocessing. |
| `frequency` | `str` | `"keep"` | `"keep"`, `"monthly"`, `"quarterly"`, `"drop_non_monthly"`, `"drop_non_quarterly"`. |
| `quarterly_to_monthly` | `str` | `"step_backward"` | `"step_backward"`, `"repeat_within_quarter"`, `"step_forward"`, `"quarter_end_ffill"`, `"linear_interpolation"`. |
| `weekly_to_monthly` | `str` | `"mean"` | `"mean"`, `"last"`, `"sum"`. |
| `monthly_to_quarterly` | `str` | `"quarterly_average"` | `"quarterly_average"`, `"quarterly_endpoint"`, `"quarterly_sum"`. |
| `weekly_to_quarterly` | `str` | `"mean"` | `"mean"`, `"last"`, `"sum"`. |
| `transform_order` | `str` | `"after_frequency"` | `"after_frequency"`/`"frequency_then_transform"` or `"before_frequency"`/`"transform_then_frequency"`. |
| `transform` | `str` | `"official"` | `"official"`, `"custom"`, `"none"`; accepts aliases `apply_official_tcode`, `custom_tcode`, `no_transform`. |
| `transform_codes` | mapping or `None` | from metadata | Full t-code map. Required for `transform="custom"` and required for `transform="official"` when metadata does not provide codes. Explicit keys must match panel columns. |
| `transform_code_overrides` | mapping or `None` | `None` | Per-series override applied on top of official or custom codes. Override keys must match panel columns. |
| `tcode_lag` | `str` | `"drop"` | `"drop"`, `"keep"`, `"drop_all_missing_rows"`, `"drop_any_missing_rows"`. |
| `outliers` | `str` | `"iqr"` | `"iqr"`, `"zscore"`, `"winsorize"`, `"none"`. |
| `outlier_action` | `str` | `"flag_as_nan"` | `"flag_as_nan"`, `"replace_with_median"`, `"replace_with_cap_value"` for IQR/z-score methods. |
| `impute` | `str` | `"em_factor"` | `"em_factor"`, `"em_multivariate"`, `"mean"`, `"forward_fill"`, `"linear"`, `"none"`. |
| `em_factor_selection` | `str` | `"baing_p2"` | `"baing_p1"`, `"baing_p2"`, `"baing_p3"`, `"fixed"`. |
| `em_demean` | `int` | `2` | `0`, `1`, `2`, `3`, matching `factors_em.m`. |
| `standardize` | `str` | `"none"` | `"none"`, `"zscore"`, `"robust"`, `"minmax"`. Aliases include `"standard"` and `"standardize"` for z-score. |
| `standardize_columns` | `str` or sequence | `"all"` | `"all"`, `"predictors"`, `"targets"`, or explicit column names. `"predictors"` and `"targets"` use `DataSpec` choices when available. |
| `standardize_ddof` | `int` | `0` | Degrees of freedom used by z-score scaling. |
| `frame` | `str` | `"keep"` | `"keep"`, `"truncate"`, `"drop_unbalanced_series"`, `"zero_fill"`. |
| `warn_metadata` | `bool` | `True` | Warn when plain panels lack metadata from `macroforecast.data`. `preprocess_spec(...)` defaults this to `False` unless explicitly overridden. |

### Output

Returns `PreprocessedData`.

| Field | Type | Meaning |
| --- | --- | --- |
| `panel` | `pandas.DataFrame` | Processed canonical date-indexed panel. |
| `metadata` | `dict` | Original data metadata plus a `preprocessing` stage. |
| `target`, `targets`, `horizons`, `start`, `end`, `predictors` | copied from `DataSpec` when supplied | Run-level data choices preserved for downstream stages. |
| `steps` | `tuple[dict, ...]` | Ordered preprocessing log. |

`metadata["preprocessing"]["transform_state"]` stores inverse-transform support
metadata for every transformed series: t-code, log-domain requirement, lag
count, and the last observed raw values/dates available before transformation.
`metadata["preprocessing"]["standardization_state"]` stores the fitted center
and scale values when `standardize != "none"`.

When transforms are applied, the final post-override t-code map is also stored
in `metadata["transform_codes_applied"]` and
`processed.panel.attrs["macroforecast_transform_codes"]`. This is the map that
actually ran, not just the raw loader metadata.

### Error Conditions

| Condition | Result |
| --- | --- |
| Plain `DataFrame` without data metadata | `UserWarning`; preprocessing still runs if the panel is canonical. |
| `transform="official"` with no t-code map | `ValueError`. |
| `transform="custom"` with no t-code map | `ValueError`. |
| Explicit transform-code or override key not in the panel | `ValueError`. |
| FRED-SD with default `transform="official"` | `ValueError`; choose `transform="none"` or custom FRED-SD codes. |
| Frequency inference finds sparse unknown columns during alignment | `UserWarning`; supply data metadata when the source frequency is known. |
| EM imputation sees an all-missing row or column | `ValueError`. |
| Standardization sees a zero-variance numeric column | `ValueError`. |

`PreprocessedData` supports tuple unpacking:

```python
panel, metadata = processed
```

## preprocess_spec

`preprocess_spec(...)` stores the same preprocessing options accepted by
`reprocess(...)`, excluding input-only arguments such as `data` and `metadata`.
It rejects unknown options immediately, so stage timing options must be passed
to `forecasting.run(..., preprocessing_policy=...)`, not hidden inside the
preprocessing spec.

```python
macroforecast.preprocessing.preprocess_spec(
    **options,
) -> PreprocessSpec
```

### Input

`**options` may include any `reprocess(...)` option except `data` and
`metadata`. It also accepts:

| Name | Type | Default | Meaning |
| --- | --- | --- | --- |
| `custom_steps` | sequence or omitted | omitted | Custom preprocessing steps created by `custom_preprocess_step(...)`. |
| `warn_metadata` | `bool` | `False` inside runner specs unless supplied | Whether to warn when input lacks `macroforecast.data` metadata. |

Do not pass window timing, stage scope, or split choices here. Those belong to
`forecasting.run(..., preprocessing_policy=...)`.

### Output

Returns `PreprocessSpec`.

| Method | Input | Output | Meaning |
| --- | --- | --- | --- |
| `fit(data, metadata=None, policy="origin_available")` | preprocessing input | `FittedPreprocessor` | Fit preprocessing choices on a training/history panel. |
| `fit_transform(data, metadata=None, policy="origin_available")` | preprocessing input | `PreprocessedData` | Fit and return the processed training panel. |
| `to_dict()` | none | `dict` | JSON-ready preprocessing options. |
| `to_metadata()` | none | `dict` | Compact runner metadata. |

`FittedPreprocessor.transform(data, metadata=None, history=None, policy=None)`
returns `PreprocessedData` for new rows. `policy="origin_available"` replays
preprocessing on `history + data`; `policy="fit_window"` applies state fitted
on the training window where supported.

```python
pre = mf.preprocessing.preprocess_spec(
    transform="official",
    outliers="iqr",
    impute="em_factor",
    standardize="zscore",
    frame="keep",
)
```

For direct advanced use:

```python
fitted = pre.fit(train_panel, policy="origin_available")
processed_test = fitted.transform(test_panel, history=train_panel)
```

The fitted and transformed metadata records `fit_period`, `history_period`,
`transform_period`, and `output_period`. `policy="fit_window"` applies
fit-window outlier, imputation, and standardization state; it currently supports
`impute="none"`, `"mean"`, and `"forward_fill"`.

`preprocess_spec(...)` also accepts `custom_steps=[...]`. These steps run after
the built-in preprocessing options. Inside `forecasting.run(...)`, the custom
steps are fitted or applied inside the same stage policy as the rest of the
preprocessing spec.

```python
def add_spread(panel, *, metadata=None, scale=1.0):
    out = panel.copy()
    out["spread"] = (out["long_rate"] - out["short_rate"]) * scale
    return out

pre = mf.preprocessing.preprocess_spec(
    transform="none",
    impute="mean",
    custom_steps=[
        mf.preprocessing.custom_preprocess_step("spread", add_spread, scale=100.0),
    ],
)
```

## custom_preprocess

Apply one user-supplied preprocessing callable directly to a panel or bundle.

```python
macroforecast.preprocessing.custom_preprocess(
    data,
    func,
    *,
    metadata: Mapping[str, object] | None = None,
    name: str | None = None,
    **params,
) -> PreprocessedData
```

### Callable Contract

The callable receives:

```python
func(panel: pandas.DataFrame, *, metadata: dict, **params)
```

It must return one of:

| Return type | Meaning |
| --- | --- |
| `pandas.DataFrame` | New canonical or normalizable panel. Existing `attrs["macroforecast_metadata"]` is merged with input metadata. |
| `DataBundle` | Panel plus metadata to continue with. |
| `PreprocessedData` | Full preprocessing object to continue with. |
| `(DataFrame, metadata)` | Explicit panel and metadata pair. |

### Output

Returns `PreprocessedData`. Metadata gains `metadata["custom_preprocess"]`,
including callable name, parameters, input panel summary, and output panel
summary. The output panel also carries
`panel.attrs["macroforecast_metadata"]`.

## custom_preprocess_step

Create a runner-compatible preprocessing step for
`preprocess_spec(custom_steps=[...])`.

```python
macroforecast.preprocessing.custom_preprocess_step(
    name: str,
    func,
    **params,
) -> dict
```

| Input | Meaning |
| --- | --- |
| `name` | Stable step name stored in metadata. |
| `func` | Callable following the `custom_preprocess()` callable contract. |
| `**params` | JSON-ready parameters passed to `func`. |

The returned dictionary keeps the callable for Python execution, but
`PreprocessSpec.to_dict()` records only the callable name so runner metadata is
JSON-ready.

## Step Helpers

These helpers return `pandas.DataFrame` unless noted.

| Function | Input | Output | Meaning |
| --- | --- | --- | --- |
| `plan(data, ...)` | DataFrame/bundle/spec | `dict` | Dry-run summary of configured choices, transform codes, metadata warning, and detected native frequencies. |
| `report(processed)` | `PreprocessedData` | `dict` | Compact report from a completed preprocessing result. |
| `custom_preprocess(data, func, ...)` | DataFrame/bundle/spec and callable | `PreprocessedData` | Apply one custom preprocessing function directly. |
| `custom_preprocess_step(name, func, **params)` | name and callable | `dict` | Build a custom step for `preprocess_spec(custom_steps=[...])`. |
| `apply_transform_codes(panel, codes)` | DataFrame, t-code map | DataFrame | Apply McCracken-Ng t-code formulas. |
| `fred_sd_transform_codes(data, ...)` | FRED-SD panel/bundle/spec | `dict[str, int]`, or `(dict, DataFrame)` with `return_table=True` | Build FRED-SD state-series t-codes from user choices and optional national-analog suggestions. |
| `handle_tcode_lag(panel, method=..., codes=...)` | DataFrame | DataFrame | Handle missing rows introduced by t-code transforms. |
| `handle_outliers(panel, method=...)` | DataFrame | DataFrame | Apply one outlier policy. |
| `impute_missing(panel, method=...)` | DataFrame | DataFrame | Fill missing values. |
| `standardize_panel(panel, method=...)` | DataFrame | DataFrame | Apply one full-panel standardization policy. |
| `handle_frame_edges(panel, method=...)` | DataFrame | DataFrame | Keep/drop/truncate/fill remaining unbalanced edges. |

Low-level callable variants are public for users who want one exact operation
without the full `reprocess(...)` sequence.

## Low-Level Clean Helpers

These helpers accept a `pandas.DataFrame` and return a new `pandas.DataFrame`
unless the output column says otherwise.

| Function | Key options | Output | Meaning |
| --- | --- | --- | --- |
| `iqr_outlier_clean(panel, threshold=10.0, action="flag_as_nan")` | `threshold`, `action` | DataFrame | IQR outlier rule used by `handle_outliers(method="iqr")`. |
| `zscore_outlier_clean(panel, threshold=3.0, action="flag_as_nan")` | `threshold`, `action` | DataFrame | Z-score outlier rule used by `handle_outliers(method="zscore")`. |
| `winsorize_clean(panel, lower_quantile=0.01, upper_quantile=0.99)` | quantile bounds | DataFrame | Winsorization rule used by `handle_outliers(method="winsorize")`. |
| `em_factor_impute_clean(panel, n_factors=8, max_iter=50, tol=1e-6, factor_selection="baing_p2", demean=2)` | EM factor controls | DataFrame | PCA-EM imputation used by `impute_missing(method="em_factor")`. |
| `em_multivariate_impute_clean(panel, max_iter=20, tol=1e-4)` | EM controls | DataFrame | Multivariate EM imputation used by `impute_missing(method="em_multivariate")`. |
| `mean_impute_clean(panel)` | none | DataFrame | Column-mean imputation. |
| `forward_fill_clean(panel)` | none | DataFrame | Forward-fill imputation. |
| `linear_interpolate_clean(panel)` | none | DataFrame | Time interpolation imputation. |
| `truncate_to_balanced_clean(panel)` | none | DataFrame | Keep the largest balanced sample. |
| `drop_unbalanced_series_clean(panel)` | none | DataFrame | Drop series that keep unbalanced sample edges. |
| `zero_fill_leading_clean(panel)` | none | DataFrame | Fill leading missing values with zero. |
| `fit_standardization_state(panel, method="zscore", ddof=0)` | scaling method | `dict` | Fit reusable scaling state. |
| `apply_standardization_state(panel, state)` | fitted state | DataFrame | Apply previously fitted scaling state. |
| `standardize_clean(panel, method="zscore", ddof=0)` | scaling method | DataFrame | One-shot panel standardization. |
| `apply_tcode_transform(panel, tcode_map)` | t-code map | DataFrame | Apply McCracken-Ng t-code formulas to matching panel columns. |
| `freq_align_quarterly_to_monthly_clean(panel, quarterly_columns, rule="step_backward")` | column list, rule | DataFrame | Low-level quarterly-to-monthly alignment helper. |
| `freq_align_monthly_to_quarterly_clean(panel, monthly_columns, rule="quarterly_average")` | column list, rule | DataFrame | Low-level monthly-to-quarterly alignment helper. |

## plan

```python
macroforecast.preprocessing.plan(
    data,
    *,
    metadata: Mapping[str, object] | None = None,
    frequency: str = "keep",
    transform_order: str = "after_frequency",
    transform: str = "official",
    transform_codes: Mapping[str, int] | None = None,
    transform_code_overrides: Mapping[str, int] | None = None,
    tcode_lag: str = "drop",
    outliers: str = "iqr",
    impute: str = "em_factor",
    standardize: str = "none",
    standardize_columns: str | Sequence[str] = "all",
    standardize_ddof: int = 0,
    frame: str = "keep",
) -> dict
```

### Input

Same data input contract as `reprocess()`. `plan()` validates the panel and
normalizes choices, but it does not transform, impute, or mutate the panel.

### Output

| Key | Meaning |
| --- | --- |
| `input_panel` | Shape, date range, columns, missing count, and inferred index frequency. |
| `metadata_warning` | Warning text that would matter for a panel without data-generated metadata, or `None`. |
| `steps` | Ordered step names implied by `transform_order`. |
| `frequency` | Requested frequency policy plus native-frequency map and metadata source. |
| `frequency["issues"]` | Native-frequency inference concerns such as sparse `unknown`, `irregular`, or `annual` columns. |
| `transform` | Transform method, applied t-code map, ignored metadata-only codes, and any no-code/no-match error note. |
| `tcode_lag`, `outliers`, `impute`, `standardize`, `frame` | Normalized choice values. |

## report

```python
macroforecast.preprocessing.report(processed: PreprocessedData) -> dict
```

### Input

`processed` must be the object returned by `reprocess()`.

### Output

| Key | Meaning |
| --- | --- |
| `input_panel` | Panel summary before preprocessing. |
| `output_panel` | Panel summary after preprocessing. |
| `steps` | Ordered execution log with input/output shapes where relevant. |
| `choices` | Final normalized preprocessing choices. |
| `transform_state` | Inverse-transform support metadata saved during the transform step. |
| `standardization_state` | Fitted scaling metadata saved during the standardization step. |

## apply_transform_codes

```python
macroforecast.preprocessing.apply_transform_codes(
    panel: pandas.DataFrame,
    codes: Mapping[str, int],
) -> pandas.DataFrame
```

### Input

| Name | Type | Required | Choices |
| --- | --- | --- | --- |
| `panel` | `pandas.DataFrame` | yes | Canonical date-indexed numeric panel. |
| `codes` | mapping from column name to integer | yes | T-codes `1` through `7`. Columns absent from the panel are ignored. |

### Output

Returns a new `pandas.DataFrame` with matching columns transformed by the
McCracken-Ng formulas above. Columns without a matching t-code are copied
unchanged. Leading missing values are not removed here; call
`handle_tcode_lag()` or use `reprocess(tcode_lag=...)`.

Note the distinction between this low-level helper and `reprocess()`.
`apply_transform_codes()` ignores absent code keys for convenience when used
interactively. `reprocess()` is stricter: explicit transform-code keys must
match panel columns so a production run cannot silently miss a requested
series.

## handle_tcode_lag

```python
macroforecast.preprocessing.handle_tcode_lag(
    panel: pandas.DataFrame,
    *,
    method: str = "drop",
    codes: Mapping[str, int] | None = None,
) -> pandas.DataFrame
```

### Input

| `method` | Meaning |
| --- | --- |
| `"drop"` | Drop the first `max(t-code lag)` rows. This is the FRED-MD default path after applying official t-codes. |
| `"keep"` | Keep all rows, including transform-induced leading missing values. |
| `"drop_all_missing_rows"` | Drop only rows where every column is missing. |
| `"drop_any_missing_rows"` | Drop every row with at least one missing value. This is strict and often removes too much data. |

### Output

Returns a new `pandas.DataFrame`. The function does not impute; it only handles
missing rows introduced by transformations.

## handle_outliers

```python
macroforecast.preprocessing.handle_outliers(
    panel: pandas.DataFrame,
    *,
    method: str = "iqr",
    action: str = "flag_as_nan",
    iqr_threshold: float = 10.0,
    zscore_threshold: float = 3.0,
    winsorize_quantiles: tuple[float, float] = (0.01, 0.99),
) -> pandas.DataFrame
```

### Input

| Name | Default | Choices |
| --- | --- | --- |
| `method` | `"iqr"` | `"iqr"`, `"zscore"`, `"winsorize"`, `"none"` |
| `action` | `"flag_as_nan"` | `"flag_as_nan"`, `"replace_with_median"`, `"replace_with_cap_value"` for IQR/z-score methods |
| `iqr_threshold` | `10.0` | Positive float. McCracken-Ng default is `10.0`. |
| `zscore_threshold` | `3.0` | Positive float. |
| `winsorize_quantiles` | `(0.01, 0.99)` | Lower and upper quantiles for winsorization. |

### Output

Returns a new `pandas.DataFrame`. The default marks IQR outliers as `NaN`, so
the next imputation step can fill them.

## impute_missing

```python
macroforecast.preprocessing.impute_missing(
    panel: pandas.DataFrame,
    *,
    method: str = "em_factor",
    em_n_factors: int = 8,
    em_factor_selection: str = "baing_p2",
    em_demean: int = 2,
    em_max_iter: int = 50,
    em_tolerance: float = 1e-6,
) -> pandas.DataFrame
```

### Input

| Name | Default | Choices |
| --- | --- | --- |
| `method` | `"em_factor"` | `"em_factor"`, `"em_multivariate"`, `"mean"`, `"forward_fill"`, `"linear"`, `"none"` |
| `em_n_factors` | `8` | Maximum factor count for `em_factor`; fixed rank when `em_factor_selection="fixed"`. |
| `em_factor_selection` | `"baing_p2"` | `"baing_p1"`, `"baing_p2"`, `"baing_p3"`, `"fixed"` |
| `em_demean` | `2` | `0`, `1`, `2`, `3`, matching `factors_em.m` standardization modes. |
| `em_max_iter` | `50` | Positive integer. |
| `em_tolerance` | `1e-6` | Positive float. |

### Output

Returns a new `pandas.DataFrame`. The default `em_factor` path uses the
FRED-MD-style PCA-EM algorithm. It raises if the panel contains an all-missing
row or all-missing column; use `handle_tcode_lag()` before this step for the
usual FRED-MD transform-induced leading missing rows.

`method="linear"` fills only interior missing values bracketed by observed
data. It does not extrapolate leading or trailing missing values, because those
edges usually encode unavailable source observations.

`method="em_multivariate"` uses the same all-missing row/column guard as
`em_factor`.

## standardize_panel

```python
macroforecast.preprocessing.standardize_panel(
    panel: pandas.DataFrame,
    *,
    method: str = "zscore",
    ddof: int = 0,
) -> pandas.DataFrame
```

### Input

| Name | Default | Choices |
| --- | --- | --- |
| `method` | `"zscore"` | `"zscore"`, `"robust"`, `"minmax"` |
| `ddof` | `0` | Non-negative integer used only for z-score standardization. |

### Output

Returns a new `pandas.DataFrame` with numeric columns scaled. `zscore` uses
column means and standard deviations, `robust` uses median and IQR, and
`minmax` uses minimum and range. The helper fits scaling parameters on the full
panel supplied to it.

For forecasting experiments that require origin-by-origin information sets,
prefer `preprocess_spec(standardize=...)` through the forecasting runner. In
that path, scaling parameters are fitted on the train window and reused for the
test rows.

Inside `reprocess(...)`, use `standardize_columns="predictors"` when a
`DataSpec` should scale predictor columns while leaving the target in its
post-transform units.

## handle_frame_edges

```python
macroforecast.preprocessing.handle_frame_edges(
    panel: pandas.DataFrame,
    *,
    method: str = "keep",
) -> pandas.DataFrame
```

### Input

| `method` | Meaning |
| --- | --- |
| `"keep"` | Keep the panel as-is. This is the default after EM imputation. |
| `"truncate"` | Truncate to the largest balanced sample. |
| `"drop_unbalanced_series"` | Drop columns that keep unbalanced edges. |
| `"zero_fill"` | Fill leading missing values with zero. |

### Output

Returns a new `pandas.DataFrame`.

## FRED-SD

FRED-SD does not provide official t-codes. `reprocess(fred_sd_bundle)` with
the default `transform="official"` raises an error. The user must choose one
of these paths.

Package suggestion tables are exposed as constants for inspection:

| Symbol | Meaning |
| --- | --- |
| `FRED_SD_NATIONAL_ANALOG_TRANSFORM_CODES` | High-confidence t-code suggestions based on national FRED-MD/FRED-QD analogs. |
| `FRED_SD_MEDIUM_CONFIDENCE_TRANSFORM_CODES` | Broader provisional t-code suggestions; opt in with `include_medium_confidence=True`. |

## fred_sd_transform_codes

```python
macroforecast.preprocessing.fred_sd_transform_codes(
    data,
    *,
    variable_codes: Mapping[str, int] | None = None,
    state_series_codes: Mapping[str, int] | None = None,
    use_national_analog_suggestions: bool = True,
    include_medium_confidence: bool = False,
    return_table: bool = False,
) -> dict[str, int] | tuple[dict[str, int], pandas.DataFrame]
```

### Input

| Name | Type | Default | Meaning |
| --- | --- | --- | --- |
| `data` | `DataBundle`, `DataSpec`, `(panel, metadata)`, or `DataFrame` | required | FRED-SD wide state-series panel. |
| `variable_codes` | mapping or `None` | `None` | User t-code choices by FRED-SD variable, such as `{"UR": 2}`. Expanded to every matching state series. |
| `state_series_codes` | mapping or `None` | `None` | User t-code choices by exact column, such as `{"UR_CA": 2}`. Overrides variable-level choices. |
| `use_national_analog_suggestions` | `bool` | `True` | Include high-confidence package suggestions based on national FRED-MD/FRED-QD analogs. |
| `include_medium_confidence` | `bool` | `False` | Include broader provisional suggestions. |
| `return_table` | `bool` | `False` | Return a provenance table with the expanded code map. |

### Output

By default, returns `dict[str, int]` mapping FRED-SD state-series columns to
t-codes. With `return_table=True`, returns `(codes, table)`. The table columns
are `column`, `sd_variable`, `state`, `tcode`, `source`, and
`suggestion_confidence`.

`suggestion_confidence` is not a statistical confidence interval. It records
whether the t-code came from a user state-series override, user variable-level
choice, high-confidence package suggestion, medium-confidence package
suggestion, or no assignment.

No transform:

```python
processed = mf.preprocessing.reprocess(fred_sd_bundle, transform="none")
```

Variable-level t-codes expanded to all state series:

```python
codes = mf.preprocessing.fred_sd_transform_codes(
    fred_sd_bundle,
    variable_codes={"UR": 2, "ICLAIMS": 5},
)

processed = mf.preprocessing.reprocess(
    fred_sd_bundle,
    frequency="monthly",
    transform="custom",
    transform_codes=codes,
)
```

Built-in national-analog suggestions are offered for high-confidence FRED-SD
variables such as `UR`, `PARTRATE`, `ICLAIMS`, `LF`, `NA`, and major employment
sector variables. These are suggestions, not official FRED-SD metadata. Pass
`include_medium_confidence=True` to also include broader output, housing, trade,
and income analogs.

To inspect provenance:

```python
codes, table = mf.preprocessing.fred_sd_transform_codes(
    fred_sd_bundle,
    variable_codes={"UR": 2},
    return_table=True,
)
```

`table` has columns `column`, `sd_variable`, `state`, `tcode`, `source`, and
`suggestion_confidence`. Sources distinguish user state-series overrides, user
variable-level choices, high- or medium-confidence national-analog suggestions,
and unassigned columns. `suggestion_confidence` is not a statistical confidence
interval; it is a provenance label for non-official package suggestions.

For FRED-SD frequency alignment, preprocessing reads the data-generated
`fred_sd_series_metadata` report first. Observed-date inference is only a
fallback. FRED-SD is mixed monthly/quarterly data; combined dataset frequency
alignment belongs in `macroforecast.data`, not in preprocessing.

## FRED-QD and Dataset Combination

`mf.data.load_fred_qd()` returns a quarterly panel with
`metadata["frequency"] == "quarterly"` and official FRED-QD t-codes. FRED-QD is
not mixed-frequency in the same sense as FRED-SD.

Combinations such as FRED-MD + FRED-SD or FRED-QD + FRED-SD should be built in
`macroforecast.data`, not in preprocessing. Dataset composition decides which
sources to load, how to align indices before a run, how to merge metadata, and
how to record frequency-conversion provenance. Preprocessing then operates on
the combined canonical panel it receives.

Use:

```python
monthly_bundle = mf.data.load_fred_md_sd(states=["CA"], variables=["UR"])
quarterly_bundle = mf.data.load_fred_qd_sd(states=["CA"], variables=["UR"])
```

## Source

The FRED-MD/FRED-QD defaults are based on the public FRED-Databases Matlab code
linked from the St. Louis Fed FRED-MD/FRED-QD page, specifically
`fredfactors.m`, `prepare_missing.m`, `remove_outliers.m`, and `factors_em.m`.

- `box_cox_lambda` -- select a Box-Cox lambda for one series ('loglik' MLE or 'guerrero'; forecast::BoxCox.lambda).
- `box_cox_clean` -- apply a Box-Cox variance-stabilising transform per numeric column (lambda selected or supplied).
- `inverse_box_cox` -- invert a Box-Cox transform given lambda.