# macroforecast.feature_analysis

[Back to reference](index.md)

`macroforecast.feature_analysis` inspects feature matrices after
`macroforecast.feature_engineering`. It does not create new predictors and does
not fit a forecasting model. Its job is to make the constructed `X` auditable:
missingness, high correlations, PCA/factor columns, lag/MARX structure, feature
stage changes, and feature-selection stability.

`macroforecast.feature_diagnostic` remains available as a compatibility alias.

Accepted feature inputs are:

| Input | Meaning |
| --- | --- |
| `FeatureSet` | Uses `FeatureSet.X`, `FeatureSet.feature_metadata`, and `FeatureSet.metadata`. |
| `pandas.DataFrame` | Uses the frame as `X`; reads `attrs["macroforecast_feature_metadata"]` when present. |
| `DataBundle`, `DataSpec`, `(DataFrame, metadata)` | Uses the panel as the inspected matrix and carries metadata forward. |

The DataFrame input must satisfy the canonical macroforecast panel contract:
`DatetimeIndex` named `"date"`, sorted, no duplicate dates, numeric columns,
finite values or `NaN`, and non-empty shape.

Learned filters and feature builders can also produce weight matrices. For
example, `filters.albama()` returns `AlbaMAResult.weights`, where rows
are source observations and columns are target feature dates. `feature_analysis`
owns the diagnostic summaries of those weights:

| Callable | Input | Output | Purpose |
| --- | --- | --- | --- |
| `effective_window(weights, threshold=1e-12)` | square source-by-target weight matrix | `Series` | Count nonzero source observations used at each target date. |
| `recent_weight_share(weights, mode="one_sided")` | square source-by-target weight matrix | `DataFrame` | Summarize weight mass in recent lag/lead buckets. |

## Public Flow

```python
import macroforecast as mf

processed = mf.preprocessing.reprocess(data_spec)
features = mf.feature_engineering.feature_spec(
    target="INDPRO",
    horizons=(1, 3, 6),
    predictors="all",
    lags=(0, 1, 2, 3),
    pca_components=8,
).fit_transform(processed)

diagnostic = mf.feature_analysis.diagnose_features(
    features,
    include_correlation=True,
    include_correlation_matrix=True,
    include_lag_autocorrelation=True,
    selections={"origin_1": ["pc1", "PAYEMS_lag0"]},
    selection_similarity_metric="jaccard",
)

albama = mf.filters.albama(inflation, mode="one_sided")
window = mf.feature_analysis.effective_window(albama.weights)
shares = mf.feature_analysis.recent_weight_share(albama.weights, mode="one_sided")
```

## effective_window

```python
macroforecast.feature_analysis.effective_window(
    weights,
    *,
    threshold=1e-12,
) -> pandas.Series
```

Input: a square weight matrix whose rows are source observations and whose
columns are target feature dates. This is the shape returned by
`AlbaMAResult.weights`.

Output: one value per target date. The value is the number of source
observations with absolute weight above `threshold`.

## recent_weight_share

```python
macroforecast.feature_analysis.recent_weight_share(
    weights,
    *,
    mode="one_sided",
) -> pandas.DataFrame
```

Input: the same source-by-target weight matrix.

Output for `mode="one_sided"`:

| Column | Meaning |
| --- | --- |
| `y_t` | Weight on the target-date observation. |
| `y_t_minus_1_2` | Weight on lags 1 and 2. |
| `y_t_minus_3_5` | Weight on lags 3 through 5. |
| `y_t_minus_6_plus` | Weight on lags 6 and older. |
| `future_weight` | Weight assigned to future observations; should be zero for one-sided features. |

Output for `mode="two_sided"` replaces `future_weight` with forward buckets
`y_t_plus_1_2` and `y_t_plus_3_plus`.

## diagnose_features

```python
macroforecast.feature_analysis.diagnose_features(
    data,
    *,
    feature_metadata: pandas.DataFrame | None = None,
    stages: Mapping[str, object] | None = None,
    include_correlation: bool = False,
    include_correlation_matrix: bool = False,
    correlation_method: str = "pearson",
    correlation_threshold: float | None = 0.9,
    correlation_min_periods: int = 3,
    correlation_order: str = "original",
    correlation_scope: str = "all",
    target=None,
    include_target_correlation: bool = False,
    high_missing_threshold: float = 0.5,
    include_factors: bool = True,
    include_factor_variance: bool = True,
    include_factor_loadings: bool = False,
    include_factor_timeseries: bool = False,
    factor_source_data=None,
    include_lags: bool = True,
    include_lag_autocorrelation: bool = False,
    include_lag_correlation_decay: bool = False,
    include_marx: bool = True,
    include_marx_weight_decay: bool = True,
    include_stage_distribution_shift: bool = True,
    selections: Mapping | Sequence | pandas.DataFrame | None = None,
    selection_similarity_metric: str | None = None,
) -> FeatureDiagnosticReport
```

### Input

| Name | Type | Default | Choices |
| --- | --- | --- | --- |
| `data` | feature input | required | `FeatureSet`, `DataFrame`, `DataBundle`, `DataSpec`, or `(DataFrame, metadata)`. |
| `feature_metadata` | `DataFrame` or `None` | auto | Overrides metadata stored on the input. |
| `stages` | mapping or `None` | `None` | Named feature-like panels to compare in construction order. |
| `include_correlation` | `bool` | `False` | Whether to compute high-correlation feature pairs. |
| `include_correlation_matrix` | `bool` | `False` | Include a full correlation matrix. |
| `correlation_method` | `str` | `"pearson"` | `"pearson"`, `"spearman"`, or `"kendall"`. |
| `correlation_threshold` | float or `None` | `0.9` | Pair filter. Uses absolute correlation when `feature_correlation(..., absolute=True)`. `None` returns all non-missing pairs. |
| `correlation_min_periods` | positive int | `3` | Minimum overlapping observations for correlation. |
| `correlation_order` | `str` | `"original"` | `"original"` or `"clustered"` for the full correlation matrix. |
| `correlation_scope` | `str` | `"all"` | `"all"`, `"within_block"`, or `"cross_block"`. Block comes from feature metadata `block`, then operation/source fallback. |
| `target` | Series, DataFrame, array-like, string, or `None` | `None` | Target used by `include_target_correlation`. A string refers to a column in `data`. |
| `include_target_correlation` | `bool` | `False` | Include feature-to-target correlation rows. |
| `high_missing_threshold` | float | `0.5` | Features with missing-rate above this value are flagged in `overview`. |
| `include_factors` | `bool` | `True` | Include PCA/factor/component diagnostics. |
| `include_factor_variance` | `bool` | `True` | Include scree/cumulative-variance table for detected factor columns. |
| `include_factor_loadings` | `bool` | `False` | Include source-factor correlation loadings. Use `factor_source_data` for original source variables. |
| `include_factor_timeseries` | `bool` | `False` | Include long-form factor-score time series. |
| `include_lags` | `bool` | `True` | Include lag/window diagnostics. |
| `include_lag_autocorrelation` | `bool` | `False` | Include ACF table for detected lag/window columns. |
| `include_lag_correlation_decay` | `bool` | `False` | Include lag-correlation decay against target or lag-0/current source columns. |
| `include_marx` | `bool` | `True` | Include MARX-style moving-average lag diagnostics. |
| `include_marx_weight_decay` | `bool` | `True` | Include equal lag weights implied by MARX moving-average windows. |
| `include_stage_distribution_shift` | `bool` | `True` | When `stages` is supplied, include adjacent-stage distribution-shift diagnostics. |
| `selections` | mapping, sequence, DataFrame, or `None` | `None` | Feature selections by origin/fold/window for stability counts. |
| `selection_similarity_metric` | `str` or `None` | `None` | `"jaccard"` or `"kuncheva"` for pairwise selection similarity. |

### Output

Returns `FeatureDiagnosticReport`.

| Field | Type | Meaning |
| --- | --- | --- |
| `overview` | `dict` | Shape, date range, missingness, zero-variance features, operation/source counts, and feature-metadata coverage. |
| `correlation` | `DataFrame` or `None` | Long-form feature pairs above the requested correlation threshold. |
| `correlation_matrix` | `DataFrame` or `None` | Full correlation matrix, optionally cluster-ordered. |
| `target_correlation` | `DataFrame` or `None` | Feature-to-target correlation rows. |
| `factors` | `DataFrame` or `None` | PCA/factor/component feature diagnostics. |
| `factor_variance` | `DataFrame` or `None` | Scree-style variance and cumulative variance share. |
| `factor_loadings` | `DataFrame` or `None` | Source-factor correlations for loading heatmaps. |
| `factor_timeseries` | `DataFrame` or `None` | Long-form factor/component values by date. |
| `lags` | `DataFrame` or `None` | Lag/window feature diagnostics. |
| `lag_autocorrelation` | `DataFrame` or `None` | ACF/PACF style lag-feature autocorrelation table. |
| `lag_correlation_decay` | `DataFrame` or `None` | Correlation decay by lag/window. |
| `marx` | `DataFrame` or `None` | MARX-style moving-average lag diagnostics. |
| `marx_weight_decay` | `DataFrame` or `None` | Equal lag weights implied by MARX windows. |
| `selection_stability` | `DataFrame` or `None` | Per-feature selection frequency across origins/folds/windows. |
| `selection_similarity` | `DataFrame` or `None` | Pairwise Jaccard or Kuncheva stability across origins/folds/windows. |
| `stage_comparison` | `DataFrame` or `None` | Shape/missingness/column-delta comparison across named feature stages. |
| `stage_distribution_shift` | `DataFrame` or `None` | Adjacent-stage mean, standard-deviation, missingness, and KS-statistic shifts. |
| `metadata` | `dict` | Input metadata plus a compact `feature_analysis` stage. |

`FeatureDiagnosticReport.to_dict()` converts tables to JSON-ready nested
dictionaries/lists.

### Metadata

`diagnose_features(...)` attaches one compact stage:

```python
diagnostic.metadata["feature_analysis"]
```

The stage records:

| Key | Meaning |
| --- | --- |
| `overview` | Compact counts: observations, features, missing cells, high-missing feature count, zero-variance feature count. |
| `options` | Correlation, factor, lag, MARX, selection, and stage-comparison choices. |
| `tables` | Number of rows generated by each diagnostic table. |

Returned diagnostic DataFrames also carry
`attrs["macroforecast_metadata"] == diagnostic.metadata`.

## Helper Functions

### feature_overview

```python
macroforecast.feature_analysis.feature_overview(
    data,
    *,
    feature_metadata: pandas.DataFrame | None = None,
    high_missing_threshold: float = 0.5,
) -> dict
```

Returns one compact dictionary. It is the quickest check for whether the feature
matrix is sparse, constant, or missing feature metadata.

### compare_feature_stages

```python
macroforecast.feature_analysis.compare_feature_stages(
    stages: Mapping[str, object] | None = None,
    **named_stages,
) -> pandas.DataFrame
```

Compares named feature-like panels in order. The table reports observations,
feature counts, missingness, zero-variance counts, and column additions/removals
relative to the previous stage.

Example:

```python
comparison = mf.feature_analysis.compare_feature_stages(
    {
        "base": processed.panel[["PAYEMS", "INDPRO"]],
        "lagged": mf.feature_engineering.lag(processed, columns=["PAYEMS"], lags=(0, 1, 2)),
    }
)
```

### stage_distribution_shift

```python
macroforecast.feature_analysis.stage_distribution_shift(
    stages: Mapping[str, object] | None = None,
    *,
    columns=None,
    min_obs: int = 3,
    **named_stages,
) -> pandas.DataFrame
```

Compares adjacent named stages column by column. Output columns include
`stage_a`, `stage_b`, `feature`, observation counts, means, standard
deviations, `mean_shift`, `sd_ratio`, missing-rate shift, and a two-sample
KS-statistic. Use it to check whether scaling, lag construction, factor
construction, or selection changed feature distributions unexpectedly.

### feature_correlation

```python
macroforecast.feature_analysis.feature_correlation(
    data,
    *,
    feature_metadata: pandas.DataFrame | None = None,
    method: str = "pearson",
    min_periods: int = 3,
    threshold: float | None = 0.9,
    absolute: bool = True,
    max_pairs: int | None = None,
    scope: str = "all",
    block_column: str = "block",
) -> pandas.DataFrame
```

Returns long-form pairs:

| Column | Meaning |
| --- | --- |
| `feature_a`, `feature_b` | Pair names. |
| `correlation`, `abs_correlation` | Signed and absolute correlation. |
| `block_a`, `block_b` | Block labels from feature metadata when available. |
| `operation_a`, `operation_b` | Feature operations from metadata when available. |
| `source_a`, `source_b` | Source columns from metadata when available. |

Use `threshold=None` for a full long-form correlation table.
Use `scope="within_block"` or `scope="cross_block"` to restrict pairs using
metadata blocks.

### feature_target_correlation

```python
macroforecast.feature_analysis.feature_target_correlation(
    data,
    target,
    *,
    feature_metadata=None,
    method: str = "pearson",
    min_periods: int = 3,
    absolute: bool = True,
    max_features: int | None = None,
) -> pandas.DataFrame
```

Returns one row per feature with correlation against the supplied target.
Output columns include `feature`, `target`, `correlation`,
`abs_correlation`, `operation`, `source`, `block`, and `n_obs`.

### feature_correlation_matrix

```python
macroforecast.feature_analysis.feature_correlation_matrix(
    data,
    *,
    method: str = "pearson",
    min_periods: int = 3,
    order: str = "original",
    absolute_distance: bool = True,
) -> pandas.DataFrame
```

Returns a square correlation matrix. `order="clustered"` reorders rows and
columns so highly correlated features are adjacent; this is the callable table
behind a clustered heatmap.

### factor_diagnostics

```python
macroforecast.feature_analysis.factor_diagnostics(
    data,
    *,
    feature_metadata: pandas.DataFrame | None = None,
    operations: Sequence[str] = (...),
    prefixes: Sequence[str] = ("pc", "factor", "maf"),
) -> pandas.DataFrame
```

Detects factor/component features using either feature metadata
(`operation in {"pca", "group_pca", "maf", ...}` or a non-null `component`) or
name prefixes such as `pc1`, `factor1`, and `maf1`.

Returned columns include `feature`, `group`, `operation`, `block`, `source`,
`component`, `n_obs`, `missing_rate`, `mean`, `sd`, `variance`, and
`variance_share`. `variance_share` is a diagnostic share of variance within the
detected factor group. It is not the PCA model's explained-variance ratio unless
the upstream transform recorded that exact quantity.

### factor_variance

```python
macroforecast.feature_analysis.factor_variance(data, *, feature_metadata=None)
```

Returns scree-style rows with `variance_share` and
`cumulative_variance_share`. This is the callable table behind scree and
cumulative-variance views.

### factor_loadings

```python
macroforecast.feature_analysis.factor_loadings(
    data,
    *,
    source_data=None,
    feature_metadata=None,
    method="pearson",
    max_sources=None,
)
```

Approximates factor loadings as correlations between source variables and
factor columns. Supply `source_data` when `data` contains only factor scores.
Returned rows are long-form: `factor`, `source`, `loading`, `abs_loading`.

### factor_timeseries

```python
macroforecast.feature_analysis.factor_timeseries(
    data,
    *,
    feature_metadata=None,
    operations=(...),
    prefixes=("pc", "factor", "maf"),
    max_factors=None,
) -> pandas.DataFrame
```

Returns detected factor/component columns in long time-series form. Output
columns are `date`, `factor`, `value`, `group`, `operation`, `component`, and
`source`. Use this for factor-score line plots or factor stability checks
without reconstructing the feature metadata manually.

### lag_diagnostics

```python
macroforecast.feature_analysis.lag_diagnostics(
    data,
    *,
    feature_metadata: pandas.DataFrame | None = None,
    operations: Sequence[str] = (...),
) -> pandas.DataFrame
```

Detects lag/window features using metadata fields `lag`, `window`,
`operation`, or feature names such as `x_lag3`, `x_roll6_mean`, and
`x_ma4_lag1`.

Returned columns include `feature`, `operation`, `source`, `lag`, `window`,
`n_obs`, `missing_rate`, `first_valid`, and `last_valid`.

### lag_autocorrelation

```python
macroforecast.feature_analysis.lag_autocorrelation(
    data,
    *,
    max_lag: int = 12,
    kind: str = "acf",
) -> pandas.DataFrame
```

Returns ACF or PACF values for detected lag/window feature columns. This is the
callable table behind autocorrelation-per-lag and partial-autocorrelation views.

### lag_correlation_decay

```python
macroforecast.feature_analysis.lag_correlation_decay(
    data,
    *,
    target=None,
    method="pearson",
) -> pandas.DataFrame
```

Returns correlation decay by lag/window. If `target` is supplied, each lag
feature is correlated with that target. Otherwise, each lag feature is compared
with its same-source lag-0/current column when available.

### marx_diagnostics

```python
macroforecast.feature_analysis.marx_diagnostics(
    data,
    *,
    feature_metadata: pandas.DataFrame | None = None,
) -> pandas.DataFrame
```

Detects MARX-style columns named like `x_ma4_lag1`. These are moving-average
lag features, not PCA. The returned table adds `marx_formula` using the
recorded starting lag and window. For example:

```text
mean(x[t-1]...x[t-4])
```

For `x_ma4_lag2`, the formula is `mean(x[t-2]...x[t-5])`.

### marx_weight_decay

```python
macroforecast.feature_analysis.marx_weight_decay(
    data,
    *,
    feature_metadata=None,
) -> pandas.DataFrame
```

Returns the equal lag weights implied by each MARX moving-average feature.
For `x_ma4_lag1`, the table has four rows with weight `0.25` for lags 1
through 4 and cumulative weights from `0.25` to `1.0`. For `x_ma4_lag2`,
the lag rows are 2 through 5, with the same equal weights.

### selection_stability

```python
macroforecast.feature_analysis.selection_stability(
    selections,
    *,
    all_features: Iterable[str] | None = None,
) -> pandas.DataFrame
```

Accepts any of these inputs:

| Input form | Example |
| --- | --- |
| Mapping of origin to selected names | `{"2020-01": ["x1", "x2"], "2020-02": ["x2"]}` |
| Sequence of selected-name iterables | `[["x1"], ["x1", "x3"]]` |
| Indicator DataFrame | rows are origins, columns are features, truthy values mean selected |
| Long DataFrame | columns `feature`, `selected`, and optionally `origin`, `window`, `fold`, or `split` |

The result is indexed by `feature` and includes `selected_count`,
`selection_rate`, `n_origins`, `first_selected_origin`, and
`last_selected_origin`.

### selection_similarity

```python
macroforecast.feature_analysis.selection_similarity(
    selections,
    *,
    metric: str = "jaccard",
    all_features=None,
    n_features=None,
) -> pandas.DataFrame
```

Returns pairwise stability across origins/folds/windows. `metric="jaccard"`
uses overlap divided by union. `metric="kuncheva"` adjusts overlap for expected
random overlap using the declared or inferred feature universe size. Kuncheva
stability is a fixed-selection-size measure; when two windows select different
numbers of features, `score` is missing and the output still reports
`selected_a`, `selected_b`, and `overlap`.

### custom_feature_diagnostic

```python
macroforecast.feature_analysis.custom_feature_diagnostic(
    data,
    func,
    *,
    name=None,
    feature_metadata=None,
    metadata=None,
    **params,
) -> pandas.DataFrame
```

Runs one user diagnostic on a feature matrix or `FeatureSet`. This is for
inspection only; it does not create new predictors.

Callable signature:

```python
func(X, *, feature_metadata=None, metadata=None, **params)
```

Accepted callable outputs are `DataFrame`, `Series`, mapping, or a sequence
convertible to a `DataFrame`. The returned table carries:

| Attr | Meaning |
| --- | --- |
| `macroforecast_metadata_schema.kind` | Always `custom_feature_diagnostic`. |
| `macroforecast_metadata_schema.method` | `name` or callable name. |
| `macroforecast_metadata` | Input metadata plus a `custom_feature_diagnostic` stage. |

Example:

```python
def block_missingness(X, *, feature_metadata=None, metadata=None, block="all"):
    return pd.DataFrame(
        [{"block": block, "missing_rate": float(X.isna().mean().mean())}]
    )

diag = mf.feature_analysis.custom_feature_diagnostic(
    features,
    block_missingness,
    name="block_missingness",
    block="rates",
)
```

## Boundary

| Question | Use |
| --- | --- |
| Create predictors and target matrices | `mf.feature_engineering` |
| Inspect feature matrix quality and metadata | `mf.feature_analysis` |
| Compare raw and preprocessed panels | `mf.data_analysis` |
| Inspect fitted model residuals or tuning trace | `mf.forecast_analysis` |