macroforecast.feature_analysis#

Back to reference

macroforecast.feature_analysis inspects feature matrices after macroforecast.feature_engineering. It does not create new predictors and does not fit a forecasting model. Its job is to make the constructed X auditable: missingness, high correlations, PCA/factor columns, lag/MARX structure, feature stage changes, and feature-selection stability.

macroforecast.feature_diagnostic remains available as a compatibility alias.

Accepted feature inputs are:

Input	Meaning
`FeatureSet`	Uses `FeatureSet.X`, `FeatureSet.feature_metadata`, and `FeatureSet.metadata`.
`pandas.DataFrame`	Uses the frame as `X`; reads `attrs["macroforecast_feature_metadata"]` when present.
`DataBundle`, `DataSpec`, `(DataFrame, metadata)`	Uses the panel as the inspected matrix and carries metadata forward.

The DataFrame input must satisfy the canonical macroforecast panel contract: DatetimeIndex named "date", sorted, no duplicate dates, numeric columns, finite values or NaN, and non-empty shape.

Learned filters and feature builders can also produce weight matrices. For example, filters.albama() returns AlbaMAResult.weights, where rows are source observations and columns are target feature dates. feature_analysis owns the diagnostic summaries of those weights:

Callable	Input	Output	Purpose
`effective_window(weights, threshold=1e-12)`	square source-by-target weight matrix	`Series`	Count nonzero source observations used at each target date.
`recent_weight_share(weights, mode="one_sided")`	square source-by-target weight matrix	`DataFrame`	Summarize weight mass in recent lag/lead buckets.

Public Flow#

import macroforecast as mf

processed = mf.preprocessing.reprocess(data_spec)
features = mf.feature_engineering.feature_spec(
    target="INDPRO",
    horizons=(1, 3, 6),
    predictors="all",
    lags=(0, 1, 2, 3),
    pca_components=8,
).fit_transform(processed)

diagnostic = mf.feature_analysis.diagnose_features(
    features,
    include_correlation=True,
    include_correlation_matrix=True,
    include_lag_autocorrelation=True,
    selections={"origin_1": ["pc1", "PAYEMS_lag0"]},
    selection_similarity_metric="jaccard",
)

albama = mf.filters.albama(inflation, mode="one_sided")
window = mf.feature_analysis.effective_window(albama.weights)
shares = mf.feature_analysis.recent_weight_share(albama.weights, mode="one_sided")

effective_window#

macroforecast.feature_analysis.effective_window(
    weights,
    *,
    threshold=1e-12,
) -> pandas.Series

Input: a square weight matrix whose rows are source observations and whose columns are target feature dates. This is the shape returned by AlbaMAResult.weights.

Output: one value per target date. The value is the number of source observations with absolute weight above threshold.

recent_weight_share#

macroforecast.feature_analysis.recent_weight_share(
    weights,
    *,
    mode="one_sided",
) -> pandas.DataFrame

Input: the same source-by-target weight matrix.

Output for mode="one_sided":

Column	Meaning
`y_t`	Weight on the target-date observation.
`y_t_minus_1_2`	Weight on lags 1 and 2.
`y_t_minus_3_5`	Weight on lags 3 through 5.
`y_t_minus_6_plus`	Weight on lags 6 and older.
`future_weight`	Weight assigned to future observations; should be zero for one-sided features.

Output for mode="two_sided" replaces future_weight with forward buckets y_t_plus_1_2 and y_t_plus_3_plus.

diagnose_features#

macroforecast.feature_analysis.diagnose_features(
    data,
    *,
    feature_metadata: pandas.DataFrame | None = None,
    stages: Mapping[str, object] | None = None,
    include_correlation: bool = False,
    include_correlation_matrix: bool = False,
    correlation_method: str = "pearson",
    correlation_threshold: float | None = 0.9,
    correlation_min_periods: int = 3,
    correlation_order: str = "original",
    correlation_scope: str = "all",
    target=None,
    include_target_correlation: bool = False,
    high_missing_threshold: float = 0.5,
    include_factors: bool = True,
    include_factor_variance: bool = True,
    include_factor_loadings: bool = False,
    include_factor_timeseries: bool = False,
    factor_source_data=None,
    include_lags: bool = True,
    include_lag_autocorrelation: bool = False,
    include_lag_correlation_decay: bool = False,
    include_marx: bool = True,
    include_marx_weight_decay: bool = True,
    include_stage_distribution_shift: bool = True,
    selections: Mapping | Sequence | pandas.DataFrame | None = None,
    selection_similarity_metric: str | None = None,
) -> FeatureDiagnosticReport

Input#

Name	Type	Default	Choices
`data`	feature input	required	`FeatureSet`, `DataFrame`, `DataBundle`, `DataSpec`, or `(DataFrame, metadata)`.
`feature_metadata`	`DataFrame` or `None`	auto	Overrides metadata stored on the input.
`stages`	mapping or `None`	`None`	Named feature-like panels to compare in construction order.
`include_correlation`	`bool`	`False`	Whether to compute high-correlation feature pairs.
`include_correlation_matrix`	`bool`	`False`	Include a full correlation matrix.
`correlation_method`	`str`	`"pearson"`	`"pearson"`, `"spearman"`, or `"kendall"`.
`correlation_threshold`	float or `None`	`0.9`	Pair filter. Uses absolute correlation when `feature_correlation(..., absolute=True)`. `None` returns all non-missing pairs.
`correlation_min_periods`	positive int	`3`	Minimum overlapping observations for correlation.
`correlation_order`	`str`	`"original"`	`"original"` or `"clustered"` for the full correlation matrix.
`correlation_scope`	`str`	`"all"`	`"all"`, `"within_block"`, or `"cross_block"`. Block comes from feature metadata `block`, then operation/source fallback.
`target`	Series, DataFrame, array-like, string, or `None`	`None`	Target used by `include_target_correlation`. A string refers to a column in `data`.
`include_target_correlation`	`bool`	`False`	Include feature-to-target correlation rows.
`high_missing_threshold`	float	`0.5`	Features with missing-rate above this value are flagged in `overview`.
`include_factors`	`bool`	`True`	Include PCA/factor/component diagnostics.
`include_factor_variance`	`bool`	`True`	Include scree/cumulative-variance table for detected factor columns.
`include_factor_loadings`	`bool`	`False`	Include source-factor correlation loadings. Use `factor_source_data` for original source variables.
`include_factor_timeseries`	`bool`	`False`	Include long-form factor-score time series.
`include_lags`	`bool`	`True`	Include lag/window diagnostics.
`include_lag_autocorrelation`	`bool`	`False`	Include ACF table for detected lag/window columns.
`include_lag_correlation_decay`	`bool`	`False`	Include lag-correlation decay against target or lag-0/current source columns.
`include_marx`	`bool`	`True`	Include MARX-style moving-average lag diagnostics.
`include_marx_weight_decay`	`bool`	`True`	Include equal lag weights implied by MARX moving-average windows.
`include_stage_distribution_shift`	`bool`	`True`	When `stages` is supplied, include adjacent-stage distribution-shift diagnostics.
`selections`	mapping, sequence, DataFrame, or `None`	`None`	Feature selections by origin/fold/window for stability counts.
`selection_similarity_metric`	`str` or `None`	`None`	`"jaccard"` or `"kuncheva"` for pairwise selection similarity.

Output#

Returns FeatureDiagnosticReport.

Field	Type	Meaning
`overview`	`dict`	Shape, date range, missingness, zero-variance features, operation/source counts, and feature-metadata coverage.
`correlation`	`DataFrame` or `None`	Long-form feature pairs above the requested correlation threshold.
`correlation_matrix`	`DataFrame` or `None`	Full correlation matrix, optionally cluster-ordered.
`target_correlation`	`DataFrame` or `None`	Feature-to-target correlation rows.
`factors`	`DataFrame` or `None`	PCA/factor/component feature diagnostics.
`factor_variance`	`DataFrame` or `None`	Scree-style variance and cumulative variance share.
`factor_loadings`	`DataFrame` or `None`	Source-factor correlations for loading heatmaps.
`factor_timeseries`	`DataFrame` or `None`	Long-form factor/component values by date.
`lags`	`DataFrame` or `None`	Lag/window feature diagnostics.
`lag_autocorrelation`	`DataFrame` or `None`	ACF/PACF style lag-feature autocorrelation table.
`lag_correlation_decay`	`DataFrame` or `None`	Correlation decay by lag/window.
`marx`	`DataFrame` or `None`	MARX-style moving-average lag diagnostics.
`marx_weight_decay`	`DataFrame` or `None`	Equal lag weights implied by MARX windows.
`selection_stability`	`DataFrame` or `None`	Per-feature selection frequency across origins/folds/windows.
`selection_similarity`	`DataFrame` or `None`	Pairwise Jaccard or Kuncheva stability across origins/folds/windows.
`stage_comparison`	`DataFrame` or `None`	Shape/missingness/column-delta comparison across named feature stages.
`stage_distribution_shift`	`DataFrame` or `None`	Adjacent-stage mean, standard-deviation, missingness, and KS-statistic shifts.
`metadata`	`dict`	Input metadata plus a compact `feature_analysis` stage.

FeatureDiagnosticReport.to_dict() converts tables to JSON-ready nested dictionaries/lists.

Metadata#

diagnose_features(...) attaches one compact stage:

diagnostic.metadata["feature_analysis"]

The stage records:

Key	Meaning
`overview`	Compact counts: observations, features, missing cells, high-missing feature count, zero-variance feature count.
`options`	Correlation, factor, lag, MARX, selection, and stage-comparison choices.
`tables`	Number of rows generated by each diagnostic table.

Returned diagnostic DataFrames also carry attrs["macroforecast_metadata"] == diagnostic.metadata.

Helper Functions#

feature_overview#

macroforecast.feature_analysis.feature_overview(
    data,
    *,
    feature_metadata: pandas.DataFrame | None = None,
    high_missing_threshold: float = 0.5,
) -> dict

Returns one compact dictionary. It is the quickest check for whether the feature matrix is sparse, constant, or missing feature metadata.

compare_feature_stages#

macroforecast.feature_analysis.compare_feature_stages(
    stages: Mapping[str, object] | None = None,
    **named_stages,
) -> pandas.DataFrame

Compares named feature-like panels in order. The table reports observations, feature counts, missingness, zero-variance counts, and column additions/removals relative to the previous stage.

Example:

comparison = mf.feature_analysis.compare_feature_stages(
    {
        "base": processed.panel[["PAYEMS", "INDPRO"]],
        "lagged": mf.feature_engineering.lag(processed, columns=["PAYEMS"], lags=(0, 1, 2)),
    }
)

stage_distribution_shift#

macroforecast.feature_analysis.stage_distribution_shift(
    stages: Mapping[str, object] | None = None,
    *,
    columns=None,
    min_obs: int = 3,
    **named_stages,
) -> pandas.DataFrame

Compares adjacent named stages column by column. Output columns include stage_a, stage_b, feature, observation counts, means, standard deviations, mean_shift, sd_ratio, missing-rate shift, and a two-sample KS-statistic. Use it to check whether scaling, lag construction, factor construction, or selection changed feature distributions unexpectedly.

feature_correlation#

macroforecast.feature_analysis.feature_correlation(
    data,
    *,
    feature_metadata: pandas.DataFrame | None = None,
    method: str = "pearson",
    min_periods: int = 3,
    threshold: float | None = 0.9,
    absolute: bool = True,
    max_pairs: int | None = None,
    scope: str = "all",
    block_column: str = "block",
) -> pandas.DataFrame

Returns long-form pairs:

Column	Meaning
`feature_a`, `feature_b`	Pair names.
`correlation`, `abs_correlation`	Signed and absolute correlation.
`block_a`, `block_b`	Block labels from feature metadata when available.
`operation_a`, `operation_b`	Feature operations from metadata when available.
`source_a`, `source_b`	Source columns from metadata when available.

Use threshold=None for a full long-form correlation table. Use scope="within_block" or scope="cross_block" to restrict pairs using metadata blocks.

feature_target_correlation#

macroforecast.feature_analysis.feature_target_correlation(
    data,
    target,
    *,
    feature_metadata=None,
    method: str = "pearson",
    min_periods: int = 3,
    absolute: bool = True,
    max_features: int | None = None,
) -> pandas.DataFrame

Returns one row per feature with correlation against the supplied target. Output columns include feature, target, correlation, abs_correlation, operation, source, block, and n_obs.

feature_correlation_matrix#

macroforecast.feature_analysis.feature_correlation_matrix(
    data,
    *,
    method: str = "pearson",
    min_periods: int = 3,
    order: str = "original",
    absolute_distance: bool = True,
) -> pandas.DataFrame

Returns a square correlation matrix. order="clustered" reorders rows and columns so highly correlated features are adjacent; this is the callable table behind a clustered heatmap.

factor_diagnostics#

macroforecast.feature_analysis.factor_diagnostics(
    data,
    *,
    feature_metadata: pandas.DataFrame | None = None,
    operations: Sequence[str] = (...),
    prefixes: Sequence[str] = ("pc", "factor", "maf"),
) -> pandas.DataFrame

Detects factor/component features using either feature metadata (operation in {"pca", "group_pca", "maf", ...} or a non-null component) or name prefixes such as pc1, factor1, and maf1.

Returned columns include feature, group, operation, block, source, component, n_obs, missing_rate, mean, sd, variance, and variance_share. variance_share is a diagnostic share of variance within the detected factor group. It is not the PCA model’s explained-variance ratio unless the upstream transform recorded that exact quantity.

factor_variance#

macroforecast.feature_analysis.factor_variance(data, *, feature_metadata=None)

Returns scree-style rows with variance_share and cumulative_variance_share. This is the callable table behind scree and cumulative-variance views.

factor_loadings#

macroforecast.feature_analysis.factor_loadings(
    data,
    *,
    source_data=None,
    feature_metadata=None,
    method="pearson",
    max_sources=None,
)

Approximates factor loadings as correlations between source variables and factor columns. Supply source_data when data contains only factor scores. Returned rows are long-form: factor, source, loading, abs_loading.

factor_timeseries#

macroforecast.feature_analysis.factor_timeseries(
    data,
    *,
    feature_metadata=None,
    operations=(...),
    prefixes=("pc", "factor", "maf"),
    max_factors=None,
) -> pandas.DataFrame

Returns detected factor/component columns in long time-series form. Output columns are date, factor, value, group, operation, component, and source. Use this for factor-score line plots or factor stability checks without reconstructing the feature metadata manually.

lag_diagnostics#

macroforecast.feature_analysis.lag_diagnostics(
    data,
    *,
    feature_metadata: pandas.DataFrame | None = None,
    operations: Sequence[str] = (...),
) -> pandas.DataFrame

Detects lag/window features using metadata fields lag, window, operation, or feature names such as x_lag3, x_roll6_mean, and x_ma4_lag1.

Returned columns include feature, operation, source, lag, window, n_obs, missing_rate, first_valid, and last_valid.

lag_autocorrelation#

macroforecast.feature_analysis.lag_autocorrelation(
    data,
    *,
    max_lag: int = 12,
    kind: str = "acf",
) -> pandas.DataFrame

Returns ACF or PACF values for detected lag/window feature columns. This is the callable table behind autocorrelation-per-lag and partial-autocorrelation views.

lag_correlation_decay#

macroforecast.feature_analysis.lag_correlation_decay(
    data,
    *,
    target=None,
    method="pearson",
) -> pandas.DataFrame

Returns correlation decay by lag/window. If target is supplied, each lag feature is correlated with that target. Otherwise, each lag feature is compared with its same-source lag-0/current column when available.

marx_diagnostics#

macroforecast.feature_analysis.marx_diagnostics(
    data,
    *,
    feature_metadata: pandas.DataFrame | None = None,
) -> pandas.DataFrame

Detects MARX-style columns named like x_ma4_lag1. These are moving-average lag features, not PCA. The returned table adds marx_formula using the recorded starting lag and window. For example:

mean(x[t-1]...x[t-4])

For x_ma4_lag2, the formula is mean(x[t-2]...x[t-5]).

marx_weight_decay#

macroforecast.feature_analysis.marx_weight_decay(
    data,
    *,
    feature_metadata=None,
) -> pandas.DataFrame

Returns the equal lag weights implied by each MARX moving-average feature. For x_ma4_lag1, the table has four rows with weight 0.25 for lags 1 through 4 and cumulative weights from 0.25 to 1.0. For x_ma4_lag2, the lag rows are 2 through 5, with the same equal weights.

selection_stability#

macroforecast.feature_analysis.selection_stability(
    selections,
    *,
    all_features: Iterable[str] | None = None,
) -> pandas.DataFrame

Accepts any of these inputs:

Input form	Example
Mapping of origin to selected names	`{"2020-01": ["x1", "x2"], "2020-02": ["x2"]}`
Sequence of selected-name iterables	`[["x1"], ["x1", "x3"]]`
Indicator DataFrame	rows are origins, columns are features, truthy values mean selected
Long DataFrame	columns `feature`, `selected`, and optionally `origin`, `window`, `fold`, or `split`

The result is indexed by feature and includes selected_count, selection_rate, n_origins, first_selected_origin, and last_selected_origin.

selection_similarity#

macroforecast.feature_analysis.selection_similarity(
    selections,
    *,
    metric: str = "jaccard",
    all_features=None,
    n_features=None,
) -> pandas.DataFrame

Returns pairwise stability across origins/folds/windows. metric="jaccard" uses overlap divided by union. metric="kuncheva" adjusts overlap for expected random overlap using the declared or inferred feature universe size. Kuncheva stability is a fixed-selection-size measure; when two windows select different numbers of features, score is missing and the output still reports selected_a, selected_b, and overlap.

custom_feature_diagnostic#

macroforecast.feature_analysis.custom_feature_diagnostic(
    data,
    func,
    *,
    name=None,
    feature_metadata=None,
    metadata=None,
    **params,
) -> pandas.DataFrame

Runs one user diagnostic on a feature matrix or FeatureSet. This is for inspection only; it does not create new predictors.

Callable signature:

func(X, *, feature_metadata=None, metadata=None, **params)

Accepted callable outputs are DataFrame, Series, mapping, or a sequence convertible to a DataFrame. The returned table carries:

Attr	Meaning
`macroforecast_metadata_schema.kind`	Always `custom_feature_diagnostic`.
`macroforecast_metadata_schema.method`	`name` or callable name.
`macroforecast_metadata`	Input metadata plus a `custom_feature_diagnostic` stage.

Example:

def block_missingness(X, *, feature_metadata=None, metadata=None, block="all"):
    return pd.DataFrame(
        [{"block": block, "missing_rate": float(X.isna().mean().mean())}]
    )

diag = mf.feature_analysis.custom_feature_diagnostic(
    features,
    block_missingness,
    name="block_missingness",
    block="rates",
)

Boundary#

Question	Use
Create predictors and target matrices	`mf.feature_engineering`
Inspect feature matrix quality and metadata	`mf.feature_analysis`
Compare raw and preprocessed panels	`mf.data_analysis`
Inspect fitted model residuals or tuning trace	`mf.forecast_analysis`