macroforecast.feature_analysis#

Back to reference

macroforecast.feature_analysis inspects feature matrices after macroforecast.feature_engineering. It does not create new predictors and does not fit a forecasting model. Its job is to make the constructed X auditable: missingness, high correlations, PCA/factor columns, lag/MARX structure, feature stage changes, and feature-selection stability.

macroforecast.feature_diagnostic remains available as a compatibility alias.

Accepted feature inputs are:

Input

Meaning

FeatureSet

Uses FeatureSet.X, FeatureSet.feature_metadata, and FeatureSet.metadata.

pandas.DataFrame

Uses the frame as X; reads attrs["macroforecast_feature_metadata"] when present.

DataBundle, DataSpec, (DataFrame, metadata)

Uses the panel as the inspected matrix and carries metadata forward.

The DataFrame input must satisfy the canonical macroforecast panel contract: DatetimeIndex named "date", sorted, no duplicate dates, numeric columns, finite values or NaN, and non-empty shape.

Learned filters and feature builders can also produce weight matrices. For example, filters.albama() returns AlbaMAResult.weights, where rows are source observations and columns are target feature dates. feature_analysis owns the diagnostic summaries of those weights:

Callable

Input

Output

Purpose

effective_window(weights, threshold=1e-12)

square source-by-target weight matrix

Series

Count nonzero source observations used at each target date.

recent_weight_share(weights, mode="one_sided")

square source-by-target weight matrix

DataFrame

Summarize weight mass in recent lag/lead buckets.

Public Flow#

import macroforecast as mf

processed = mf.preprocessing.reprocess(data_spec)
features = mf.feature_engineering.feature_spec(
    target="INDPRO",
    horizons=(1, 3, 6),
    predictors="all",
    lags=(0, 1, 2, 3),
    pca_components=8,
).fit_transform(processed)

diagnostic = mf.feature_analysis.diagnose_features(
    features,
    include_correlation=True,
    include_correlation_matrix=True,
    include_lag_autocorrelation=True,
    selections={"origin_1": ["pc1", "PAYEMS_lag0"]},
    selection_similarity_metric="jaccard",
)

albama = mf.filters.albama(inflation, mode="one_sided")
window = mf.feature_analysis.effective_window(albama.weights)
shares = mf.feature_analysis.recent_weight_share(albama.weights, mode="one_sided")

effective_window#

macroforecast.feature_analysis.effective_window(
    weights,
    *,
    threshold=1e-12,
) -> pandas.Series

Input: a square weight matrix whose rows are source observations and whose columns are target feature dates. This is the shape returned by AlbaMAResult.weights.

Output: one value per target date. The value is the number of source observations with absolute weight above threshold.

recent_weight_share#

macroforecast.feature_analysis.recent_weight_share(
    weights,
    *,
    mode="one_sided",
) -> pandas.DataFrame

Input: the same source-by-target weight matrix.

Output for mode="one_sided":

Column

Meaning

y_t

Weight on the target-date observation.

y_t_minus_1_2

Weight on lags 1 and 2.

y_t_minus_3_5

Weight on lags 3 through 5.

y_t_minus_6_plus

Weight on lags 6 and older.

future_weight

Weight assigned to future observations; should be zero for one-sided features.

Output for mode="two_sided" replaces future_weight with forward buckets y_t_plus_1_2 and y_t_plus_3_plus.

diagnose_features#

macroforecast.feature_analysis.diagnose_features(
    data,
    *,
    feature_metadata: pandas.DataFrame | None = None,
    stages: Mapping[str, object] | None = None,
    include_correlation: bool = False,
    include_correlation_matrix: bool = False,
    correlation_method: str = "pearson",
    correlation_threshold: float | None = 0.9,
    correlation_min_periods: int = 3,
    correlation_order: str = "original",
    correlation_scope: str = "all",
    target=None,
    include_target_correlation: bool = False,
    high_missing_threshold: float = 0.5,
    include_factors: bool = True,
    include_factor_variance: bool = True,
    include_factor_loadings: bool = False,
    include_factor_timeseries: bool = False,
    factor_source_data=None,
    include_lags: bool = True,
    include_lag_autocorrelation: bool = False,
    include_lag_correlation_decay: bool = False,
    include_marx: bool = True,
    include_marx_weight_decay: bool = True,
    include_stage_distribution_shift: bool = True,
    selections: Mapping | Sequence | pandas.DataFrame | None = None,
    selection_similarity_metric: str | None = None,
) -> FeatureDiagnosticReport

Input#

Name

Type

Default

Choices

data

feature input

required

FeatureSet, DataFrame, DataBundle, DataSpec, or (DataFrame, metadata).

feature_metadata

DataFrame or None

auto

Overrides metadata stored on the input.

stages

mapping or None

None

Named feature-like panels to compare in construction order.

include_correlation

bool

False

Whether to compute high-correlation feature pairs.

include_correlation_matrix

bool

False

Include a full correlation matrix.

correlation_method

str

"pearson"

"pearson", "spearman", or "kendall".

correlation_threshold

float or None

0.9

Pair filter. Uses absolute correlation when feature_correlation(..., absolute=True). None returns all non-missing pairs.

correlation_min_periods

positive int

3

Minimum overlapping observations for correlation.

correlation_order

str

"original"

"original" or "clustered" for the full correlation matrix.

correlation_scope

str

"all"

"all", "within_block", or "cross_block". Block comes from feature metadata block, then operation/source fallback.

target

Series, DataFrame, array-like, string, or None

None

Target used by include_target_correlation. A string refers to a column in data.

include_target_correlation

bool

False

Include feature-to-target correlation rows.

high_missing_threshold

float

0.5

Features with missing-rate above this value are flagged in overview.

include_factors

bool

True

Include PCA/factor/component diagnostics.

include_factor_variance

bool

True

Include scree/cumulative-variance table for detected factor columns.

include_factor_loadings

bool

False

Include source-factor correlation loadings. Use factor_source_data for original source variables.

include_factor_timeseries

bool

False

Include long-form factor-score time series.

include_lags

bool

True

Include lag/window diagnostics.

include_lag_autocorrelation

bool

False

Include ACF table for detected lag/window columns.

include_lag_correlation_decay

bool

False

Include lag-correlation decay against target or lag-0/current source columns.

include_marx

bool

True

Include MARX-style moving-average lag diagnostics.

include_marx_weight_decay

bool

True

Include equal lag weights implied by MARX moving-average windows.

include_stage_distribution_shift

bool

True

When stages is supplied, include adjacent-stage distribution-shift diagnostics.

selections

mapping, sequence, DataFrame, or None

None

Feature selections by origin/fold/window for stability counts.

selection_similarity_metric

str or None

None

"jaccard" or "kuncheva" for pairwise selection similarity.

Output#

Returns FeatureDiagnosticReport.

Field

Type

Meaning

overview

dict

Shape, date range, missingness, zero-variance features, operation/source counts, and feature-metadata coverage.

correlation

DataFrame or None

Long-form feature pairs above the requested correlation threshold.

correlation_matrix

DataFrame or None

Full correlation matrix, optionally cluster-ordered.

target_correlation

DataFrame or None

Feature-to-target correlation rows.

factors

DataFrame or None

PCA/factor/component feature diagnostics.

factor_variance

DataFrame or None

Scree-style variance and cumulative variance share.

factor_loadings

DataFrame or None

Source-factor correlations for loading heatmaps.

factor_timeseries

DataFrame or None

Long-form factor/component values by date.

lags

DataFrame or None

Lag/window feature diagnostics.

lag_autocorrelation

DataFrame or None

ACF/PACF style lag-feature autocorrelation table.

lag_correlation_decay

DataFrame or None

Correlation decay by lag/window.

marx

DataFrame or None

MARX-style moving-average lag diagnostics.

marx_weight_decay

DataFrame or None

Equal lag weights implied by MARX windows.

selection_stability

DataFrame or None

Per-feature selection frequency across origins/folds/windows.

selection_similarity

DataFrame or None

Pairwise Jaccard or Kuncheva stability across origins/folds/windows.

stage_comparison

DataFrame or None

Shape/missingness/column-delta comparison across named feature stages.

stage_distribution_shift

DataFrame or None

Adjacent-stage mean, standard-deviation, missingness, and KS-statistic shifts.

metadata

dict

Input metadata plus a compact feature_analysis stage.

FeatureDiagnosticReport.to_dict() converts tables to JSON-ready nested dictionaries/lists.

Metadata#

diagnose_features(...) attaches one compact stage:

diagnostic.metadata["feature_analysis"]

The stage records:

Key

Meaning

overview

Compact counts: observations, features, missing cells, high-missing feature count, zero-variance feature count.

options

Correlation, factor, lag, MARX, selection, and stage-comparison choices.

tables

Number of rows generated by each diagnostic table.

Returned diagnostic DataFrames also carry attrs["macroforecast_metadata"] == diagnostic.metadata.

Helper Functions#

feature_overview#

macroforecast.feature_analysis.feature_overview(
    data,
    *,
    feature_metadata: pandas.DataFrame | None = None,
    high_missing_threshold: float = 0.5,
) -> dict

Returns one compact dictionary. It is the quickest check for whether the feature matrix is sparse, constant, or missing feature metadata.

compare_feature_stages#

macroforecast.feature_analysis.compare_feature_stages(
    stages: Mapping[str, object] | None = None,
    **named_stages,
) -> pandas.DataFrame

Compares named feature-like panels in order. The table reports observations, feature counts, missingness, zero-variance counts, and column additions/removals relative to the previous stage.

Example:

comparison = mf.feature_analysis.compare_feature_stages(
    {
        "base": processed.panel[["PAYEMS", "INDPRO"]],
        "lagged": mf.feature_engineering.lag(processed, columns=["PAYEMS"], lags=(0, 1, 2)),
    }
)

stage_distribution_shift#

macroforecast.feature_analysis.stage_distribution_shift(
    stages: Mapping[str, object] | None = None,
    *,
    columns=None,
    min_obs: int = 3,
    **named_stages,
) -> pandas.DataFrame

Compares adjacent named stages column by column. Output columns include stage_a, stage_b, feature, observation counts, means, standard deviations, mean_shift, sd_ratio, missing-rate shift, and a two-sample KS-statistic. Use it to check whether scaling, lag construction, factor construction, or selection changed feature distributions unexpectedly.

feature_correlation#

macroforecast.feature_analysis.feature_correlation(
    data,
    *,
    feature_metadata: pandas.DataFrame | None = None,
    method: str = "pearson",
    min_periods: int = 3,
    threshold: float | None = 0.9,
    absolute: bool = True,
    max_pairs: int | None = None,
    scope: str = "all",
    block_column: str = "block",
) -> pandas.DataFrame

Returns long-form pairs:

Column

Meaning

feature_a, feature_b

Pair names.

correlation, abs_correlation

Signed and absolute correlation.

block_a, block_b

Block labels from feature metadata when available.

operation_a, operation_b

Feature operations from metadata when available.

source_a, source_b

Source columns from metadata when available.

Use threshold=None for a full long-form correlation table. Use scope="within_block" or scope="cross_block" to restrict pairs using metadata blocks.

feature_target_correlation#

macroforecast.feature_analysis.feature_target_correlation(
    data,
    target,
    *,
    feature_metadata=None,
    method: str = "pearson",
    min_periods: int = 3,
    absolute: bool = True,
    max_features: int | None = None,
) -> pandas.DataFrame

Returns one row per feature with correlation against the supplied target. Output columns include feature, target, correlation, abs_correlation, operation, source, block, and n_obs.

feature_correlation_matrix#

macroforecast.feature_analysis.feature_correlation_matrix(
    data,
    *,
    method: str = "pearson",
    min_periods: int = 3,
    order: str = "original",
    absolute_distance: bool = True,
) -> pandas.DataFrame

Returns a square correlation matrix. order="clustered" reorders rows and columns so highly correlated features are adjacent; this is the callable table behind a clustered heatmap.

factor_diagnostics#

macroforecast.feature_analysis.factor_diagnostics(
    data,
    *,
    feature_metadata: pandas.DataFrame | None = None,
    operations: Sequence[str] = (...),
    prefixes: Sequence[str] = ("pc", "factor", "maf"),
) -> pandas.DataFrame

Detects factor/component features using either feature metadata (operation in {"pca", "group_pca", "maf", ...} or a non-null component) or name prefixes such as pc1, factor1, and maf1.

Returned columns include feature, group, operation, block, source, component, n_obs, missing_rate, mean, sd, variance, and variance_share. variance_share is a diagnostic share of variance within the detected factor group. It is not the PCA model’s explained-variance ratio unless the upstream transform recorded that exact quantity.

factor_variance#

macroforecast.feature_analysis.factor_variance(data, *, feature_metadata=None)

Returns scree-style rows with variance_share and cumulative_variance_share. This is the callable table behind scree and cumulative-variance views.

factor_loadings#

macroforecast.feature_analysis.factor_loadings(
    data,
    *,
    source_data=None,
    feature_metadata=None,
    method="pearson",
    max_sources=None,
)

Approximates factor loadings as correlations between source variables and factor columns. Supply source_data when data contains only factor scores. Returned rows are long-form: factor, source, loading, abs_loading.

factor_timeseries#

macroforecast.feature_analysis.factor_timeseries(
    data,
    *,
    feature_metadata=None,
    operations=(...),
    prefixes=("pc", "factor", "maf"),
    max_factors=None,
) -> pandas.DataFrame

Returns detected factor/component columns in long time-series form. Output columns are date, factor, value, group, operation, component, and source. Use this for factor-score line plots or factor stability checks without reconstructing the feature metadata manually.

lag_diagnostics#

macroforecast.feature_analysis.lag_diagnostics(
    data,
    *,
    feature_metadata: pandas.DataFrame | None = None,
    operations: Sequence[str] = (...),
) -> pandas.DataFrame

Detects lag/window features using metadata fields lag, window, operation, or feature names such as x_lag3, x_roll6_mean, and x_ma4_lag1.

Returned columns include feature, operation, source, lag, window, n_obs, missing_rate, first_valid, and last_valid.

lag_autocorrelation#

macroforecast.feature_analysis.lag_autocorrelation(
    data,
    *,
    max_lag: int = 12,
    kind: str = "acf",
) -> pandas.DataFrame

Returns ACF or PACF values for detected lag/window feature columns. This is the callable table behind autocorrelation-per-lag and partial-autocorrelation views.

lag_correlation_decay#

macroforecast.feature_analysis.lag_correlation_decay(
    data,
    *,
    target=None,
    method="pearson",
) -> pandas.DataFrame

Returns correlation decay by lag/window. If target is supplied, each lag feature is correlated with that target. Otherwise, each lag feature is compared with its same-source lag-0/current column when available.

marx_diagnostics#

macroforecast.feature_analysis.marx_diagnostics(
    data,
    *,
    feature_metadata: pandas.DataFrame | None = None,
) -> pandas.DataFrame

Detects MARX-style columns named like x_ma4_lag1. These are moving-average lag features, not PCA. The returned table adds marx_formula using the recorded starting lag and window. For example:

mean(x[t-1]...x[t-4])

For x_ma4_lag2, the formula is mean(x[t-2]...x[t-5]).

marx_weight_decay#

macroforecast.feature_analysis.marx_weight_decay(
    data,
    *,
    feature_metadata=None,
) -> pandas.DataFrame

Returns the equal lag weights implied by each MARX moving-average feature. For x_ma4_lag1, the table has four rows with weight 0.25 for lags 1 through 4 and cumulative weights from 0.25 to 1.0. For x_ma4_lag2, the lag rows are 2 through 5, with the same equal weights.

selection_stability#

macroforecast.feature_analysis.selection_stability(
    selections,
    *,
    all_features: Iterable[str] | None = None,
) -> pandas.DataFrame

Accepts any of these inputs:

Input form

Example

Mapping of origin to selected names

{"2020-01": ["x1", "x2"], "2020-02": ["x2"]}

Sequence of selected-name iterables

[["x1"], ["x1", "x3"]]

Indicator DataFrame

rows are origins, columns are features, truthy values mean selected

Long DataFrame

columns feature, selected, and optionally origin, window, fold, or split

The result is indexed by feature and includes selected_count, selection_rate, n_origins, first_selected_origin, and last_selected_origin.

selection_similarity#

macroforecast.feature_analysis.selection_similarity(
    selections,
    *,
    metric: str = "jaccard",
    all_features=None,
    n_features=None,
) -> pandas.DataFrame

Returns pairwise stability across origins/folds/windows. metric="jaccard" uses overlap divided by union. metric="kuncheva" adjusts overlap for expected random overlap using the declared or inferred feature universe size. Kuncheva stability is a fixed-selection-size measure; when two windows select different numbers of features, score is missing and the output still reports selected_a, selected_b, and overlap.

custom_feature_diagnostic#

macroforecast.feature_analysis.custom_feature_diagnostic(
    data,
    func,
    *,
    name=None,
    feature_metadata=None,
    metadata=None,
    **params,
) -> pandas.DataFrame

Runs one user diagnostic on a feature matrix or FeatureSet. This is for inspection only; it does not create new predictors.

Callable signature:

func(X, *, feature_metadata=None, metadata=None, **params)

Accepted callable outputs are DataFrame, Series, mapping, or a sequence convertible to a DataFrame. The returned table carries:

Attr

Meaning

macroforecast_metadata_schema.kind

Always custom_feature_diagnostic.

macroforecast_metadata_schema.method

name or callable name.

macroforecast_metadata

Input metadata plus a custom_feature_diagnostic stage.

Example:

def block_missingness(X, *, feature_metadata=None, metadata=None, block="all"):
    return pd.DataFrame(
        [{"block": block, "missing_rate": float(X.isna().mean().mean())}]
    )

diag = mf.feature_analysis.custom_feature_diagnostic(
    features,
    block_missingness,
    name="block_missingness",
    block="rates",
)

Boundary#

Question

Use

Create predictors and target matrices

mf.feature_engineering

Inspect feature matrix quality and metadata

mf.feature_analysis

Compare raw and preprocessed panels

mf.data_analysis

Inspect fitted model residuals or tuning trace

mf.forecast_analysis