macroforecast.feature_analysis#
macroforecast.feature_analysis inspects feature matrices after
macroforecast.feature_engineering. It does not create new predictors and does
not fit a forecasting model. Its job is to make the constructed X auditable:
missingness, high correlations, PCA/factor columns, lag/MARX structure, feature
stage changes, and feature-selection stability.
macroforecast.feature_diagnostic remains available as a compatibility alias.
Accepted feature inputs are:
Input |
Meaning |
|---|---|
|
Uses |
|
Uses the frame as |
|
Uses the panel as the inspected matrix and carries metadata forward. |
The DataFrame input must satisfy the canonical macroforecast panel contract:
DatetimeIndex named "date", sorted, no duplicate dates, numeric columns,
finite values or NaN, and non-empty shape.
Learned filters and feature builders can also produce weight matrices. For
example, filters.albama() returns AlbaMAResult.weights, where rows
are source observations and columns are target feature dates. feature_analysis
owns the diagnostic summaries of those weights:
Callable |
Input |
Output |
Purpose |
|---|---|---|---|
|
square source-by-target weight matrix |
|
Count nonzero source observations used at each target date. |
|
square source-by-target weight matrix |
|
Summarize weight mass in recent lag/lead buckets. |
Public Flow#
import macroforecast as mf
processed = mf.preprocessing.reprocess(data_spec)
features = mf.feature_engineering.feature_spec(
target="INDPRO",
horizons=(1, 3, 6),
predictors="all",
lags=(0, 1, 2, 3),
pca_components=8,
).fit_transform(processed)
diagnostic = mf.feature_analysis.diagnose_features(
features,
include_correlation=True,
include_correlation_matrix=True,
include_lag_autocorrelation=True,
selections={"origin_1": ["pc1", "PAYEMS_lag0"]},
selection_similarity_metric="jaccard",
)
albama = mf.filters.albama(inflation, mode="one_sided")
window = mf.feature_analysis.effective_window(albama.weights)
shares = mf.feature_analysis.recent_weight_share(albama.weights, mode="one_sided")
effective_window#
macroforecast.feature_analysis.effective_window(
weights,
*,
threshold=1e-12,
) -> pandas.Series
Input: a square weight matrix whose rows are source observations and whose
columns are target feature dates. This is the shape returned by
AlbaMAResult.weights.
Output: one value per target date. The value is the number of source
observations with absolute weight above threshold.
diagnose_features#
macroforecast.feature_analysis.diagnose_features(
data,
*,
feature_metadata: pandas.DataFrame | None = None,
stages: Mapping[str, object] | None = None,
include_correlation: bool = False,
include_correlation_matrix: bool = False,
correlation_method: str = "pearson",
correlation_threshold: float | None = 0.9,
correlation_min_periods: int = 3,
correlation_order: str = "original",
correlation_scope: str = "all",
target=None,
include_target_correlation: bool = False,
high_missing_threshold: float = 0.5,
include_factors: bool = True,
include_factor_variance: bool = True,
include_factor_loadings: bool = False,
include_factor_timeseries: bool = False,
factor_source_data=None,
include_lags: bool = True,
include_lag_autocorrelation: bool = False,
include_lag_correlation_decay: bool = False,
include_marx: bool = True,
include_marx_weight_decay: bool = True,
include_stage_distribution_shift: bool = True,
selections: Mapping | Sequence | pandas.DataFrame | None = None,
selection_similarity_metric: str | None = None,
) -> FeatureDiagnosticReport
Input#
Name |
Type |
Default |
Choices |
|---|---|---|---|
|
feature input |
required |
|
|
|
auto |
Overrides metadata stored on the input. |
|
mapping or |
|
Named feature-like panels to compare in construction order. |
|
|
|
Whether to compute high-correlation feature pairs. |
|
|
|
Include a full correlation matrix. |
|
|
|
|
|
float or |
|
Pair filter. Uses absolute correlation when |
|
positive int |
|
Minimum overlapping observations for correlation. |
|
|
|
|
|
|
|
|
|
Series, DataFrame, array-like, string, or |
|
Target used by |
|
|
|
Include feature-to-target correlation rows. |
|
float |
|
Features with missing-rate above this value are flagged in |
|
|
|
Include PCA/factor/component diagnostics. |
|
|
|
Include scree/cumulative-variance table for detected factor columns. |
|
|
|
Include source-factor correlation loadings. Use |
|
|
|
Include long-form factor-score time series. |
|
|
|
Include lag/window diagnostics. |
|
|
|
Include ACF table for detected lag/window columns. |
|
|
|
Include lag-correlation decay against target or lag-0/current source columns. |
|
|
|
Include MARX-style moving-average lag diagnostics. |
|
|
|
Include equal lag weights implied by MARX moving-average windows. |
|
|
|
When |
|
mapping, sequence, DataFrame, or |
|
Feature selections by origin/fold/window for stability counts. |
|
|
|
|
Output#
Returns FeatureDiagnosticReport.
Field |
Type |
Meaning |
|---|---|---|
|
|
Shape, date range, missingness, zero-variance features, operation/source counts, and feature-metadata coverage. |
|
|
Long-form feature pairs above the requested correlation threshold. |
|
|
Full correlation matrix, optionally cluster-ordered. |
|
|
Feature-to-target correlation rows. |
|
|
PCA/factor/component feature diagnostics. |
|
|
Scree-style variance and cumulative variance share. |
|
|
Source-factor correlations for loading heatmaps. |
|
|
Long-form factor/component values by date. |
|
|
Lag/window feature diagnostics. |
|
|
ACF/PACF style lag-feature autocorrelation table. |
|
|
Correlation decay by lag/window. |
|
|
MARX-style moving-average lag diagnostics. |
|
|
Equal lag weights implied by MARX windows. |
|
|
Per-feature selection frequency across origins/folds/windows. |
|
|
Pairwise Jaccard or Kuncheva stability across origins/folds/windows. |
|
|
Shape/missingness/column-delta comparison across named feature stages. |
|
|
Adjacent-stage mean, standard-deviation, missingness, and KS-statistic shifts. |
|
|
Input metadata plus a compact |
FeatureDiagnosticReport.to_dict() converts tables to JSON-ready nested
dictionaries/lists.
Metadata#
diagnose_features(...) attaches one compact stage:
diagnostic.metadata["feature_analysis"]
The stage records:
Key |
Meaning |
|---|---|
|
Compact counts: observations, features, missing cells, high-missing feature count, zero-variance feature count. |
|
Correlation, factor, lag, MARX, selection, and stage-comparison choices. |
|
Number of rows generated by each diagnostic table. |
Returned diagnostic DataFrames also carry
attrs["macroforecast_metadata"] == diagnostic.metadata.
Helper Functions#
feature_overview#
macroforecast.feature_analysis.feature_overview(
data,
*,
feature_metadata: pandas.DataFrame | None = None,
high_missing_threshold: float = 0.5,
) -> dict
Returns one compact dictionary. It is the quickest check for whether the feature matrix is sparse, constant, or missing feature metadata.
compare_feature_stages#
macroforecast.feature_analysis.compare_feature_stages(
stages: Mapping[str, object] | None = None,
**named_stages,
) -> pandas.DataFrame
Compares named feature-like panels in order. The table reports observations, feature counts, missingness, zero-variance counts, and column additions/removals relative to the previous stage.
Example:
comparison = mf.feature_analysis.compare_feature_stages(
{
"base": processed.panel[["PAYEMS", "INDPRO"]],
"lagged": mf.feature_engineering.lag(processed, columns=["PAYEMS"], lags=(0, 1, 2)),
}
)
stage_distribution_shift#
macroforecast.feature_analysis.stage_distribution_shift(
stages: Mapping[str, object] | None = None,
*,
columns=None,
min_obs: int = 3,
**named_stages,
) -> pandas.DataFrame
Compares adjacent named stages column by column. Output columns include
stage_a, stage_b, feature, observation counts, means, standard
deviations, mean_shift, sd_ratio, missing-rate shift, and a two-sample
KS-statistic. Use it to check whether scaling, lag construction, factor
construction, or selection changed feature distributions unexpectedly.
feature_correlation#
macroforecast.feature_analysis.feature_correlation(
data,
*,
feature_metadata: pandas.DataFrame | None = None,
method: str = "pearson",
min_periods: int = 3,
threshold: float | None = 0.9,
absolute: bool = True,
max_pairs: int | None = None,
scope: str = "all",
block_column: str = "block",
) -> pandas.DataFrame
Returns long-form pairs:
Column |
Meaning |
|---|---|
|
Pair names. |
|
Signed and absolute correlation. |
|
Block labels from feature metadata when available. |
|
Feature operations from metadata when available. |
|
Source columns from metadata when available. |
Use threshold=None for a full long-form correlation table.
Use scope="within_block" or scope="cross_block" to restrict pairs using
metadata blocks.
feature_target_correlation#
macroforecast.feature_analysis.feature_target_correlation(
data,
target,
*,
feature_metadata=None,
method: str = "pearson",
min_periods: int = 3,
absolute: bool = True,
max_features: int | None = None,
) -> pandas.DataFrame
Returns one row per feature with correlation against the supplied target.
Output columns include feature, target, correlation,
abs_correlation, operation, source, block, and n_obs.
feature_correlation_matrix#
macroforecast.feature_analysis.feature_correlation_matrix(
data,
*,
method: str = "pearson",
min_periods: int = 3,
order: str = "original",
absolute_distance: bool = True,
) -> pandas.DataFrame
Returns a square correlation matrix. order="clustered" reorders rows and
columns so highly correlated features are adjacent; this is the callable table
behind a clustered heatmap.
factor_diagnostics#
macroforecast.feature_analysis.factor_diagnostics(
data,
*,
feature_metadata: pandas.DataFrame | None = None,
operations: Sequence[str] = (...),
prefixes: Sequence[str] = ("pc", "factor", "maf"),
) -> pandas.DataFrame
Detects factor/component features using either feature metadata
(operation in {"pca", "group_pca", "maf", ...} or a non-null component) or
name prefixes such as pc1, factor1, and maf1.
Returned columns include feature, group, operation, block, source,
component, n_obs, missing_rate, mean, sd, variance, and
variance_share. variance_share is a diagnostic share of variance within the
detected factor group. It is not the PCA model’s explained-variance ratio unless
the upstream transform recorded that exact quantity.
factor_variance#
macroforecast.feature_analysis.factor_variance(data, *, feature_metadata=None)
Returns scree-style rows with variance_share and
cumulative_variance_share. This is the callable table behind scree and
cumulative-variance views.
factor_loadings#
macroforecast.feature_analysis.factor_loadings(
data,
*,
source_data=None,
feature_metadata=None,
method="pearson",
max_sources=None,
)
Approximates factor loadings as correlations between source variables and
factor columns. Supply source_data when data contains only factor scores.
Returned rows are long-form: factor, source, loading, abs_loading.
factor_timeseries#
macroforecast.feature_analysis.factor_timeseries(
data,
*,
feature_metadata=None,
operations=(...),
prefixes=("pc", "factor", "maf"),
max_factors=None,
) -> pandas.DataFrame
Returns detected factor/component columns in long time-series form. Output
columns are date, factor, value, group, operation, component, and
source. Use this for factor-score line plots or factor stability checks
without reconstructing the feature metadata manually.
lag_diagnostics#
macroforecast.feature_analysis.lag_diagnostics(
data,
*,
feature_metadata: pandas.DataFrame | None = None,
operations: Sequence[str] = (...),
) -> pandas.DataFrame
Detects lag/window features using metadata fields lag, window,
operation, or feature names such as x_lag3, x_roll6_mean, and
x_ma4_lag1.
Returned columns include feature, operation, source, lag, window,
n_obs, missing_rate, first_valid, and last_valid.
lag_autocorrelation#
macroforecast.feature_analysis.lag_autocorrelation(
data,
*,
max_lag: int = 12,
kind: str = "acf",
) -> pandas.DataFrame
Returns ACF or PACF values for detected lag/window feature columns. This is the callable table behind autocorrelation-per-lag and partial-autocorrelation views.
lag_correlation_decay#
macroforecast.feature_analysis.lag_correlation_decay(
data,
*,
target=None,
method="pearson",
) -> pandas.DataFrame
Returns correlation decay by lag/window. If target is supplied, each lag
feature is correlated with that target. Otherwise, each lag feature is compared
with its same-source lag-0/current column when available.
marx_diagnostics#
macroforecast.feature_analysis.marx_diagnostics(
data,
*,
feature_metadata: pandas.DataFrame | None = None,
) -> pandas.DataFrame
Detects MARX-style columns named like x_ma4_lag1. These are moving-average
lag features, not PCA. The returned table adds marx_formula using the
recorded starting lag and window. For example:
mean(x[t-1]...x[t-4])
For x_ma4_lag2, the formula is mean(x[t-2]...x[t-5]).
marx_weight_decay#
macroforecast.feature_analysis.marx_weight_decay(
data,
*,
feature_metadata=None,
) -> pandas.DataFrame
Returns the equal lag weights implied by each MARX moving-average feature.
For x_ma4_lag1, the table has four rows with weight 0.25 for lags 1
through 4 and cumulative weights from 0.25 to 1.0. For x_ma4_lag2,
the lag rows are 2 through 5, with the same equal weights.
selection_stability#
macroforecast.feature_analysis.selection_stability(
selections,
*,
all_features: Iterable[str] | None = None,
) -> pandas.DataFrame
Accepts any of these inputs:
Input form |
Example |
|---|---|
Mapping of origin to selected names |
|
Sequence of selected-name iterables |
|
Indicator DataFrame |
rows are origins, columns are features, truthy values mean selected |
Long DataFrame |
columns |
The result is indexed by feature and includes selected_count,
selection_rate, n_origins, first_selected_origin, and
last_selected_origin.
selection_similarity#
macroforecast.feature_analysis.selection_similarity(
selections,
*,
metric: str = "jaccard",
all_features=None,
n_features=None,
) -> pandas.DataFrame
Returns pairwise stability across origins/folds/windows. metric="jaccard"
uses overlap divided by union. metric="kuncheva" adjusts overlap for expected
random overlap using the declared or inferred feature universe size. Kuncheva
stability is a fixed-selection-size measure; when two windows select different
numbers of features, score is missing and the output still reports
selected_a, selected_b, and overlap.
custom_feature_diagnostic#
macroforecast.feature_analysis.custom_feature_diagnostic(
data,
func,
*,
name=None,
feature_metadata=None,
metadata=None,
**params,
) -> pandas.DataFrame
Runs one user diagnostic on a feature matrix or FeatureSet. This is for
inspection only; it does not create new predictors.
Callable signature:
func(X, *, feature_metadata=None, metadata=None, **params)
Accepted callable outputs are DataFrame, Series, mapping, or a sequence
convertible to a DataFrame. The returned table carries:
Attr |
Meaning |
|---|---|
|
Always |
|
|
|
Input metadata plus a |
Example:
def block_missingness(X, *, feature_metadata=None, metadata=None, block="all"):
return pd.DataFrame(
[{"block": block, "missing_rate": float(X.isna().mean().mean())}]
)
diag = mf.feature_analysis.custom_feature_diagnostic(
features,
block_missingness,
name="block_missingness",
block="rates",
)
Boundary#
Question |
Use |
|---|---|
Create predictors and target matrices |
|
Inspect feature matrix quality and metadata |
|
Compare raw and preprocessed panels |
|
Inspect fitted model residuals or tuning trace |
|