macroforecast.feature_engineering#
Purpose#
macroforecast.feature_engineering is the direct pandas surface for building
forecast targets and model-ready feature matrices. It accepts the same direct Python inputs used by
previous stages: PreprocessedData, DataSpec, DataBundle,
(panel, metadata), or a canonical pandas.DataFrame.
For strict windowed forecasting, use feature_spec(...). The spec is fitted by
macroforecast.forecasting.run(...) inside each train window and then
transformed for the matching test rows. Individual functions such as lag(),
rolling_mean(), pca_features(), and feature_matrix() remain callable
one-shot helpers; runner composition belongs in macroforecast.forecasting.
The preferred flow is:
import macroforecast as mf
bundle = mf.data.load_fred_md()
data_spec = mf.data.spec(bundle, target="INDPRO", horizons=[1, 3, 6], predictors="all")
processed = mf.preprocessing.reprocess(data_spec)
features = mf.feature_engineering.build_features(
processed,
lags=(0, 1, 2, 3),
rolling_windows=(3, 6),
add_time=True,
)
X = features.X
y = features.y
metadata = features.metadata
build_features() emits a warning when it receives a canonical panel that does
not carry metadata["preprocessing"]. This is allowed, but the default package
workflow is data -> preprocessing -> feature engineering.
Common callable examples:
# Direct horizon targets: y[t+h].
y_direct = mf.feature_engineering.direct_target(processed, target="INDPRO", horizons=[1, 3, 6])
# Direct average target: one y column per requested horizon.
y_avg = mf.feature_engineering.average_target(processed, target="INDPRO", horizon=12, transform="growth")
# Path target: one y column per future step. Model fits/forecasts each step;
# evaluation averages the step forecasts.
y_path = mf.feature_engineering.path_targets(processed, target="INDPRO", horizon=12, transform="growth")
# Simple lagged predictors.
X_lag = mf.feature_engineering.lag(processed, columns=["PAYEMS", "INDPRO"], lags=range(0, 13))
Public Functions#
Group |
Functions |
Purpose |
|---|---|---|
Target construction |
|
Build direct, average, or step-path forecast targets. |
Basic predictor transforms |
|
Add lags, rolling blocks, mixed-frequency lag blocks, and deterministic date features. |
ML-side value transforms |
|
Add post-preprocessing transformations and scaling used as model features. |
Nonlinear expansions |
|
Add nonlinear, smooth, rank, or multi-resolution feature blocks. |
Supervised aggregation helpers |
|
Reusable component-to-aggregate helpers derived from the Albacore/assemblage R package. They are generic and can be used outside inflation. |
Trend/cycle filter wrappers |
|
Turn |
Factor features |
|
Build fitted factor, supervised-factor, sparse-factor, rotation, random projection, and kernel approximation features. |
Paper-style combinations |
|
Materialize named macro-ML blocks such as |
Composition |
|
Run sequential feature steps or user-supplied transforms. |
Runner-safe specs |
|
Store fit-aware feature construction for |
Feature selection |
|
Select columns with variance, target association, sparse-model, wrapper, or search rules. |
Selection utilities |
|
Normalize selection aliases and report whether a target is required. |
End-to-end builder |
|
Return aligned |
Runner-safe step builders are direct functions too: lag_step,
rolling_step, moving_average_step, marx_step, transform_step,
seasonal_lag_step, season_dummy_step, fourier_step, time_step,
polynomial_step, interaction_step, scale_step, pca_step,
sparse_pca_chen_rohe_step, varimax_step, group_pca_step, maf_step,
hamilton_step, random_projection_step, nystroem_step,
partial_least_squares_step, and sliced_inverse_regression_step.
Code Structure#
The public namespace stays macroforecast.feature_engineering, while the
implementation is split by responsibility:
File |
Responsibility |
|---|---|
|
|
|
Direct pandas feature transforms: lags, rolling means, scaling, PCA, PLS, DFM-style factors, Chen-Rohe sparse component analysis, varimax rotation, grouped PCA, MAF, filter-to-feature wrappers, custom feature callables, and time features. |
|
Shared fitted feature-selection algorithms used by direct selection callables and runner-safe |
|
Reusable step builders and sequential feature composition. |
|
Paper-style |
|
End-to-end |
|
Internal normalization, metadata, fitting, and validation helpers. |
|
Compatibility re-export only. |
Public Classes And Types#
Symbol |
Meaning |
|---|---|
|
Accepted feature-engineering input type. |
|
Output object returned by |
|
Runner-compatible feature-building contract. |
|
Fitted feature-builder state used by the runner. |
|
Metadata-rich result for feature-selection helpers. |
|
Generic feature-selection dispatcher. |
|
Return whether a feature-selection method requires a target. |
|
Normalize feature-selection method aliases. |
|
Convenience composition: PCA factors first, then lags. |
|
Convenience composition: lag panel first, then PCA. |
|
Convenience composition for moving-average blocks, PCA, and lags. |
FeatureSet#
macroforecast.feature_engineering.FeatureSet(
X: pandas.DataFrame,
y: pandas.DataFrame,
metadata: dict,
feature_metadata: pandas.DataFrame,
target_metadata: pandas.DataFrame,
target: str | None = None,
targets: tuple[str, ...] = (),
horizons: tuple[int, ...] = (),
predictors: tuple[str, ...] = (),
)
Output Schema#
Field |
Type |
Meaning |
|---|---|---|
|
|
Predictor matrix aligned on forecast-origin dates. |
|
|
Direct horizon targets or path step targets aligned to |
|
|
Input metadata plus feature-engineering stage metadata. |
|
|
One row per generated feature with provenance columns. |
|
|
One row per target column with horizon, transform, and formula provenance. |
|
scalar or tuple fields |
Resolved study choices. |
Methods#
Method |
Input |
Output |
Meaning |
|---|---|---|---|
|
|
|
Return a new object with one metadata stage added. |
FeatureSet also supports tuple unpacking:
X, y, metadata = features
FeatureSelectionResult#
select_features(...) returns FeatureSelectionResult; direct selection
wrappers return a selected-column DataFrame and store the same selection
metadata on the returned frame.
Field |
Meaning |
|---|---|
|
Final selected columns in source order. |
|
Per-column score dictionary. |
|
Canonical selection method. |
|
Requested and resolved selected-feature counts. |
|
Rows used by the selection fit. |
|
Fit contract, such as target-aligned rows or column-only rows. |
|
Whether the method requires a target. |
|
Method-specific score and fit details. |
Feature Boundary#
This stage is direct and pandas-first. It constructs target columns and ML-oriented feature transforms. Multiple transformations can be composed in sequence through plain Python callables, and higher-level orchestration can call the same functions later.
Function |
Owns |
Does not own yet |
|---|---|---|
|
Direct-forecast target columns, including direct average targets. |
Train/test split, recursive forecasting, inverse transforms. |
|
Explicit wrapper for direct average change/growth targets. |
Model fitting. |
|
Albacore/assemblage-named wrapper for future average aggregate targets. |
Inflation-only semantics; works for any aggregate target. |
|
Step-level targets for path-average forecasting. |
Model-stage step fit/forecast and evaluation-stage forecast averaging. |
|
Current and lagged predictor columns. |
Model-specific lag search. |
|
Exact-date lag blocks for native mixed-frequency panels. |
Frequency conversion or model estimation. |
|
Rolling-window means. |
Fit-based filters or learned smoothers. |
|
Multi-scale trailing moving-average block used before optional factor/PCA steps. |
PCA/factor extraction itself. |
|
Moving Average Factors from variable-specific lag panels. |
Model fitting or choosing final feature combinations. |
|
Panel wrapper around |
Model fitting, test windows, or choosing filter horizons. |
|
Named |
Loading or preprocessing the raw/level panel. |
|
Fit-policy-aware z-score, min-max, or robust scaling. |
Model fitting. |
|
Fit-policy-aware PCA factors. |
Forecast model fitting. |
|
Chen-Rohe sparse component analysis factors using an L1 loading budget. |
Model fitting; runner-safe fitting should use |
|
Orthogonal varimax rotation of already-created factor-score columns. |
Factor extraction itself; usually call after |
|
Target-aware Sliced Inverse Regression factors. |
Model fitting; runner-safe fitting should use |
|
Target-aware PLS factor scores. |
Model fitting; runner-safe fitting should use |
|
Static DFM approximation by standardized PCA. |
State-space DFM estimation; use model callables for that. |
|
Direct column selection by one explicit algorithm. |
Model fitting; runner-safe fitting uses the same method names inside |
|
Per-period rank-space columns for asymmetric trimming weights. |
Estimating the nonnegative rank weights. |
|
Named generic rank-space/order-statistic primitive from the Albacore R |
Model fitting or learned rank weights. |
|
Convert one-period component changes to a trailing moving-average unit. |
Choosing forecast windows or fitting weights. |
|
Align official/reference weights to a component column order. |
Estimating those weights. |
|
Apply fixed component weights to produce one aggregate. |
Learning the weights; use |
|
Panel wrapper around |
True DWT family-specific filtering. |
|
Feature-wrapper form of |
Forecast model fitting. |
|
PCA factors within named column groups. |
FAVAR-specific slow/fast construction, model estimation, or structural identification. |
|
One direct user-supplied pandas feature transform. |
Window-safe fitted state. Use |
|
Sequential combinations such as |
Model fitting or evaluation. |
|
Trend, month, quarter, and year columns. |
Public-holiday or trading-day calendars; the package targets monthly and quarterly macro panels. |
|
Aligned |
Model evaluation. |
Fit-based transformations require a declared fit_policy. The default is
fit_policy="expanding", which estimates transform parameters using only data
available through each date. fit_policy="full_sample" is available for
exploration or already-split training data. Public fitted transforms warn by
default when full_sample is used; pass warn_full_sample=False only when the
input panel is already training-only or the call is intentionally diagnostic.
direct_target#
macroforecast.feature_engineering.direct_target(
data,
*,
metadata: Mapping[str, object] | None = None,
target: str | None = None,
targets: Iterable[str] | None = None,
horizon: int | None = None,
horizons: Iterable[int] | int | None = None,
transform: str = "level",
) -> pandas.DataFrame
Input#
Name |
Type |
Default |
Choices |
|---|---|---|---|
|
|
required |
Canonical macroforecast input. |
|
mapping or |
|
Extra metadata to merge into the input metadata. |
|
|
from |
One target column. |
|
iterable or |
from |
Multiple target columns. Mutually exclusive with |
|
positive |
from |
One forecast horizon. |
|
positive int/iterable or |
from |
Multiple forecast horizons. Mutually exclusive with |
|
|
|
|
Output#
Returns a pandas.DataFrame indexed by date. Column names are
{target}_{transform}_h{horizon}.
Transform |
Formula aligned on row |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
Average of future values |
|
Average of one-period changes from |
|
Average of one-period simple growth rates from |
|
Average of one-period log growth rates from |
The final h rows are missing by construction because the future target is not
observed.
The returned frame also carries attrs["macroforecast_target_metadata"].
Core columns are target_column, source, horizon, step, mode,
transform, operation, formula, aggregation, and used_for_horizons.
average_target#
macroforecast.feature_engineering.average_target(
data,
*,
metadata: Mapping[str, object] | None = None,
target: str | None = None,
targets: Iterable[str] | None = None,
horizon: int | None = None,
horizons: Iterable[int] | int | None = None,
transform: str = "change",
) -> pandas.DataFrame
average_target() is a readability wrapper for direct average targets.
It returns the same output as:
mf.feature_engineering.direct_target(..., transform="average_change")
mf.feature_engineering.direct_target(..., transform="average_growth")
mf.feature_engineering.direct_target(..., transform="average_log_growth")
mf.feature_engineering.direct_target(..., transform="average_value")
This is the direct average approach: one final target column is created per requested horizon, and a later model can fit that column directly.
|
Meaning |
|---|---|
|
Average future values of an already transformed one-period target series. |
|
Average one-period differences over the future path. |
|
Average one-period simple growth rates over the future path. |
|
Average one-period log growth rates over the future path. |
forward_average_target#
macroforecast.feature_engineering.forward_average_target(
data,
*,
target=None,
targets=None,
horizon=None,
horizons=None,
transform="change",
)
forward_average_target() is the named target helper for
Albacore/assemblage-style supervised aggregation. It calls the same target
logic as average_target(), but records source metadata pointing to Goulet
Coulombe, Klieber, Barrette, and Goebel, Maximally Forward-Looking Core
Inflation, and the R package assemblage. The helper is generic: the target
can be any future aggregate, not only headline inflation.
Output: a DataFrame with columns such as
headline_average_change_h12. The output stores
attrs["macroforecast_target_metadata"] and marks
source_method="assemblage_forward_target".
Assemblage Helper Primitives#
These helpers come from the Albacore/assemblage workflow but are intentionally split into generic callables. They can be attached to inflation components, state panels, sector panels, industry components, or any setting where current components are aggregated to forecast a future aggregate target.
Function |
Input |
Output |
Albacore source cue |
|---|---|---|---|
|
Component panel. |
|
R |
|
One-period component changes. |
|
R |
|
Mapping, |
|
R |
|
Component panel plus fixed weights. |
One aggregate column. |
Learned core measure after assemblage weights are estimated. |
rank_space_features() and weighted_aggregate() do not estimate weights.
Use macroforecast.models.rank_aggregation(),
macroforecast.models.component_aggregation(), or the Albacore wrappers for
supervised weight estimation.
path_targets#
macroforecast.feature_engineering.path_targets(
data,
*,
metadata: Mapping[str, object] | None = None,
target: str | None = None,
targets: Iterable[str] | None = None,
horizon: int | None = None,
horizons: Iterable[int] | int | None = None,
transform: str = "change",
) -> pandas.DataFrame
path_targets() creates step-level future targets for path-average
forecasting. For horizon=3, it returns step columns for t+1, t+2, and
t+3. The model stage should fit and forecast each step target separately.
The evaluation stage can then average the step forecasts for the final horizon.
Use transform="value" when the supplied target column is already a
one-period transformed object, such as the monthly growth/difference target in a
FRED-MD replication.
path_y = mf.feature_engineering.path_targets(
processed,
target="INDPRO",
horizon=3,
transform="value",
)
Output columns are named {target}_{transform}_step{step}. Metadata includes
metadata["path_target"]["columns_by_horizon"], which records which step
columns should be averaged for each requested horizon.
macroforecast_target_metadata marks these rows with mode="path",
operation="path_step_target", a non-null step, and
aggregation="average_step_forecasts_in_evaluation". This records the intended
later use without moving model fitting or forecast averaging into this stage.
lag#
macroforecast.feature_engineering.lag(
data,
*,
metadata: Mapping[str, object] | None = None,
columns: Iterable[str] | None = None,
lags: Iterable[int] | int = (1,),
drop_missing: bool = False,
) -> pandas.DataFrame
Input#
Name |
Type |
Default |
Choices |
|---|---|---|---|
|
feature input |
required |
Canonical macroforecast input. |
|
iterable or |
all columns |
Source columns to lag. |
|
int or iterable of ints |
|
Non-negative lags. |
|
|
|
Drop rows with any lag-induced missing values. |
Output#
Returns a pandas.DataFrame with columns named {column}_lag{lag}.
mixed_frequency_lags#
macroforecast.feature_engineering.mixed_frequency_lags(
data,
*,
metadata: Mapping[str, object] | None = None,
target: str | None = None,
anchor_dates: Iterable[object] | None = None,
columns: Iterable[str] | None = None,
lags: Iterable[int] | int = (0, 1, 2),
frequency_by_column: Mapping[str, str] | None = None,
target_frequency: str | None = None,
anchor_position: str = "date",
drop_missing: bool = False,
) -> pandas.DataFrame
Builds a lag matrix for MIDAS-style and other mixed-frequency regressions.
Unlike lag(), lags are measured in each source column’s native frequency,
using metadata["native_frequency_by_column"] from mf.data.set_frequencies()
or mf.data.combine(..., frequency="native").
Lookup is period based, not timestamp-string based. A monthly source dated
2020-03-01 and the same source dated 2020-03-31 both map to the March 2020
source period. This prevents month-start/month-end conventions from silently
breaking MIDAS lag construction.
Input#
Name |
Type |
Default |
Meaning |
|---|---|---|---|
|
feature input |
required |
Panel or bundle with a mixed-frequency contract. |
|
|
input target if available |
Column whose non-missing dates define anchors when |
|
iterable or |
target non-missing dates |
Explicit rows to build features for. |
|
iterable or |
all non-target columns |
Source columns to lag. |
|
int or iterable |
|
Native-frequency lags. Pass an iterable for exact lags. |
|
mapping or |
metadata map |
Override native frequency by source column. |
|
|
target metadata/inference |
Frequency used when positioning anchor dates. |
|
|
|
|
|
|
|
Drop rows with missing lag values. |
For FRED-QD-style quarterly targets dated at the first month of the quarter,
use target_frequency="quarterly", anchor_position="period_end" to construct
monthly lag blocks at the quarter-end month:
X_midas = mf.feature_engineering.mixed_frequency_lags(
bundle,
target="GDPC1",
columns=["PAYEMS", "INDPRO"],
lags=range(0, 12),
target_frequency="quarterly",
anchor_position="period_end",
)
The output columns are named {column}_lag{lag}, which is the grouping format
expected by mf.models.midas_almon, mf.models.midas_beta, and
mf.models.midas_step.
The returned DataFrame records metadata in two places:
Location |
Meaning |
|---|---|
|
Target, anchor dates, selected columns, exact lags, frequency map, anchor positioning, lookup calendar, and row counts before/after |
|
One row per generated lag feature, including source column, lag, native source frequency, anchor position, and lookup start/end dates. |
rolling_mean#
macroforecast.feature_engineering.rolling_mean(
data,
*,
metadata: Mapping[str, object] | None = None,
columns: Iterable[str] | None = None,
windows: Iterable[int] | int = (3,),
min_periods: int | None = None,
shift: int = 0,
drop_missing: bool = False,
) -> pandas.DataFrame
Input#
Name |
Type |
Default |
Choices |
|---|---|---|---|
|
iterable or |
all columns |
Source columns. |
|
positive int or iterable |
|
Rolling-window lengths. |
|
positive int or |
window length |
Minimum observations required for a value. |
|
non-negative int |
|
Shift source series before rolling. Use |
|
|
|
Drop rows with window-induced missing values. |
Output#
Returns a pandas.DataFrame with columns named {column}_roll{window}_mean.
When shift > 0, names end in _lag{shift}.
moving_average_ladder#
macroforecast.feature_engineering.moving_average_ladder(
data,
*,
metadata: Mapping[str, object] | None = None,
columns: Iterable[str] | None = None,
windows: Iterable[int] | None = None,
max_window: int = 12,
min_periods: int | None = None,
shift: int = 0,
drop_missing: bool = False,
) -> pandas.DataFrame
Meaning#
moving_average_ladder() builds a stacked block of trailing moving averages at
multiple horizons. With the default max_window=12, the implicit windows are
1, 2, 4, 8. Pass windows=(1, 2, 4, 8, 12) or any other explicit sequence
when the endpoint should be included.
MARX in macroforecast#
Some papers describe this step as marx_features(P) or Moving Average Rotation
of X (MARX). In macroforecast, the direct pandas form is the following
explicit moving-average-ladder call:
MARX = mf.feature_engineering.moving_average_ladder(
X,
windows=range(1, P + 1),
shift=1,
)
This means that, for each source series, the feature block contains increasing
moving averages of lagged X: one-period lag, two-period average ending at
t-1, three-period average ending at t-1, and so on through P. The
shift=1 part is important because the MARX block uses lagged predictors, not
the contemporaneous realization at the forecast date.
The runner-safe shorthand is marx_step(max_lag=P), used inside
feature_spec(..., steps=[...]). It emits the same columns as the direct call,
but lets forecasting.run() decide which rows are available for any fitted
state through feature_policy.
The original author R snippet builds a VAR lag matrix ordered as lag 1 for all
variables, lag 2 for all variables, and so on. Then each lag-l slot for a
variable is replaced by the row average of that variable’s lag 1 through lag
l columns. The direct call and marx_step(scale_lags=False) match that
unscaled calculation. Through feature_matrix(..., specification="MARX", scale_marx=True) or marx_step(scale_lags=True), macroforecast also supports
the optional R-code scaling step: z-score the lag matrix first using sample
standard deviations, then apply the same increasing-lag averages. In
feature_spec() mode, scale_lags=True fits those lag-matrix center/scale
values only on the feature-fit panel and reuses them for validation/test rows.
This function is not PCA. It is the moving-average block used before optional factor extraction. Moving-average PCA should be represented as:
ma_block = mf.feature_engineering.moving_average_ladder(panel, windows=(1, 2, 4, 8, 12))
factors = mf.feature_engineering.pca_features(ma_block, fit_policy="expanding")
Keeping the moving-average block and PCA step separate matters because PCA is a fit-based transformation. Running PCA on the full sample before train/test or walk-forward boundaries would leak future information.
Input#
Name |
Type |
Default |
Choices |
|---|---|---|---|
|
iterable or |
all columns |
Source columns. |
|
iterable of positive ints or |
powers of two up to |
Exact moving-average windows. |
|
positive int |
|
Used only when |
|
positive int or |
window length |
Minimum observations required for a value. |
|
non-negative int |
|
Shift source series before rolling. Use |
|
bool |
|
Drop rows with window-induced missing values. |
Output#
Returns a pandas.DataFrame with columns named {column}_ma{window}. When
shift > 0, names end in _lag{shift}.
scale_features#
macroforecast.feature_engineering.scale_features(
data,
*,
metadata: Mapping[str, object] | None = None,
columns: Iterable[str] | None = None,
method: str = "zscore",
fit_policy: str = "expanding",
min_train_size: int | None = None,
drop_missing: bool = False,
warn_full_sample: bool = True,
) -> pandas.DataFrame
Name |
Type |
Default |
Choices |
|---|---|---|---|
|
str |
|
|
|
str |
|
|
|
positive int or |
|
Minimum complete rows before emitting scaled values. |
|
bool |
|
Drop rows where scaling is unavailable. |
|
bool |
|
Warn when |
pca_features#
macroforecast.feature_engineering.pca_features(
data,
*,
metadata: Mapping[str, object] | None = None,
columns: Iterable[str] | None = None,
n_components: int = 1,
fit_policy: str = "expanding",
min_train_size: int | None = None,
scale: bool = True,
prefix: str = "pc",
drop_missing: bool = False,
random_state: int | None = None,
warn_full_sample: bool = True,
) -> pandas.DataFrame
pca_features() returns columns named {prefix}1, {prefix}2, and so on.
The default fit_policy="expanding" avoids full-sample leakage. Use
fit_policy="full_sample" only after the input sample has already been split
or for exploratory diagnostics. warn_full_sample=True emits a warning for
that choice.
sparse_pca_chen_rohe_features#
macroforecast.feature_engineering.sparse_pca_chen_rohe_features(
data,
*,
metadata: Mapping[str, object] | None = None,
columns: Iterable[str] | None = None,
n_components: int = 4,
zeta: float = 0.0,
max_iter: int = 200,
var_innovations: bool = False,
prefix: str | None = None,
min_train_size: int | None = None,
drop_missing: bool = False,
random_state: int | None = 0,
warn_full_sample: bool = True,
) -> pandas.DataFrame
sparse_pca_chen_rohe_features() implements the legacy package’s
Chen-Rohe-style Sparse Component Analysis (SCA) routine directly with NumPy. It
is not sklearn.decomposition.SparsePCA. The transform centers the selected
predictor panel, alternates over the score and loading matrices, and constrains
the loading matrix with an L1 budget zeta.
The direct callable fits on all complete rows of the supplied input. It warns by
default because that is a full-input fitted transform. For strict walk-forward
forecasting, use sparse_pca_chen_rohe_step() inside feature_spec(...); the
runner will fit the sparse loading matrix on the feature-fit panel and reuse the
fixed loading matrix on validation/test rows.
Input#
Name |
Type |
Default |
Meaning |
|---|---|---|---|
|
iterable or |
all columns |
Predictor columns used to fit sparse components. |
|
positive int |
|
Requested number of sparse components. The resolved number is |
|
non-negative float |
|
L1 loading-budget parameter. |
|
positive int |
|
Maximum alternating updates. |
|
bool |
|
If |
|
string or |
|
Output prefix. Default is |
|
positive int or |
|
Minimum complete rows before emitting factors. |
|
bool |
|
Drop rows where sparse factors are unavailable. |
|
int or |
|
Initialization seed for the alternating algorithm. |
|
bool |
|
Warn because the direct callable fits on all complete input rows. |
Output#
Returns a pandas.DataFrame indexed by date, with columns such as sca1,
sca2, or scaf1. Metadata is stored under
attrs["macroforecast_metadata"]["feature_engineering_sparse_pca_chen_rohe"].
The stage records selected columns, requested/resolved components, zeta,
resolved zeta, iteration count, final objective, VAR-innovation use, fit rows,
and fit_policy="full_input_complete_rows".
macroforecast_feature_metadata records one row per factor with
operation="sparse_pca_chen_rohe", the source columns, component index, and fit
policy.
varimax_features#
macroforecast.feature_engineering.varimax_features(
data,
*,
metadata: Mapping[str, object] | None = None,
columns: Iterable[str] | None = None,
max_iter: int = 50,
tol: float = 1e-7,
prefix: str = "varimax",
min_train_size: int | None = None,
drop_missing: bool = False,
warn_full_sample: bool = True,
) -> pandas.DataFrame
varimax_features() rotates a factor-score panel with an orthogonal varimax
rotation. It should be applied to factor columns, not raw macro variables. A
typical direct use is:
factors = mf.feature_engineering.pca_features(
processed,
columns=["INDPRO", "PAYEMS", "UNRATE"],
n_components=3,
fit_policy="full_sample",
warn_full_sample=False,
)
rotated = mf.feature_engineering.varimax_features(factors, warn_full_sample=False)
The direct callable fits the rotation on all complete rows and warns by default. For strict walk-forward forecasting, use:
spec = mf.feature_engineering.feature_spec(
target="INDPRO",
horizon=1,
predictors=["PAYEMS", "UNRATE", "HOUST"],
steps=[
mf.feature_engineering.pca_step(name="pc", n_components=3, include=False),
mf.feature_engineering.varimax_step(name="rot", input="pc"),
],
)
Input#
Name |
Type |
Default |
Meaning |
|---|---|---|---|
|
iterable or |
all columns |
Factor-score columns to rotate. |
|
positive int |
|
Maximum varimax iterations. |
|
non-negative float |
|
Convergence tolerance for the rotation objective. |
|
string |
|
Output prefix. |
|
positive int or |
|
Minimum complete rows before emitting rotated factors. |
|
bool |
|
Drop rows where rotated factors are unavailable. |
|
bool |
|
Warn because the direct callable fits on all complete input rows. |
Output#
Returns a pandas.DataFrame with columns such as varimax1, varimax2, and so
on. Metadata is stored under metadata["feature_engineering_varimax"], and
macroforecast_feature_metadata records operation="varimax", component index,
source factor columns, and fit policy.
sliced_inverse_regression_features#
macroforecast.feature_engineering.sliced_inverse_regression_features(
data,
target: str | pandas.Series | None = None,
*,
metadata: Mapping[str, object] | None = None,
columns: Iterable[str] | None = None,
n_components: int = 3,
n_slices: int = 10,
scaling_policy: str = "scaled_pca",
prefix: str = "sir",
drop_missing: bool = False,
warn_full_sample: bool = True,
) -> pandas.DataFrame
sliced_inverse_regression_features() implements a target-aware SIR factor
transform. It aligns the predictor panel with a target series, standardizes
predictors, optionally applies predictive column scaling, slices observations by
the target distribution, and projects the full panel onto the leading
between-slice directions.
The direct callable fits on all target-aligned complete rows in the supplied
input. For strict walk-forward forecasting, use
sliced_inverse_regression_step() inside feature_spec(...); the runner then
fits SIR directions only on the feature-fit panel and applies the fixed
directions to validation/test rows.
Input#
Name |
Type |
Default |
Choices |
|---|---|---|---|
|
string, |
input target metadata |
Target signal used for SIR slicing and optional predictive scaling. |
|
iterable or |
all non-target columns |
Predictor columns. |
|
positive int |
|
Number of SIR factors to return. If the effective rank is smaller, remaining columns are zero-padded for stable shape. |
|
int |
|
Target-distribution slices. Must be at least |
|
string |
|
|
|
string |
|
Output prefix. Use |
|
bool |
|
Drop rows with missing predictor values after projection. |
|
bool |
|
Warn because the direct callable fits on all target-aligned complete rows. |
Output#
Returns columns such as sir1, sir2, and so on. Metadata is stored under
metadata["feature_engineering_sliced_inverse_regression"] and records the
target, predictor columns, requested/resolved components, slices, scaling policy,
fit row count, and fit_policy="full_input_target_aligned_rows".
Target-Aware Feature Steps#
feature_spec(..., steps=[...]) also supports target-aware fitted transforms.
These steps use the resolved feature_spec() target during .fit(...), store a
fixed fit state, and do not look at target values during .transform(...).
features = mf.feature_engineering.feature_spec(
target="INDPRO",
horizon=1,
predictors=["PAYEMS", "UNRATE", "HOUST"],
steps=[
mf.feature_engineering.scale_step(name="scaled", include=False),
mf.feature_engineering.partial_least_squares_step(
name="pls",
input="scaled",
n_components=2,
min_train_size=60,
),
],
)
Target-aware steps require exactly one resolved target column. In practice that
means one target and one horizon for the step pipeline. If multiple targets
or horizons are requested, fit raises before any model is run.
Step builder or method |
Direct callable |
Main options |
Output |
|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
Selected input columns. |
|
|
|
Selected input columns. |
|
|
|
Selected input columns. |
|
|
|
Selected input columns. |
|
|
|
Selected input columns. |
|
|
|
Selected input columns. |
|
|
|
Selected input columns. |
|
|
|
Selected input columns. |
Fit-state metadata records the resolved target column, selected source columns,
requested/resolved component or feature count, fit row count, and
fit_policy="fixed_fit_panel_target_aligned_rows" for target-dependent methods.
For method="variance_selection", no target is used and the fit policy is
fixed_fit_panel_columns.
Feature selection deliberately has no generic wrapper step.
Use each algorithm name directly inside feature_spec():
features = mf.feature_engineering.feature_spec(
target="INDPRO",
horizon=1,
predictors=["PAYEMS", "UNRATE", "HOUST"],
steps=[
{"name": "boruta", "method": "boruta_selection", "n_features": 2},
],
)
Custom Feature Functions#
custom_features() applies one user feature transform directly. It is useful
when the input has already been split or when the transform has no fitted state.
macroforecast.feature_engineering.custom_features(
data,
func,
*,
metadata: Mapping[str, object] | None = None,
columns: Iterable[str] | None = None,
name: str | None = None,
**params,
) -> pandas.DataFrame
The callable receives:
func(source: pandas.DataFrame, *, metadata: dict, **params)
source is the selected predictor block. The callable may return a
DataFrame, Series, or 1-D/2-D array-like object. The output must have the
same row count as source or keep a DatetimeIndex. All output columns are
coerced to numeric and validated as a macroforecast panel.
def spread_square(source, *, metadata=None, suffix="sq"):
column = source.columns[0]
return pd.DataFrame({f"{column}_{suffix}": source[column] ** 2}, index=source.index)
X_custom = mf.feature_engineering.custom_features(
processed,
spread_square,
columns=["term_spread"],
name="spread_square",
)
For strict runner use, prefer custom_step() inside feature_spec(...).
The runner fits the step on the rows allowed by feature_policy and applies it
to validation/test rows without leaking future information.
macroforecast.feature_engineering.custom_step(
name,
func=None,
*,
input="panel",
include=True,
columns=None,
fit_func=None,
transform_func=None,
requires_target=False,
min_train_size=None,
prefix=None,
drop_missing=False,
**params,
) -> dict
Custom Step Modes#
Mode |
Required callable |
Fit-time call |
Transform-time call |
|---|---|---|---|
Stateless |
|
none |
|
Fitted transformer object |
|
|
|
Separate fit/transform functions |
|
|
|
State-aware callable |
|
|
|
Set requires_target=True when the fitting callable needs the resolved
feature_spec() target. This requires exactly one target and one horizon. The
fitted state metadata stores callable names, selected columns, whether the
target was used, fit row count, and output columns.
features = mf.feature_engineering.feature_spec(
target="INDPRO",
horizon=1,
predictors=["PAYEMS", "UNRATE", "HOUST"],
steps=[
mf.feature_engineering.custom_step(
"my_factor",
fit_func=my_factor_fit,
transform_func=my_factor_transform,
columns=["PAYEMS", "UNRATE", "HOUST"],
requires_target=True,
prefix="myf",
n_components=2,
),
],
)
group_pca#
macroforecast.feature_engineering.group_pca(
data,
*,
groups: Mapping[str, Iterable[str]],
metadata: Mapping[str, object] | None = None,
n_components: int | Mapping[str, int] = 1,
fit_policy: str = "expanding",
min_train_size: int | None = None,
scale: bool = True,
prefix: str | None = None,
drop_missing: bool = False,
random_state: int | None = None,
warn_full_sample: bool = True,
) -> pandas.DataFrame
group_pca() extracts PCA factors separately within named groups. It is a
generic grouped factor transform. FAVAR-specific slow/fast grouping,
observed-policy variables, VAR dynamics, identification, and IRFs belong to
later model and evaluation stages.
factors = mf.feature_engineering.group_pca(
processed,
groups={
"real_activity": ["INDPRO", "PAYEMS", "UNRATE"],
"prices": ["CPIAUCSL", "PPIACO"],
},
n_components={"real_activity": 3, "prices": 2},
fit_policy="expanding",
)
Output columns use the group name as the prefix by default:
real_activity1, real_activity2, real_activity3, prices1, prices2
group_pca_step() provides the same operation inside compose_features().
Supervised And Sparse Component Boundary#
Unsupervised group PCA belongs in feature_engineering because it only uses
the predictor panel. PLS and SIR are target-aware and are available as
runner-safe feature steps when the resolved feature_spec() target is single.
Supervised PCA variants that fit a full predictive model still belong in
macroforecast.models. Chen-Rohe sparse component analysis is unsupervised and
is available as sparse_pca_chen_rohe_features() /
sparse_pca_chen_rohe_step().
maf_features#
macroforecast.feature_engineering.maf_features(
data,
*,
metadata: Mapping[str, object] | None = None,
columns: Iterable[str] | None = None,
max_lag: int = 12,
lags: Iterable[int] | None = None,
n_components: int = 2,
fit_policy: str = "expanding",
min_train_size: int | None = None,
scale: bool = False,
prefix: str = "maf",
drop_missing: bool = False,
random_state: int | None = None,
warn_full_sample: bool = True,
) -> pandas.DataFrame
maf_features() implements Moving Average Factors. For each selected variable
x_k, it builds a variable-specific lag panel:
[x_k(t), x_k(t-1), ..., x_k(t-P)]
Then it extracts PCA components from that lag panel only. This is different
from pca_features(), which runs PCA across all selected variables, and
different from moving_average_pca_lags(), which runs PCA on a moving-average
block.
The MAF implementation is intentionally limited to the construction described in the paper: variable-specific lag panels followed by PCA. The package does not assume undocumented author-code details beyond that description.
Validation status: MARX is tested against the author-supplied R-loop pattern. MAF is tested for the documented variable-specific lag-panel PCA contract, but there is no author-code benchmark in the package yet. If author MAF code becomes available, it should be added as a separate equivalence test before tightening the claim.
MAF = mf.feature_engineering.maf_features(
X,
max_lag=12,
n_components=2,
fit_policy="expanding",
)
With two input series, this returns columns like:
INDPRO_maf1, INDPRO_maf2, PAYEMS_maf1, PAYEMS_maf2
Input#
Name |
Type |
Default |
Choices |
|---|---|---|---|
|
iterable or |
all columns |
Source series for variable-specific lag panels. |
|
non-negative int |
|
Used when |
|
iterable of non-negative ints or |
|
Exact lag set. Overrides |
|
positive int |
|
Number of MAF components per source series. |
|
str |
|
|
|
positive int or |
|
Minimum complete rows before emitting PCA values. |
|
bool |
|
Whether to z-score the lag columns before PCA. Default is |
|
str |
|
Component label used in output names. |
|
bool |
|
Drop rows where MAF values are unavailable. |
|
bool |
|
Warn when |
Output#
Returns a pandas.DataFrame with one block per source series. Metadata is
stored in metadata["feature_engineering_maf"], and
macroforecast_feature_metadata records the source series for each component.
feature_matrix#
macroforecast.feature_engineering.feature_matrix(
data,
*,
metadata: Mapping[str, object] | None = None,
specification: str | Iterable[str] = "X",
columns: Iterable[str] | None = None,
level_data: feature input | None = None,
level_columns: Iterable[str] | None = None,
lags: Iterable[int] | int = (0,),
max_lag: int = 12,
n_factors: int = 8,
n_maf_components: int = 2,
fit_policy: str = "expanding",
min_train_size: int | None = None,
include_current_factor: bool = True,
scale_factors: bool = True,
scale_marx: bool = False,
scale_maf: bool = False,
drop_missing: bool = False,
warn_full_sample: bool = True,
) -> pandas.DataFrame
feature_matrix() builds named combinations used in macro-ML forecasting
papers without requiring the user to hand-write compose_features() steps.
Block |
Package implementation |
|---|---|
|
|
|
PCA factors from the supplied panel, then lags of those factors. |
|
|
|
|
|
|
specification can be a string such as "F-X-MARX" or an iterable such as
("F", "X", "MAF"). Output columns are prefixed by block, for example
F__F1_lag0, X__INDPRO_lag1, MARX__INDPRO_ma3_lag1, or
MAF__INDPRO_maf1.
include_current_factor=True ensures the F block includes current factors
even when lags contains only positive values such as range(1, 13). Set it
to False when the factor block should exactly follow the supplied lag set.
Paper-Style Specifications#
The paper-style feature families are handled directly by feature_matrix().
The parser accepts -, +, or _ separators.
Specification |
Meaning |
|---|---|
|
Lagged predictor panel. |
|
PCA factors from the predictor panel, then factor lags. |
|
Factor lags plus lagged predictors. |
|
Lagged level variables from |
|
Lagged predictors plus lagged level variables from |
|
|
|
|
|
|
|
|
|
|
Specification |
Requires |
Fitted transform |
Main output blocks |
|---|---|---|---|
|
No |
No |
|
|
No |
PCA |
|
|
No |
PCA |
|
|
Yes |
No |
|
|
Yes |
No |
|
|
Yes |
PCA |
|
|
No |
PCA; optional MARX scaling |
|
|
No |
PCA and MAF PCA |
|
|
Yes |
PCA; optional MARX scaling |
|
|
Yes |
PCA and MAF PCA |
|
Examples:
FX = mf.feature_engineering.feature_matrix(
processed,
specification="F-X",
lags=range(0, 13),
n_factors=8,
fit_policy="expanding",
)
FXH = mf.feature_engineering.feature_matrix(
processed,
specification="F-X-H",
level_data=raw_bundle,
lags=range(0, 13),
n_factors=8,
)
FXMARX = mf.feature_engineering.feature_matrix(
processed,
specification="F-X-MARX",
lags=range(0, 13),
max_lag=12,
n_factors=8,
scale_marx=False,
)
FXMAF = mf.feature_engineering.feature_matrix(
processed,
specification="F-X-MAF",
lags=range(0, 13),
max_lag=12,
n_factors=8,
n_maf_components=2,
)
Input#
Name |
Type |
Default |
Choices |
|---|---|---|---|
|
string or iterable |
|
Blocks |
|
iterable or |
all columns |
Source columns from the preprocessed panel. |
|
feature input or |
|
Required when |
|
iterable or |
|
Level-data columns to include. |
|
int or iterable |
|
Lag set for |
|
positive int |
|
Maximum lag for |
|
positive int |
|
Number of PCA factors for |
|
positive int |
|
MAF components per source variable. |
|
str |
|
|
|
positive int or |
transform-specific |
Minimum complete rows for fitted transforms. |
|
bool |
|
Force lag 0 in the |
|
bool |
|
Z-score variables before PCA in the |
|
bool |
|
Match the optional author R-code scaling step for |
|
bool |
|
Z-score variable-specific MAF lag panels before PCA. |
|
bool |
|
Drop rows with missing feature values. |
|
bool |
|
Warn when any fitted block uses |
Z = mf.feature_engineering.feature_matrix(
processed,
specification="F-X-MARX",
lags=range(0, 13),
max_lag=12,
n_factors=8,
fit_policy="expanding",
drop_missing=True,
)
Use level_data= when the combination includes level variables:
Z = mf.feature_engineering.feature_matrix(
processed,
specification="F-LEVEL",
level_data=raw_bundle,
lags=range(0, 13),
)
compose_features#
macroforecast.feature_engineering.compose_features(
data,
steps,
*,
metadata: Mapping[str, object] | None = None,
columns: Iterable[str] | None = None,
include_original: bool = False,
drop_missing: bool = False,
) -> pandas.DataFrame
steps is an ordered list of mappings. Each step has:
Key |
Meaning |
|---|---|
|
Step name. Later steps can reference this name. |
|
One of |
|
|
|
Whether this step’s output is included in final |
other keys |
Method-specific parameters such as |
Examples:
# PCA, then lags of the PCA factors.
X = mf.feature_engineering.compose_features(
processed,
[
{"name": "pc", "method": "pca", "columns": ["PAYEMS", "INDPRO"], "n_components": 2, "include": False},
{"name": "pc_lags", "method": "lag", "input": "pc", "lags": [1, 2, 3]},
],
)
# Lags first, then PCA on the lag block.
X = mf.feature_engineering.compose_features(
processed,
[
{"name": "lag_block", "method": "lag", "lags": [0, 1, 2, 3], "include": False},
{"name": "lag_pc", "method": "pca", "input": "lag_block", "n_components": 4},
],
)
# Moving-average ladder, PCA, then lags of the factor.
X = mf.feature_engineering.compose_features(
processed,
[
{"name": "ma", "method": "moving_average_ladder", "windows": [1, 2, 4, 8, 12], "include": False},
{"name": "ma_pc", "method": "pca", "input": "ma", "n_components": 4, "include": False},
{"name": "ma_pc_lags", "method": "lag", "input": "ma_pc", "lags": [1, 2]},
],
)
# MAF as a direct block inside a composed feature matrix.
X = mf.feature_engineering.compose_features(
processed,
[
{"name": "maf", "method": "maf", "max_lag": 12, "n_components": 2},
],
)
# MARX shorthand: increasing averages of lagged predictors.
X = mf.feature_engineering.compose_features(
processed,
[
mf.feature_engineering.marx_step(max_lag=12, scale_lags=False),
],
)
# Extra deterministic transforms can be composed the same way.
X = mf.feature_engineering.compose_features(
processed,
[
mf.feature_engineering.transform_step(name="log_ip", transform="log", columns=["INDPRO"], include=False),
mf.feature_engineering.lag_step(name="log_ip_lag", input="log_ip", lags=[1, 2, 3]),
mf.feature_engineering.interaction_step(name="cross", columns=["PAYEMS", "HOUST"]),
],
)
time_features#
macroforecast.feature_engineering.time_features(
data,
*,
metadata: Mapping[str, object] | None = None,
trend: bool = True,
month: bool = False,
quarter: bool = False,
year: bool = False,
) -> pandas.DataFrame
Input And Output#
Option |
Output columns |
|---|---|
|
|
|
|
|
|
|
|
Additional Transform Helpers#
These helpers are feature-engineering transforms, not preprocessing t-codes. Use them when the model feature set needs extra ML-oriented columns after the canonical panel has already been cleaned.
Function |
Main options |
Output |
|---|---|---|
|
|
|
|
Thin named wrappers around |
Same as above. |
|
Seasonal step length and seasonal lag count. |
|
|
|
Month or quarter dummies. |
|
Seasonal period and harmonic order. |
Sine/cosine seasonal terms. |
|
|
Named polynomial expansion columns. |
|
Exact-order pure interaction expansion without lower-order terms or powers. |
|
|
HP |
HP cycle/trend columns. |
|
Hamilton horizon |
|
|
Centered filter window, polynomial order, derivative; |
Smoothed columns. |
|
Causal rolling approximation/detail levels; |
|
|
Feature wrapper around |
|
|
Sorts each row’s selected columns in ascending order. |
|
|
Target-aware PLSRegression scores; warns by default. |
|
|
Static DFM approximation by standardized PCA; warns by default. |
|
|
Select by sample variance; no target required. |
Subset of original columns. |
|
Select by absolute target correlation. |
Subset of original columns. |
|
Select by absolute lasso coefficient. |
Subset of original columns. |
|
Select by lasso-path inclusion frequency. |
Subset of original columns. |
|
Select by recursive feature elimination. |
Subset of original columns. |
|
Select by Boruta-style shadow-feature tests. |
Subset of original columns. |
|
Select by repeated sparse-model subsampling frequency. |
Subset of original columns. |
|
Select by genetic subset search. |
Subset of original columns. |
|
Gaussian random projection; |
|
|
Kernel approximation settings; |
|
Filter-Backed Features#
macroforecast.filters owns one-series filter and smoother callables.
macroforecast.feature_engineering owns panel wrappers that turn those outputs
into feature columns:
Feature wrapper |
Direct filter |
|---|---|
|
|
|
|
|
|
|
|
|
|
For AlbaMA method details, R-code alignment, and weight extraction, see
Filters. adaptive_ma_rf_features() stores full AlbaMAResult
objects in attrs["macroforecast_feature_weight_results"] so
feature_analysis.effective_window() and
feature_analysis.recent_weight_share() can inspect learned weights.
features = mf.feature_engineering.adaptive_ma_rf_features(
processed.panel,
columns=["CPIAUCSL", "INDPRO"],
sided="one",
)
albama_results = features.attrs["macroforecast_feature_weight_results"]
random_projection_features() and nystroem_features() fit on complete rows
of the provided input and warn by default because the direct helpers are
full-input fitted helpers. For strict origin-by-origin forecasting, use
random_projection_step() and nystroem_step() inside feature_spec(); the
runner fits the projection/kernel state on the feature-fit panel and reuses the
fixed state for validation/test rows.
hamilton_filter_features() follows Hamilton’s regression form:
y[t+h] = a + b_0 y[t] + ... + b_{p-1} y[t-p+1] + e[t+h].
The fitted value is stored as the trend and the residual as the cycle, both
labeled at t+h. Defaults h=8, p=4 match the common quarterly setting; for
monthly panels, h=24, p=12 is the usual analogue. The default
fit_policy="expanding" estimates each row with only earlier completed
Hamilton-regression rows. fit_policy="full_sample" reproduces the ordinary
in-sample filter style and warns by default because it can use future
information relative to a forecasting origin.
feature_spec#
macroforecast.feature_engineering.feature_spec(
*,
target: str | None = None,
targets: Iterable[str] | None = None,
horizon: int | None = None,
horizons: Iterable[int] | int | None = None,
predictors: Literal["all"] | Iterable[str] | None = None,
lags: Iterable[int] | int | None = (0, 1),
target_lags: Iterable[int] | int | None = None,
rolling_windows: Iterable[int] | int | None = None,
rolling_min_periods: int | None = None,
add_time: bool = False,
time_trend: bool = True,
time_month: bool = False,
time_quarter: bool = False,
time_year: bool = False,
pca_components: int | None = None,
pca_columns: Iterable[str] | None = None,
pca_scale: bool = True,
pca_prefix: str = "pc",
steps: Iterable[Mapping[str, object]] | None = None,
feature_steps: Iterable[Mapping[str, object]] | None = None,
include_original: bool = False,
target_transform: str = "level",
target_mode: str = "direct",
drop_missing: bool = True,
metadata: Mapping[str, object] | None = None,
) -> FeatureSpec
feature_spec() is the runner-safe feature contract. It is fitted by
forecasting.run() according to feature_policy, so stateful choices such as
scaling, PCA, grouped PCA, and MAF are estimated on the allowed
training/reference panel and reused when transforming validation/test rows.
Input#
Name |
Type |
Default |
Meaning |
|---|---|---|---|
|
string/iterable or |
from input metadata |
Target column or target columns. |
|
positive int/iterable or |
from input, then |
Forecast horizon choices. |
|
|
from input, then all non-target columns |
Predictor columns. |
|
int, iterable, or |
|
Predictor lags. |
|
int, iterable, or |
|
Explicit autoregressive target lags added to |
|
positive int/iterable or |
|
Optional rolling means. |
|
positive int or |
window length |
Minimum observations for rolling means. |
|
bool |
|
Add deterministic date features. |
|
bool |
see signature |
Which deterministic date features to add. |
|
positive int or |
|
Fit PCA on the allowed feature-fit panel and append fixed-loadings components. |
|
iterable or |
predictors |
Columns used for PCA. |
|
bool |
|
Standardize PCA inputs using the feature-fit panel. |
|
string |
|
PCA output prefix. |
|
iterable of step mappings or |
|
Fit-aware feature-step pipeline. Use public step builders for deterministic/fitted transforms: |
|
bool |
|
Include the original predictor panel as part of |
|
string |
|
Same target choices as |
|
string |
|
|
|
bool |
|
Drop rows with missing selected |
|
mapping or |
|
User metadata stored inside the feature spec record. |
When steps are supplied, they replace the shortcut predictor options
rolling_windows, add_time, and pca_components; use the corresponding step
builders instead. The default lags=(0, 1) shortcut is also not used in step
mode unless you explicitly add a lag_step(). target_lags is not a predictor
shortcut; it is appended as a separate autoregressive target block after the
step pipeline, so paper-style designs can combine steps=[...] with
target_lags=range(0, 13).
Output#
Returns FeatureSpec. Important methods:
Method |
Output |
Meaning |
|---|---|---|
|
|
Fits reusable feature state for PCA/scaling/grouped PCA/MAF steps on the supplied panel. |
|
|
Fits and transforms the same panel. |
|
|
JSON-ready feature choices for result metadata. |
|
|
Compact runner metadata. |
Fit-Aware Step Pipeline#
Step pipelines let the runner refit feature transformations inside each forecasting window:
features = mf.feature_engineering.feature_spec(
target="INDPRO",
horizon=1,
predictors=["PAYEMS", "HOUST", "S&P 500"],
steps=[
mf.feature_engineering.scale_step(name="scaled", include=False),
mf.feature_engineering.pca_step(
name="pc",
input="scaled",
n_components=3,
min_train_size=60,
include=False,
),
mf.feature_engineering.lag_step(name="pc_lag", input="pc", lags=range(0, 13)),
],
)
Each step has a name, method, input, and include flag. input="panel"
reads the original predictor panel; input="<step name>" reads a prior step;
input="target_panel" reads the resolved target columns. The last input is
explicitly opt-in: predictors still reject target overlap, but paper designs
that require target-derived feature blocks such as MARX_y or MAF_y can
construct them without treating the target as an ordinary predictor. If
include=False, the step is an intermediate fitted transformation and its
output is not included in the final X, but its metadata is still recorded.
Example target-derived MARX block:
features = mf.feature_engineering.feature_spec(
target="INDPRO",
horizon=3,
predictors=["PAYEMS", "UNRATE", "HOUST"],
steps=[
mf.feature_engineering.marx_step(name="MARX_X", max_lag=12),
mf.feature_engineering.marx_step(
name="MARX_y",
input="target_panel",
columns=["INDPRO"],
max_lag=12,
),
],
target_lags=range(0, 13),
)
Stateful step builders are interpreted as fixed-fit transformations inside
FeatureSpec: the runner’s feature_policy determines which rows are used to
fit the step. Any fit_policy value inherited from reusable step builders is
ignored in feature_spec() mode because the runner owns the temporal fit
policy.
Step builder |
Runner-safe behavior |
|---|---|
|
Deterministic lag transform. |
|
Deterministic rolling mean transform. |
|
Deterministic moving-average ladder. |
|
MARX increasing lag averages; with |
|
Deterministic column transform: |
|
Deterministic seasonal lag such as 12-month or 4-quarter lag blocks. |
|
Deterministic month or quarter date dummies from the index. |
|
Deterministic Fourier seasonal terms from the index. |
|
Deterministic trend, month, quarter, and year columns from the index. |
|
Deterministic polynomial expansion. |
|
Deterministic pure interaction terms. |
|
Fits center/scale on the feature-fit panel, then applies fixed parameters. |
|
Fits PCA loadings on the feature-fit panel, then applies fixed loadings. |
|
Fits Chen-Rohe sparse loadings on the feature-fit panel, then applies fixed loadings; optional |
|
Fits an orthogonal rotation on factor-score columns from the feature-fit panel, then applies the fixed rotation. |
|
Fits separate PCA states inside named groups. |
|
Fits variable-specific lag-panel PCA states for Moving Average Factors. |
|
Fits Hamilton-regression beta on the feature-fit panel, then applies fixed beta to train/validation/test rows. |
|
Fits a Gaussian random-projection transformer on the feature-fit panel and applies fixed components. |
|
Fits Nystroem kernel-approximation landmarks on the feature-fit panel and applies fixed components. |
|
Fits PLS components against the single resolved target on the feature-fit panel and applies fixed weights. |
|
Fits target-sliced directions on the feature-fit panel and applies fixed directions. |
|
Select columns on the feature-fit panel and reuse the selected columns. Use these as |
In feature_spec() mode, hamilton_step() ignores the reusable step’s
fit_policy argument because the runner’s feature_policy owns the allowed
fit rows. The fitted state records fit_policy="fixed_fit_panel". Runner-safe
Hamilton currently requires missing="drop"; impute missing values in
preprocessing before using it. The direct helper and compose_features() still
support missing="interpolate" for one-shot exploratory construction.
Direct pandas functions and runner-safe step builders are intentionally paired:
Direct function |
Runner-safe step |
Fit state? |
Typical use |
|---|---|---|---|
|
|
No |
Add current/lagged predictors. |
|
|
No |
Add trailing rolling means. |
|
|
No |
Add multi-scale moving-average blocks. |
|
|
Only when |
Add MARX increasing lag averages. |
|
|
No |
Add ML-side transforms after preprocessing. |
|
|
No |
Add seasonal lag blocks. |
|
|
No |
Add calendar dummies. |
|
|
No |
Add deterministic seasonal Fourier terms. |
|
|
No |
Add deterministic trend/month/quarter/year terms. |
|
|
No |
Add nonlinear expansions. |
|
|
No |
Add cross-products. |
|
|
Yes |
Fit center/scale on allowed rows. |
|
|
Yes |
Fit PCA loadings on allowed rows. |
|
|
Yes |
Fit Chen-Rohe sparse component loadings on allowed rows. |
|
|
Yes |
Fit orthogonal factor rotation on allowed rows. |
|
|
Yes |
Fit separate PCA states by group. |
|
|
Yes |
Fit variable-specific lag-panel PCA states. |
|
|
Yes |
Fit Hamilton-regression beta on allowed rows, then apply fixed beta. |
|
|
Yes |
Fit Gaussian random-projection state on allowed rows. |
|
|
Yes |
Fit Nystroem kernel landmarks on allowed rows. |
|
|
Yes |
Fit PLS scores against the resolved target on allowed rows. |
|
|
Yes |
Fit SIR directions against the resolved target on allowed rows. |
|
|
Yes |
Select columns by sample variance on allowed rows; no target required. |
|
|
Yes |
Select columns by target correlation on allowed rows. |
|
|
Yes |
Select columns by lasso coefficient magnitude on allowed rows. |
|
|
Yes |
Select columns by lasso-path inclusion frequency on allowed rows. |
|
|
Yes |
Select columns by recursive feature elimination on allowed rows. |
|
|
Yes |
Select columns by Boruta-style shadow-feature tests on allowed rows. |
|
|
Yes |
Select columns by sparse-model subsampling frequency on allowed rows. |
|
|
Yes |
Select columns by genetic subset search on allowed rows. |
The remaining helpers remain callable but are intentionally not accepted as
FeatureSpec step methods yet:
Helper |
Why not a runner-safe step yet |
|---|---|
|
It changes the date anchor and native-frequency lookup calendar. This belongs with mixed-frequency data/model design, not ordinary same-index feature steps. |
|
HP filtering is two-sided on the supplied sample. It remains direct-only and warns by default; use |
|
The smoother uses a centered local window over the supplied sample. It remains direct-only and warns by default; use trailing |
build_features() remains broader for one-shot construction, including
feature_specification="F-X-MARX" and feature_specification="F-X-MAF".
Use it when you want to materialize a complete FeatureSet first. Use
feature_spec(..., steps=...) when the feature transformations themselves must
be refit inside forecasting.run() according to the window design.
build_features#
macroforecast.feature_engineering.build_features(
data,
*,
metadata: Mapping[str, object] | None = None,
target: str | None = None,
targets: Iterable[str] | None = None,
horizon: int | None = None,
horizons: Iterable[int] | int | None = None,
predictors: Literal["all"] | Iterable[str] | None = None,
lags: Iterable[int] | int = (0, 1),
target_lags: Iterable[int] | int | None = None,
rolling_windows: Iterable[int] | int | None = None,
rolling_min_periods: int | None = None,
add_time: bool = False,
time_trend: bool = True,
time_month: bool = False,
time_quarter: bool = False,
time_year: bool = False,
feature_steps: Iterable[Mapping[str, object]] | None = None,
feature_specification: str | Iterable[str] | None = None,
include_original: bool = False,
level_data: feature input | None = None,
max_lag: int = 12,
n_factors: int = 8,
n_maf_components: int = 2,
feature_fit_policy: str = "expanding",
feature_min_train_size: int | None = None,
feature_warn_full_sample: bool = True,
include_current_factor: bool = True,
scale_factors: bool = True,
scale_marx: bool = False,
scale_maf: bool = False,
target_transform: str = "level",
target_mode: str = "direct",
drop_missing: bool = True,
) -> FeatureSet
Input#
Name |
Type |
Default |
Choices |
|---|---|---|---|
|
string/iterable or |
from |
Target column choices. One of them is required if the input does not already define targets. |
|
positive int/iterable or |
from input, then |
Forecast horizons. |
|
|
from input, then all non-target columns |
Predictor columns. Target columns are rejected as predictors. |
|
int or iterable |
|
Current value plus lag one by default. |
|
int, iterable, or |
|
Add autoregressive target-lag columns to |
|
positive int/iterable or |
|
Add rolling-mean features for each window. |
|
positive int or |
window length |
Passed to |
|
bool |
|
Add deterministic date features. |
|
bool |
|
Which date features to include when |
|
iterable of mappings or |
|
If supplied, use |
|
string/iterable or |
|
If supplied, use |
|
bool |
|
Include original predictors when |
|
feature input or |
|
Passed to |
|
positive int |
|
Passed to |
|
positive int |
|
Number of |
|
positive int |
|
MAF components per source variable when |
|
str |
|
Fit policy passed to |
|
positive int or |
|
Minimum complete rows passed to |
|
bool |
|
Warn when block-based fitted transforms use |
|
bool |
|
Force lag 0 for the |
|
bool |
|
Scale variables before |
|
bool |
|
Apply optional author R-code lag-matrix scaling for |
|
bool |
|
Scale MAF lag panels before PCA. |
|
str |
|
Same choices as |
|
str |
|
|
|
bool |
|
Drop rows where any selected |
target_mode="path" is a target-construction shortcut only. It does not fit
or forecast one model per step; that belongs in the model stage. It also does
not average forecasts; horizon-level forecast averaging belongs in evaluation.
The returned FeatureSet.y contains step columns, and metadata records which
step columns belong to each requested horizon.
Output#
Returns FeatureSet.
Field |
Type |
Meaning |
|---|---|---|
|
|
Predictor matrix aligned on forecast origin dates. |
|
|
Direct horizon targets or path step targets aligned to |
|
|
Input metadata plus a |
|
|
Generated-feature provenance. Core columns are |
|
|
Target-column provenance. Core columns are |
|
scalar/tuple fields |
Resolved study choices. |
FeatureSet supports tuple unpacking:
X, y, metadata = features
Metadata#
metadata["feature_engineering"] records:
Key |
Meaning |
|---|---|
|
Shape, date range, columns, missing count, and inferred index frequency. |
|
Resolved study choices. |
|
Target formula choice. |
|
|
|
Step columns to average later when |
|
Predictor and autoregressive target-lag construction choices. |
|
|
|
|
|
Ordered composition steps when |
|
Deterministic date-feature choices. |
|
Whether rows with missing |
|
Final row count, feature count, target count, and sample dates. |
Feature Metadata#
Each feature-producing function attaches macroforecast_feature_metadata to
the returned DataFrame. build_features() exposes the same table as
FeatureSet.feature_metadata.
The table is normalized through a single schema helper. The first columns are always:
Column |
Meaning |
|---|---|
|
Generated feature column name. |
|
Producing step name when created through |
|
Paper-style block such as |
|
Operation family, for example |
|
Main source column, source group, or |
|
Compact parameter string such as |
|
Parsed numeric fields when the feature name/operation carries them. |
|
Fitting policy for stateful transforms. In |
|
Comma-separated source columns used by the feature. |
|
|
Extra columns are preserved after the standard columns. For example,
mixed_frequency_lags() adds source-frequency and lookup-calendar fields.
The metadata frame also carries
attrs["macroforecast_metadata_schema"] = {"kind": "feature_metadata", "version": 1, ...}.
features = mf.feature_engineering.build_features(
processed,
feature_specification="F-X-MARX",
lags=range(0, 13),
max_lag=12,
n_factors=8,
)
features.feature_metadata.loc[
features.feature_metadata["feature"] == "MARX__INDPRO_ma3_lag1",
["block", "operation", "source", "window", "lag"],
]
This records that MARX__INDPRO_ma3_lag1 came from the MARX block, source
series INDPRO, window 3, and lag 1. Intermediate compose_features() steps
are also recorded; the included column marks whether a step output is part of
the final X matrix.
Target Metadata#
Target-producing functions attach macroforecast_target_metadata to the
returned target frame. build_features() exposes the same table as
FeatureSet.target_metadata.
features = mf.feature_engineering.build_features(
processed,
target="INDPRO",
horizons=[1, 3, 6],
target_transform="growth",
)
features.target_metadata.loc[
features.target_metadata["target_column"] == "INDPRO_growth_h3",
["source", "horizon", "mode", "transform", "formula"],
]
For direct targets, horizon is the forecast horizon and step is empty. For
path targets, step identifies the future step and used_for_horizons records
which requested horizons later consume that step forecast. This keeps the
target construction, model-stage step fitting, and evaluation-stage averaging
separate while preserving the contract in metadata.
Error Conditions#
Condition |
Result |
|---|---|
Input is not a canonical panel-like object |
|
Target is missing and input has no target metadata |
|
Target/predictor names are not in the panel |
|
Predictors include target columns |
|
Horizons, windows, or min periods are non-positive |
|
Feature construction leaves no aligned rows |
|