macroforecast.data_analysis#

Back to reference

Purpose#

macroforecast.data_analysis is the read-only inspection module for canonical pandas panels. It covers two related tasks:

Task

Main function

Input count

Use case

Single-panel summary

summarize_data(data)

1

Inspect one raw, processed, or custom panel.

Raw-vs-processed comparison

analyze_data(raw, clean)

2

Inspect what changed after preprocessing.

This module validates inputs, computes tables, and returns report objects. It does not load data, transform values, impute missing observations, create features, fit models, evaluate forecasts, or write files.

Inputs must satisfy the canonical panel contract used by macroforecast.data: pandas.DataFrame, DatetimeIndex named "date", ascending dates, no duplicate dates, no duplicate columns, numeric values or NaN, no infinite values, and non-empty shape. summarize_data(...) also accepts DataBundle, DataSpec, (panel, metadata), and PreprocessedData-style objects with .panel and .metadata.

Public Functions#

Function

Kind

Output

Purpose

summarize_data(data, ...)

single-panel report

DataSummaryReport

Standard summary suite for one panel.

panel_overview(data)

single-panel helper

dict

Shape, dates, frequency, missingness, metadata keys.

panel_snapshot(data)

single-panel helper

dict

Compact rows/columns/dates/missingness/frequency snapshot.

sample_coverage(data)

single-panel helper

DataFrame

Per-series first/last valid dates, observation counts, missing rates.

observation_counts(data)

single-panel helper

Series

Per-series non-missing observation counts.

missing_rates(data)

single-panel helper

Series

Per-series missing rates.

univariate_summary(data, ...)

single-panel helper

DataFrame

Per-series numeric descriptive statistics.

missing_summary(data)

single-panel helper

DataFrame

Missing count, missing rate, longest missing run.

correlation_matrix(data, ...)

single-panel helper

DataFrame

Pairwise numeric correlation matrix.

outlier_summary(data, ...)

single-panel helper

DataFrame

IQR and/or z-score outlier counts and rates.

stationarity_tests(data, ...)

single-panel helper

dict

ADF, Phillips-Perron, KPSS, or all three.

phillips_perron_test(values, ...)

statistic helper

dict

Native PP fallback used when arch is unavailable.

mackinnon_pp_pvalue(z_tau, ...)

statistic helper

float

Approximate p-value helper for native PP.

analyze_data(raw, clean, ...)

before/after report

DataAnalysisReport

Standard comparison suite for raw and processed panels.

compare_panels(raw, clean, ...)

before/after helper

dict

Shape/date/column/index comparison plus changed-cell count.

panel_snapshots(raw, clean)

before/after helper

dict

Compact before/after snapshots.

changed_cells(raw, clean, ...)

before/after helper

DataFrame

Boolean changed-cell mask on common dates and columns.

changed_cell_count(raw, clean, ...)

before/after helper

int

Count changed common cells.

changed_cell_summary(raw, clean, ...)

before/after helper

dict

Changed-cell denominator, count, rate, and tolerance.

missing_shift(raw, clean)

before/after helper

DataFrame

Missing-count and missing-rate changes.

distribution_shift(raw, clean, ...)

before/after helper

DataFrame

Mean, scale, tail-shape, and KS-style shifts.

correlation_shift(raw, clean, ...)

before/after helper

DataFrame

Cleaned-minus-raw correlation differences.

cleaning_effect_summary(...)

metadata helper

dict

Normalize preprocessing metadata and counters.

Public Flow#

import macroforecast as mf

bundle = mf.data.load_fred_md()
summary = mf.data_analysis.summarize_data(
    bundle,
    include_outliers=True,
    include_stationarity=True,
)

spec = mf.data.spec(bundle, target="INDPRO", horizons=[1, 3, 6, 12])
processed = mf.preprocessing.reprocess(spec)

analysis = mf.data_analysis.analyze_data(
    spec.panel,
    processed.panel,
    include_correlation=True,
)

Example single-panel output:

summary.overview
{
    "n_rows": 4,
    "n_columns": 2,
    "start": "2020-01-01",
    "end": "2020-04-01",
    "missing_values": 1,
    "frequency": "monthly",
    "metadata_keys": ["dataset", "frequency"],
}

Example raw-vs-processed output:

analysis.comparison
{
    "raw_shape": (4, 3),
    "clean_shape": (4, 3),
    "raw_missing_total": 1,
    "clean_missing_total": 0,
    "common_columns": ["y", "x1", "x2"],
    "common_index_count": 4,
    "changed_cell_count": 2,
}

summarize_data#

Run the standard one-panel summary suite.

Signature#

macroforecast.data_analysis.summarize_data(
    data,
    *,
    metrics: Sequence[str] | None = None,
    include_correlation: bool = False,
    correlation_method: str = "pearson",
    include_outliers: bool = False,
    outlier_method: str = "iqr",
    include_stationarity: bool = False,
    stationarity_test: str = "multi",
    stationarity_scope: str = "all",
) -> DataSummaryReport

Input#

Name

Type

Default

Allowed values

Meaning

data

DataBundle, DataSpec, PreprocessedData, (panel, metadata), or DataFrame

required

canonical panel input

Panel to summarize.

metrics

sequence or None

default summary metrics

mean, sd, min, max, skew, kurtosis, n_obs, n_missing

Univariate statistics to compute.

include_correlation

bool

False

True, False

Include correlation_matrix(...).

correlation_method

str

"pearson"

"pearson", "spearman", "kendall"

Correlation method when correlation is included.

include_outliers

bool

False

True, False

Include outlier_summary(...).

outlier_method

str

"iqr"

"iqr", "zscore", "multi", "both"

Outlier rule when outliers are included.

include_stationarity

bool

False

True, False

Include stationarity_tests(...).

stationarity_test

str

"multi"

"adf", "pp", "kpss", "multi", "none"

Unit-root/stationarity test choice.

stationarity_scope

str

"all"

"all", "target_and_predictors", "target_only", "predictors_only"

Columns to test.

Defaults#

Default

Value

Summary metrics

DEFAULT_SUMMARY_METRICS = ("mean", "sd", "min", "max", "n_obs", "n_missing")

Correlation included

False

Outlier summary included

False

Stationarity tests included

False

Metadata stage

metadata["data_analysis"] with analysis_type="single_panel"

Output#

Returns DataSummaryReport.

Field

Type

Meaning

overview

dict

Panel row/column count, date range, missing total, inferred frequency, metadata keys.

coverage

DataFrame

Per-series first/last observed date, n_obs, n_missing, missing rate.

univariate

DataFrame

Per-series descriptive statistics selected by metrics.

missing

DataFrame

Per-series missing count, missing rate, longest missing run.

correlation

DataFrame or None

Numeric correlation matrix when requested.

outliers

DataFrame or None

IQR and/or z-score outlier counts and rates when requested.

stationarity

dict or None

ADF/PP/KPSS results when requested.

metadata

dict

Input metadata plus compact data_analysis run metadata.

DataSummaryReport.to_dict() converts DataFrame fields into nested dictionaries for serialization.

The returned coverage, univariate, missing, correlation, and outliers tables carry attrs["macroforecast_metadata"] == summary.metadata when the table is present.

Metadata#

summarize_data(...) stores run-level facts, not duplicate result tables:

summary.metadata["data_analysis"]

Key

Meaning

analysis_type

"single_panel".

metrics

Univariate metrics requested.

include_correlation, correlation_method

Correlation option state.

include_outliers, outlier_method

Outlier option state.

include_stationarity, stationarity_test, stationarity_scope

Stationarity option state.

panel

Compact panel snapshot.

input

Source metadata snapshot and metadata-key list.

outputs

Boolean flags for report fields included.

Single-Panel Helpers#

panel_overview#

macroforecast.data_analysis.panel_overview(data) -> dict

Input is the same canonical one-panel input accepted by summarize_data(...). Output includes the full panel_info(...) dictionary plus metadata_keys.

panel_snapshot#

macroforecast.data_analysis.panel_snapshot(data) -> dict

Returns a compact dictionary with n_rows, n_columns, start, end, missing_values, and frequency.

sample_coverage#

macroforecast.data_analysis.sample_coverage(data) -> pandas.DataFrame

Output columns:

Column

Meaning

first_valid

First non-missing date for the series.

last_valid

Last non-missing date for the series.

n_obs

Non-missing observation count.

n_missing

Missing observation count.

missing_rate

n_missing / n_panel_rows.

observation_counts(data) returns sample_coverage(data)["n_obs"]. missing_rates(data) returns sample_coverage(data)["missing_rate"].

univariate_summary#

macroforecast.data_analysis.univariate_summary(
    data,
    *,
    metrics: Sequence[str] | None = None,
) -> pandas.DataFrame

Input

Default

Allowed values

metrics

default summary metrics

mean, sd, min, max, skew, kurtosis, n_obs, n_missing

Returns one row per numeric column. Unknown metrics raise ValueError.

missing_summary#

macroforecast.data_analysis.missing_summary(data) -> pandas.DataFrame

Returns n_missing, missing_rate, and longest_missing_run for each series.

correlation_matrix#

macroforecast.data_analysis.correlation_matrix(
    data,
    *,
    method: str = "pearson",
    min_periods: int = 1,
) -> pandas.DataFrame

Input

Default

Allowed values

method

"pearson"

"pearson", "spearman", "kendall"

min_periods

1

positive integer

Invalid methods or min_periods < 1 raise ValueError.

outlier_summary#

macroforecast.data_analysis.outlier_summary(
    data,
    *,
    method: str = "iqr",
    iqr_threshold: float = 10.0,
    zscore_threshold: float = 3.0,
) -> pandas.DataFrame

Input

Default

Allowed values

method

"iqr"

"iqr", "zscore", "multi", "both"

iqr_threshold

10.0

positive float

zscore_threshold

3.0

positive float

The IQR default matches the McCracken-Ng/FRED-MD outlier multiplier used by preprocessing defaults. The z-score path uses population standard deviation (ddof=0) to match macroforecast.preprocessing.zscore_outlier_clean(...). Non-positive thresholds raise ValueError.

stationarity_tests#

macroforecast.data_analysis.stationarity_tests(
    data,
    *,
    test: str = "multi",
    scope: str = "all",
    target: str | None = None,
    targets: Sequence[str] | None = None,
    alpha: float = 0.05,
) -> dict

Input

Default

Allowed values

test

"multi"

"adf", "pp", "kpss", "multi", "none"

scope

"all"

"all", "target_and_predictors", "target_only", "predictors_only"

target, targets

None

target names in the panel

alpha

0.05

float strictly between 0 and 1

For scope="target_only" and scope="predictors_only", target names must be known from arguments or from a DataSpec. Missing target columns raise ValueError.

Output dictionary:

Key

Meaning

test, scope, alpha

Requested test settings.

n_series

Number of tested series.

by_series

Per-series test results.

Per-test outputs:

Test

Key outputs

adf

statistic, p_value, reject_unit_root

pp

statistic, p_value, reject_unit_root, implementation, bandwidth_lags when native

kpss

statistic, p_value, reject_stationarity

pp uses arch.unitroot.PhillipsPerron when available. Otherwise it falls back to macroforecast’s native Newey-West/MacKinnon implementation.

Phillips-Perron Helpers#

macroforecast.data_analysis.phillips_perron_test(values, *, alpha=0.05) -> dict
macroforecast.data_analysis.mackinnon_pp_pvalue(z_tau, *, n, regression="c") -> float

phillips_perron_test(...) drops non-finite values, requires at least eight finite observations, and returns status="insufficient_data" or status="singular_design" instead of raising for those data conditions.

mackinnon_pp_pvalue(...) approximates the MacKinnon p-value for the constant case (regression="c") using the internal critical-value table. For other regression labels it falls back to a normal CDF approximation. Non-finite statistics and non-positive sample sizes raise ValueError.

analyze_data#

Run the standard before/after data analysis suite.

Signature#

macroforecast.data_analysis.analyze_data(
    raw,
    clean,
    *,
    distribution_metrics: Sequence[str] | None = None,
    include_correlation: bool = False,
    correlation_method: str = "pearson",
    sample: str = "common_index",
    cleaning_metadata: Mapping[str, object] | None = None,
    cleaning_log: Mapping[str, object] | None = None,
    transform_map_applied: Mapping[str, int] | None = None,
    n_imputed_cells: int | None = None,
    n_outliers_flagged: int | None = None,
    n_truncated_obs: int | None = None,
    column_metadata: Mapping[str, object] | None = None,
    tolerance: float = 0.0,
) -> DataAnalysisReport

Input#

Name

Type

Default

Allowed values

Meaning

raw

DataFrame

required

canonical panel

Before/preprocessing panel.

clean

DataFrame

required

canonical panel

After/preprocessing panel.

distribution_metrics

sequence or None

all defaults

mean_change, sd_change, sd_ratio, skew_change, kurtosis_change, ks_statistic

Distribution-shift columns to compute.

include_correlation

bool

False

True, False

Include cleaned-minus-raw correlations.

correlation_method

str

"pearson"

"pearson", "spearman", "kendall"

Correlation method.

sample

str

"common_index"

"common_index", "full"

Date sample used by distribution and correlation shifts.

cleaning_metadata

mapping or None

auto from clean panel metadata

preprocessing metadata mapping

Source for effect counters and logs.

cleaning_log

mapping or None

from metadata when available

mapping

Optional explicit cleaning log.

transform_map_applied

mapping or None

from metadata when available

mapping from column to t-code

Optional explicit transform-code map.

n_imputed_cells

int or None

from metadata when available

non-negative count

Optional imputation counter.

n_outliers_flagged

int or None

from metadata when available

non-negative count

Optional outlier counter.

n_truncated_obs

int or None

from metadata when available

non-negative count

Optional truncation counter.

column_metadata

mapping or None

from metadata when available

mapping

Optional per-column preprocessing metadata.

tolerance

float

0.0

non-negative float

Absolute tolerance for changed-cell counting.

Defaults#

Default

Value

Distribution metrics

all six DEFAULT_DISTRIBUTION_METRICS values

Correlation included

False

Comparison sample

"common_index"

Changed-cell tolerance

0.0

Metadata stage

metadata["data_analysis"] with analysis_type="raw_vs_processed"

Output#

Returns DataAnalysisReport.

Field

Type

Meaning

comparison

dict

Shape, date range, common columns/index, missing totals, changed-cell count.

missing_shift

DataFrame

Per-column raw/clean missing counts and rate changes.

distribution_shift

DataFrame

Per-column distribution changes for common numeric columns.

correlation_shift

DataFrame or None

Cleaned-minus-raw correlation matrix when requested.

cleaning_effect_summary

dict

Normalized preprocessing counters, transform map, cleaning log, column metadata.

metadata

dict

Input metadata plus compact data_analysis run metadata.

DataAnalysisReport.to_dict() converts DataFrame fields into nested dictionaries for serialization.

The returned missing_shift, distribution_shift, and correlation_shift tables carry attrs["macroforecast_metadata"] == analysis.metadata when the table is present.

Metadata#

analysis.metadata["data_analysis"]

Key

Meaning

analysis_type

"raw_vs_processed".

before

Raw panel snapshot: rows, columns, start, end, missing count.

after

Processed panel snapshot with the same fields.

common

Common row/column counts and changed-cell count.

options

Distribution metrics, correlation option, sample, and tolerance.

effects

Compact preprocessing counters and metadata presence flags.

metadata_keys

Metadata keys detected on raw and processed panels.

Sample Choice#

distribution_shift(...) and correlation_shift(...) default to sample="common_index". This avoids mixing distribution changes with dates that only exist before or after preprocessing. Use sample="full" only when the full available sample of each panel is the intended comparison.

ks_statistic is the two-sample KS statistic only; it does not compute a p-value.

Before/After Helpers#

compare_panels#

macroforecast.data_analysis.compare_panels(
    raw,
    clean,
    *,
    tolerance: float = 0.0,
) -> dict

Output keys include raw_shape, clean_shape, raw/clean index types, date ranges, missing totals, common_columns, raw-only and clean-only columns, common/raw-only/clean-only index counts, and changed_cell_count.

panel_snapshots#

macroforecast.data_analysis.panel_snapshots(raw, clean) -> dict

Returns {"before": ..., "after": ...} using compact snapshots.

changed_cells, changed_cell_count, changed_cell_summary#

macroforecast.data_analysis.changed_cells(raw, clean, *, tolerance=0.0) -> pandas.DataFrame
macroforecast.data_analysis.changed_cell_count(raw, clean, *, tolerance=0.0) -> int
macroforecast.data_analysis.changed_cell_summary(raw, clean, *, tolerance=0.0) -> dict

All three use common dates and common columns. Numeric cells whose absolute difference is less than or equal to tolerance are treated as unchanged. Negative tolerance raises ValueError.

missing_shift#

macroforecast.data_analysis.missing_shift(raw, clean) -> pandas.DataFrame

Returns one row per unioned column with column_status, raw and clean sample sizes, raw and clean missing counts, missing-count change, missing rates, and missing-rate change.

distribution_shift#

macroforecast.data_analysis.distribution_shift(
    raw,
    clean,
    *,
    metrics: Sequence[str] | None = None,
    sample: str = "common_index",
) -> pandas.DataFrame

Allowed metrics are mean_change, sd_change, sd_ratio, skew_change, kurtosis_change, and ks_statistic. Unknown metrics raise ValueError.

correlation_shift#

macroforecast.data_analysis.correlation_shift(
    raw,
    clean,
    *,
    method: str = "pearson",
    fill_value: float | None = None,
    sample: str = "common_index",
) -> pandas.DataFrame

Returns the cleaned-minus-raw correlation matrix for common numeric columns. If fewer than two common numeric columns exist, returns an empty square DataFrame indexed by the available common numeric columns.

cleaning_effect_summary#

macroforecast.data_analysis.cleaning_effect_summary(
    *,
    cleaning_metadata: Mapping[str, object] | None = None,
    cleaning_log: Mapping[str, object] | None = None,
    transform_map_applied: Mapping[str, int] | None = None,
    n_imputed_cells: int | None = None,
    n_outliers_flagged: int | None = None,
    n_truncated_obs: int | None = None,
    column_metadata: Mapping[str, object] | None = None,
) -> dict

This helper normalizes preprocessing metadata into one compact dictionary. If explicit counters are not supplied, it tries to derive them from preprocessing step metadata.

Boundaries#

Question

Use

Why

What does this one panel look like?

mf.data_analysis.summarize_data(panel)

One input, level summary.

What changed from raw to processed?

mf.data_analysis.analyze_data(raw, processed)

Two inputs, before/after deltas.

Which preprocessing choices ran?

mf.preprocessing.report(processed)

Execution log rather than statistical summary.

Should this table be written to disk?

mf.output or mf.reporting

Output/rendering is separate from analysis.

  • adf_test – Augmented Dickey-Fuller unit-root test for a single series (flat result).

  • kpss_test – KPSS stationarity test for a single series (flat result).

  • dfgls_test – Elliott-Rothenberg-Stock DF-GLS GLS-detrended unit-root test (urca::ur.ers).

  • zivot_andrews_test – Zivot-Andrews unit-root test with one endogenous structural break (urca::ur.za).

  • ndiffs – number of first differences for stationarity (KPSS/ADF/PP; forecast::ndiffs).

  • nsdiffs – number of seasonal differences via STL seasonal strength (forecast::nsdiffs).

  • acf – sample autocorrelation function with confidence bands (stats::acf / forecast::Acf).

  • pacf – sample partial autocorrelation function with confidence bands (stats::pacf / forecast::Pacf).

  • johansen_cointegration – Johansen cointegration test (trace + max-eigenvalue, rank selection, cointegrating vectors; urca::ca.jo).

  • engle_granger – Engle-Granger two-step residual-based cointegration test with cointegrating coefficients (statsmodels coint).

  • phillips_ouliaris – Phillips-Ouliaris residual-based cointegration test, non-parametric LRV-corrected (urca::ca.po / tseries::po.test).

  • variance_ratio – Lo-MacKinlay variance-ratio test of the random-walk null (arch VarianceRatio).

  • structural_stability – OLS-CUSUM parameter-stability test with break-point estimate (strucchange::efp / vars::stability).

  • newey_west – Newey-West HAC covariance for OLS coefficients with Bartlett kernel and coefficient table (sandwich::NeweyWest + lmtest::coeftest).

  • vcov_hc – heteroskedasticity-consistent (White HC0-HC3) covariance for OLS coefficients with coefficient table (sandwich::vcovHC + lmtest::coeftest).

  • breusch_pagan_test – Breusch-Pagan test for heteroskedasticity, Koenker studentized or classic variant (lmtest::bptest).