macroforecast.data_analysis#
Purpose#
macroforecast.data_analysis is the read-only inspection module for canonical
pandas panels. It covers two related tasks:
Task |
Main function |
Input count |
Use case |
|---|---|---|---|
Single-panel summary |
|
1 |
Inspect one raw, processed, or custom panel. |
Raw-vs-processed comparison |
|
2 |
Inspect what changed after preprocessing. |
This module validates inputs, computes tables, and returns report objects. It does not load data, transform values, impute missing observations, create features, fit models, evaluate forecasts, or write files.
Inputs must satisfy the canonical panel contract used by
macroforecast.data: pandas.DataFrame, DatetimeIndex named
"date", ascending dates, no duplicate dates, no duplicate columns, numeric
values or NaN, no infinite values, and non-empty shape. summarize_data(...)
also accepts DataBundle, DataSpec, (panel, metadata), and
PreprocessedData-style objects with .panel and .metadata.
Public Functions#
Function |
Kind |
Output |
Purpose |
|---|---|---|---|
|
single-panel report |
|
Standard summary suite for one panel. |
|
single-panel helper |
|
Shape, dates, frequency, missingness, metadata keys. |
|
single-panel helper |
|
Compact rows/columns/dates/missingness/frequency snapshot. |
|
single-panel helper |
|
Per-series first/last valid dates, observation counts, missing rates. |
|
single-panel helper |
|
Per-series non-missing observation counts. |
|
single-panel helper |
|
Per-series missing rates. |
|
single-panel helper |
|
Per-series numeric descriptive statistics. |
|
single-panel helper |
|
Missing count, missing rate, longest missing run. |
|
single-panel helper |
|
Pairwise numeric correlation matrix. |
|
single-panel helper |
|
IQR and/or z-score outlier counts and rates. |
|
single-panel helper |
|
ADF, Phillips-Perron, KPSS, or all three. |
|
statistic helper |
|
Native PP fallback used when |
|
statistic helper |
|
Approximate p-value helper for native PP. |
|
before/after report |
|
Standard comparison suite for raw and processed panels. |
|
before/after helper |
|
Shape/date/column/index comparison plus changed-cell count. |
|
before/after helper |
|
Compact before/after snapshots. |
|
before/after helper |
|
Boolean changed-cell mask on common dates and columns. |
|
before/after helper |
|
Count changed common cells. |
|
before/after helper |
|
Changed-cell denominator, count, rate, and tolerance. |
|
before/after helper |
|
Missing-count and missing-rate changes. |
|
before/after helper |
|
Mean, scale, tail-shape, and KS-style shifts. |
|
before/after helper |
|
Cleaned-minus-raw correlation differences. |
|
metadata helper |
|
Normalize preprocessing metadata and counters. |
Public Flow#
import macroforecast as mf
bundle = mf.data.load_fred_md()
summary = mf.data_analysis.summarize_data(
bundle,
include_outliers=True,
include_stationarity=True,
)
spec = mf.data.spec(bundle, target="INDPRO", horizons=[1, 3, 6, 12])
processed = mf.preprocessing.reprocess(spec)
analysis = mf.data_analysis.analyze_data(
spec.panel,
processed.panel,
include_correlation=True,
)
Example single-panel output:
summary.overview
{
"n_rows": 4,
"n_columns": 2,
"start": "2020-01-01",
"end": "2020-04-01",
"missing_values": 1,
"frequency": "monthly",
"metadata_keys": ["dataset", "frequency"],
}
Example raw-vs-processed output:
analysis.comparison
{
"raw_shape": (4, 3),
"clean_shape": (4, 3),
"raw_missing_total": 1,
"clean_missing_total": 0,
"common_columns": ["y", "x1", "x2"],
"common_index_count": 4,
"changed_cell_count": 2,
}
summarize_data#
Run the standard one-panel summary suite.
Signature#
macroforecast.data_analysis.summarize_data(
data,
*,
metrics: Sequence[str] | None = None,
include_correlation: bool = False,
correlation_method: str = "pearson",
include_outliers: bool = False,
outlier_method: str = "iqr",
include_stationarity: bool = False,
stationarity_test: str = "multi",
stationarity_scope: str = "all",
) -> DataSummaryReport
Input#
Name |
Type |
Default |
Allowed values |
Meaning |
|---|---|---|---|---|
|
|
required |
canonical panel input |
Panel to summarize. |
|
sequence or |
default summary metrics |
|
Univariate statistics to compute. |
|
|
|
|
Include |
|
|
|
|
Correlation method when correlation is included. |
|
|
|
|
Include |
|
|
|
|
Outlier rule when outliers are included. |
|
|
|
|
Include |
|
|
|
|
Unit-root/stationarity test choice. |
|
|
|
|
Columns to test. |
Defaults#
Default |
Value |
|---|---|
Summary metrics |
|
Correlation included |
|
Outlier summary included |
|
Stationarity tests included |
|
Metadata stage |
|
Output#
Returns DataSummaryReport.
Field |
Type |
Meaning |
|---|---|---|
|
|
Panel row/column count, date range, missing total, inferred frequency, metadata keys. |
|
|
Per-series first/last observed date, |
|
|
Per-series descriptive statistics selected by |
|
|
Per-series missing count, missing rate, longest missing run. |
|
|
Numeric correlation matrix when requested. |
|
|
IQR and/or z-score outlier counts and rates when requested. |
|
|
ADF/PP/KPSS results when requested. |
|
|
Input metadata plus compact |
DataSummaryReport.to_dict() converts DataFrame fields into nested
dictionaries for serialization.
The returned coverage, univariate, missing, correlation, and outliers
tables carry attrs["macroforecast_metadata"] == summary.metadata when the
table is present.
Metadata#
summarize_data(...) stores run-level facts, not duplicate result tables:
summary.metadata["data_analysis"]
Key |
Meaning |
|---|---|
|
|
|
Univariate metrics requested. |
|
Correlation option state. |
|
Outlier option state. |
|
Stationarity option state. |
|
Compact panel snapshot. |
|
Source metadata snapshot and metadata-key list. |
|
Boolean flags for report fields included. |
Single-Panel Helpers#
panel_overview#
macroforecast.data_analysis.panel_overview(data) -> dict
Input is the same canonical one-panel input accepted by summarize_data(...).
Output includes the full panel_info(...) dictionary plus metadata_keys.
panel_snapshot#
macroforecast.data_analysis.panel_snapshot(data) -> dict
Returns a compact dictionary with n_rows, n_columns, start, end,
missing_values, and frequency.
sample_coverage#
macroforecast.data_analysis.sample_coverage(data) -> pandas.DataFrame
Output columns:
Column |
Meaning |
|---|---|
|
First non-missing date for the series. |
|
Last non-missing date for the series. |
|
Non-missing observation count. |
|
Missing observation count. |
|
|
observation_counts(data) returns sample_coverage(data)["n_obs"].
missing_rates(data) returns sample_coverage(data)["missing_rate"].
univariate_summary#
macroforecast.data_analysis.univariate_summary(
data,
*,
metrics: Sequence[str] | None = None,
) -> pandas.DataFrame
Input |
Default |
Allowed values |
|---|---|---|
|
default summary metrics |
|
Returns one row per numeric column. Unknown metrics raise ValueError.
missing_summary#
macroforecast.data_analysis.missing_summary(data) -> pandas.DataFrame
Returns n_missing, missing_rate, and longest_missing_run for each
series.
correlation_matrix#
macroforecast.data_analysis.correlation_matrix(
data,
*,
method: str = "pearson",
min_periods: int = 1,
) -> pandas.DataFrame
Input |
Default |
Allowed values |
|---|---|---|
|
|
|
|
|
positive integer |
Invalid methods or min_periods < 1 raise ValueError.
outlier_summary#
macroforecast.data_analysis.outlier_summary(
data,
*,
method: str = "iqr",
iqr_threshold: float = 10.0,
zscore_threshold: float = 3.0,
) -> pandas.DataFrame
Input |
Default |
Allowed values |
|---|---|---|
|
|
|
|
|
positive float |
|
|
positive float |
The IQR default matches the McCracken-Ng/FRED-MD outlier multiplier used by
preprocessing defaults. The z-score path uses population standard deviation
(ddof=0) to match macroforecast.preprocessing.zscore_outlier_clean(...).
Non-positive thresholds raise ValueError.
stationarity_tests#
macroforecast.data_analysis.stationarity_tests(
data,
*,
test: str = "multi",
scope: str = "all",
target: str | None = None,
targets: Sequence[str] | None = None,
alpha: float = 0.05,
) -> dict
Input |
Default |
Allowed values |
|---|---|---|
|
|
|
|
|
|
|
|
target names in the panel |
|
|
float strictly between |
For scope="target_only" and scope="predictors_only", target names must be
known from arguments or from a DataSpec. Missing target columns raise
ValueError.
Output dictionary:
Key |
Meaning |
|---|---|
|
Requested test settings. |
|
Number of tested series. |
|
Per-series test results. |
Per-test outputs:
Test |
Key outputs |
|---|---|
|
|
|
|
|
|
pp uses arch.unitroot.PhillipsPerron when available. Otherwise it falls
back to macroforecast’s native Newey-West/MacKinnon implementation.
Phillips-Perron Helpers#
macroforecast.data_analysis.phillips_perron_test(values, *, alpha=0.05) -> dict
macroforecast.data_analysis.mackinnon_pp_pvalue(z_tau, *, n, regression="c") -> float
phillips_perron_test(...) drops non-finite values, requires at least eight
finite observations, and returns status="insufficient_data" or
status="singular_design" instead of raising for those data conditions.
mackinnon_pp_pvalue(...) approximates the MacKinnon p-value for the constant
case (regression="c") using the internal critical-value table. For other
regression labels it falls back to a normal CDF approximation. Non-finite
statistics and non-positive sample sizes raise ValueError.
analyze_data#
Run the standard before/after data analysis suite.
Signature#
macroforecast.data_analysis.analyze_data(
raw,
clean,
*,
distribution_metrics: Sequence[str] | None = None,
include_correlation: bool = False,
correlation_method: str = "pearson",
sample: str = "common_index",
cleaning_metadata: Mapping[str, object] | None = None,
cleaning_log: Mapping[str, object] | None = None,
transform_map_applied: Mapping[str, int] | None = None,
n_imputed_cells: int | None = None,
n_outliers_flagged: int | None = None,
n_truncated_obs: int | None = None,
column_metadata: Mapping[str, object] | None = None,
tolerance: float = 0.0,
) -> DataAnalysisReport
Input#
Name |
Type |
Default |
Allowed values |
Meaning |
|---|---|---|---|---|
|
|
required |
canonical panel |
Before/preprocessing panel. |
|
|
required |
canonical panel |
After/preprocessing panel. |
|
sequence or |
all defaults |
|
Distribution-shift columns to compute. |
|
|
|
|
Include cleaned-minus-raw correlations. |
|
|
|
|
Correlation method. |
|
|
|
|
Date sample used by distribution and correlation shifts. |
|
mapping or |
auto from clean panel metadata |
preprocessing metadata mapping |
Source for effect counters and logs. |
|
mapping or |
from metadata when available |
mapping |
Optional explicit cleaning log. |
|
mapping or |
from metadata when available |
mapping from column to t-code |
Optional explicit transform-code map. |
|
int or |
from metadata when available |
non-negative count |
Optional imputation counter. |
|
int or |
from metadata when available |
non-negative count |
Optional outlier counter. |
|
int or |
from metadata when available |
non-negative count |
Optional truncation counter. |
|
mapping or |
from metadata when available |
mapping |
Optional per-column preprocessing metadata. |
|
|
|
non-negative float |
Absolute tolerance for changed-cell counting. |
Defaults#
Default |
Value |
|---|---|
Distribution metrics |
all six |
Correlation included |
|
Comparison sample |
|
Changed-cell tolerance |
|
Metadata stage |
|
Output#
Returns DataAnalysisReport.
Field |
Type |
Meaning |
|---|---|---|
|
|
Shape, date range, common columns/index, missing totals, changed-cell count. |
|
|
Per-column raw/clean missing counts and rate changes. |
|
|
Per-column distribution changes for common numeric columns. |
|
|
Cleaned-minus-raw correlation matrix when requested. |
|
|
Normalized preprocessing counters, transform map, cleaning log, column metadata. |
|
|
Input metadata plus compact |
DataAnalysisReport.to_dict() converts DataFrame fields into nested
dictionaries for serialization.
The returned missing_shift, distribution_shift, and correlation_shift
tables carry attrs["macroforecast_metadata"] == analysis.metadata when the
table is present.
Metadata#
analysis.metadata["data_analysis"]
Key |
Meaning |
|---|---|
|
|
|
Raw panel snapshot: rows, columns, start, end, missing count. |
|
Processed panel snapshot with the same fields. |
|
Common row/column counts and changed-cell count. |
|
Distribution metrics, correlation option, sample, and tolerance. |
|
Compact preprocessing counters and metadata presence flags. |
|
Metadata keys detected on raw and processed panels. |
Sample Choice#
distribution_shift(...) and correlation_shift(...) default to
sample="common_index". This avoids mixing distribution changes with dates
that only exist before or after preprocessing. Use sample="full" only when
the full available sample of each panel is the intended comparison.
ks_statistic is the two-sample KS statistic only; it does not compute a
p-value.
Before/After Helpers#
compare_panels#
macroforecast.data_analysis.compare_panels(
raw,
clean,
*,
tolerance: float = 0.0,
) -> dict
Output keys include raw_shape, clean_shape, raw/clean index types, date
ranges, missing totals, common_columns, raw-only and clean-only columns,
common/raw-only/clean-only index counts, and changed_cell_count.
panel_snapshots#
macroforecast.data_analysis.panel_snapshots(raw, clean) -> dict
Returns {"before": ..., "after": ...} using compact snapshots.
changed_cells, changed_cell_count, changed_cell_summary#
macroforecast.data_analysis.changed_cells(raw, clean, *, tolerance=0.0) -> pandas.DataFrame
macroforecast.data_analysis.changed_cell_count(raw, clean, *, tolerance=0.0) -> int
macroforecast.data_analysis.changed_cell_summary(raw, clean, *, tolerance=0.0) -> dict
All three use common dates and common columns. Numeric cells whose absolute
difference is less than or equal to tolerance are treated as unchanged.
Negative tolerance raises ValueError.
missing_shift#
macroforecast.data_analysis.missing_shift(raw, clean) -> pandas.DataFrame
Returns one row per unioned column with column_status, raw and clean sample
sizes, raw and clean missing counts, missing-count change, missing rates, and
missing-rate change.
distribution_shift#
macroforecast.data_analysis.distribution_shift(
raw,
clean,
*,
metrics: Sequence[str] | None = None,
sample: str = "common_index",
) -> pandas.DataFrame
Allowed metrics are mean_change, sd_change, sd_ratio, skew_change,
kurtosis_change, and ks_statistic. Unknown metrics raise ValueError.
correlation_shift#
macroforecast.data_analysis.correlation_shift(
raw,
clean,
*,
method: str = "pearson",
fill_value: float | None = None,
sample: str = "common_index",
) -> pandas.DataFrame
Returns the cleaned-minus-raw correlation matrix for common numeric columns. If fewer than two common numeric columns exist, returns an empty square DataFrame indexed by the available common numeric columns.
cleaning_effect_summary#
macroforecast.data_analysis.cleaning_effect_summary(
*,
cleaning_metadata: Mapping[str, object] | None = None,
cleaning_log: Mapping[str, object] | None = None,
transform_map_applied: Mapping[str, int] | None = None,
n_imputed_cells: int | None = None,
n_outliers_flagged: int | None = None,
n_truncated_obs: int | None = None,
column_metadata: Mapping[str, object] | None = None,
) -> dict
This helper normalizes preprocessing metadata into one compact dictionary. If explicit counters are not supplied, it tries to derive them from preprocessing step metadata.
Boundaries#
Question |
Use |
Why |
|---|---|---|
What does this one panel look like? |
|
One input, level summary. |
What changed from raw to processed? |
|
Two inputs, before/after deltas. |
Which preprocessing choices ran? |
|
Execution log rather than statistical summary. |
Should this table be written to disk? |
|
Output/rendering is separate from analysis. |
adf_test– Augmented Dickey-Fuller unit-root test for a single series (flat result).kpss_test– KPSS stationarity test for a single series (flat result).dfgls_test– Elliott-Rothenberg-Stock DF-GLS GLS-detrended unit-root test (urca::ur.ers).zivot_andrews_test– Zivot-Andrews unit-root test with one endogenous structural break (urca::ur.za).ndiffs– number of first differences for stationarity (KPSS/ADF/PP; forecast::ndiffs).nsdiffs– number of seasonal differences via STL seasonal strength (forecast::nsdiffs).acf– sample autocorrelation function with confidence bands (stats::acf / forecast::Acf).pacf– sample partial autocorrelation function with confidence bands (stats::pacf / forecast::Pacf).johansen_cointegration– Johansen cointegration test (trace + max-eigenvalue, rank selection, cointegrating vectors; urca::ca.jo).engle_granger– Engle-Granger two-step residual-based cointegration test with cointegrating coefficients (statsmodels coint).phillips_ouliaris– Phillips-Ouliaris residual-based cointegration test, non-parametric LRV-corrected (urca::ca.po / tseries::po.test).variance_ratio– Lo-MacKinlay variance-ratio test of the random-walk null (arch VarianceRatio).structural_stability– OLS-CUSUM parameter-stability test with break-point estimate (strucchange::efp / vars::stability).newey_west– Newey-West HAC covariance for OLS coefficients with Bartlett kernel and coefficient table (sandwich::NeweyWest + lmtest::coeftest).vcov_hc– heteroskedasticity-consistent (White HC0-HC3) covariance for OLS coefficients with coefficient table (sandwich::vcovHC + lmtest::coeftest).breusch_pagan_test– Breusch-Pagan test for heteroskedasticity, Koenker studentized or classic variant (lmtest::bptest).