# macroforecast.data [Back to reference](index.md) ## Purpose `macroforecast.data` is the data entry point for the package. It loads official or user-supplied data, normalizes it to one pandas panel contract, and attaches source metadata. It also creates run-level data specifications and combines national FRED-MD/FRED-QD data with state-level FRED-SD panels. This module does not apply stationarity transforms, outlier rules, imputation, feature engineering, model fitting, or evaluation. Those steps happen later. The main output is always a `DataBundle` or `DataSpec`. The usual flow is: ```python import macroforecast as mf bundle = mf.data.load_fred_md() data_spec = mf.data.spec( bundle, target="INDPRO", horizons=[1, 3, 6, 12], start="1960-01", end="2024-12", predictors="all", ) ``` `mf.data.spec(...)` is not a wrapper that runs data loading, preprocessing, feature engineering, or modeling. It is a small contract builder for the already-loaded panel. It validates the requested target, horizons, sample window, and predictor set; subsets the panel to those columns and dates; expands `predictors="all"` to concrete non-target columns; and records the choices in metadata. Later callable stages can consume the same `DataSpec` without guessing which columns or horizons the run intended to use. ## Public Functions | Function | Purpose | Output | | --- | --- | --- | | `load_fred_md` | Load official FRED-MD current or vintage data. | `DataBundle` | | `load_fred_qd` | Load official FRED-QD current or vintage data. | `DataBundle` | | `load_fred_sd` | Load official FRED-SD state-level panel data. | `DataBundle` | | `load_fred_md_sd` | Load and combine FRED-MD with FRED-SD. | `DataBundle` | | `load_fred_qd_sd` | Load and combine FRED-QD with FRED-SD. | `DataBundle` | | `load_custom_csv` | Load a user CSV into the canonical panel contract. | `DataBundle` | | `load_custom_parquet` | Load a user Parquet file into the canonical panel contract. | `DataBundle` | | `custom_dataset` | Build a custom dataset from an in-memory `DataFrame`. | `DataBundle` | | `combine` | Concatenate loaded bundles and optionally align frequency. | `DataBundle` | | `list_vintages` | Generate supported monthly vintage labels for a dataset. | `list[str]` | | `as_panel` | Normalize a `DataFrame` to the canonical panel contract. | `pandas.DataFrame` | | `validate_panel` | Validate the canonical panel contract. | `None` | | `panel_info` | Summarize panel shape, dates, missingness, and frequency. | `dict` | | `metadata` | Extract explicit package metadata from data-like input. | `dict` | | `attach_metadata` | Merge one metadata stage into an existing metadata dictionary. | `dict` | | `set_frequencies` | Attach column-level native/output frequency metadata. | `DataBundle` | | `spec` | Attach target, horizon, sample, and predictor choices. | `DataSpec` | | `align_frequency` | Keep, filter, or align panel columns to a common frequency. | `DataBundle` | | `chow_lin_disaggregate` | Disaggregate low-frequency series with a high-frequency indicator. | `pandas.Series` | | `infer_frequencies` | Read or infer native frequency by column. | `(dict[str, str], str)` | | `frequency_hardening_issues` | Report columns with weak frequency classification. | `list[dict]` | | `availability_lag` | Delay selected columns to encode release availability. | `DataBundle` | | `same_period_predictors` | Allow, lag, drop, or reject same-period predictors in a `DataSpec`. | `DataSpec` | | `define_regime` | Attach a binary regime definition to metadata, optionally as a column. | `DataBundle` | ## Public Classes And Types | Symbol | Meaning | | --- | --- | | `DataBundle` | Canonical panel plus metadata returned by loaders and data-policy helpers. | | `DataSpec` | Canonical panel plus target, horizon, sample, and predictor choices for a run. | | `RegimeDirection` | Stored threshold direction type: `"above"`, `"below"`, `"equal"`, or `"not_equal"`. | | `SamePeriodPolicy` | Stored same-period predictor policy type: `"allow"`, `"lag"`, `"drop"`, or `"forbid"`. | ## Canonical Panel Every public loader returns a `DataBundle`. ```python panel = bundle.panel metadata = bundle.metadata ``` `DataBundle` also supports tuple unpacking: ```python panel, metadata = mf.data.load_fred_md() ``` ### Panel Contract | Property | Required Value | | --- | --- | | Type | `pandas.DataFrame` | | Index | `pandas.DatetimeIndex` | | Index name | `"date"` | | Sort order | ascending date order | | Duplicate dates | not allowed | | Columns | variable IDs | | Values | numeric values or `NaN` | | Empty panel | not allowed | | Infinite values | not allowed | Metadata is explicit on `DataBundle.metadata`. The panel also carries `panel.attrs["macroforecast_metadata"]` for pandas-native handoff. FRED-MD and FRED-QD transform codes are attached to `panel.attrs["macroforecast_transform_codes"]`; preprocessing is responsible for using them. Panel normalization is strict by default. Invalid date values, non-numeric cells that would be coerced to `NaN`, duplicate dates, empty panels, and infinite values raise errors. When a caller deliberately sets `strict=False`, lossy normalization is allowed but recorded in `panel.attrs["macroforecast_panel_report"]` and `metadata["panel"]` when the panel is returned inside a `DataBundle`. `macroforecast_panel_report` contains: | Key | Meaning | | --- | --- | | `contract` | Panel contract version, currently `macroforecast_panel_v1`. | | `strict` | Whether lossy date/numeric coercion was rejected. | | `input_rows`, `output_rows` | Row count before and after panel normalization. | | `input_columns`, `output_columns` | Column names before and after selection/renaming. | | `date_source` | Date source used: a column name or `"index"`. | | `invalid_date_rows_dropped` | Number of invalid date rows dropped when `strict=False`. | | `numeric_coercion` | Count and examples of non-numeric cells coerced to `NaN` when `strict=False`. | ### Metadata Contract Every loader writes a metadata dictionary with these common keys. | Key | Type | Meaning | | --- | --- | --- | | `dataset` | `str` | Dataset identifier such as `fred_md`, `fred_qd`, `fred_sd`, `fred_md+fred_sd`, or `fred_qd+fred_sd`. | | `frequency` | `str` | Loader-level frequency label: `monthly`, `quarterly`, `weekly`, `annual`, `mixed`, `unknown`, or the chosen combined frequency. | | `version_mode` | `str` | `current`, `vintage`, or `mixed` for combined inputs with different modes. | | `vintage` | `str` or `None` | Requested vintage label in `YYYY-MM` form, or `None` for current data. | | `data_through` | `str` or `None` | Last date present in the loaded panel, formatted as `YYYY-MM`. | | `support_tier` | `str` | `stable` for official loaders, `provisional` for user-supplied files. | | `parse_notes` | `tuple[str, ...]` | Loader notes, including discouraged frequency alignments for combined datasets. | | `artifact` | `dict` or `None` | Raw-file provenance for single-source loads; combined bundles use `None`. | | `transform_codes` | `dict[str, int]` | Official FRED-MD/FRED-QD t-codes when available. FRED-SD has no official t-code map. | Combined bundles add: | Key | Type | Meaning | | --- | --- | --- | | `source_family` | `str` | Combined-source label currently set to `"combined"`. | | `combined_sources` | `list[dict]` | Full metadata dictionaries from the source bundles. | | `source_by_column` | `dict[str, str]` | Source dataset for each output column. | | `native_frequency_by_column` | `dict[str, str]` | Original frequency for each output column before alignment. | | `native_frequency_counts` | `dict[str, int]` | Count of columns by original frequency. | | `date_anchor_by_column` | `dict[str, str]` | FRED-SD date-anchor map for state columns when available. | | `date_anchor_counts` | `dict[str, int]` | Count of FRED-SD date-anchor patterns when available. | | `output_frequency_by_column` | `dict[str, str]` | Frequency represented in the returned panel for each output column. | | `output_frequency_counts` | `dict[str, int]` | Count of columns by returned-panel frequency. | | `frequency_conversion_warnings` | `list[dict]` | Records of monthly-to-quarterly or quarterly-to-monthly conversions. | | `alignment` | `dict` | Chosen target frequency, alignment rules, and source-level alignment summaries. | Public metadata helpers and policy types: | Symbol | Meaning | | --- | --- | | `attach_metadata` | Return metadata with one stage key merged in a pandas-safe way. Used by loaders, preprocessing, analysis, and runner outputs. | | `RegimeDirection` | Stored threshold direction type for `define_regime(...)`: `"above"`, `"below"`, `"equal"`, or `"not_equal"`. | | `SamePeriodPolicy` | Stored same-period predictor policy type for `same_period_predictors(...)`: `"allow"`, `"lag"`, `"drop"`, or `"forbid"`. | ## DataBundle ```python macroforecast.data.DataBundle( panel: pandas.DataFrame, metadata: dict, ) ``` ### Output | Field | Type | Meaning | | --- | --- | --- | | `panel` | `pandas.DataFrame` | Canonical date-indexed data panel. | | `metadata` | `dict` | Source, vintage, artifact, frequency, and transform-code metadata. | ### Methods | Method | Input | Output | Meaning | | --- | --- | --- | --- | | `attach(stage, values)` | `stage: str`, `values: Mapping` | `DataBundle` | Return a new bundle with one metadata stage added. | Preprocessing outputs can use the same metadata-attachment pattern. ## DataSpec ```python macroforecast.data.DataSpec( panel: pandas.DataFrame, metadata: dict, target: str | None, targets: tuple[str, ...], horizons: tuple[int, ...], start: str | None = None, end: str | None = None, predictors: "all" | tuple[str, ...] = "all", ) ``` `DataSpec` is the output of `spec(...)`. It keeps the canonical panel and metadata together with the target, horizons, sample window, and predictor selection for a run. ### Output | Field | Type | Meaning | | --- | --- | --- | | `panel` | `pandas.DataFrame` | Canonical date-indexed data panel after sample and column selection. | | `metadata` | `dict` | Source metadata plus a `data_spec` stage. | | `target` | `str` or `None` | Single target column when `target=` was used. | | `targets` | `tuple[str, ...]` | Active target columns. | | `horizons` | `tuple[int, ...]` | Positive forecast horizons. | | `start`, `end` | `str` or `None` | Normalized sample bounds. | | `predictors` | `tuple[str, ...]` | Concrete non-target predictor columns. | ### Methods | Method | Input | Output | Meaning | | --- | --- | --- | --- | | `attach(stage, values)` | `stage: str`, `values: Mapping` | `DataSpec` | Return a new spec with one metadata stage added. | `DataSpec` also supports tuple unpacking: ```python panel, metadata = data_spec ``` ## load_fred_md Load FRED-MD and return `DataBundle`. ```python macroforecast.data.load_fred_md( vintage: str | None = None, *, force: bool = False, cache_root: str | pathlib.Path | None = None, local_source: str | pathlib.Path | None = None, local_zip_source: str | pathlib.Path | None = None, ) -> DataBundle ``` ### Input | Name | Type | Default | Meaning | | --- | --- | --- | --- | | `vintage` | str | None | `None` | Vintage in `YYYY-MM` form. `None` loads current. | | `force` | `bool` | `False` | Re-download or re-copy even if cache exists. | | `cache_root` | path-like or `None` | `None` | Raw cache root. | | `local_source` | path-like or `None` | `None` | Local CSV source instead of download. | | `local_zip_source` | path-like or `None` | `None` | Optional local historical zip override. Without it, vintage requests automatically download the official FRED-MD historical archive and extract the requested CSV. | ### Output Returns `DataBundle` with a monthly FRED-MD panel and metadata. The official CSV transform row is parsed into `metadata["transform_codes"]` and `panel.attrs["macroforecast_transform_codes"]`. See [FRED-MD](../datasets/fred_md.md) for dataset-specific details. See [FRED-MD + FRED-SD](../datasets/fred_md_sd.md) for the combined monthly national/state loader. ## load_fred_qd Load FRED-QD and return `DataBundle`. ```python macroforecast.data.load_fred_qd( vintage: str | None = None, *, force: bool = False, cache_root: str | pathlib.Path | None = None, local_source: str | pathlib.Path | None = None, local_zip_source: str | pathlib.Path | None = None, ) -> DataBundle ``` ### Input | Name | Type | Default | Meaning | | --- | --- | --- | --- | | `vintage` | str | None | `None` | Vintage in `YYYY-MM` form. `None` loads current. | | `force` | `bool` | `False` | Re-download or re-copy even if cache exists. | | `cache_root` | path-like or `None` | `None` | Raw cache root. | | `local_source` | path-like or `None` | `None` | Local CSV source instead of download. | | `local_zip_source` | path-like or `None` | `None` | Optional local historical zip override. Without it, vintage requests automatically download the official FRED-QD historical archive and extract the requested CSV. | ### Output Returns a quarterly canonical panel. The official CSV transform row is parsed into `metadata["transform_codes"]` and `panel.attrs["macroforecast_transform_codes"]`. See [FRED-QD](../datasets/fred_qd.md) for dataset-specific details. See [FRED-QD + FRED-SD](../datasets/fred_qd_sd.md) for the combined quarterly national/state loader. ## load_fred_sd Load FRED-SD and return `DataBundle`. ```python macroforecast.data.load_fred_sd( vintage: str | None = None, *, force: bool = False, cache_root: str | pathlib.Path | None = None, local_source: str | pathlib.Path | None = None, states: list[str] | None = None, variables: list[str] | None = None, ) -> DataBundle ``` ### Input | Name | Type | Default | Meaning | | --- | --- | --- | --- | | `states` | list[str] | None | `None` | Optional state subset. | | `variables` | list[str] | None | `None` | Optional FRED-SD variable subset. | FRED-SD columns are wide variable-state IDs such as `UR_CA`. The loader also adds `panel.attrs["macrocast_reports"]["fred_sd_series_metadata"]`, which records each column's state, FRED-SD variable, observed date range, non-missing count, native frequency, and date-anchor pattern inferred from the official series workbook. The same frequency and date-anchor maps are exposed in `metadata["native_frequency_by_column"]`, `metadata["native_frequency_counts"]`, `metadata["date_anchor_by_column"]`, `metadata["date_anchor_counts"]`, and `metadata["state_summary"]`. For `vintage="YYYY-MM"`, FRED-SD uses the official by-series workbook path. It tries `series-YYYY-MM.xlsx` first and then falls back to the official by-series zip archive containing that workbook. There is no `local_zip_source` parameter for FRED-SD because local overrides are supplied as `local_source=` with either an official workbook or a canonical wide CSV. See [FRED-SD](../datasets/fred_sd.md) for mixed-frequency state-series details and t-code limitations. See [FRED-MD + FRED-SD](../datasets/fred_md_sd.md) and [FRED-QD + FRED-SD](../datasets/fred_qd_sd.md) for combined-loader behavior. ## load_fred_md_sd Load FRED-MD and FRED-SD, align them to one panel, and return `DataBundle`. ```python macroforecast.data.load_fred_md_sd( vintage: str | None = None, *, force: bool = False, cache_root: str | pathlib.Path | None = None, local_fred_md_source: str | pathlib.Path | None = None, local_fred_sd_source: str | pathlib.Path | None = None, states: list[str] | None = None, variables: list[str] | None = None, frequency: str = "monthly", quarterly_to_monthly: str = "repeat_within_quarter", monthly_to_quarterly: str = "quarterly_average", ) -> DataBundle ``` ### Purpose Use this when the outcome or main state panel is monthly and national macroeconomic controls should come from FRED-MD. This is the recommended combined dataset for monthly state analysis. ### Input | Name | Type | Default | Meaning | | --- | --- | --- | --- | | `vintage` | str | None | `None` | Vintage label shared across FRED-MD and FRED-SD. | | `force` | `bool` | `False` | Re-download or re-copy raw sources. | | `cache_root` | path-like or `None` | `None` | Raw cache root used by both loaders. | | `local_fred_md_source` | path-like or `None` | `None` | Local FRED-MD CSV source. | | `local_fred_sd_source` | path-like or `None` | `None` | Local FRED-SD workbook or CSV source. | | `states` | list[str] | None | `None` | FRED-SD state subset. | | `variables` | list[str] | None | `None` | FRED-SD variable subset. | | `frequency` | `str` | `"monthly"` | `"monthly"`, `"quarterly"`, or `"native"`. Quarterly is supported but not recommended for this loader. | | `quarterly_to_monthly` | `str` | `"repeat_within_quarter"` | Rule used if an included FRED-SD series is quarterly and the target panel is monthly. | | `monthly_to_quarterly` | `str` | `"quarterly_average"` | Rule used only when `frequency="quarterly"`. | ### Output Returns a combined `DataBundle` with: - `metadata["dataset"] == "fred_md+fred_sd"` - `metadata["source_family"] == "combined"` - `metadata["frequency"] == frequency` - FRED-MD official t-codes in `metadata["transform_codes"]` - FRED-SD series metadata preserved in `panel.attrs["macrocast_reports"]` - FRED-SD source-frequency and date-anchor maps in `metadata["native_frequency_by_column"]` and `metadata["date_anchor_by_column"]` - any frequency conversions recorded in `metadata["frequency_conversion_warnings"]` If a quarterly FRED-SD series is included in a monthly panel, the function emits a `UserWarning` and records the conversion. The default `quarterly_to_monthly="repeat_within_quarter"` assigns the quarterly value to each month inside the quarter. ## load_fred_qd_sd Load FRED-QD and FRED-SD, align them to one panel, and return `DataBundle`. ```python macroforecast.data.load_fred_qd_sd( vintage: str | None = None, *, force: bool = False, cache_root: str | pathlib.Path | None = None, local_fred_qd_source: str | pathlib.Path | None = None, local_fred_sd_source: str | pathlib.Path | None = None, states: list[str] | None = None, variables: list[str] | None = None, frequency: str = "quarterly", quarterly_to_monthly: str = "repeat_within_quarter", monthly_to_quarterly: str = "quarterly_average", ) -> DataBundle ``` ### Purpose Use this when the target or outcome is quarterly and national controls should come from FRED-QD. This is the recommended combined dataset for quarterly state-level analysis. ### Input | Name | Type | Default | Meaning | | --- | --- | --- | --- | | `vintage` | str | None | `None` | Vintage label shared across FRED-QD and FRED-SD. | | `force` | `bool` | `False` | Re-download or re-copy raw sources. | | `cache_root` | path-like or `None` | `None` | Raw cache root used by both loaders. | | `local_fred_qd_source` | path-like or `None` | `None` | Local FRED-QD CSV source. | | `local_fred_sd_source` | path-like or `None` | `None` | Local FRED-SD workbook or CSV source. | | `states` | list[str] | None | `None` | FRED-SD state subset. | | `variables` | list[str] | None | `None` | FRED-SD variable subset. | | `frequency` | `str` | `"quarterly"` | `"quarterly"`, `"monthly"`, or `"native"`. Monthly is supported but not recommended for this loader. | | `quarterly_to_monthly` | `str` | `"repeat_within_quarter"` | Rule used only when `frequency="monthly"`. | | `monthly_to_quarterly` | `str` | `"quarterly_average"` | Rule used if an included FRED-SD series is monthly and the target panel is quarterly. | ### Output Returns a combined `DataBundle` with: - `metadata["dataset"] == "fred_qd+fred_sd"` - `metadata["source_family"] == "combined"` - `metadata["frequency"] == frequency` - FRED-QD official t-codes in `metadata["transform_codes"]` - FRED-SD series metadata preserved in `panel.attrs["macrocast_reports"]` - FRED-SD source-frequency and date-anchor maps in `metadata["native_frequency_by_column"]` and `metadata["date_anchor_by_column"]` - any frequency conversions recorded in `metadata["frequency_conversion_warnings"]` If a monthly FRED-SD series is included in a quarterly panel, the function emits a `UserWarning` and records the conversion. The default `monthly_to_quarterly="quarterly_average"` averages monthly observations inside each quarter. ## combine Combine already-loaded `DataBundle` objects into one canonical panel. ```python macroforecast.data.combine( *bundles, dataset: str | None = None, frequency: str = "native", quarterly_to_monthly: str = "repeat_within_quarter", monthly_to_quarterly: str = "quarterly_average", ) -> DataBundle ``` ### Input | Name | Type | Default | Choices | | --- | --- | --- | --- | | `*bundles` | `DataBundle` | required | Two or more bundles to concatenate by date index. | | `dataset` | `str` or `None` | joined source names | Output dataset label. | | `frequency` | `str` | `"native"` | `"native"`, `"monthly"`, or `"quarterly"`. | | `quarterly_to_monthly` | `str` | `"repeat_within_quarter"` | `"repeat_within_quarter"`, `"quarter_end_ffill"`, `"linear_interpolation"`. | | `monthly_to_quarterly` | `str` | `"quarterly_average"` | `"quarterly_average"`, `"quarterly_endpoint"`, `"quarterly_sum"`. | With `frequency="native"` or `frequency="mixed"`, no monthly/quarterly conversion is applied. The returned panel keeps each source column on its native observation dates and records `metadata["frequency"] == "mixed"`. Quarterly columns therefore appear as sparse columns on the union date index when they are combined with monthly columns. Downstream mixed-frequency models should read `metadata["native_frequency_by_column"]` rather than infer frequency from the overall index. ### Frequency Conversion Rules | Direction | Rule | Meaning | | --- | --- | --- | | quarterly to monthly | `repeat_within_quarter` | Assign the quarterly value to each month in that quarter. | | quarterly to monthly | `quarter_end_ffill` | Place the quarterly value at quarter end and forward-fill after it is observed. | | quarterly to monthly | `linear_interpolation` | Interpolate between observed quarter-end values on the monthly grid. | | monthly to quarterly | `quarterly_average` | Average monthly observations in the quarter. | | monthly to quarterly | `quarterly_endpoint` | Use the last monthly observation in the quarter. | | monthly to quarterly | `quarterly_sum` | Sum monthly observations in the quarter. | Combined monthly/quarterly output supports only source columns identified as monthly or quarterly. If a source contains weekly, annual, irregular, or unknown-frequency columns, `combine()` raises `ValueError`. Use `frequency="native"` to inspect the mixed panel first, then call `mf.data.align_frequency()` explicitly if those columns should enter a common monthly or quarterly design. ### Output Returns `DataBundle`. The panel is a column-wise concatenation after frequency alignment. Duplicate output column names raise `ValueError`. For mixed outputs, the key metadata fields are: | Key | Meaning | | --- | --- | | `metadata["frequency"]` | `"mixed"`. | | `metadata["native_frequency_by_column"]` | Native source frequency for each column. | | `metadata["native_frequency_counts"]` | Counts of native source frequencies. | | `metadata["date_anchor_by_column"]` | FRED-SD date-anchor map when available. | | `metadata["date_anchor_counts"]` | Counts of FRED-SD date-anchor patterns when available. | | `metadata["output_frequency_by_column"]` | Returned-panel frequency for each column; equal to native frequency in native mode. | | `metadata["alignment"]["frequency"]` | `"native"` when no conversion was applied. | ### Frequency Conversion Warnings When `combine()` changes a source column's native frequency, it emits `UserWarning` and records the same information in `metadata["frequency_conversion_warnings"]`. Each record has: | Key | Type | Meaning | | --- | --- | --- | | `dataset` | `str` | Source dataset whose columns were converted. | | `from_frequency` | `str` | Native frequency before alignment. | | `to_frequency` | `str` | Combined panel frequency. | | `rule` | `str` | Alignment rule used. | | `variables` | `list[str]` | Variable-level names, e.g. `["NQGSP"]` for `NQGSP_CA`. | | `columns` | `list[str]` | Exact converted panel columns. | | `n_columns` | `int` | Number of converted columns. | Example warning: ```text fred_sd monthly variables were aligned to quarterly using quarterly_average: UR, ICLAIMS (102 columns). ``` ## load_custom_csv Load a user CSV and normalize it to the canonical panel contract. ```python macroforecast.data.load_custom_csv( path, *, date: str | None = None, date_col: str | int | None = None, columns: Iterable[str] | None = None, series_columns: Iterable[str] | None = None, rename: Mapping[str, str] | None = None, dataset: str = "custom", frequency: str = "unknown", frequency_by_column: Mapping[str, str] | None = None, default_frequency: str | None = None, metadata: Mapping[str, object] | None = None, transform_codes: Mapping[str, int] | None = None, cache_root: str | pathlib.Path | None = None, strict: bool = True, ) -> DataBundle ``` ### Input | Name | Type | Default | Meaning | | --- | --- | --- | --- | | `path` | path-like | required | CSV file path. | | `date` | str | None | `None` | Date column. If omitted, uses a DatetimeIndex or parses the first column. | | `date_col` | str | int | None | `None` | Alias for `date`; integer values select the date column by zero-based position. | | `columns` | iterable or `None` | `None` | Columns to keep before renaming. | | `series_columns` | iterable or `None` | `None` | Alias for `columns`; use this name when thinking in panel series IDs. | | `rename` | mapping or `None` | `None` | Column rename map. | | `dataset` | `str` | `"custom"` | Metadata dataset label. | | `frequency` | `str` | `"unknown"` | Metadata frequency label. | | `frequency_by_column` | mapping or `None` | `None` | Optional final-column frequency map, e.g. `{"PAYEMS": "monthly", "GDPC1": "quarterly"}`. | | `default_frequency` | `str` or `None` | `None` | Fill frequency for columns omitted from `frequency_by_column`. | | `metadata` | mapping or `None` | `None` | User metadata to attach. | | `transform_codes` | mapping or `None` | `None` | Optional McCracken-Ng t-code map. Keys must match final loaded series columns after selection and renaming. | | `cache_root` | path-like or `None` | `None` | If supplied, append a raw-manifest entry under this cache root. Custom loaders do not write the default manifest unless this is supplied. | | `strict` | `bool` | `True` | Reject invalid date rows and non-numeric cells instead of silently coercing them. Set `False` only when you want a permissive load with a panel report. | ### Output Returns a `DataBundle`. The normalized panel is available as `bundle.panel` and metadata as `bundle.metadata`. If `transform_codes` is provided, it is stored in both `bundle.metadata["transform_codes"]` and `bundle.panel.attrs["macroforecast_transform_codes"]`, so `mf.preprocessing.reprocess(bundle)` can use the codes automatically. Custom loaders also store the strict-normalization report at `bundle.metadata["panel"]`. With `strict=True`, malformed dates or non-numeric cells raise `RawParseError` wrapping the underlying validation error. With `strict=False`, those lossy operations are allowed and counted. If `frequency_by_column` is provided, custom loaders call `set_frequencies(...)` internally and write the same mixed-frequency metadata contract used by official combined bundles. The keys must match final loaded column names after selection and renaming. Example: ```python bundle = mf.data.load_custom_csv( "panel.csv", date_col="DATE", series_columns=["INDPRO", "spread"], frequency="monthly", transform_codes={"INDPRO": 5, "spread": 2}, ) processed = mf.preprocessing.reprocess(bundle) ``` ## load_custom_parquet Load a user Parquet file with the same normalization contract as `load_custom_csv`. ```python macroforecast.data.load_custom_parquet( path, *, date: str | None = None, date_col: str | int | None = None, columns: Iterable[str] | None = None, series_columns: Iterable[str] | None = None, rename: Mapping[str, str] | None = None, dataset: str = "custom", frequency: str = "unknown", frequency_by_column: Mapping[str, str] | None = None, default_frequency: str | None = None, metadata: Mapping[str, object] | None = None, transform_codes: Mapping[str, int] | None = None, cache_root: str | pathlib.Path | None = None, strict: bool = True, ) -> DataBundle ``` ### Input | Name | Type | Default | Meaning | | --- | --- | --- | --- | | `path` | path-like | required | Parquet file path. | | `date` | str | None | `None` | Date column. If omitted, uses a `DatetimeIndex` or parses the first column. | | `date_col` | str | int | None | `None` | Alias for `date`; integer values select the date column by zero-based position. | | `columns` | iterable or `None` | `None` | Columns to keep before renaming. | | `series_columns` | iterable or `None` | `None` | Alias for `columns`. | | `rename` | mapping or `None` | `None` | Column rename map. | | `dataset` | `str` | `"custom"` | Metadata dataset label. | | `frequency` | `str` | `"unknown"` | Metadata frequency label. | | `frequency_by_column` | mapping or `None` | `None` | Optional final-column frequency map. | | `default_frequency` | `str` or `None` | `None` | Fill frequency for columns omitted from `frequency_by_column`. | | `metadata` | mapping or `None` | `None` | User metadata to attach. | | `transform_codes` | mapping or `None` | `None` | Optional McCracken-Ng t-code map. Keys must match final loaded series columns after selection and renaming. | | `cache_root` | path-like or `None` | `None` | If supplied, append a raw-manifest entry under this cache root. | | `strict` | `bool` | `True` | Reject invalid date rows and non-numeric cells instead of silently coercing them. | ### Output Returns a `DataBundle` with the same canonical panel, metadata, transform-code, strict-normalization, and optional mixed-frequency contract as `load_custom_csv`. ## custom_dataset Build a custom `DataBundle` from an in-memory pandas `DataFrame`. Use `custom_dataset()` when the data are already in Python memory and should enter the same contract as `load_fred_md()`, `load_fred_qd()`, `load_fred_sd()`, `load_custom_csv()`, and `load_custom_parquet()`. ```python macroforecast.data.custom_dataset( frame, *, date: str | None = None, columns: Iterable[str] | None = None, rename: Mapping[str, str] | None = None, dataset: str = "custom", source_family: str = "custom", frequency: str = "unknown", frequency_by_column: Mapping[str, str] | None = None, transform_codes: Mapping[str, int] | None = None, metadata: Mapping[str, object] | None = None, strict: bool = True, ) -> DataBundle ``` ### Input | Name | Type | Default | Meaning | | --- | --- | --- | --- | | `frame` | `pandas.DataFrame` | required | Raw or already canonical panel. | | `date` | `str` or `None` | `None` | Date column. If omitted, the input must have a `DatetimeIndex` or a parseable first column. | | `columns` | iterable or `None` | `None` | Columns to keep before renaming. | | `rename` | mapping or `None` | `None` | Rename retained columns after selection. | | `dataset` | `str` | `"custom"` | Dataset label stored in metadata. | | `source_family` | `str` | `"custom"` | Source-family label stored in metadata. | | `frequency` | `str` | `"unknown"` | Loader-level frequency label. | | `frequency_by_column` | mapping or `None` | `None` | Optional column-level frequency map for mixed-frequency panels. | | `transform_codes` | mapping or `None` | `None` | Optional t-code map. Keys must match final panel columns. | | `metadata` | mapping or `None` | `None` | User metadata merged before package metadata is attached. | | `strict` | `bool` | `True` | Reject lossy date or numeric coercion. | ### Output Returns `DataBundle`. The panel is canonical and the metadata includes `dataset`, `source_family`, `frequency`, optional `transform_codes`, optional column-level frequency metadata, and a `custom_dataset` stage. ```python bundle = mf.data.custom_dataset( frame, date="date", dataset="bank_panel", frequency="monthly", transform_codes={"loan_growth": 1, "spread": 2}, ) processed = mf.preprocessing.reprocess( bundle, transform="custom", impute="mean", ) ``` ## as_panel Normalize an existing pandas `DataFrame`. ```python macroforecast.data.as_panel( frame, *, date: str | None = None, columns: Iterable[str] | None = None, rename: Mapping[str, str] | None = None, metadata: Mapping[str, object] | None = None, strict: bool = True, ) -> pandas.DataFrame ``` `as_panel` returns a canonical panel. It raises if the date column is missing, dates are duplicated, the output is empty, infinite values are present, or any retained column cannot be represented as numeric values or `NaN`. ### Input | Name | Type | Default | Meaning | | --- | --- | --- | --- | | `frame` | `pandas.DataFrame` | required | Raw or already canonical panel. | | `date` | str | None | `None` | Date column. If omitted and the index is not a `DatetimeIndex`, the first column is parsed as dates. | | `columns` | iterable or `None` | `None` | Columns to keep before renaming. | | `rename` | mapping or `None` | `None` | Rename retained columns after selection. | | `metadata` | mapping or `None` | `None` | Metadata attached under `panel.attrs["macroforecast_metadata"]`. | | `strict` | `bool` | `True` | Reject lossy date/numeric coercion. `False` permits it and records a panel report. | ### Output Returns a `pandas.DataFrame` with `DatetimeIndex` named `"date"`, ascending dates, numeric columns, and attrs containing `macroforecast_panel_report`. ## validate_panel Validate the canonical panel contract. ```python macroforecast.data.validate_panel(panel) -> None ``` Raises `TypeError` or `ValueError` when the panel is not canonical. ## panel_info Return a compact panel summary. ```python macroforecast.data.panel_info(bundle_or_panel) -> dict ``` Output keys include `n_rows`, `n_columns`, `start`, `end`, `columns`, `missing_values`, `frequency`, and `index_frequency`. If the input carries metadata, `frequency` uses the metadata label such as `"mixed"` while `index_frequency` reports the pandas-inferred date-index frequency. Combined data also include compact native/output frequency counts. ## set_frequencies Attach a column-level frequency contract to an existing panel or bundle. ```python macroforecast.data.set_frequencies( data, frequency_by_column, *, default_frequency: str | None = None, output_frequency_by_column: Mapping[str, str] | None = None, frequency: str | None = None, metadata: Mapping[str, object] | None = None, ) -> DataBundle ``` ### Input | Name | Type | Default | Meaning | | --- | --- | --- | --- | | `data` | `DataBundle`, `DataSpec`, `(panel, metadata)`, or `DataFrame` | required | Canonical panel input. | | `frequency_by_column` | mapping | required | Native frequency for each final panel column. | | `default_frequency` | `str` or `None` | `None` | Fill omitted columns with one frequency. | | `output_frequency_by_column` | mapping or `None` | `None` | Returned-panel frequency for each column; defaults to native frequency. | | `frequency` | `str` or `None` | `None` | Overall metadata label. Defaults to the unique native frequency or `"mixed"`. | | `metadata` | mapping or `None` | `None` | Extra metadata to merge before writing frequency fields. | Allowed column frequencies are `monthly`, `quarterly`, `weekly`, `annual`, `irregular`, and `unknown`, with short aliases such as `m`, `q`, and `w`. For mixed-frequency DFM models, monthly and quarterly columns are the relevant contract. ### Output Returns a `DataBundle` with: | Metadata key | Meaning | | --- | --- | | `frequency` | Overall label, usually `"mixed"` when multiple native frequencies are present. | | `native_frequency_by_column` | Native frequency for each column. | | `native_frequency_counts` | Counts by native frequency. | | `output_frequency_by_column` | Frequency represented in the returned panel for each column. | | `output_frequency_counts` | Counts by output frequency. | ## metadata Return explicit metadata from a `DataBundle`, `DataSpec`, `(panel, metadata)` tuple, or `DataFrame`. ```python macroforecast.data.metadata(obj) -> dict ``` ### Input | Name | Type | Meaning | | --- | --- | --- | | `obj` | `DataBundle`, `DataSpec`, `(panel, metadata)`, or `DataFrame` | Object carrying package metadata. | ### Output Returns a shallow copy of the metadata dictionary. Mutating the returned object does not mutate the original bundle or panel attrs. ## attach_metadata Merge one metadata stage into an existing metadata dictionary. ```python macroforecast.data.attach_metadata( metadata, stage: str, values, ) -> dict ``` ### Input | Name | Type | Meaning | | --- | --- | --- | | `metadata` | mapping | Existing metadata dictionary. | | `stage` | `str` | Non-empty stage key to write, such as `"data_spec"` or `"data_frequency_alignment"`. | | `values` | mapping | Stage payload to copy under `stage`. | ### Output Returns a new dictionary. Existing metadata is copied, then `values` is copied under the requested stage. `attach_metadata()` does not mutate its input. ## spec Attach run-level data choices to a bundle or panel. This function creates a `DataSpec`; it does not execute downstream pipeline steps. ```python macroforecast.data.spec( data, *, metadata: Mapping[str, object] | None = None, target: str | None = None, targets: Iterable[str] | None = None, horizons: Iterable[int] | int | None = None, start: str | None = None, end: str | None = None, predictors: "all" | Iterable[str] = "all", ) -> DataSpec ``` ### Input | Name | Type | Default | Meaning | | --- | --- | --- | --- | | `data` | `DataBundle`, `DataSpec`, `(panel, metadata)`, or `DataFrame` | required | Canonical data input. | | `metadata` | mapping or `None` | `None` | Extra metadata to merge. | | `target` | str | None | `None` | Single target column. | | `targets` | iterable or `None` | `None` | Multiple target columns. | | `horizons` | iterable, int, or `None` | derived | Forecast horizons. | | `start` | str | None | `None` | Start date. Accepts `YYYY`, `YYYY-MM`, or `YYYY-MM-DD`. | | `end` | str | None | `None` | End date. Accepts `YYYY`, `YYYY-MM`, or `YYYY-MM-DD`. | | `predictors` | "all" | iterable | `"all"` | Predictor columns to keep. `"all"` expands to all non-target columns. Explicit predictor lists may be empty for target-only or autoregressive designs, and may not include target columns. | ### Default Horizons | Metadata frequency | Default horizons | | --- | --- | | `monthly` | `(1, 3, 6, 12)` | | `quarterly` | `(1, 2, 4, 8)` | | other or unknown | `(1,)` | ### Output Returns `DataSpec`. Its metadata contains a `data_spec` entry with the chosen target, targets, horizons, sample dates, expanded predictor list, and panel summary. This expansion is deliberate: downstream model stages should consume a concrete non-target predictor list, not infer from the full panel and risk target leakage. ### What It Does And Does Not Do | Action | Done by `mf.data.spec(...)`? | | --- | --- | | Validate the canonical panel contract | Yes | | Validate target and predictor columns | Yes | | Expand `predictors="all"` to all non-target columns | Yes | | Apply `start` and `end` sample bounds | Yes | | Attach `metadata["data_spec"]` | Yes | | Load raw data | No | | Transform, clean, impute, or standardize values | No | | Create forecast targets or lagged predictors | No | | Fit models or run evaluation | No | ## Data Policy Helpers These functions are direct Python replacements for the old data-policy axes. They do not parse YAML and do not fit models. ### align_frequency ```python macroforecast.data.align_frequency( data, *, method: str = "keep", quarterly_to_monthly: str = "repeat_within_quarter", weekly_to_monthly: str = "mean", monthly_to_quarterly: str = "quarterly_average", weekly_to_quarterly: str = "mean", chow_lin_indicator: str | Mapping[str, str] | None = None, chow_lin_aggregation: str = "mean", chow_lin_rho: float | None = None, chow_lin_rho_method: str = "fixed", ) -> DataBundle ``` Keeps, filters, or aligns a panel to a common data frequency. This belongs in `macroforecast.data` because it changes the calendar and column-level frequency contract before preprocessing or feature engineering. | Input | Default | Choices | | --- | --- | --- | | `method` | `"keep"` | `"keep"`, `"monthly"`, `"quarterly"`, `"drop_non_monthly"`, `"drop_non_quarterly"` | | `quarterly_to_monthly` | `"repeat_within_quarter"` | `"repeat_within_quarter"`, `"step_backward"`, `"step_forward"`, `"quarter_end_ffill"`, `"linear_interpolation"`, `"chow_lin"` | | `weekly_to_monthly` | `"mean"` | `"mean"`, `"last"`, `"sum"` | | `monthly_to_quarterly` | `"quarterly_average"` | `"quarterly_average"`, `"quarterly_endpoint"`, `"quarterly_sum"` | | `weekly_to_quarterly` | `"mean"` | `"mean"`, `"last"`, `"sum"` | | `chow_lin_indicator` | `None` | Indicator column name, or mapping from quarterly column to indicator column, used only when `quarterly_to_monthly="chow_lin"`. | | `chow_lin_aggregation` | `"mean"` | `"mean"` or `"sum"`; the low-frequency aggregation to conserve. | | `chow_lin_rho` | `None` | Fixed AR(1) residual correlation. If supplied, must be inside `(-1, 1)`. | | `chow_lin_rho_method` | `"fixed"` | `"fixed"`, `"min_chi_squared"`, or `"max_likelihood"`. | Output is a `DataBundle`. Metadata records `data_frequency_alignment`, `native_frequency_by_column`, `output_frequency_by_column`, and frequency counts. Frequency detection uses `native_frequency_by_column` first, then FRED-SD series reports, then observed-date spacing. ```python monthly = mf.data.align_frequency( mixed_bundle, method="monthly", quarterly_to_monthly="repeat_within_quarter", ) ``` For quarterly-to-monthly alignment, `step_backward` is accepted as an alias for `repeat_within_quarter`; the latter is the clearer spelling. Use `quarter_end_ffill` when values should only become available from the quarter-end month forward. Use `quarterly_to_monthly="chow_lin"` when a quarterly series should be regression-disaggregated with a monthly indicator: ```python monthly = mf.data.align_frequency( mixed_bundle, method="monthly", quarterly_to_monthly="chow_lin", chow_lin_indicator={"GDPC1": "INDPRO"}, chow_lin_aggregation="mean", ) ``` This preserves the supplied quarterly observations when the output is re-aggregated by the declared `chow_lin_aggregation`. The function records the indicator and rho choices in `metadata["data_frequency_alignment"]`. ### chow_lin_disaggregate ```python macroforecast.data.chow_lin_disaggregate( low_frequency, indicator, *, aggregation: str = "mean", rho: float | None = None, rho_method: str = "fixed", ) -> pandas.Series ``` Direct Chow-Lin quarterly-to-monthly style disaggregation. `low_frequency` is a low-frequency `Series`, and `indicator` is a higher-frequency `Series` or a single/first-column `DataFrame`. The returned series is indexed like the indicator and conserves `low_frequency` under `aggregation="mean"` or `aggregation="sum"`. `rho_method="fixed"` uses `rho` when supplied and `0.0` otherwise. `"min_chi_squared"` and `"max_likelihood"` estimate `rho` over a bounded grid. ### infer_frequencies ```python macroforecast.data.infer_frequencies(data) -> tuple[dict[str, str], str] ``` `infer_frequencies()` returns `(frequency_by_column, source)`. The source is `"native_frequency_by_column"`, `"fred_sd_series_metadata"`, or `"observed_dates"`. ### frequency_hardening_issues ```python macroforecast.data.frequency_hardening_issues( frequencies, ) -> list[dict] ``` Reports columns classified as `unknown`, `irregular`, or `annual` before a caller aligns frequencies. This is useful before forcing a mixed panel to monthly or quarterly frequency. | Output key | Meaning | | --- | --- | | `frequency` | Weak frequency class. | | `columns` | Columns assigned to that class. | | `n_columns` | Number of affected columns. | ### availability_lag ```python macroforecast.data.availability_lag( data, *, lags: int | Mapping[str, int] = 1, columns: Iterable[str] | None = None, drop_missing: bool = False, ) -> DataBundle ``` Positive lags delay predictor availability. `lags=1` means the value dated `t-1` is the latest available value on row `t`. Pass a mapping for column-specific release lags. ### same_period_predictors ```python macroforecast.data.same_period_predictors( data_spec, *, policy: "allow" | "lag" | "drop" | "forbid" = "allow", lag: int = 1, columns: Iterable[str] | None = None, drop_missing: bool = False, ) -> DataSpec ``` `allow` records the choice, `lag` shifts selected predictors, `drop` removes them from the active predictor set, and `forbid` raises if such predictors are present. Targets are never shifted by this helper. ### define_regime ```python macroforecast.data.define_regime( data, *, name: str = "regime", column: str | None = None, threshold: float | None = None, direction: "above" | "below" | "equal" | "not_equal" = "above", dates: Iterable[str | pandas.Timestamp] | None = None, values: Sequence[bool | int | float] | pandas.Series | None = None, append: bool = False, output_column: str | None = None, ) -> DataBundle ``` Exactly one regime source is required: threshold rule, explicit dates, or an aligned vector/Series. The regime is stored in `metadata["regimes"]`; set `append=True` to also add a numeric indicator column to the panel. ## Vintage Helpers ### list_vintages Generate monthly vintage labels for a supported dataset. ```python macroforecast.data.list_vintages( dataset: str, start: str | None = None, end: str | None = None, ) -> list[str] ``` ### Input | Name | Type | Default | Meaning | | --- | --- | --- | --- | | `dataset` | `str` | required | One of `fred_md`, `fred_qd`, `fred_sd`, `fred_md+fred_sd`, or `fred_qd+fred_sd`. | | `start` | str | None | first supported vintage | Start vintage in `YYYY-MM` form. | | `end` | str | None | required | End vintage in `YYYY-MM` form. | ### Output Returns candidate monthly vintage labels. The selected vintage is passed to `load_fred_md`, `load_fred_qd`, or `load_fred_sd` through `vintage=`. `end` is required because the function does not inspect remote availability. ## Official Source Pages - FRED-MD and FRED-QD source page: - FRED-SD source page: