Custom Evaluation and Tests#
Use custom metrics when the output is a scalar forecast score. Use
custom_test() when the output is a statistical forecast-comparison result.
Use custom aggregation mappings when the evaluation report needs project-local
slices.
Custom Metrics#
Custom point metrics are plain callables:
def mean_bias(y_true, y_pred):
return float(pandas.Series(y_pred).sub(pandas.Series(y_true)).mean())
scores = mf.metrics.evaluate_forecasts(
forecast_table,
metrics=("mse", mean_bias),
)
Metric Callable Contract#
metric(y_true, y_pred) -> float
The callable must return one scalar. The output column uses the callable name unless the surrounding evaluation function renames it.
custom_test#
mf.tests.custom_test(
name,
func,
*args,
alternative="two-sided",
**params,
) -> mf.tests.TestResult
Test Callable Contract#
The callable receives *args plus **params and should return either a
mapping or a TestResult-like object containing:
Field |
Meaning |
|---|---|
|
Test statistic. |
|
P-value, or |
|
Optional reject flag. |
|
Number of aligned observations. |
|
Optional source, null hypothesis, reference distribution, or warning metadata. |
Example#
def my_loss_test(loss_a, loss_b):
diff = pandas.Series(loss_a).sub(pandas.Series(loss_b)).dropna()
return {
"statistic": float(diff.mean()),
"p_value": 0.04,
"n_obs": len(diff),
}
test = mf.tests.custom_test("my_loss_test", my_loss_test, loss_a, loss_b)
Custom Evaluation Slices#
report = mf.evaluation.evaluate_report(
forecast_result,
metrics=("mse", mean_bias),
aggregations={
"model_target": ("model", "target"),
"model_regime": ("model", "regime"),
},
)
Custom aggregations create additional EvaluationReport.aggregations tables.
They do not change raw metric definitions.
Output Flow#
main_table = mf.reporting.test_report_table({"custom": test})
manifest = mf.output.write_artifacts(
{"scores": scores, "custom_test": test.to_dict()},
"results/custom_eval",
)