# Custom Evaluation and Tests

[Back to custom extensions](index.md)

Use custom metrics when the output is a scalar forecast score. Use
`custom_test()` when the output is a statistical forecast-comparison result.
Use custom aggregation mappings when the evaluation report needs project-local
slices.

## Custom Metrics

Custom point metrics are plain callables:

```python
def mean_bias(y_true, y_pred):
    return float(pandas.Series(y_pred).sub(pandas.Series(y_true)).mean())

scores = mf.metrics.evaluate_forecasts(
    forecast_table,
    metrics=("mse", mean_bias),
)
```

### Metric Callable Contract

```python
metric(y_true, y_pred) -> float
```

The callable must return one scalar. The output column uses the callable name
unless the surrounding evaluation function renames it.

## custom_test

```python
mf.tests.custom_test(
    name,
    func,
    *args,
    alternative="two-sided",
    **params,
) -> mf.tests.TestResult
```

### Test Callable Contract

The callable receives `*args` plus `**params` and should return either a
mapping or a `TestResult`-like object containing:

| Field | Meaning |
| --- | --- |
| `statistic` | Test statistic. |
| `p_value` | P-value, or `None` if unavailable. |
| `decision` | Optional reject flag. |
| `n_obs` | Number of aligned observations. |
| `metadata` | Optional source, null hypothesis, reference distribution, or warning metadata. |

### Example

```python
def my_loss_test(loss_a, loss_b):
    diff = pandas.Series(loss_a).sub(pandas.Series(loss_b)).dropna()
    return {
        "statistic": float(diff.mean()),
        "p_value": 0.04,
        "n_obs": len(diff),
    }

test = mf.tests.custom_test("my_loss_test", my_loss_test, loss_a, loss_b)
```

## Custom Evaluation Slices

```python
report = mf.evaluation.evaluate_report(
    forecast_result,
    metrics=("mse", mean_bias),
    aggregations={
        "model_target": ("model", "target"),
        "model_regime": ("model", "regime"),
    },
)
```

Custom aggregations create additional `EvaluationReport.aggregations` tables.
They do not change raw metric definitions.

## Output Flow

```python
main_table = mf.reporting.test_report_table({"custom": test})
manifest = mf.output.write_artifacts(
    {"scores": scores, "custom_test": test.to_dict()},
    "results/custom_eval",
)
```