ML Eval Scorecard — Churn Prediction v2

Primary Metric

AUC-ROC ≥ 0.82

Task Type

Binary Classification

Model version: churn-xgb-v2 · High-LTV segment recall gated separately at ≥ 0.75 — a missed high-value customer costs 12× a false positive intervention

Testing Layer Coverage

✓

L1 — Unit tests

Feature transforms, encoding logic, null handling. 47 tests passing.

✓

L2 — Data quality

Schema contract, null rate < 2%, distribution shift checks on training split.

✓

L3 — Model behavior

AUC, precision-recall curve, calibration plot. Threshold sweep at F1-optimal.

⚠

L4 — Slice testing

High-LTV recall gap identified (see slice metrics). New customers segment not yet evaluated.

✗

L5 — Production shadow

Shadow mode not yet configured. Required before canary rollout. Blocks promotion.

Slice Metrics

Slice	Value	Threshold	Status
Overall AUC	0.847	≥ 0.82	Pass
High-LTV recall	0.71	≥ 0.75	Fail
New customers AUC	—	≥ 0.78	Not run
Precision @ 0.5	0.79	≥ 0.70	Pass
Calibration error	0.032	≤ 0.05	Pass

Promotion Gate Policy

Stage	Condition	Status
Dev → Staging	AUC ≥ 0.82 on holdout + all L1–L3 tests green	Pass
Staging → Shadow	High-LTV recall ≥ 0.75 + new customer slice evaluated	Blocked
Shadow → Canary	2 weeks shadow, score distribution within 5% of incumbent	Pending shadow
Canary → Full	10% canary for 1 week, no regression on business KPIs	Pending canary

Anti-Patterns Detected

Aggregate-only evaluation hides slice failure

Overall AUC passes (0.847) but high-LTV recall fails (0.71 vs 0.75 gate). Reporting only overall AUC would silently ship a model that underperforms on the highest-value customer segment.

No temporal holdout in evaluation split

Current train/test split is random. For churn prediction, temporal split is required — test set must postdate training set to avoid lookahead bias.