production-mlops · ML Eval Scorecard

Churn Prediction v2 — Eval Scorecard

Model: churn-xgb-v2 · Task: Binary Classification · Skill: ml-test-and-eval-design · Command: /production-mlops:mlops-evals
Primary Metric
AUC-ROC ≥ 0.82
Task Type
Binary Classification
Model version: churn-xgb-v2 · High-LTV segment recall gated separately at ≥ 0.75 — a missed high-value customer costs 12× a false positive intervention

Testing Layer Coverage

L1 — Unit tests
Feature transforms, encoding logic, null handling. 47 tests passing.
L2 — Data quality
Schema contract, null rate < 2%, distribution shift checks on training split.
L3 — Model behavior
AUC, precision-recall curve, calibration plot. Threshold sweep at F1-optimal.
L4 — Slice testing
High-LTV recall gap identified (see slice metrics). New customers segment not yet evaluated.
L5 — Production shadow
Shadow mode not yet configured. Required before canary rollout. Blocks promotion.

Slice Metrics

SliceValueThresholdStatus
Overall AUC0.847≥ 0.82Pass
High-LTV recall0.71≥ 0.75Fail
New customers AUC≥ 0.78Not run
Precision @ 0.50.79≥ 0.70Pass
Calibration error0.032≤ 0.05Pass

Promotion Gate Policy

StageConditionStatus
Dev → StagingAUC ≥ 0.82 on holdout + all L1–L3 tests greenPass
Staging → ShadowHigh-LTV recall ≥ 0.75 + new customer slice evaluatedBlocked
Shadow → Canary2 weeks shadow, score distribution within 5% of incumbentPending shadow
Canary → Full10% canary for 1 week, no regression on business KPIsPending canary

Anti-Patterns Detected

P0
Aggregate-only evaluation hides slice failure
Overall AUC passes (0.847) but high-LTV recall fails (0.71 vs 0.75 gate). Reporting only overall AUC would silently ship a model that underperforms on the highest-value customer segment.
P2
No temporal holdout in evaluation split
Current train/test split is random. For churn prediction, temporal split is required — test set must postdate training set to avoid lookahead bias.