production-mlops · Architecture Review

Customer Churn Prediction System

Reviewed 2026-06-22 · Skill: production-ml-system-design · Command: /production-mlops:mlops-arch-review
Current Maturity
Level 1 → Level 2
Manual retraining triggered ad hoc. Target: automated training pipeline with eval gate and model registry promotion. Estimated 6–8 weeks to reach Level 2 with dedicated MLOps sprint.

Deployment Lane Topology

Pipeline Lane (Code)
Feature engineering, preprocessing, and training code versioned in Git. CI pipeline on every PR: unit tests, schema validation, data contract checks. Deployment triggered on merge to main. No separate model artifact lifecycle — this is the current gap.
Model Lane (Artifact)
Recommended: model artifacts promoted through dev → staging → prod via MLflow Model Registry. Promotion gate requires eval scorecard pass (AUC ≥ 0.82, high-LTV recall ≥ 0.75). Canary at 10% traffic before full rollout. Rollback: re-promote previous version from registry.
Lane Conflation Risk (P1)
Current setup conflates code deployment with model deployment — a bad model ships whenever code ships. Separate the lanes before moving to Level 2. Implementing model registry is the first prerequisite action.

Key Architecture Decisions

Decision AreaRecommendationRationale
Serving modeBatch inferenceChurn scores consumed nightly by CRM — no real-time SLO. Batch is cheaper, easier to debug, and matches the business consumption pattern.
Feature storeShared offline storeNo real-time serving required. Offline store (Feast / Hive) sufficient. Defer online store until use case requires <100ms features.
Retraining cadenceWeekly scheduledChurn signal drifts monthly. Weekly retraining provides a safety margin. Trigger also on drift alert (PSI > 0.2 on key features).
Model registryMLflowAlready in tech stack. Enables staged promotion, artifact lineage, and A/B experiment tracking without new tooling.
Shadow modeRequired firstRun new model in shadow for 2 weeks before canary. Compare score distributions — not just offline metrics.

Anti-Patterns Detected

!
Label proxy leakage — P0
Feature "days_since_last_login" is computed post-churn event in current pipeline. This is a target leakage proxy. Must be recomputed using a snapshot taken 30 days before the churn label cutoff.
!
Missing eval gate between training and serving — P1
No automated check that the newly trained model meets minimum AUC threshold before promotion. A degraded model can be promoted silently.
!
No drift monitoring in production — P2
Feature distribution and prediction distribution are not monitored post-deployment. Churn model can silently degrade between retraining cycles.

Next Actions