ML Architecture Review — Customer Churn Prediction

Current Maturity

Level 1 → Level 2

Manual retraining triggered ad hoc. Target: automated training pipeline with eval gate and model registry promotion. Estimated 6–8 weeks to reach Level 2 with dedicated MLOps sprint.

Deployment Lane Topology

Pipeline Lane (Code)

Feature engineering, preprocessing, and training code versioned in Git. CI pipeline on every PR: unit tests, schema validation, data contract checks. Deployment triggered on merge to main. No separate model artifact lifecycle — this is the current gap.

Model Lane (Artifact)

Recommended: model artifacts promoted through dev → staging → prod via MLflow Model Registry. Promotion gate requires eval scorecard pass (AUC ≥ 0.82, high-LTV recall ≥ 0.75). Canary at 10% traffic before full rollout. Rollback: re-promote previous version from registry.

Lane Conflation Risk (P1)

Current setup conflates code deployment with model deployment — a bad model ships whenever code ships. Separate the lanes before moving to Level 2. Implementing model registry is the first prerequisite action.

Key Architecture Decisions

Decision Area	Recommendation	Rationale
Serving mode	Batch inference	Churn scores consumed nightly by CRM — no real-time SLO. Batch is cheaper, easier to debug, and matches the business consumption pattern.
Feature store	Shared offline store	No real-time serving required. Offline store (Feast / Hive) sufficient. Defer online store until use case requires <100ms features.
Retraining cadence	Weekly scheduled	Churn signal drifts monthly. Weekly retraining provides a safety margin. Trigger also on drift alert (PSI > 0.2 on key features).
Model registry	MLflow	Already in tech stack. Enables staged promotion, artifact lineage, and A/B experiment tracking without new tooling.
Shadow mode	Required first	Run new model in shadow for 2 weeks before canary. Compare score distributions — not just offline metrics.

Anti-Patterns Detected

Label proxy leakage — P0

Feature "days_since_last_login" is computed post-churn event in current pipeline. This is a target leakage proxy. Must be recomputed using a snapshot taken 30 days before the churn label cutoff.

Missing eval gate between training and serving — P1

No automated check that the newly trained model meets minimum AUC threshold before promotion. A degraded model can be promoted silently.

No drift monitoring in production — P2

Feature distribution and prediction distribution are not monitored post-deployment. Churn model can silently degrade between retraining cycles.

Next Actions

1. Fix label proxy leakage in feature pipeline (P0 — blocks all eval work)
2. Set up MLflow Model Registry with dev/staging/prod stages
3. Implement eval gate: AUC ≥ 0.82 and high-LTV recall ≥ 0.75 required for promotion
4. Run new model in shadow mode for 2 weeks before canary rollout
5. Add PSI drift monitor on top-5 features; alert threshold PSI > 0.2