TL;DR

  • The hard part of churn prediction is not the model. It is the data. If your event tracking does not capture the behavioral signals that precede churn, no algorithm will save you.
  • Production SaaS churn models typically achieve 0.75-0.85 AUC with 10-15 derived features. You do not need 300 variables. You need the right ones, engineered from behavioral patterns.
  • XGBoost is the practical choice for most B2B SaaS churn models because it handles non-linear relationships, produces feature importance rankings, and is interpretable enough that a CS leader can understand why an account got flagged.
  • SHAP values make your model explainable. They tell you which features drove this specific prediction, which is the difference between "the model says they will churn" and "they will churn because their feature usage dropped 60% in 14 days."
  • Minimum data requirement: 6 months of behavioral event data with labeled churn outcomes. The output is not a dashboard but a weekly at-risk list that your CS team can act on every Monday.

Follow the steps below to build the full pipeline from feature engineering to deployment.

The Data Problem (Before the Code)

How to Build a Churn Prediction Model in Python (For SaaS)
Key insights on How to Build a Churn Prediction Model in Python (For SaaS).

Every churn prediction tutorial on the internet starts with import pandas as pd and a clean CSV file. Real SaaS data does not look like that. Before you write a single line of Python, you need 3 things in place, and most teams have none of them.

The cost of skipping this step is high. A model trained on incomplete or mislabeled data will flag the wrong accounts and miss the customers who actually cancel. That erodes trust in the entire system. By the time you realize the model is wrong, you have lost months of training data and the CS team has stopped listening.

The problem is not that teams cannot build churn models. The problem is that they build them on top of event data that does not capture the behaviors that precede cancellation.

Here is what you need before any code:

  1. A behavioral event stream with login events, feature usage events, session events, support ticket events, and billing events. Each event needs a user ID, account ID, timestamp, and event type for reliable analysis.
  2. A churn label that identifies which accounts churned and when. You need a specific definition such as contract non-renewal, explicit cancellation, or downgrade below a revenue threshold.
  3. At least 6 months of data because shorter windows capture only seasonal noise. The behavioral patterns that predict churn at month 1 look different from month 6.

If you do not have these 3 things, you do not need a churn model — you need a tracking plan. Instrument your product, come back in 6 months, then build.

"If your product only tracks logins, the churn model will be weak. You need feature usage, engagement velocity, and support interaction events before any algorithm becomes useful."

- Jake McMahon, ProductQuant

Once you have these 3 data sources in place, you can move to feature engineering.

Feature Engineering Through Deployment

The model will only be as good as the features you feed it. The biggest mistake teams make is using raw counts instead of derived metrics.

Raw counts tell you what happened. Derived features tell you what is changing. Change is what predicts churn.

Here are the feature categories that actually predict churn in B2B SaaS, with Python code you can adapt to your own event schema.

Category 1: Frequency Features (Engagement Velocity)

These features capture whether engagement is dropping relative to the user's own baseline. A customer who drops from 15 logins per fortnight to 3 logins per fortnight has a -80% engagement velocity. That is a much stronger churn signal than "they logged in 3 times."

def engagement_velocity(events, account_id, window_days=14):
    """Calculate week-over-week login change rate."""
    cutoff = datetime.now() - timedelta(days=window_days * 2)
    recent = events[(events['account_id'] == account_id) &
                    (events['timestamp'] >= cutoff)]

    mid = cutoff + timedelta(days=window_days)
    prev_window = recent[recent['timestamp'] < mid]
    curr_window = recent[recent['timestamp'] >= mid]

    prev_count = prev_window['event_type'].eq('login').sum()
    curr_count = curr_window['event_type'].eq('login').sum()

    if prev_count == 0:
        return 0

    velocity = (curr_count - prev_count) / prev_count
    return velocity

The key insight here is relative change, not absolute count. The velocity metric captures trajectory, which is what matters for churn prediction.

The insight: Track relative change in engagement, not absolute usage counts. Trajectory predicts churn better than volume.

Category 2: Depth Features (Feature Regression)

When customers start churning, they do not just use the product less — they use it differently. A power user who stops using exports, integrations, and automations is quietly testing what life without your product looks like.

def feature_regression(events, account_id, window_days=30):
    """Calculate the ratio of advanced-to-basic feature usage."""
    cutoff = datetime.now() - timedelta(days=window_days)
    recent = events[(events['account_id'] == account_id) &
                    (events['timestamp'] >= cutoff)]

    advanced_features = {'report_export', 'api_call', 'automation'}
    basic_features = {'dashboard_view', 'settings', 'help_center'}

    advanced_count = recent[recent['feature_name'].isin(advanced_features)].shape[0]
    total_count = recent.shape[0]

    if total_count == 0:
        return 0

    return advanced_count / total_count

This ratio captures a subtle but reliable signal. Customers retreat to surface-level interactions before they cancel, and the feature regression metric quantifies that retreat.

The insight: Monitor the ratio of advanced-to-basic feature usage. A declining ratio signals pre-churn behavior.

Customers do not churn overnight. They disengage feature by feature, session by session, until one day they simply do not come back. The features they stop using first tell you everything.

Category 3: Breadth Features (Feature Contraction)

Is the account contracting with fewer features and fewer users? A breadth contraction ratio of 0.4 means the account is using 60% fewer features than their historical average. Contraction precedes cancellation.

def breadth_contraction(events, account_id, window_days=14):
    """Compare feature breadth in current window vs. historical baseline."""
    cutoff = datetime.now() - timedelta(days=window_days)
    recent = events[(events['account_id'] == account_id) &
                    (events['timestamp'] >= cutoff)]

    current_breadth = recent['feature_name'].nunique()

    historical = events[(events['account_id'] == account_id) &
                        (events['timestamp'] < cutoff)]
    if historical.empty:
        return 0

    historical['week'] = historical['timestamp'].dt.isocalendar().week
    weekly_breadth = historical.groupby('week')['feature_name'].nunique()
    avg_historical_breadth = weekly_breadth.mean()

    if avg_historical_breadth == 0:
        return 0

    return current_breadth / avg_historical_breadth

The contraction ratio is a leading indicator. When it drops below 0.5, the account is using less than half its normal feature set, which typically precedes a cancellation by 2-4 weeks.

The insight: Feature contraction ratio below 0.5 is a leading indicator of cancellation within 2-4 weeks.

An account using half its normal feature set is not having a bad week. It is quietly testing what life without your product looks like.

Category 4: Tenure-Adjusted Risk

Tenure is consistently the strongest predictor of churn. The highest risk window is the first 3-6 months, and the probability of churn declines as tenure increases. This feature captures that curve directly.

def tenure_risk(account_start_date, current_date):
    """Calculate tenure-based risk multiplier."""
    tenure_months = (current_date - account_start_date).days / 30.0

    if tenure_months <= 1:
        return 1.0
    elif tenure_months <= 3:
        return 0.8
    elif tenure_months <= 6:
        return 0.5
    elif tenure_months <= 12:
        return 0.3
    else:
        return 0.15

New accounts in their first month carry a risk multiplier of 1.0. Established customers past 12 months drop to 0.15. This single feature captures the entire lifecycle risk curve.

The insight: Tenure-adjusted risk captures the lifecycle curve in one feature. New accounts carry the highest churn risk.

Assembling the Feature Matrix

Once you have defined your features, you assemble them into a training dataset. The target is 10-15 high-signal derived features. Not 50. Not 300. Each feature captures a specific behavioral pattern that correlates with churn.

def build_feature_matrix(events, accounts, churn_labels):
    rows = []

    for account_id in accounts['account_id'].unique():
        acct_events = events[events['account_id'] == account_id]
        row = {
            'account_id': account_id,
            'engagement_velocity': engagement_velocity(acct_events, account_id),
            'feature_regression': feature_regression(acct_events, account_id),
            'breadth_contraction': breadth_contraction(acct_events, account_id),
            'tenure_risk': tenure_risk(
                accounts.loc[accounts['account_id'] == account_id, 'start_date'].iloc[0],
                datetime.now()),
            'label': churn_labels.get(account_id, 0),
        }
        rows.append(row)
    return pd.DataFrame(rows)

This matrix is your training data. Each row is one account at one point in time, with derived behavioral features and a binary churn label. If you are unsure whether to build this pipeline in-house or buy an off-the-shelf solution, our build vs buy analysis for churn prediction breaks down the tradeoffs.

The insight: Train on 10-15 high-signal derived features, not raw counts. Each feature should capture a specific behavioral pattern.

Behavioral Feature Engineering Vectors
Converting raw events into predictive vectors: recency, depth, and intensity.
Related Offer

Churn Prediction Engagement

We build the full pipeline from event tracking QA through weekly at-risk list deployment. Fixed scope, fixed price, production-ready code.

Training the Model with XGBoost

XGBoost is the workhorse for SaaS churn prediction. It handles non-linear relationships, noisy data well, produces feature importance rankings, and is interpretable enough that a CS leader can understand why an account got flagged. Logistic regression works as a baseline. Random forests handle messy data. But for production churn prediction with 10-15 features, gradient boosting hits the sweet spot of accuracy, speed, and explainability.

import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

model = xgb.XGBClassifier(
    n_estimators=100,
    max_depth=4,
    learning_rate=0.1,
    scale_pos_weight=len(y_train[y_train==0]) / len(y_train[y_train==1]),
    random_state=42)

model.fit(X_train, y_train)
y_pred_proba = model.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_pred_proba)
print(f"AUC: {auc:.3f}")

The expected AUC for production SaaS churn models is 0.75-0.85. In one engagement, we achieved 0.82 AUC with 25+ behavioral features.

The insight: XGBoost hits the sweet spot of accuracy, speed, and explainability for production churn prediction with 10-15 features.

Explaining Predictions with SHAP

A churn prediction without an explanation is a number that nobody trusts. SHAP values tell you which features drove each specific prediction. This is the difference between "the model says they will churn" and "they will churn because their engagement velocity dropped 80%." The CS team can act on the second one.

import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

shap.summary_plot(shap_values, X_test, feature_names=X.columns)

# Individual prediction:
# "ACC-042 churn probability: 0.73"
#   +0.18 ← engagement_velocity (-80%)
#   +0.12 ← breadth_contraction (0.35)
#   +0.08 ← feature_regression (0.05)

The summary plot shows which features matter most across all predictions. The individual force plot explains why a specific account got flagged. This is what separates a model your CS team trusts from one they ignore.

The insight: SHAP values turn opaque predictions into actionable explanations. CS teams act on explanations, not raw scores.

Deploying the Weekly At-Risk List

The model runs weekly. Every Monday morning, it produces a prioritized list rather than a health score dashboard. The CS team gets this in their inbox, Slack, or CRM. Each account comes with the specific signal that drove the flag.

def generate_weekly_risk_list(model, current_features, threshold=0.5):
    risk_scores = model.predict_proba(current_features)[:, 1]
    current_features['churn_probability'] = risk_scores
    at_risk = current_features[current_features['churn_probability'] >= threshold]
    at_risk = at_risk.sort_values('churn_probability', ascending=False)
    return at_risk[['account_id', 'churn_probability', 'primary_signal']]
Account Churn Probability Primary Signal
ACC-042 0.73 Engagement velocity dropped 80%
ACC-018 0.61 Feature breadth contracted to 35%
ACC-095 0.58 3 unresolved support tickets (14+ days)

Each row gives your CS team a specific account to review and a clear reason for the flag. If you want to understand the behavioral events that feed into these signals, our guide on event prerequisites for churn prediction covers the instrumentation details.

The insight: Deploy a prioritized weekly at-risk list with specific signals, not a health score dashboard. CS teams act on lists, not dashboards.

Closing the Loop: Monthly Retraining

The model is not done when it ships. It improves every month as intervention outcomes become training data.

Month 1, the model flags accounts. Month 2, CS interventions produce outcomes. Month 3, the model retrains on those outcomes and gets better. By month 6, the model is significantly more accurate than the v1 because it has learned from 20+ weeks of real intervention data.

This compounding loop is what separates production churn models from proof-of-concept notebooks. The model gets smarter every month because it learns from the actual results of CS interventions, not just from historical labels.

The insight: Monthly retraining with intervention outcomes compounds model accuracy over time. The model learns from real results, not just historical labels.

What Most Tutorials Do Not Tell You

The Python code above is the easy part. The hard parts are the decisions that no tutorial covers: how you define churn, how you handle class imbalance, and how you ensure data quality before training.

The Labeling Problem

How do you define "churn"? It sounds simple until you look at your data.

Contract non-renewal means one thing, but what about customers who downgraded from Enterprise to Starter? Explicit cancellation is clean, but what about customers who stopped paying and never clicked cancel? Usage-based churn with no login in 30 days catches seasonal businesses that may still be healthy.

Your churn definition determines your model's behavior. Pick the definition that matches your revenue reality. If downgrades cost you as much as cancellations, treat them as churn.

The insight: Match your churn definition to your revenue reality. Downgrades that cost as much as cancellations should be treated as churn.

The Class Imbalance Problem

In B2B SaaS, monthly churn is typically 3%-5%. That means 95-97% of your training data is "did not churn." A naive model that predicts "no churn" for everyone achieves 95% accuracy and catches zero churners.

The scale_pos_weight parameter in XGBoost handles this by weighting the minority class. You also need to optimize for recall over precision — catching more potential churners, even if some are false alarms, is cheaper than missing churners entirely.

The insight: Optimize for recall over precision. Missing churners costs more than false alarms.

A model that predicts nobody will churn and achieves 95% accuracy is not a model. It is a mirror reflecting the class imbalance problem back at you.

The Data Quality Problem

If your event tracking fires inconsistently, your model will learn noise. Before you train, run a tracking QA check to find duplicates and missing timestamps. If either issue is present, clean your data before training because these issues will bias your model toward incorrect feature weights.

# Quick tracking QA: check for duplicate events and missing timestamps
duplicates = events.duplicated(
    subset=['account_id', 'event_type', 'timestamp'], keep=False)
print(f"Duplicate events: {duplicates.sum()}")

missing_ts = events['timestamp'].isna().sum()
print(f"Missing timestamps: {missing_ts}")

if duplicates.sum() > 0 or missing_ts > 0:
    print("WARNING: Clean your data before training.")

This quick check takes less than a minute and can save you hours of debugging a model that learned from corrupted data. The behavioral patterns you should be looking for are feature usage decline, engagement velocity drop, and support ticket escalation.

The insight: Run tracking QA before training. Duplicate events and missing timestamps bias the model toward incorrect feature weights.

Garbage in, garbage out is not just a saying. It is the most common reason churn models fail in production. Clean data first, train second.
0.75-0.85 AUC

The expected accuracy range for production B2B SaaS churn models with 10-15 well-engineered features. If your model is below 0.70, the issue is feature engineering, not algorithm choice.

Related Offer

Churn Prediction Engagement

We build the full pipeline from event tracking QA through weekly at-risk list deployment. Fixed scope, fixed price, production-ready code.

When Not to Build a Churn Model

Not every SaaS company needs a custom churn prediction model. You should wait if you have less than 6 months of behavioral event data, your monthly churn is below 2%, you have fewer than 100 customers, or you do not have a CS team to act on predictions.

At that scale, the cost of building and maintaining the model likely exceeds the revenue at risk.

Consider the opportunity cost. Every week your engineering team spends on model development is a week they are not building product features that might reduce churn directly. If your product is still finding product-market fit, a churn model will only tell you that users are leaving because the product does not solve their problem yet. That is not something a model can fix.

Focus on product-market fit and event instrumentation first. The model becomes valuable once you have enough data to learn from and a team to act on its output.

The insight: Wait to build until you have 6+ months of data, 100+ customers, and a CS team to act on predictions.

What to Do Instead

If you are not ready to build a full churn prediction model, there are intermediate steps that still move the needle on retention.

  • Instrument behavioral event tracking first. Before any model, you need clean event data that captures feature usage, engagement velocity, and support interactions for reliable analysis.
  • Build a simple rules-based at-risk list. Accounts with no login in 14 days and 3+ unresolved support tickets will catch a large portion of churners without any machine learning involved.
  • Focus on the churn intervention playbook before the prediction model. If your CS team does not have a process for acting on at-risk accounts, the model output is wasted effort.

The sequence matters. Instrument the product, define churn, build rules-based alerts, train the model, then close the intervention loop. Skipping steps costs more time than doing them in order.

FAQ

What AUC should I expect from a SaaS churn model?

0.75-0.85 is the typical range for production B2B SaaS models. In one engagement, we achieved 0.82 AUC with 25+ behavioral features.

If your model is below 0.70, the issue is usually feature engineering, not algorithm choice. Academic studies have achieved above 90% on controlled datasets, but those use cleaner data than real SaaS products produce. Your model should be evaluated on out-of-sample data from a different time period than your training set to avoid overfitting to historical patterns.

The insight: Expect 0.75-0.85 AUC in production. Below 0.70 means the issue is feature engineering, not algorithm choice.

SHAP Impact: Local Model Interpretability
Explaining why an individual user is at risk: feature contribution to churn probability.

How often should I retrain the model?

Monthly retraining is the sweet spot. Retrain too frequently and the model overfits to recent noise. Retrain too infrequently and the model drifts as user behavior patterns shift.

Monthly gives you enough new intervention outcomes to learn from without overreacting to weekly fluctuations. Set up an automated pipeline that retrains, evaluates against the holdout set, and deploys only if the new model improves on the current AUC.

The insight: Monthly retraining balances learning from new outcomes without overreacting to weekly noise.

Should I use deep learning for churn prediction?

Almost certainly not. For 10-15 features on B2B SaaS data, XGBoost will match or beat a neural network while being faster to train, easier to interpret, and more forgiving of the messy data that real SaaS products produce.

Deep learning requires large datasets, significant compute resources, and produces models that are nearly impossible to explain to stakeholders. Save it for when you are processing millions of unstructured records with complex non-linear patterns that gradient boosting cannot capture.

The insight: XGBoost matches or beats neural networks on B2B SaaS churn while being faster, easier to interpret, and more forgiving of messy data.

How do I handle customers who churn and come back?

Treat reactivated customers as new accounts with a tenure reset. Their pre-reactivation behavior is not predictive of post-reactivation churn. What matters is their behavior after reactivation and whether they re-engage at a level that predicts sustained retention.

The insight: Reset tenure for reactivated customers. Pre-reactivation behavior does not predict post-reactivation churn.

What data do I need before building a churn model?

You need a behavioral event stream, a churn label, and at least 6 months of historical data. The key is event instrumentation. If your product only tracks logins, the model will be weak. You need feature usage, engagement velocity, and support interaction events.

The insight: Event instrumentation matters more than algorithm choice. Without feature usage, velocity, and support events, the model will be weak.

Sources

All sources were accessed and verified in April 2026.

Jake McMahon

About the Author

Jake McMahon builds growth infrastructure for B2B SaaS companies through analytics, experimentation, and predictive modeling that turns product data into revenue decisions. He has built churn prediction models at 0.82 AUC and designed intervention playbooks that save accounts before they cancel. His approach combines behavioral event engineering with gradient boosting to produce models that CS teams actually trust and act on every week.

Next Step

Build a Churn Model Your CS Team Will Actually Use

We design, build, and deploy production churn prediction systems from event tracking QA through weekly at-risk list deployment. Fixed scope, fixed price.