Big Data · Survival Analysis · PySpark

Survival Analysis for Customer Churn with PySpark

Yihang Hu  ·  April 2026  ·  SUSTech
7,043 Customers
46.4% Churn Rate
56 mo Median Survival
0.64 C-index

Overview

Survival analysis is a branch of statistics focused on modeling the time until an event occurs. Originally developed for clinical settings to study time-to-death, it has since been applied broadly to customer retention, equipment failure, and subscription churn.

In this project, I apply four survival analysis methods to the IBM Telco Customer Churn dataset using PySpark and the lifelines library, ultimately building a Customer Lifetime Value (CLV) model.

Dataset IBM Telco Customer Churn — 7,043 records, 21 features. Filtered to month-to-month internet subscribers (3,351 records, 46.4% churn rate) for this analysis. Source: databricks-industry-solutions/survival-analysis.

Data Preprocessing

The raw dataset was processed into a Silver table by filtering to month-to-month contracts and internet subscribers, converting the churn column to a binary integer, and cleaning inconsistent categorical values like "No internet service".

silver_df = (
  bronze_df
  .withColumn('churn', when(col('churnString')=='Yes', 1).otherwise(0))
  .filter(col('contract') == 'Month-to-month')
  .filter(col('internetService') != 'No')
)
MetricValue
Total records3,351
Churned customers1,556 (46.4%)
Median tenure13 months
Mean monthly charges$73.59
Fiber optic churn rate54.6%
DSL churn rate32.2%

Method 1 — Kaplan-Meier

The Kaplan-Meier estimator is a non-parametric method that computes the survival function without assuming any underlying distribution. It handles right-censored data naturally — customers still active at observation end are counted as censored.

from lifelines import KaplanMeierFitter

kmf = KaplanMeierFitter()
kmf.fit(silver_pd['tenure'], event_observed=silver_pd['churn'])
kmf.plot_survival_function()

The overall median survival time is 56 months. Log-rank tests revealed that techSupport, onlineSecurity, and dependents have highly significant effects on survival (p < 0.01), while gender shows no significant difference.

Method 2 — Cox Proportional Hazards

The Cox PH model is a semi-parametric approach that quantifies the effect of each covariate via a hazard ratio (HR). An HR > 1 indicates increased churn risk; HR < 1 means protection.

CovariateHazard RatioInterpretation
onlineBackup_Yes0.45654% lower churn risk
techSupport_Yes0.52947% lower churn risk
dependents_Yes0.72627% lower churn risk
internetService_Fiber1.21722% higher churn risk
paperlessBilling_Yes1.15716% higher churn risk

The model achieved a C-index of 0.6425, indicating meaningful discriminative ability to separate high and low risk customers.

Method 3 — AFT Weibull (Spark MLlib)

The Accelerated Failure Time model parameterizes how covariates stretch or compress the time axis. I implemented this using Spark MLlib's AFTSurvivalRegression with a full feature pipeline for distributed training.

from pyspark.ml.regression import AFTSurvivalRegression
from pyspark.ml import Pipeline

aft = AFTSurvivalRegression(
    featuresCol='features',
    labelCol='tenure',
    censorCol='churn_censor',
    maxIter=200
)
pipeline = Pipeline(stages=[indexers, encoder, assembler, aft])
model = pipeline.fit(train_df)

Test set RMSE: 81.46 months. The negative R² suggests the Weibull distribution is not ideal for this dataset's survival pattern, but the distributed pipeline demonstrates clear engineering value at scale.

Method 4 — Log-Logistic AFT

The Log-Logistic AFT model was validated using a Log-Odds plot. Parallel, roughly linear curves confirmed that the proportional odds assumption holds. Key acceleration factors: onlineSecurity (AF = 2.55), techSupport (AF = 2.10), partner (AF = 2.04).

Customer Lifetime Value (CLV)

Using Cox-predicted monthly survival probabilities and a 10% annual discount rate, I computed 36-month NPV for two customer archetypes:

Customer TypeProfileCLV Trend
High RiskFiber + Paperless, $75/moRapid early decline
Low RiskDependents + Backup + Support, $45/moSustained long-term

Despite higher monthly revenue, high-risk customers yield lower cumulative CLV due to early churn. The 12-month CLV serves as a rational cap for customer acquisition cost.

Model Comparison

MethodTypeKey MetricBest Use
Kaplan-MeierNon-parametricMedian = 56 moExploratory analysis
Cox PHSemi-parametricC-index = 0.64Interpretation & CLV
AFT Weibull (Spark)Fully parametricRMSE = 81.46Distributed prediction
Log-Logistic AFTFully parametricAIC = 13,812Non-PH scenarios

Key Takeaways

Online backup and tech support are the strongest retention levers, reducing churn risk by 54% and 47% respectively. Fiber optic customers churn at 54.6% versus 32.2% for DSL, yet their higher monthly charges do not compensate for lower CLV.

The most actionable finding: proactively offering tech support and online backup packages to new fiber customers in their first 3 months could substantially shift the survival curve upward, improving both retention and lifetime value simultaneously.

Code Full PySpark implementation available at github.com/YihangHu2004