Survival analysis is a branch of statistics focused on modeling the time until an event occurs. Originally developed for clinical settings to study time-to-death, it has since been applied broadly to customer retention, equipment failure, and subscription churn.
In this project, I apply four survival analysis methods to the IBM Telco Customer Churn dataset using PySpark and the lifelines library, ultimately building a Customer Lifetime Value (CLV) model.
The raw dataset was processed into a Silver table by filtering to month-to-month contracts
and internet subscribers, converting the churn column to a binary integer, and cleaning
inconsistent categorical values like "No internet service".
silver_df = (
bronze_df
.withColumn('churn', when(col('churnString')=='Yes', 1).otherwise(0))
.filter(col('contract') == 'Month-to-month')
.filter(col('internetService') != 'No')
)
| Metric | Value |
|---|---|
| Total records | 3,351 |
| Churned customers | 1,556 (46.4%) |
| Median tenure | 13 months |
| Mean monthly charges | $73.59 |
| Fiber optic churn rate | 54.6% |
| DSL churn rate | 32.2% |
The Kaplan-Meier estimator is a non-parametric method that computes the survival function without assuming any underlying distribution. It handles right-censored data naturally — customers still active at observation end are counted as censored.
from lifelines import KaplanMeierFitter
kmf = KaplanMeierFitter()
kmf.fit(silver_pd['tenure'], event_observed=silver_pd['churn'])
kmf.plot_survival_function()
The overall median survival time is 56 months. Log-rank tests revealed that techSupport, onlineSecurity, and dependents have highly significant effects on survival (p < 0.01), while gender shows no significant difference.
The Cox PH model is a semi-parametric approach that quantifies the effect of each covariate via a hazard ratio (HR). An HR > 1 indicates increased churn risk; HR < 1 means protection.
| Covariate | Hazard Ratio | Interpretation |
|---|---|---|
| onlineBackup_Yes | 0.456 | 54% lower churn risk |
| techSupport_Yes | 0.529 | 47% lower churn risk |
| dependents_Yes | 0.726 | 27% lower churn risk |
| internetService_Fiber | 1.217 | 22% higher churn risk |
| paperlessBilling_Yes | 1.157 | 16% higher churn risk |
The model achieved a C-index of 0.6425, indicating meaningful discriminative ability to separate high and low risk customers.
The Accelerated Failure Time model parameterizes how covariates stretch or compress
the time axis. I implemented this using Spark MLlib's
AFTSurvivalRegression with a full feature pipeline for distributed training.
from pyspark.ml.regression import AFTSurvivalRegression
from pyspark.ml import Pipeline
aft = AFTSurvivalRegression(
featuresCol='features',
labelCol='tenure',
censorCol='churn_censor',
maxIter=200
)
pipeline = Pipeline(stages=[indexers, encoder, assembler, aft])
model = pipeline.fit(train_df)
Test set RMSE: 81.46 months. The negative R² suggests the Weibull distribution is not ideal for this dataset's survival pattern, but the distributed pipeline demonstrates clear engineering value at scale.
The Log-Logistic AFT model was validated using a Log-Odds plot. Parallel, roughly linear curves confirmed that the proportional odds assumption holds. Key acceleration factors: onlineSecurity (AF = 2.55), techSupport (AF = 2.10), partner (AF = 2.04).
Using Cox-predicted monthly survival probabilities and a 10% annual discount rate, I computed 36-month NPV for two customer archetypes:
| Customer Type | Profile | CLV Trend |
|---|---|---|
| High Risk | Fiber + Paperless, $75/mo | Rapid early decline |
| Low Risk | Dependents + Backup + Support, $45/mo | Sustained long-term |
Despite higher monthly revenue, high-risk customers yield lower cumulative CLV due to early churn. The 12-month CLV serves as a rational cap for customer acquisition cost.
| Method | Type | Key Metric | Best Use |
|---|---|---|---|
| Kaplan-Meier | Non-parametric | Median = 56 mo | Exploratory analysis |
| Cox PH | Semi-parametric | C-index = 0.64 | Interpretation & CLV |
| AFT Weibull (Spark) | Fully parametric | RMSE = 81.46 | Distributed prediction |
| Log-Logistic AFT | Fully parametric | AIC = 13,812 | Non-PH scenarios |
Online backup and tech support are the strongest retention levers, reducing churn risk by 54% and 47% respectively. Fiber optic customers churn at 54.6% versus 32.2% for DSL, yet their higher monthly charges do not compensate for lower CLV.
The most actionable finding: proactively offering tech support and online backup packages to new fiber customers in their first 3 months could substantially shift the survival curve upward, improving both retention and lifetime value simultaneously.