Project Analyzing Customer Churn in Sparkify using PySpark

9 min readJul 10, 2021

1. Project Overview

With the development of networks, there are many online music distribution services, and they are constantly working to improve customer satisfaction. In my project, I aimed to create a model to analyze and predict customer churn rates by applying data science techniques to actual data. I believe that if data scientists can play a role in analyzing the churn rate, a business-critical indicator, in real time and model it to predict the future, companies will be able to quickly rethink their strategies and maintain their competitive edge.

2. Problem Statement and Motivation

The churn rate for a customer refers to the percentage of contract customers or subscribers who leave the supplier during a specific period of time. It is a possible indicator of customer dissatisfaction, cheaper and/or better offers from competitors, more successful sales and/or marketing by competitors, or reasons related to the customer life cycle.

I will use a small subset of data in this project to build a big data analysis and prediction model. As a result, I hope to gain the skills to apply the analysis and models I create to larger data sets.

3. Metrics

I consider the type of problem to be a classification problem. The metrics for the classification problem can be Accuracy, recall, F1 score, and f-beta score. The classification problem in the subset may achieve very high accuracy in the case of imbalanced data, simply because the data was imbalanced. Therefore, for the churn rate, it is more appropriate to use the f1 score rather than the accuracy in the case of imbalanced data.

4. Files Description

Specifically, let’s look at the data used by the prediction model.

Total Row Number:286500, Total Column Number18

mini_sparkify_event_data.json -Included in zip
root
| — artist: string (nullable = true)
| — auth: string (nullable = true)
| — firstName: string (nullable = true)
| — gender: string (nullable = true)
| — itemInSession: long (nullable = true)
| — lastName: string (nullable = true)
| — length: double (nullable = true)
| — level: string (nullable = true)
| — location: string (nullable = true)
| — method: string (nullable = true)
| — page: string (nullable = true)
| — registration: long (nullable = true)
| — sessionId: long (nullable = true)
| — song: string (nullable = true)
| — status: long (nullable = true)
| — ts: long (nullable = true)
| — userAgent: string (nullable = true)
| — userId: string (nullable = true)

The data below shows that the userId column contains 8346 invalid values. My first task is going to be to remove the invalid values in the userId column.

+------+----+---------+------+-------------+--------+------+-----+--|artist|auth|firstName|gender|itemInSession|lastName|length|level|location|method|page|registration|sessionId|song|status|ts|userAgent|
userId|
+------+----+---------+------+-------------+--------+------+-----+--| 58392|   0|     8346|  8346|            0|    8346| 58392|    0|    8346|     0|   0|        8346|        0|58392|     0|  0|     8346|  8346|
+------+----+---------+------+-------------+--------+------+-----+--

5. Data Pre-Processing

To identify unique users, filter users with invalid User Id and the result is below. Unique User Number is 225.

+------+----+---------+------+-------------+--------+------+-----+--
|artist|auth|firstName|gender|itemInSession|lastName|length|level|location|method|page|registration|sessionId|song|status|ts|userAgent|
userId|
+------+----+---------+------+-------------+--------+------+-----+--| 50046|   0|        0|     0|            0|       0| 50046|    0|       0|     0|   0|           0|        0|50046|     0|  0|        0|     0|
+------+----+---------+------+-------------+--------+------+-----+--

Classification processes for user groups are performed.The first is to classify users into ongoing users and terminated users.The second is to classify users by the number of times they have been downgraded.Let’s take a look at the main logic and results of the process.

# for cancel case
churn = f.udf(lambda x: 1 if x=="Cancellation Confirmation" else 0, IntegerType())
df_cleaned = df_cleaned.withColumn("churn", churn(df.page))# define window bounds
windowval = Window.partitionBy("userId").rangeBetween(Window.unboundedPreceding, Window.unboundedFollowing)
# Applying the window
df_cleaned = df_cleaned.withColumn("churn", f.sum("churn").over(windowval))+-----+-----+
|churn|count|
+-----+-----+
|    0|  173|
|    1|   52|
+-----+-----+

Data shows that we’ve lost 23% of our customers in three months.Churn seems to be imbaranced.

6. Exploratory Data Analysis

Analysis of 2 user groups: churned users and ongoing users.

By creating a function that uses both pyspark and pandas, you can get the above data in no time at all. When I find it difficult to analyze and report on data, I tend to create functions, but it may be because I have not studied the API of pyspark enough.

Proportion Difference by Active/Churned user group
0,Status:Active ,Ratio-F:8.1e+01(%), Ratio-M:7.4e+01(%) Difference:7.2(%)

Next, let’s take a look at a graph of the percentage of music played per month for each usage status.

Notice：The ratio of time spent playing music by ongoing users to the time spent playing music by churned users was determined. In October, the number of hours played by users who were still ongoing was about 3 times greater, and in November, about 16 times greater.

Here is the usage of the page by contract status of the users and this is very interested in terms of user behavior analysis.These analyses will be useful later on when building predictive models.

Notice: stay user show positive diference page over 1% is below

Add Friend 1.24860144316
Thumbs Up 3.76870772099

Notice: stay user show negative diference page over 1% is below

Roll Advert -4.28284163

Analysis by number of downgrades performed

Let’s see if the same kind of relationship with usage shows up in the number of downgrades. Male users were more likely to cancel their subscriptions, but did the number of downgrades follow a similar trend?

The data shows that women downgrade more than men.This is same trend for churn and detail is below.

The percentage of women who have not downgraded is 73%, which is 9.6% less than men. 19% of women have downgraded once, which is 5.2% more than men. 5.8% of women have downgraded twice, which is 2.5% more than men. The percentage of women who downgraded three times was 1.9%, and there was no information on men who downgraded three times.

7. Results

Feature Engineering

Encode qualitative information: gender, billing status, and OS type for each user.
Calculate quantitative information: total number of page accesses per user, and music playing time per user by month and day.

Dataset for prediction has 37 column and feature for prediction is 35 columns.

Modeling

Normalized and divided into training data and test data.

#create feature vecter
features = predict_df.columns[2:]
assembler = VectorAssembler(inputCols=features, outputCol="raw_features")
predict_df = assembler.transform(predict_df)
# standardation of data
scaler = StandardScaler(inputCol="raw_features", outputCol="feature",withStd=True)

# Divide feature and label columns
dataset = predict_df.select(predict_df["churn"].alias("label"), predict_df["raw_features"])

# split data to test and validation
train, validation = dataset.randomSplit([0.7,0.3], seed = 42)

2. Evaluated by 3 models: Logistic regression analysis, Random Forest, and GBT.

Logistic Regression

# build model
logit_r = LogisticRegression(labelCol="label",featuresCol="feature",maxIter=10)

# define grid
paramGrid = (ParamGridBuilder()
    .addGrid(logit_r.regParam, [0.01, 0.1, 0.2, 0.5]) \
    .addGrid(logit_r.elasticNetParam, [0.0, 0.2, 0.5])\
    .build())

Random Forest

# create model
rfc = RandomForestClassifier(labelCol="label",featuresCol="feature", seed=99)# define grid
paramGrid = (ParamGridBuilder()
             .addGrid(rfc.maxDepth, [2, 5, 10])\
             .addGrid(rfc.maxBins, [10, 20, 40])\
             .build())

GBT

# create model
gbtc = GBTClassifier(labelCol="label",featuresCol="feature", lossType='logistic', seed=99)
pipeline = Pipeline(stages=[scaler, gbtc])


# define grid
paramGrid = (ParamGridBuilder()
             .addGrid(gbtc.maxDepth, [2, 5])\
             .addGrid(gbtc.stepSize, [0.01, 0.1, 0.2])\
             .build())

Model evaluation

The GBT with the best F1 score and no signs of over-fit was selected as the model using the test data.

Logistic Regression -2nd candidate without over-fit sign

The F1 score on the training set is 0.82
The F1 score on the test set is 0.79

Random Forest -3rd candidate with over-fit sign

The F1 score on the training set is 0.96
The F1 score on the test set is 0.78

GBT -Best candidate without over-fit sign

The F1 score on the training set is 0.93
The F1 score on the test set is 0.85

Feature Importance -GBT

The above graph shows that the importance of most features is concentrated on the number of selections on each page such as Add Friend,Thumbs Up,Roll Advert and the number of hours of music played per month and per day of the week.In terms of the feature importance of related to OS type, Windows is present, but no other OS type is present.

8. Conclusion

Prediction model based on GBT shows the F1 score on the test set is 0.85.

Findings of Interest and Difficulty

I found it very interesting that the churn rate could be predicted based on the time spent on the app and the pages accessed by each user, and I felt that there was a possibility of applying the prediction model to business. The most difficult part of the project for me was the selection of factors for the prediction. I was able to build a prediction model because I had guidance in this project, but I thought about these two points in more detail, and I will discuss them in detail in the improvement plan.

Improvement

Precondition

Consideration of improvement plans for operation: imagine the requests from business owners.

The F1 score was obtained with some satisfaction, but depending on the timing, it is not sufficient. For example, suppose a business owner requests a forecast for the remaining three months based on information through October 2018. The business owner thinks the request is reasonable because he wants to forecast the quarter’s revenue early. Since the current forecasting model was not built using all the data, it is possible to consider responding to the above request.

Considerations from a data engineering perspective:.

Additional information that could be used is artist and session information. A simple example would be to use data analysis to identify artists and song titles with low churn rates and incorporate them into the factors of the prediction model. At this time, it is easy to imagine that sufficient data analysis will be necessary because the number of artist and song information is expected to be very large.

Consideration from the feature importance of the prediction model:

Add better feature.

The feature importance of the factors in the prediction model is dominated by two factors related to pages and time. Since the relationship between users and pages was used as a factor in the prediction model this time, it is conceivable that the relationship between users and time could be analyzed and factorized in more detail.
For example, let’s say I want to analyze the period of time when users are likely to consider canceling a contract after signing it. If I know the period of time during which cancellations are concentrated, I can consider the number of days that have passed since the user signed the contract itself as a factor.

At least, it is possible to improve the above two points, but since this data is based on the quarter period, I think it is necessary to improve the following points.

Seasonality

I believe that the churn rate is an information that strongly depends on time. From this point of view, we can assume that the season gives a big change to users. For example, in the U.S., summer vacation is coming up, which means more free time for future users. At this time of year, future users will want to listen to music while driving and will be willing to pay for it because they don’t want ads. In other words, I assume that there are times when seasonality makes it easier for users to think about charging. I believe that seasonality is a risk for business owners when they use learned models for their business. In other words, the forecasting models used so far will be built with data that cannot be used for forecasting due to seasonal changes. So-called data drift monitoring will become a necessity in business, and I assume that strategies such as data reanalysis and relearning will be necessary from the perspective of seasonality.

Next Project

In such a situation above, I see the need for pipelined predictive model building from data used for prediction on cloud applications like AWS and Azure. Cloud-based applications are next on my list.

9. Other Information for you

Other results and the code itself will be shared at the following link

https://github.com/hitoshikumagai/Udacity/blob/main/data-science/project/Capstone%20Project/README.md