HomeE-commercePredicting Buyer Churn in E-commerce Utilizing Supervised Studying | by Tahera Firdose...

Predicting Buyer Churn in E-commerce Utilizing Supervised Studying | by Tahera Firdose | Jan, 2024

Writer: Uthaman Bakthikrishnan

In an period the place buyer retention is as essential as acquisition, understanding and predicting buyer churn turns into a centerpiece of enterprise technique, particularly in e-commerce. This weblog presents an in-depth information on learn how to use Python and machine studying methods for efficient churn prediction.

Churn prediction entails figuring out prospects more likely to depart a service or cease shopping for merchandise. In e-commerce, the place competitors is fierce, retaining an present buyer is commonly cheaper than buying a brand new one. Correct churn predictions may help companies tailor their retention methods extra successfully.

Within the first a part of our collection on churn evaluation in e-commerce, we targeted on analyzing buyer churn. On this continuation, we’ll discover the sensible software of supervised studying methods in Python for predicting churn.

Our journey begins with importing the libraries and loading the E-Commerce dataset:

With the dataset loaded right into a DataFrame, we first take a peek on the prime rows utilizing df.head() . This provides us an preliminary sense of the information we’re working with, together with the varieties of columns and the character of the information in every column.

Subsequent, df.information() supplies a extra detailed overview of the dataset. This technique is extremely helpful for understanding the construction of the DataFrame.

Checking for duplicate values:

There aren’t any duplicate values within the dataset

Analyzing Lacking Knowledge:

df.isnull().sum() supplies us with a transparent image of the lacking values throughout totally different columns.

Overview of Lacking Values

  • Tenure: 264 lacking values.
  • WarehouseToHome: 251 lacking values.
  • HourSpendOnApp: 255 lacking values.
  • OrderAmountHikeFromlastYear: 265 lacking values.
  • CouponUsed: 256 lacking values.
  • OrderCount: 258 lacking values.
  • DaySinceLastOrder: 307 lacking values.

Knowledge Sort Conversion

It’s important to make sure that categorical variables like ‘Churn’ and ‘CityTier’ are accurately recognized for the algorithms to course of

Click on right here to study Knowledge Sort Conversion

Checking for duplicate values in every Class:


  • Nearly all of customers, 2,765, want ‘Cell Telephone’ for login, adopted by ‘Pc’ (1,634) and ‘Telephone’ (1,231).
  • We are able to consolidated ‘Telephone’ into ‘Cell Telephone’ for consistency, as they doubtless symbolize the identical class.


  • ‘Debit Card’ is the commonest fee mode (2,314), adopted by ‘Credit score Card’ (1,501) and ‘E pockets’ (614).
  • We are able to additionally merge ‘CC’ with ‘Credit score Card’ and ‘COD’ with ‘Money on Supply’ to take away redundancy.


  • ‘Laptop computer & Accent’ and ‘Cell Telephone’ are prime classes, with 2,050 and 1,271 situations, respectively.
  • We are able to additionally merge ‘Cell’ with ‘Cell Telephone’ for a clearer categorization.

Additionally we are able to observe our goal variable, with 0 (no churn) having 4,682 situations and 1 (churn) having 948 situations.

The info suggests a category imbalance, with a bigger variety of prospects not churning.

Dealing with Outliers

Outliers can skew the outcomes. We visualize them utilizing field plots and cap them utilizing the Interquartile Vary technique:

Capping outliers utilizing IQR

Click on right here to study extra about Outliers

As we advance in our churn prediction evaluation, we come to a vital section of encoding and preprocessing our dataset. Right here’s how we put together our e-commerce knowledge for the predictive modeling stage:

Label Encoding the Goal Variable

First, we apply label encoding to the Churn column:

Label encoding transforms the specific goal variable right into a numerical format that machine studying fashions can perceive. It assigns a novel integer to every class of churn (e.g., 1 for churned prospects and 0 for non-churned prospects).

Separating Options and Goal Variable

We then separate our dataset into options (x) and the goal variable (y):

This distinction is important as a result of we have to deal with the options and the goal variable in a different way within the modeling course of.

Figuring out Column Sorts

Understanding which columns are categorical and that are numerical is essential for preprocessing:

This step helps us apply the suitable preprocessing methods to every kind of information.

Developing Preprocessing Pipelines

We create two pipelines, one for numerical and one for categorical columns:

  • The numerical pipeline imputes lacking values with the imply after which scales the information to have a imply of zero and an ordinary deviation of 1.
  • The explicit pipeline imputes lacking values with essentially the most frequent class after which applies one-hot encoding.

Combining Transformers

We mix these transformers utilizing a ColumnTransformer:

This preprocessor will apply every pipeline to its respective column kind in the course of the transformation course of.

To study column transformer click on right here

Splitting and making use of the transformation to the Dataset

We cut up our dataset into coaching and testing units to judge our mannequin’s efficiency on unseen knowledge:

By becoming the preprocessor to the coaching knowledge after which reworking each the coaching and testing units, we put together our knowledge for the machine studying algorithms.

Within the pursuit of an efficient churn prediction technique, we now have laid down a complete framework for coaching and evaluating a number of machine studying fashions. This part particulars our strategy and findings in making use of these fashions to our e-commerce dataset.

Number of Fashions

We’ve chosen a various set of fashions to make sure robustness and examine totally different studying algorithms:

  • Logistic Regression: A baseline mannequin for binary classification issues.
  • Determination Tree Classifier: A mannequin that partitions the area right into a hierarchy of selections, providing interpretability.
  • Random Forest Classifier: An ensemble of choice timber that improves prediction robustness by means of averaging.
  • XGBoost Classifier: A strong, boosted tree algorithm recognized for its effectivity and efficiency.

Overview of Mannequin Efficiency

  1. Logistic Regression confirmed an accuracy of 91% with a ROC AUC of 0.90. Whereas precision is excessive for the non-churned class, the recall for the churned class is notably decrease, indicating that the mannequin is lacking a big variety of churned prospects.
  2. Determination Tree Classifier improved barely with an accuracy of 94% and a ROC AUC additionally at 0.90. It confirmed a greater stability between precision and recall for the churned class in comparison with Logistic Regression.
  3. Random Forest Classifier carried out higher, with a 96% accuracy and the next ROC AUC of 0.99. Regardless of excessive precision, the recall for churned prospects continues to be lower than excellent, suggesting room for enchancment in accurately figuring out all precise churn circumstances.
  4. XGBClassifier topped the accuracy charts at 97% with a ROC AUC of 0.98. It confirmed exceptional precision and a good recall for churned prospects, suggesting a strong means to determine churn.

The Difficulty of Imbalanced Datasets

Regardless of these seemingly excessive accuracy figures, there’s a caveat. Our dataset is imbalanced, as indicated by the unique Churn worth counts—there are considerably extra non-churned (0) prospects than churned (1) ones.

Right here’s why imbalance could be deceptive:

  • Accuracy Paradox: Excessive accuracy may mirror the underlying class distribution fairly than the mannequin’s predictive energy. A mannequin that all the time predicts ‘no churn’ may nonetheless be extremely correct just because most prospects don’t churn.
  • Precision and Recall Commerce-Off: Focusing solely on accuracy overlooks the stability between precision and recall. In an imbalanced dataset, it’s widespread for the minority class to have a decrease recall, as seen with the churned prospects in our case.
  • ROC AUC Consideration: Whereas ROC AUC is usually an excellent efficiency metric, it may additionally current an excessively optimistic view of a mannequin’s efficiency on imbalanced datasets.

To handle this concern, we now have carried out Artificial Minority Over-sampling Method (SMOTE) in our knowledge preprocessing stage.

SMOTE is a complicated over-sampling method that generates artificial examples of the minority class by interpolating new situations between present ones. This course of helps create a balanced dataset, enabling our fashions to study from an equal illustration of each courses.

Click on right here to study smote

Class Distribution Earlier than SMOTE

Initially, the dataset exhibited a big imbalance between the 2 courses:

  • Non-churned prospects (0): 3741 situations
  • Churned prospects (1): 763 situations

This disparity presents a danger {that a} mannequin educated on this knowledge is perhaps biased towards predicting non-churn prospects, just because there are extra examples of this class within the coaching knowledge.

Class Distribution After SMOTE

After making use of SMOTE, the category distribution is completely balanced:

  • Non-churned prospects (0): 3741 situations
  • Churned prospects (1): 3741 situations

SMOTE has created artificial samples of churned prospects, growing their illustration to equal that of non-churned prospects. This balanced distribution is important for coaching fashions that may determine patterns indicative of churn with out the bias launched by class imbalance.

After balancing our dataset with SMOTE, we now have evaluated 4 totally different machine studying fashions on the reworked dataset. Right here’s a abstract of how every mannequin carried out:

Logistic Regression

  • Accuracy: 83%
  • Precision: 0.96 (for non-churned), 0.48 (for churned)
  • Recall: 0.83 (for non-churned), 0.81 (for churned)
  • F1-Rating: 0.89 (for non-churned), 0.60 (for churned)
  • ROC AUC: 0.90

Whereas Logistic Regression achieved a good ROC AUC rating, the precision for predicting churned prospects was fairly low, although it had a excessive recall, indicating an inclination to over-predict churn.

Determination Tree Classifier

  • Accuracy: 95%
  • Precision: 0.98 (for non-churned), 0.82 (for churned)
  • Recall: 0.96 (for non-churned), 0.88 (for churned)
  • F1-Rating: 0.97 (for non-churned), 0.85 (for churned)
  • ROC AUC: 0.92

The Determination Tree Classifier confirmed substantial enchancment, notably within the precision of predicting churned prospects, and a balanced recall.

RandomForest Classifier

  • Accuracy: 96%
  • Precision: 0.96 (for non-churned), 0.96 (for churned)
  • Recall: 0.99 (for non-churned), 0.82 (for churned)
  • F1-Rating: 0.98 (for non-churned), 0.88 (for churned)
  • ROC AUC: 0.98

Random Forest Classifier supplied a excessive degree of accuracy and a very good ROC AUC rating. It maintained excessive precision for churned prospects however confirmed a slight drop in recall, indicating fewer false positives however extra false negatives for churn.


  • Accuracy: 97%
  • Precision: 0.97 (for non-churned), 0.95 (for churned)
  • Recall: 0.99 (for non-churned), 0.87 (for churned)
  • F1-Rating: 0.98 (for non-churned), 0.91 (for churned)
  • ROC AUC: 0.97

XGBClassifier emerged as the highest performer with the very best accuracy and a robust stability between precision and recall for churned prospects. The ROC AUC rating additionally suggests a superb discriminative means between churned and non-churned prospects.

Implications of the Outcomes

The outcomes after making use of SMOTE point out that the fashions are more proficient at predicting churn, with improved efficiency throughout the board. The elevated precision and recall for the churned class particularly present that the fashions at the moment are higher tuned to acknowledge patterns of churn, which is crucial for actionable enterprise insights.

Cross-validation is a crucial step to judge the efficiency of machine studying fashions. It helps us be sure that the fashions not solely match the coaching knowledge nicely but in addition have the power to generalize to new, unseen knowledge. Right here, we now have carried out cross-validation on the balanced coaching set obtained post-SMOTE software and compiled the outcomes.

Mannequin Efficiency

Logistic Regression:

  • Common ROC AUC: 0.9001
  • Commonplace Deviation: 0.0102 Logistic Regression reveals respectable efficiency with an excellent common ROC AUC rating, indicating its means to distinguish between the courses. The low commonplace deviation suggests constant efficiency throughout totally different folds.

Determination Tree Classifier:

  • Common ROC AUC: 0.9409
  • Commonplace Deviation: 0.0195 Determination Tree Classifier reveals improved efficiency over Logistic Regression, with the next common ROC AUC rating, although with a barely larger commonplace deviation, indicating a bit extra variation in its efficiency throughout folds.

Random Forest Classifier:

  • Common ROC AUC: 0.9986
  • Commonplace Deviation: 0.0021 Random Forest Classifier stands out with a superb common ROC AUC rating and a really low commonplace deviation, which highlights its sturdy and constant efficiency throughout totally different cross-validation folds.


  • Common ROC AUC: 0.9949
  • Commonplace Deviation: 0.0090 XGBClassifier additionally reveals excellent efficiency, with a excessive common ROC AUC rating and low commonplace deviation, indicating each its effectiveness and reliability.


The cross-validation outcomes underscore the robustness of the ensemble strategies (Random Forest and XGBClassifier) when coping with balanced datasets. The excessive common ROC AUC scores are indicative of the fashions’ capabilities to tell apart between churned and non-churned prospects successfully. The low commonplace deviations verify that these fashions carry out constantly throughout totally different subsets of the information.


In abstract, our journey by means of predicting buyer churn in e-commerce utilizing supervised studying has led us to a number of key insights. By meticulously getting ready our dataset, addressing class imbalance with SMOTE, and deciding on a various set of machine studying fashions, we now have laid a robust basis for correct churn prediction.

Our cross-validation outcomes reveal that ensemble fashions, particularly Random Forest and XGBClassifier, stand out of their means to generalize throughout totally different knowledge subsets, exhibiting nice promise for deployment in real-world situations. Their excessive common ROC AUC scores and low commonplace deviations exhibit not solely their predictive energy but in addition their consistency and reliability.

When you discovered this text fascinating, your help by following under steps will assist me unfold the information to others:

👏 Give the article 20 claps and hit the observe button

Observe me on LinkedIn

📚 Learn extra articles on Medium

Supply hyperlink

latest articles

Head Up For Tails [CPS] IN
ChicMe WW

explore more