ChurnOrbit: Bank Churn Prediction Model


As aspiring Entrepreneurs, we know...

Customer retention is a key challenge for companies across all industries.

Understanding why customers leave and what keeps them engaged is essential for long-term success.

In the banking industry, churn is a persistent issue.

This turnover can lead to significant revenue loss for banks.

How can we better retain customers?

Analyzed a bank churn dataset.
Leveraged machine learning
Uncover the most telling predictors of customer retention.
Identify at-risk customers.

Our findings helps banks implement retention strategies based on insights.

Process

For this project, we used a dataset from Kaggle. The data consists of 10,000 observations with 14 features, including

For more information on each variable, check out the codebook in our GitHub repository.

EDA

After importing the data, we checked for missing and duplicate values. We found only one missing and 3 duplicated rows, so we decided to drop them. We also dropped the row number, id, and surname columns since they only contained identifying information, which is not useful for prediction.
We then used a correlation plot to visualize relationships between variables.
correlation plot
As we can see, not many variables were highly correlated. This is good, as multicollinearity can introduce issues such as overfitting and decreased performance and computational efficiency.

Next, we looked at each variable and plotted them against churn. The biggest differences in distribution occurred in has credit card, age, balance, and is active member. Based on this insight, we might expect that these variables will play a crucial role in our models.
When examining the distribution of those who churned, we noticed a large difference in the number of positive and negative cases. While it may not seem like an issue, the discrepancy in cases can cause problems when fitting models. The model would not have enough data to determine patterns in the minority case, decreasing accuracy. To fix this, we used oversampling, or duplicating minority cases, to balance the class distribution.

Models

To better understand which customers are likely to churn, we created several classification models. We began by splitting our dataset into a training set (to build the models) and a testing set (to check how well the models predict unseen data). Since we’re predicting a yes-or-no outcome—whether a customer will churn—we used common classification algorithms:

After training each model, we measured their accuracy—the percentage of correct predictions out of all customers—and also compared their ROC AUC scores (a measure of how well the model distinguishes churners from non-churners). The formula for accuracy is:

\[ \text{Accuracy} = \frac{\text{True Positive} + \text{True Negative}}{\text{True Positive} + \text{True Negative} + \text{False Positive} + \text{False Negative}} \]

Among these models, KNN and Random Forest achieved the highest AUC, so we tested these two further on the unseen data (test set). Random Forest ultimately produced the highest accuracy, making it our top choice. Because we want to focus on the factors that truly matter for churn, we refitted the Random Forest model after removing variables that appeared less important. This helps simplify our model and concentrate on the key drivers of churn.

Deployment

After finalizing our model, we used Streamlit to create an interactive dashboard for users to predict bank churn. We chose Streamlit since it was fast, simple, and able to easily create interactive features.
We first tested the dashboard locally to finetune the functionality and interface. Once we were satisfied with the app's performance, we deployed it to the cloud using Streamlit Cloud, making the dashboard easily accessible for users. This deployment ensured that the app was available for real-time predictions, offering a seamless user experience.

Results

Our Team

Minu

Minu

Data Scientist

Katie

Katie

UI/UX

Nicole

Nicole

Data Scientist

Shani

Shani

Data Scientist