This is a simple copy of the exellent tutorial by Blent.ai “initiation au machine learning” (all credit to them). You can download the jupyther notebook here. You can run it on google collab, or any juyter notebook environment.

Intro

We are given a dataset of bank clients. Our mission is to investigate the churn rate. To go in detail of this mission, download the ipynb or check out the pdf bellow (it’s basically the notebook already executed)

Following, I will sum up the most crucial part.

Part 1: Data preparation

Load your CSV

data = pd.read_csv('dataset.csv')
data.head()

Avoid casing in column names

clean_column_name = []
columns = data.columns
for i in range(len(columns)):
    clean_column_name.append(columns[i].lower())
data.columns = clean_column_name

Remove unecessary columns

data = data.drop(["rownumber", "customerid", "surname"], axis=1)
print(data.shape)
data.head()

Make sure no data is missing

np.sum(data.isna())

if data is missing, please refer to our previous “Datascience” tutorial from november 2019.

Cleaning

cleaned_data = data.copy()
cleaned_data = cleaned_data[~((cleaned_data['exited'] == 1) & (cleaned_data['numofproducts'] == 4))]
cleaned_data.shape

Variable quali to quanti

X = cleaned_data.iloc[:, :-1].copy()
y = cleaned_data['exited']
X.head()

Binary (easy)

X['gender'] = data['gender'].apply(lambda x: 1 if x == "Female" else 0)
X.head()

Multi possibilites

X = X.join(pd.get_dummies(data['geography']))
del X['geography']
X.head()

Part 2: Data Visualization

for 1 variable

sns.distplot(data['balance'])

for 2 variables

sns.distplot(data.loc[data['exited'] == 1, 'age'], label="Churn")
sns.distplot(data.loc[data['exited'] == 0, 'age'], label="Non churn")
plt.legend()

Box plots

sns.boxplot(x='numofproducts', y='age', data=data)

Pie

data['exited'].value_counts().plot.pie(autopct=lambda x: '{:2.1f}%'.format(x), explode=[0, 0.1])

Part 3: Machine Learning

For this example, we choose the DecisionTreeClassifier model.

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

tree = DecisionTreeClassifier(max_depth=6)
tree.fit(X_train, y_train)
# fit is doing the training.

Model Visualization

from sklearn.metrics import accuracy_score

print("Train :", accuracy_score(y_train, tree.predict(X_train)))
print("Test :", accuracy_score(y_test, tree.predict(X_test)))

features_imp = pd.DataFrame(
    data=np.asarray([X.columns, tree.feature_importances_]).transpose(),
    columns=['Variable', 'Importance'])
features_imp

features_imp.set_index("Variable").sort_values(by="Importance").plot.barh(figsize=(14, 9))
for item in ([plt.gca().title, plt.gca().xaxis.label, plt.gca().yaxis.label] +
             plt.gca().get_xticklabels() + plt.gca().get_yticklabels()):
    item.set_fontsize(13)

Results

Ressources:

Download the ipynb