This is a simple copy of the exellent tutorial by Blent.ai “initiation au machine learning” (all credit to them). You can download the jupyther notebook here. You can run it on google collab, or any juyter notebook environment.
Intro
We are given a dataset of bank clients. Our mission is to investigate the churn rate. To go in detail of this mission, download the ipynb or check out the pdf bellow (it’s basically the notebook already executed)
Following, I will sum up the most crucial part.
Part 1: Data preparation
Load your CSV
data = pd.read_csv('dataset.csv')
data.head()
Avoid casing in column names
clean_column_name = []
columns = data.columns
for i in range(len(columns)):
clean_column_name.append(columns[i].lower())
data.columns = clean_column_name
Remove unecessary columns
data = data.drop(["rownumber", "customerid", "surname"], axis=1)
print(data.shape)
data.head()
Make sure no data is missing
np.sum(data.isna())
if data is missing, please refer to our previous “Datascience” tutorial from november 2019.
Cleaning
cleaned_data = data.copy()
cleaned_data = cleaned_data[~((cleaned_data['exited'] == 1) & (cleaned_data['numofproducts'] == 4))]
cleaned_data.shape
Variable quali to quanti
X = cleaned_data.iloc[:, :-1].copy()
y = cleaned_data['exited']
X.head()
- Binary (easy)
X['gender'] = data['gender'].apply(lambda x: 1 if x == "Female" else 0)
X.head()
- Multi possibilites
X = X.join(pd.get_dummies(data['geography']))
del X['geography']
X.head()
Part 2: Data Visualization
for 1 variable
sns.distplot(data['balance'])
for 2 variables
sns.distplot(data.loc[data['exited'] == 1, 'age'], label="Churn")
sns.distplot(data.loc[data['exited'] == 0, 'age'], label="Non churn")
plt.legend()
Box plots
sns.boxplot(x='numofproducts', y='age', data=data)
Pie
data['exited'].value_counts().plot.pie(autopct=lambda x: '{:2.1f}%'.format(x), explode=[0, 0.1])
Part 3: Machine Learning
For this example, we choose the DecisionTreeClassifier model.
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
tree = DecisionTreeClassifier(max_depth=6)
tree.fit(X_train, y_train)
# fit is doing the training.
Model Visualization
from sklearn.metrics import accuracy_score
print("Train :", accuracy_score(y_train, tree.predict(X_train)))
print("Test :", accuracy_score(y_test, tree.predict(X_test)))
features_imp = pd.DataFrame(
data=np.asarray([X.columns, tree.feature_importances_]).transpose(),
columns=['Variable', 'Importance'])
features_imp
features_imp.set_index("Variable").sort_values(by="Importance").plot.barh(figsize=(14, 9))
for item in ([plt.gca().title, plt.gca().xaxis.label, plt.gca().yaxis.label] +
plt.gca().get_xticklabels() + plt.gca().get_yticklabels()):
item.set_fontsize(13)
Results
Ressources:
Download the ipynb