Sem Spirit

Naive Bayes Classification in Python

In this usecase, we build in Python the following Naive Bayes classifier (whose model predictions are shown in the 3D graph below) in order to classify a business as a retail shop or a hotel/restaurant/café according to the amount of fresh, grocery and frozen food bought during the year.

Test set observations (dots) and predictions (3D shape) of retail (=1) vs hostellery/restaurant/cafe (=0) according to fresh, grocery and frozen food bought during the year.

Test set observations (dots) and predictions (3D shape) of retail (=1) vs hostellery/restaurant/cafe (=0) according to fresh, grocery and frozen food bought during the year.

We first import the needed libraries and load the dataset :

#importing libraries
import numpy as np
import pandas as pd
from matplotlib import cm

#loading the dataset
dataset = pd.read_csv('dataset.csv')

Here follows the 40 first rows over a total of 9753 in the dataset :

Dataset in Python of businesses classification as a retail shop or a hotel/restaurant/café according to the amount of fresh, grocery and frozen food bought during the year

Dataset in Python of businesses classification as a retail shop or a hotel/restaurant/café according to the amount of fresh, grocery and frozen food bought during the year

Then we split the dataset into the predictors set X (fresh, grocery and frozen food amount in euros) and the independant variable y to predict (hotel/restaurant/cafe (=0) or retail (=1)) :

X = dataset.iloc[:,0:3].values
y = dataset.iloc[:,len(dataset.iloc[0])-1].values

We split the data into the training and the test set :

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

Then we are ready to fit the classifier to the training set data, which is done with the following code :

#fitting the classifier to the training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

Once ready, we can run the classifier on the training set and the test set in order to get the predictions.

#predicting the results on the training set
y_train_pred = classifier.predict(X_train)
y_test_pred = classifier.predict(X_test)

In order to evaluate the quality of the classifier, we compute with the following code the two confusion matrix of the predictions made with the training and test set and according to them, the success ratio of the predictions :

from sklearn.metrics import confusion_matrix
cm_train = confusion_matrix(y_train, y_train_pred)
cm_test = confusion_matrix(y_test, y_test_pred)

print("Training set confusion matrix : \n"+str(cm_train))
print("Success ratio on training set : "+str(success_ratio(cm=cm_train))+"%")
print("Test set confusion matrix : \n"+str(cm_test))
print("Success ratio on test set : "+str(success_ratio(cm=cm_test))+"%")

The console shows the two following confusion matrix and success ratio for the training and test sets :

Confusion matrix of the Naive Bayes classifier for businesses classified as a retail shop or a hotel/restaurant/café according to the amount of fresh, grocery and frozen food bought during the year

Confusion matrix of the Naive Bayes classifier for businesses classified as a retail shop or a hotel/restaurant/café according to the amount of fresh, grocery and frozen food bought during the year

Finally, we display in a 3D graph the test set observations (dots) and predictions (3D shape) of retail (=1) vs hostellery/restaurant/cafe (=0) according to fresh, grocery and frozen food bought during the year with the following code :

#3D display of predictions and test set observations
show3D(title="Test set observations (dots) and predictions (3D shape) of retail (=1) vs hostellery/restaurant/cafe (=0) according to fresh, grocery and frozen food bought during the year.",
x_colname = 'FRESH', y_colname = 'GROCERY', z_colname = 'FROZEN', c_colname = 'HOSTEL_OR_RETAIL',
x_train = X_train[:,0], y_train=X_train[:,1], z_train=X_train[:,2], c_train=y_train,
x_test = X_test[:,0], y_test = X_test[:,1], z_test = X_test[:,2], c_test=y_test,
mesh_nb_pts = 10**3, class_num=1,
classifier = classifier
)


Here follows the 3D graph displayed with the test set :

Test set observations (dots) and predictions (3D shape) of retail (=1) vs hostellery/restaurant/cafe (=0) according to fresh, grocery and frozen food bought during the year.

Test set observations (dots) and predictions (3D shape) of retail (=1) vs hostellery/restaurant/cafe (=0) according to fresh, grocery and frozen food bought during the year.

And we can also display a 3D graph containing the training set observations with the following function call :

#3D display of predictions and training set observations
show3D(title="Training set observations (dots) and predictions (3D shape) of retail (=1) vs hostellery/restaurant/cafe (=0) according to fresh, grocery and frozen food bought during the year.",
x_colname = 'FRESH', y_colname = 'GROCERY', z_colname = 'FROZEN', c_colname = 'HOSTEL_OR_RETAIL',
x_train = X_train[:,0], y_train=X_train[:,1], z_train=X_train[:,2], c_train=y_train,
x_test = X_train[:,0], y_test = X_train[:,1], z_test = X_train[:,2], c_test=y_train,
mesh_nb_pts = 10**3, class_num=1,
classifier = classifier
)

that leads to the following 3D graph :

Training set observations (dots) and predictions (3D shape) of retail (=1) vs hostellery/restaurant/cafe (=0) according to fresh, grocery and frozen food bought during the year.

Training set observations (dots) and predictions (3D shape) of retail (=1) vs hostellery/restaurant/cafe (=0) according to fresh, grocery and frozen food bought during the year.

In the both graphs, we notice that the model predictions (the green transparent 3D shape) for classifying the business into retail or hotel/restaurant/cafe fits remarquably well the observations classified as retail (black dots), whereas most of the red dots (hotel/restaurant/cafe) stay outside the green shape.

This script uses the two functions success_ratio and show3D that are defined below :

# FUNCTIONS
def success_ratio(cm):
total = cm[0][0] + cm[1][0] + cm[0][1] + cm[1][1]
return 100*(cm[0][0] + cm[1][1]) / total

import plotly
import plotly.graph_objs as go
#displays in a 3D space the model as a 3D mesh and the test set observations as a 3D scatter plot
def show3D(title, x_colname, y_colname, z_colname, c_colname, x_train, y_train, z_train, c_train, x_test, y_test, z_test, c_test, mesh_nb_pts, class_num, classifier):
n= int( mesh_nb_pts ** (1. / 3) ) #cubic root of mesh_nb_pts
min_x = min(x_train)
min_y = min(y_train)
min_z = min(z_train)
x_size = max(x_train) - min_x
y_size = max(y_train) - min_y
z_size = max(z_train) - min_z
x_step = x_size / n
y_step = y_size / n
z_step = z_size / n

#POSITIVE (1) PREDICTIONS AS A 3D MESH
i=0
x = np.empty([n*n*n])
y = np.empty([n*n*n])
z = np.empty([n*n*n])
for xi in range(0, n):
for yi in range(0, n):
for zi in range(0, n):
x[i] = min_x + xi * x_step
y[i] = min_y + yi * y_step
z[i] = min_z + zi * z_step
i = i + 1
print("Grid of size "+str(n)+"x"+str(n)+"x"+str(n)+" generated (nb vertices = "+str(n*n*n)+").")
#computing the predictions on the grid
datagrid = pd.DataFrame( {x_colname : x, y_colname: y, z_colname : z }) #the dataframe sort the columns lexicographically
datagrid = datagrid[[x_colname, y_colname, z_colname]] #fix the correct order of columns
p = classifier.predict(datagrid);
print("Predictions on the grid computed.")

#extracting the class_num-classified records from the predictions
ss = sum(p)
xx = np.empty([ss])
yy = np.empty([ss])
zz = np.empty([ss])
pp = np.empty([ss])
j = 0
for i in range(0, len(p)-1):
if(p[i] == class_num):
xx[j]=x[i]
yy[j]=y[i]
zz[j]=z[i]
pp[j]=p[i]
j = j + 1

print(str(ss)+" mesh coordinates extracted with ", class_num,"-predictions.")

#building the mesh for the predictions
trace_preds = go.Mesh3d(
x=xx,y=yy,z=zz,
alphahull=5, opacity=0.2, color='rgb(0, 256, 0)',
name = 'Predictions'
)
print("Mesh generated.")

#OBSERVATIONS 3D POINTS
#number of positive (1) and negative (0) predictions
s_pts = len(c_test)
s_pts_0 = 0
s_pts_1 = 0
for i in range(0, s_pts):
if (c_test[i] == 0) :
s_pts_0 = s_pts_0 + 1
elif (c_test[i] == 1) :
s_pts_1 = s_pts_1 + 1
else :
print("ERROR : c_test["+str(i)+"] = "+str(c_test[i]))
break

#building two set of points (x,y,z coordinates) : one for the positive (1) predictions and another one for the (0) negative predictions
x_pts_0 = np.empty([s_pts_0])
y_pts_0 = np.empty([s_pts_0])
z_pts_0 = np.empty([s_pts_0])
x_pts_1 = np.empty([s_pts_1])
y_pts_1 = np.empty([s_pts_1])
z_pts_1 = np.empty([s_pts_1])
j = 0
k = 0
for i in range(0, s_pts):
if (c_test[i] == 0) :
x_pts_0[j] = x_test[i]
y_pts_0[j] = y_test[i]
z_pts_0[j] = z_test[i]
j = j + 1
elif (c_test[i] == 1) :
x_pts_1[k] = x_test[i]
y_pts_1[k] = y_test[i]
z_pts_1[k] = z_test[i]
k = k + 1
else :
print("ERROR : c_test["+str(i)+"] = "+str(c_test[i]))
break

trace_obs_0 = go.Scatter3d(
x=x_pts_0, y=y_pts_0, z=z_pts_0,
mode='markers',
marker=dict(size=2, line=dict(color='rgb(128, 0, 0)', width=0.5), color='rgb(128, 0, 0)', opacity=1),
name='[red] Observations of class 0'
)

trace_obs_1 = go.Scatter3d(
x=x_pts_1, y=y_pts_1,z=z_pts_1,
mode='markers',
marker=dict(size=2, line=dict( color='rgba(0, 0, 0)', width=0.5 ), color='rgb(0, 0, 0)', opacity=1),
name='[black] Observations of class 1'
)

print("x from ", min_x, " to ", min_x+x_size/4)
print("y from ", min_y, " to ", min_y+y_size/4)
print("z from ", min_z, " to ", min_z+z_size/4)

layout = go.Layout(
title=title,
scene = dict(
xaxis=dict(title=x_colname, range = [min_x,min_x+x_size]),
yaxis=dict(title=y_colname, range = [min_y,min_y+y_size]),
zaxis=dict(title=z_colname, range = [min_z,min_z+z_size])
)
)

fig = go.Figure(data=[trace_obs_0, trace_obs_1, trace_preds], layout=layout)
plotly.offline.plot(fig)