Sem Spirit

Decision Tree Classification in Python

In this usecase, we build in Python the following Decision Tree classifier (whose model predictions are shown in the 3D graph below) in order to classify an individual salary as big (>50K$) or not according to the age, the level of education, and the average number of weekly working hours.

Test set observations (dots) and predictions (3D shape) of large salaries (>50K) according to age, education time and working hours.

Test set observations (dots) and predictions (3D shape) of large salaries (>50K) according to age, education time and working hours.

We first import the needed libraries and load the dataset :

#importing libraries
import numpy as np
import pandas as pd
from matplotlib import cm

#loading the dataset
dataset = pd.read_csv('dataset.csv')

Here follows the 30 first rows over a total of 23414 in the dataset :

Dataset in Python of large salaries (if =1 then >50K else <=50K) according to age, education time and working hours.

Dataset in Python of large salaries (if =1 then >50K else <=50K) according to age, education time and working hours.

Then we split the dataset into the predictors set X (age, education time and working hours) and the independant variable y to predict (1=big salary or 0=no big salary) :

X = dataset.iloc[:,0:3].values
y = dataset.iloc[:,len(dataset.iloc[0])-1].values

We split the data into the training and the test set :

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

Then we are ready to fit the classifier to the training set data, which is done with the following code :

#fitting the classifier to the training set
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state=0)
classifier.fit(X_train, y_train)

Once ready, we can run the classifier on the training set and the test set in order to get the predictions.

#predicting the results on the training set
y_train_pred = classifier.predict(X_train)
y_test_pred = classifier.predict(X_test)

In order to evaluate the quality of the classifier, we compute with the following code the two confusion matrix of the predictions made with the training and test sets and according to them, the success ratio of the predictions :

from sklearn.metrics import confusion_matrix
cm_train = confusion_matrix(y_train, y_train_pred)
cm_test = confusion_matrix(y_test, y_test_pred)

print("Training set confusion matrix : \n"+str(cm_train))
print("Success ratio on training set : "+str(success_ratio(cm=cm_train))+"%")
print("Test set confusion matrix : \n"+str(cm_test))
print("Success ratio on test set : "+str(success_ratio(cm=cm_test))+"%")

The console shows the two following confusion matrix and success ratio for the training and test sets :

Confusion matrix of the training and test sets predictions of large salaries (>50K) according to age, education time and working hours.

Confusion matrix of the training and test sets predictions of large salaries (>50K) according to age, education time and working hours.

Finally, we display in a 3D graph the test set observations (dots) and predictions (3D shape) of large salaries (>50K iff =1 and <=50K iff =0) according to age, education time and weekly working hours with the following code :

#3D display of predictions and test set observations
show3D(title="Test set observations (dots) and predictions (3D shape) of large salaries (>50K) according to age, education time and working hours.",
x_colname = 'AGE', y_colname = 'EDUCATION', z_colname = 'WORKING_HOURS', c_colname = 'BIG_SALARY',
x_train = X_train[:,0], y_train=X_train[:,1], z_train=X_train[:,2], c_train=y_train,
x_test = X_test[:,0], y_test = X_test[:,1], z_test = X_test[:,2], c_test=y_test,
mesh_nb_pts = 10**3, class_num=1,
classifier = classifier
)


Here follows the 3D graph displayed with the test set :

Test set observations (dots) and predictions (3D shape) of large salaries (>50K) according to age, education time and working hours.

Test set observations (dots) and predictions (3D shape) of large salaries (>50K) according to age, education time and working hours.

And we can also display a 3D graph containing the training set observations with the following function call :

#3D display of predictions and training set observations
show3D(title="Training set observations (dots) and predictions (3D shape) of large salaries (>50K) according to age, education time and working hours.",
x_colname = 'AGE', y_colname = 'EDUCATION', z_colname = 'WORKING_HOURS', c_colname = 'BIG_SALARY',
x_train = X_train[:,0], y_train=X_train[:,1], z_train=X_train[:,2], c_train=y_train,
x_test = X_train[:,0], y_test = X_train[:,1], z_test = X_train[:,2], c_test=y_train,
mesh_nb_pts = 10**3, class_num=1,
classifier = classifier
)

that leads to the following 3D graph :

Training set observations (dots) and predictions (3D shape) of large salaries (>50K) according to age, education time and working hours.

Training set observations (dots) and predictions (3D shape) of large salaries (>50K) according to age, education time and working hours.

In the both graphs, we notice that the model predictions (the green transparent 3D shape) for classifying a salary as big (> 50K) fits remarquably well the actual observations related to big salaries (black dots), whereas most of the red dots (normal and small salaries) stay outside the green shape.

This script uses the two functions success_ratio and show3D that are defined below :

# FUNCTIONS
def success_ratio(cm):
total = cm[0][0] + cm[1][0] + cm[0][1] + cm[1][1]
return 100*(cm[0][0] + cm[1][1]) / total

import plotly
import plotly.graph_objs as go
#displays in a 3D space the model as a 3D mesh and the test set observations as a 3D scatter plot
def show3D(title, x_colname, y_colname, z_colname, c_colname, x_train, y_train, z_train, c_train, x_test, y_test, z_test, c_test, mesh_nb_pts, class_num, classifier):
n= int( mesh_nb_pts ** (1. / 3) ) #cubic root of mesh_nb_pts
min_x = min(x_train)
min_y = min(y_train)
min_z = min(z_train)
x_size = max(x_train) - min_x
y_size = max(y_train) - min_y
z_size = max(z_train) - min_z
x_step = x_size / n
y_step = y_size / n
z_step = z_size / n

#POSITIVE (1) PREDICTIONS AS A 3D MESH
i=0
x = np.empty([n*n*n])
y = np.empty([n*n*n])
z = np.empty([n*n*n])
for xi in range(0, n):
for yi in range(0, n):
for zi in range(0, n):
x[i] = min_x + xi * x_step
y[i] = min_y + yi * y_step
z[i] = min_z + zi * z_step
i = i + 1
print("Grid of size "+str(n)+"x"+str(n)+"x"+str(n)+" generated (nb vertices = "+str(n*n*n)+").")
#computing the predictions on the grid
datagrid = pd.DataFrame( {x_colname : x, y_colname: y, z_colname : z }) #the dataframe sort the columns lexicographically
datagrid = datagrid[[x_colname, y_colname, z_colname]] #fix the correct order of columns
p = classifier.predict(datagrid);
print("Predictions on the grid computed.")

#extracting the class_num-classified records from the predictions
ss = -1
if (class_num == 0):
ss = sum(1-p)
elif (class_num == 1):
ss = sum(p)
else:
print("ERROR. class_num=", class_num)

xx = np.empty([ss])
yy = np.empty([ss])
zz = np.empty([ss])
pp = np.empty([ss])
j = 0
for i in range(0, len(p)-1):
if(p[i] == class_num):
xx[j]=x[i]
yy[j]=y[i]
zz[j]=z[i]
pp[j]=p[i]
j = j + 1

print(str(ss)+" mesh coordinates extracted with ", class_num,"-predictions.")

#building the mesh for the predictions
trace_preds = go.Mesh3d(
x=xx,y=yy,z=zz,
alphahull=5, opacity=0.2, color='rgb(0, 256, 0)',
name = 'Predictions'
)
print("Mesh generated.")

#OBSERVATIONS 3D POINTS
#number of positive (1) and negative (0) predictions
s_pts = len(c_test)
s_pts_0 = 0
s_pts_1 = 0
for i in range(0, s_pts):
if (c_test[i] == 0) :
s_pts_0 = s_pts_0 + 1
elif (c_test[i] == 1) :
s_pts_1 = s_pts_1 + 1
else :
print("ERROR : c_test["+str(i)+"] = "+str(c_test[i]))
break

#building two set of points (x,y,z coordinates) : one for the positive (1) predictions and another one for the (0) negative predictions
x_pts_0 = np.empty([s_pts_0])
y_pts_0 = np.empty([s_pts_0])
z_pts_0 = np.empty([s_pts_0])
x_pts_1 = np.empty([s_pts_1])
y_pts_1 = np.empty([s_pts_1])
z_pts_1 = np.empty([s_pts_1])
j = 0
k = 0
for i in range(0, s_pts):
if (c_test[i] == 0) :
x_pts_0[j] = x_test[i]
y_pts_0[j] = y_test[i]
z_pts_0[j] = z_test[i]
j = j + 1
elif (c_test[i] == 1) :
x_pts_1[k] = x_test[i]
y_pts_1[k] = y_test[i]
z_pts_1[k] = z_test[i]
k = k + 1
else :
print("ERROR : c_test["+str(i)+"] = "+str(c_test[i]))
break

trace_obs_0 = go.Scatter3d(
x=x_pts_0, y=y_pts_0, z=z_pts_0,
mode='markers',
marker=dict(size=2, line=dict(color='rgb(128, 0, 0)', width=0.5), color='rgb(128, 0, 0)', opacity=1),
name='[red] Observations of class 0'
)

trace_obs_1 = go.Scatter3d(
x=x_pts_1, y=y_pts_1,z=z_pts_1,
mode='markers',
marker=dict(size=2, line=dict( color='rgba(0, 0, 0)', width=0.5 ), color='rgb(0, 0, 0)', opacity=1),
name='[black] Observations of class 1'
)

print("x from ", min_x, " to ", min_x+x_size/4)
print("y from ", min_y, " to ", min_y+y_size/4)
print("z from ", min_z, " to ", min_z+z_size/4)

layout = go.Layout(
title=title,
scene = dict(
xaxis=dict(title=x_colname, range = [min_x,min_x+x_size]),
yaxis=dict(title=y_colname, range = [min_y,min_y+y_size]),
zaxis=dict(title=z_colname, range = [min_z,min_z+z_size])
)
)

fig = go.Figure(data=[trace_obs_0, trace_obs_1, trace_preds], layout=layout)
plotly.offline.plot(fig)