Decision Tree Classification in R

In this usecase, we build in R the following Decision Tree Classifier (whose model predictions are shown in the 3D graph below) in order to classify an individual salary as big (>50K$) or not according to the age, the level of education, and the average number of weekly working hours.

Test set observations (dots) and predictions (3D shape) of large salaries (>50K) according to age, education time and working hours.

Test set observations (dots) and predictions (3D shape) of large salaries (>50K) according to age, education time and working hours.

We first set the working directory and load the dataset :

setwd("[WORKING DIRECTORY]")
dataset = read.csv('dataset.csv')

Here follows the 40 first rows over a total of 23414 in the dataset :

Dataset in Python of large salaries (if =1 then >50K else <=50K) according to age, education time and working hours.

Dataset in Python of large salaries (if =1 then >50K else <=50K) according to age, education time and working hours.

Then we split the data into the training and the test set :

library(caTools)
set.seed(123)
split = sample.split(dataset$BIG_SALARY, SplitRatio=0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

Then we are ready to fit the classifier to the training set data, which is done with the following code :

#install.packages('rpart')
library(rpart)
classifier = rpart(formula = BIG_SALARY ~ ., data = training_set, method = 'class', control = rpart.control(minsplit = 2, cp = 0.000005))
#summary(classifier)

Once ready, we can run the classifier on the training set and the test set in order to get the predictions.

#predicting the test set results
y_train_pred = predict(classifier, newdata=training_set[-4])
y_test_pred = predict(classifier, newdata=test_set[-4])

In order to evaluate the quality of the classifier, we compute with the following code the two confusion matrix of the predictions made with the training and test set and according to them, the success ratio of the predictions :

#Building the confusion matrix
cm_train = table(training_set[, 4], y_train_pred)
cm_test = table(test_set[, 4], y_test_pred)

cm_train_str = capture.output(show(cm_train))
writeLines(c(
"Training set confusion matrix : ",
cm_train_str,
paste("Success ratio on training set : ", toString(success_ratio(cm=cm_train)), "%")
))

cm_test_str = capture.output(show(cm_test))
writeLines(c(
"Test set confusion matrix : ",
cm_test_str,
paste("Success ratio on test set : ", toString(success_ratio(cm=cm_test)), "%")
))

The console shows the two following confusion matrix and success ratio for the training and test sets :

Confusion matrix of the training and test sets predictions of large salaries (>50K) according to age, education time and working hours.

Confusion matrix of the training and test sets predictions of large salaries (>50K) according to age, education time and working hours.

We can display as below the structure of the decision tree with the following code :

plot(classifier)
#text(classifier)

Structure of the decision tree trained on the dataset of large salaries (>50K) according to age, education time and working hours.

Structure of the decision tree trained on the dataset of large salaries (>50K) according to age, education time and working hours.

Finally, we display in a 3D graph the test set observations (dots) and predictions (3D shape) of large salaries (>50K iff =1 and <=50K iff =0) according to age, education time and weekly working hours with the following code :

show3D(title="Test set observations (dots) and predictions (3D shape) of large salaries (>50K) according to age, education time and working hours.",
x_colname = 'AGE', y_colname = 'EDUCATION', z_colname = 'WORKING_HOURS', c_colname = 'BIG_SALARY',
x_train = training_set[,1], y_train=training_set[,2], z_train=training_set[,3], c_train=training_set[,4],
x_test = test_set[,1], y_test = test_set[,2], z_test = test_set[,3], c_test=test_set[,4],
mesh_nb_pts = 10**3,
classifier = classifier
)

Here follows the 3D graph displayed with the test set :

Test set observations (dots) and predictions (3D shape) of large salaries (>50K) according to age, education time and working hours.

Test set observations (dots) and predictions (3D shape) of large salaries (>50K) according to age, education time and working hours.

And we can also display a 3D graph containing the training set observations with the following function call :

show3D(title="Training set observations (dots) and predictions (3D shape) of large salaries (>50K) according to age, education time and working hours.",
x_colname = 'AGE', y_colname = 'EDUCATION', z_colname = 'WORKING_HOURS', c_colname = 'BIG_SALARY',
x_train = training_set[,1], y_train=training_set[,2], z_train=training_set[,3], c_train=training_set[,4],
x_test = training_set[,1], y_test = training_set[,2], z_test = training_set[,3], c_test=training_set[,4],
mesh_nb_pts = 10**3,
classifier = classifier
)

that leads to the following 3D graph :

Training set observations (dots) and predictions (3D shape) of large salaries (>50K) according to age, education time and working hours.

Training set observations (dots) and predictions (3D shape) of large salaries (>50K) according to age, education time and working hours.

In the both graphs, we notice that the model predictions (the green transparent 3D shape) for classifying a salary as big (> 50K) fits remarquably well the actual observations related to big salaries (black dots), whereas most of the red dots (normal and small salaries) stay outside the green shape.

This script uses the two functions success_ratio and show3D that are defined below :

#FUNCTIONS

success_ratio <- function(cm) {
total = cm[1][1] + cm[2][1] + cm[3][1] + cm[4][1]
ratio = (100*(cm[1][1] + cm[4][1]) / total)
return(ratio)
}

#install.packages('plotly')
library(plotly)
show3D <- function(title, x_colname, y_colname, z_colname, c_colname, x_train, y_train, z_train, c_train, x_test, y_test, z_test, c_test, mesh_nb_pts, classifier) {
n= as.integer(mesh_nb_pts ^ (1/3)) #cubic root of mesh_nb_pts
min_x = min(x_train)
min_y = min(y_train)
min_z = min(z_train)
x_size = max(x_train) - min_x
y_size = max(y_train) - min_y
z_size = max(z_train) - min_z
x_step = x_size / n
y_step = y_size / n
z_step = z_size / n

#POSITIVE (1) PREDICTIONS AS A 3D MESH
i=0
x = array(dim = n*n*n)
y = array(dim = n*n*n)
z = array(dim = n*n*n)
for (xi in 0:n) {
for (yi in 0:n) {
for (zi in 0:n) {
x[i] = min_x + xi * x_step
y[i] = min_y + yi * y_step
z[i] = min_z + zi * z_step
i = i + 1
}
}
}
print(paste("Grid of size ", n, "x", n, "x", n, " generated (nb vertices = ", (n*n*n), ")."))
#computing the predictions on the grid
datagrid = data.frame(x, y, z)
colnames(datagrid) %
add_trace(x=as.vector(x_pts_1),y=as.vector(y_pts_1),z=z_pts_1, type = "scatter3d", mode="markers", name = "[black] Observations of class 1", marker = list(size = 3, color = 'black')) %>%
add_trace(x=as.vector(xx),y=as.vector(yy),z=zz, type = "mesh3d", name = "Predictions", alphahull=5, opacity=0.2, colors=c('#00FF00')) %>%
layout(
scene = list(
xaxis = list(title = x_colname),
yaxis = list(title = y_colname),
zaxis = list(title = z_colname)
)
)
}