Naive Bayes Classification in R

In this usecase, we build in R the following SVM classifier (whose model predictions are shown in the 3D graph below) in order to detect if yes or no a human is present inside a room according to the room temperature, humidity and CO2 levels.

Test set observations (dots) and predictions (3D shape) of retail (=1) vs hostellery/restaurant/cafe (=0) according to fresh, grocery and frozen food bought during the year.

Test set observations (dots) and predictions (3D shape) of retail (=1) vs hostellery/restaurant/cafe (=0) according to fresh, grocery and frozen food bought during the year.

We first set the working directory and load the dataset :

setwd("[WORKING DIRECTORY]")
dataset = read.csv('dataset.csv')

Here follows the 40 first rows over a total of 9753 in the dataset :

Dataset in R of businesses classification as a retail shop or a hotel/restaurant/café according to the amount of fresh, grocery and frozen food bought during the year

Dataset in R of businesses classification as a retail shop or a hotel/restaurant/café according to the amount of fresh, grocery and frozen food bought during the year

Then we split the data into the training and the test set :

library(caTools)
set.seed(123)
split = sample.split(dataset$HOSTEL_OR_RETAIL, SplitRatio=0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

Then we are ready to fit the classifier to the training set data, which is done with the following code :

#install.packages('e1071')
library(e1071)
classifier = naiveBayes(x = training_set[-4], y = as.factor(training_set$HOSTEL_OR_RETAIL))
#summary(classifier)

Once ready, we can run the classifier on the training set and the test set in order to get the predictions.

#predicting the test set results
y_train_pred = predict(classifier, newdata=training_set[-4])
y_test_pred = predict(classifier, newdata=test_set[-4])

In order to evaluate the quality of the classifier, we compute with the following code the two confusion matrix of the predictions made with the training and test set and according to them, the success ratio of the predictions :

#Building the confusion matrix
cm_train = table(training_set[, 4], y_train_pred)
cm_test = table(test_set[, 4], y_test_pred)

cm_train_str = capture.output(show(cm_train))
writeLines(c(
"Training set confusion matrix : ",
cm_train_str,
paste("Success ratio on training set : ", toString(success_ratio(cm=cm_train)), "%")
))

cm_test_str = capture.output(show(cm_test))
writeLines(c(
"Test set confusion matrix : ",
cm_test_str,
paste("Success ratio on test set : ", toString(success_ratio(cm=cm_test)), "%")
))

The console shows the two following confusion matrix and success ratio for the training and test sets :

Confusion matrix in R of the Naive Bayes classifier for businesses classified as a retail shop or a hotel/restaurant/café according to the amount of fresh, grocery and frozen food bought during the year

Confusion matrix in R of the Naive Bayes classifier for businesses classified as a retail shop or a hotel/restaurant/café according to the amount of fresh, grocery and frozen food bought during the year

Finally, we display in a 3D graph the test set observations (dots) and predictions (3D shape) of retail (=1) vs hostellery/restaurant/cafe (=0) according to fresh, grocery and frozen food bought during the year with the following code :

show3D(title="Test set observations (dots) and predictions (3D shape) of retail (=1) vs hostellery/restaurant/cafe (=0) according to fresh, grocery and frozen food bought during the year.",
x_colname = 'FRESH', y_colname = 'GROCERY', z_colname = 'FROZEN', c_colname = 'HOSTEL_OR_RETAIL',
x_train = training_set[,1], y_train=training_set[,2], z_train=training_set[,3], c_train=training_set[,4],
x_test = test_set[,1], y_test = test_set[,2], z_test = test_set[,3], c_test=test_set[,4],
mesh_nb_pts = 10**3,
classifier = classifier
)

Here follows the 3D graph displayed with the test set :

Test set observations (dots) and predictions (3D shape) of retail (=1) vs hostellery/restaurant/cafe (=0) according to fresh, grocery and frozen food bought during the year.

Test set observations (dots) and predictions (3D shape) of retail (=1) vs hostellery/restaurant/cafe (=0) according to fresh, grocery and frozen food bought during the year.

And we can also display a 3D graph containing the training set observations with the following function call :

show3D(title="Training set observations (dots) and predictions (3D shape) of retail (=1) vs hostellery/restaurant/cafe (=0) according to fresh, grocery and frozen food bought during the year.",
x_colname = 'FRESH', y_colname = 'GROCERY', z_colname = 'FROZEN', c_colname = 'HOSTEL_OR_RETAIL',
x_train = training_set[,1], y_train=training_set[,2], z_train=training_set[,3], c_train=training_set[,4],
x_test = training_set[,1], y_test = training_set[,2], z_test = training_set[,3], c_test=training_set[,4],
mesh_nb_pts = 10**3,
classifier = classifier
)

that leads to the following 3D graph :

Training set observations (dots) and predictions (3D shape) of retail (=1) vs hostellery/restaurant/cafe (=0) according to fresh, grocery and frozen food bought during the year.

Training set observations (dots) and predictions (3D shape) of retail (=1) vs hostellery/restaurant/cafe (=0) according to fresh, grocery and frozen food bought during the year.

In the both graphs, we notice that the model predictions of room occupancy (the green transparent 3D shape) fits remarquably well the occupancy observations (black dots), whereas most of the red dots (empty room observations) stay outside the green shape.

This script uses the two functions success_ratio and show3D that are defined below :

#FUNCTIONS

success_ratio <- function(cm) {
total = cm[1][1] + cm[2][1] + cm[3][1] + cm[4][1]
ratio = (100*(cm[1][1] + cm[4][1]) / total)
return(ratio)
}

#install.packages('plotly')
library(plotly)
show3D <- function(title, x_colname, y_colname, z_colname, c_colname, x_train, y_train, z_train, c_train, x_test, y_test, z_test, c_test, mesh_nb_pts, classifier) {
n= as.integer(mesh_nb_pts ^ (1/3)) #cubic root of mesh_nb_pts
min_x = min(x_train)
min_y = min(y_train)
min_z = min(z_train)
x_size = max(x_train) - min_x
y_size = max(y_train) - min_y
z_size = max(z_train) - min_z
x_step = x_size / n
y_step = y_size / n
z_step = z_size / n

#POSITIVE (1) PREDICTIONS AS A 3D MESH
i=0
x = array(dim = n*n*n)
y = array(dim = n*n*n)
z = array(dim = n*n*n)
for (xi in 0:n) {
for (yi in 0:n) {
for (zi in 0:n) {
x[i] = min_x + xi * x_step
y[i] = min_y + yi * y_step
z[i] = min_z + zi * z_step
i = i + 1
}
}
}
print(paste("Grid of size ", n, "x", n, "x", n, " generated (nb vertices = ", (n*n*n), ")."))
#computing the predictions on the grid
datagrid = data.frame(x, y, z)
colnames(datagrid) %
add_trace(x=as.vector(x_pts_1),y=as.vector(y_pts_1),z=z_pts_1, type = "scatter3d", mode="markers", name = "[black] Observations of class 1", marker = list(size = 3, color = 'black')) %>%
add_trace(x=as.vector(xx),y=as.vector(yy),z=zz, type = "mesh3d", name = "Predictions", alphahull=5, opacity=0.2, colors=c('#00FF00')) %>%
layout(
scene = list(
xaxis = list(title = x_colname),
yaxis = list(title = y_colname),
zaxis = list(title = z_colname)
)
)
}