Sem Spirit

K-Means Clustering in R

The purpose here is to write a script in R that uses the k-Means method in order to partition in k meaningful clusters the dataset (shown in the 3D graph below) containing levels of three kinds of steroid hormones found in female or male foxes some living in protected regions and others in intensive hunting regions.

3D graph in R of the level of steroid hormones (cortisol, progesterone, testosterone) in foxes whether or not they are under selective pressure

3D graph in R of the level of steroid hormones (cortisol, progesterone, testosterone) in foxes whether or not they are under selective pressure

First, we set the working directory and we load our dataset :

setwd("[WORKING DIR]")
dataset = read.csv('fox-dataset.csv')
X = dataset[2:4]
y = dataset[,1:1]

The 40 first rows (over 484) of the dataset in R are as follows :

Dataset in R of the level of steroid hormones (cortisol, progesterone, testosterone) in foxes whether or not they are under selective pressure

Dataset in R of the level of steroid hormones (cortisol, progesterone, testosterone) in foxes whether or not they are under selective pressure

This dataset contains the level of three kinds of steroid hormones found in female or male foxes living in regions in which either they are intensively hunted, either they are protected.

In order to use the k-Means method we have to find the optimal number k of clusters for the given dataset. In some cases (as in the following), the so-called « elbow method » can be used to determine a nearly-optimal number k of clusters. When the elbow method is inefficient, the « silhouette » method may give a better result.

The elbow method consists in plotting in a graph the WCSS(x) value (within-cluster sums of squares) on y-axis according to the number x of clusters considered on the x-axis, the WCSS(x) value being the sum for all data points of the squared distance between one data point x_i of a cluster j and the centroid of this cluster j (as written in the formula below), after having partionned the dataset in x clusters with the k-means method.

WCSS(k) definition

WCSS(k) definition

The optimal number x_opt of clusters is given by the point (x_opt, WCSS(x_opt)) located in the part of the WCSS(x) plotted curve that looks like an « elbow ». We can find it visually or we can use a formal trick that consists in finding the (x_opt, WCSS(x_opt)) point of the WCSS(x) curve that maximize the distance with the « extremes » line starting from (1,WCSS(1)) and ending in (max_clusters,WCSS(max_clusters)) ; the distance from the « extremes » line to a point p of the curve being defined as the size of the segment starting from p and crossing the « extremes » line orthogonally.

In the following, we arbitrarly fix the maximum number of clusters to the squared root of the dataset size. This means that if the data is uniformly distributed and a partition is built with up to this maximum number of clusters, then the number of data points inside each cluster will be similar to the number of clusters, which is considered large enough here.

Given the dataset, the following function generates the WCSS curve data as a first step before finding the elbow point :

set.seed(124)
wcss_values = getWCSSData(X)

Then as a second step, the following function finds the « elbow » point in the given WCSS curve :

nb_clusters = seq(1, length(wcss_values), 1)
elbowPoint_info = getElbowPoint(x_values = nb_clusters, y_values = wcss_values)

We can plot the WCSS curve and the « elbow » point with the following function

showElbowGraph(nb_clusters, wcss_values)

which displays the following graph in which we can see in blue the WCSS curve, in green the « extremes » line between the start and end points of the WCSS curve, and in red the line orthogonal to the « extremes » line that crosses the WCSS curve in the « elbow » point maximizing the distance between the « extremes » line and the WCSS curve.

Finding in R the optimal number of cluster with the Elbow method : in blue the WCSS curve, in green the "extremes" line, and in red the "elbow" line that crosses the WCSS curve in the "elbow" point.

Finding in R the optimal number of cluster with the Elbow method : in blue the WCSS curve, in green the « extremes » line, and in red the « elbow » line that crosses the WCSS curve in the « elbow » point.

The optimal number of clusters is given by elbowPoint_info[1] = 5 which is close to the initial number 4 of classes in the observations. Once we found the optimal number of clusters we can launch the k-means algorithm with this number and get the resulting kmeans classification of the given X data points, as follows :

opt_nb_clusters = elbowPoint_info[1]
# Applying k-means to the mall dataset
set.seed(124)
kmeans <- kmeans(X, opt_nb_clusters, iter.max = 300, nstart = 10)

We can visualize the distribution of the data inside the clusters in 3D with the following script :

#visualizing
clus = kmeans$cluster
p <- init3DGraph()
p <- setTrace(p, x=X[clus == 1,]$CORTISOL, y=X[clus == 1,]$PROGESTERONE, z=X[clus == 1,]$TESTOSTERONE, n='1', c='red') #match 2, green
p <- setTrace(p, x=X[clus == 2,]$CORTISOL, y=X[clus == 2,]$PROGESTERONE, z=X[clus == 2,]$TESTOSTERONE, n='2', c='blue') #match 1, blue
p <- setTrace(p, x=X[clus == 3,]$CORTISOL, y=X[clus == 3,]$PROGESTERONE, z=X[clus == 3,]$TESTOSTERONE, n='3', c='green') #match 3, black
p <- setTrace(p, x=X[clus == 4,]$CORTISOL, y=X[clus == 4,]$PROGESTERONE, z=X[clus == 4,]$TESTOSTERONE, n='4', c='black') #match 3, black
p <- setTrace(p, x=X[clus == 5,]$CORTISOL, y=X[clus == 5,]$PROGESTERONE, z=X[clus == 5,]$TESTOSTERONE, n='5', c='cyan') #match 0, red
show3DGraph(p, x_name='Cortisol', y_name='Progesterone', z_name='Testosterone')

that displays the following 3D graph in which each steroid hormone level corresponds to an axe and data points belonging to one of the 5 clusters are drawn with the corresponding color. The clusters are coloured in yellow.

3D graph in R showing the 5 clusters coloured with different colors and containing the data points from the dataset providing the level of steroid hormones (cortisol, progesterone, testosterone) in foxes whether or not they are under selective pressure

3D graph in R showing the 5 clusters coloured with different colors and containing the data points from the dataset providing the level of steroid hormones (cortisol, progesterone, testosterone) in foxes whether or not they are under selective pressure

Notice that if we merge the green cluster (#3) with the black one (#4), we get a classification similar to the initial one in the dataset. Even better : if we reassign the cluster labels so that they correspond to the initial classes given in the dataset (TYPE column) without changing the content of those clusters, we can evaluate the success ratio of the clustering by comparing it to the original classification.

clus2 = c(clus)

clus2[clus == 1] = 2
clus2[clus == 2] = 1
clus2[clus == 3] = 3
clus2[clus == 3] = 3
clus2[clus == 4] = 3
clus2[clus == 5] = 0

cm = table(y, clus2)
success_ratio(cm)

the printed success ratio shows that 86.77686% of the data points were assigned to the correct classes, which is not a bad result indeed.

If we visualize with the following code the final adjusted clustering result, we can see that the final clusters are very close to the original data classes.

p <- init3DGraph()
p <- setTrace(p, x=X[clus2 == 0,]$CORTISOL, y=X[clus2 == 0,]$PROGESTERONE, z=X[clus2 == 0,]$TESTOSTERONE, n='1', c='red')
p <- setTrace(p, x=X[clus2 == 1,]$CORTISOL, y=X[clus2 == 1,]$PROGESTERONE, z=X[clus2 == 1,]$TESTOSTERONE, n='2', c='blue')
p <- setTrace(p, x=X[clus2 == 2,]$CORTISOL, y=X[clus2 == 2,]$PROGESTERONE, z=X[clus2 == 2,]$TESTOSTERONE, n='3', c='green')
p <- setTrace(p, x=X[clus2 == 3,]$CORTISOL, y=X[clus2 == 3,]$PROGESTERONE, z=X[clus2 == 3,]$TESTOSTERONE, n='4', c='black')
show3DGraph(p, x_name='Cortisol', y_name='Progesterone', z_name='Testosterone')

Which displays the following 3D graph :

3D graph in R showing the 4 clusters after having merged the cyan and black clusters

3D graph in R showing the 4 clusters after having merged the cyan and black clusters

The script written in this page uses the functions defined below :

#FUNCTIONS

success_ratio <- function(cm) {
total_success = 0
total = 0
for(irow in 1:length(cm[,1])) {
for(icol in 1:length(cm[irow,])) {
if (irow == icol) {
total_success = total_success + cm[irow, icol]
}
total = total + cm[irow, icol]
}
}
return( (100*total_success)/total )
}

getElbowPoint <- function(x_values, y_values) {

# Max values to create line
max_x_x <- max(x_values)
max_x_y <- y_values[which.max(x_values)]
max_y_y <- max(y_values)
max_y_x <- x_values[which.max(y_values)]
max_df <- data.frame(x = c(max_y_x, max_x_x), y = c(max_y_y, max_x_y))

# Creating straight line between the max values
fit <- lm(max_df$y ~ max_df$x)

# Distance from point to line
distances <- c()
for(i in 1:length(x_values)) {
distances <- c(distances, abs(coef(fit)[2]*x_values[i] - y_values[i] + coef(fit)[1]) / sqrt(coef(fit)[2]^2 + 1^2))
}

# Max distance point
x_max_dist <- x_values[which.max(distances)]
y_max_dist <- y_values[which.max(distances)]

return(c(x_max_dist, y_max_dist, max(distances)))
}

getWCSSData <- function(X) {
wcss_values <- vector()
max_wcss_steps = sqrt(length(X[,1]))
for(i in 1:max_wcss_steps) {
wcss_values[i] <- sum(kmeans(X, i, iter.max = 1000)$withinss)
}
return(wcss_values)
}

showElbowGraph <- function(x_clusters, y_wcss) {
nb_wcss_values = length(y_wcss)
extremes_line_coef = (x_clusters[nb_wcss_values] - x_clusters[1]) / (y_wcss[nb_wcss_values] - wcss_values[1])
extremes_orth_line_coef = -1 / extremes_line_coef
elbowPoint_orth_proj = c(elbowPoint_info[1] + elbowPoint_info[3]/2, elbowPoint_info[2] + extremes_orth_line_coef * (elbowPoint_info[3]/2))

plot(x_clusters, y_wcss, type="b", main = 'WCSS value according to the number of clusters', xlab = 'Number of clusters', ylab = 'WCSS value')
lines(x=c(x_clusters[1], x_clusters[nb_wcss_values]), y=c(y_wcss[1], y_wcss[nb_wcss_values]), type="b", col='green')
lines(x=c(elbowPoint_info[1], elbowPoint_orth_proj[1]), y=c(elbowPoint_info[2], elbowPoint_orth_proj[2]), type="b", col='red')
}

library(plotly)
init3DGraph <- function() {
p<-plot_ly(evaluate = FALSE)
return(p)
}

setTrace <- function(p, x, y, z, n, c) {
p<-add_trace(p, x=as.vector(x),y=as.vector(y),z=as.vector(z), type="scatter3d", mode="markers", name = n, marker = list(size = 4, color = c), evaluate = FALSE)
return(p)
}

show3DGraph <- function(p, x_name, y_name, z_name) {
layout(p, scene = list(xaxis = list(title = x_name), yaxis = list(title = y_name), zaxis = list(title = z_name)))
}