
The purpose here is to write a script in R that uses the aggregative clustering method in order to partition in k meaningful clusters the dataset (shown in the 3D graph below) containing mesures (area, perimeter and asymmetry coefficient) of three different varieties of wheat kernels : Kama (red), Rosa (green) and Canadian (blue).

3D graph showing in R the dataset of mesures of AREA, PERIMETER and ASYMMETRY of three different varieties of wheat kernels : Kama (red), Rosa (green) and Canadian (blue)
First, we set the working directory and we load our dataset :
#loading the dataset
setwd("[WORKING DIRECTORY]")
dataset = read.csv('dataset-2-FINAL.csv')
X = dataset[1:3]
y = dataset[,4:4]
The 40 first rows (over 210) of the dataset in R are as follows :

Dataset in R containing mesures (area, perimeter and asymmetry coefficient) of three different varieties of wheat kernels : Kama (red), Rosa (green) and Canadian (blue)
It contains the mesures (AREA, PERIMETER and ASYMMETRY COEFFICIENT) of three different varieties of wheat kernels : Kama (red), Rosa (green) and Canadian (blue) and a fourth column (SPECIES.
An easy way to visualize the result of a clustering in R consists in displaying a dendrogram in order to partition the dataset into an optimal number of clusters.
The code below creates and displays the following dendrogram :
#computing dendrogram
dendrogram = hclust(dist(X, method = 'euclidean'), method = 'ward.D')
plot(dendrogram,
main = paste('Dendrogram'),
xlab = 'Wheat Kernels',
ylab = 'Euclidean Distances')

Dendrogram in R for the hierarchical clustering of the dataset of three kernel wheat varieties.
A good cut of the dendrogram is the one that split the level whose minimum length of fork legs (distances between clusters) is greatest to the minimum lengths of all other levels, as shown below :

A cut of a Dendrogram
Therefore we conclude that the optimal number of clusters is 3. Now, we can partition the dataset into this optimal number of clusters with the following code :
#Launching the hierarchical clustering on the wheat kernel dataset with the optimal number of clusters
set.seed(123)
hc = hclust(dist(X, method = 'euclidean'), method = 'ward.D')
y_hc = cutree(hc, 3)
Unfortunately, the clusters labels returned by the clustering method do not match the actual SPECIES column classifiction in the dataset (but the clustering assignments are still isomorphic to the classifiction). Therefore, in order to visualize with the same colors and to compare the results to the initial SPECIES classification we have to reassign the samples to the correct labels with this code :
y_hc2 = c(y_hc)
y_hc2[y_hc == 1] = 1
y_hc2[y_hc == 2] = 3
y_hc2[y_hc == 3] = 2
Then, we can visualize with – the same colors as above – the distribution of the data inside the clusters in 3D with the following script :
#visualizing
p <- init3DGraph()
p <- setTrace(p, x=X[y_hc2 == 1,]$AREA, y=X[y_hc2 == 1,]$PERIMETER, z=X[y_hc2 == 1,]$ASYMMETRY, n='1', c='red')
p <- setTrace(p, x=X[y_hc2 == 2,]$AREA, y=X[y_hc2 == 2,]$PERIMETER, z=X[y_hc2 == 2,]$ASYMMETRY, n='2', c='green')
p <- setTrace(p, x=X[y_hc2 == 3,]$AREA, y=X[y_hc2 == 3,]$PERIMETER, z=X[y_hc2 == 3,]$ASYMMETRY, n='3', c='blue')
show3DGraph(p, x_name='Area', y_name='Perimeter', z_name='Asymmetry')
that displays the following 3D graph in which each measures of area, perimeter and asymmetry coefficient corresponds to an axis and data points are drawn in a color corresponding to the cluster they belong to.

Results in R of the hierarchical clustering of the dataset of mesures of AREA, PERIMETER and ASYMMETRY of three different varieties of wheat kernels : Kama (red), Rosa (green) and Canadian (blue)
Then, we can evaluate the success ratio of the clustering with the following code :
cm = table(y, y_hc2)
success_ratio(cm)
which prints a not too bad score of 82.38095 % of success.
The scripts written in this page use the functions defined below :
#FUNCTIONS
success_ratio <- function(cm) {
total_success = 0
total = 0
for(irow in 1:length(cm[,1])) {
for(icol in 1:length(cm[irow,])) {
if (irow == icol) {
total_success = total_success + cm[irow, icol]
}
total = total + cm[irow, icol]
}
}
return( (100*total_success)/total )
}
library(plotly)
init3DGraph <- function() {
p<-plot_ly(evaluate = FALSE)
return(p)
}
setTrace <- function(p, x, y, z, n, c) {
p<-add_trace(p, x=as.vector(x),y=as.vector(y),z=as.vector(z), type="scatter3d", mode="markers", name = n, marker = list(size = 4, color = c), evaluate = FALSE)
return(p)
}
show3DGraph <- function(p, x_name, y_name, z_name) {
layout(p, scene = list(xaxis = list(title = x_name), yaxis = list(title = y_name), zaxis = list(title = z_name)))
}