The purpose here is to write a script in R that uses the aggregative clustering method in order to partition in k meaningful clusters the dataset (shown in the 3D graph below) containing mesures (area, perimeter and asymmetry coefficient) of three different varieties of wheat kernels : Kama (red), Rosa (green) and Canadian (blue).

First, we set the working directory and we load our dataset :

#loading the dataset

setwd("[WORKING DIRECTORY]")

dataset = read.csv('dataset-2-FINAL.csv')

X = dataset[1:3]

y = dataset[,4:4]

The 40 first rows (over 210) of the dataset in R are as follows :

It contains the mesures (AREA, PERIMETER and ASYMMETRY COEFFICIENT) of three different varieties of wheat kernels : Kama (red), Rosa (green) and Canadian (blue) and a fourth column (SPECIES.

An easy way to visualize the result of a clustering in R consists in displaying a dendrogram in order to partition the dataset into an optimal number of clusters.

The code below creates and displays the following dendrogram :

#computing dendrogram

dendrogram = hclust(dist(X, method = 'euclidean'), method = 'ward.D')

plot(dendrogram,

main = paste('Dendrogram'),

xlab = 'Wheat Kernels',

ylab = 'Euclidean Distances')

A good cut of the dendrogram is the one that split the level whose minimum length of fork legs (distances between clusters) is greatest to the minimum lengths of all other levels, as shown below :

Therefore we conclude that the optimal number of clusters is 3. Now, we can partition the dataset into this optimal number of clusters with the following code :

#Launching the hierarchical clustering on the wheat kernel dataset with the optimal number of clusters

set.seed(123)

hc = hclust(dist(X, method = 'euclidean'), method = 'ward.D')

y_hc = cutree(hc, 3)

Unfortunately, the clusters labels returned by the clustering method do not match the actual SPECIES column classifiction in the dataset (but the clustering assignments are still isomorphic to the classifiction). Therefore, in order to visualize with the same colors and to compare the results to the initial SPECIES classification we have to reassign the samples to the correct labels with this code :

y_hc2 = c(y_hc)

y_hc2[y_hc == 1] = 1

y_hc2[y_hc == 2] = 3

y_hc2[y_hc == 3] = 2

Then, we can visualize with – the same colors as above – the distribution of the data inside the clusters in 3D with the following script :

#visualizing

p <- init3DGraph()

p <- setTrace(p, x=X[y_hc2 == 1,]$AREA, y=X[y_hc2 == 1,]$PERIMETER, z=X[y_hc2 == 1,]$ASYMMETRY, n='1', c='red')

p <- setTrace(p, x=X[y_hc2 == 2,]$AREA, y=X[y_hc2 == 2,]$PERIMETER, z=X[y_hc2 == 2,]$ASYMMETRY, n='2', c='green')

p <- setTrace(p, x=X[y_hc2 == 3,]$AREA, y=X[y_hc2 == 3,]$PERIMETER, z=X[y_hc2 == 3,]$ASYMMETRY, n='3', c='blue')

show3DGraph(p, x_name='Area', y_name='Perimeter', z_name='Asymmetry')

that displays the following 3D graph in which each measures of area, perimeter and asymmetry coefficient corresponds to an axis and data points are drawn in a color corresponding to the cluster they belong to.

Then, we can evaluate the success ratio of the clustering with the following code :

cm = table(y, y_hc2)

success_ratio(cm)

which prints a not too bad score of 82.38095 % of success.

The scripts written in this page use the functions defined below :

#FUNCTIONS

```
```success_ratio <- function(cm) {

total_success = 0

total = 0

for(irow in 1:length(cm[,1])) {

for(icol in 1:length(cm[irow,])) {

if (irow == icol) {

total_success = total_success + cm[irow, icol]

}

total = total + cm[irow, icol]

}

}

return( (100*total_success)/total )

}

library(plotly)

init3DGraph <- function() {

p<-plot_ly(evaluate = FALSE)

return(p)

}

setTrace <- function(p, x, y, z, n, c) {

p<-add_trace(p, x=as.vector(x),y=as.vector(y),z=as.vector(z), type="scatter3d", mode="markers", name = n, marker = list(size = 4, color = c), evaluate = FALSE)

return(p)

}

show3DGraph <- function(p, x_name, y_name, z_name) {

layout(p, scene = list(xaxis = list(title = x_name), yaxis = list(title = y_name), zaxis = list(title = z_name)))

}