The purpose here is to write a script in Python that uses the aggregative clustering method in order to partition in k meaningful clusters the dataset (shown in the 3D graph below) containing mesures (area, perimeter and asymmetry coefficient) of three different varieties of wheat kernels : Kama (red), Rosa (green) and Canadian (blue).

3D graph showing the dataset of mesures of AREA, PERIMETER and ASYMMETRY of three different varieties of wheat kernels : Kama (red), Rosa (green) and Canadian (blue)
First, we import the basic libraries and we load our dataset :
#importing libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import cmath as math
import sys
#loading the dataset
dataset = pd.read_csv('dataset-2-FINAL.csv')
X = dataset.iloc[:,[0,1,2]].values
y_set = dataset.iloc[:,[3]].values
The 30 first rows (over 210) of the dataset in Python are as follows :

Dataset in Python containing mesures (area, perimeter and asymmetry coefficient) of three different varieties of wheat kernels : Kama (red), Rosa (green) and Canadian (blue)
It contains the mesures (AREA, PERIMETER and ASYMMETRY COEFFICIENT) of three different varieties of wheat kernels : Kama (red), Rosa (green) and Canadian (blue) and a fourth column (SPECIES.
One easy way to do clustering in Python consists in using a dendrogram in order to partition the dataset into an optimal number of clusters.
The following code creates the dendrogram and browse the dendrogram tree-like structure in order to retrieve the membership assignments between the data points and the clusters.
import scipy.cluster.hierarchy as sch
Z = sch.linkage(X, method = 'median')
den = sch.dendrogram(Z)
plt.title('Dendrogram for the clustering of the dataset on three different varieties of wheat kernels (Kama, Rosa and Canadian)')
plt.xlabel('Wheat kernels')
plt.ylabel('Euclidean distance in the space with dimensions AREA, PERIMETER and ASYMMETRY');
plt.show()
y_pred = getClusterAssignments(X, den)
This displays the following dendrogram :

Dendrogram for the hierarchical clustering of the dataset of three kernel wheat varieties.
We can visualize the distribution of the data inside the clusters in 3D with the following script :
#visualising the clusters
t1 = getTrace(X[y_pred == 1, 0], X[y_pred == 1, 1], X[y_pred == 1, 2], s= 4, c='red', label = '1') #
t2 = getTrace(X[y_pred == 2, 0], X[y_pred == 2, 1], X[y_pred == 2, 2], s= 4, c='green', label = '2') #-> blue, 3
t3 = getTrace(X[y_pred == 3, 0], X[y_pred == 3, 1], X[y_pred == 3, 2], s= 4, c='blue', label = '3') #-> green, 2
x=X[:,0]
y=X[:,1]
z=X[:,2]
showGraph("Clustering results of the wheat kernels dataset represented in the 3D space with dimensions for AREA, PERIMETER and ASYMMETRY", "AREA", [min(x),max(x)], "PERIMETER", [min(y),max(y)], "ASYMMETRY", [min(z)-1,max(z)], [t1,t2,t3])
that displays the following 3D graph in which each measures of area, perimeter and asymmetry coefficient corresponds to an axis and data points are drawn in a color corresponding to the cluster they belong to.

Results of the dendrogram clustering of the dataset of mesures of AREA, PERIMETER and ASYMMETRY of three different varieties of wheat kernels : Kama (red), Rosa (green) and Canadian (blue)
Fortunately, the clusters assignments found in the dendrogram match the actual classifiction in the dataset (in column SPECIES). Therefore, we can evaluate the success ratio of the clustering with the following code :
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_set, y_pred)
print("success ratio : ",success_ratio(cm=cm), "%")
which prints a score of 85.2380952381 % of success.
A second way to partition a dataset into clusters in Python, consists in using the AgglomerativeClustering class from the sklearn.cluster package, with the following code :
from sklearn.cluster import AgglomerativeClustering
hc = AgglomerativeClustering(n_clusters = 3, affinity = 'euclidean', linkage = 'ward')
y_hc = hc.fit_predict(X)
In this case, we need to provide the number of clusters as input (here n_clusters = 3).
We can see the results of the clustering in a 3D graph with the following code :
#visualising the clusters
t1 = getTrace(X[y_hc == 0, 0], X[y_hc == 0, 1], X[y_hc == 0, 2], s= 4, c='red', label = '1') #
t2 = getTrace(X[y_hc == 1, 0], X[y_hc == 1, 1], X[y_hc == 1, 2], s= 4, c='green', label = '2') #
t3 = getTrace(X[y_hc == 2, 0], X[y_hc == 2, 1], X[y_hc == 2, 2], s= 4, c='blue', label = '3') #
x=X[:,0]
y=X[:,1]
z=X[:,2]
showGraph("Clustering results of the wheat kernels dataset represented in the 3D space with dimensions for AREA, PERIMETER and ASYMMETRY", "AREA", [min(x),max(x)], "PERIMETER", [min(y),max(y)], "ASYMMETRY", [min(z)-1,max(z)], [t1,t2,t3])
Which displays the following 3D graph :

Results of the clustering with the AgglomerativeClustering class of the dataset of mesures of AREA, PERIMETER and ASYMMETRY of three different varieties of wheat kernels : Kama (red), Rosa (green) and Canadian (blue)
If we reassign the correct cluster labels so that they correspond to the initial classes given in the dataset (column SPECIES) without changing the content of those clusters, we can evaluate the success ratio of the clustering by comparing it to the original classification. This is done with the following code :
y_pred = np.array(y_hc)
y_pred[y_hc == 0] = 1
y_pred[y_hc == 1] = 2
y_pred[y_hc == 2] = 3
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_set, y_pred)
print("success ratio : ",success_ratio(cm=cm), "%")
the printed success ratio shows that 83.333333% of the data points were assigned to the correct classes. As the previous one, it is not a bad result.
The scripts written in this page use the functions defined below :
def getClusterAssignments(X, den):
cluster_idxs = dict()
cluster_rows = dict()
colors_nums = dict()
col_num=0
for c in np.unique(den['color_list']):
cluster_idxs[c] = []
colors_nums[c] = col_num
cluster_rows[col_num] = []
col_num = col_num + 1
for c, pi in zip(den['color_list'], den['icoord']):
for leg in pi[1:3]:
i = (leg - 5.0) / 10.0
if abs(i - int(i)) < 1e-5:
cluster_idxs[c].append(int(i))
rows_clusters = dict()
for c, l in cluster_idxs.items():
i_l = [den['ivl'][i] for i in l]
col_num = colors_nums[c]
cluster_rows[col_num] = i_l
for i in i_l:
rows_clusters[i] = col_num
y_pred = []
for i in range(0, len(X)):
y_pred = np.append(y_pred, int(rows_clusters[str(i)]))
return y_pred
def success_ratio(cm):
total_success = 0;
total = 0
for i in range(0, len(cm)):
for j in range(0, len(cm[i])):
if i == j: total_success = total_success + cm[i, j]
total = total + cm[i, j]
return (100*total_success)/total
def getOptimalNumberOfClusters():
wcss_values = buildWCSSValues()
elbowIndex = getElbowPointIndex(wcss_values)
return elbowIndex
def buildWCSSValues(X):
from sklearn.cluster import KMeans
print("Building WCSS Data...")
wcss_values = []
tmax_clusters = int(math.sqrt(len(X)).real)
stepstr = ''
sys.stdout.write("Progression : ")
for i in range(1, tmax_clusters) :
sys.stdout.write('\b'*len(stepstr))
stepstr = str(i) + "/" + str(tmax_clusters - 1)
sys.stdout.write(stepstr)
kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter=300, n_init = 10, random_state=0)
kmeans.fit(X)
wcss_values.append(kmeans.inertia_)
return wcss_values
def getElbowPointIndex(wcss):
curve = wcss
nPoints = len(curve)
allCoord = np.vstack((range(nPoints), curve)).T
np.array([range(nPoints), curve])
firstPoint = allCoord[0]
lineVec = allCoord[-1] - allCoord[0]
lineVecNorm = lineVec / np.sqrt(np.sum(lineVec**2))
vecFromFirst = allCoord - firstPoint
scalarProduct = np.sum(vecFromFirst * np.matlib.repmat(lineVecNorm, nPoints, 1), axis=1)
vecFromFirstParallel = np.outer(scalarProduct, lineVecNorm)
vecToLine = vecFromFirst - vecFromFirstParallel
distToLine = np.sqrt(np.sum(vecToLine ** 2, axis=1))
idxOfBestPoint = np.argmax(distToLine)
return idxOfBestPoint
def showWCSSElbowGraph(wcss_values, elbowIndex):
max_wcss = max(wcss_values)
max_clusters = len(wcss_values)
nb_clusters = np.arange(1, max_clusters+1, 1)
wcss_r = np.array(wcss_values)/max_wcss
nb_clusters_r = (1 * np.array(nb_clusters))/max_clusters
plt.plot(nb_clusters_r, wcss_r)
lx1=nb_clusters_r[0]
ly1=wcss_r[0]
lx2=nb_clusters_r[max_clusters - 1]
ly2=wcss_r[max_clusters - 1]
plt.plot([lx1, lx2], [ly1, ly2], c='green')
coef = (ly2 - ly1)/(lx2 - lx1)
plt.plot([nb_clusters_r[elbowIndex], 1], [wcss_r[elbowIndex], wcss_r[elbowIndex] - coef], c='red')
plt.title('WCSS value according to the number of clusters')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS value')
xticks = nb_clusters_r[0::1]
xticks_lab = nb_clusters[0::1]
plt.xticks(xticks, xticks_lab)
ticks = np.arange(0, 1, 0.05)
yticks = np.round(ticks * max_wcss) / max_wcss
plt.yticks(yticks, (yticks*max_wcss).astype(int))
plt.show()
import plotly
import plotly.graph_objs as go
def getTrace(x, y, z, c, label, s=2):
trace_points = go.Scatter3d(
x=x, y=y, z=z,
mode='markers',
marker=dict(size=s, line=dict(color='rgb(0, 0, 0)', width=0.5), color=c, opacity=1),
name=label
)
return trace_points;
def showGraph(title, x_colname, x_range, y_colname, y_range, z_colname, z_range, traces):
layout = go.Layout(
title=title,
scene = dict(
xaxis=dict(title=x_colname, range = x_range),
yaxis=dict(title=y_colname, range = y_range),
zaxis=dict(title=z_colname, range = z_range)
)
)
fig = go.Figure(data=traces, layout=layout)
plotly.offline.plot(fig)