# Hierarchical Clustering in Python

The purpose here is to write a script in Python that uses the aggregative clustering method in order to partition in k meaningful clusters the dataset (shown in the 3D graph below) containing mesures (area, perimeter and asymmetry coefficient) of three different varieties of wheat kernels : Kama (red), Rosa (green) and Canadian (blue). 3D graph showing the dataset of mesures of AREA, PERIMETER and ASYMMETRY of three different varieties of wheat kernels : Kama (red), Rosa (green) and Canadian (blue)

First, we import the basic libraries and we load our dataset :
``` #importing libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd import cmath as math import sys```

``` ```

```#loading the dataset dataset = pd.read_csv('dataset-2-FINAL.csv') X = dataset.iloc[:,[0,1,2]].values y_set = dataset.iloc[:,].values ```

The 30 first rows (over 210) of the dataset in Python are as follows : Dataset in Python containing mesures (area, perimeter and asymmetry coefficient) of three different varieties of wheat kernels : Kama (red), Rosa (green) and Canadian (blue)

It contains the mesures (AREA, PERIMETER and ASYMMETRY COEFFICIENT) of three different varieties of wheat kernels : Kama (red), Rosa (green) and Canadian (blue) and a fourth column (SPECIES.

One easy way to do clustering in Python consists in using a dendrogram in order to partition the dataset into an optimal number of clusters.

The following code creates the dendrogram and browse the dendrogram tree-like structure in order to retrieve the membership assignments between the data points and the clusters.
``` import scipy.cluster.hierarchy as sch Z = sch.linkage(X, method = 'median') den = sch.dendrogram(Z) plt.title('Dendrogram for the clustering of the dataset on three different varieties of wheat kernels (Kama, Rosa and Canadian)') plt.xlabel('Wheat kernels') plt.ylabel('Euclidean distance in the space with dimensions AREA, PERIMETER and ASYMMETRY'); plt.show()```

``` ```

```y_pred = getClusterAssignments(X, den) ```
This displays the following dendrogram : Dendrogram for the hierarchical clustering of the dataset of three kernel wheat varieties.

We can visualize the distribution of the data inside the clusters in 3D with the following script :
``` #visualising the clusters t1 = getTrace(X[y_pred == 1, 0], X[y_pred == 1, 1], X[y_pred == 1, 2], s= 4, c='red', label = '1') # t2 = getTrace(X[y_pred == 2, 0], X[y_pred == 2, 1], X[y_pred == 2, 2], s= 4, c='green', label = '2') #-> blue, 3 t3 = getTrace(X[y_pred == 3, 0], X[y_pred == 3, 1], X[y_pred == 3, 2], s= 4, c='blue', label = '3') #-> green, 2```

``` ```

```x=X[:,0] y=X[:,1] z=X[:,2] showGraph("Clustering results of the wheat kernels dataset represented in the 3D space with dimensions for AREA, PERIMETER and ASYMMETRY", "AREA", [min(x),max(x)], "PERIMETER", [min(y),max(y)], "ASYMMETRY", [min(z)-1,max(z)], [t1,t2,t3]) ```
that displays the following 3D graph in which each measures of area, perimeter and asymmetry coefficient corresponds to an axis and data points are drawn in a color corresponding to the cluster they belong to. Results of the dendrogram clustering of the dataset of mesures of AREA, PERIMETER and ASYMMETRY of three different varieties of wheat kernels : Kama (red), Rosa (green) and Canadian (blue)

Fortunately, the clusters assignments found in the dendrogram match the actual classifiction in the dataset (in column SPECIES). Therefore, we can evaluate the success ratio of the clustering with the following code :
``` from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_set, y_pred) print("success ratio : ",success_ratio(cm=cm), "%") ```
which prints a score of 85.2380952381 % of success.

A second way to partition a dataset into clusters in Python, consists in using the AgglomerativeClustering class from the sklearn.cluster package, with the following code :
``` from sklearn.cluster import AgglomerativeClustering hc = AgglomerativeClustering(n_clusters = 3, affinity = 'euclidean', linkage = 'ward') y_hc = hc.fit_predict(X) ```
In this case, we need to provide the number of clusters as input (here n_clusters = 3).

We can see the results of the clustering in a 3D graph with the following code :
``` #visualising the clusters t1 = getTrace(X[y_hc == 0, 0], X[y_hc == 0, 1], X[y_hc == 0, 2], s= 4, c='red', label = '1') # t2 = getTrace(X[y_hc == 1, 0], X[y_hc == 1, 1], X[y_hc == 1, 2], s= 4, c='green', label = '2') # t3 = getTrace(X[y_hc == 2, 0], X[y_hc == 2, 1], X[y_hc == 2, 2], s= 4, c='blue', label = '3') #```

``` ```

```x=X[:,0] y=X[:,1] z=X[:,2] showGraph("Clustering results of the wheat kernels dataset represented in the 3D space with dimensions for AREA, PERIMETER and ASYMMETRY", "AREA", [min(x),max(x)], "PERIMETER", [min(y),max(y)], "ASYMMETRY", [min(z)-1,max(z)], [t1,t2,t3]) ```
Which displays the following 3D graph : Results of the clustering with the AgglomerativeClustering class of the dataset of mesures of AREA, PERIMETER and ASYMMETRY of three different varieties of wheat kernels : Kama (red), Rosa (green) and Canadian (blue)

If we reassign the correct cluster labels so that they correspond to the initial classes given in the dataset (column SPECIES) without changing the content of those clusters, we can evaluate the success ratio of the clustering by comparing it to the original classification. This is done with the following code :
``` y_pred = np.array(y_hc) y_pred[y_hc == 0] = 1 y_pred[y_hc == 1] = 2 y_pred[y_hc == 2] = 3```

``` from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_set, y_pred) ```

```print("success ratio : ",success_ratio(cm=cm), "%") ```
the printed success ratio shows that 83.333333% of the data points were assigned to the correct classes. As the previous one, it is not a bad result.

The scripts written in this page use the functions defined below :
``` def getClusterAssignments(X, den): cluster_idxs = dict() cluster_rows = dict() colors_nums = dict() col_num=0 for c in np.unique(den['color_list']): cluster_idxs[c] = [] colors_nums[c] = col_num cluster_rows[col_num] = [] col_num = col_num + 1```

``` for c, pi in zip(den['color_list'], den['icoord']): for leg in pi[1:3]: i = (leg - 5.0) / 10.0 if abs(i - int(i)) < 1e-5: cluster_idxs[c].append(int(i)) rows_clusters = dict() for c, l in cluster_idxs.items(): i_l = [den['ivl'][i] for i in l] col_num = colors_nums[c] cluster_rows[col_num] = i_l for i in i_l: rows_clusters[i] = col_num y_pred = [] for i in range(0, len(X)): y_pred = np.append(y_pred, int(rows_clusters[str(i)])) return y_pred def success_ratio(cm): total_success = 0; total = 0 for i in range(0, len(cm)): for j in range(0, len(cm[i])): if i == j: total_success = total_success + cm[i, j] total = total + cm[i, j] return (100*total_success)/total def getOptimalNumberOfClusters(): wcss_values = buildWCSSValues() elbowIndex = getElbowPointIndex(wcss_values) return elbowIndex def buildWCSSValues(X): from sklearn.cluster import KMeans print("Building WCSS Data...") wcss_values = [] tmax_clusters = int(math.sqrt(len(X)).real) stepstr = '' sys.stdout.write("Progression : ") for i in range(1, tmax_clusters) : sys.stdout.write('\b'*len(stepstr)) stepstr = str(i) + "/" + str(tmax_clusters - 1) sys.stdout.write(stepstr) kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter=300, n_init = 10, random_state=0) kmeans.fit(X) wcss_values.append(kmeans.inertia_) return wcss_values def getElbowPointIndex(wcss): curve = wcss nPoints = len(curve) allCoord = np.vstack((range(nPoints), curve)).T np.array([range(nPoints), curve]) firstPoint = allCoord lineVec = allCoord[-1] - allCoord lineVecNorm = lineVec / np.sqrt(np.sum(lineVec**2)) vecFromFirst = allCoord - firstPoint scalarProduct = np.sum(vecFromFirst * np.matlib.repmat(lineVecNorm, nPoints, 1), axis=1) vecFromFirstParallel = np.outer(scalarProduct, lineVecNorm) vecToLine = vecFromFirst - vecFromFirstParallel distToLine = np.sqrt(np.sum(vecToLine ** 2, axis=1)) idxOfBestPoint = np.argmax(distToLine) return idxOfBestPoint def showWCSSElbowGraph(wcss_values, elbowIndex): max_wcss = max(wcss_values) max_clusters = len(wcss_values) nb_clusters = np.arange(1, max_clusters+1, 1) wcss_r = np.array(wcss_values)/max_wcss nb_clusters_r = (1 * np.array(nb_clusters))/max_clusters plt.plot(nb_clusters_r, wcss_r) lx1=nb_clusters_r ly1=wcss_r lx2=nb_clusters_r[max_clusters - 1] ly2=wcss_r[max_clusters - 1] plt.plot([lx1, lx2], [ly1, ly2], c='green') coef = (ly2 - ly1)/(lx2 - lx1) plt.plot([nb_clusters_r[elbowIndex], 1], [wcss_r[elbowIndex], wcss_r[elbowIndex] - coef], c='red') plt.title('WCSS value according to the number of clusters') plt.xlabel('Number of clusters') plt.ylabel('WCSS value') xticks = nb_clusters_r[0::1] xticks_lab = nb_clusters[0::1] plt.xticks(xticks, xticks_lab) ticks = np.arange(0, 1, 0.05) yticks = np.round(ticks * max_wcss) / max_wcss plt.yticks(yticks, (yticks*max_wcss).astype(int)) plt.show() import plotly import plotly.graph_objs as go def getTrace(x, y, z, c, label, s=2): trace_points = go.Scatter3d( x=x, y=y, z=z, mode='markers', marker=dict(size=s, line=dict(color='rgb(0, 0, 0)', width=0.5), color=c, opacity=1), name=label ) return trace_points; def showGraph(title, x_colname, x_range, y_colname, y_range, z_colname, z_range, traces): layout = go.Layout( title=title, scene = dict( xaxis=dict(title=x_colname, range = x_range), yaxis=dict(title=y_colname, range = y_range), zaxis=dict(title=z_colname, range = z_range) ) ) fig = go.Figure(data=traces, layout=layout) plotly.offline.plot(fig) ```