Sem Spirit

K-Means Clustering in Python

The purpose here is to write a script in Python that uses the k-Means method in order to partition in k meaningful clusters the dataset (shown in the 3D graph below) containing levels of three kinds of steroid hormones found in female or male foxes some living in protected regions and others in intensive hunting regions.

3D graph in Python of the level of steroid hormones (cortisol, progesterone, testosterone) in foxes whether or not they are under selective pressure

3D graph in Python of the level of steroid hormones (cortisol, progesterone, testosterone) in foxes whether or not they are under selective pressure

First, we import the basic libraries and we load our dataset :

#importing libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import cmath as math
import sys

#loading the dataset
dataset = pd.read_csv('fox-3c-FINAL.csv')
X = dataset.iloc[:,[1,2,3]].values #The three columns of steroid levels : Cortisol, Progesterone and Testosterone
y_set = dataset.iloc[:,[0]].values #TYPE column with the four class initial assignment

The 30 first rows (over 484) of the dataset in Python is as follows :

Dataset in Python of the level of steroid hormones (cortisol, progesterone, testosterone) in foxes whether or not they are under selective pressure

Dataset in Python of the level of steroid hormones (cortisol, progesterone, testosterone) in foxes whether or not they are under selective pressure

It contains the level of three kinds of steroid hormones found in female or male foxes living in regions in which either they are intensively hunted, either they are protected.

In order to use the k-Means method we have to find the optimal number k of clusters for the given dataset. In some cases (as in the following), the so-called « elbow method » can be used to determine a nearly-optimal number k of clusters. When the elbow method is inefficient, the « silhouette » method may give a better result.

The elbow method consists in plotting in a graph the WCSS(x) value on y-axis according to the number x of clusters considered on the x-axis, the WCSS(x) value being the sum for all data points of the squared distance between one data point x_i of a cluster j and the centroid of this cluster j (as written in the formula below), after having partionned the dataset in x clusters with the k-means method.

WCSS(k) definition

WCSS(k) definition

The optimal number x_opt of clusters is given by the point (x_opt, WCSS(x_opt)) located in the part of the WCSS(x) plotted curve that looks like an « elbow ». We can find it visually or we can use a formal trick that consists in finding the (x_opt, WCSS(x_opt)) point of the WCSS(x) curve that maximize the distance with the « extremes » line starting from (1,WCSS(1)) and ending in (max_clusters,WCSS(max_clusters)) ; the distance from the « extremes » line to a point p of the curve being defined as the size of the segment starting from p and crossing the « extremes » line orthogonally.

In the following, we arbitrarly fix the maximum number of clusters to the squared root of the dataset size. This means that if the data is uniformly distributed and a partition is built with up to this maximum number of clusters, then the number of data points inside each cluster will be similar to the number of clusters, which is considered large enough here.

Given the dataset, the following function generates the WCSS curve data as a first step before finding the elbow point :

wcss_values = buildWCSSValues(X)

Then as a second step, the following function finds the « elbow » point in the given WCSS curve :

elbowIndex = getElbowPointIndex(wcss_values)

We can plot the WCSS curve and the « elbow » point with the following function

showWCSSElbowGraph(wcss_values=wcss_values, elbowIndex=elbowIndex)

which displays the following graph in which we can see in blue the WCSS curve, in green the « extremes » line between the start and end points of the WCSS curve, and in red the line orthogonal to the « extremes » line that crosses the WCSS curve in the « elbow » point maximizing the distance between the « extremes » line and the WCSS curve.

Finding in Python the optimal number of cluster with the Elbow method : in blue the WCSS curve, in green the "extremes" line, and in red the "elbow" line that crosses the WCSS curve in the "elbow" point.

Finding in Python the optimal number of cluster with the Elbow method : in blue the WCSS curve, in green the « extremes » line, and in red the « elbow » line that crosses the WCSS curve in the « elbow » point.

Here the elbowIndex = 4, which leads to a optimal number of clusters of n_opt_clusters=(elbowIndex+1) = 5 which is close to the initial number 4 of classes in the observations.
Once we found the optimal number of clusters we can launch k-means with this number and get the resulting y_kmeans classification of the given X data points, as follows :

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters = n_opt_clusters, init = 'k-means++', max_iter=1000, n_init = 100, random_state=0)
y_kmeans = kmeans.fit_predict(X)

We can visualize the distribution of the data inside the clusters in 3D with the following script :

#visualising the clusters
centroids = getTrace(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], kmeans.cluster_centers_[:, 2], s= 8, c = 'yellow', label='Centroids')
t1 = getTrace(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], X[y_kmeans == 0, 2], s= 4, c='red', label = '1') #match with red=1 initial class
t2 = getTrace(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], X[y_kmeans == 1, 2], s= 4, c='black', label = '2') #match with black=3 initial class
t3 = getTrace(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], X[y_kmeans == 2, 2], s= 4, c='blue', label = '3') #match with blue=2 initial class
t4 = getTrace(X[y_kmeans == 3, 0], X[y_kmeans == 3, 1], X[y_kmeans == 3, 2], s= 4, c='green', label = '4') #match with green=0 initial class
t5 = getTrace(X[y_kmeans == 4, 0], X[y_kmeans == 4, 1], X[y_kmeans == 4, 2], s= 4, c='cyan', label = '5') #match with black=3 initial class

x=X[:,0]
y=X[:,1]
z=X[:,2]
showGraph("Steroïds levels in male/female foxes depending on whether they live or not in an intensive hunting region", "Cortisol", [min(x),max(x)], "Progesterone", [min(y),max(y)], "Testosterone", [min(z)-1,max(z)], [t1,t2,t3,t4,t5,centroids])

that displays the following 3D graph in which each steroid hormone level corresponds to an axe and data points belonging to one of the 5 clusters are drawn with the corresponding color. The clusters are coloured in yellow.

3D graph in Python showing the 5 clusters coloured with different colors and containing the data points from the dataset providing the level of steroid hormones (cortisol, progesterone, testosterone) in foxes whether or not they are under selective pressure

3D graph in Python showing the 5 clusters coloured with different colors and containing the data points from the dataset providing the level of steroid hormones (cortisol, progesterone, testosterone) in foxes whether or not they are under selective pressure

Notice that if we merge the cyan cluster (#4) with the black one (#1), we get a classification similar to the initial one in the dataset. Even better : if we reassign the cluster labels so that they correspond to the initial classes given in the dataset (TYPE column) without changing the content of those clusters, we can evaluate the success ratio of the clustering by comparing it to the original classification.

y_pred = np.array(y_kmeans)
y_pred[y_kmeans == 0] = 1
y_pred[y_kmeans == 1] = 3
y_pred[y_kmeans == 2] = 2
y_pred[y_kmeans == 3] = 0
y_pred[y_kmeans == 4] = 3

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_set, y_pred)

print("success ratio : ",success_ratio(cm=cm), "%")

the printed success ratio shows that 86.78% of the data points were assigned to the correct classes, which is not a bad result.

If we visualize with the following code the final adjusted clustering result, we can see that the final clusters are very close to the original data classes.

centroids = getTrace(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], kmeans.cluster_centers_[:, 2], s= 8, c = 'yellow', label='Centroids')
t1 = getTrace(X[y_pred == 0, 0], X[y_pred == 0, 1], X[y_pred == 0, 2], s= 4, c='red', label = '1')
t2 = getTrace(X[y_pred == 1, 0], X[y_pred == 1, 1], X[y_pred == 1, 2], s= 4, c='blue', label = '2')
t3 = getTrace(X[y_pred == 2, 0], X[y_pred == 2, 1], X[y_pred == 2, 2], s= 4, c='green', label = '3')
t4 = getTrace(X[y_pred == 3, 0], X[y_pred == 3, 1], X[y_pred == 3, 2], s= 4, c='black', label = '4')

x=X[:,0]
y=X[:,1]
z=X[:,2]
showGraph("Steroïds levels in male/female foxes depending on whether they live or not in an intensive hunting region", "Cortisol", [min(x),max(x)], "Progesterone", [min(y),max(y)], "Testosterone", [min(z)-1,max(z)], [t1,t2,t3,t4,centroids])

Which displays the following 3D graph :

3D graph in Python showing the 4 clusters after having merged the cyan and black clusters

3D graph in Python showing the 4 clusters after having merged the cyan and black clusters

The script written in this page uses the functions defined below :

def success_ratio(cm):
total_success = 0;
total = 0
for i in range(0, len(cm)):
for j in range(0, len(cm[i])):
if i == j: total_success = total_success + cm[i, j]
total = total + cm[i, j]
return (100*total_success)/total

def getOptimalNumberOfClusters():
wcss_values = buildWCSSValues()
elbowIndex = getElbowPointIndex(wcss_values)
return elbowIndex

def buildWCSSValues(X):
from sklearn.cluster import KMeans
print("Building WCSS Data...")
wcss_values = []
tmax_clusters = int(math.sqrt(len(X)).real)
stepstr = ''
sys.stdout.write("Progression : ")
for i in range(1, tmax_clusters) :
sys.stdout.write('\b'*len(stepstr))
stepstr = str(i) + "/" + str(tmax_clusters - 1)
sys.stdout.write(stepstr)
kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter=300, n_init = 10, random_state=0)
kmeans.fit(X)
wcss_values.append(kmeans.inertia_)
return wcss_values

def getElbowPointIndex(wcss):
curve = wcss
nPoints = len(curve)
allCoord = np.vstack((range(nPoints), curve)).T
np.array([range(nPoints), curve])
firstPoint = allCoord[0]
lineVec = allCoord[-1] - allCoord[0]
lineVecNorm = lineVec / np.sqrt(np.sum(lineVec**2))
vecFromFirst = allCoord - firstPoint
scalarProduct = np.sum(vecFromFirst * np.matlib.repmat(lineVecNorm, nPoints, 1), axis=1)
vecFromFirstParallel = np.outer(scalarProduct, lineVecNorm)
vecToLine = vecFromFirst - vecFromFirstParallel
distToLine = np.sqrt(np.sum(vecToLine ** 2, axis=1))
idxOfBestPoint = np.argmax(distToLine)
return idxOfBestPoint

def showWCSSElbowGraph(wcss_values, elbowIndex):
max_wcss = max(wcss_values)
max_clusters = len(wcss_values)
nb_clusters = np.arange(1, max_clusters+1, 1)
wcss_r = np.array(wcss_values)/max_wcss
nb_clusters_r = (1 * np.array(nb_clusters))/max_clusters
plt.plot(nb_clusters_r, wcss_r)
lx1=nb_clusters_r[0]
ly1=wcss_r[0]
lx2=nb_clusters_r[max_clusters - 1]
ly2=wcss_r[max_clusters - 1]
plt.plot([lx1, lx2], [ly1, ly2], c='green')
coef = (ly2 - ly1)/(lx2 - lx1)
plt.plot([nb_clusters_r[elbowIndex], 1], [wcss_r[elbowIndex], wcss_r[elbowIndex] - coef], c='red')
plt.title('WCSS value according to the number of clusters')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS value')
xticks = nb_clusters_r[0::1]
xticks_lab = nb_clusters[0::1]
plt.xticks(xticks, xticks_lab)
ticks = np.arange(0, 1, 0.05)
yticks = np.round(ticks * max_wcss) / max_wcss
plt.yticks(yticks, (yticks*max_wcss).astype(int))
plt.show()

import plotly
import plotly.graph_objs as go
def getTrace(x, y, z, c, label, s=2):
trace_points = go.Scatter3d(
x=x, y=y, z=z,
mode='markers',
marker=dict(size=s, line=dict(color='rgb(0, 0, 0)', width=0.5), color=c, opacity=1),
name=label
)
return trace_points;

def showGraph(title, x_colname, x_range, y_colname, y_range, z_colname, z_range, traces):
layout = go.Layout(
title=title,
scene = dict(
xaxis=dict(title=x_colname, range = x_range),
yaxis=dict(title=y_colname, range = y_range),
zaxis=dict(title=z_colname, range = z_range)
)
)

fig = go.Figure(data=traces, layout=layout)
plotly.offline.plot(fig)