Hierarchical Clustering

In data mining or machine learning, the hierarchical clustering is a method that builds a hierarchy of clusters in order to analyse a dataset. Strategies for hierarchical grouping generally fall into two types :

– Agglomerative: This is an ascending hierarchical classification that starts from a situation where all individuals are alone in a class, then are grouped into larger and larger classes. The hierarchical qualifier comes from the fact that it produces a hierarchy H containing the set of classes created at all stages of the algorithm.
– Divisive: This is a descending approach: all observations begin in a group, and divisions are made while one falls in the hierarchy.

In general, the results of a hierarchical clustering are represented in a dendrogram.

A dendrogram is the graphical representation of an ascending hierarchical classification; It is often presented as a binary tree whose leaves are the individuals aligned on the x-axis. When two classes or two individuals join with the aggregation index τ, vertical lines are drawn from the abscissa of the two classes to the ordinate τ, and then they are connected by a horizontal segment. From an aggregation index τ, it is possible to draw an ordinate line τ which makes it possible to see a classification on the dendrogram. More complex versions of the classification tree can eventually help build a decision tree.

A Dendrogram

USE CASE : THREE SPECIES OF WHEAT SEED

In this use case, 210 kernels belonging to three different varieties (Kama, Rosa and Canadian) of wheat have been randomly selected and examined. For each seed, the dataset contains a row providing the area of the kernel (column AREA), its perimeter (column PERIMETER), its coefficient of asymmetry (column ASYMMETRY) and its variety (column SPECIES containg the values : 1 for Kama, 2 for Rosa and 3 for Canadian). The following 3D graph shows how the seeds are distributed into a space whose three dimensions corresponds to the three columns AREA, PERIMETER, ASYMMETRY.

Mesures of AREA, PERIMETER and ASYMMETRY of three different varieties of wheat kernels : Kama (red), Rosa (green) and Canadian (blue)

The purpose here is to ignore the SPECIES column from the dataset and, using an aggregative hierarchical clustering method, to find a k-partition of the dataset close to the one provided by the SPECIES column.

Here follows the 40 first rows (over 210) of the dataset :

Dataset containing the area, perimeter and asymmetry coefficient of three varieties of wheat kernels.