K-means clustering is a data partitioning method and a combinatorial optimization problem. It is especially used in unsupervised learning. Given a set of points (as observations) and an integer k, the problem is to divide the given set of points into k partitions, often called clusters, in order to minimize a certain function. We consider the distance of a point to the average of the points of its cluster; the function to be minimized is the sum of the squares of these distances. There is a classic heuristic for this problem that is used in a wide range of applications. The problem is also studied as a classical optimization problem, with, for example, approximation algorithms.
USE CASE : FOXES UNDER SELECTIVE PRESSURE
In this imaginary use case, the dataset contains the level of three kinds of steroid hormones found in female or male foxes living in regions in which either they are intensively hunted, either they are protected. The following graph shows the dataset in a 3D space where each axis correspond to one steroid hormone level among Cortisol, Progesterone and Testosterone. The data points are organized in four classes (TYPE column) : Male and protected (value : 0, color : green), Male and intensively hunted (value : 1, color : red), Female and protected (value : 2, color : blue), Female and intensively hunted (value : 3, color : black). This graph shows that the higher the selective pressure on the animals, the higher their steroids hormones level, and the higher their survival abilities (high cortisol=high robustness and hunting abilities, high testosterone/progesterone=high fertility).
The purpose here is to ignore the TYPE column from the dataset and, using the K-means clustering method, to find a k-partition of the dataset close to the one provided by the TYPE column.
Here follows the 40 first rows (over 484) of the dataset :