Sem Spirit

k-Nearest-Neighbors Classification in Python

The following shows how to write a Python script in order to classify using the K Nearest Neighbors method whether a patient will survive or died within 5 years after his breast cancer diagnosis according to his age and the number of axillary nodes.

We start by importing the needed libraries and loading the dataset :

#importing libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

#loading dataset and separating predictors from predicted variable
dataset = pd.read_csv('dataset.csv')
X = dataset.iloc[:,:-1].values #patient ages and number of axillary nodes
y = dataset.iloc[:,len(dataset.iloc[0])-1].values #dead (1) or alive (0)

The ‘dataset’ is an array of 306 records whose 30 first rows are :

Dataset of patients that survived or died within 5 years after a breast cancer diagnosis according to his age and the number of axillary nodes.

Dataset of patients that survived or died within 5 years after a breast cancer diagnosis according to his age and the number of axillary nodes.

Then we divide the dataset into the training and test sets :

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

Then we can fit our k Nearest Neighbors Classifier on the training data :

from sklearn.neighbors import KNeighborsClassifier
kNeighborsClassifier = KNeighborsClassifier(n_neighbors=5, metric = 'minkowski', p=2)
kNeighborsClassifier.fit(X_train, y_train)

And finally, we compute the predictions for the training and the test set :

#predicting the results on the training set
y_train_pred = kNeighborsClassifier.predict(X_train);

#predicting the results on the test set
y_test_pred = kNeighborsClassifier.predict(X_test);

In order to measure the quality of the classification, we use a method called the « confusion matrix ». Each column of the matrix represents the number of occurrences of an estimated class, while each row represents the number of occurrences of a real (or reference) class. One of the advantages of the confusion matrix is that it quickly shows if the classifier works correctly.

The following script builds the confusion matrix according to the test set and to the training set :

#Making the confusion matrix on the test set
from sklearn.metrics import confusion_matrix
cm_test = confusion_matrix(y_test, y_test_pred)

#Making the confusion matrix on the training set
from sklearn.metrics import confusion_matrix
cm_train = confusion_matrix(y_train, y_train_pred)

print(cm_test)
print(cm_train)

The result is the following array :

Confusion matrix for the test set predictions (and for the training set predictions).

Confusion matrix for the test set predictions (and for the training set predictions).

The sum of all the values of each matrix gives the total number of records respectively in the test set (77 records) or in the training set (229 records).
The two rows of the each confusion matrix are interpreted as follows :
– among the 58 actual test-set alive patients (resp. 167 training-set alive patients), 53 (resp. 157) are classified as such and 5 (resp. 10) are wrongly classified as dead
– among the 19 actual test-set dead patients (resp. 62 training-set dead patients), 8 (resp. 28) are classified as such and 11 (resp. 34) are wrongly classified as alive

The two columns are interpreted as follows :
– among the 64 records classified as alive, 53 are correct (recorded as alive in the data) and 11 are actually dead
– among the 13 records classified as dead, 8 are correct (recorded as dead in the data) and 5 are actually alive

In conclusion, each confusion matrix lead to :
– 79% of the records were successfully classified on the test set
– and 81% classification success on the training set

Finally, with the following script, we show in a graph the actual training set observations (dots) and predictions (coloured area and slope) for ‘x’ years old patients that survived (green color) or died (red color) 5 years after having been diagnosed with a number ‘y’ of axillary nodes :

#visualising the predictions on the training set
from matplotlib.colors import ListedColormap
myplot('Training set observations (dots) and predictions (coloured areas) for \'x\' years old patients that survived (green color) or died (red color) 5 years after having been diagnosed with a number \'y\' of axillary nodes.',
'Patient Age', X_train, 'Number of Axillary Nodes', y_train, kNeighborsClassifier)

This script displays the following graph :

Training set observations (dots) and predictions (coloured areas) for x years old patients that survived (green color) or died (red color) 5 years after having been diagnosed with a number y of axillary nodes.

Training set observations (dots) and predictions (coloured areas) for x years old patients that survived (green color) or died (red color) 5 years after having been diagnosed with a number y of axillary nodes.

A similar graph based on the test set observations can be displayed with the following script :

#visualising the predictions on the test set
myplot('Test set observations (dots) and predictions (coloured areas) for \'x\' years old patients that survived (green color) or died (red color) 5 years after having been diagnosed with a number \'y\' of axillary nodes.',
'Patient Age', X_test, 'Number of Axillary Nodes', y_test, kNeighborsClassifier)

Which gives the following graph :

Test set observations (dots) and predictions (coloured areas) for x years old patients that survived (green color) or died (red color) 5 years after having been diagnosed with a number y of axillary nodes.

Test set observations (dots) and predictions (coloured areas) for x years old patients that survived (green color) or died (red color) 5 years after having been diagnosed with a number y of axillary nodes.

We can see that the classifier is working relatively well since most of the alive patients are correctly classified as alive (green) and a non negligeable part of dead patients have been correctly classified as dead.

The myplot function is a function created on the fly that uses two other functions. They are defined with the following :

def myplot(title, x_name, X_set, y_name, y_set, classifier):
X1, X2 = make_meshgrid(x = X_set[:, 0], y = X_set[:, 1], h=0.01)
plot_contours(plt, classifier, X1, X2, alpha = 0.75, cmap = ListedColormap(('green', 'red')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('green', 'red'))(i), label = j)
plt.title(title)
plt.xlabel(x_name)
plt.ylabel(y_name)
plt.legend()
plt.show()

def make_meshgrid(x, y, h=1):
x_min, x_max = x.min() - 1, x.max() + 1
y_min, y_max = y.min() - 1, y.max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
return xx, yy

def plot_contours(ax, clf, xx, yy, **params):
Z = clf.predict(np.array([xx.ravel(), yy.ravel()]).T)
Z = Z.reshape(xx.shape)
out = ax.contourf(xx, yy, Z, **params)
return out