The confusion matrix is a tool for measuring the quality of a classification system.
Each column of the matrix represents the number of occurrences of an estimated class, while each row represents the number of occurrences of a real actual class in the given dataset. Therefore, for a dataset with k classes, the confusion matrix contains k rows and k columns. If the classifier is binary, the number k of classes being k=2, the confusion matrix (as shown below) is made of two rows and two columns. In this specific case, one can read the 4 cells of the confusion matrix with the following terminology used in epidemiology : each cell denotes respectively the number of true negatives (TN), false negatives (FN), false positives (FP) and true positives (TP).
One of the interests of the confusion matrix is that it quickly shows if the system manages to classify correctly.
In the following example, we consider a dataset of elements split into two sets ‘0’ and ‘1’ : an element belonging to the set x in the dataset is written « x-element ». The classification system learns to classify elements into the two classes ‘0’ and ‘1’ : an element classified in the set x is written « x-classified element ». We want to know how many 0-elements are falsely classified as 1 (false positives) and how many 1-elements are not classified as such (false negatives). We assume that the classifier is tested with 20 0-elements and 20 1-elements. Somemore, since the classifier is binary, the confusion matrix is a 2×2 matrix as shown below :
The matrix reads as follows :
- Top Row : out of the 20 0-elements, 16 are classified as such (16 true negatives) and 4 are wrongly classified as 1 (4 false positives)
- Bottom Row : out of the 20 1-elements, 1 is wrongly classified as 0 (1 false negative), and 19 are correctly classified as 1 (19 true positives)
- Left Column : out of the 17 0-classified elements, 16 are really 0-elements (16 true negatives) and only one is a 1-element actually (1 false negative)
- Right Column : out of the 23 1-classified elements, 4 are 0-elements actually (4 false positives) and 19 are really 1-elements (19 true positives)
Assuming that y_test contains the actual values from the test set of the variable to predict and y_pred contains the predictions obtained by applying the classifier on the test set, here follows the script that creates a confusion matrix in Python :
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
And here follows the script for the same purpose in R :
cm = table(y_test, y_pred)