Sem Spirit

Classifier evaluation with CAP curve in Python

The cumulative accuracy profile (CAP) is used in data science to visualize the discriminative power of a model. The CAP of a model represents the cumulative number of elements meeting a given property along the y-axis versus the corresponding cumulative number of elements along the x-axis. The CAP is equivalent to the Lorenz curve, the Power curve and the Lift curve. It is different from the receiver operating characteristic (ROC), which gives the true-positive rate according to the false-positive rate.

Considering a model that predicts if a property holds (true positive) for each element of a dataset based on multiple factors (such as gender, age, income etc). If within a data subset, elements are picked at random, the cumulative number elements for which this property holds would rise linearly toward a maximum value corresponding to the total number of elements for which this property holds within this subset. This distribution is called the « random » CAP. A perfect prediction, on the other hand, determines exactly which subset element will meet the property, such that the maximum elements meeting the property will be reached with a minimum number of element selection among the subset. This produces a steep line on the CAP curve that stays flat once the maximum is reached (considering all other subset elements will not lead to more increase the cumulative amount of elements meeting this property), which is the « perfect » CAP.

A successful model predicts the likelihood of elements for which this property holds and ranks these probabilities to produce a list of potential elements to be selected first since they are more likely to meet the property. The resulting cumulative number of elements meeting the property will increase rapidly and eventually flatten out to the given maximum as more subset elements are selected. This results in a distribution that lies between the random and the perfect CAP curves.

To obtain the CAP indicator, first all elements are ranked in descending order according to their score or probability of meeting the property. For a certain fraction x of the total number of elements, the CAP curve is constructed by calculating the percentage d (x) of elements meeting the property that have a probability to meet the property greater than or equal to the minimum probability of fraction x. This is done for all x varying from 0% to 100%.

The CAP can be used to evaluate a model by comparing the curve to the perfect CAP in which the maximum rate of elements meeting the property is achieved directly and to the random CAP in which the elements meeting the property are distributed equally. A good model will have a CAP between the perfect CAP and the random CAP with a better model tending to the perfect CAP.

The accuracy ratio (AR) is defined as the ratio of the area between the model CAP and the random CAP and the area between the perfect CAP and the random CAP. For a successful model the AR has values between zero and one, with a higher value for a stronger model.

Another indication of the model strength is given by the cumulative number of elements meeting the property at 50% of the totality on the x-axis. For a successful model this value should lie between 50% and 100% of the maximum, with a higher percentage for stronger models.

USE CASE : EVALUATING A CLASSIFIER IN PYTHON WITH THE CAP CURVE

In the following we evaluate with the CAP curve the Random Forest classifier created here with a dataset about distribution of big salaries.

First we create the classifier with the following code :

#importing libraries
import numpy as np
import pandas as pd
from matplotlib import cm

#loading the dataset
dataset = pd.read_csv('dataset-4.csv')
#X = dataset.iloc[:,0:6].values
X = dataset.iloc[:,0:3].values
y = dataset.iloc[:,len(dataset.iloc[0])-1].values

#train/test
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

#fitting the classifier to the training set
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 100, criterion='entropy', random_state=0)
classifier.fit(X_train, y_train)

Then we create the CAP Curve with the following code :

y_pred_proba = classifier.predict_proba(X=X_test)
capcurve(y_values=y_test, y_preds_proba=y_pred_proba[:,1])

The ‘capcurve’ function that builds and shows the CAP curve is defined as follows :

import matplotlib.pyplot as plt
from scipy import integrate
def capcurve(y_values, y_preds_proba):
num_pos_obs = np.sum(y_values)
num_count = len(y_values)
rate_pos_obs = float(num_pos_obs) / float(num_count)
ideal = pd.DataFrame({'x':[0,rate_pos_obs,1],'y':[0,1,1]})
xx = np.arange(num_count) / float(num_count - 1)

y_cap = np.c_[y_values,y_preds_proba]
y_cap_df_s = pd.DataFrame(data=y_cap)
y_cap_df_s = y_cap_df_s.sort_values([1], ascending=False).reset_index('index', drop=True)

print(y_cap_df_s.head(20))

yy = np.cumsum(y_cap_df_s[0]) / float(num_pos_obs)
yy = np.append([0], yy[0:num_count-1]) #add the first curve point (0,0) : for xx=0 we have yy=0

percent = 0.5
row_index = np.trunc(num_count * percent)

val_y1 = yy[row_index]
val_y2 = yy[row_index+1]
if val_y1 == val_y2:
val = val_y1*1.0
else:
val_x1 = xx[row_index]
val_x2 = xx[row_index+1]
val = val_y1 + ((val_x2 - percent)/(val_x2 - val_x1))*(val_y2 - val_y1)

sigma_ideal = 1 * xx[num_pos_obs - 1 ] / 2 + (xx[num_count - 1] - xx[num_pos_obs]) * 1
sigma_model = integrate.simps(yy,xx)
sigma_random = integrate.simps(xx,xx)

ar_value = (sigma_model - sigma_random) / (sigma_ideal - sigma_random)
#ar_label = 'ar value = %s' % ar_value

fig, ax = plt.subplots(nrows = 1, ncols = 1)
ax.plot(ideal['x'],ideal['y'], color='grey', label='Perfect Model')
ax.plot(xx,yy, color='red', label='User Model')
#ax.scatter(xx,yy, color='red')
ax.plot(xx,xx, color='blue', label='Random Model')
ax.plot([percent, percent], [0.0, val], color='green', linestyle='--', linewidth=1)
ax.plot([0, percent], [val, val], color='green', linestyle='--', linewidth=1, label=str(val*100)+'% of positive obs at '+str(percent*100)+'%')

plt.xlim(0, 1.02)
plt.ylim(0, 1.25)
plt.title("CAP Curve - a_r value ="+str(ar_value))
plt.xlabel('% of the data')
plt.ylabel('% of positive obs')
plt.legend()
plt.show()

This code leads to the following graph :

The CAP curve for the random forest classifier predictions obtained with the "big salaries" test set

The CAP curve for the random forest classifier predictions obtained with the « big salaries » test set

The a_r value and the y-value at 50% shows that the performance of the model is pretty good.