 # Random Forest Classification in Python

In this usecase, we build in Python the following Random Forest classifier (whose model predictions are shown in the 3D graph below) in order to classify an individual salary as big (>50K\$) or not according to the age, the level of education, and the average number of weekly working hours. Test set observations (dots) and predictions (3D shape) of large salaries (>50K) according to age, education time and working hours.

We first import the needed libraries and load the dataset :
``` #importing libraries import numpy as np import pandas as pd from matplotlib import cm```

``` ```

```#loading the dataset dataset = pd.read_csv('dataset.csv') ```
Here follows the 30 first rows over a total of 23414 in the dataset : Dataset in Python of large salaries (if =1 then >50K else <=50K) according to age, education time and working hours.

Then we split the dataset into the predictors set X (age, education time and working hours) and the independant variable y to predict (1=big salary or 0=no big salary) :
``` X = dataset.iloc[:,0:3].values y = dataset.iloc[:,len(dataset.iloc)-1].values ```

We split the data into the training and the test set :
``` from sklearn.cross_validation import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0) ```

Then we are ready to fit the classifier to the training set data, which is done with the following code :
``` #fitting the classifier to the training set from sklearn.ensemble import RandomForestClassifier classifier = RandomForestClassifier(n_estimators = 100, criterion='entropy', random_state=0) classifier.fit(X_train, y_train) ```

Once ready, we can run the classifier on the training set and the test set in order to get the predictions.
``` #predicting the results on the training set y_train_pred = classifier.predict(X_train) y_test_pred = classifier.predict(X_test) ```

In order to evaluate the quality of the classifier, we compute with the following code the two confusion matrix of the predictions made with the training and test sets and according to them, the success ratio of the predictions :
``` from sklearn.metrics import confusion_matrix cm_train = confusion_matrix(y_train, y_train_pred) cm_test = confusion_matrix(y_test, y_test_pred)```

``` ```

```print("Training set confusion matrix : \n"+str(cm_train)) print("Success ratio on training set : "+str(success_ratio(cm=cm_train))+"%") print("Test set confusion matrix : \n"+str(cm_test)) print("Success ratio on test set : "+str(success_ratio(cm=cm_test))+"%") ```
The console shows the two following confusion matrix and success ratio for the training and test sets : Confusion matrix of the training and test sets predictions of of large salaries (>50K) according to age, education time and working hours.

Finally, we display in a 3D graph the test set observations (dots) and predictions (3D shape) of large salaries (>50K iff =1 and <=50K iff =0) according to age, education time and weekly working hours with the following code :
``` #3D display of predictions and test set observations show3D(title="Test set observations (dots) and predictions (3D shape) of large salaries (>50K) according to age, education time and working hours.", x_colname = 'AGE', y_colname = 'EDUCATION', z_colname = 'WORKING_HOURS', c_colname = 'BIG_SALARY', x_train = X_train[:,0], y_train=X_train[:,1], z_train=X_train[:,2], c_train=y_train, x_test = X_test[:,0], y_test = X_test[:,1], z_test = X_test[:,2], c_test=y_test, mesh_nb_pts = 10**3, class_num=1, classifier = classifier )```

``` ```

Here follows the 3D graph displayed with the test set : Test set observations (dots) and predictions (3D shape) of large salaries (>50K) according to age, education time and working hours.

And we can also display a 3D graph containing the training set observations with the following function call :
``` #3D display of predictions and training set observations show3D(title="Training set observations (dots) and predictions (3D shape) of large salaries (>50K) according to age, education time and working hours.", x_colname = 'AGE', y_colname = 'EDUCATION', z_colname = 'WORKING_HOURS', c_colname = 'BIG_SALARY', x_train = X_train[:,0], y_train=X_train[:,1], z_train=X_train[:,2], c_train=y_train, x_test = X_train[:,0], y_test = X_train[:,1], z_test = X_train[:,2], c_test=y_train, mesh_nb_pts = 10**3, class_num=1, classifier = classifier ) ```
that leads to the following 3D graph : Training set observations (dots) and predictions (3D shape) of large salaries (>50K) according to age, education time and working hours.

In the both graphs, we notice that the model predictions (the green transparent 3D shape) for classifying a salary as big (> 50K) fits remarquably well the actual observations related to big salaries (black dots), whereas most of the red dots (normal and small salaries) stay outside the green shape.

This script uses the two functions success_ratio and show3D that are defined below :
``` # FUNCTIONS def success_ratio(cm): total = cm + cm + cm + cm return 100*(cm + cm) / total```

``` import plotly import plotly.graph_objs as go #displays in a 3D space the model as a 3D mesh and the test set observations as a 3D scatter plot def show3D(title, x_colname, y_colname, z_colname, c_colname, x_train, y_train, z_train, c_train, x_test, y_test, z_test, c_test, mesh_nb_pts, class_num, classifier): n= int( mesh_nb_pts ** (1. / 3) ) #cubic root of mesh_nb_pts min_x = min(x_train) min_y = min(y_train) min_z = min(z_train) x_size = max(x_train) - min_x y_size = max(y_train) - min_y z_size = max(z_train) - min_z x_step = x_size / n y_step = y_size / n z_step = z_size / n #POSITIVE (1) PREDICTIONS AS A 3D MESH i=0 x = np.empty([n*n*n]) y = np.empty([n*n*n]) z = np.empty([n*n*n]) for xi in range(0, n): for yi in range(0, n): for zi in range(0, n): x[i] = min_x + xi * x_step y[i] = min_y + yi * y_step z[i] = min_z + zi * z_step i = i + 1 print("Grid of size "+str(n)+"x"+str(n)+"x"+str(n)+" generated (nb vertices = "+str(n*n*n)+").") #computing the predictions on the grid datagrid = pd.DataFrame( {x_colname : x, y_colname: y, z_colname : z }) #the dataframe sort the columns lexicographically datagrid = datagrid[[x_colname, y_colname, z_colname]] #fix the correct order of columns p = classifier.predict(datagrid); print("Predictions on the grid computed.") #extracting the class_num-classified records from the predictions ss = -1 if (class_num == 0): ss = sum(1-p) elif (class_num == 1): ss = sum(p) else: print("ERROR. class_num=", class_num) xx = np.empty([ss]) yy = np.empty([ss]) zz = np.empty([ss]) pp = np.empty([ss]) j = 0 for i in range(0, len(p)-1): if(p[i] == class_num): xx[j]=x[i] yy[j]=y[i] zz[j]=z[i] pp[j]=p[i] j = j + 1 print(str(ss)+" mesh coordinates extracted with ", class_num,"-predictions.") #building the mesh for the predictions trace_preds = go.Mesh3d( x=xx,y=yy,z=zz, alphahull=5, opacity=0.2, color='rgb(0, 256, 0)', name = 'Predictions' ) print("Mesh generated.") #OBSERVATIONS 3D POINTS #number of positive (1) and negative (0) predictions s_pts = len(c_test) s_pts_0 = 0 s_pts_1 = 0 for i in range(0, s_pts): if (c_test[i] == 0) : s_pts_0 = s_pts_0 + 1 elif (c_test[i] == 1) : s_pts_1 = s_pts_1 + 1 else : print("ERROR : c_test["+str(i)+"] = "+str(c_test[i])) break #building two set of points (x,y,z coordinates) : one for the positive (1) predictions and another one for the (0) negative predictions x_pts_0 = np.empty([s_pts_0]) y_pts_0 = np.empty([s_pts_0]) z_pts_0 = np.empty([s_pts_0]) x_pts_1 = np.empty([s_pts_1]) y_pts_1 = np.empty([s_pts_1]) z_pts_1 = np.empty([s_pts_1]) j = 0 k = 0 for i in range(0, s_pts): if (c_test[i] == 0) : x_pts_0[j] = x_test[i] y_pts_0[j] = y_test[i] z_pts_0[j] = z_test[i] j = j + 1 elif (c_test[i] == 1) : x_pts_1[k] = x_test[i] y_pts_1[k] = y_test[i] z_pts_1[k] = z_test[i] k = k + 1 else : print("ERROR : c_test["+str(i)+"] = "+str(c_test[i])) break trace_obs_0 = go.Scatter3d( x=x_pts_0, y=y_pts_0, z=z_pts_0, mode='markers', marker=dict(size=2, line=dict(color='rgb(128, 0, 0)', width=0.5), color='rgb(128, 0, 0)', opacity=1), name='[red] Observations of class 0' ) trace_obs_1 = go.Scatter3d( x=x_pts_1, y=y_pts_1,z=z_pts_1, mode='markers', marker=dict(size=2, line=dict( color='rgba(0, 0, 0)', width=0.5 ), color='rgb(0, 0, 0)', opacity=1), name='[black] Observations of class 1' ) print("x from ", min_x, " to ", min_x+x_size/4) print("y from ", min_y, " to ", min_y+y_size/4) print("z from ", min_z, " to ", min_z+z_size/4) layout = go.Layout( title=title, scene = dict( xaxis=dict(title=x_colname, range = [min_x,min_x+x_size]), yaxis=dict(title=y_colname, range = [min_y,min_y+y_size]), zaxis=dict(title=z_colname, range = [min_z,min_z+z_size]) ) ) fig = go.Figure(data=[trace_obs_0, trace_obs_1, trace_preds], layout=layout) plotly.offline.plot(fig) ```