Sem Spirit

Correcting irrelevent orders

After having encoded the labels as numerical values, we must ask ourselves : is the natural order between these numbers relevent for the column ?
In our dataset, ‘Female’ is encoded with 0 and ‘Male’ with ‘1’ : but we dont want a woman to be considered as less important as a man. In order to avoid the model to be influenced by any irrelevent natural order existing within a column between the encoded labels, we need to turn the values in each of these columns into binary valued columns. In our dataset ‘Male’ (resp. ‘Female’) will label a column in which each cell has a binary value and is set to ‘1’ when the individual described by the line to which belongs this cell is a man (and ‘0’ when this individual is a female).


#taking care of wrong order relationships
from sklearn.preprocessing import OneHotEncoder
onehotencoder_01 = OneHotEncoder(categorical_features = [0, 1])
dataset = onehotencoder_01.fit_transform(dataset).toarray()

After having run this code, the dataset looks like :

Part in green : For each line, the x-th column among the three first ones contains ‘1’ and the two others contain ‘0’ iff for this line there was a value x in the column ‘1’ (corresponding to ‘OCCUPATION’) in the previous dataset. For example, in the previous dataset the 1st value (index 0) of the 1st column (N°0, corresponding to OCCUPATION) was 2 (SPECIALTY). In the new dataset the column N°2 is set to ‘1’ and the columns N°0 and N°1 are set to ‘0’.
Part in red : For each line, the x-th column among the 4th and 5th ones contains ‘1’ and the other one contains ‘0’ iff for this line there was a value (x-4) in the column ‘2’ (corresponding to ‘GENDER’) in the previous dataset. For example, in the previous dataset the 6th value (index 5) of the 2nd column (N°1, corresponding to GENDER) was 0 (FEMALE). In the new dataset the column N°3 is set to ‘1’ and the column N°4 is set to ‘0’.

The whole Python script becomes :

#importing libraries
import numpy as n
import matplotlib.pyplot as m
import pandas as p

#loading the dataset
dataset = p.read_csv('dataset.csv', sep=',').values

#filling blank cells
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values='NaN', strategy='mean', axis = 0)
imputer = imputer.fit(dataset[:, 2:6])
dataset[:, 2:6] = imputer.transform(dataset[:, 2:6])

#turning textual data to numerical
from sklearn.preprocessing import LabelEncoder
labelencoder_0 = LabelEncoder() #independent variable encoder
dataset[:,0] = labelencoder_0.fit_transform(dataset[:,0])
labelencoder_1 = LabelEncoder() #independent variable encoder
dataset[:,1] = labelencoder_1.fit_transform(dataset[:,1])
labelencoder_6 = LabelEncoder() #dependent (target) variable encoder
dataset[:,6] = labelencoder_6.fit_transform(dataset[:,6])

#taking care of wrong order relationships
from sklearn.preprocessing import OneHotEncoder
onehotencoder_01 = OneHotEncoder(categorical_features = [0, 1])
dataset = onehotencoder_01.fit_transform(dataset).toarray()

In the next steps, after separating the source variables from the target variable we are going to take care of feature scaling.