Feature scaling

When some independant variables take large values while the values of other variables are too small, it may lead the learning algorithm to give too much importance to the former and make the latter become negligeable. Indeed, each line made with k non-dependant cells is considered as a point in a k-dimensional space (the values in each cell of the line are the coordinates) and the euclidian distance is often used in the machine learning algorithms to analyse the similarity between the lines.
Considering two points (i.e. lines) p1 and p2 and two coordinates x and y (i.e. columns x and y), if the squared difference (p2[x] – p1[x])² is too important comparing to the squared difference (p2[y] – p1[y])², then the values at coordinate y (column y) will become negligeable in the evaluation of the euclidian distance and useless for the construction of the model.

Initial dataset without missing data

Initial dataset without missing data

For example, restricting ourselves to the columns « Age » and « Company Size » of the first two lines A and B of our example dataset above, the squared difference for the « Age » is (42.25 – 50)²=60.0625 and the squared difference for the « Company Size » is (125224-125)²=15649759801, the euclidian distance between the two points will be much more influenced by the « Company Size » than by the « Age » since 15649759801 is significantly dominating 60.0625 that becomes negligeable. We dont want that happens since the « Age » is considered as important as the « Company Size ». The solution is to put the variables in the same range and in the same scale, so that no variable can be dominated by another. For this, we use either standard or normal feature scaling (depending on the usecase) on the impacted column i :
stand(i) = [ i – mean(i) ] / std_deviation(i)
norm(i) = [ i – min(i) ] / [ max(i) – min(i) ]

In some context, we could avoid to scale the generated binary variables (the columns that comes from the values Specialty/Management/Manual and Male/Female). But usually, it is better to scale them. It improves the quality of the model. The main problem being that, once done, the data becomes difficult to read. In the following, we dont apply feature scaling « Salary », because it’s a « categorical » variable AND it’s a « target » variable (i.e. dependent variable) . Therefore, we choose to apply the standard scaling method on the source variables, inluding the generated binary variable.

from sklearn.preprocessing import StandardScaler
stScaler_ds = StandardScaler()
sourcevars = stScaler_ds.fit_transform(sourcevars)

Source Variables

Source Variables

After running this code on the sourcevars array above we obtain the following array :

'sourcevars' array after feature scaling

‘sourcevars’ array after feature scaling

The whole Python script becomes :

#importing libraries
import numpy as n
import matplotlib.pyplot as m
import pandas as p

#loading the dataset
dataset = p.read_csv('dataset.csv', sep=',').values

#filling blank cells
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values='NaN', strategy='mean', axis = 0)
imputer = imputer.fit(dataset[:, 2:6])
dataset[:, 2:6] = imputer.transform(dataset[:, 2:6])

#turning textual data to numerical
from sklearn.preprocessing import LabelEncoder
labelencoder_0 = LabelEncoder() #independent variable encoder
dataset[:,0] = labelencoder_0.fit_transform(dataset[:,0])
labelencoder_1 = LabelEncoder() #independent variable encoder
dataset[:,1] = labelencoder_1.fit_transform(dataset[:,1])
labelencoder_6 = LabelEncoder() #dependent (target) variable encoder
dataset[:,6] = labelencoder_6.fit_transform(dataset[:,6])

#taking care of wrong order relationships
from sklearn.preprocessing import OneHotEncoder
onehotencoder_01 = OneHotEncoder(categorical_features = [0, 1])
dataset = onehotencoder_01.fit_transform(dataset).toarray()

#splitting the dataset into the source variables (independant variables) and the target variable (dependant variable)
sourcevars = dataset[:,:-1] #all columns except the last one
targetvar = dataset[:,len(dataset[0])-1] #only the last column

#feature scaling
from sklearn.preprocessing import StandardScaler
stScaler_ds = StandardScaler()
sourcevars = stScaler_ds.fit_transform(sourcevars)

Next step : Splitting the dataset into training and test sets