Feature scaling

When some independant variables take large values while the values of other variables are too small, it may lead the learning algorithm to give too much importance to the former and make the latter become negligeable. Indeed, each line made with k non-dependant cells is considered as a point in a k-dimensional space (the values in each cell of the line are the coordinates) and the euclidian distance is often used in the machine learning algorithms to analyse the similarity between the lines.
Considering two points (i.e. lines) p1 and p2 and two coordinates x and y (i.e. columns x and y), if the squared difference (p2[x] – p1[x])² is too important comparing to the squared difference (p2[y] – p1[y])², then the values at coordinate y (column y) will become negligeable in the evaluation of the euclidian distance and useless for the construction of the model.
For example, restricting ourselves to the columns « Age » and « Company Size » of the first two lines A and B, the squared difference for the « Age » is (42.25 – 50)²=60.0625 and the squared difference for the « Company Size » is (125224-125)²=15649759801, the euclidian distance between the two points will be much more influenced by the « Company Size » than by the « Age » since 15649759801 is significantly dominating 60.0625 that becomes negligeable. We dont want that happens since the « Age » is considered as important as the « Company Size ». The solution is to put the variables in the same range and in the same scale, so that no variable can be dominated by another. For this, we use either standard or normal feature scaling (depending on the usecase) on the impacted column i :
stand(i) = [ i – mean(i) ] / std_deviation(i)
norm(i) = [ i – min(i) ] / [ max(i) – min(i) ]

In some context, we could avoid to scale the generated binary variables (the columns that comes from the values Specialty/Management/Manual and Male/Female). But usually, it is better to scale them. It improves the quality of the model. The main problem being that, once done, the data becomes difficult to read. In the following, we dont apply feature scaling « Salary », because it’s a « categorical » variable AND it’s a « target » variable (i.e. dependent variable) . Therefore, we choose to apply the standard scaling method on the whole dataset, inluding the generated binary variables. This is done with the scale() function. However, before applying the scale function on the whole dataset, we must set all cells to the numeric type (the cells that were containing textual values does not have a numerical type despite the numerical encoding of their value). We proceed by defining a loop that apply the type conversion on each columns.
``` #feature scaling for (i in 1:7){ dataset[,i] = as.numeric(as.character(dataset[,i])) } dataset = scale(dataset) dataset = as.data.frame(dataset) ```
After scaling, we need to convert the result back to a Data Frame (since the usage of scale() turned it into a vector which is an atomic object).

After running this code, the dataset array looks like :

Dataset after feature scaling

The whole R script becomes :
``` #setting the working folder setwd("")```

``` #loading the dataset dataset = read.csv('dataset.csv') #Numerical relabeling dataset\$OCCUPATION = factor(dataset\$OCCUPATION, levels = c('Management', 'Manual', 'Specialty'), labels = c(0, 1, 2)) dataset\$GENDER = factor(dataset\$GENDER, levels = c('Female', 'Male'), labels = c(0, 1)) dataset\$SALARY = factor(dataset\$SALARY, levels = c('HIGH', 'LOW'), labels = c(0, 1)) ```

```#feature scaling for (i in 1:7){ dataset[,i] = as.numeric(as.character(dataset[,i])) } dataset = scale(dataset) dataset = as.data.frame(dataset) ```