Sem Spirit

Filling the blanks

When some data is missing in the dataset, removing the lines with blank cells or filling those cells with random values is not a good option, since it may have an influence on the construction of the model. The values to be selected to fill the blank cells must be the most neutral possible according to the learning strategy, in order to influence the less possible the learning algorithms. There are basically three strategies to take care of missing data : replacing a blank cell by either (1) the mean, (2) the median or (3) the most frequent value of the column. In our case we are going to use the mean of the column, which is the usual strategy to handle missing data.

The code below applies the filling strategy on the numerical columns with missing data : AGE and WEEKLY_WORKING_HOURS.

#filling blank cells
dataset$AGE = ifelse(is.na(dataset$AGE), ave(dataset$AGE, FUN=function(x) mean(x, na.rm=TRUE)), dataset$AGE)
dataset$WEEKLY_WORKING_HOURS = ifelse(is.na(dataset$WEEKLY_WORKING_HOURS), ave(dataset$WEEKLY_WORKING_HOURS, FUN=function(x) mean(x, na.rm=TRUE)), dataset$WEEKLY_WORKING_HOURS)

After running those lines the dataset looks like this :

Dataset after having filled the blank cells

Dataset after having filled the blank cells

The whole R script becomes :

#setting the working folder
setwd("")

#loading the dataset
dataset = read.csv('dataset.csv')

#Numerical relabeling
dataset$OCCUPATION = factor(dataset$OCCUPATION, levels = c('Management', 'Manual', 'Specialty'), labels = c(0, 1, 2))
dataset$GENDER = factor(dataset$GENDER, levels = c('Female', 'Male'), labels = c(0, 1))
dataset$SALARY = factor(dataset$SALARY, levels = c('HIGH', 'LOW'), labels = c(0, 1))

#filling blank cells
dataset$AGE = ifelse(is.na(dataset$AGE), ave(dataset$AGE, FUN=function(x) mean(x, na.rm=TRUE)), dataset$AGE)
dataset$WEEKLY_WORKING_HOURS = ifelse(is.na(dataset$WEEKLY_WORKING_HOURS), ave(dataset$WEEKLY_WORKING_HOURS, FUN=function(x) mean(x, na.rm=TRUE)), dataset$WEEKLY_WORKING_HOURS)

Next step : feature scaling.