Sem Spirit

Splitting the dataset into training and test sets

Machine learning methodology consists in applying the learning algorithms on a part of the dataset called the « training set » in order to build the model and evaluate the quality of the model on the rest of the dataset, called the « test set ». In the following we divide the dataset into the training and test sets.
Firstly, we install the « caTools » package that contains the tools needed ; and after setting a seed that will randomize the division strategy, we apply the split function on the SALARY columns which will flag each lines by TRUE or FALSE. Then, we extract from the dataset two disjoint and complementary arrays training_set and test_set according to these flags.

#splitting into training and test sets
install.packages("caTools")
library(caTools)
set.seed(123)
split = sample.split(dataset$SALARY, SplitRatio=0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

After running this code we obtain the two following arrays :

Dataset split into training and test sets

Dataset split into training and test sets

The whole R script becomes :

#setting the working folder
setwd("")

#loading the dataset
dataset = read.csv('dataset.csv')

#Numerical relabeling
dataset$OCCUPATION = factor(dataset$OCCUPATION, levels = c('Management', 'Manual', 'Specialty'), labels = c(0, 1, 2))
dataset$GENDER = factor(dataset$GENDER, levels = c('Female', 'Male'), labels = c(0, 1))
dataset$SALARY = factor(dataset$SALARY, levels = c('HIGH', 'LOW'), labels = c(0, 1))

#feature scaling
for (i in 1:7){
dataset[,i] = as.numeric(as.character(dataset[,i]))
}
dataset = scale(dataset)
dataset = as.data.frame(dataset)

#splitting into training and test sets
install.packages("caTools")
library(caTools)
set.seed(123)
split = sample.split(dataset$SALARY, SplitRatio=0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)