Sem Spirit

Separating source and target variables

In order to simplify the next steps of data preprocessing, we separate the source variables (independant variables) from the target variable to be predicted (dependant variable) by adding these lines :

#splitting the dataset into the source variables (independant variables) and the target variable (dependant variable)
sourcevars = dataset[:,:-1] #all columns except the last one
targetvar = dataset[:,len(dataset[0])-1] #only the last column

The aforedmentionned ‘dataset’ array :

Initial Dataset

Initial Dataset


… is splitted into the following ‘sourcevars’ and ‘targetvar’ arrays :
Source Variables

Source Variables


Target Variable

Target Variable

The whole Python script becomes :

#importing libraries
import numpy as n
import matplotlib.pyplot as m
import pandas as p

#loading the dataset
dataset = p.read_csv('dataset.csv', sep=',').values
#dataset = p.DataFrame(dataset)
#dataset = dataset.values

#filling blank cells
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values='NaN', strategy='mean', axis = 0)
imputer = imputer.fit(dataset[:, 2:6])
dataset[:, 2:6] = imputer.transform(dataset[:, 2:6])
#dataset = p.DataFrame(dataset)
#dataset = dataset.values

#turning textual data to numerical
from sklearn.preprocessing import LabelEncoder
labelencoder_0 = LabelEncoder() #independent variable encoder
dataset[:,0] = labelencoder_0.fit_transform(dataset[:,0])
labelencoder_1 = LabelEncoder() #independent variable encoder
dataset[:,1] = labelencoder_1.fit_transform(dataset[:,1])
labelencoder_6 = LabelEncoder() #dependent (target) variable encoder
dataset[:,6] = labelencoder_6.fit_transform(dataset[:,6])
#dataset = p.DataFrame(dataset)
#dataset = dataset.values

#taking care of wrong order relationships
from sklearn.preprocessing import OneHotEncoder
onehotencoder_01 = OneHotEncoder(categorical_features = [0, 1])
dataset = onehotencoder_01.fit_transform(dataset).toarray()
#dataset = p.DataFrame(dataset)
#dataset = dataset.values

#splitting the dataset into the source variables (independant variables) and the target variable (dependant variable)
sourcevars = dataset[:,:-1] #all columns except the last one
targetvar = dataset[:,len(dataset[0])-1] #only the last column

Next step : Scaling the values in the source variables array, so that no variable dominates another.