Sem Spirit

Numerical relabeling of textual data

Since the learning algorithms usually take as input numerical values, it is recommended to encode each textual label into a number.
To do so we use a label encoder and apply it on the part of the dataset containing the textual values.


#turning textual data to numerical
from sklearn.preprocessing import LabelEncoder
labelencoder_iv = LabelEncoder() #independent variable encoder
dataset[:,0:2] = labelencoder_iv.fit_transform(dataset[:,0:2])
labelencoder_dv = LabelEncoder() #dependent (target) variable encoder
dataset[:,6:7] = labelencoder_dv.fit_transform(dataset[:,6:7])

After running this script the dataset looks like this :

The whole Python script becomes :

#importing libraries
import numpy as n
import matplotlib.pyplot as m
import pandas as p

#importing the dataset
dataset = p.read_csv('dataset.csv', sep=',').values

#filling blank cells
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values='NaN', strategy='mean', axis = 0)
imputer = imputer.fit(dataset[:, 2:6])
dataset[:, 2:6] = imputer.transform(dataset[:, 2:6])

#turning textual data to numerical
from sklearn.preprocessing import LabelEncoder
labelencoder_0 = LabelEncoder() #independent variable encoder
dataset[:,0] = labelencoder_0.fit_transform(dataset[:,0])
labelencoder_1 = LabelEncoder() #independent variable encoder
dataset[:,1] = labelencoder_1.fit_transform(dataset[:,1])
labelencoder_6 = LabelEncoder() #dependent (target) variable encoder
dataset[:,6] = labelencoder_6.fit_transform(dataset[:,6])

Next step : correcting irrelevent orders between encoded values.