When some data is missing in the dataset, removing the lines with blank cells or filling those cells with random values are not a good option, since it may have an influence on the construction of the model. The values to be selected to fill the blank cells must be the most neutral possible according to the learning strategy, in order to influence the less possible the learning algorithms. There are basically three strategies to take care of missing data : replacing a blank cell by either (1) the mean, (2) the median or (3) the most frequent value of the column. In our case we are going to use the mean of the column, which is the usual strategy to handle missing data.
The code below applies the filling strategy on the numerical columns with missing data : the content in the array located between the columns 2 and 6 (excluded).
#filling blank cells
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values='NaN', strategy='mean', axis = 0)
imputer = imputer.fit(dataset[:, 2:6])
dataset[:, 2:6] = imputer.transform(dataset[:, 2:6])
After running those lines the dataset looks like this :
The red cells were previously blank and are now filled with the mean value of the column.
The whole Python script becomes :
#importing libraries
import numpy as n
import matplotlib.pyplot as m
import pandas as p
#importing the dataset
dataset = p.read_csv('dataset.csv', sep=',').values
#filling blank cells
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values='NaN', strategy='mean', axis = 0)
imputer = imputer.fit(dataset[:, 2:6])
dataset[:, 2:6] = imputer.transform(dataset[:, 2:6])
Next step : Numerical relabeling of textual data.