This page shows how to apply the backward elimination method on the Sacramento real estate dataset (whose 36 first rows are shown in the figure below) in order to obtain a nearly optimal multilinear model.

The first step consists in loading and preparing the data. The following code build an array X containing 41 columns and 814 rows from the Sacramento dataset which initially contains 6 columns and 814 rows.

#importing libraries

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd

```
```#loading the dataset

dataset = pd.read_csv("dataset-4-E3-FINAL.csv")

X = dataset.iloc[:,:-1].values

y = dataset.iloc[:,len(dataset.iloc[0])-1].values

#categories encoding

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

labelencoder_X_0 = LabelEncoder()

X[:, 0] = labelencoder_X_0.fit_transform(X[:,0])

labelencoder_X_1 = LabelEncoder()

X[:, 1] = labelencoder_X_1.fit_transform(X[:,1])

onehotencoder = OneHotEncoder(categorical_features = [0, 1])

X = onehotencoder.fit_transform(X).toarray()

#Fixing the generated variables : removing "Type" first dummy variable index N°36 (among 36, 37 ,38)

X = np.append(arr = X[:, 0:36], values = X[:, 37:], axis = 1)

#Fixing the generated variables : removing "City" first dummy variable (index N°0)

X = X[:, 1:]

`#Adding one column filled with "1" corresponding to the constant`

X = np.append(arr = np.ones((814, 1)).astype(int), values = X, axis = 1)

The X variable contains an array whose 36 first rows are as follows :

The package containing the multilinear regression algorithms is the « statsmodels.formula.api ».

#importing the package for linear regression

import statsmodels.formula.api as sm

The first step of the backward elimination method consists in fitting the model to all the variables (the above array in the variable X).

#Building the optimal model using backward elimination

#STEP 1 : fit the model with all the variables

X_opt = X[:, [0, 1, 2, 3, 4, 5, 6, 7 ,8, 9, 10, 11, 12, 13, 14, 15, 16, 17 ,18, 19, 20, 21, 22, 23, 24, 25, 26, 27 , 28, 29, 30, 31, 32, 33, 34, 35, 36, 37 , 38, 39, 40]]

x_SAC_opt = x_SAC[:, [0, 1, 2, 3, 4, 5, 6, 7 ,8, 9, 10, 11, 12, 13, 14, 15, 16, 17 ,18, 19, 20, 21, 22, 23, 24, 25, 26, 27 , 28, 29, 30, 31, 32, 33, 34, 35, 36, 37 , 38, 39, 40]]

#X_opt = X[:, [0, 1, 2, 3, 4, 5, 6, 7 ,8, 9, 10, 11, 12, 13, 14, 15, 16, 17 ,18, 19, 20]]

#X_opt = X[:, [0, 1, 2, 3, 4, 5, 6, 7, 8]]

regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()

regressor_OLS.summary()

The summary function shows that the columns corresponding to « Type », « Baths » and « Beds » (from indeces 36 to 39) have a P > SKEEP.

Therefore, we remove those columns and obtain a new array « X_opt » with 37 columns. The first 36 rows of « X_opt » are as follow :

Then, we fit again the model to this new array « X_opt ».

#STEP 2 : we remove the variables with P < SKEEP = the variables from indeces 36 to 39 (taking into account the column n°1) corresponding to "Type", "Baths" and "Beds"

X_opt = X[:, [0, 1, 2, 3, 4, 5, 6, 7 ,8, 9, 10, 11, 12, 13, 14, 15, 16, 17 ,18, 19, 20, 21, 22, 23, 24, 25, 26, 27 , 28, 29, 30, 31, 32, 33, 34, 35, 40]]

regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()

regressor_OLS.summary()

The information given by the summary function (below) shows that almost all the variables have P < SKEEP. Therefore, the model containing the predictors "City", "SquareFeet" is nearly optimal.

All along the backward elimination process, it is important to check the evolution of the Adjusted R-Squared value for each model built. In our case here, this value remains constantly equal to 0.997. The final (standard) R-squared value is also equal to 0.997 which suggests that the regression model is well fitted to the data (the closer to 1, the better).

We can get the predictions from the dataset with the following script :

#getting the training set predictions

y_pred = regressor_OLS.predict(X_opt)

In order to visualize in a 2D graph the multilinear model, we can get the predictions by restricting the dataset to the 424 SACRAMENTO rows whose first 22 rows are shown below (more than the half of the whole dataset) :

This is done with the following script, that applies all the previous steps to prepare the data and optimize the model :

#preparing the dataset restricted to SACRAMENTO rows

#loading the dataset (rows restricted to SACRAMENTO city)

dataset_SAC = pd.read_csv("dataset-SACRAMENTO.csv")

x_SAC = dataset_SAC.iloc[:,:-1].values

y_SAC = dataset_SAC.iloc[:,len(dataset.iloc[0])-1].values

x_SAC[:, 0] = labelencoder_X_0.transform(x_SAC[:,0])

x_SAC[:, 1] = labelencoder_X_1.transform(x_SAC[:,1])

x_SAC = onehotencoder.transform(x_SAC).toarray()

x_SAC = np.append(arr = x_SAC[:, 0:36], values = x_SAC[:, 37:], axis = 1)

x_SAC = x_SAC[:, 1:]

x_SAC = np.append(arr = np.ones((424, 1)).astype(int), values = x_SAC, axis = 1)

x_SAC_opt = x_SAC[:, [0, 1, 2, 3, 4, 5, 6, 7 ,8, 9, 10, 11, 12, 13, 14, 15, 16, 17 ,18, 19, 20, 21, 22, 23, 24, 25, 26, 27 , 28, 29, 30, 31, 32, 33, 34, 35, 40]]

```
```

`#visualising the SACRAMENTO set and the multilinear model`

plt.scatter(x_SAC_opt[:, 36:], y_SAC, color='black')

plt.plot(x_SAC_opt[:, 36:], regressor_OLS.predict(x_SAC_opt), color='red')

plt.title('Price (in dollars $) VS Surface (in feet²) when restricting to SACRAMENTO area')

plt.xlabel('Surface (in feet²)')

plt.ylabel('Price (in $)');

plt.show()

The following graph shows the actual prices (in black) and the predicted prices (in red) according to the surfaces for the SACRAMENTO city :