Sem Spirit

Multilinear Regression Model in R

This page shows how to apply the backward elimination method in R on the Sacramento real estate dataset in order to obtain a nearly optimal multilinear model.

Initial dataset on real estate transaction around the Sacramento area

Initial dataset on real estate transaction around the Sacramento area

The first step consists in loading and preparing the data. The following code build an array « dataset » containing 41 columns and 814 rows from the Sacramento dataset which initially contains 6 columns and 814 rows.

#loading the dataset
setwd("D:/WORK/DEEPLEARNING/SITE/Machine-Learning/2_Regression/1_Linear_Regression/2_Multiple_Linear_Regression/1_in_R/DEV")
dataset = read.csv('dataset-4-E3-FINAL.csv')

#Encoding categorical data
dataset$City = factor(dataset$City, levels = c('ANTELOPE', 'AUBURN', 'CAMERON PARK', 'CARMICHAEL', 'CITRUS HEIGHTS', 'COOL', 'DIAMOND SPRINGS', 'EL DORADO HILLS', 'EL DORADO', 'ELK GROVE', 'ELVERTA', 'FAIR OAKS', 'FOLSOM', 'GALT', 'GOLD RIVER', 'GRANITE BAY', 'GREENWOOD', 'LINCOLN', 'LOOMIS', 'MATHER', 'MEADOW VISTA', 'NORTH HIGHLANDS', 'ORANGEVALE', 'PENRYN', 'PLACERVILLE', 'POLLOCK PINES', 'RANCHO CORDOVA', 'RANCHO MURIETA', 'RIO LINDA', 'ROCKLIN', 'ROSEVILLE', 'SACRAMENTO', 'SLOUGHHOUSE', 'WALNUT GROVE', 'WEST SACRAMENTO', 'WILTON'), , labels = c(1, 2, 3, 4, 5, 6, 7 ,8, 9, 10, 11, 12, 13, 14, 15, 16, 17 ,18, 19, 20, 21, 22, 23, 24, 25, 26, 27 , 28, 29, 30, 31, 32, 33, 34, 35, 36))
dataset$Type = factor(dataset$Type, levels = c('Residential', 'Condo', 'Multi-Family'), labels = c(1, 2, 3))

After this step, the « dataset » variable contains a new array whose 36 first rows look like :

Dataset ready for multilinear regression

Dataset ready for multilinear regression

The first step of the backward elimination method consists in fitting the model to all the variables (the above array in the variable « dataset »).

#Backward Elimination STEP 1 : Fitting the Linear Regression model to the dataset with all the variables
regressor = lm(formula = Price ~ City + Type + Beds + Baths + SquareFeet, data = dataset)
summary(regressor)

The summary function shows that the variables « Type », « Baths » and « Beds » have a P > SKEEP.

Backward Elimination STEP 1 : Summary of the statistical info of the model

Backward Elimination STEP 1 : Summary of the statistical info of the model

Therefore, we remove those variables from the model formula and we fit again the model to the same array « dataset ».

#Backward Elimination STEP 2 : Fitting the Linear Regression model to the dataset with the remaining variables
regressor = lm(formula = Price ~ City + SquareFeet, data = dataset)
summary(regressor)

The summary shows that almost all the variables have P < SKEEP.

Backward Elimination STEP 2 : Summary of statistical info on the model

Backward Elimination STEP 2 : Summary of statistical info on the model

All along the backward elimination process, it is important to check the evolution of the Adjusted R-Squared value for each model built. In our case here, this value is slightly increasing from 0.9965 to 0.9966. The final (standard) R-squared value is equal to 0.9967 which suggests that the regression model is well fitted to the data (the closer to 1, the better). Therefore, the model containing the predictors "City", "SquareFeet" is nearly optimal.

We can get the predictions from the dataset with the following script :

#predictions on the dataset
y_pred = predict(regressor, dataset)

In order to visualize in a 2D graph the multilinear model, we can restrict and prepare the dataset to the 424 SACRAMENTO rows (more than the half of the whole dataset).

Dataset restricted to the Sacramento rows

Dataset restricted to the Sacramento rows

This is done in the following script :

#preparing the data to visualize the predictions by restricting the rows on SACRAMENTO
dataset_SAC = read.csv('dataset-4-E3-SACRAMENTO.csv')
dataset_SAC$City = factor(dataset_SAC$City, levels = c('ANTELOPE', 'AUBURN', 'CAMERON PARK', 'CARMICHAEL', 'CITRUS HEIGHTS', 'COOL', 'DIAMOND SPRINGS', 'EL DORADO HILLS', 'EL DORADO', 'ELK GROVE', 'ELVERTA', 'FAIR OAKS', 'FOLSOM', 'GALT', 'GOLD RIVER', 'GRANITE BAY', 'GREENWOOD', 'LINCOLN', 'LOOMIS', 'MATHER', 'MEADOW VISTA', 'NORTH HIGHLANDS', 'ORANGEVALE', 'PENRYN', 'PLACERVILLE', 'POLLOCK PINES', 'RANCHO CORDOVA', 'RANCHO MURIETA', 'RIO LINDA', 'ROCKLIN', 'ROSEVILLE', 'SACRAMENTO', 'SLOUGHHOUSE', 'WALNUT GROVE', 'WEST SACRAMENTO', 'WILTON'), , labels = c(1, 2, 3, 4, 5, 6, 7 ,8, 9, 10, 11, 12, 13, 14, 15, 16, 17 ,18, 19, 20, 21, 22, 23, 24, 25, 26, 27 , 28, 29, 30, 31, 32, 33, 34, 35, 36))
dataset_SAC$Type = factor(dataset_SAC$Type, levels = c('Residential', 'Condo', 'Multi-Family'), labels = c(1, 2, 3))

#showing the actual observations and the predictions in a graph
#install.packages('ggplot2')
library(ggplot2)
ggplot() +
geom_point(aes(x = dataset_SAC$SquareFeet, y = dataset_SAC$Price), colour='black') +
geom_line(aes(x = dataset_SAC$SquareFeet, y = predict(regressor, newdata=dataset_SAC)), colour='red') +
ggtitle('Price (in dollars $) VS Surface (in feet²) when restricting to SACRAMENTO area') +
xlab('Surface (in feet²)') +
ylab('Price (in $)')

The following graph shows the actual prices (in black) and the predicted prices (in red) according to the surfaces for the SACRAMENTO city :

Prices (in dollar) vs Surface (in feet²) of real estate transaction of the Sacramento City (actual observations in black / predictions in red)

Prices (in dollar) vs Surface (in feet²) of real estate transaction of the Sacramento City (actual observations in black / predictions in red)