Sem Spirit

Simple Linear Regression in R

The following code is used to load and separate the training set from the test set in R.


#loading the dataset
setwd("[FOLDER]")
dataset = read.csv('dataset.csv')

#Splitting the data set into learning and test sets
#install.packages("caTools")
#library(caTools)
set.seed(123)
split = sample.split(dataset$StopDistance, SplitRatio=0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

We obtain two arrays. Here follows the training set :

Training Set (without centering)

Training Set (without centering)

And here the test set :

Test set (without centering)

Test set (without centering)

The following code is used to compute the linear model according to the training set.

#Fitting the Linear Regression model to the training set
regressor = lm(formula = StopDistance ~ as.vector(Speed), data = training_set)

To visualize the data and predictions we need to use a package called ggplot that can be installed with the following lines :

install.packages('ggplot2')
library(ggplot2)

Then, with the following lines we can plot the training set points in a graph and draw the predicted model as a line :

#visualizing the training set results
ggplot() +
geom_point(aes(x = training_set$Speed, y = training_set$StopDistance), colour='black') +
geom_line(aes(x = training_set$Speed, y = predict(regressor, newdata=training_set)), colour='red') +
xlab('Speed') +
ylab('StopDistance') +
ggtitle('StopDistance vs Speed : predictions and training set')

We obtain this graph :

StopDistance vs Speed : predictions and training set (without centering)

StopDistance vs Speed : predictions and training set (without centering)

Finally, with the following lines we can plot the test set points in the graph and compare it to the ones predicted by the model :

#visualizing the test set results
ggplot() +
geom_point(aes(x = test_set$Speed, y = test_set$StopDistance), colour='black') +
geom_line(aes(x = training_set$Speed, y = predict(regressor, newdata=training_set)), colour='red') +
xlab('Speed') +
ylab('StopDistance') +
ggtitle('StopDistance vs Speed : predictions and test set')

Which leads to this graph :

StopDistance vs Speed : predictions and test set (without centering)

StopDistance vs Speed : predictions and test set (without centering)

To get the predicted stopping distance for a car running at 62.138 mph (100 km/h), we run the code below :

y = predict(regressor, newdata=as.data.frame(x=list(62.138), col.names=c('Speed')))

which returns a value of 232.0136 feets (70,72 meters).

Now we can ask ourselves if this prediction is enough trustable and accurate. In other terms, what is the quality of the linear model ?
We can obtain some statistics on our model with the following code :


summary(regressor)

Which prints the following statistics :

Linear Model Statistics (without centering)

Linear Model Statistics (without centering)

These statistics are organised as follow :
1) The « Call » is The formula expression that led to this linear model
2) The « Residuals » are essentially the difference between the actual observed target values (Stop distance) and the target values predicted by the linear model.
3) The « Coefficients » deal with the constants that define the line y=a.x + b + e. The Estimate value of the « intercept » row is the constant « b ». The second row deals with the « slope » of the line : the estimate value of this row is the coefficient « a ». The coefficient Standard Error measures the average amount that the coefficient estimates vary from the actual average value of our response variable. The coefficient t-value is a measure of how many standard deviations our coefficient estimate is far away from 0. The Pr(>t) acronym found in the model output relates to the probability of observing any value equal or larger than t.
4) The « Residual Standard Error » R is measuring the quality of the linear regression fit. Regarding the algebraic equation of a linear regression model y=a.x + b + e, it is involved in the definition of the e constant such that e ~ N(0, R²).
5) The « R-squared » statistic provides a measure of how well the model is fitting the actual data.
6) F-statistic is a good indicator of whether there is a relationship between our predictor and the response variables.

The most important part of the above statistics output is the ‘star’ evaluation of the intercept and slope (first and second rows under « Coefficients » section) which is very helpful to evaluate at a glance the quality of the model. If 3-stars *** are appended to these rows then the quality of the prediction is very good. Usually the most important row is the « Slope » row (second row) but in some cases it might be important to get a good evaluation in the « Intercept » row. In our example, the Slope row get a 3-star evaluation but the Intercept has a 1-star only. Even if for our purpose here, it is not mandatory to get a good evaluation in the Intercept row, it might be a good habit to center a predictor variable in most of the cases, in order to optimize the quality of the model. Centering simply means subtracting a constant C from every value of a variable, in order to redefine the 0 point for that predictor to be whatever value you subtracted. It shifts the scale over, but retains the units.

The code below show how to center our predictor, the Speed variable :

dataset = read.csv('dataset.csv')
dataset$Speed = scale(dataset$Speed, center=TRUE, scale=FALSE)

The training set becomes as follow :

Training set (with centering)

Training set (with centering)

And the test set looks like this :

Test set (with centering)

Test set (with centering)

The centering constant here is C=15.4 and in our graphs it must be added back to the predictor variable in order to draw them identically to the previous ones.

This is done by updating the previous code that prints the graph with the training set and the predictions :


#visualizing the training set results
ggplot() +
geom_point(aes(x = training_set$Speed+15.4, y = training_set$StopDistance), colour='black') +
geom_line(aes(x = training_set$Speed+15.4, y = predict(regressor, newdata=training_set)), colour='red') +
xlab('Speed') +
ylab('StopDistance') +
ggtitle('StopDistance vs Speed : predictions and training set (after centering Speed)')

Which displays the graph of predictions and the training set :

StopDistance vs Speed : predictions and training set (after centering Speed)

StopDistance vs Speed : predictions and training set (after centering Speed)

Here follows the previous code that prints the test set and the predictions updated with the centering constant :

#visualizing the test set results
ggplot() +
geom_point(aes(x = test_set$Speed+15.4, y = test_set$StopDistance), colour='black') +
geom_line(aes(x = training_set$Speed+15.4, y = predict(regressor, newdata=training_set)), colour='red') +
xlab('Speed') +
ylab('StopDistance') +
ggtitle('StopDistance vs Speed : predictions and test set (after centering Speed)')

Which displays the graph of predictions and the test set :

StopDistance vs Speed : predictions and test set (after centering Speed)

StopDistance vs Speed : predictions and test set (after centering Speed)

When printing the summary, we now get a 3-star evaluation of the Intercept as shown below :

Linear Model Statistics (with centering)

Linear Model Statistics (with centering)

For more information :
– https://feliperego.github.io/blog/2015/10/23/Interpreting-Model-Output-In-R
– http://www.theanalysisfactor.com/center-on-the-mean/
– https://stats.stackexchange.com/questions/63600/how-to-translate-the-results-from-lm-to-an-equation
– If a « variable was fitted with type nmatrix.1 but type numeric was supplied » error is encountered when calling the ‘predict’ method, it’s a bug that can be solved with this turnover : https://stackoverflow.com/questions/22337495/how-to-solve-predict-lm-error-variable-affinity-was-fitted-with-type-nmatr