Simple Linear Regression

In statistics, econometrics and automatic learning, a linear regression model is a regression model that seeks to establish a linear relationship between a variable, called explained, and one or more variables, called explanatory variables.

It is also called a linear model or a linear regression model.

Among the models of linear regression, the simplest is the affine fit. This consists in finding the right line to explain the behavior of a statistic variable y as an affine function of another statistic variable x.

In general, the linear regression model designates a model in which the conditional expectation of y knowing x is an affine transformation in the parameters. However, we can also consider models in which it is the conditional median of y where x or any quantile of the distribution of y knowing x which is an affine transformation in the parameters.

The linear regression model is often estimated by the least squares method but there are also many other methods for estimating this model. We can estimate the model by maximum likelihood or by Bayesian inference.

Although they are often presented together, the linear and least squares models do not refer to the same thing. The linear model designates a class of models that can be estimated by a large number of methods, and the least squares method is an estimation method. It can be used to estimate different types of models.

Basically, a linear regression consists in defining a line that fits the best a set of points. When the points are located in a two dimensional space, it is called Simple Linear Regression. Since the formula that defines a line is y=a.x + b the linear regression process finds the values of the constants a and b that defines a line corresponding to a best compromise between all the given points. Basically, to find a best compromise the Linear Regression method draws a huge amount of lines ‘f’ and for each of them is computed the sum of (y_i – f(x_i))² over all couples (x_i, y_i) given in the dataset, where y_i is the observed value of the dependent variable for x_i and f(x_i) is the predicted value for x_i. Then the method choose the line whose sum is of minimum value among all the lines drawn.

The dataset in our example below shows the Stop Distance according to the Car Speed. This data has been recorded in the 1920s and is not up to date anymore, but still, it is interesting to use as dataset to build a linear model.

Car Speed VS Stop Distance - Dataset

Car Speed VS Stop Distance – Dataset

In this dataset the Car Speed is expressed in mph and the distance in feets. For example, one row shows that in 1920 a car running at 24 mph (38.62 km/h) needed a distance of 92 feets (28 meters) to stop.

What we are going to do is :
1) to learn a predictive model by applying the Linear Regression method on 80% of the dataset
2) to quickly get an idea of the quality of our model by applying it on the test set (20% of the dataset)
3) to show what would have been the stopping distance of a car running at 62.138 mph (100 km/h) if such a car would have existed in the 1920s