Sem Spirit

Polynomial Regression in R

The aim of this script is to create in R the following bivariate polynomial regression model (the observations are represented with blue dots and the predictions with the orange 3D surface) :

3D visualization of the observations and of the predictions of the polynomial model in R

3D visualization of the observations and of the predictions of the polynomial model in R

We start by setting the working folder and loading the dataset

setwd("[WORKING FOLDER]")
#loading the dataset
dataset = read.csv('oil_dataset-2.csv')

whose first 40 rows over 1000 are as follows :

Oil Search Dataset : Mendacium and Depth vs Barrel Price

Oil Search Dataset : Mendacium and Depth vs Barrel Price

Then we build manually another dataset data_poly whose columns corresponds to each monoms (the terms) of the targeted polynomial formula. We build this array carefully so that we still have a clear understanding of the mapping between the variables (columns) and the monoms of the polynomial formula :

#building the data array 'data_poly' in order to proceed to a polynomial regression of degree 4
#the model formula is of the form : P ~ b00.M^0.D^0 + b10.M^1.D^0 + b20.M^2.D^0 + b30.M^3.D^0 + b40.M^4.D^0 + b01.M^0.D^1 + b11.M^1.D^1 + b21.M^2.D^1 + b31.M^3.D^1 + b02.M^0.D^2 + b12.M^1.D^2 + b22.M^2.D^2 + b03.M^0.D^3 + b13.M^1.D^3 + b04.M^0.D^4
#in the following : MiDj means (M^i.D^j)

data_poly = data.frame(matrix(NA, nrow = 1000, ncol = 15))
colnames(data_poly) <- c("M1D0","M2D0","M3D0","M4D0","M0D1","M1D1","M2D1","M3D1","M0D2","M1D2","M2D2","M0D3","M1D3","M0D4", "Price")
#the M0D0 case will be added automatically by the linear regression algorithm (lm function)
data_poly$M1D0 = dataset$Mendacium^1 * 1 #equals to Mendacium variable
data_poly$M2D0 = dataset$Mendacium^2 * 1
data_poly$M3D0 = dataset$Mendacium^3 * 1
data_poly$M4D0 = dataset$Mendacium^4 * 1
data_poly$M0D1 = 1 * dataset$Depth^1 #equals to Depth variable
data_poly$M1D1 = dataset$Mendacium^1 * dataset$Depth^1
data_poly$M2D1 = dataset$Mendacium^2 * dataset$Depth^1
data_poly$M3D1 = dataset$Mendacium^3 * dataset$Depth^1
data_poly$M0D2 = 1 * dataset$Depth^2
data_poly$M1D2 = dataset$Mendacium^1 * dataset$Depth^2
data_poly$M2D2 = dataset$Mendacium^2 * dataset$Depth^2
data_poly$M0D3 = 1 * dataset$Depth^3
data_poly$M1D3 = dataset$Mendacium^1 * dataset$Depth^3
data_poly$M0D4 = 1 * dataset$Depth^4
data_poly$Price = dataset$Price

Dataset ready to use for polynomial regression

Dataset ready to use for polynomial regression

We start the backward elimination method by setting the optimal dataset to the 14 variables MiDj and we fit the regressor :

#Price ~ all polynomial terms possible with degree <= 4
poly_reg = lm(formula = Price ~ M1D0 + M2D0 + M3D0 + M4D0 + M0D1 + M1D1 + M2D1 + M3D1 + M0D2 + M1D2 + M2D2 + M0D3 + M1D3 + M0D4 , data = data_poly)
summary(poly_reg)

This first step produces a regressor with the following summary information. We remove the variable with the greatest P such that P < t, which is M2D0.

Backward Elimination STEP 1

Backward Elimination STEP 1

Then we fit again the regressor to the new dataset with 14 variables with the following script :

# - M2D0
poly_reg = lm(formula = Price ~ M1D0 + M3D0 + M4D0 + M0D1 + M1D1 + M2D1 + M3D1 + M0D2 + M1D2 + M2D2 + M0D3 + M1D3 + M0D4 , data = data_poly)
summary(poly_reg)

The above script produces the following regressor summary. We remove again the variable with the greatest probability such that P < t, which is M1D3.

Backward Elimination STEP 2

Backward Elimination STEP 2

And so on, we apply the steps of the backward elimination until reaching an irreducible set of variables.

# - M1D3
poly_reg = lm(formula = Price ~ M1D0 + M3D0 + M4D0 + M0D1 + M1D1 + M2D1 + M3D1 + M0D2 + M1D2 + M2D2 + M0D3 + M0D4 , data = data_poly)
summary(poly_reg)

# - M0D1
poly_reg = lm(formula = Price ~ M1D0 + M3D0 + M4D0 + M1D1 + M2D1 + M3D1 + M0D2 + M1D2 + M2D2 + M0D3 + M0D4 , data = data_poly)
summary(poly_reg)

# - M3D1
poly_reg = lm(formula = Price ~ M1D0 + M3D0 + M4D0 + M1D1 + M2D1 + M0D2 + M1D2 + M2D2 + M0D3 + M0D4 , data = data_poly)
summary(poly_reg)

# - M0D4
poly_reg = lm(formula = Price ~ M1D0 + M3D0 + M4D0 + M1D1 + M2D1 + M0D2 + M1D2 + M2D2 + M0D3 , data = data_poly)
summary(poly_reg)

# - M0D3
poly_reg = lm(formula = Price ~ M1D0 + M3D0 + M4D0 + M1D1 + M2D1 + M0D2 + M1D2 + M2D2 , data = data_poly)
summary(poly_reg)

# - M2D2
poly_reg = lm(formula = Price ~ M1D0 + M3D0 + M4D0 + M1D1 + M2D1 + M0D2 + M1D2 , data = data_poly)
summary(poly_reg)

# - M1D2
poly_reg = lm(formula = Price ~ M1D0 + M3D0 + M4D0 + M1D1 + M2D1 + M0D2, data = data_poly)
summary(poly_reg)

# - M3D0
poly_reg = lm(formula = Price ~ M1D0 + M4D0 + M1D1 + M2D1 + M0D2, data = data_poly)
summary(poly_reg)

# - M1D0
poly_reg = lm(formula = Price ~ M4D0 + M1D1 + M2D1 + M0D2, data = data_poly)
summary(poly_reg)

# - M1D1
poly_reg = lm(formula = Price ~ M4D0 + M2D1 + M0D2, data = data_poly)
summary(poly_reg)

Finaly, we remove the last useless variable with the following script :


# - M2D1
poly_reg = lm(formula = Price ~ M4D0 + M0D2, data = data_poly)
summary(poly_reg)

that leads to a regressor whose below summary fits our expectations.

Backward Elimination - Final Step

Backward Elimination – Final Step

All along the backward elimination process, it is important to check the evolution of the Adjusted R-Squared value for each model built. In our case here, this value remains around 0.968. The final (standard) R-squared value is also closed to 0.968. This suggests that the regression model is well fitted to the data (the closer to 1, the better).

It is the end of the backward elimination : the optimal model formula is P ~ M^4 + D^2.

We finish this script by displaying in a 3D space the observed and predicted Price along the z axis, where x and y axis correspond to Mendacium and Depth.


#visualizing the observations and predictions in a 3D space
#install.packages('plotly')
library(plotly)
Mendacium = data_poly$M1D0
Depth = data_poly$M0D1
Price = data_poly$Price
plot_ly(x=as.vector(Mendacium),y=as.vector(Depth),z=Price, type="scatter3d", mode="markers", name = "Obs", marker = list(size = 3)) %>%
add_trace(x=as.vector(Mendacium),y=as.vector(Depth),z=predict(poly_reg, newdata=data_poly), type = "mesh3d", name = "Preds")

The above code produces the following 3D graph where the predictions take the shape of a orange 3D surface and the observations are represented with blue dots. The model is fitting relatively well the observations since the blue dots are wrapping the 3D surface very closely.

3D visualization of the observations and of the predictions of the polynomial model in R

3D visualization of the observations and of the predictions of the polynomial model in R

We can also check manually for any observation (in blue), the corresponding prediction (in orange) as shown below : for a concentration in Mendacium of 12.135 g/m^3 at a depth of 2740.358 meters the actual price of the barrel is 180.4771 dollars (in blue) whereas the price predicted by the model is 176.8785 dollars (in orange), which is not that bad for a prediction since the gap with the observed price is 3.5986 dollars (corresponding to 1.99% of the actual price).

Visually comparing observations and predictions inside the 3D graph

Visually comparing observations and predictions inside the 3D graph