In this use case, we want to build the following regression model (shown below in a 3D graph) that predicts the revenue of a movie according to a given director and starring actor.

Somemore, since we are passionate by cinema and since we like novel original movies, we want to evaluate how successful (measured in terms of profit) would be a movie made with the following 22 unprecedented combinations of directors and starring actors (in the table we want to predict the RIC column) :

The dataset to train the model is built according to a numerical feature called « Revenues Over Investments Quotient » (ROIQ) that shows how far all the budgets of the movies in which an individual has been involved during his professional life has been turned into profit. In other terms, the global ability of this individual to participate in successful movies. More precisely, the ROIQ of an individual is the sum of all the revenues divided by the sum of all the budgets of the movies in which he has been involved during his professional life.

The first step of the dataset preparation is to build two tables associating each individual with his ROIQ value : one table for the directors and the other one for the actors. Here follows the 40 first rows of these two tables :

Then – using the two previous ROIQs tables – the second step is to build a table of movies, and for each movie to provide the director with his ROIQ value, the starring actor with his ROIQ value, the budget of the movie, the revenue of the movie, and another feature called « Return on Investment Coefficient » (RIC) which is the revenue divided by the budget corresponding to this movie. For instance, a RIC value of 1.1508179 means that for 1$ invested, the corresponding movie lead to 1.1508179$ of revenue. We obtain a table of 145 rows whose first 40 rows are as below :

We load this dataset in R after setting the working directory with the following script :

setwd([WORKING DIRECTORY])

#loading the dataset

dataset_all = read.csv('cinema_dataset.csv')

dataset = dataset_all[4:6] #we are training the regressor only on the DIR_ROIQ, STAR_ROIQ and RIC variables

We restrict the dataset to the RIC column and the two director and actor ROIQ columns. The regression model will be created by considering that the two ROIQ columns are the predictors and the RIC column is the variable to predict.

We create and fit the SVM Regressor to the dataset as follows :

#Fitting the SVR to the dataset

#install.packages('e1071')

library(e1071)

regressor1 = svm(formula = RIC ~ ., data = dataset, type = "eps-regression");

We can get a very rough idea of the quality of the regressor by computing the mean and standard deviation of the observed RIC data (1.1497 and 0.4843) with the following script :

mean(dataset$RIC) #mean of observed RIC data

sd(dataset$RIC) # if the mean is meaningfull, the std dev is small, otherwise it is meaningless

and comparing it to the mean and standard deviation of the RIC predictions (1.1425 and 0.4197) with as done here :

RIC_pred=predict(regressor1, newdata=dataset[1:2])

mean(RIC_pred) #mean of predicted RIC

sd(RIC_pred) # if the mean is meaningfull, the std dev is small, otherwise it is meaningless

The values are quite similar but the standard deviation is not small enough for the mean to be meaningfull. Therefore, another evaluation metric is necessary here to get a better idea of the quality of the model : the Root Mean Squared Error (RMSE) of the model.

The following script computes the RMSE value between the observed RIC and predicted RIC and scale it according to the observed RIC range of values.

#checking the rmse value, considering it in percentages of the range of RIC values (from 0 to 2 approx)

rmse <- function(error) { sqrt(mean(error^2)) }

error <- dataset$RIC - RIC_pred

RICPredictionsRMSE <- rmse(error)

RICPredictionsRMSE_scaled = (RICPredictionsRMSE / (max(dataset$RIC) - min(dataset$RIC))) * 100 #approx 11.22% of error (not so bad)

RICPredictionsRMSE_scaled

The scaled RMSE is equal to 11.22% of the range of RIC observations, which is not that bad for a small dataset of 145 rows. However, be aware that since the dataset is small, the predictions are not that reliable and one has to be careful in not interpreting too seriously the model predictions presented in the following. But let’s continue for the fun and to satisfy our curiosity.

The following script fills the RIC column that contains « ? » in the « unprecedented directors and stars combination » table with the RIC predictions according to the associated directors and stars ROIQ values.

unprec_dir_star_all = read.csv('unprecedented-directors-stars-combination.csv')

unprec_dir_star = unprec_dir_star_all[3:5] #we apply the regressor only on the DIR_ROIQ, STAR_ROIQ variables

unprec_dir_star_pred=predict(regressor1, newdata=unprec_dir_star)

unprec_dir_star_all$RIC = unprec_dir_star_pred

The result is the following array :

This array shows that according to the model a movie directed by Steven Spielberg where Denzel Washington is starring would lead to a RIC of 1.4067967. This means that 1$ of investment in this movie would produce 1.4067967$ of revenue. Therefore, this association would be very profitable. The funny story is that, such a movie with Denzel Washington and Steven Spielberg was close to be made. Indeed, Steven Spielberg proposed to Denzel Washington to play « Cinqué », the principal character of the « Amistad » movie. But Denzel Washington rejected the offer and Spielberg decided to choose Djimon Hounsou ; which was not a bad idea since the actual Amistad movie has a profitable RIC of 1.227094278 (even though it is less than the RIC predicted with Denzel Washington).

To finish, we display in a 3D graph the model predictions (3D surface in orange) and the observations (blue dots) with the following script :

#install.packages('plotly')

library(plotly)

```
```grid_x = seq(min(dataset$DIR_ROIQ) - 0.1, max(dataset$DIR_ROIQ) + 0.1, by = 0.05)

grid_y = seq(min(dataset$STAR_ROIQ) - 0.1, max(dataset$STAR_ROIQ) + 0.1, by = 0.05)

grid_array = expand.grid(grid_x, grid_y)

colnames(grid_array) = c('DIR_ROIQ', 'STAR_ROIQ')

grid_pred = predict(regressor1, newdata=grid_array)

`plot_ly(x=as.vector(dataset$DIR_ROIQ),y=as.vector(dataset$STAR_ROIQ),z=dataset$RIC, type="scatter3d", mode="markers", name = "Obs", marker = list(size = 3)) %>%`

add_trace(x=grid_array$DIR_ROIQ,y=grid_array$STAR_ROIQ,z=grid_pred, type = "mesh3d", name = "Preds")

That produces the following 3D graph :

We notice that the predictions are not that bad since the 3D surface is fitting relatively well the observations. We observe that the closest to zero are both the director and star ROIQ values, the closer to zero is the movie RIC value. And the greater are both the director and star ROIQ values, the greater is the movie RIC value.