Random Forest Regression in R

The following shows how to build in R a regression model using random forests with the Los-Angeles 2016 Crime Dataset.

Our goal is to answer the following specific questions :

  1. Considering night sex crimes targeting 14 years old female, compare their number depending on whereas they have occurred at home or in the street.
  2. Considering street robberies targeting 24 years old male, compare their number depending on whereas they have occurred in the afternoon or in the night.
  3. Considering night street violence acts on 29 years old individuals, compare their number depending on whereas they target a female or a male.

And more generally, to display the following three graphs :

  1. Number of night sex crimes in 2016 occurring at home (red curve) or in the street (green curve) according to the female victim age.
  2. Number of street robberies in 2016 occurring in the afternoon (red curve) or in the night (green curve) according to the male victim age.
  3. Number of night street violence acts in 2016 targeting a female (red curve) or a male (green curve) according to the victim age.

We start by setting the working directory and loading the dataset :

setwd("[WORKING DIRECTORY]")
dataset = read.csv('LOS-ANGELES-2016-CRIMES-DATASET.CSV')

The dataset variable contains the following array whose 40 first rows are as follow :

Los-Angeles 2016 Crime Dataset in R

Los-Angeles 2016 Crime Dataset in R

We fit the Random Forest regression model to the dataset with the following script :

#install.packages('randomForest')
library(randomForest)
#set.seed(1234)
regressor = randomForest(x = dataset[1:5], #returns a dataframe
y = dataset$OCCURRENCES, #returns a vector
ntree = 500, mtry = 5, nPerm = 4, nodesize = 2)

It may take a while to build the regressor (20 min on a 8-Core i7 CPU with 8MB RAM). Once the regressor is built, it is advised to save the regressor object on the disk with the saveRDS() function :

saveRDS(regressor, "rfreg_ntree500_mtry5_nPerm4_nodesize2.dat")

We can load back again the regressor (possibly in a different variable) with the readRDS() function :

regressor2 = readRDS("rfreg_ntree500_mtry5_nPerm4_nodesize2.dat")

THE 3 SPECIFIC QUESTIONS

We can now answer to the three specific questions asked above by estimating the number of crimes for each row. This number is a prediction since the rows given below as input to the predict() function do not exist in the dataset.

Before computing the predictions we define the following myrow() function helpful to create the row to give as input to the regressor in order to get the prediction value :

myrow <- function(moment, location, crime, sex, age, flevels){
row=data.frame(MOMENT="", LOCATION="", CRIME="", VICTIM_SEX="", VICTIM_AGE=0)
levels(row$MOMENT) <- flevels$MOMENT
levels(row$LOCATION) <- flevels$LOCATION
levels(row$CRIME) <- flevels$CRIME
levels(row$VICTIM_SEX) <- flevels$VICTIM_SEX
row$MOMENT[1]=moment
row$LOCATION[1]=location
row$CRIME[1]=crime
row$VICTIM_SEX[1]=sex
row$VICTIM_AGE[1]=age
return(row)
}

  • QUESTION 1 : The following script computes the estimated number of NIGHT SEX_CRIME on 14 years old female occurring at HOME compared to the number occurring in the STREET. x1a and x1b are the label encoded rows resp. for [NIGHT, HOME, SEX_CRIME, F, 14] and [NIGHT, STREET, SEX_CRIME, F, 14]

    x1a=myrow(moment="NIGHT", location="HOME", crime="SEX_CRIME", sex="F", age=14, flevels=flevels)
    x1b=myrow(moment="NIGHT", location="STREET", crime="SEX_CRIME", sex="F", age=14, flevels=flevels)
    y1a_pred = predict(regressor, x1a)
    y1b_pred = predict(regressor, x1b)

    After execution : y1a_pred=40 is the estimated number of NIGHT SEX_CRIME on 14 years old female occurring at HOME and y1b_pred=16.9 is the one occuring in the street.

  • QUESTION 2 : This second script computes the estimated number of STREET ROBBERIES on 24 years old male occurring in the AFTERNOON compared to the number occurring in the NIGHT. x2a and x2b are the label encoded rows resp. for [AFTERNOON, STREET, ROBBERY, M, 24] and [NIGHT, STREET, ROBBERY, M, 24].

    x2a=myrow(moment="AFTERNOON", location="STREET", crime="ROBBERY", sex="M", age=24, flevels=flevels)
    x2b=myrow(moment="NIGHT", location="STREET", crime="ROBBERY", sex="M", age=24, flevels=flevels)
    y2a_pred = predict(regressor, x2a)
    y2b_pred = predict(regressor, x2b)

    After execution : y2a_pred=131 is the estimated number of STREET ROBBERIES on 24 years old male occurring in the AFTERNOON and y2b_pred=248 is the number occuring in the night.

  • QUESTION 3 : This third script computes the estimated number of NIGHT STREET VIOLENCE ACT on 29 years old individuals depending on whereas the individual is a female or a male.

    x3a=myrow(moment="NIGHT", location="STREET", crime="VIOLENCE", sex="F", age=29, flevels=flevels)
    x3b=myrow(moment="NIGHT", location="STREET", crime="VIOLENCE", sex="M", age=29, flevels=flevels)
    y3a_pred = predict(regressor, x3a)
    y3b_pred = predict(regressor, x3b)

    After execution : y3a_pred=95.8 the estimated number of NIGHT STREET VIOLENCE ACT on 29 years old females and y3b_pred=129 is the number for males.

THE 3 GRAPHS

Before drawing the three graphs showing the number of crimes fitting some description according to the age of the victim, we define the following myrows() function helpful to create the data frame to give as input to the regressor in order to get the y vector of predictions.


myrows <- function(moment, location, crime, sex, ages, flevels){
n=length(ages)
str_col=rep("", n)
int_col=rep(0, n)
rows=data.frame(MOMENT=str_col, LOCATION=str_col, CRIME=str_col, VICTIM_SEX=str_col, VICTIM_AGE=int_col)
levels(rows$MOMENT) <- flevels$MOMENT
levels(rows$LOCATION) <- flevels$LOCATION
levels(rows$CRIME) <- flevels$CRIME
levels(rows$VICTIM_SEX) <- flevels$VICTIM_SEX
rows$MOMENT[TRUE] = rep(moment, n)
rows$LOCATION[TRUE]=rep(location, n)
rows$CRIME[TRUE]=rep(crime, n)
rows$VICTIM_SEX[TRUE]=rep(sex, n)
rows$VICTIM_AGE=ages
return(rows)
}

We also define the range of ages along which we want to draw the graphs and we load the ggplot library :

#install.packages('ggplot2')
library(ggplot2)

#GRAPHS ACCORDING TO AGE WITHIN THE FOLLOWING RANGE
age_seq = seq(min(dataset$VICTIM_AGE), max(dataset$VICTIM_AGE), 1)

  • GRAPH 1 : The following script displays the graph showing real (dots) and estimated (curve) number of NIGHT SEX CRIMES in 2016 at HOME (red) and in the STREET (green) according to the FEMALE victim age.

    dataset1a = dataset[(dataset$MOMENT == "NIGHT") & (dataset$LOCATION == "HOME") & (dataset$CRIME == "SEX_CRIME") & (dataset$VICTIM_SEX == "F") , ]
    a1a = myrows("NIGHT", "HOME", "SEX_CRIME", "F", age_seq, flevels = flevels)
    y_a1a = predict(regressor, newdata=a1a)

    dataset1b = dataset[(dataset$MOMENT == "NIGHT") & (dataset$LOCATION == "STREET") & (dataset$CRIME == "SEX_CRIME") & (dataset$VICTIM_SEX == "F") , ]
    a1b = myrows("NIGHT", "STREET", "SEX_CRIME", "F", age_seq, flevels = flevels)
    y_a1b = predict(regressor, newdata=a1b)

    ggplot() +
    geom_point(aes(x = dataset1a$VICTIM_AGE, y = dataset1a$OCCURRENCES), colour='red') +
    geom_line(aes(x = a1a$VICTIM_AGE, y = y_a1a), colour='red') +
    geom_point(aes(x = dataset1b$VICTIM_AGE, y = dataset1b$OCCURRENCES), colour='green') +
    geom_line(aes(x = a1b$VICTIM_AGE, y = y_a1b), colour='green') +
    ggtitle('Real (dots) and Estimated (curve) Number of NIGHT SEX CRIME in 2016 at HOME (red) and in the STREET (green) according to the FEMALE victim age.') +
    xlab('Female Victim Age') +
    ylab('Number of Night Sex Crimes')

    The graph looks as below :

    Real (dots) and Estimated (curve) Number of NIGHT SEX CRIMES in 2016 at HOME (red) and in the STREET (green) according to the FEMALE victim age.

    Real (dots) and Estimated (curve) Number of NIGHT SEX CRIMES in 2016 at HOME (red) and in the STREET (green) according to the FEMALE victim age.

    The graph shows that night sex crimes are essentially commited at home rather than in the street.

  • GRAPH 2 : The following script displays the graph showing real (dots) and estimated (curve) number of STREET ROBBERIES in 2016 occurring in the AFTERNOON (red) and in the NIGHT (green) according to the MALE victim age.

    dataset2a = dataset[(dataset$MOMENT == "AFTERNOON") & (dataset$LOCATION == "STREET") & (dataset$CRIME == "ROBBERY") & (dataset$VICTIM_SEX == "M") , ]
    a2a = myrows("AFTERNOON", "STREET", "ROBBERY", "M", age_seq, flevels = flevels)
    y_a2a = predict(regressor, newdata=a2a)

    dataset2b = dataset[(dataset$MOMENT == "NIGHT") & (dataset$LOCATION == "STREET") & (dataset$CRIME == "ROBBERY") & (dataset$VICTIM_SEX == "M") , ]
    a2b = myrows("NIGHT", "STREET", "ROBBERY", "M", age_seq, flevels = flevels)
    y_a2b = predict(regressor, newdata=a2b)

    ggplot() +
    geom_point(aes(x = dataset2a$VICTIM_AGE, y = dataset2a$OCCURRENCES), colour='red') +
    geom_line(aes(x = a2a$VICTIM_AGE, y = y_a2a), colour='red') +
    geom_point(aes(x = dataset2b$VICTIM_AGE, y = dataset2b$OCCURRENCES), colour='green') +
    geom_line(aes(x = a2b$VICTIM_AGE, y = y_a2b), colour='green') +
    ggtitle('Real (dots) and Estimated (curve) Number of STREET ROBBERIES in 2016 occurring in the AFTERNOON (red) and in the NIGHT (green) according to the MALE victim age.') +
    xlab('Male Victim Age') +
    ylab('Number of Street Robberies')

    The graph looks as below :

    Real (dots) and Estimated (curve) Number of STREET ROBBERIES in 2016 occurring in the AFTERNOON (red) and in the NIGHT (green) according to the MALE victim age.

    Real (dots) and Estimated (curve) Number of STREET ROBBERIES in 2016 occurring in the AFTERNOON (red) and in the NIGHT (green) according to the MALE victim age.

    There is a huge peak of robberies around 15 years old that squeezes the rest of the graph. We can get a more detailed graph by zooming the part squeezed by the peak presence. For zooming, we add the coord_cartesian(…) parameter to the ggplot() function call with ylim range :

    ggplot() +
    ...
    ylab('Number of Street Robberies') +
    coord_cartesian(ylim=c(0,280))

    We obtain the following graph :

    Graph 2 after zooming.

    Graph 2 after zooming.

    The graph shows that male targeting street robberies are essentially commited the night rather than in the afternoon.

  • GRAPH 3 : The following script displays the graph showing real (dots) and estimated (curve) number of NIGHT STREET VIOLENCE ACTS in 2016 targetting a female (red) or a male (green) according to the victim age.

    dataset3a = dataset[(dataset$MOMENT == "NIGHT") & (dataset$LOCATION == "STREET") & (dataset$CRIME == "VIOLENCE") & (dataset$VICTIM_SEX == "F") , ]
    a3a = myrows("NIGHT", "STREET", "VIOLENCE", "F", age_seq, flevels = flevels)
    y_a3a = predict(regressor, newdata=a3a)

    dataset3b = dataset[(dataset$MOMENT == "NIGHT") & (dataset$LOCATION == "STREET") & (dataset$CRIME == "VIOLENCE") & (dataset$VICTIM_SEX == "M") , ]
    a3b = myrows("NIGHT", "STREET", "VIOLENCE", "M", age_seq, flevels = flevels)
    y_a3b = predict(regressor, newdata=a3b)

    ggplot() +
    geom_point(aes(x = dataset3a$VICTIM_AGE, y = dataset3a$OCCURRENCES), colour='red') +
    geom_line(aes(x = a3a$VICTIM_AGE, y = y_a3a), colour='red') +
    geom_point(aes(x = dataset3b$VICTIM_AGE, y = dataset3b$OCCURRENCES), colour='green') +
    geom_line(aes(x = a3b$VICTIM_AGE, y = y_a3b), colour='green') +
    ggtitle('Real (dots) and Estimated (curve) Number of NIGHT STREET VIOLENCE ACTS in 2016 targetting a female (red) or a male (green) according to the victim age.') +
    xlab('Victim Age') +
    ylab('Number of Night Street Violence Acts')

    The graph looks as below :

    Real (dots) and Estimated (curve) Number of NIGHT STREET VIOLENCE ACTS in 2016 targetting a female (red) or a male (green) according to the victim age.

    Real (dots) and Estimated (curve) Number of NIGHT STREET VIOLENCE ACTS in 2016 targetting a female (red) or a male (green) according to the victim age.

    The graph shows that night street violence targets mostly male individuals rather than females.