The following shows how to write a R script in order to classify using the Logistic Regression method whether an individual description correspond to a female or a male according to the age and the number of sexual assaults suffered given in the description.

We start by setting the working directory and loading the dataset :

#importing the dataset

setwd("[WORKING DIRECTORY]")

dataset = read.csv('LOS-ANGELES-2016-SEX_CRIMES-DATASET-6-FINAL.csv')

The ‘dataset’ variable is an array whose 40 first rows are :

Since the gender variable to predict is a categorical variable with text values, we encode it into a number using the LabelEncoder class :

dataset$VICTIM_GENDER = factor(dataset$VICTIM_GENDER, levels = c('F', 'M'), labels = c(0, 1))

Then we divide the dataset into the training and test sets :

library(caTools)

set.seed(123)

split = sample.split(dataset$VICTIM_GENDER, SplitRatio=0.75)

training_set = subset(dataset, split == TRUE)

test_set = subset(dataset, split == FALSE)

Then we can fit our Logistic Regression Classifier on the training data :

classifier = glm(formula = VICTIM_GENDER ~ ., family = binomial, data = training_set)

And finally, compute the predictions for the test set :

probability_pred = predict(classifier, type='response', newdata=test_set[-3])

y_pred = ifelse(probability_pred > 0.5, 1, 0)

In order to measure the quality of the classification, we use a convenient tool called the « confusion matrix ». Each column of the matrix represents the number of occurrences of an estimated class, while each row represents the number of occurrences of a real (or reference) class. One of the interests of the confusion matrix is that it quickly shows if the system manages to classify correctly.

The following script builds the confusion matrix according to the test set :

cm = table(test_set[, 3], y_pred)

The result is the following array :

The sum of all the values of the matrix gives the total number of records in the test set (40 records).

The two rows of the confusion matrix are interpreted as follows :

– among the 21 actual records with female gender, 20 are classified as such and 1 are wrongly classified as male

– among the 19 actual records with male gender, 3 are wrongly classified as female and 16 are correctly classified as male

The two columns are interpreted as follows :

– among the 23 records classified as female, 20 are actually females and 3 are males

– among the 17 records classified as male, 1 is actually a female and 16 are males

Therefore, the classification performance is not bad since 36 (20+16) records over a total of 40 are correctly classified.

Finally, with the following script, we show in a graph the actual training set observations (dots) and predictions (coloured area and slope) for the classification as female (red) or male (green) of sex crime victims according to the age and the amount of sex crimes in Los Angeles during 2016 year :

mygraph(set=training_set, xlabel='VICTIM_AGE', xstep=0.1, ylabel='SEX_CRIMES', ystep=1, title='Training set observations (dots) and predictions (coloured area and slope) for the classification as female (red) or male (green) of sex crime victims according to the age and the amount of sex crimes', classifier=classifier)

This script displays the following graph :

A similar graph based the test set observations can be displayed with the following script :

mygraph(set=test_set, xlabel='VICTIM_AGE', xstep=0.1, ylabel='SEX_CRIMES', ystep=1, title='Test set observations (dots) and predictions (coloured area and slope) for the classification as female (red) or male (green) of sex crime victims according to the age and the amount of sex crimes', classifier=classifier)

Which gives the following graph :

The mygraph(…) function is a built-in function defined as follows :

mygraph <- function(set, xlabel, xstep, ylabel, ystep, title, classifier){

X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = xstep)

X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = ystep)

grid_set = expand.grid(X1, X2)

colnames(grid_set) = c(xlabel, ylabel)

probability_set = predict(classifier, type = 'response', newdata = grid_set)

y_grid = ifelse(probability_set > 0.5, 1, 0)

plot(set[, -3],

main = title,

xlab = xlabel, ylab = ylabel,

xlim = range(X1), ylim = range(X2))

contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)

points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'seagreen4', 'palevioletred'))

points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'darkgreen', 'darkred'))

}