The following shows how to write a Python script in order to classify using the Logistic Regression method whether an individual description correspond to a female or a male according to the age and the number of sexual assaults suffered given in the description.
We start by importing the needed libraries and loading the dataset :
#importing libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#loading dataset and separating predictors from predicted variable
dataset = pd.read_csv('LOS-ANGELES-2016-SEX_CRIMES-DATASET-6-FINAL.csv')
X = dataset.iloc[:,:-1].values #victim ages and sexual assaults suffered
y = dataset.iloc[:,len(dataset.iloc[0])-1].values #gender
The ‘dataset’ variable is an array whose 40 first rows are :

Los Angeles 2016 Sex Crimes Dataset in Python
Since the gender variable to predict is a categorical variable with text values, we encode it into a number using the LabelEncoder class :
from sklearn.preprocessing import LabelEncoder
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)
Then we divide the dataset into the training and test sets :
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
Then we can fit our Logistic Regression Classifier on the training data :
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=0)
classifier.fit(X_train, y_train)
And finally, compute the predictions for the test set :
y_pred = classifier.predict(X_test);
In order to measure the quality of the classification, we use a very convenient tool called the « confusion matrix ». Each column of the matrix represents the number of occurrences of an estimated class, while each row represents the number of occurrences of a real (or reference) class. One of the interests of the confusion matrix is that it quickly shows if the system manages to classify correctly.
The following script builds the confusion matrix according to the test set :
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
The result is the following array :

Confusion Matrix based on the test set in Python
The sum of all the values of the matrix gives the total number of records in the test set (40 records).
The two rows of the confusion matrix are interpreted as follows :
– among the 20 actual records with female gender, 16 are classified as such and 4 are wrongly classified as male
– among the 20 actual records with male gender, 19 are classified as such and 1 is wrongly classified as female
The two columns are interpreted as follows :
– among the 17 records classified as female, 16 are actually females and 1 is a male
– among the 23 records classified as male, 4 are actually females and 19 are males
Therefore, the classification performance is not bad since 35 (16+19) records over a total of 40 are correctly classified.
Finally, with the following script, we show in a graph the actual training set observations (dots) and predictions (coloured area and slope) for the classification as female (red) or male (green) of sex crime victims according to the age and the amount of sex crimes in Los Angeles during 2016 year :
from matplotlib.colors import ListedColormap
myplot('Training set observations (dots) and predictions (coloured area and slope) for the classification as female (red) or male (green) of sex crime victims according to the age and the amount of sex crimes',
'Victim Age', X_train, 'Sex Crimes', y_train)
This script displays the following graph :

Training set observations (dots) and predictions (coloured area and slope) for the classification as female (red) or male (green) of sex crime victims according to the age and the amount of sex crimes
A similar graph based the test set observations can be displayed with the following script :
from matplotlib.colors import ListedColormap
myplot('Test set observations (dots) and predictions (coloured area and slope) for the classification as female (red) or male (green) of sex crime victims according to the age and the amount of sex crimes',
'Victim Age', X_test, 'Sex Crimes', y_test)
Which gives the following graph :

Test set observations (dots) and predictions (coloured area and slope) for the classification as female (red) or male (green) of sex crime victims according to the age and the amount of sex crimes
The myplot function is a built-in function using two other functions defined as below :
def myplot(title, x_name, X_set, y_name, y_set):
X1, X2 = make_meshgrid(x = X_set[:, 0], y = X_set[:, 1], h=1)
plot_contours(plt, classifier, X1, X2, alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
plt.title(title)
plt.xlabel(x_name)
plt.ylabel(y_name)
plt.legend()
plt.show()
def plot_contours(ax, clf, xx, yy, **params):
Z = clf.predict(np.array([xx.ravel(), yy.ravel()]).T)
Z = Z.reshape(xx.shape)
out = ax.contourf(xx, yy, Z, **params)
return out
def make_meshgrid(x, y, h=1):
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
return xx, yy