Sem Spirit

Apriori in Python

The following script uses the Apriori algorythm written in Python called « apyori » and accessible here in order to extract association rules from the Microsoft Support Website Visits dataset.

We start by importing the needed libraries :

#importing libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from apyori import apriori

Since the dataset contains a variable number of columns, the usual loading technics from pandas or numpy don’t work very well. Therefore, with the following script we read the dataset file line by line and merge each line independently to the currently built array :

datalines = []
filepath = 'data.csv'
counter = 0
with open(filepath, 'r') as f:
for line in f:
if counter % 1000 == 0: print("Progress : ", counter)
datalines.append(line.strip().split(','))
#dataset = pd.concat( [dataset, pd.DataFrame([tuple(line.strip().split(','))])], ignore_index=True )
counter += 1

The ‘datalines ‘ variable is an array whose 30 first rows are :

Input data rows for apriori algorithm in Python

Input data rows for apriori algorithm in Python

This dataset is split in two parts : the first 301 rows provides information on the website pages (their ids and topics) and the rest of the dataset contains for each visitor the page ids visited.
The following script processes the datalines in order to generate data structures : a dictionnary ‘categories_by_ids’ of the names of the pages sorted by their ids and the list ‘topics_by_visits’ containing the sequence of all page topics visited by each visitor.

categories_by_ids = {}
for i in range(0, len(datalines)):
dataline = datalines[i]
if dataline[0] == 'A':
category_id = dataline[1]
category_name = dataline[3]
categories_by_ids[category_id] = category_name.replace('"','')

topics_by_visits = []
for i in range(0, len(datalines)):
dataline = datalines[i]
if dataline[0] == 'C':
topics_by_visits.append([])
elif dataline[0] == 'V':
last_list = topics_by_visits[len(topics_by_visits) - 1]
category_id = dataline[1]
category_name = categories_by_ids[category_id]
last_list.append(category_name)

At this step we can launch the Apriori algorithm on the list of visited topics sequence with the following function call :

rules = apriori(topics_by_visits, min_support = 0.003, min_confidence = 0.2, min_lift = 3, min_length=2)

To process or visualize the results we can convert the object ‘rules’ created by apriori function into a list :

results = list(rules)

And then with the following script we display the results :

print("number of rules", len(results))
for i in range(0, 20):
result = results[i]
supp = int(result.support*10000)/100
conf = int(result.ordered_statistics[0].confidence*100)
hypo = ''.join([x+' ' for x in result.ordered_statistics[0].items_base])
conc = ''.join([x+' ' for x in result.ordered_statistics[0].items_add])
print("if "+str(hypo)+ " is visited --> "+str(conf)+" % that "+str(conc)+" is visited [support = "+str(supp)+"%]")

This code shows the following list of results :

if Access Development is visited --> 46 % that MS Access is visited [support = 0.3%]
if ActiveX Technology Development is visited --> 47 % that Developer Workshop is visited [support = 0.87%]
if Internet Development is visited --> 21 % that ActiveX Technology Development is visited [support = 0.33%]
if ActiveX Technology Development is visited --> 51 % that Internet Site Construction for Developers is visited [support = 0.94%]
if ActiveX Technology Development is visited --> 20 % that MS Site Builder Workshop is visited [support = 0.37%]
if ActiveX Technology Development is visited --> 22 % that SiteBuilder Network Membership is visited [support = 0.41%]
if ActiveX Technology Development is visited --> 31 % that Web Site Builder's Gallery is visited [support = 0.58%]
if Corporate Desktop Evaluation is visited --> 47 % that MS Office Info is visited [support = 1.11%]
if Visual Basic is visited --> 26 % that Developer Network is visited [support = 0.57%]
if Visual C is visited --> 46 % that Developer Network is visited [support = 0.53%]
if Visual Studio is visited --> 36 % that Developer Network is visited [support = 0.31%]
if Developer Workshop is visited --> 21 % that Internet Development is visited [support = 0.98%]
if Developer Workshop is visited --> 61 % that Internet Site Construction for Developers is visited [support = 2.81%]
if MS Site Builder Workshop is visited --> 30 % that Developer Workshop is visited [support = 0.3%]
if Developer Workshop is visited --> 26 % that SiteBuilder Network Membership is visited [support = 1.23%]
if Developer Workshop is visited --> 22 % that Web Site Builder's Gallery is visited [support = 1.01%]
if FrontPage is visited --> 47 % that Products is visited [support = 0.79%]
if sports is visited --> 73 % that Games is visited [support = 0.62%]
if IE Support is visited --> 45 % that Support Desktop is visited [support = 0.87%]
if IT Technical Information is visited --> 46 % that Support Desktop is visited [support = 0.55%]

in which we notice some meaningful association rules as : « if the page dealing with « sports » is visited, then there is 73% probability that the page dealing with « Games  » is visited » (with a support of 0.62% for « sports » and « Games »).