Sem Spirit


In data mining, Apriori’s algorithm is a classic association search algorithm. It was conceived in 1994, by Rakesh Agrawal and Ramakrishnan Sikrant, in the field of learning rules of association. It is used to recognize properties that come up frequently in a dataset and to infer a categorization. The algorithm generates frequent itemsets, from the itemset with a single element. The theoretical hypothesis on which the algorithm is based starts from the consideration that if a set of objects (itemset) is frequent, then all its subsets are also frequent, but if a itemset is not frequent, then the sets which contain it are not so either (principle of anti-monotony).

One area where this algorithm finds great applicability is the problem of the housewife’s basket. The goal is to explore the sales data in a supermarket to discover rules of association such as the one that states that a customer buying tomato sauce and beef simultaneously, would be likely to buy pasta. To derive associations, a bottom-up approach is used, where frequent subsets are constructed by adding one element at a time (generation of candidates); groups of candidates are then checked on the data and the algorithm ends when there are no more extensions possible. The Apriori algorithm, while historically significant, suffers from some inefficiencies. In particular, the generation of candidates creates many subsets. There are other algorithms with similar goals but they are more prevalent in areas where data is less temporally connoted as in bioinformatics.

Use Case : Microsoft’s Support Website Visits Analysis

The goal here is to analyse how the Microsoft’s Support website is visited : are there some visiting association rules patterns between web pages ?
The dataset can be found in UCI Machine Learning Repository in this page here. This dataset is more than 20 years old, therefore it does not teach any significant behavior on nowadays visits. But still, it provides and interesting use case for testing Apriori algorithm.

Here follows the first 40 rows of the Microsoft’s Support Website Visits dataset :

Microsoft Support Website Visits Log

Microsoft Support Website Visits Log