Association Rules: Apriori Algorithm

Introduction to Data Science Machine Learning Association Rules: Apriori Algorithm

Introduction

Apriori algorithm is an association rule-based data mining algorithm that was proposed for frequent itemset mining. This algorithm uses prior knowledge of frequent itemset properties and hence its name is Apriori. An iterative level-wise search is applied to find the frequent itemsets. This approach has two main iterative steps: the join and the prune steps. The iteration continues until the most frequent itemsets are obtained. The key concept of the Apriori algorithm is its anti-monotonicity of support measure, which means that if an itemset set has a value less than minimum support, then all of its supersets will also fall below minimum support, and thus can be ignored.

Working of Apriori

The Apriori algorithm has three main working components in order to find frequent patterns in the data given. These are:

Support
Confidence
Lift

Support

Support refers to the ratio between the number of transactions having a particular item in it to the total number of transactions given in the dataset. Support can indicate the popularity of each product in any data.

Confidence

Confidence refers to the possibility of two items being in the same transaction together with respect to one of them. Confidence determines the likeliness of two items and also determines the more important item out of the two.

Lift

Lift refers to the probability of an increase in the ratio of a particular item when there is a transaction for another item.

Algorithm

In a given dataset, the following steps can be taken to find out the associations:

Take a counter k=1.
Generate frequent itemsets of length k.
Repeat the following steps until no new frequent itemsets are identified:
- Generate length (k+1) candidate itemsets from length k frequent itemsets.
- Prune candidate itemsets containing subsets of length k that are infrequent.
- Count the support of each candidate by scanning the dataset.
- Eliminate candidates that are infrequent, leaving only those that are frequent.

Implementation

Let us try to build associations on the weather dataset using the methods present in the weka package. We can create a dataset in the .arff format for weka to understand the data contents as given below:

We can import the necessary packages using the files mentioned below. The Apriori class provides methods to build associations and display the findings interactively. Hyperparameter tuning can be done as per the needs of the user.

				
					@file:Repository("https://mvnrepository.com/artifact/nz.ac.waikato.cms.weka/weka-stable")
@file:DependsOn("nz.ac.waikato.cms.weka:weka-stable:3.8.0")

				
					import weka.core.Instances
import weka.core.converters.ConverterUtils.DataSource
import weka.associations.Apriori

				
					val dataset = "weather_nominal.arff"
val source = DataSource(dataset)
val datas = source.getDataSet()
val model = Apriori()
model.buildAssociations(datas)
println(model)

Output:

				
					Apriori
=======

Minimum support: 0.15 (2 instances)
Minimum metric : 0.9
Number of cycles performed: 17

Generated sets of large itemsets:

Size of set of large itemsets L(1): 12

Size of set of large itemsets L(2): 47

Size of set of large itemsets L(3): 39

Size of set of large itemsets L(4): 6

Best rules found:

 1. outlook=overcast 4 ==&gt; play=yes 4     lift:(1.56) lev:(0.1) [1] conv:(1.43)
 2. temperature=cool 4 ==&gt; humidity=normal 4     lift:(2) lev:(0.14) [2] conv:(2)
 3. humidity=normal windy=FALSE 4 ==&gt; play=yes 4     lift:(1.56) lev:(0.1) [1] conv:(1.43)
 4. outlook=sunny play=no 3 ==&gt; humidity=high 3     lift:(2) lev:(0.11) [1] conv:(1.5)
 5. outlook=sunny humidity=high 3 ==&gt; play=no 3     lift:(2.8) lev:(0.14) [1] conv:(1.93)
 6. outlook=rainy play=yes 3 ==&gt; windy=FALSE 3     lift:(1.75) lev:(0.09) [1] conv:(1.29)
 7. outlook=rainy windy=FALSE 3 ==&gt; play=yes 3     lift:(1.56) lev:(0.08) [1] conv:(1.07)
 8. temperature=cool play=yes 3 ==&gt; humidity=normal 3     lift:(2) lev:(0.11) [1] conv:(1.5)
 9. outlook=sunny temperature=hot 2 ==&gt; humidity=high 2     lift:(2) lev:(0.07) [1] conv:(1)
10. temperature=hot play=no 2 ==&gt; outlook=sunny 2     lift:(2.8) lev:(0.09) [1] conv:(1.29)

Advantages and Disadvantages

Advantages

The algorithm is very easy to understand and implement for a given dataset.
The join and prune steps are easy to implement on large itemsets in large dataset.

Disadvantages

The computational requirements for the implementation of this algorithm are very high when the itemsets are very large and the minimum support is set to a very low value.
The algorithm needs to can the entire dataset, making the algorithm costly.

Improvements in Apriori

The following improvements can be made to the Apriori Algorithm so that it works more efficiently:

1) Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is below the threshold will not be frequent.

2) Transaction reduction: A transaction that does not contain any frequent k-itemset is useless in subsequent scans.

3) Partitioning: Any itemset that is potentially frequent in the dataset must be frequent in at least one of the partitions of the dataset.

4) Dynamic itemset counting: Add new candidate itemsets only when all of their subsets are estimated to be frequent.

Previous Topic

Back to Lesson

Next Topic