This will be a demonstration on how to apply the Apriori algorithm to a categorical dataset. The data is from the hardware store sales of Arford Inc. and lists what items each customer bought. With this information, we will be able to predict the likelihood of customers buying other items. This will increase sales for the hardware store since customer recommendations will be optimized.
Loading and Summarizing Data
First we will load in the data and take a look at our transaction account. We will be using the arules package because of it’s supports many Apriori functions. Then we can summarize our data.
## distribution of transactions with duplicates:
## 1
## 144
## transactions as itemMatrix in sparse format with
## 7501 rows (elements/itemsets/transactions) and
## 117 columns (items) and a density of 0.03329357
##
## most frequent items:
## nails cement saw measuring tape paint brush
## 1780 1325 1302 1269 1211
## (Other)
## 22332
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 17
## 1358 1431 1125 861 1088 574 342 271 161 111 78 49 22 20 6 1
## 18 20
## 2 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 3.895 5.000 20.000
##
## includes extended item information - examples:
## labels
## 1 allan wrench
## 2 appliances
## 3 asparagus
Visualizing Common Items
We can see the most frequent items, column sizes, and distribution. It is important to note what items are sold the most. If we focus on those items we will have more profit than if we focused on uncommon goods.
## [1] 0.0028
Creating Model With First Rule
Nails are by far the most popular item.
Now that we have taken a look at our data we can use the Apriori Alorithm on our first model. Our parameters are support and confidence. I started with a confidence of .8 and got no matches in our function. Lowering the confidence to .4 shows us that there are many associations with confidence of .5 or more. This means the customer would be more likely to buy the item than not and sales may benefit by recommendeding these items.
Second Model
Lowering the confidence to .4, we can analyze the strength of our associations.
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.4 0.1 1 none FALSE TRUE 5 0.003 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 22
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[117 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [113 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [61 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
## lhs rhs support confidence coverage
## [1] {paint brush, ruler} => {wood plank} 0.003599520 0.4029851 0.008932142
## [2] {drill bits, saw} => {paint} 0.003466205 0.4193548 0.008265565
## [3] {gloves, wood plank} => {saw} 0.004399413 0.5322581 0.008265565
## [4] {cement, compost} => {saw} 0.003732836 0.5185185 0.007199040
## [5] {mulch, wood plank} => {saw} 0.003866151 0.4833333 0.007998933
## [6] {vice, wood plank} => {saw} 0.004132782 0.4769231 0.008665511
## [7] {primer, wood plank} => {saw} 0.006932409 0.4333333 0.015997867
## [8] {dry wall, vice} => {nails} 0.003066258 0.5897436 0.005199307
## [9] {cement, wood plank} => {saw} 0.007598987 0.4285714 0.017730969
## [10] {cement, gloves} => {nails} 0.005999200 0.5769231 0.010398614
## lift count
## [1] 4.112641 27
## [2] 3.256295 26
## [3] 3.066411 33
## [4] 2.987256 28
## [5] 2.784549 29
## [6] 2.747619 31
## [7] 2.496493 52
## [8] 2.485206 23
## [9] 2.469059 57
## [10] 2.431180 45
When inspecting our rules we have many variables. Let’s inspect the variables.
- lhs - this is the left hand side of our equation (inputs)
- rhs - this is the right hand side of our equation (outputs)
- support - the support is how often an itemset appears.
- confidence - this is how often item y is purchased when x is purchased
- lift - This says how likely item Y is purchased when item X is purchased, while controlling for how popular item Y is. A lift value greater than 1 means that item Y is likely to be bought if item X is bought, while a value less than 1 means that item Y is unlikely to be bought if item X is bought.
- Coverage - The Coverage of the rule is the probability for the antecedent alone in the entire dataset.
Third Model
Lowering the confidence to .2 gave us some new associations.
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.2 0.1 1 none FALSE TRUE 5 0.003 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 22
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[117 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [113 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [696 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
## lhs rhs support confidence coverage
## [1] {plants} => {soil} 0.006132516 0.3898305 0.015731236
## [2] {plants} => {mulch} 0.003199573 0.2033898 0.015731236
## [3] {plants} => {shovel} 0.005065991 0.3220339 0.015731236
## [4] {paint brush, ruler} => {wood plank} 0.003599520 0.4029851 0.008932142
## [5] {glue} => {wood plank} 0.005332622 0.3773585 0.014131449
## [6] {shovel, thin brush} => {primer} 0.003999467 0.3614458 0.011065191
## [7] {dry wall, nails} => {vice} 0.003066258 0.2233010 0.013731502
## [8] {drill bits} => {hammer} 0.007065725 0.2398190 0.029462738
## [9] {car parts} => {vice} 0.003332889 0.2136752 0.015597920
## [10] {primer, shovel} => {thin brush} 0.003999467 0.2400000 0.016664445
## lift count
## [1] 4.947747 46
## [2] 4.527083 24
## [3] 4.506672 38
## [4] 4.112641 27
## [5] 3.851110 40
## [6] 3.797206 30
## [7] 3.730469 23
## [8] 3.678696 53
## [9] 3.569661 25
## [10] 3.509240 30
Conclusion
In Conclusion, using a confidence of .4 yields the best results. The saw is our 3rd most popular item and it makes up most of the associations. Many of these associations are over 50% accurate. Recommending these items would likely have a positive impact on our sales.
Nails also have a strong association with people who buy cement and gloves. This is another instance where recommending to a customer would have a positive impact.