Apriori Applied to Hardware Store Customer Data

This will be a demonstration on how to apply the Apriori algorithm to a categorical dataset. The data is from the hardware store sales of Arford Inc. and lists what items each customer bought. With this information, we will be able to predict the likelihood of customers buying other items. This will increase sales for the hardware store since customer recommendations will be optimized.

Loading and Summarizing Data

First we will load in the data and take a look at our transaction account. We will be using the arules package because of it’s supports many Apriori functions. Then we can summarize our data.

## distribution of transactions with duplicates:
##   1 
## 144

## transactions as itemMatrix in sparse format with
##  7501 rows (elements/itemsets/transactions) and
##  117 columns (items) and a density of 0.03329357 
## 
## most frequent items:
##          nails         cement            saw measuring tape    paint brush 
##           1780           1325           1302           1269           1211 
##        (Other) 
##          22332 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   17 
## 1358 1431 1125  861 1088  574  342  271  161  111   78   49   22   20    6    1 
##   18   20 
##    2    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   3.895   5.000  20.000 
## 
## includes extended item information - examples:
##         labels
## 1 allan wrench
## 2   appliances
## 3    asparagus

Visualizing Common Items

We can see the most frequent items, column sizes, and distribution. It is important to note what items are sold the most. If we focus on those items we will have more profit than if we focused on uncommon goods.

## [1] 0.0028

Creating Model With First Rule

Nails are by far the most popular item.

Now that we have taken a look at our data we can use the Apriori Alorithm on our first model. Our parameters are support and confidence. I started with a confidence of .8 and got no matches in our function. Lowering the confidence to .4 shows us that there are many associations with confidence of .5 or more. This means the customer would be more likely to buy the item than not and sales may benefit by recommendeding these items.

Second Model

Lowering the confidence to .4, we can analyze the strength of our associations.

rules2 = apriori(data = dataset, parameter = list(support = 0.003, confidence = 0.4))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.4    0.1    1 none FALSE            TRUE       5   0.003      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 22 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[117 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [113 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [61 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

inspect(sort(rules2, by = 'lift')[1:10])

##      lhs                     rhs          support     confidence coverage   
## [1]  {paint brush, ruler} => {wood plank} 0.003599520 0.4029851  0.008932142
## [2]  {drill bits, saw}    => {paint}      0.003466205 0.4193548  0.008265565
## [3]  {gloves, wood plank} => {saw}        0.004399413 0.5322581  0.008265565
## [4]  {cement, compost}    => {saw}        0.003732836 0.5185185  0.007199040
## [5]  {mulch, wood plank}  => {saw}        0.003866151 0.4833333  0.007998933
## [6]  {vice, wood plank}   => {saw}        0.004132782 0.4769231  0.008665511
## [7]  {primer, wood plank} => {saw}        0.006932409 0.4333333  0.015997867
## [8]  {dry wall, vice}     => {nails}      0.003066258 0.5897436  0.005199307
## [9]  {cement, wood plank} => {saw}        0.007598987 0.4285714  0.017730969
## [10] {cement, gloves}     => {nails}      0.005999200 0.5769231  0.010398614
##      lift     count
## [1]  4.112641 27   
## [2]  3.256295 26   
## [3]  3.066411 33   
## [4]  2.987256 28   
## [5]  2.784549 29   
## [6]  2.747619 31   
## [7]  2.496493 52   
## [8]  2.485206 23   
## [9]  2.469059 57   
## [10] 2.431180 45

When inspecting our rules we have many variables. Let’s inspect the variables.

lhs - this is the left hand side of our equation (inputs)
rhs - this is the right hand side of our equation (outputs)
support - the support is how often an itemset appears.
confidence - this is how often item y is purchased when x is purchased
lift - This says how likely item Y is purchased when item X is purchased, while controlling for how popular item Y is. A lift value greater than 1 means that item Y is likely to be bought if item X is bought, while a value less than 1 means that item Y is unlikely to be bought if item X is bought.
Coverage - The Coverage of the rule is the probability for the antecedent alone in the entire dataset.

Third Model

Lowering the confidence to .2 gave us some new associations.

rules3 = apriori(data = dataset, parameter = list(support = 0.003, confidence = 0.2))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.2    0.1    1 none FALSE            TRUE       5   0.003      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 22 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[117 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [113 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [696 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

inspect(sort(rules3, by = 'lift')[1:10])

##      lhs                     rhs          support     confidence coverage   
## [1]  {plants}             => {soil}       0.006132516 0.3898305  0.015731236
## [2]  {plants}             => {mulch}      0.003199573 0.2033898  0.015731236
## [3]  {plants}             => {shovel}     0.005065991 0.3220339  0.015731236
## [4]  {paint brush, ruler} => {wood plank} 0.003599520 0.4029851  0.008932142
## [5]  {glue}               => {wood plank} 0.005332622 0.3773585  0.014131449
## [6]  {shovel, thin brush} => {primer}     0.003999467 0.3614458  0.011065191
## [7]  {dry wall, nails}    => {vice}       0.003066258 0.2233010  0.013731502
## [8]  {drill bits}         => {hammer}     0.007065725 0.2398190  0.029462738
## [9]  {car parts}          => {vice}       0.003332889 0.2136752  0.015597920
## [10] {primer, shovel}     => {thin brush} 0.003999467 0.2400000  0.016664445
##      lift     count
## [1]  4.947747 46   
## [2]  4.527083 24   
## [3]  4.506672 38   
## [4]  4.112641 27   
## [5]  3.851110 40   
## [6]  3.797206 30   
## [7]  3.730469 23   
## [8]  3.678696 53   
## [9]  3.569661 25   
## [10] 3.509240 30

Conclusion

In Conclusion, using a confidence of .4 yields the best results. The saw is our 3rd most popular item and it makes up most of the associations. Many of these associations are over 50% accurate. Recommending these items would likely have a positive impact on our sales.

Nails also have a strong association with people who buy cement and gloves. This is another instance where recommending to a customer would have a positive impact.