Spam Detection with Decision Trees

This analysis will use the spam7 dataset. The data contains metrics from thousands of emails. The fields can be described by the following:

crl.tot total length of words in capitals
dollar number of occurrences of the $ symbol
bang number of occurrences of the ! symbol
money number of occurrences of the word “money”
n000 number of occurrences of the string “000”
make number of occurrences of the word “make”
yesno a factor with levels n (not spam) and y (spam)

Our goal will be to predict whether a message is spam using a decision tree, our confusion matrix, and the ROC curve. We will then use a real estate dataset to show how a decision tree can be used for regression. First, let’s take a look at the stucture of our data.

## 'data.frame':    4601 obs. of  7 variables:
##  $ crl.tot: num  278 1028 2259 191 191 ...
##  $ dollar : num  0 0.18 0.184 0 0 0 0.054 0 0.203 0.081 ...
##  $ bang   : num  0.778 0.372 0.276 0.137 0.135 0 0.164 0 0.181 0.244 ...
##  $ money  : num  0 0.43 0.06 0 0 0 0 0 0.15 0 ...
##  $ n000   : num  0 0.43 1.16 0 0 0 0 0 0 0.19 ...
##  $ make   : num  0 0.21 0.06 0 0 0 0 0 0.15 0.06 ...
##  $ yesno  : Factor w/ 2 levels "n","y": 2 2 2 2 2 2 2 2 2 2 ...

We are dealing with mostly numeric. It is very important that our last column is a factor since that is required for this model. We will set the seed so our results become reproducible. We can use rpartplot to visualize our initial tree.

mydata <- spam7

set.seed(1234)
ind <- sample(2, nrow(mydata), replace = T, prob = c(0.5, 0.5))
train <- mydata[ind == 1,]
test <- mydata[ind == 2,]

tree <- rpart(yesno ~., data = train, cp=0.07444)

rpart.plot(tree)

We can see that if the dollar field is less than .056, the email is predicted as spam. Our other branches use the count of ! and the total letters following capitals as criteria. To test the accuracy of our model we can use a confusion matrix.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    n    y
##          n 1278  212
##          y  127  688
##                                           
##                Accuracy : 0.8529          
##                  95% CI : (0.8378, 0.8671)
##     No Information Rate : 0.6095          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6857          
##                                           
##  Mcnemar's Test P-Value : 5.061e-06       
##                                           
##             Sensitivity : 0.7644          
##             Specificity : 0.9096          
##          Pos Pred Value : 0.8442          
##          Neg Pred Value : 0.8577          
##              Prevalence : 0.3905          
##          Detection Rate : 0.2985          
##    Detection Prevalence : 0.3536          
##       Balanced Accuracy : 0.8370          
##                                           
##        'Positive' Class : y               
##

Our model is performing with an accuracy of 85%. Our main parameter is cp. We can see how the changing this affects our error.

rpart(formula = yesno ~ ., data = train)

## n= 2305 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 2305 900 n (0.6095445 0.3904555)  
##    2) dollar< 0.0555 1740 404 n (0.7678161 0.2321839)  
##      4) bang< 0.092 1227 128 n (0.8956805 0.1043195) *
##      5) bang>=0.092 513 237 y (0.4619883 0.5380117)  
##       10) crl.tot< 86.5 263  84 n (0.6806084 0.3193916) *
##       11) crl.tot>=86.5 250  58 y (0.2320000 0.7680000) *
##    3) dollar>=0.0555 565  69 y (0.1221239 0.8778761) *

plotcp(tree)

An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters: True Positive Rate. False Positive Rate

p1 <- predict(tree, test, type = 'prob')
p1 <- p1[,2]
r <- multiclass.roc(test$yesno, p1, percent = TRUE)

## Setting direction: controls < cases

roc <- r[['rocs']]
r1 <- roc[[1]]
plot.roc(r1,
         print.auc=TRUE,
         auc.polygon=TRUE,
         grid=c(0.1, 0.2),
         grid.col=c("green", "red"),
         max.auc.polygon=TRUE,
         auc.polygon.col="lightblue",
         print.thres=TRUE,
         main= 'ROC Curve')

Another use for decision trees is regression analysis. We will use a real estate dataset for this example. The following explains the variables:

CRIM - per capita crime rate by town
ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS - proportion of non-retail business acres per town.
CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
NOX - nitric oxides concentration (parts per 10 million)
RM - average number of rooms per dwelling
AGE - proportion of owner-occupied units built prior to 1940
DIS - weighted distances to five Boston employment centres
RAD - index of accessibility to radial highways
TAX - full-value property-tax rate per $10,000
PTRATIO - pupil-teacher ratio by town
B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
LSTAT - % lower status of the population
MEDV - Median value of owner-occupied homes in $1000’s

First we will create our initial tree just like before. The main difference in this model is that our response variable will be a continuous value instead of a factor.

set.seed(1234)
ind <- sample(2, nrow(mydata), replace = T, prob = c(0.5, 0.5))
train <- mydata[ind == 1,]
test <- mydata[ind == 2,]

tree <- rpart(medv ~., data = train)
rpart.plot(tree)

printcp(tree)

## 
## Regression tree:
## rpart(formula = medv ~ ., data = train)
## 
## Variables actually used in tree construction:
## [1] age   crim  lstat rm   
## 
## Root node error: 22620/262 = 86.334
## 
## n= 262 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.469231      0   1.00000 1.01139 0.115186
## 2 0.128700      1   0.53077 0.62346 0.080154
## 3 0.098630      2   0.40207 0.51042 0.076055
## 4 0.033799      3   0.30344 0.42674 0.069827
## 5 0.028885      4   0.26964 0.39232 0.066342
## 6 0.028018      5   0.24075 0.37848 0.066389
## 7 0.015141      6   0.21274 0.34877 0.065824
## 8 0.010000      7   0.19760 0.33707 0.065641

rpart(formula = medv ~ ., data = train)

## n= 262 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 262 22619.5900 22.64809  
##    2) lstat>=7.15 190  6407.5790 18.73000  
##      4) lstat>=14.8 91  1204.9870 14.64725  
##        8) crim>=5.7819 47   483.9983 12.77021 *
##        9) crim< 5.7819 44   378.5098 16.65227 *
##      5) lstat< 14.8 99  2291.4410 22.48283  
##       10) rm< 6.6365 87  1491.4180 21.52874 *
##       11) rm>=6.6365 12   146.6600 29.40000 *
##    3) lstat< 7.15 72  5598.1990 32.98750  
##      6) rm< 7.4525 59  2516.6520 30.37458  
##       12) age< 88.6 52  1024.2690 29.05385  
##         24) rm< 6.776 29   220.2917 25.94483 *
##         25) rm>=6.776 23   170.2243 32.97391 *
##       13) age>=88.6 7   727.8686 40.18571 *
##      7) rm>=7.4525 13   850.5723 44.84615 *

plotcp(tree)

p <- predict(tree, train)
sqrt(mean((train$medv-p)^2))

## [1] 4.130294

(cor(train$medv,p))^2

## [1] 0.8024039

The model is showing a goodness of fit of around .8.

Spam Detection with Decision Trees

2024-01-16