Spam Detection with Decision Trees

This analysis will use the spam7 dataset. The data contains metrics from thousands of emails. The fields can be described by the following:

crl.tot- total length of words in capitals
dollar- number of occurrences of the $ symbol
bang- number of occurrences of the ! symbol
money- number of occurrences of the word “money”
n000- number of occurrences of the string “000”
make- number of occurrences of the word “make”
yesno- a factor with levels n (not spam) and y (spam)

Viewing the Data Structure

Our goal will be to predict whether a message is spam using a decision tree, confusion matrix, and the ROC curve. We will then use a real estate dataset to show how a decision tree can be used for regression. First, let’s take a look at the stucture of our data.

## 'data.frame':    4601 obs. of  7 variables:
##  $ crl.tot: num  278 1028 2259 191 191 ...
##  $ dollar : num  0 0.18 0.184 0 0 0 0.054 0 0.203 0.081 ...
##  $ bang   : num  0.778 0.372 0.276 0.137 0.135 0 0.164 0 0.181 0.244 ...
##  $ money  : num  0 0.43 0.06 0 0 0 0 0 0.15 0 ...
##  $ n000   : num  0 0.43 1.16 0 0 0 0 0 0 0.19 ...
##  $ make   : num  0 0.21 0.06 0 0 0 0 0 0.15 0.06 ...
##  $ yesno  : Factor w/ 2 levels "n","y": 2 2 2 2 2 2 2 2 2 2 ...

Splitting the Data and Initial Model

We are dealing with mostly numerics. It is very important that our last column is a factor since that is required for this model. We will set the seed so our results become reproducible. We can use rpartplot to visualize our initial tree.

mydata <- spam7

set.seed(1234)
ind <- sample(2, nrow(mydata), replace = T, prob = c(0.5, 0.5))
train <- mydata[ind == 1,]
test <- mydata[ind == 2,]

tree <- rpart(yesno ~., data = train, cp=0.07444)

rpart.plot(tree)

Analyzing Model Performance

We can see that if the dollar field is less than .056, the email is predicted as spam. Our other branches use the count of the bang column and the total letters following capitals as criteria. To test the accuracy of our model we can use a confusion matrix.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    n    y
##          n 1278  212
##          y  127  688
##                                           
##                Accuracy : 0.8529          
##                  95% CI : (0.8378, 0.8671)
##     No Information Rate : 0.6095          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6857          
##                                           
##  Mcnemar's Test P-Value : 5.061e-06       
##                                           
##             Sensitivity : 0.7644          
##             Specificity : 0.9096          
##          Pos Pred Value : 0.8442          
##          Neg Pred Value : 0.8577          
##              Prevalence : 0.3905          
##          Detection Rate : 0.2985          
##    Detection Prevalence : 0.3536          
##       Balanced Accuracy : 0.8370          
##                                           
##        'Positive' Class : y               
##

Our model is performing with an accuracy of 85%. Our main parameter is cp. We can see how the changing this affects our error.

rpart(formula = yesno ~ ., data = train)

## n= 2305 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 2305 900 n (0.6095445 0.3904555)  
##    2) dollar< 0.0555 1740 404 n (0.7678161 0.2321839)  
##      4) bang< 0.092 1227 128 n (0.8956805 0.1043195) *
##      5) bang>=0.092 513 237 y (0.4619883 0.5380117)  
##       10) crl.tot< 86.5 263  84 n (0.6806084 0.3193916) *
##       11) crl.tot>=86.5 250  58 y (0.2320000 0.7680000) *
##    3) dollar>=0.0555 565  69 y (0.1221239 0.8778761) *

plotcp(tree)

ROC Curve

An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters: True Positive Rate. False Positive Rate

p1 <- predict(tree, test, type = 'prob')
p1 <- p1[,2]
r <- multiclass.roc(test$yesno, p1, percent = TRUE)

## Setting direction: controls < cases

roc <- r[['rocs']]
r1 <- roc[[1]]
plot.roc(r1,
         print.auc=TRUE,
         auc.polygon=TRUE,
         grid=c(0.1, 0.2),
         grid.col=c("green", "red"),
         max.auc.polygon=TRUE,
         auc.polygon.col="lightblue",
         print.thres=TRUE,
         main= 'ROC Curve')

Regression with Decision Trees

Another use for decision trees is regression analysis. We will use a real estate dataset for this example. The following explains the variables:

CRIM - per capita crime rate by town
ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS - proportion of non-retail business acres per town.
CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
NOX - nitric oxides concentration (parts per 10 million)
RM - average number of rooms per dwelling
AGE - proportion of owner-occupied units built prior to 1940
DIS - weighted distances to five Boston employment centres
RAD - index of accessibility to radial highways
TAX - full-value property-tax rate per $10,000
PTRATIO - pupil-teacher ratio by town
B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
LSTAT - % lower status of the population
MEDV - Median value of owner-occupied homes in $1000’s

First we will create our initial tree just like before. The main difference in this model is that our response variable will be a continuous value instead of a factor.

set.seed(1234)
ind <- sample(2, nrow(mydata), replace = T, prob = c(0.5, 0.5))
train <- mydata[ind == 1,]
test <- mydata[ind == 2,]

tree <- rpart(medv ~., data = train)
rpart.plot(tree)

Model

printcp(tree)

## 
## Regression tree:
## rpart(formula = medv ~ ., data = train)
## 
## Variables actually used in tree construction:
## [1] age   crim  lstat rm   
## 
## Root node error: 22620/262 = 86.334
## 
## n= 262 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.469231      0   1.00000 1.01139 0.115186
## 2 0.128700      1   0.53077 0.62346 0.080154
## 3 0.098630      2   0.40207 0.51042 0.076055
## 4 0.033799      3   0.30344 0.42674 0.069827
## 5 0.028885      4   0.26964 0.39232 0.066342
## 6 0.028018      5   0.24075 0.37848 0.066389
## 7 0.015141      6   0.21274 0.34877 0.065824
## 8 0.010000      7   0.19760 0.33707 0.065641

rpart(formula = medv ~ ., data = train)

## n= 262 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 262 22619.5900 22.64809  
##    2) lstat>=7.15 190  6407.5790 18.73000  
##      4) lstat>=14.8 91  1204.9870 14.64725  
##        8) crim>=5.7819 47   483.9983 12.77021 *
##        9) crim< 5.7819 44   378.5098 16.65227 *
##      5) lstat< 14.8 99  2291.4410 22.48283  
##       10) rm< 6.6365 87  1491.4180 21.52874 *
##       11) rm>=6.6365 12   146.6600 29.40000 *
##    3) lstat< 7.15 72  5598.1990 32.98750  
##      6) rm< 7.4525 59  2516.6520 30.37458  
##       12) age< 88.6 52  1024.2690 29.05385  
##         24) rm< 6.776 29   220.2917 25.94483 *
##         25) rm>=6.776 23   170.2243 32.97391 *
##       13) age>=88.6 7   727.8686 40.18571 *
##      7) rm>=7.4525 13   850.5723 44.84615 *

Visualizing

plotcp(tree)

Predict

p <- predict(tree, train)
sqrt(mean((train$medv-p)^2))

## [1] 4.130294

(cor(train$medv,p))^2

## [1] 0.8024039

The model is showing a goodness of fit of around .8.

Sources

https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset