This analysis will use the spam7 dataset. The data contains metrics from thousands of emails. The fields can be described by the following:
crl.tot total length of words in capitals
dollar number of occurrences of the $ symbol
bang number of occurrences of the ! symbol
money number of occurrences of the word “money”
n000 number of occurrences of the string “000”
make number of occurrences of the word “make”
yesno a factor with levels n (not spam) and y (spam)
Our goal will be to predict whether a message is spam using a decision tree, our confusion matrix, and the ROC curve. We will then use a real estate dataset to show how a decision tree can be used for regression. First, let’s take a look at the stucture of our data.
## 'data.frame': 4601 obs. of 7 variables:
## $ crl.tot: num 278 1028 2259 191 191 ...
## $ dollar : num 0 0.18 0.184 0 0 0 0.054 0 0.203 0.081 ...
## $ bang : num 0.778 0.372 0.276 0.137 0.135 0 0.164 0 0.181 0.244 ...
## $ money : num 0 0.43 0.06 0 0 0 0 0 0.15 0 ...
## $ n000 : num 0 0.43 1.16 0 0 0 0 0 0 0.19 ...
## $ make : num 0 0.21 0.06 0 0 0 0 0 0.15 0.06 ...
## $ yesno : Factor w/ 2 levels "n","y": 2 2 2 2 2 2 2 2 2 2 ...
We are dealing with mostly numeric. It is very important that our last column is a factor since that is required for this model. We will set the seed so our results become reproducible. We can use rpartplot to visualize our initial tree.
mydata <- spam7
set.seed(1234)
ind <- sample(2, nrow(mydata), replace = T, prob = c(0.5, 0.5))
train <- mydata[ind == 1,]
test <- mydata[ind == 2,]
tree <- rpart(yesno ~., data = train, cp=0.07444)
rpart.plot(tree)
We can see that if the dollar field is less than .056, the email is predicted as spam. Our other branches use the count of ! and the total letters following capitals as criteria. To test the accuracy of our model we can use a confusion matrix.
## Confusion Matrix and Statistics
##
## Reference
## Prediction n y
## n 1278 212
## y 127 688
##
## Accuracy : 0.8529
## 95% CI : (0.8378, 0.8671)
## No Information Rate : 0.6095
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6857
##
## Mcnemar's Test P-Value : 5.061e-06
##
## Sensitivity : 0.7644
## Specificity : 0.9096
## Pos Pred Value : 0.8442
## Neg Pred Value : 0.8577
## Prevalence : 0.3905
## Detection Rate : 0.2985
## Detection Prevalence : 0.3536
## Balanced Accuracy : 0.8370
##
## 'Positive' Class : y
##
Our model is performing with an accuracy of 85%. Our main parameter is cp. We can see how the changing this affects our error.
rpart(formula = yesno ~ ., data = train)
## n= 2305
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 2305 900 n (0.6095445 0.3904555)
## 2) dollar< 0.0555 1740 404 n (0.7678161 0.2321839)
## 4) bang< 0.092 1227 128 n (0.8956805 0.1043195) *
## 5) bang>=0.092 513 237 y (0.4619883 0.5380117)
## 10) crl.tot< 86.5 263 84 n (0.6806084 0.3193916) *
## 11) crl.tot>=86.5 250 58 y (0.2320000 0.7680000) *
## 3) dollar>=0.0555 565 69 y (0.1221239 0.8778761) *
plotcp(tree)
An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters: True Positive Rate. False Positive Rate
p1 <- predict(tree, test, type = 'prob')
p1 <- p1[,2]
r <- multiclass.roc(test$yesno, p1, percent = TRUE)
## Setting direction: controls < cases
roc <- r[['rocs']]
r1 <- roc[[1]]
plot.roc(r1,
print.auc=TRUE,
auc.polygon=TRUE,
grid=c(0.1, 0.2),
grid.col=c("green", "red"),
max.auc.polygon=TRUE,
auc.polygon.col="lightblue",
print.thres=TRUE,
main= 'ROC Curve')
Another use for decision trees is regression analysis. We will use a real estate dataset for this example. The following explains the variables:
First we will create our initial tree just like before. The main difference in this model is that our response variable will be a continuous value instead of a factor.
set.seed(1234)
ind <- sample(2, nrow(mydata), replace = T, prob = c(0.5, 0.5))
train <- mydata[ind == 1,]
test <- mydata[ind == 2,]
tree <- rpart(medv ~., data = train)
rpart.plot(tree)
printcp(tree)
##
## Regression tree:
## rpart(formula = medv ~ ., data = train)
##
## Variables actually used in tree construction:
## [1] age crim lstat rm
##
## Root node error: 22620/262 = 86.334
##
## n= 262
##
## CP nsplit rel error xerror xstd
## 1 0.469231 0 1.00000 1.01139 0.115186
## 2 0.128700 1 0.53077 0.62346 0.080154
## 3 0.098630 2 0.40207 0.51042 0.076055
## 4 0.033799 3 0.30344 0.42674 0.069827
## 5 0.028885 4 0.26964 0.39232 0.066342
## 6 0.028018 5 0.24075 0.37848 0.066389
## 7 0.015141 6 0.21274 0.34877 0.065824
## 8 0.010000 7 0.19760 0.33707 0.065641
rpart(formula = medv ~ ., data = train)
## n= 262
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 262 22619.5900 22.64809
## 2) lstat>=7.15 190 6407.5790 18.73000
## 4) lstat>=14.8 91 1204.9870 14.64725
## 8) crim>=5.7819 47 483.9983 12.77021 *
## 9) crim< 5.7819 44 378.5098 16.65227 *
## 5) lstat< 14.8 99 2291.4410 22.48283
## 10) rm< 6.6365 87 1491.4180 21.52874 *
## 11) rm>=6.6365 12 146.6600 29.40000 *
## 3) lstat< 7.15 72 5598.1990 32.98750
## 6) rm< 7.4525 59 2516.6520 30.37458
## 12) age< 88.6 52 1024.2690 29.05385
## 24) rm< 6.776 29 220.2917 25.94483 *
## 25) rm>=6.776 23 170.2243 32.97391 *
## 13) age>=88.6 7 727.8686 40.18571 *
## 7) rm>=7.4525 13 850.5723 44.84615 *
plotcp(tree)
p <- predict(tree, train)
sqrt(mean((train$medv-p)^2))
## [1] 4.130294
(cor(train$medv,p))^2
## [1] 0.8024039
The model is showing a goodness of fit of around .8.