This analysis will use the spam7 dataset. The data contains metrics from thousands of emails. The fields can be described by the following:
crl.tot- total length of words in capitals
dollar- number of occurrences of the $ symbol
bang- number of occurrences of the ! symbol
money- number of occurrences of the word “money”
n000- number of occurrences of the string “000”
make- number of occurrences of the word “make”
yesno- a factor with levels n (not spam) and y (spam)
Viewing the Data Structure
Our goal will be to predict whether a message is spam using a decision tree, confusion matrix, and the ROC curve. We will then use a real estate dataset to show how a decision tree can be used for regression. First, let’s take a look at the stucture of our data.
## 'data.frame': 4601 obs. of 7 variables:
## $ crl.tot: num 278 1028 2259 191 191 ...
## $ dollar : num 0 0.18 0.184 0 0 0 0.054 0 0.203 0.081 ...
## $ bang : num 0.778 0.372 0.276 0.137 0.135 0 0.164 0 0.181 0.244 ...
## $ money : num 0 0.43 0.06 0 0 0 0 0 0.15 0 ...
## $ n000 : num 0 0.43 1.16 0 0 0 0 0 0 0.19 ...
## $ make : num 0 0.21 0.06 0 0 0 0 0 0.15 0.06 ...
## $ yesno : Factor w/ 2 levels "n","y": 2 2 2 2 2 2 2 2 2 2 ...
Splitting the Data and Initial Model
We are dealing with mostly numerics. It is very important that our last column is a factor since that is required for this model. We will set the seed so our results become reproducible. We can use rpartplot to visualize our initial tree.
Analyzing Model Performance
We can see that if the dollar field is less than .056, the email is predicted as spam. Our other branches use the count of the bang column and the total letters following capitals as criteria. To test the accuracy of our model we can use a confusion matrix.
## Confusion Matrix and Statistics
##
## Reference
## Prediction n y
## n 1278 212
## y 127 688
##
## Accuracy : 0.8529
## 95% CI : (0.8378, 0.8671)
## No Information Rate : 0.6095
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6857
##
## Mcnemar's Test P-Value : 5.061e-06
##
## Sensitivity : 0.7644
## Specificity : 0.9096
## Pos Pred Value : 0.8442
## Neg Pred Value : 0.8577
## Prevalence : 0.3905
## Detection Rate : 0.2985
## Detection Prevalence : 0.3536
## Balanced Accuracy : 0.8370
##
## 'Positive' Class : y
##
Our model is performing with an accuracy of 85%. Our main parameter is cp. We can see how the changing this affects our error.
## n= 2305
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 2305 900 n (0.6095445 0.3904555)
## 2) dollar< 0.0555 1740 404 n (0.7678161 0.2321839)
## 4) bang< 0.092 1227 128 n (0.8956805 0.1043195) *
## 5) bang>=0.092 513 237 y (0.4619883 0.5380117)
## 10) crl.tot< 86.5 263 84 n (0.6806084 0.3193916) *
## 11) crl.tot>=86.5 250 58 y (0.2320000 0.7680000) *
## 3) dollar>=0.0555 565 69 y (0.1221239 0.8778761) *
ROC Curve
An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters: True Positive Rate. False Positive Rate
p1 <- predict(tree, test, type = 'prob')
p1 <- p1[,2]
r <- multiclass.roc(test$yesno, p1, percent = TRUE)
## Setting direction: controls < cases
Regression with Decision Trees
Another use for decision trees is regression analysis. We will use a real estate dataset for this example. The following explains the variables:
- CRIM - per capita crime rate by town
- ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS - proportion of non-retail business acres per town.
- CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
- NOX - nitric oxides concentration (parts per 10 million)
- RM - average number of rooms per dwelling
- AGE - proportion of owner-occupied units built prior to 1940
- DIS - weighted distances to five Boston employment centres
- RAD - index of accessibility to radial highways
- TAX - full-value property-tax rate per $10,000
- PTRATIO - pupil-teacher ratio by town
- B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT - % lower status of the population
- MEDV - Median value of owner-occupied homes in $1000’s
First we will create our initial tree just like before. The main difference in this model is that our response variable will be a continuous value instead of a factor.
set.seed(1234)
ind <- sample(2, nrow(mydata), replace = T, prob = c(0.5, 0.5))
train <- mydata[ind == 1,]
test <- mydata[ind == 2,]
tree <- rpart(medv ~., data = train)
rpart.plot(tree)
Model
##
## Regression tree:
## rpart(formula = medv ~ ., data = train)
##
## Variables actually used in tree construction:
## [1] age crim lstat rm
##
## Root node error: 22620/262 = 86.334
##
## n= 262
##
## CP nsplit rel error xerror xstd
## 1 0.469231 0 1.00000 1.01139 0.115186
## 2 0.128700 1 0.53077 0.62346 0.080154
## 3 0.098630 2 0.40207 0.51042 0.076055
## 4 0.033799 3 0.30344 0.42674 0.069827
## 5 0.028885 4 0.26964 0.39232 0.066342
## 6 0.028018 5 0.24075 0.37848 0.066389
## 7 0.015141 6 0.21274 0.34877 0.065824
## 8 0.010000 7 0.19760 0.33707 0.065641
## n= 262
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 262 22619.5900 22.64809
## 2) lstat>=7.15 190 6407.5790 18.73000
## 4) lstat>=14.8 91 1204.9870 14.64725
## 8) crim>=5.7819 47 483.9983 12.77021 *
## 9) crim< 5.7819 44 378.5098 16.65227 *
## 5) lstat< 14.8 99 2291.4410 22.48283
## 10) rm< 6.6365 87 1491.4180 21.52874 *
## 11) rm>=6.6365 12 146.6600 29.40000 *
## 3) lstat< 7.15 72 5598.1990 32.98750
## 6) rm< 7.4525 59 2516.6520 30.37458
## 12) age< 88.6 52 1024.2690 29.05385
## 24) rm< 6.776 29 220.2917 25.94483 *
## 25) rm>=6.776 23 170.2243 32.97391 *
## 13) age>=88.6 7 727.8686 40.18571 *
## 7) rm>=7.4525 13 850.5723 44.84615 *
Predict
## [1] 4.130294
## [1] 0.8024039
The model is showing a goodness of fit of around .8.