Clustering and Random Forest Analysis

I am currently collecting data for a penguin population study in Antarctica. Arford Inc. has robots that can measure the dimensions as well as their rate. The problem is that we have to analyze their sex manually because the robot does not yet have the technology. I can create a classification model that will hopefully classify the gender of the penguins automatically. My goal is to use a K-Means Classification model and a random forest model. I will try both methods and see which yields better results.

Loading in Data and packages

I will use the following packages for this analysis:

# Installing Packages 


  
# Loading package 
library(ClusterR) 
library(cluster) 
library(randomForest)
library(datasets)
library(dplyr)
library(caret)

Cleaning and Clustering

First, I cleaned the data by removing NA values, removing unneeded variables, and setting our data types correctly.

Our variables include the following * culmendepth * culmenwidth * flipperlength * bodymass * sex

Next, I will set the seed so our results will be reproducible. Then we can use the kmeans function to build our model. This function is the simplest way I’ve found to create a k-means model. As long as you only have numbers and one factor column, the function will automatically identify the fields we need. This is why we cut out uneeded rows at the beginning.

Nstart is our parameter and it decides the number of configurations our model start.

Below you can see the 3 stages of creating the cluster plot as well as our model statistics.

# Removing initial label of  
# Species from original dataset 

penguin <- na.omit(penguin)
penguins1 <- penguin[, -5] 
penguins1 <- penguins1 %>% mutate_if(is.character, as.numeric)
  
# Fitting K-Means clustering Model  
# to training dataset 
set.seed(240) # Setting seed 
kmeans.re <- kmeans(penguins1, centers = 2, nstart = 20) 
kmeans.re

## K-means clustering with 2 clusters of sizes 126, 207
## 
## Cluster means:
##   culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g
## 1         47.40397        15.56905          215.6111    254.6329
## 2         41.96425        18.11014          192.1449    183.5145
## 
## Clustering vector:
##   [1] 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2
##  [38] 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2
##  [75] 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2
## [112] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [149] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 2 1
## [186] 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1
## [223] 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 2 1 1 1 2
## [260] 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [297] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 79145.43 94499.84
##  (between_SS / total_SS =  71.8 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

# Cluster identification for  
# each observation 
kmeans.re$cluster

##   [1] 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2
##  [38] 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2
##  [75] 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2
## [112] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [149] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 2 1
## [186] 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1
## [223] 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 2 1 1 1 2
## [260] 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [297] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

# Model Evaluation and visualization 
plot(penguins1[c("flipper_length_mm", "body_mass_g")])

plot(penguins1[c("flipper_length_mm", "body_mass_g")],  
     col = kmeans.re$cluster)

plot(penguins1[c("flipper_length_mm", "body_mass_g")],  
     col = kmeans.re$cluster,  
     main = "K-means with 3 clusters") 
  
## Plotting cluster centers 
kmeans.re$centers

##   culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g
## 1         47.40397        15.56905          215.6111    254.6329
## 2         41.96425        18.11014          192.1449    183.5145

kmeans.re$centers[, c("flipper_length_mm", "body_mass_g")]

##   flipper_length_mm body_mass_g
## 1          215.6111    254.6329
## 2          192.1449    183.5145

# cex is font size, pch is symbol 
points(kmeans.re$centers[, c("flipper_length_mm", "body_mass_g")],  
       col = 1:3, pch = 8, cex = 3)

## Visualizing clusters 
y_kmeans <- kmeans.re$cluster 
clusplot(penguins1[, c("flipper_length_mm", "body_mass_g")], 
         y_kmeans, 
         lines = 0, 
         shade = TRUE, 
         color = TRUE, 
         labels = 2, 
         plotchar = FALSE, 
         span = TRUE, 
         main = paste("Cluster Penguins"), 
           xlab = 'flipper length', 
         ylab = 'mass/20')

We now have visualized the cluster plot. Our clusters seemed to be well defined but to make sure we want to look at a confusion matrix.

Confusion Matrix

##         
##            1   2
##   .        1   0
##   FEMALE  50 115
##   MALE    75  92

The matrix shows that the model is not performing well. The accuracy is not even at 50%. Normally we could tweak some parameters to get better accuracy but we are going to want to try a new method. Let’s try a random forest model.

Random Forest Model

We will use a train/test split of 70/30. The randomForest package will do the rest of the work for us.

penguin$sex <- as.factor(penguin$sex)
penguin2 <- na.omit(penguin)
table(penguin$sex)

## 
##      . FEMALE   MALE 
##      1    165    167

set.seed(222)
ind <- sample(2, nrow(penguin2), replace = TRUE, prob = c(0.7, 0.3))
train <- penguin2[ind==1,]
test <- penguin2[ind==2,]


rf <- randomForest(sex~., data=train, proximity=TRUE) 
print(rf)

## 
## Call:
##  randomForest(formula = sex ~ ., data = train, proximity = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 12.28%
## Confusion matrix:
##        . FEMALE MALE class.error
## .      0      1    0   1.0000000
## FEMALE 0    102   15   0.1282051
## MALE   0     12   98   0.1090909

Our base model is performing much better than the K-means model with an accuracy of 88%.

A look at some of the more advanced statistics…

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   . FEMALE MALE
##     .        1      0    0
##     FEMALE   0    117    0
##     MALE     0      0  110
## 
## Overall Statistics
##                                     
##                Accuracy : 1         
##                  95% CI : (0.984, 1)
##     No Information Rate : 0.5132    
##     P-Value [Acc > NIR] : < 2.2e-16 
##                                     
##                   Kappa : 1         
##                                     
##  Mcnemar's Test P-Value : NA        
## 
## Statistics by Class:
## 
##                      Class: . Class: FEMALE Class: MALE
## Sensitivity          1.000000        1.0000      1.0000
## Specificity          1.000000        1.0000      1.0000
## Pos Pred Value       1.000000        1.0000      1.0000
## Neg Pred Value       1.000000        1.0000      1.0000
## Prevalence           0.004386        0.5132      0.4825
## Detection Rate       0.004386        0.5132      0.4825
## Detection Prevalence 0.004386        0.5132      0.4825
## Balanced Accuracy    1.000000        1.0000      1.0000

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  . FEMALE MALE
##     .       0      0    0
##     FEMALE  0     43    2
##     MALE    0      5   55
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9333          
##                  95% CI : (0.8675, 0.9728)
##     No Information Rate : 0.5429          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.865           
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: . Class: FEMALE Class: MALE
## Sensitivity                NA        0.8958      0.9649
## Specificity                 1        0.9649      0.8958
## Pos Pred Value             NA        0.9556      0.9167
## Neg Pred Value             NA        0.9167      0.9556
## Prevalence                  0        0.4571      0.5429
## Detection Rate              0        0.4095      0.5238
## Detection Prevalence        0        0.4286      0.5714
## Balanced Accuracy          NA        0.9304      0.9304

Conclusion