I am currently collecting data for a penguin population study in Antarctica. Arford Inc. has robots that can measure the dimensions as well as their rate. The problem is that we have to analyze their sex manually because the robot does not yet have the technology. I can create a classification model that will hopefully classify the gender of the penguins automatically. My goal is to use a K-Means Classification model and a random forest model. I will try both methods and see which yields better results.
Loading in Data and packages
I will use the following packages for this analysis:
Cleaning and Clustering
First, I cleaned the data by removing NA values, removing unneeded variables, and setting our data types correctly.
Our variables include the following * culmendepth * culmenwidth * flipperlength * bodymass * sex
Next, I will set the seed so our results will be reproducible. Then we can use the kmeans function to build our model. This function is the simplest way I’ve found to create a k-means model. As long as you only have numbers and one factor column, the function will automatically identify the fields we need. This is why we cut out uneeded rows at the beginning.
Nstart is our parameter and it decides the number of configurations our model start.
Below you can see the 3 stages of creating the cluster plot as well as our model statistics.
# Removing initial label of
# Species from original dataset
penguin <- na.omit(penguin)
penguins1 <- penguin[, -5]
penguins1 <- penguins1 %>% mutate_if(is.character, as.numeric)
# Fitting K-Means clustering Model
# to training dataset
set.seed(240) # Setting seed
kmeans.re <- kmeans(penguins1, centers = 2, nstart = 20)
kmeans.re
## K-means clustering with 2 clusters of sizes 126, 207
##
## Cluster means:
## culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g
## 1 47.40397 15.56905 215.6111 254.6329
## 2 41.96425 18.11014 192.1449 183.5145
##
## Clustering vector:
## [1] 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2
## [38] 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2
## [75] 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2
## [112] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [149] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 2 1
## [186] 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1
## [223] 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 2 1 1 1 2
## [260] 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [297] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##
## Within cluster sum of squares by cluster:
## [1] 79145.43 94499.84
## (between_SS / total_SS = 71.8 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
## [1] 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2
## [38] 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2
## [75] 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2
## [112] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [149] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 2 1
## [186] 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1
## [223] 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 2 1 1 1 2
## [260] 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [297] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
plot(penguins1[c("flipper_length_mm", "body_mass_g")],
col = kmeans.re$cluster,
main = "K-means with 3 clusters")
## Plotting cluster centers
kmeans.re$centers
## culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g
## 1 47.40397 15.56905 215.6111 254.6329
## 2 41.96425 18.11014 192.1449 183.5145
## flipper_length_mm body_mass_g
## 1 215.6111 254.6329
## 2 192.1449 183.5145
# cex is font size, pch is symbol
points(kmeans.re$centers[, c("flipper_length_mm", "body_mass_g")],
col = 1:3, pch = 8, cex = 3)
## Visualizing clusters
y_kmeans <- kmeans.re$cluster
clusplot(penguins1[, c("flipper_length_mm", "body_mass_g")],
y_kmeans,
lines = 0,
shade = TRUE,
color = TRUE,
labels = 2,
plotchar = FALSE,
span = TRUE,
main = paste("Cluster Penguins"),
xlab = 'flipper length',
ylab = 'mass/20')
We now have visualized the cluster plot. Our clusters seemed to be well defined but to make sure we want to look at a confusion matrix.
Confusion Matrix
##
## 1 2
## . 1 0
## FEMALE 50 115
## MALE 75 92
The matrix shows that the model is not performing well. The accuracy is not even at 50%. Normally we could tweak some parameters to get better accuracy but we are going to want to try a new method. Let’s try a random forest model.
Random Forest Model
We will use a train/test split of 70/30. The randomForest package will do the rest of the work for us.
##
## . FEMALE MALE
## 1 165 167
set.seed(222)
ind <- sample(2, nrow(penguin2), replace = TRUE, prob = c(0.7, 0.3))
train <- penguin2[ind==1,]
test <- penguin2[ind==2,]
rf <- randomForest(sex~., data=train, proximity=TRUE)
print(rf)
##
## Call:
## randomForest(formula = sex ~ ., data = train, proximity = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 12.28%
## Confusion matrix:
## . FEMALE MALE class.error
## . 0 1 0 1.0000000
## FEMALE 0 102 15 0.1282051
## MALE 0 12 98 0.1090909
Our base model is performing much better than the K-means model with an accuracy of 88%.
A look at some of the more advanced statistics…
## Confusion Matrix and Statistics
##
## Reference
## Prediction . FEMALE MALE
## . 1 0 0
## FEMALE 0 117 0
## MALE 0 0 110
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.984, 1)
## No Information Rate : 0.5132
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: . Class: FEMALE Class: MALE
## Sensitivity 1.000000 1.0000 1.0000
## Specificity 1.000000 1.0000 1.0000
## Pos Pred Value 1.000000 1.0000 1.0000
## Neg Pred Value 1.000000 1.0000 1.0000
## Prevalence 0.004386 0.5132 0.4825
## Detection Rate 0.004386 0.5132 0.4825
## Detection Prevalence 0.004386 0.5132 0.4825
## Balanced Accuracy 1.000000 1.0000 1.0000
## Confusion Matrix and Statistics
##
## Reference
## Prediction . FEMALE MALE
## . 0 0 0
## FEMALE 0 43 2
## MALE 0 5 55
##
## Overall Statistics
##
## Accuracy : 0.9333
## 95% CI : (0.8675, 0.9728)
## No Information Rate : 0.5429
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.865
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: . Class: FEMALE Class: MALE
## Sensitivity NA 0.8958 0.9649
## Specificity 1 0.9649 0.8958
## Pos Pred Value NA 0.9556 0.9167
## Neg Pred Value NA 0.9167 0.9556
## Prevalence 0 0.4571 0.5429
## Detection Rate 0 0.4095 0.5238
## Detection Prevalence 0 0.4286 0.5714
## Balanced Accuracy NA 0.9304 0.9304
Conclusion
In conclusion, our random forest model is performing with an accuracy over 93%. This is much better than our k-means model. We can introduce more data or parameters to get the accuracy as high as possible, but adding the model to our collection will automate the task of checking gender.