This analysis will use K-means clustering to predict what type of vehicle goes through a toll booth. This will fix a long process of looking at the images manually. Arford Inc. has started using a new software in the cameras at the toll booths. This software takes 3 length measurement approximations as well as the weight of the vehicle. A toll for a car is 7 dollars. The truck is 9 and the large truck costs 15. It is important we predict the vehicle with high accuracy so we don’t overcharge people or lose out on income.
We will first load in our dataset and take a look at the data structure. We have all numbers and one character column.
library(readxl)
tolldata <- read_excel("data/tolldata.xlsx")
# Structure
str(tolldata)
## tibble [150 × 5] (S3: tbl_df/tbl/data.frame)
## $ fwindshield: num [1:150] 137.7 63.7 61.1 59.8 65 ...
## $ tirewidth : num [1:150] 94.5 39 41.6 40.3 46.8 50.7 44.2 44.2 37.7 40.3 ...
## $ weight : num [1:150] 37.8 18.2 16.9 19.5 18.2 22.1 18.2 19.5 18.2 19.5 ...
## $ clearance : num [1:150] 5.4 5.4 5.4 5.4 5.4 10.8 8.1 5.4 5.4 2.7 ...
## $ vehicle : chr [1:150] "car" "car" "car" "car" ...
The following packages will be used for this model.
# Loading package
library(e1071)
library(caTools)
library(class)
library(tidyverse)
library(ggplot2)
A data split of 70/30 for training and testing data is common in k-Nearest Neighbor3 clustering.
tolldata <- as.data.frame(tolldata)
tolldata <- tolldata %>% mutate_at(c('vehicle'), as.factor)
# Splitting data into train and test data
split <- sample.split(tolldata, SplitRatio = 0.7)
train_cl <- subset(tolldata, split == "TRUE")
test_cl <- subset(tolldata, split == "FALSE")
# Feature Scaling
train_scale <- scale(train_cl[, 1:4])
test_scale <- scale(test_cl[, 1:4])
head(train_scale)
## fwindshield tirewidth weight clearance
## 1 4.8883077 6.66897430 -0.4999445 -1.353762
## 2 -1.0613860 -0.17033228 -1.3661681 -1.353762
## 4 -1.3749510 -0.01013231 -1.3087145 -1.353762
## 6 -0.5387778 1.27146748 -1.1938073 -1.089470
## 7 -1.3749510 0.47046762 -1.3661681 -1.221616
## 9 -1.5839943 -0.33053225 -1.3661681 -1.353762
head(test_scale)
## fwindshield tirewidth weight clearance
## 3 -1.33191360 0.3806609 -1.370325 -1.234759
## 5 -0.97726205 1.3225023 -1.314507 -1.234759
## 8 -0.97726205 0.8515816 -1.258690 -1.234759
## 10 -1.09547923 0.1452005 -1.258690 -1.363380
## 13 -1.21369641 -0.0902598 -1.314507 -1.363380
## 15 -0.03152458 2.2643437 -1.426143 -1.234759
Next, we will create a base cluster plot so we know what to expect.
We can now create our model using the knn function.
# Fitting KNN Model to training dataset
classifier_knn <- knn(train = train_scale,
test = test_scale,
cl = train_cl$vehicle,
k = 1)
classifier_knn
## [1] car car car car car car
## [7] car car car car car car
## [13] car car car car car car
## [19] car car truck truck truck truck
## [25] truck truck truck truck truck truck
## [31] truck truck truck truck truck truck
## [37] truck truck truck truck largetruck largetruck
## [43] largetruck largetruck largetruck largetruck largetruck truck
## [49] largetruck largetruck largetruck largetruck largetruck truck
## [55] largetruck largetruck largetruck largetruck largetruck largetruck
## Levels: car largetruck truck
Let’s take a look at a confusion matrix to see how our model performs.
cm <- table(test_cl$vehicle, classifier_knn)
cm
## classifier_knn
## car largetruck truck
## car 20 0 0
## largetruck 0 18 2
## truck 0 0 20
We are performing at 100% accuracy for cars and trucks with 90 % accuracy for large trucks. This means we are not overcharging anyone but losing out on 10% of our large truck tolls. Overall this is a well performing model but lets see if we can up the accuracy.
Our parameter in this model is the number of clusters. We can experiment with a different number of clusters.
## [1] "Accuracy = 0.966666666666667"
## [1] "Accuracy = 0.966666666666667"
## [1] "Accuracy = 0.983333333333333"
## [1] "Accuracy = 0.983333333333333"
## [1] "Accuracy = 0.983333333333333"
## [1] "Accuracy = 0.933333333333333"
Let’s visualize our performance
In conclusion, we can achieve over 98% accuracy with a k value between 5 and 15. This will automate the task of manually classifying the vehicle and will save the company a good amount of money.