K_Nearest_Neighbor Analysis

This analysis will use K-means clustering to predict what type of vehicle goes through a toll booth. This will fix a long process of looking at the images manually. Arford Inc. has started using a new software in the cameras at the toll booths. This software takes 3 length measurement approximations as well as the weight of the vehicle. A toll for a car is 7 dollars. The truck is 9 and the large truck costs 15. It is important we predict the vehicle with high accuracy so we don’t overcharge people or lose out on income.

We will first load in our dataset and take a look at the data structure. We have all numbers and one character column.

library(readxl)
tolldata <- read_excel("data/tolldata.xlsx")
# Structure 
str(tolldata)

## tibble [150 × 5] (S3: tbl_df/tbl/data.frame)
##  $ fwindshield: num [1:150] 137.7 63.7 61.1 59.8 65 ...
##  $ tirewidth  : num [1:150] 94.5 39 41.6 40.3 46.8 50.7 44.2 44.2 37.7 40.3 ...
##  $ weight     : num [1:150] 37.8 18.2 16.9 19.5 18.2 22.1 18.2 19.5 18.2 19.5 ...
##  $ clearance  : num [1:150] 5.4 5.4 5.4 5.4 5.4 10.8 8.1 5.4 5.4 2.7 ...
##  $ vehicle    : chr [1:150] "car" "car" "car" "car" ...

The following packages will be used for this model.

# Loading package 
library(e1071) 
library(caTools) 
library(class) 
library(tidyverse)
library(ggplot2)

A data split of 70/30 for training and testing data is common in k-Nearest Neighbor3 clustering.

tolldata <- as.data.frame(tolldata)
tolldata <- tolldata %>% mutate_at(c('vehicle'), as.factor)

# Splitting data into train and test data 
split <- sample.split(tolldata, SplitRatio = 0.7) 
train_cl <- subset(tolldata, split == "TRUE") 
test_cl <- subset(tolldata, split == "FALSE") 
 
# Feature Scaling 
train_scale <- scale(train_cl[, 1:4]) 
test_scale <- scale(test_cl[, 1:4]) 
 
 
head(train_scale)

##   fwindshield   tirewidth     weight clearance
## 1   4.8883077  6.66897430 -0.4999445 -1.353762
## 2  -1.0613860 -0.17033228 -1.3661681 -1.353762
## 4  -1.3749510 -0.01013231 -1.3087145 -1.353762
## 6  -0.5387778  1.27146748 -1.1938073 -1.089470
## 7  -1.3749510  0.47046762 -1.3661681 -1.221616
## 9  -1.5839943 -0.33053225 -1.3661681 -1.353762

head(test_scale)

##    fwindshield  tirewidth    weight clearance
## 3  -1.33191360  0.3806609 -1.370325 -1.234759
## 5  -0.97726205  1.3225023 -1.314507 -1.234759
## 8  -0.97726205  0.8515816 -1.258690 -1.234759
## 10 -1.09547923  0.1452005 -1.258690 -1.363380
## 13 -1.21369641 -0.0902598 -1.314507 -1.363380
## 15 -0.03152458  2.2643437 -1.426143 -1.234759

Next, we will create a base cluster plot so we know what to expect.

We can now create our model using the knn function.

# Fitting KNN Model to training dataset 
classifier_knn <- knn(train = train_scale, 
                      test = test_scale, 
                      cl = train_cl$vehicle, 
                      k = 1) 
classifier_knn

##  [1] car        car        car        car        car        car       
##  [7] car        car        car        car        car        car       
## [13] car        car        car        car        car        car       
## [19] car        car        truck      truck      truck      truck     
## [25] truck      truck      truck      truck      truck      truck     
## [31] truck      truck      truck      truck      truck      truck     
## [37] truck      truck      truck      truck      largetruck largetruck
## [43] largetruck largetruck largetruck largetruck largetruck truck     
## [49] largetruck largetruck largetruck largetruck largetruck truck     
## [55] largetruck largetruck largetruck largetruck largetruck largetruck
## Levels: car largetruck truck

Let’s take a look at a confusion matrix to see how our model performs.

cm <- table(test_cl$vehicle, classifier_knn) 
cm

##             classifier_knn
##              car largetruck truck
##   car         20          0     0
##   largetruck   0         18     2
##   truck        0          0    20

We are performing at 100% accuracy for cars and trucks with 90 % accuracy for large trucks. This means we are not overcharging anyone but losing out on 10% of our large truck tolls. Overall this is a well performing model but lets see if we can up the accuracy.

Our parameter in this model is the number of clusters. We can experiment with a different number of clusters.

## [1] "Accuracy = 0.966666666666667"

## [1] "Accuracy = 0.966666666666667"

## [1] "Accuracy = 0.983333333333333"

## [1] "Accuracy = 0.983333333333333"

## [1] "Accuracy = 0.983333333333333"

## [1] "Accuracy = 0.933333333333333"

Let’s visualize our performance

In conclusion, we can achieve over 98% accuracy with a k value between 5 and 15. This will automate the task of manually classifying the vehicle and will save the company a good amount of money.

K_Nearest_Neighbor Analysis

2024-01-24