K Nearest Neighbor Analysis

This analysis will use K-means clustering to predict what type of vehicle goes through a toll booth. This will fix a long process of looking at the images manually. Arford Inc. has started using a new software in the cameras at the toll booths. This software takes 3 length measurement approximations as well as the weight of the vehicle. A toll for a car is 7 dollars. The truck is 9 and the large truck costs 15. It is important we predict the vehicle with high accuracy so we don’t overcharge people or lose out on income.

Load in Data and Import Packages

We will first load in our dataset and take a look at the data structure. We have all numbers and one character column.

library(readxl)
tolldata <- read_excel("data/tolldata.xlsx")
# Structure 
str(tolldata)

## tibble [150 × 5] (S3: tbl_df/tbl/data.frame)
##  $ fwindshield: num [1:150] 137.7 63.7 61.1 59.8 65 ...
##  $ tirewidth  : num [1:150] 94.5 39 41.6 40.3 46.8 50.7 44.2 44.2 37.7 40.3 ...
##  $ weight     : num [1:150] 37.8 18.2 16.9 19.5 18.2 22.1 18.2 19.5 18.2 19.5 ...
##  $ clearance  : num [1:150] 5.4 5.4 5.4 5.4 5.4 10.8 8.1 5.4 5.4 2.7 ...
##  $ vehicle    : chr [1:150] "car" "car" "car" "car" ...

The following packages will be used for this model.

# Loading package 
library(e1071) 
library(caTools) 
library(class) 
library(tidyverse)
library(ggplot2)

Splitting the data

A data split of 70/30 for training and testing data is common in k-Nearest Neighbor clustering.

tolldata <- as.data.frame(tolldata)
tolldata <- tolldata %>% mutate_at(c('vehicle'), as.factor)

# Splitting data into train and test data 
split <- sample.split(tolldata, SplitRatio = 0.7) 
train_cl <- subset(tolldata, split == "TRUE") 
test_cl <- subset(tolldata, split == "FALSE") 
 
# Feature Scaling 
train_scale <- scale(train_cl[, 1:4]) 
test_scale <- scale(test_cl[, 1:4]) 
 
 
head(train_scale)

##   fwindshield   tirewidth     weight clearance
## 1   4.6267266  6.68678539 -0.5012242 -1.290892
## 3  -1.2414830  0.19034926 -1.3887350 -1.290892
## 4  -1.3410740  0.03070149 -1.2783270 -1.290892
## 6  -0.5443458  1.30788364 -1.1679189 -1.032426
## 8  -0.9427099  0.50964480 -1.2783270 -1.290892
## 9  -1.5402560 -0.28859404 -1.3335310 -1.290892

head(test_scale)

##    fwindshield   tirewidth    weight clearance
## 2   -1.1724769 -0.18115482 -1.368268 -1.325089
## 5   -1.0392409  1.23657852 -1.368268 -1.325089
## 7   -1.5721850  0.76400074 -1.368268 -1.192359
## 10  -1.1724769  0.05513407 -1.308735 -1.457819
## 12  -1.3057129  0.76400074 -1.249202 -1.325089
## 15   0.0266472  2.18173409 -1.487334 -1.325089

Categorized Visualization

Next, we will create a basic cluster plot so we can get an idea of what to expect.

Model Creation

We can now create our model using the knn function.

# Fitting KNN Model to training dataset 
classifier_knn <- knn(train = train_scale, 
                      test = test_scale, 
                      cl = train_cl$vehicle, 
                      k = 1) 
classifier_knn

##  [1] car        car        car        car        car        car       
##  [7] car        car        car        car        car        car       
## [13] car        car        car        car        car        car       
## [19] car        car        truck      truck      truck      truck     
## [25] truck      truck      truck      truck      truck      truck     
## [31] truck      truck      truck      truck      truck      truck     
## [37] truck      truck      truck      truck      largetruck largetruck
## [43] truck      largetruck largetruck largetruck largetruck truck     
## [49] largetruck largetruck largetruck largetruck largetruck truck     
## [55] largetruck largetruck largetruck largetruck largetruck largetruck
## Levels: car largetruck truck

Confusion Matrix

Let’s take a look at a confusion matrix to see how our model performs.

cm <- table(test_cl$vehicle, classifier_knn) 
cm

##             classifier_knn
##              car largetruck truck
##   car         20          0     0
##   largetruck   0         17     3
##   truck        0          0    20

We are performing at 100% accuracy for cars and trucks with 90% accuracy for large trucks. This means we are not overcharging anyone but losing out on 10% of our large truck tolls. Overall this is a well performing model but lets see if we can up the accuracy.

Changing the K

Our parameter in this model is the number of clusters. We can experiment with a different number of clusters.

## [1] "Accuracy = 0.95"

## [1] "Accuracy = 0.933333333333333"

## [1] "Accuracy = 0.933333333333333"

## [1] "Accuracy = 0.95"

## [1] "Accuracy = 0.916666666666667"

## [1] "Accuracy = 0.9"

Visualizing Performance at Different K Values

Let’s visualize our performance

In conclusion, we can achieve over 98% accuracy with a k value between 5 and 15. This will automate the task of manually classifying the vehicle and will save the company a good amount of money.