Regression Analysis in the NFL

SNOWFALL

Sports Analytics has grown heavily in the NFL the last decade. Some sports figures have mixed feelings on the introduction to advanced analytics, but there is no doubt certain teams have had great success using advanced stats. I will be analysing the NFL's Passer rating as well as using physical attributes to predict this rating.

To begin, we will analyze a dataset that includes the stats of starting 2023 quarterbacks. One field in this dataset is Passer Rating. This is a rating created by ESPN using a formula. My goal is to analyze the correlation between the QB's stats in relation to the QBR. I will use a correlation plot along with linear regression to analyze this relationship.

In the second part of this project, I will used a different dataset. I will use multiple linear regression and lasso regression to predict the rating of a qb based on physical attributes. This would be helpful for NFL scouts when drafting a qb from college.

Below are the packages and functions I will be using. Most of the data cleaning will be done with Pandas or Numpy. To create, test, and validate models I will use Scikit. I will be using Numpy more to calculate final statistics.

Importing Packages and Initial View

In [1]:
import warnings
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split 
from sklearn.metrics import mean_squared_error, mean_absolute_error 
from sklearn import preprocessing 
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.linear_model import Lasso
from numpy import mean
from numpy import std
from numpy import absolute

To make a correlation plot I will need to load and clean the data. We need only numeric columns to create this plot.

In [2]:
warnings.filterwarnings('ignore')
nfl_df = pd.read_csv("nfl2023.csv") 
nfl_df.head()
Out[2]:
rank player team age pod g gs cmp att cmppct ... aya ypc ypg rate qbr sk yds.1 spct nya anya
0 1.0 Tua Tagovailoa MIA 25.0 QB 17.0 17.0 388.0 560.0 69.3 ... 8.2 11.9 272.0 101.1 60.8 29.0 171.0 4.9 7.56 7.48
1 2.0 Jared Goff DET 29.0 QB 17.0 17.0 407.0 605.0 67.3 ... 7.7 11.2 269.1 97.9 60.3 30.0 197.0 4.7 6.89 6.99
2 3.0 Dak Prescott DAL 30.0 QB 17.0 17.0 410.0 590.0 69.5 ... 8.2 11.0 265.6 105.9 72.7 39.0 255.0 6.2 6.77 7.28
3 4.0 Josh Allen BUF 27.0 QB 17.0 17.0 385.0 579.0 66.5 ... 7.0 11.2 253.3 92.2 69.6 24.0 152.0 4.0 6.89 6.51
4 5.0 Brock Purdy SFO 24.0 QB 16.0 16.0 308.0 444.0 69.4 ... 9.9 13.9 267.5 113.0 72.8 28.0 153.0 5.9 8.74 9.01

5 rows × 29 columns

Clean and Scope Data

In [3]:
n = 41
 
# Dropping last n rows using drop
corr_df = nfl_df.drop(nfl_df.tail(n).index,
        inplace = True)
In [4]:
corr_df = nfl_df.drop(columns=['rank', 'player', 'team', 'pod'])
corr_df.head()
Out[4]:
age g gs cmp att cmppct yds tf tdpct int ... aya ypc ypg rate qbr sk yds.1 spct nya anya
0 25.0 17.0 17.0 388.0 560.0 69.3 4624.0 29.0 5.2 14.0 ... 8.2 11.9 272.0 101.1 60.8 29.0 171.0 4.9 7.56 7.48
1 29.0 17.0 17.0 407.0 605.0 67.3 4575.0 30.0 5.0 12.0 ... 7.7 11.2 269.1 97.9 60.3 30.0 197.0 4.7 6.89 6.99
2 30.0 17.0 17.0 410.0 590.0 69.5 4516.0 36.0 6.1 9.0 ... 8.2 11.0 265.6 105.9 72.7 39.0 255.0 6.2 6.77 7.28
3 27.0 17.0 17.0 385.0 579.0 66.5 4306.0 29.0 5.0 18.0 ... 7.0 11.2 253.3 92.2 69.6 24.0 152.0 4.0 6.89 6.51
4 24.0 16.0 16.0 308.0 444.0 69.4 4280.0 31.0 7.0 11.0 ... 9.9 13.9 267.5 113.0 72.8 28.0 153.0 5.9 8.74 9.01

5 rows × 25 columns

Correlation Plot

I began with a correlation plot. This plot uses only numeric variables so I cleaned out the non-numeric columns.

In [5]:
corr = corr_df.corr()
corr.style.background_gradient(cmap='coolwarm')
Out[5]:
  age g gs cmp att cmppct yds tf tdpct int intpct 1d succpct lng ya aya ypc ypg rate qbr sk yds.1 spct nya anya
age 1.000000 -0.096904 -0.058288 0.092303 0.027248 0.378987 0.068625 0.283090 0.398776 -0.201751 -0.267725 0.090304 0.250194 0.189479 0.114578 0.241700 -0.033522 0.311876 0.373436 0.377266 -0.302338 -0.366297 -0.335697 0.193378 0.293391
g -0.096904 1.000000 0.947394 0.857407 0.880886 0.162190 0.846586 0.654199 0.274326 0.564133 0.153288 0.814989 0.366283 0.371712 0.354643 0.289473 0.347475 0.228232 0.243155 0.399136 0.398730 0.277875 -0.138835 0.348502 0.299672
gs -0.058288 0.947394 1.000000 0.898056 0.903782 0.266359 0.886366 0.744135 0.381754 0.564149 0.129231 0.861190 0.449808 0.334011 0.401916 0.356744 0.358319 0.354868 0.337671 0.437934 0.388259 0.295769 -0.170364 0.390225 0.360396
cmp 0.092303 0.857407 0.898056 1.000000 0.981798 0.429316 0.933687 0.819149 0.442877 0.600554 0.106078 0.939518 0.569898 0.348069 0.363722 0.347863 0.242721 0.574668 0.395058 0.577638 0.233116 0.171267 -0.392721 0.415506 0.397321
att 0.027248 0.880886 0.903782 0.981798 1.000000 0.254365 0.896764 0.737374 0.319134 0.625179 0.120960 0.897764 0.437261 0.290577 0.255577 0.235044 0.190616 0.479179 0.258355 0.477424 0.343072 0.296493 -0.293298 0.294886 0.273843
cmppct 0.378987 0.162190 0.266359 0.429316 0.254365 1.000000 0.496097 0.678879 0.780841 0.103741 -0.024563 0.514556 0.838139 0.351031 0.650268 0.671652 0.339429 0.707967 0.810791 0.708133 -0.421986 -0.516058 -0.614579 0.730529 0.741264
yds 0.068625 0.846586 0.886366 0.933687 0.896764 0.496097 1.000000 0.891506 0.602020 0.511032 0.057922 0.987144 0.720030 0.522083 0.652745 0.615508 0.571344 0.701055 0.594801 0.704422 0.121029 0.050318 -0.440615 0.673940 0.645405
tf 0.283090 0.654199 0.744135 0.819149 0.737374 0.678879 0.891506 1.000000 0.861829 0.337682 -0.045441 0.902285 0.799468 0.526854 0.689922 0.725873 0.533740 0.763021 0.791048 0.815154 -0.022791 -0.114336 -0.480900 0.719189 0.756094
tdpct 0.398776 0.274326 0.381754 0.442877 0.319134 0.780841 0.602020 0.861829 1.000000 0.043867 -0.137667 0.606851 0.792002 0.438851 0.789619 0.854679 0.614041 0.761222 0.934254 0.818219 -0.224387 -0.327145 -0.428565 0.794594 0.863398
int -0.201751 0.564133 0.564149 0.600554 0.625179 0.103741 0.511032 0.337682 0.043867 1.000000 0.828189 0.502828 0.247263 -0.023687 0.051363 -0.155769 0.025601 0.176181 -0.156891 0.120725 0.270975 0.146509 -0.143110 0.110455 -0.089974
intpct -0.267725 0.153288 0.129231 0.106078 0.120960 -0.024563 0.057922 -0.045441 -0.137667 0.828189 1.000000 0.054950 0.040843 -0.201250 -0.085309 -0.341080 -0.077540 -0.122191 -0.359339 -0.174296 0.089859 -0.045769 0.001506 -0.034404 -0.281243
1d 0.090304 0.814989 0.861190 0.939518 0.897764 0.514556 0.987144 0.902285 0.606851 0.502828 0.054950 1.000000 0.736239 0.522984 0.617692 0.590601 0.517996 0.714296 0.588809 0.711426 0.069109 0.020066 -0.489252 0.653081 0.630445
succpct 0.250194 0.366283 0.449808 0.569898 0.437261 0.838139 0.720030 0.799468 0.792002 0.247263 0.040843 0.736239 1.000000 0.563832 0.814080 0.782768 0.619845 0.809252 0.817766 0.837726 -0.460164 -0.503405 -0.760726 0.899464 0.863322
lng 0.189479 0.371712 0.334011 0.348069 0.290577 0.351031 0.522083 0.526854 0.438851 -0.023687 -0.201250 0.522984 0.563832 1.000000 0.593199 0.603200 0.577296 0.426060 0.537869 0.625337 -0.297472 -0.340659 -0.435324 0.624214 0.628479
ya 0.114578 0.354643 0.401916 0.363722 0.255577 0.650268 0.652745 0.689922 0.789619 0.051363 -0.085309 0.617692 0.814080 0.593199 1.000000 0.955494 0.932350 0.720316 0.871648 0.729782 -0.269983 -0.347630 -0.422742 0.966948 0.946365
aya 0.241700 0.289473 0.356744 0.347863 0.235044 0.671652 0.615508 0.725873 0.854679 -0.155769 -0.341080 0.590601 0.782768 0.603200 0.955494 1.000000 0.865259 0.749503 0.957212 0.781493 -0.276368 -0.324431 -0.415558 0.919160 0.979305
ypc -0.033522 0.347475 0.358319 0.242721 0.190616 0.339429 0.571344 0.533740 0.614041 0.025601 -0.077540 0.517996 0.619845 0.577296 0.932350 0.865259 1.000000 0.573362 0.697040 0.580806 -0.153090 -0.207109 -0.248473 0.859729 0.824698
ypg 0.311876 0.228232 0.354868 0.574668 0.479179 0.707967 0.701055 0.763021 0.761222 0.176181 -0.122191 0.714296 0.809252 0.426060 0.720316 0.749503 0.573362 1.000000 0.785172 0.788274 -0.270803 -0.249964 -0.603812 0.765344 0.789656
rate 0.373436 0.243155 0.337671 0.395058 0.258355 0.810791 0.594801 0.791048 0.934254 -0.156891 -0.359339 0.588809 0.817766 0.537869 0.871648 0.957212 0.697040 0.785172 1.000000 0.826181 -0.315711 -0.374640 -0.481912 0.868464 0.958765
qbr 0.377266 0.399136 0.437934 0.577638 0.477424 0.708133 0.704422 0.815154 0.818219 0.120725 -0.174296 0.711426 0.837726 0.625337 0.729782 0.781493 0.580806 0.788274 0.826181 1.000000 -0.361001 -0.412892 -0.672507 0.805898 0.842078
sk -0.302338 0.398730 0.388259 0.233116 0.343072 -0.421986 0.121029 -0.022791 -0.224387 0.270975 0.089859 0.069109 -0.460164 -0.297472 -0.269983 -0.276368 -0.153090 -0.270803 -0.315711 -0.361001 1.000000 0.936531 0.780383 -0.440851 -0.410559
yds.1 -0.366297 0.277875 0.295769 0.171267 0.296493 -0.516058 0.050318 -0.114336 -0.327145 0.146509 -0.045769 0.020066 -0.503405 -0.340659 -0.347630 -0.324431 -0.207109 -0.249964 -0.374640 -0.412892 0.936531 1.000000 0.742603 -0.518535 -0.461820
spct -0.335697 -0.138835 -0.170364 -0.392721 -0.293298 -0.614579 -0.440615 -0.480900 -0.428565 -0.143110 0.001506 -0.489252 -0.760726 -0.435324 -0.422742 -0.415558 -0.248473 -0.603812 -0.481912 -0.672507 0.780383 0.742603 1.000000 -0.630212 -0.583843
nya 0.193378 0.348502 0.390225 0.415506 0.294886 0.730529 0.673940 0.719189 0.794594 0.110455 -0.034404 0.653081 0.899464 0.624214 0.966948 0.919160 0.859729 0.765344 0.868464 0.805898 -0.440851 -0.518535 -0.630212 1.000000 0.961000
anya 0.293391 0.299672 0.360396 0.397321 0.273843 0.741264 0.645405 0.756094 0.863398 -0.089974 -0.281243 0.630445 0.863322 0.628479 0.946365 0.979305 0.824698 0.789656 0.958765 0.842078 -0.410559 -0.461820 -0.583843 0.961000 1.000000

Initially, we can see 3 variables with very strong correlations to rating. The correlations are .93, .96, and .96. These include touchdown %, adjusted yard per attempt, and adjusted net yard per attempt. These 3 are all calculated stats just like qbr but the formulas are public. The calculations include...

  • tdpct- (passing TD)/(passing attempts)
  • aya- (pass yards + 20(pass TD) - 45(interceptions thrown))/(passing attempts)
  • anya- (pass yards + 20(pass TD) - 45(interceptions thrown) - sack yards)/(passing attempts + sacks)

Linear Regression

Next, I will create linear models with relation to the rating so I can visualize these variables. This uses a slightly different formula than the correlation plot so it will also double check our $r^{2}$ value.

In [6]:
length = 29

x = corr_df.tdpct.values
y = corr_df.rate.values

x = x.reshape(length, 1)
y = y.reshape(length, 1)

regr = LinearRegression().fit(x, y)


x2 = corr_df.aya.values
y2 = corr_df.rate.values

x2 = x2.reshape(length, 1)
y2 = y2.reshape(length, 1)

regr2 = LinearRegression().fit(x2, y2)

x3 = corr_df.anya.values
y3 = corr_df.rate.values

x3 = x3.reshape(length, 1)
y3 = y3.reshape(length, 1)

regr3 = LinearRegression().fit(x3, y3)


print(regr.score(x,y))
print(regr2.score(x2,y2))
print(regr3.score(x3,y3))
0.872830434973256
0.9162543052759315
0.9192305613686622

The $r^{2}$ values are less than our correlation plot formulas but still very strong. I will proceed with these variables for my formula. Below I can visualize our model line and data points.

In [7]:
# plot it as in the example at http://scikit-learn.org/
plt.scatter(x, y,  color='black')
plt.plot(x, regr.predict(x), color='blue', linewidth=3)
plt.title("Quarterback Statistics")
plt.xlabel("Touchdown %")
plt.ylabel("Rating")
plt.show()

plt.scatter(x2, y2,  color='black')
plt.plot(x2, regr2.predict(x2), color='blue', linewidth=3)
plt.xlabel("Adjusted Yards per Attempt")
plt.ylabel("Rating")
plt.show()


plt.scatter(x3, y3,  color='black')
plt.plot(x3, regr3.predict(x3), color='blue', linewidth=3)
plt.xlabel("Adjusted Net Yards per Attempt")
plt.ylabel("Rating")
plt.show()

All plots are very linear with few outliers.

Recreating the Rating

I'm going to see how close I can get to their formula based on these variables. I am only going to use 3 variables, but since these are a combination of others, we are using 8 when we split them. I'm using base python to mutate these columns into our formula column. I will multiply each variable by a numeric coefficiant so my scale is close to ESPN's. The first time I tried this the difference between my rating and ESPN's was about 5 points. By tuning the parameters I was able to get it down to 3.31.

In [8]:
nfl_df['myrating'] = ((nfl_df['anya']*3) + (nfl_df['aya']*4)  + (nfl_df['tdpct']*4.8)) + 23
In [9]:
nfl_df.head()
Out[9]:
rank player team age pod g gs cmp att cmppct ... ypc ypg rate qbr sk yds.1 spct nya anya myrating
0 1.0 Tua Tagovailoa MIA 25.0 QB 17.0 17.0 388.0 560.0 69.3 ... 11.9 272.0 101.1 60.8 29.0 171.0 4.9 7.56 7.48 103.20
1 2.0 Jared Goff DET 29.0 QB 17.0 17.0 407.0 605.0 67.3 ... 11.2 269.1 97.9 60.3 30.0 197.0 4.7 6.89 6.99 98.77
2 3.0 Dak Prescott DAL 30.0 QB 17.0 17.0 410.0 590.0 69.5 ... 11.0 265.6 105.9 72.7 39.0 255.0 6.2 6.77 7.28 106.92
3 4.0 Josh Allen BUF 27.0 QB 17.0 17.0 385.0 579.0 66.5 ... 11.2 253.3 92.2 69.6 24.0 152.0 4.0 6.89 6.51 94.53
4 5.0 Brock Purdy SFO 24.0 QB 16.0 16.0 308.0 444.0 69.4 ... 13.9 267.5 113.0 72.8 28.0 153.0 5.9 8.74 9.01 123.23

5 rows × 30 columns

I mutated the existing variables into a new column so I can compare my rating against the original. Rather than looking at the columns it is easier to visualize.

In [10]:
plt.rcParams.update({'figure.figsize':(10,8), 'figure.dpi':100})
plt.scatter(nfl_df.player, nfl_df.rate)
plt.scatter(nfl_df.player, nfl_df.myrating)
plt.xticks(rotation=90)
plt.show()
In [11]:
nfl_df['ratingdiff'] = (nfl_df['rate']) - (nfl_df['myrating'])

nfl_df['ratingdiff'] = nfl_df['ratingdiff'].abs()

avg = nfl_df['ratingdiff'].mean()
In [12]:
print(avg)
3.3103448275862077

I took the average of the difference of the 2 ratings. By tuning the parameters, I was able to get the error from 5 to 3.31

In [ ]:
 

Predicting Quarterback Rating With Multiple Linear Regression

The next dataset was not available publicly so I scraped each column from the web individually. The data lists 4 different physical attributes for all starting quarterbacks. I'm going to apply multiple linear regression as well as lasso regression. The goal of this is to predict quarterback rating and success based of physcial attritbutes. Many other things go into a QB besides physical attributes, but this would be a good tool for scouts to use for analyzing the riskiness of a certain physical profile. An example of this is many news outlets stating Kenny Pickett will play poor due to his hand size. Let's see if this view holds up.

Looking at another corelation plot, we can see that height has almost no correlation and the hand size is very low. The others are low compared to our last dataset but are significant enough to use for this experiment. I will create a multiple linear regression model with weight and age, then use the model to predict QB success.

In [13]:
measurements = pd.read_csv("nfl2023measure.csv") 
measurements.head()
Out[13]:
name height weight age handsize
0 Tua Tagovailoa 71 227 25 10.000
1 Jared Goff 76 222 29 9.000
2 Dak Prescott 74 229 30 10.000
3 Josh Allen 77 237 28 10.125
4 Brock Purdy 73 220 24 9.250
In [14]:
measurements['myrating'] = nfl_df['rate']
measurements1 = measurements.drop('name', axis = 1) 
measurements2 = measurements1.drop('height', axis = 1) 
measurements3 = measurements2.drop('handsize', axis = 1) 
measurements1.head()


measure = measurements1.corr()
measure.style.background_gradient(cmap='coolwarm')
Out[14]:
  height weight age handsize myrating
height 1.000000 0.208924 -0.013441 -0.034121 0.033015
weight 0.208924 1.000000 0.099235 0.186289 0.340325
age -0.013441 0.099235 1.000000 0.065689 0.355507
handsize -0.034121 0.186289 0.065689 1.000000 0.153581
myrating 0.033015 0.340325 0.355507 0.153581 1.000000
In [15]:
x4 = measurements3.drop('myrating',axis= 1) 
y4 = measurements3['myrating'] 

I will be splitting my testing and traning data into a 30/70 split

In [16]:
X_train, X_test, y_train, y_test = train_test_split( 
    x4, y4, test_size=0.3, random_state=101) 
In [17]:
model = LinearRegression() 
In [18]:
model.fit(X_train,y_train)
Out[18]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
In [19]:
predictions = model.predict(X_test) 

I can use the predict method to check our general predictions. Our mean absolute error is approximatley 11. This is very poor if we want to make a prediction. We can add 11 to each result as a parameter

In [20]:
print(predictions)
[89.86870924 93.25767581 89.51779279 91.83190589 95.14567609 91.95723319
 94.54706636 91.83190589 98.76023181]
In [21]:
print( 
  'mean_squared_error : ', mean_squared_error(y_test, predictions)) 
print( 
  'mean_absolute_error : ', mean_absolute_error(y_test, predictions)) 
mean_squared_error :  141.5633064933608
mean_absolute_error :  11.032362146922777
In [22]:
regr = linear_model.LinearRegression()
regr.fit(x4, y4)
Out[22]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()

I'm going to check the accuracy of my model by inputting real athletes stats.

In [23]:
predicted1 = regr.predict([[227, 25]])


predicted2 = regr.predict([[222, 29]])

predicted3 = regr.predict([[229, 30]])
In [24]:
predicted1 + 11
Out[24]:
array([103.23099967])
In [25]:
predicted2 + 11
Out[25]:
array([104.95161571])
In [26]:
predicted3 + 11
Out[26]:
array([108.00396546])

We can see with the added parameter the error is now down to 3 with many predictions being within 1 point of their actual rating.

This does not have very high correlations unlike the last dataset. We can see correlations are highest when dealing with age and weight. Height has almost not correlation so it will not be used in our model.

In [ ]:
 

Lasso Regression Model

I'm going to use the Scikit package for another model. This time I will use Lasso Regression.

Lasso regression is like linear regression, but it uses a technique “shrinkage” where the coefficients of determination are shrunk towards zero.

Linear regression gives you regression coefficients as observed in the dataset. The lasso regression allows you to shrink or regularize these coefficients to avoid overfitting and make them work better on different datasets.

This type of regression is used when the dataset shows high multicollinearity or when you want to automate variable elimination and feature selection.

In [27]:
model = Lasso(alpha=1.0)
In [28]:
# define model
model = Lasso(alpha=1.0)
# define model evaluation method
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, x4, y4, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
# force scores to be positive
scores = absolute(scores)
print('Mean MAE: %.3f (%.3f)' % (mean(scores), std(scores)))
Mean MAE: 7.698 (3.260)
In [29]:
model = Lasso(alpha=1.0)
# fit model
model.fit(x4, y4)
# define new data
row1 = [227, 25]
row2 = [222, 29]
row3 = [229, 39]
# make a prediction
yhat1 = model.predict([row1])
yhat2 = model.predict([row2])
yhat3 = model.predict([row3])

yhat1 = yhat1 * 1.1
yhat2 = yhat2 * 1.1
yhat3 = yhat3 * 1.1
# summarize prediction
print('Predicted: %.3f' % yhat1)
print('Predicted: %.3f' % yhat2)
print('Predicted: %.3f' % yhat3)
Predicted: 101.531
Predicted: 103.168
Predicted: 113.887

Conclusion

This model performace is decent, but even after tuning parameters, our multiple linear regression model is more accurate. In conclusion, I was able to get a formula that matched up very closly to ESPN's rating with and average error of 3.1 points.

I was able to build a multiple regression and lasso regression model to predict athlete rating based of physical attributes. I obtained a MSAE of 11 for this model and now have the ability to predict athlete rating with that error.

Sources

https://www.espn.com/nfl/stats/player

https://www.espn.com/nfl/story/_/id/34408660/nfl-quarterback-council-2022-ranking-top-10-qbs-arm-strength-accuracy-decision-making-rushing-ability-more

In [ ]: