Linear Regression Analysis

The following dataset comes from the WHO (World Health Organization). The data includes life expectancy in every country from 2000-2015. Our goal is to find what variables affect life expectancy the most as well as analyzing geographical differences. We will be performing 4 types of analysis on this dataset. We will use Linear Regression, Multiple Regression, Geographical Analysis, and Time Series Analysis. To find what variables we need for our regression we will start with a correlation plot.

Correlation Plot

We must remove all non numeric variables to create or correlation plot. I am using the corPlt function from the “Psych” library

cordata = subset(lifexp, select = -c(Country,Year,Status)) 
corPlot(cordata, cex = .4)

There are 3 correlations that stand out. One negative and 2 positive. We can see that infant mortality rate has a strong negative correlation with life expectancy. Thinness and schooling both have a strong correlation with life expectancy. With schooling being our strongest correlation, we will use it as our initial predictor.

Linear Model

fit_1 <- lm(Lifeexpectancy ~ Schooling, data = lifexp)

summary(fit_1)

## 
## Call:
## lm(formula = Lifeexpectancy ~ Schooling, data = lifexp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.8986  -2.8210   0.6186   3.8186  30.4911 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 44.10889    0.43676  100.99   <2e-16 ***
## Schooling    2.10345    0.03506   59.99   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.172 on 2766 degrees of freedom
##   (170 observations deleted due to missingness)
## Multiple R-squared:  0.5655, Adjusted R-squared:  0.5653 
## F-statistic:  3599 on 1 and 2766 DF,  p-value: < 2.2e-16

Geographical Plot

Predict

Our data is fairly linear with a r^2 value of .57. We can use the predict() function to get and estimate of life expectancy if we know the individuals schooling level. A level of 10 yields an expectancy of 65.

predict(fit_1, data.frame(Schooling = 10))

##        1 
## 65.14342

Multiple Regression

It would be helpful if we could get our R^2 value higher. We can use multiple regression, the combination of 2 input variables. We will use thinness as our other variable and build the model.

fit_2 <- lm(Lifeexpectancy ~ Schooling + thin, data = lifexp)
summary(fit_2)

## 
## Call:
## lm(formula = Lifeexpectancy ~ Schooling + thin, data = lifexp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -26.5599  -2.8836   0.6504   3.8214  28.5227 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 47.04912    0.58103  80.975   <2e-16 ***
## Schooling    1.97141    0.04050  48.682   <2e-16 ***
## thin        -0.28584    0.02866  -9.974   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.06 on 2733 degrees of freedom
##   (202 observations deleted due to missingness)
## Multiple R-squared:  0.5774, Adjusted R-squared:  0.5771 
## F-statistic:  1867 on 2 and 2733 DF,  p-value: < 2.2e-16

The new R^2 is .58. This is a marginal difference but it is still and improvement. We can visualize the affect of our inputs using a 3d scatterplot.

Visualization

Second Prediction

We can now use 2 inputs in the predict function to make a prediction. An individual of level 12 schooling and level 22 thinness can expect to live to 64.

predict(fit_2, data.frame(Schooling = 12, thin = 22))

##        1 
## 64.41764

Geographical Dataset

We can use ggplot and the world data to visualize life expectancy geographically.

North America and Europe have the highest life expectancy with Africa at the lowest. Countries that top the list include France, Spain, Australia, and Norway.

Lastly, we will use our regression method on the “Year” variable to conduct time series analysis.

We have data through 2015. The life expectancy in India in 2020 was 70.15. We want to see if we can get an accurate prediction for 2020 using our model.

## 
## Call:
## lm(formula = Lifeexpectancy ~ Year, data = india)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.114706 -0.032904  0.005515  0.029963  0.117647 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.213e+02  5.973e+00  -120.8   <2e-16 ***
## Year         3.919e-01  2.975e-03   131.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05486 on 14 degrees of freedom
## Multiple R-squared:  0.9992, Adjusted R-squared:  0.9991 
## F-statistic: 1.735e+04 on 1 and 14 DF,  p-value: < 2.2e-16

We can see our fit line is very linear. This is good news.

## `geom_smooth()` using formula = 'y ~ x'

Using the predict function, we can estimate the population in 2020 and 2035. The 2020 prediction is only .15 off the actual value which shows our 5 year forecast has good accuracy. By 2035, the population should live past 75 years old.

##     1 
## 68.75

##        1 
## 69.14191

##        1 
## 69.53382

##        1 
## 69.92574

##        1 
## 70.31765

##        1 
## 76.19632