Ch. 12 Linear Regression

Sections covered: 12.1, 12.2, 12.5

12.1 The Simple Linear Regression Model

12.2 Estimating Model Parameters

Formulas to know from p. 498:

\(b_1 = \dfrac{\sum(x_i -\overline{x})(y_i - \overline{y})}{\sum(x_i - \overline{x})^2} = \frac{S_{xy}}{S_{xx}}\) and \(b_0 = \overline{y} - b_1 \overline{x}\)

Formula to know from p. 502:

\(SSE = \sum(y_i - \hat{y_i})^2\)

Formulas to know from p. 504:

\(SST = \sum(y_i - \overline{y})^2\) and \(r^2 = 1 - \frac{SSE}{SST}\)

Formulas to know from p. 505:

\(SSR = \sum(\hat{y_i} - \overline{y})^2\) and \(SSE + SSR = SST\)

Resources

Interactive Visualization: Linear Regression Try fitting the least squares line to a set of random data and check your answer (and another one).

Video: Regression I: What is regression? | SSE, SSR, SST | R-squared | Errors (ε vs. e) [contributed by Lance J.]

R

Calculating slope and intercept for a sample of (x, y) pairs (p. 498 formulas)

# Example 12.8, p. 503
x <- c(12, 30, 36, 40, 45, 57, 62, 67, 71, 78, 93, 94, 100, 105)
y <- c(3.3, 3.2, 3.4, 3, 2.8, 2.9, 2.7, 2.6, 2.5, 2.6, 2.2, 2, 2.3, 2.1)
lm(y ~ x)  #lm = linear model
## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##     3.62091     -0.01471

Predicted values:

mod <- lm(y~x)
round(mod$fitted.values, 2)
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14 
## 3.44 3.18 3.09 3.03 2.96 2.78 2.71 2.64 2.58 2.47 2.25 2.24 2.15 2.08

Residuals:

round(mod$residuals, 2)
##     1     2     3     4     5     6     7     8     9    10    11    12    13 
## -0.14  0.02  0.31 -0.03 -0.16  0.12 -0.01 -0.04 -0.08  0.13 -0.05 -0.24  0.15 
##    14 
##  0.02

SSE, SSR

anova(mod)  # anova = analysis of variance
## Analysis of Variance Table
## 
## Response: y
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## x          1 2.29469 2.29469  104.92 2.762e-07 ***
## Residuals 12 0.26246 0.02187                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The first row under “Sum Sq” is the SSR, and the second row under “Sum Sq” is the SSE:

SSE = 0.2624565

SSR = 2.2946864

SST = SSE + SSR = 0.2624565 + 2.2946864 = 2.5571

Coefficient of determination \(r^2\)

# Example 12.4, 12.9
x <- c(132, 129, 120, 113.2, 105, 92, 84, 83.2, 88.4, 59, 80, 81.5, 71, 69.2)
y <- c(46, 48, 51, 52.1, 54, 52, 59, 58.7, 61.6, 64, 61.4, 54.6, 58.8, 58)

mod <- lm(y ~ x)

SSR <- anova(mod)$`Sum Sq`[1]
SST <- anova(mod)$`Sum Sq`[1] + anova(mod)$`Sum Sq`[2]
SSR/SST
## [1] 0.7907602

Or (simply):

cor(x,y)^2
## [1] 0.7907602

(See section 12.5)

12.5 Correlation

Skip: “Inferences About the Population Correlation Coefficient” (p. 530) to end of section.

Resources

Interactive visualization: Correlation Coefficient (add and remove points)

Interactive visualization: Interpreting Correlations [contributed by Dario G.]

R

Sample correlation coefficient \(r\)

# Example 12.15, p. 528
x <- c(2.4, 3.4, 4.6, 3.7, 2.2, 3.3, 4.0, 2.1)
y <- c(1.33, 2.12, 1.80, 1.65, 2.00, 1.76, 2.11, 1.63)

cor(x,y)
## [1] 0.3472602

Practice Exercises

  1. (Least squares line) Researchers employed a least squares analysis in studying how \(Y=\) porosity (%) is related to \(X=\) unit weight (pcf) in concrete specimens. Consider the following representative data:
x <- c(99.0, 101.1, 102.7, 103.0, 105.4, 107.0, 108.7, 110.8, 112.1, 112.4, 113.6, 113.8, 115.1, 115.4, 120.0)
y <- c(28.8, 27.9, 27.0, 25.2, 22.8, 21.5, 20.9, 19.6, 17.1, 18.9, 16.0, 16.7, 13.0, 13.6, 10.8)

(Textbook 12.17)

  1. Obtain the equation of the estimated regression line.
lm(y~x)
## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##    118.9099      -0.9047

\(y = 118.91 - 0.9047x\)

  1. Calculate the residuals corresponding to the first two observations.
mod <- lm(y~x)
round(mod$residuals, 2)
##     1     2     3     4     5     6     7     8     9    10    11    12    13 
## -0.54  0.46  1.01 -0.52 -0.75 -0.60  0.33  0.93 -0.39  1.68 -0.13  0.75 -1.78 
##    14    15 
## -0.90  0.46

Or alternatively, use R as a calculator

pred <- 118.9099 - 0.9047*x
res <- y - pred
res[1]
## [1] -0.5446
res[2]
## [1] 0.45527
  1. Calculate a point estimate of \(\sigma\).
sig2 <- sum((res)^2)/(length(x)-2)
sqrt(sig2)
## [1] 0.938042
  1. What proportion of observed variation in porosity can be attributed to the approximate linear relationship between unit weight and porosity?
cor(x, y)^2
## [1] 0.9738874
  1. Calculate the SSE and SST.
anova(mod) # analsis of variance
## Analysis of Variance Table
## 
## Response: y
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## x          1 426.62  426.62  484.84 1.125e-11 ***
## Residuals 13  11.44    0.88                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
SSE <- anova(mod)$`Sum Sq`[1]
SST <- anova(mod)$`Sum Sq`[1] + anova(mod)$`Sum Sq`[2]
c(SSE, SST)
## [1] 426.6185 438.0573

Or alternatively, use R as a calculator. Notice that the same results are produced.

SSE1 <- sum((mod$residual)^2)
SST1 <- sum((y-mean(y))^2)
SSR1 <- sum((mod$fitted.values - mean(y))^2)
c(SSE1, SST1, SSE1+SSR1)
## [1]  11.43883 438.05733 438.05733