Ch. 12 Linear Regression
Sections covered: 12.1, 12.2, 12.5
12.2 Estimating Model Parameters
Formulas to know from p. 498:
\(b_1 = \dfrac{\sum(x_i -\overline{x})(y_i - \overline{y})}{\sum(x_i - \overline{x})^2} = \frac{S_{xy}}{S_{xx}}\) and \(b_0 = \overline{y} - b_1 \overline{x}\)
Formula to know from p. 502:
\(SSE = \sum(y_i - \hat{y_i})^2\)
Formulas to know from p. 504:
\(SST = \sum(y_i - \overline{y})^2\) and \(r^2 = 1 - \frac{SSE}{SST}\)
Formulas to know from p. 505:
\(SSR = \sum(\hat{y_i} - \overline{y})^2\) and \(SSE + SSR = SST\)
Resources
Interactive Visualization: Linear Regression Try fitting the least squares line to a set of random data and check your answer (and another one).
Video: Regression I: What is regression? | SSE, SSR, SST | R-squared | Errors (ε vs. e) [contributed by Lance J.]
R
Calculating slope and intercept for a sample of (x, y) pairs (p. 498 formulas)
# Example 12.8, p. 503
x <- c(12, 30, 36, 40, 45, 57, 62, 67, 71, 78, 93, 94, 100, 105)
y <- c(3.3, 3.2, 3.4, 3, 2.8, 2.9, 2.7, 2.6, 2.5, 2.6, 2.2, 2, 2.3, 2.1)
lm(y ~ x) #lm = linear model
##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## 3.62091 -0.01471
Predicted values:
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## 3.44 3.18 3.09 3.03 2.96 2.78 2.71 2.64 2.58 2.47 2.25 2.24 2.15 2.08
Residuals:
round(mod$residuals, 2)
## 1 2 3 4 5 6 7 8 9 10 11 12 13
## -0.14 0.02 0.31 -0.03 -0.16 0.12 -0.01 -0.04 -0.08 0.13 -0.05 -0.24 0.15
## 14
## 0.02
SSE, SSR
anova(mod) # anova = analysis of variance
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x 1 2.29469 2.29469 104.92 2.762e-07 ***
## Residuals 12 0.26246 0.02187
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The first row under “Sum Sq” is the SSR, and the second row under “Sum Sq” is the SSE:
SSE = 0.2624565
SSR = 2.2946864
SST = SSE + SSR = 0.2624565 + 2.2946864 = 2.5571
Coefficient of determination \(r^2\)
# Example 12.4, 12.9
x <- c(132, 129, 120, 113.2, 105, 92, 84, 83.2, 88.4, 59, 80, 81.5, 71, 69.2)
y <- c(46, 48, 51, 52.1, 54, 52, 59, 58.7, 61.6, 64, 61.4, 54.6, 58.8, 58)
mod <- lm(y ~ x)
SSR <- anova(mod)$`Sum Sq`[1]
SST <- anova(mod)$`Sum Sq`[1] + anova(mod)$`Sum Sq`[2]
SSR/SST
## [1] 0.7907602
Or (simply):
cor(x,y)^2
## [1] 0.7907602
(See section 12.5)
12.5 Correlation
Skip: “Inferences About the Population Correlation Coefficient” (p. 530) to end of section.
Resources
Interactive visualization: Correlation Coefficient (add and remove points)
Interactive visualization: Interpreting Correlations [contributed by Dario G.]
Practice Exercises
- (Least squares line) Researchers employed a least squares analysis in studying how \(Y=\) porosity (%) is related to \(X=\) unit weight (pcf) in concrete specimens. Consider the following representative data:
x <- c(99.0, 101.1, 102.7, 103.0, 105.4, 107.0, 108.7, 110.8, 112.1, 112.4, 113.6, 113.8, 115.1, 115.4, 120.0)
y <- c(28.8, 27.9, 27.0, 25.2, 22.8, 21.5, 20.9, 19.6, 17.1, 18.9, 16.0, 16.7, 13.0, 13.6, 10.8)
(Textbook 12.17)
- Obtain the equation of the estimated regression line.
lm(y~x)
##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## 118.9099 -0.9047
\(y = 118.91 - 0.9047x\)
- Calculate the residuals corresponding to the first two observations.
## 1 2 3 4 5 6 7 8 9 10 11 12 13
## -0.54 0.46 1.01 -0.52 -0.75 -0.60 0.33 0.93 -0.39 1.68 -0.13 0.75 -1.78
## 14 15
## -0.90 0.46
Or alternatively, use R as a calculator
pred <- 118.9099 - 0.9047*x
res <- y - pred
res[1]
## [1] -0.5446
res[2]
## [1] 0.45527
- Calculate a point estimate of \(\sigma\).
## [1] 0.938042
- What proportion of observed variation in porosity can be attributed to the approximate linear relationship between unit weight and porosity?
cor(x, y)^2
## [1] 0.9738874
- Calculate the SSE and SST.
anova(mod) # analsis of variance
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x 1 426.62 426.62 484.84 1.125e-11 ***
## Residuals 13 11.44 0.88
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] 426.6185 438.0573
Or alternatively, use R as a calculator. Notice that the same results are produced.
SSE1 <- sum((mod$residual)^2)
SST1 <- sum((y-mean(y))^2)
SSR1 <- sum((mod$fitted.values - mean(y))^2)
c(SSE1, SST1, SSE1+SSR1)
## [1] 11.43883 438.05733 438.05733