Chapter 3 Multiple regression
Learning Objectives
- Understand how to specify regression models using matrix notation.
- Become familiar with creating dummy variables to code for categorical predictors
- Interpret the results of regression analyses that include both categorical and quantitative variables
- Understand approaches for visualizing the results of multiple regression models
3.1 Introduction to multiple regression
So far, we have considered regression models containing a single predictor, often referred to as simple linear regression models. In this section, we will consider models that contain more than one predictor. We can again write our model of the data generating process (DGP) in two ways:
Yi=β0+β1Xi,1+β2Xi,2+…+βpXi,p+ϵi(i=1,…,n)ϵi∼N(0,σ2)
Or:
Yi∼N(μi,σ2)μi=β0+β1Xi,1+β2Xi,2+…+βpXi,p(i=1,…,n)
The above expressions define linear regression models in terms of individual observations. It will also be advantageous, at times, to be able to define the model for all observations simultaneously using matrices. Doing so will provide insights into how models are represented in statistical software, including models that allow for non-linear relationships between predictor and response variables (Section 4). In addition, matrix notation will provide us with a precise language for describing uncertainty associated with the predictions of linear models. Use of matrix notation will be important for:
- understanding methods for calculating uncertainty in ˆμi=ˆβ0+ˆβ1Xi,1+ˆβ2Xi,2+…ˆβpXi,p (we glossed over this in Section 1.8 where we showed code for calculating confidence and prediction intervals in regression but did not provide equations or derive expressions for these interval estimators).
- specifying models for observations that are not independent (e.g., Section 5 and Section 18).
3.2 Matrix notation for regression
Let’s start by writing our linear regression model as:
Yi=β0+β1Xi+ϵi(i=1,…,n)ϵi∼N(0,σ2)
This implies:
Y1=β0+β1X1+ϵ1Y2=β0+β1X2+ϵ2⋮Yn=β0+β1Xn+ϵn
Alternatively, We can write this set of equations very compactly using matrices:
[Y1Y2⋮Yn]=[1X11X2⋮⋮1Xn]×[β0β1]+[ϵ1ϵ2⋮ϵn]
or
Y[n×1]=X[n×2]β[2×1]+ϵ[n×1]
We can multiply two matrices, A and B, together if the number of columns of the first matrix is equal to the number of rows in the second matrix. Let the dimensions of A be n×m and the dimension of B be m×p. Matrix multiplication results in a new matrix with the number of rows equal to the number of rows in A and number of columns equal to the number of columns in B, i.e., the matrix will be of dimension n×p. The (i,j) entry of this matrix is formed by taking the dot product of row i and column j, where the dot product is the sum of element-wise products (Figure 3.1).
Using matrix notation, we can generalize our model to include any number of predictors (shown below for p−1 predictors):
[Y1Y2⋮Yn]=[1X11X12⋯X1,p−11X21X22⋯X2,p−1⋮⋮⋮⋮1Xn1Xn2⋯Xn,p−1]×[β0β1⋮βp−1]+[ϵ1ϵ2⋮ϵn]
Y[n×1]=X[n×p]β[p×1]+ϵ[n×1]
The matrix, X, is referred to as the design matrix and encodes all of the information present in our predictors. In this Section, we will learn how categorical variables and interactions are represented in the design matrix. It Section 4, we will learn how various methods for modeling non-linearities linearities in the relationship between our predictors and our dependent variable can be represented in a design matrix.
It is also useful to know how we can write our alternative formulation of the linear regression model using matrix notation. Specifically, we can write:
Y∼N(Xβ,Σ)
Here, Σ is an n×n variance covariance matrix. Its diagonal elements capture the variances of the observations and the off-diagonal elements capture covariances. Under the assumptions of linear regression, our observations are independent (implying the covariances are 0) and have constant variance, σ2. Thus:
Σn×n=[σ200⋯00σ20⋯000σ2⋯0⋮⋮⋮⋱⋮000⋯σ2]=σ2I[n×n], where I[n×n]=[100⋯0010⋯0001⋯0⋮⋮⋮⋱⋮000⋯1].
3.3 Parameter estimation, sums-of-squares, and R2
We can use lm
to find estimates of intercept (β0) and slope parameters (β1,β2,…) that minimize the sum of squared differences between our observations and the model predictions:
∑ni(Yi−ˆY)2=∑ni(Yi−[β0+β1X1,i+β2X2,i+…])2
Further, we can again decompose the total variation, quantified by the total sums of squares (SST), into variation explained by the model (SSR) and residual sums of squares (SSE):
- SSTdf=n−1=∑ni(Yi−ˉY)2
- SSEdf=n−p=∑ni(Yi−ˆY)2
- SSRdf=p−1=SST−SSE=∑ni(ˆYi−ˉY)2
The Coefficient of Determination (R2) is calculated the same way as with simple linear regression (R2=SSR/SST). However, adding additional predictors always increases R2. Thus, we will eventually also want to consider an adjusted R2, quantified as:
R2adj=SSRp−1SSTn−1=(n−1p−1)SSRSST
The adjusted-R2 penalizes for additional predictors and so will not always increase as you add predictors. Thus, it should provide a more honest measure of variance explained, particularly when the model contains many predictors. We will consider this measure in more detail when we compare the fits of multiple competing models (Section 8).
We have the same assumptions (linearity, constant variance, Normality) as we do with simple linear regression and can use the same diagnostic plots to evaluate whether these assumptions are reasonably met.
However, it’s also important to diagnose the degree to which explanatory variables are correlated with each other (a topic we will address in more detail when we cover multicollinearity in Section 6).
3.4 Parameter interpretation: Multiple regression with RIKZ data
Recall the RIKZ data from Section 2. We will continue to explore this data set, assuming (naively) that the observations are independent. Earlier, we fit a model relating species Richness
to the height of the sample site relative to sea level, NAP
. From our regression results, we can write our estimate of the best-fit line as:
Richnessi=6.886−2.867NAPi+ϵi
What if we also hypothesized that humus (amount of organic material) also influences Richness
(in addition to NAP)?
The multiple linear regression model formula would look like:
Richnessi=β0+β1NAPi+β2Humasi+ϵi
Let’s fit this model in R and compare it to the model containing only NAP.
lmfit1 <- lm(Richness ~ NAP, data = RIKZdat)
lmfit2 <- lm(Richness ~ NAP + humus, data = RIKZdat)
modelsummary(list(lmfit1, lmfit2),
gof_omit = "^(?!R2)",
estimate = "{estimate} ({std.error})",
statistic = NULL,
title = "Estimates of regression parameters (SE) for regression parameters
in one and two-variable models fit to the RIKZ data.")
Model 1 | Model 2 | |
---|---|---|
(Intercept) | 6.686 (0.658) | 5.459 (0.830) |
NAP | -2.867 (0.631) | -2.512 (0.623) |
humus | 21.942 (9.710) | |
R2 | 0.325 | 0.398 |
R2 Adj. | 0.309 | 0.369 |
We see that the slope for NAP
changed slightly (from -2.9 to -2.5) and the adjusted R2 went from 0.31 to 0.37. Instead of a best-fit line through data in two dimensions, we now have a best fit plane through data in three dimensions (Figure 3.2).
Our interpretation of regression parameters is similar to that in simple linear regression, except now we have to consider a change in one variable while holding other variables in the model constant:
- β1 describes the change in
Richness
for every 1 unit increase inNAP
while holdingHumus
constant. - β2 describes the change in
Richness
for every 1 unit increase inHumus
while holdingNAP
constant. - β0: the level of
Richness
ifHumus
andNAP
are both simultaneously equal 0.
Although it is easy to fit multiple regression models with more than two predictors, we will no longer be able to visualize the fitted model in higher dimensions. Before we consider more complex models, however, we will first explore how to incorporate categorical predictors into our models.
3.5 Categorical predictors
To understand how categorical predictors are coded in regression models, we will begin by making a connection between the standard t-test and a linear regression model with a categorical predictor with two levels or categories.
3.5.1 T-test as a regression
Here, we will consider mandible lengths (in mm) of 10 male and 10 female golden jackals (Canis aureus) specimens from the British Museum (Manly, 1991).
males<-c(120, 107, 110, 116, 114, 111, 113, 117, 114, 112)
females<-c(110, 111, 107, 108, 110, 105, 107, 106, 111, 111)
We might ask: Do males and females have, on average, different mandible lengths? Let’s consider a formal hypothesis test and confidence interval for the difference in population means:
H0:μm=μf versus Ha:μm≠μf
where μm and μf represent population means for male and female jackals, respectively.
If we assume that mandible lengths are Normally distributed in the population11, and that the variance for male and female jaw lengths are about the same12, then we can use the following code to conduct a t-test for a difference in means:
Two Sample t-test
data: males and females
t = 3.4843, df = 18, p-value = 0.002647
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
1.905773 7.694227
sample estimates:
mean of x mean of y
113.4 108.6
We can also conduct this same test using a regression model with sex
as the only predictor. First, we will have to create a data.frame with mandible lengths (quantitative) and sex
(categorical).
jaws sex
1 120 M
2 107 M
3 110 M
4 116 M
5 114 M
6 111 M
We can then fit a linear regression model to these data and inspect the output:
Call:
lm(formula = jaws ~ sex, data = jawdat)
Residuals:
Min 1Q Median 3Q Max
-6.4 -1.8 0.1 2.4 6.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 108.6000 0.9741 111.486 < 2e-16 ***
sexM 4.8000 1.3776 3.484 0.00265 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.08 on 18 degrees of freedom
Multiple R-squared: 0.4028, Adjusted R-squared: 0.3696
F-statistic: 12.14 on 1 and 18 DF, p-value: 0.002647
We see that the t value
and Pr(>|t|)
values for sexM
are identical to the t-statistic and p-value from our two-sample t-test. Also, the estimated intercept is identical to the sample mean for females. And, if we look at a confidence interval for the regression parameters, we see that the interval for the sexM
coefficient is the same as the confidence interval for the difference in means that is output by the t.test
function (and, in fact, the coefficient for sexM
is equal to the difference in sample means).
2.5 % 97.5 %
(Intercept) 106.553472 110.646528
sexM 1.905773 7.694227
To understand these results, we need to know how R accounts for sex
in the model we just fit.
3.5.2 Dummy variables: Reference (or effects) coding
We can use the model.matrix
function to see the design matrix that R uses to fit the regression model. Here, we print the 2nd, 3rd, 16th, and 17th rows of this matrix (so that we see observations from both sexes):
## (Intercept) sexM
## 2 1 1
## 3 1 1
## 16 1 0
## 17 1 0
Let’s also look at the data from these cases:
## jaws sex
## 2 107 M
## 3 110 M
## 16 105 F
## 17 107 F
We see that R created a new variable, sexM
, to indicate which cases are males (sexM
= 1) and which are females (sexM
= 0). Knowing this allows us to write our model in matrix notation (shown here for these 4 observations):
[107105107107106⋮]=[11101010⋮⋮]×[β0β1]+[ϵ1ϵ2ϵ3ϵ4⋮]
Think-Pair-Share: How can we estimate μm, the mean jaw length for males, from the fitted regression model?
To answer this question, let’s write our model as:
Yi=β0+β1I(sex=male)i+ϵi
I(sex=male)i={1if male0if female
Here, I have used the notation I(condition)
to indicate that we want to create a variable that is equal to 1 when condition
is TRUE and 0 otherwise. In general, we will refer to this type of variable as an indicator variable or, more commonly, a dummy variable.
Using this model description, we can estimate the mean jaw length of males by plugging in a 1 for I(sex=male)i:
Yi=β0+β1(1)
I.e., we can estimate the mean jaw length of males by summing the two regression coefficients:
ˆYi=108.6+4.8=113.4,
which is the mean for males that is reported by the t.test
function.
In summary, the default method used to account for categorical variables in R is to use reference or effects coding such that:
- the intercept represents the mean for a reference category (when all other predictors in the model are set to 0)
- dummy or indicator variables represent differences in means between other categories and the reference category.
3.5.3 Dummy variables: Cell means coding
It turns out that there are other ways to code the same information contained in the variable sex
, and these different parameterizations lead to the exact same model, but expressed with a different set of coefficients. In this section, we will consider what is often called cell-means or means coding:
Yi=Xi,mβm+Xi,fβf+ϵi
Xi,m={1if the ith observation is from a male0if from a female
Xi,f={1if the ith observation is from a female0if from a male
Because “male” and “female” are mutually exclusive categories so far as these data are concerned13, each individual i will be either male or female, so either Xi,f or Xi,m will be 1 and the other will be 0, and βm will represent the mean Y for males and βf the mean Y for females.
In R, we can fit this model using means coding using:
Call:
lm(formula = jaws ~ sex - 1, data = jawdat)
Residuals:
Min 1Q Median 3Q Max
-6.4 -1.8 0.1 2.4 6.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
sexF 108.6000 0.9741 111.5 <2e-16 ***
sexM 113.4000 0.9741 116.4 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.08 on 18 degrees of freedom
Multiple R-squared: 0.9993, Adjusted R-squared: 0.9992
F-statistic: 1.299e+04 on 2 and 18 DF, p-value: < 2.2e-16
The -1 here tells R to remove the column of 1s in our design matrix (which usually represent the intercept). With means coding, our model is parameterized in terms of the two group means rather than using the mean for females and the difference in means between males and females. Note: we cannot have a model with all of these parameters (μf,μm, and μm−μf) since one of these is completely determined by the other two. The -1
in the formula (jaws ~ sex - 1
) tells R not to include an overall intercept, which permits estimation of the second group mean in its place.
3.5.4 Comparing assumptions: Linear model and t-test
What are the assumptions of our model for the jaw lengths? Well, they are the same ones that we have for fitting linear regression models with continuous predictors:
- constant variance of the errors (i.e., the two groups are assumed to have equal variance, σ2)
- the residuals, and by extension, the data within each group, are Normally distributed
These are the same assumptions of the two-sample t-test. Let’s see if they are reasonable:
We have a small data set, so it is difficult to say anything definitively, but it appears that the variance may be larger for males. Later, we will see how we can relax these assumptions using the gls
function in the nlme
package (see Section 5) and using JAGS (Section 11)
3.6 Categorical variables >2 levels
To account for differences among k groups of a categorical variable, we can again use either reference/effects coding or means coding.
3.6.1 Effects coding
With effects coding, we include an overall intercept and k−1 dummy variables. Each dummy variable is used to identify group membership for one of the k−1 groups other than the reference; membership in the kth group is indicated by having 0’s for all of the other k−1 dummy variables. The intercept will again represent a reference group, and the coefficients for the dummy variables will represent differences between each of the other k−1 categories and the reference group. R will create these dummy variables for us, but understanding how the effects of categorical variables are encoded in regression models will facilitate parameter interpretation and allow us to fit customized models when desired (e.g., see Section 3.7.3). In addition, we will need to create our own dummy variables when fitting models in a Bayesian framework using JAGS (Section 12.4).
Let’s return to the RIKZ data and our model of species Richness
. What if we suspected that some species were only present in some weeks such that Richness
varied by week in addition to NAP? Let’s see what happens if we add week
to the model:
Call:
lm(formula = Richness ~ NAP + week, data = RIKZdat)
Residuals:
Min 1Q Median 3Q Max
-5.2493 -2.4558 -0.7746 1.4261 15.7933
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.0635 1.6291 5.563 1.68e-06 ***
NAP -2.6644 0.6327 -4.211 0.000131 ***
week -1.0492 0.6599 -1.590 0.119312
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.088 on 42 degrees of freedom
Multiple R-squared: 0.3629, Adjusted R-squared: 0.3326
F-statistic: 11.96 on 2 and 42 DF, p-value: 7.734e-05
Since week
is coded as an integer (equal to 1, 2, 3 or 4), we see that R assumes it is a continuous predictor and estimates a single coefficient representing the overall linear trend (slope) in Richness
over time. Specifically, the model suggests we will lose roughly 1 species each week (ˆβ=−1.05). Although this model can account for a linear increase or decrease in species Richness
during the duration of the sampling effort, it will likely be better to model week
as a categorical variable to allow for greater flexibility in how species richness changes over time. Doing so will require 3 dummy variables because there are k=4 levels in our week
categorical variable.
In R, we can use as.factor
to convert week
to a categorical variable and then refit the model:
RIKZdat <- RIKZdat %>% mutate(week.cat = as.factor(week))
lm.ancova <- lm(Richness ~ NAP + week.cat, data = RIKZdat)
summary(lm.ancova)
##
## Call:
## lm(formula = Richness ~ NAP + week.cat, data = RIKZdat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0788 -1.4014 -0.3633 0.6500 12.0845
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.3677 0.9459 12.017 7.48e-15 ***
## NAP -2.2708 0.4678 -4.854 1.88e-05 ***
## week.cat2 -7.6251 1.2491 -6.105 3.37e-07 ***
## week.cat3 -6.1780 1.2453 -4.961 1.34e-05 ***
## week.cat4 -2.5943 1.6694 -1.554 0.128
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.987 on 40 degrees of freedom
## Multiple R-squared: 0.6759, Adjusted R-squared: 0.6435
## F-statistic: 20.86 on 4 and 40 DF, p-value: 2.369e-09
Aside: historically, statistical inference from a model with a single categorical variable and a continuous variable (and no interaction between the two) was referred to as an analysis of covariance or ANCOVA. We could also have considered a model with only week.cat
(and not NAP
), which would have led to an analysis of variance or ANOVA. This distinction (between ANOVA, ANCOVA, and other regression models) is more historical than practical, however, as each of these approaches shares the same underlying statistical machinery.14
The model with NAP
and week.cat
can be written as:
Richnessi=β0+β1NAPi+β2I(week=2)i+β3I(week=3)i+β4I(week=4)i+ϵi,
where I(week=2)i, I(week=3)i, and I(week=4)i are indicator variables for Week 2, 3, and 4, respectively. For example,
I(week=2)i={1if the ith observation was from week 20otherwise
Let’s again inspect the design matrix that R creates when fitting the model. Here, we will look at the 1st observation from each week by selecting the 10th, 20th, 30th and 25th observations:
## Richness week NAP
## 10 17 1 -1.334
## 20 4 2 -0.811
## 30 4 3 0.766
## 25 6 4 0.054
## (Intercept) NAP week.cat2 week.cat3 week.cat4
## 10 1 -1.334 0 0 0
## 20 1 -0.811 1 0 0
## 30 1 0.766 0 1 0
## 25 1 0.054 0 0 1
We see that R created 3 dummy variables representing weeks 2, 3, and 4 and that we can identify observations from week 1 as having 0’s for all 3 dummy variables. In matrix form, we can write our model for these 4 observations as:
[17446]=[1−1.1340001−0.81110010.76601010.054001]×[β0β1β2β3β4]+[ϵ1ϵ2ϵ3ϵ4]
Because the effect of NAP
and week.cat
are additive (we did not include an interaction between these two variables), the effect of NAP
on species Richness
is assumed independent of what week it is (i.e., there is a common slope for all 4 weeks). The intercept, however, is different for each week. We can see this by writing down a separate equation for the data collected from each week formed by plugging in appropriate values for each of our indicator variables and then collected like terms:
- Week 1: Richnessi=β0+β1NAPi+ϵi
- Week 2: Richnessi=[β0+β2(1)]+β1NAPi+ϵi
- Week 3: Richnessi=[β0+β3(1)]+β1NAPi+ϵi
- Week 4: Richnessi=[β0+β4(1)]+β1NAPi+ϵi
By comparing weeks 2 and 1, we can see that β2 represents the difference in expected Richness
between week 2 and week 1 (if we hold NAP
constant)15. Similarly, β3 and β4 represent differences in expected Richness
between week 3 and week 1 and week 4 and week 1, respectively (if, again, we hold NAP
constant).
Lastly, it helps to visualize the model. Below, we plot the expected Richness
as a function of NAP
for each week:
library(ggplot2);
library(ggthemes) # for color
theme_set(theme_bw())
# add the fitted values to our RIZK data
RIKZdat <- RIKZdat %>% mutate(p.ancova = predict(lm.ancova))
# plot using ggplot
ggplot(data = RIKZdat,
aes(x = NAP, y = Richness, color = week.cat)) +
geom_point() + geom_line(aes(y = p.ancova)) +
scale_colour_colorblind()
This plot makes it clear that the effect of NAP
is assumed to be constant for all of the weeks and that the expected Richess
when NAP
= 0 varies by week (i.e., we have a model with constant slope but varying intercepts).
3.6.2 Means coding
We can fit the same model using means coding by removing the intercept:
##
## Call:
## lm(formula = Richness ~ NAP + week.cat - 1, data = RIKZdat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0788 -1.4014 -0.3633 0.6500 12.0845
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## NAP -2.2708 0.4678 -4.854 1.88e-05 ***
## week.cat1 11.3677 0.9459 12.017 7.48e-15 ***
## week.cat2 3.7426 0.8026 4.663 3.44e-05 ***
## week.cat3 5.1897 0.7979 6.505 9.24e-08 ***
## week.cat4 8.7734 1.3657 6.424 1.20e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.987 on 40 degrees of freedom
## Multiple R-squared: 0.8604, Adjusted R-squared: 0.843
## F-statistic: 49.32 on 5 and 40 DF, p-value: 4.676e-16
We see that the coefficient for week.cat1
is identical to the intercept in the effects model, since week 1 is the reference level in that model. In addition, the other coefficients for the week.cat
variables represent the intercepts in the other weeks. If we look at the design matrix for our same 4 observations, we see R creates a separate dummy variable for each week and that the intercept column has been removed.
## NAP week.cat1 week.cat2 week.cat3 week.cat4
## 10 -1.334 1 0 0 0
## 20 -0.811 0 1 0 0
## 30 0.766 0 0 1 0
## 25 0.054 0 0 0 1
Thus, the means model is parameterized with a separate intercept for each week and can be written as:
[17446]=[−1.1341000−0.81101000.76600100.0540001]×[β1β2β3β4β5]+[ϵ1ϵ2ϵ3ϵ4]
3.7 Models with interactions
Let’s inspect the residuals from our model fit with effects coding for week.cat
plus NAP
(the residuals will be identical using either model formulation, though). In this case, we will see how we can create our own customized residual plot using the fortify
function along with ggplot
(the fortify
function in the broom
package augments the data set used to fit the model with various outputs from the fitted model, including fitted values and residuals).
Richness NAP week.cat .hat .sigma .cooksd .fitted .resid
1 11 0.045 1 0.1005319 3.025215 0.0001963069 11.265517 -0.2655173
2 10 -1.036 1 0.1213725 2.958047 0.0487598818 13.720207 -3.7202072
3 13 -1.336 1 0.1373130 3.015884 0.0081202327 14.401435 -1.4014348
4 11 0.616 1 0.1126489 3.020466 0.0034083423 9.968914 1.0310858
5 10 -0.684 1 0.1082954 2.984729 0.0260386947 12.920900 -2.9209002
6 8 1.190 1 0.1409419 3.023361 0.0018954268 8.665499 -0.6654988
.stdresid
1 -0.09371168
2 -1.32849079
3 -0.50505662
4 0.36638770
5 -1.03538063
6 -0.24034207
ggplot(fortify(lm.ancova)) + geom_point(aes(x = .fitted, y = .resid, col = week.cat)) +
geom_hline(yintercept = 0)+theme_bw() + scale_colour_colorblind()
This plot seems to suggest that the residuals increase in variance with higher fitted values (this is common when modeling count data16). There are also a few really residuals with really high absolute values.
Think-pair-share: How can we address these issues?
It is not uncommon to find that model assumptions are not met perfectly when analyzing real data. When this happens, it is tempting to search for model-based solutions, which can send you down a spiraling path towards models with more and more complexity, landing you well outside your comfort zone. We will discuss these challenges more down the road when we cover modeling strategies (see Section 8). For now, let’s assume a colleague of yours tries to improve the fit of your model by adding an interaction between NAP
and week.cat
.17
3.7.1 Effects coding
We can include an interaction between NAP
and week.cat
in R using either Richness ~ NAP * week.cat
or Richness ~NAP + week.cat + NAP:week.cat
:
Call:
lm(formula = Richness ~ NAP * week.cat, data = RIKZdat)
Residuals:
Min 1Q Median 3Q Max
-6.3022 -0.9442 -0.2946 0.3383 7.7103
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.40561 0.77730 14.673 < 2e-16 ***
NAP -1.90016 0.87000 -2.184 0.035369 *
week.cat2 -8.04029 1.05519 -7.620 4.30e-09 ***
week.cat3 -6.37154 1.03168 -6.176 3.63e-07 ***
week.cat4 1.37721 1.60036 0.861 0.395020
NAP:week.cat2 0.42558 1.12008 0.380 0.706152
NAP:week.cat3 -0.01344 1.04246 -0.013 0.989782
NAP:week.cat4 -7.00002 1.68721 -4.149 0.000188 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.442 on 37 degrees of freedom
Multiple R-squared: 0.7997, Adjusted R-squared: 0.7618
F-statistic: 21.11 on 7 and 37 DF, p-value: 3.935e-11
Let’s look again at the model matrix to help us write out and understand the model:
## Richness week NAP
## 10 17 1 -1.334
## 20 4 2 -0.811
## 30 4 3 0.766
## 25 6 4 0.054
## (Intercept) NAP week.cat2 week.cat3 week.cat4 NAP:week.cat2 NAP:week.cat3
## 10 1 -1.334 0 0 0 0.000 0.000
## 20 1 -0.811 1 0 0 -0.811 0.000
## 30 1 0.766 0 1 0 0.000 0.766
## 25 1 0.054 0 0 1 0.000 0.000
## NAP:week.cat4
## 10 0.000
## 20 0.000
## 30 0.000
## 25 0.054
Here, we see that we added 3 new predictors to our design matrix last seen in Section 3.6. These columns are formed by multiplying our original 3 dummy variables (indicating weeks 2, 3, and 4) by NAP. Thus, our model can be written as:
Richnessi=β0+β1NAPi+β2I(week=2)i+β3I(week=3)i+β4I(week=4)i+β5NAPiI(week=2)i+β6NAPiI(week=3)i+β7NAPiI(week=4)i+ϵi
Or, in matrix notation (for our 4 observations above:
[11101311]=[1−1.3340000001−0.811100−0.8110010.76601000.766010.054001000.054]×[β0β1β2β3β4β5β6β7]+[ϵ1ϵ2ϵ3ϵ4]
In this model, we have a separate slope and intercept for each week, which becomes more evident when we write out equations for each week in which we collect like terms:
- Week 1: Richnessi=β0+β1NAPi+ϵi
- Week 2: Richnessi=[β0+β2(1)]+[β1+β5(1)]NAPi+ϵi
- Week 3: Richnessi=[β0+β3(1)]+[β1+β6(1)]NAPi+ϵi
- Week 4: Richnessi=[β0+β4(1)]+[β1+β7(1)]NAPi+ϵi
Thus, we see that β0 and β1 represent the intercept and slope for our reference category (week 1). The parameters β2, β3, and β4 represent differences in intercepts for weeks 2, 3, and 4 relative to week 1. And, the parameters β5, β6, and β7 represent differences in slopes (associated with NAP
) for weeks 2, 3, and 4 relative to week 1.
We visualize this model below, which highlights that the slope for NAP
during week 4 differs from those of the other weeks:
ggplot(fortify(lmfit.inter), aes(NAP, Richness, col = week.cat))+
geom_line(aes(NAP, .fitted, col = week.cat)) + geom_point() +
scale_colour_colorblind()
3.7.2 Means model
To fit the means parameterization of the model, we need to drop the columns of the design matrix associated with the intercept and slope for week 1 using the following syntax:
##
## Call:
## lm(formula = Richness ~ NAP * week.cat - 1 - NAP, data = RIKZdat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.3022 -0.9442 -0.2946 0.3383 7.7103
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## week.cat1 11.4056 0.7773 14.673 < 2e-16 ***
## week.cat2 3.3653 0.7136 4.716 3.38e-05 ***
## week.cat3 5.0341 0.6784 7.421 7.85e-09 ***
## week.cat4 12.7828 1.3989 9.138 5.05e-11 ***
## NAP:week.cat1 -1.9002 0.8700 -2.184 0.03537 *
## NAP:week.cat2 -1.4746 0.7055 -2.090 0.04353 *
## NAP:week.cat3 -1.9136 0.5743 -3.332 0.00197 **
## NAP:week.cat4 -8.9002 1.4456 -6.157 3.85e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.442 on 37 degrees of freedom
## Multiple R-squared: 0.9138, Adjusted R-squared: 0.8951
## F-statistic: 49 on 8 and 37 DF, p-value: < 2.2e-16
We can inspect the design matrix and write the model for our 4 observations in matrix notation:
[11101311]=[1000−1.33400001000−0.811000010000.766000010000.054]×[β1β2β3β4β5β6β7β8]+[ϵ1ϵ2ϵ3ϵ4]
## week.cat1 week.cat2 week.cat3 week.cat4 NAP:week.cat1 NAP:week.cat2
## 10 1 0 0 0 -1.334 0.000
## 20 0 1 0 0 0.000 -0.811
## 30 0 0 1 0 0.000 0.000
## 25 0 0 0 1 0.000 0.000
## NAP:week.cat3 NAP:week.cat4
## 10 0.000 0.000
## 20 0.000 0.000
## 30 0.766 0.000
## 25 0.000 0.054
In this formulation of the model, we directly estimate separate intercepts and slopes for each week (rather than parameters that describe deviations from a reference group):
Richnessi=β1I(week=1)i+β2I(week=2)i+β3I(week=3)i+β4I(week=4)i+β5NAPiI(week=1)+β6NAPiI(week=2)i+β7NAPiI(week=3)i+β8NAPiI(week=4)i+ϵi
This gives us the following equations for the observations from each week:
- Week 1: Richnessi=β1+β5NAPi+ϵi
- Week 2: Richnessi=β2+β6NAPi+ϵi
- Week 3: Richnessi=β3+β7NAPi+ϵi
- Week 4: Richnessi=β4+β8NAPi+ϵi
3.7.3 Creating flexible models with dummy variables
After looking at Figure (3.7), we might decide that we want to fit a model that allows each week to have its own intercept, but that the effect of NAP
is the same in weeks 1-3 and differs only in week 4. If we understand how categorical variables are encoded in regression models, we can fit this model quite easily. We need to include week.cat
to allow each week to have its own intercept. We also create a single dummy variable (equal to 1 if week is equal to 4 and 0 otherwise) and include the interaction of this dummy variable with NAP
:
lm.datadriven <- lm(Richness ~ NAP + week.cat + NAP:I(week==4), data = RIKZdat)
summary(lm.datadriven)
##
## Call:
## lm(formula = Richness ~ NAP + week.cat + NAP:I(week == 4), data = RIKZdat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.3022 -0.9762 -0.0838 0.6269 7.6894
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.4187 0.7558 15.108 < 2e-16 ***
## NAP -1.7722 0.3875 -4.573 4.77e-05 ***
## week.cat2 -7.9124 0.9996 -7.915 1.23e-09 ***
## week.cat3 -6.4463 0.9965 -6.469 1.16e-07 ***
## week.cat4 1.3641 1.5623 0.873 0.388
## NAP:I(week == 4)TRUE -7.1280 1.4652 -4.865 1.92e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.387 on 39 degrees of freedom
## Multiple R-squared: 0.7983, Adjusted R-squared: 0.7725
## F-statistic: 30.88 on 5 and 39 DF, p-value: 1.425e-12
We can write down this model as:
Richnessi=β0+β1NAPi+β2I(week=2)i+β3I(week=3)i+β4I(week=4)i+β5NAPiI(week=4)i+ϵi
Which implies:
- Week 1: Richnessi=β0+β1NAPi+ϵi
- Week 2: Richnessi=[β0+β2]+β1NAPi+ϵi
- Week 3: Richnessi=[β0+β3]+β1NAPi+ϵi
- Week 4: Richnessi=[β0+β4]+[β1+β5]NAPi+ϵi
Let’s plot the fitted model of the expected Richness
as a function of NAP
for each week:
ggplot(fortify(lm.datadriven), aes(NAP, Richness, col = week.cat))+
geom_line(aes(NAP, .fitted, col = week.cat)) + geom_point() +
scale_colour_colorblind()
Although these results may look convincing, we arrived at this result in a very data-driven way. As we will discuss in Section 8, it is easy to develop models that fit your data well but that perform poorly when applied to new data. In general, you should be skeptical of relationships discovered based on intensive data exploration that were not expected a priori. Also, remember that this is an unbalanced design, with only 5 observations during week 4. Hence, this interaction model should be interpreted with great caution. It is quite possible that we are just fitting a model that explains noise in the data. Note, for example, the outlier present in the data from week 4. this point could be the result of measurement error or some other factor that we have not accounted for in our model and the sole reasons for the “need” for the interaction. In addition, if we plot residuals versus fitted values, we still see there are several large outliers in our residuals. So issues remain with our model that warrant further consideration.
ggplot(fortify(lm.datadriven), aes(.fitted, .resid, col = week.cat))+
geom_point() + geom_hline(yintercept = 0) +
scale_colour_colorblind()
3.7.4 Improving parameter interpretation through centering
When we fit a model with an interaction between a continuous and categorical variable, we are explicitly assuming that the difference between any 2 groups depends on the value of the continuous variable – for example, the difference in species richness between weeks 1 and 2 depends on the value of NAP
(Figure 3.7). Further, the coefficients associated with the categorical variable represent differences in intercepts, or in other words, mean responses when all other variables are set to 0. As we saw in Section 1.2, intercepts may be difficult to interpret or misleading when they require extrapolating outside of the range of the observed data.
Centering the continuous variable can make it easier to interpret the parameters associated with the categorical variable in models with interactions (Schielzeth, 2010). For example, if we refit our model after centering NAP
by its mean, then the intercepts will represent contrasts between each group and the reference group when NAP
is set to its mean (rather than 0) if we use effects coding, or differences between the groups when NAP
is set to its mean value and we use means coding.
3.8 Pairwise comparisons
Consider the model below, fit to the RIKZ
data set. The model includes two quantitative variables (NAP
and exposure
) and a categorical variable with more than 2 categories (week
).
Call:
lm(formula = Richness ~ NAP + exposure + week.cat, data = RIKZdat)
Residuals:
Min 1Q Median 3Q Max
-4.912 -1.621 -0.313 1.004 11.903
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.9262 7.3960 3.235 0.00248 **
NAP -2.4344 0.4668 -5.215 6.33e-06 ***
exposure -1.3972 0.8164 -1.711 0.09495 .
week.cat2 -4.7364 2.0827 -2.274 0.02854 *
week.cat3 -4.2269 1.6671 -2.535 0.01535 *
week.cat4 -1.0814 1.8548 -0.583 0.56323
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.918 on 39 degrees of freedom
Multiple R-squared: 0.6986, Adjusted R-squared: 0.6599
F-statistic: 18.08 on 5 and 39 DF, p-value: 2.991e-09
For NAP
and exposure
, we can use the t-tests in the output above to test whether there is sufficient evidence to conclude the coefficients are not equal to 0. We can also see in the output whether we have evidence to suggest there are differences between week 2 and week 1, week 3 and week 1, and week 4 and week 1. What if we want to test for differences between say weeks 2 and 4, or between all pairs of weeks we have?
This question brings up the thorny issue of multiple comparisons. There are “4 choose 2” = 6 possible pairwise comparisons we could consider when comparing the 4 weeks, and in general (k2)=k!2!(k−2)! possible comparisons if there are k groups. If the probability of making a type I error (i.e., rejecting the null hypothesis when it is true) is α, and each test is independent from all the other tests, then we would have a 1−(1−α)ntests (roughly an 26% chance if α=0.05 and we conduct 6 tests) of rejecting at least one null hypothesis even if all of them were true; this error rate is often referred to as the family-wise error rate. The problem increases with the number of categories associated with the categorical variable and applies also to confidence intervals (i.e., the chance of at least one confidence interval failing to capture the true mean pairwise difference increases with the number of confidence intervals considered).
The usual way of dealing with multiple comparisons is to apply a correction factor that adjusts the p-values associated with individual hypothesis tests (or, alternatively, adjusts the critical values with which the p-values are compared when deciding if there is evidence to reject the null hypothesis). Similarly, one can make adjustments to confidence intervals to make them wider in hopes of controlling the family-wise error rate. There are multiple packages in R for conducting pairwise comparisons with adjustments. We briefly demonstrate one option using the emmeans
package (Lenth, 2021). We begin by estimating the mean Richness
for each week when NAP
and exposure
are set to their mean values using the enmeans
function:
## week.cat emmean SE df lower.CL upper.CL
## 1 8.80 1.406 39 5.95 11.64
## 2 4.06 0.995 39 2.05 6.07
## 3 4.57 0.761 39 3.03 6.11
## 4 7.72 1.320 39 5.05 10.38
##
## Confidence level used: 0.95
We can then use the pairs
function to calculate all 6 pairwise differences between weekly means and test whether these differences are statistically significant (i.e., whether we have evidence that the true difference is likely non-zero). We can also request confidence intervals for the pairwise differences in means by supplying the argument infer = c(TRUE, TRUE)
.
## contrast estimate SE df lower.CL upper.CL t.ratio p.value
## 1 - 2 4.736 2.08 39 -0.852 10.325 2.274 0.1217
## 1 - 3 4.227 1.67 39 -0.247 8.700 2.535 0.0699
## 1 - 4 1.081 1.85 39 -3.896 6.058 0.583 0.9366
## 2 - 3 -0.509 1.20 39 -3.725 2.706 -0.425 0.9738
## 2 - 4 -3.655 1.71 39 -8.241 0.931 -2.139 0.1589
## 3 - 4 -3.146 1.53 39 -7.252 0.961 -2.055 0.1858
##
## Confidence level used: 0.95
## Conf-level adjustment: tukey method for comparing a family of 4 estimates
## P value adjustment: tukey method for comparing a family of 4 estimates
Each row represents a different pairwise comparison identified by the label in the first column. By default, emmeans
uses Tukey’s Honest Significant Difference (HSD) to adjust the p-values and confidence intervals associated with each comparison (Abdi & Williams, 2010), thus controlling the family-wise error rate (i.e., the probability that we incorrectly reject at least 1 null hypothesis when all of the null hypotheses are true). Using a family-wise error rate of α=0.05, we would conclude that we do not have enough evidence to reject the null hypothesis for any of the pairwise comparisons, as they all have p-values > 0.05. We can see our conclusions are more conservative than if we had not done any adjustments:
## contrast estimate SE df lower.CL upper.CL t.ratio p.value
## 1 - 2 4.736 2.08 39 0.524 8.949 2.274 0.0285
## 1 - 3 4.227 1.67 39 0.855 7.599 2.535 0.0153
## 1 - 4 1.081 1.85 39 -2.670 4.833 0.583 0.5632
## 2 - 3 -0.509 1.20 39 -2.933 1.914 -0.425 0.6730
## 2 - 4 -3.655 1.71 39 -7.112 -0.198 -2.139 0.0388
## 3 - 4 -3.146 1.53 39 -6.241 -0.050 -2.055 0.0466
##
## Confidence level used: 0.95
Without any adjustment, we would have concluded that weeks 1 and 2, weeks 1 and 3, weeks 2 and 4, and weeks 3 and 4 all differ from one another. Clearly, then, there is a tradeoff involved when adjusting for multiple comparisons. We can reduce the family-wise type I error rate at the expense of increasing the type II error rate (failing to reject a null hypotheses when it is indeed false). Thus, correcting for multiple comparisons is not without its critics (e.g., Perneger, 1998; Moran, 2003; Nakagawa, 2004). Rather than attempt to control the family-wise error rate (i.e., probability of incorrectly rejecting one or more null hypotheses), many statisticians now advocate for controlling the false discovery rate, defined as the proportion of significant results that reflect type I errors (see e.g., Benjamini & Hochberg, 1995; García, 2004; Verhoeven, Simonsen, & McIntyre, 2005; Pike, 2011). In other words, rather than attempt to avoid rejecting any null hypotheses that are true, the goal is to ensure that most of the hypotheses that are rejected are indeed false. Controlling the false discovery rate results in more powerful tests, meaning we are more likely to reject hypotheses when they are false, and less conservative adjustments than controlling for the family-wise error rate.
Another option that is sometimes used to control the family-wise error rate is what is called Fisher’s least significant difference (LSD) procedure in which a global, multiple degree-of-freedom test is conducted first. If this test is significant, then one proceeds with further pairwise comparisons. If the global test is not significant, then no pairwise comparisons are conducted. This approach is capable of controlling the family-wise error rate when there are only 3 groups under consideration (Meier, 2006). We discuss the multiple degree-of-freedom test in Section 3.9. Lastly, it is often beneficial to limit the number of tests conducted to just those comparisons that are of primary interest.
3.9 Multiple degree-of-freedom hypothesis tests
The Fisher’s LSD procedure would require us to first test the global null hypothesis that the coefficients for week.cat2
, week.cat3
and week.cat4
are all 0 (i.e., all weeks have the same species Richness
after adjusting for NAP
and exposure
) versus an alternative hypothesis that at least 1 of the coefficients is non-zero. Tests of joint hypotheses (i.e., hypotheses involving multiple parameters set to different values, usually 0), can be conducted using either a χ2 or F distribution. Tests using the χ2 distribution are usually based on large sample approximations that lead to a Normal distribution (the square of a Normal random variable is distributed as χ21). The F distribution, like the t−distribution, is more appropriate when all of the assumptions of the linear regression model are met. The two tests will be equivalent as sample sizes approach infinity. The χ2 and F distributions only assign probabilities to positive values and therefore, p-values are calculated as areas to the right of our test statistic, calculated from the observed data.
F-statistics have an associated numerator and denominator degrees of freedom and can be most easily understood in terms of comparing two models – a full model (MF) and a reduced model (MR) in which some subset of parameters have been set equal to 0. Let pF and pR be the number of parameters in the full and reduced models, respectively. The numerator degrees of freedom, k1, is equal to the difference in the number of parameters (k1=pF−pR=3 when testing the null hypothesis that the coefficients for week.cat2
, week.cat3
and week.cat4
are all 0). The denominator degrees of freedom, k2, is determined from the full model that includes these additional parameters and is equal to n−pF where n is the sample size. The F−statistic can be calculated as:
F=(SSEMR−SSEMF)/(pF−pR)SSEMF/(n−pF)=(SSRMF−SSRMR)/(pF−pR)SSEMF/(n−pF),
where SSEMR and SSRMR are the residual and regression sums-of-squares for the restricted/constrained model (in this case, with parameters for week.cat2
, week.cat3
, and week.cat4
set to 0), and SSEMF and SSRMF are the residual and regression sums-of-squares for the full model (where these parameters are also estimated). Thus, the numerator captures the additional variability that is explained by including the additional parameters and should be close to 0 if the null hypothesis is true.
When using software to calculate sums of squares, you may be surprised to learn that sums-of-squares may be calculated in different ways, depending on whether we consider constructing a model sequentially, adding one variable at a time (so called type I sums of squares), or we consider removing a variable from the model containing all predictors (type III sums of squares)18. I.e., whether we build the model in a “forward” or “backward” direction influences what other variables are included or adjusted for when calculating regression sums of squares. When testing hypotheses, I recommend a backwards model-selection approach, which is implemented using the Anova
function in the car
package. By contrast, the anova
function in base R will implement a sequential approach, in which case the results of the test will depend on the ordering of the variables when you specify the model.
Let’s start with the Anova
function:
Anova Table (Type II tests)
Response: Richness
Sum Sq Df F value Pr(>F)
NAP 231.59 1 27.1999 6.335e-06 ***
exposure 24.94 1 2.9289 0.09495 .
week.cat 73.19 3 2.8654 0.04888 *
Residuals 332.07 39
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The Anova
function returns tests appropriate for backwards selection (see Section 8.3.1) - meaning that these tests determine if we have enough evidence to suggest that the variable of interest is associated with the response variable, after adjusting for the other variables in the model. In this case, both the full and reduced models include all other variables. The F-tests here for NAP
and exposure
are equivalent to the t-tests in the summary of the lm
(in fact, the F-statistics are equal to the square of the t−statistics we saw previously). The advantage of using Anova
is that it also returns a multiple degree-of-freedom test for week.cat
. The associated p-value (0.0488) suggests we have enough evidence in the data to conclude that at least one of the weeks differs from the others (in terms of species richness after adjusting for exposure
and NAP
).
If we had used the anova
function, we would end up with a different set of tests resulting from sequentially adding variables one at a time. In this case, the order in which the predictor variables appear matters. We will compare two different calls to lm
below to demonstrate this:
Analysis of Variance Table
Response: Richness
Df Sum Sq Mean Sq F value Pr(>F)
week.cat 3 534.31 178.104 20.9177 3.060e-08 ***
exposure 1 3.67 3.675 0.4316 0.5151
NAP 1 231.59 231.593 27.1999 6.335e-06 ***
Residuals 39 332.07 8.514
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-values here test whether :
- at least one coefficient associated with
week.cat
is non-zero in a model that only includesweek.cat
(since it was specified first) - the coefficient for
exposure
is non-zero in a model withweek.cat
andexposure
- the coefficient for
NAP
is non-zero in a model that containsweek.cat
,exposure
, andNAP
.
If we reverse the order variables are entered into the model, we get a different set of p-values, with the test for week.cat
now matching the test from the Anova
function.
## Analysis of Variance Table
##
## Response: Richness
## Df Sum Sq Mean Sq F value Pr(>F)
## NAP 1 357.53 357.53 41.9907 1.117e-07 ***
## exposure 1 338.86 338.86 39.7977 1.931e-07 ***
## week.cat 3 73.19 24.40 2.8654 0.04888 *
## Residuals 39 332.07 8.51
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In addition to using an F−test for categorical variables with more than 2-levels, we can also use an F−test to evaluate whether any of our predictors explain a significant proportion of variability in the response as we will see in the next section.
3.10 Regression F-statistic
When using the summary
function with a fitted regression model, you may notice an F-statistic and p-value at the bottom of the output:
##
## Call:
## lm(formula = Richness ~ NAP + exposure + week.cat, data = RIKZdat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.912 -1.621 -0.313 1.004 11.903
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.9262 7.3960 3.235 0.00248 **
## NAP -2.4344 0.4668 -5.215 6.33e-06 ***
## exposure -1.3972 0.8164 -1.711 0.09495 .
## week.cat2 -4.7364 2.0827 -2.274 0.02854 *
## week.cat3 -4.2269 1.6671 -2.535 0.01535 *
## week.cat4 -1.0814 1.8548 -0.583 0.56323
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.918 on 39 degrees of freedom
## Multiple R-squared: 0.6986, Adjusted R-squared: 0.6599
## F-statistic: 18.08 on 5 and 39 DF, p-value: 2.991e-09
This F-statistic is testing whether all regression coefficients (other than the intercept) are simultaneously 0 versus an alternative hypothesis that at least one of the coefficients is non-zero. Thus, the numerator degrees of freedom is equal to p−1 where p is the number of parameters in the model (p−1=5 in the above case). The denominator degrees of freedom is again equal to n−p. Similar to the idea behind Fisher’s least significant difference (LSD) procedure, we might consider only testing hypotheses involving individual coefficients when this global test is rejected as one way to reduce the family-wise type I error rate.
3.11 Contrasts: Estimation of linear combinations of parameters
Often, we are interested in estimating some linear combination (i.e., a weighted sum) of our regression parameters. Consider again the model with only NAP
and week.cat
fit using effects coding:
Richnessi=β0+β1NAPi+β2I(week=2)i+β3I(week=3)i+β4I(week=4)i+ϵi,
We saw how we can use the functions in the emmeans
package to estimate the difference in Richness
between weeks 2 and 3 (as well as between other weeks), while controlling for NAP
. We can estimate this contrast between mean Richness
in weeks 2 and 3 as: ˆβ2−ˆβ3. To correctly estimate its uncertainty requires considering how much ^β2 and ˆβ3 vary as well as how much they co-vary across data sets if we could replicate the sampling design many times. I.e., we must consider the variance/covariance matrix of our regression parameter estimators, ˆΣˆβ:
ˆΣˆβ=[var(ˆβ0)cov(ˆβ0,^β1)cov(ˆβ0,^β2)cov(ˆβ0,^β3)cov(ˆβ0,^β4)cov(ˆβ0,^β1)var(ˆβ1)cov(ˆβ1,^β2)cov(ˆβ1,^β3)cov(ˆβ1,^β4)cov(ˆβ0,^β2)cov(ˆβ1,^β2)var(ˆβ2)cov(ˆβ2,^β3)cov(ˆβ2,^β4)cov(ˆβ0,^β3)cov(ˆβ1,^β3)cov(ˆβ2,^β3)var(ˆβ3)cov(ˆβ3,^β4)cov(ˆβ0,^β4)cov(ˆβ1,^β4)cov(ˆβ2,^β4)cov(ˆβ3,^β4)var(ˆβ4)].
For constants a and b, the var(ax+by)=a2var(x)+b2var(y)+2abcov(x,y). Thus, var(ˆβ2−ˆβ3)=var(ˆβ2)+var(ˆβ3)+2cov(ˆβ2,^β3). We can calculate this variance using matrix multiplication. Define the transpose of a column vector, c as c′=c(0,0,1,−1,0), which we will use to estimate our contrast of interest via matrix multiplication:
c′ˆβ=[001−10][ˆβ0ˆβ1ˆβ2ˆβ3ˆβ4]=ˆβ2−ˆβ3.
The variance of this contrast is given by:
c′Σbc=[001−10][var(ˆβ0)cov(ˆβ0,^β1)cov(ˆβ0,^β2)cov(ˆβ0,^β3)cov(ˆβ0,^β4)cov(ˆβ0,^β1)var(ˆβ1)cov(ˆβ1,^β2)cov(ˆβ1,^β3)cov(ˆβ1,^β4)cov(ˆβ0,^β2)cov(ˆβ1,^β2)var(ˆβ2)cov(ˆβ2,^β3)cov(ˆβ2,^β4)cov(ˆβ0,^β3)cov(ˆβ1,^β3)cov(ˆβ2,^β3)var(ˆβ3)cov(ˆβ3,^β4)cov(ˆβ0,^β4)cov(ˆβ1,^β4)cov(ˆβ2,^β4)cov(ˆβ3,^β4)var(ˆβ4)][001−10].
To verify, let’s calculate the standard error of this contrast (equivalent to the square-root of the variance) using matrix multiplication in R, noting that Σb can be obtained using the vcov
function applied to our linear model object:
Sigma_b<- vcov(lm.ancova)
cmat <- c(0, 0, 1, -1, 0)
cmat%*%coef(lm.ancova) # estimate of week 2 - week 3
## [,1]
## [1,] -1.447196
## [,1]
## [1,] 1.091021
## contrast estimate SE df t.ratio p.value
## 1 - 2 7.63 1.25 40 6.105 <.0001
## 1 - 3 6.18 1.25 40 4.961 <.0001
## 1 - 4 2.59 1.67 40 1.554 0.1280
## 2 - 3 -1.45 1.09 40 -1.326 0.1922
## 2 - 4 -5.03 1.54 40 -3.258 0.0023
## 3 - 4 -3.58 1.54 40 -2.320 0.0255
We see that we get an equivalent SE to the one returned by the pairs
function for the difference between weeks 2 and 3. At this point, you might be wondering why you need to know how to calculate contrasts and their uncertainty using matrix algebra if emmeans
will do all the hard work for you. Good question! There are times when you may be interested in something other than a simple pairwise difference. For example, we could test whether the last two weeks had higher species richness, on average, than the first two using c′=(0,1/2,1/2,−1/2,−1/2), giving (ˆβ1+ˆβ22)−(ˆβ3+ˆβ42). A similar approach was used by Iannarilli, Erb, Arnold, & Fieberg (2021) to test for differences in average encounter rates in a camera trap study where sites were randomized to one of two types of lures and one of two types of camera placement strategies. For a more thorough discussion of the importance of contrasts, see Schad, Vasishth, Hohenstein, & Kliegl (2020).
3.12 Aside: Revisiting F-tests and comparing to Wald χ2 tests
In Section 3.9, we considered the F-statistic written in terms of sums of squares. We can also formulate F tests using matrix algebra. Similar to the previous section, we will use C (here as a matrix instead of a vector) to identify one or more linear combinations of our regression parameters. Let’s consider again the model from Section 3.9:
Richnessi=β0+β1NAPi+β2exposurei+β3I(week=2)i+β4I(week=3)i+β5I(week=4)i+ϵi,
Define our contrast matrix, C as:
C=[000100000010000001]
Using matrix multiplication, Cβ, identifies the parameters involved in our multiple degree of freedom hypothesis test:
Cβ=[000100000010000001][β0β1β2β3β4β5]=[β3β4β5]
The F-statistic can be written as:
F=1k1(Cˆβ)′(CˆΣˆβC′)−1(Cˆβ)
where ˆΣˆβ is once again our estimated variance-covariance matrix associated with ˆβ. We can use matrix algebra to verify the F-statistic from the test calculated using the Anvoa
function:
cmat <- matrix(c(0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 1, 0,
0, 0, 0, 0, 0, 1), byrow=TRUE, ncol=6)
t(cmat%*%coef(lm.RIKZ))%*%solve(cmat%*%vcov(lm.RIKZ)%*%t(cmat))%*%(cmat%*%coef(lm.RIKZ))/3
## [,1]
## [1,] 2.865437
## Anova Table (Type II tests)
##
## Response: Richness
## Sum Sq Df F value Pr(>F)
## NAP 231.59 1 27.1999 6.335e-06 ***
## exposure 24.94 1 2.9289 0.09495 .
## week.cat 73.19 3 2.8654 0.04888 *
## Residuals 332.07 39
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Alternatively, we could consider a χ2 test, with the statistic calculated as follows:
χ2=(Cˆβ)′(CˆΣˆβC′)−1(Cˆβ)
with degrees of freedom equal to the number of rows in C.
chisq<-t(cmat%*%coef(lm.RIKZ))%*%solve(cmat%*%vcov(lm.RIKZ)%*%t(cmat))%*%(cmat%*%coef(lm.RIKZ))
pchisq(chisq, df=3, lower.tail=FALSE)
## [,1]
## [1,] 0.03516874
We get a slightly smaller p-value in this case relative to the F−test, similar to what you would expect if you used a Normal distribution rather than a t-distribution to conduct a hypothesis test with small sample sizes.
The two tests are asymptotically equivalent (i.e., for large sample sizes). The χ2 test can be motivated by noting that asymptotically:
Cˆβ∼N(Cβ,CˆΣˆβC′)
Thus, in the 1-dimensional case, the χ2 statistic is equivalent to the square of a z-statistic:
(Cˆβ−0SE(Cˆβ))2
Lastly, we can also test hypotheses in which the regression parameters are set to specific values other than 0 by replacing C with C−˜β in equation (3.2) and (3.3), where ˜β represents the values of β under the null hypothesis.
3.13 Visualizing multiple regression models
Consider a regression model with two explanatory variables, X1 and X2.
Yi=β0+Xi,1β1+Xi,2β2+ϵi
We have already noted that β1 reflects the “effect” of X1 on Y after accounting for X2. Specifically, it describes the expected change in Y for every 1 unit increase in X1 while holding X2 constant. If we want to visualize this effect, a simple strategy that is often used is to:
- Create a data set with X1 taking on a range of values and with X2 set to its mean value (for quantitative predictors) or its modal value (for categorical predictors).
- Generate predictions, ˆY, for each value in this data set and plot ˆY versus X1.
This strategy is easy to implement using the predict
function in R and generalizes to models with more than two predictors. In addition, there are various packages that will construct this type of effect plot for you. In particular, we will look at the effects
package (Fox, 2003; Fox & Weisberg, 2018, 2019b) for creating these types plots in Section 16.
When fitting a linear regression model with only 1 predictor, it is common to create a scatterplot of Y versus X along with the fitted regression line to visualize the effect of X on Y. This type of plot allows us to quickly visualize the amount of variability in Y explained by X (and also the amount of unexplained variability remaining). It would be nice to have a similar tool available for multiple regression models where model predictions are shown together with data. In the next sections, we will explore two options:
- Added variable plots, also known as partial regression plots
- Component + residual plots, also known as partial residual plots
These types of plots are not well known among ecologists and are arguably underutilized (Moya-Laraño & Corcobado, 2008). There are several functions in R that can be used to create added variable and component + residual plots:
added variable plots
Component + residual plots
We will explore the avPlots
and crPlots
functions in the sections that follow and termplot
in Section 4.6.
3.13.1 Added variable plots
Added variable plots allow us to visualize the effect of Xk after accounting for all other predictors. These plots can be constructed using the following steps:
- Regress Y against X−k (i.e., all predictors except Xk), and obtain the residuals.
- Regress Xk against all other predictors (X−k) and obtain the residuals.
- Plot the residuals from [1] (i.e., the part of Y not explained by other predictors) against the residuals from [2] (the part of the focal predictor not explained by the other predictors). If we add a least-squares-regression line relating these two sets of residuals, the slope will be equivalent to the slope in our full model containing all predictors.
Although there are functions in R to construct added variable plots (Section 3.13), we will demonstrate these steps using a simulated data set in the Data4Ecologists
package. Specifically, the partialr
data set was simulated so that y
has a positive association with x1
, a negative association with x2
(which is also negatively correlated with x1
), a quadratic relationship with x3
, and a spurious relationship with x4
(due to its correlation with x1
). Let’s look at a pairwise scatterplot of the data set:
First, note that whenever predictor variables are correlated, as they will be in any observational data set, regression coefficients will change when we add or drop predictors from a model model as these correlated variables “compete” to predict variance in the response variables (see e.g., Table 3.2). The magnitude and direction of these changes will depend on the sign and strength of the correlations among the different predictor variables (see Sections 6 and 7; Fieberg & Johnson, 2015) [add link to mulicollinearity chapter and simulation]. Thus, choosing an appropriate model can be challenging and should ideally be informed by one’s research question and an assumed Directed Acyclical Graph (DAG) capturing assumptions about how the world works (i.e., causal relationships between the predictor variables and the response variable; see Section 7).
lmx1.y <- lm(y ~ x1, data=partialr)
lmx2.y <- lm(y ~ x2, data=partialr)
lmx3.y <- lm(y ~ x3, data=partialr)
lmx4.y <- lm(y ~ x4, data=partialr)
lmxall.y <- lm(y ~ x1 + x2 + x3 + x4, data=partialr)
Model 1 | Model 2 | Model 3 | Model 4 | Model 5 | |
---|---|---|---|---|---|
(Intercept) | -7.359 | -7.479 | -7.384 | -7.300 | -7.865 |
x1 | 0.692 | 1.562 | |||
x2 | -0.495 | -1.505 | |||
x3 | 0.925 | 0.842 | |||
x4 | -0.025 | -0.210 |
For now, let’s assume we have chosen to focus on the model containing all four predictor variables and we want to display the effect of x1
after accounting for the other predictors in the model using an added-variable plot. Let’s walk through the steps of this process:
- Regress Y against X−k (i.e., all predictors except
x1
).
- Regress Xk against all other predictors (X−k).
- Plot the residuals from [1] against the residuals from [2] along with a regression line relating these two sets of residuals. We also add a regression line through the origin with the slope coefficient for
x1
from the original regression.
plot(resid(lm.x1.allotherx), resid(lm.nox1.y),
xlab="E(X1 | X2, X3, X4)", ylab= "E(Y | X2, X3, X4)")
abline(c(0,coef(lmxall.y)[2]), col="red", lty=2, lwd=3)
(lmpartial<-lm(resid(lm.nox1.y)~resid(lm.x1.allotherx)-1))
##
## Call:
## lm(formula = resid(lm.nox1.y) ~ resid(lm.x1.allotherx) - 1)
##
## Coefficients:
## resid(lm.x1.allotherx)
## 1.562
We see that the slope from the original regression is equivalent to the slope of the regression line relating the residuals from [1] to the residuals from [2].
Rather than construct similar plots for the other variables, we will use the avPlots
function in the car
package to produce the full suite of added variable plots:
partialr
data set in the Data4Ecologists
package (J. Fieberg, 2021) calculated using the avplots
function in the car
package (Fox & Weisberg, 2019a).
We see that the slope of the line is near 0 in the plot for x4
(lower right panel of Figure 3.12), suggesting that x4
provides little to no additional information useful for predicting y
that is not already contained in the other predictor variables. Further, in the plot for x3
(lower-left panel of Figure 3.12), we see that it has a clear non-linear relationship with y
even after accounting for the effects of x1
, x2
, and x4
. Thus, we may want to add a quadratic term or use splines to relax the linearity assumption for this variable (see Section 4).
In summary, added-variable plots depict the slope and the scatter of points around the partial regression line in an analogous way to bi-variate plots in simple linear regression. These plots can be helpful for:
- visualizing the effect of predictor variables (given everything else already in the model)
- diagnosing whether some variables have a non-linear relationship with the response variable
- identifying potential influential points and outliers (
avPlots
highlights these with the row number in the data set)
One downside to added-variable plots is that the scales on the x- and y-axis do not match the scales of the original variables in the regression model.
3.13.2 Component + residual plots or partial residual plots
Component + residual plots, which are sometimes referred to as partial residual plots, offer a slightly different visualization by plotting:
Xiˆβi+ˆϵi versus Xi,where.
Xi is the ith predictor variable and is the variable of interest. As shown, below, Xiˆβi+ˆϵ represents the part of Y explained by Xi that is not already explained by all the other predictors:
Y−∑j≠iXjˆβj=ˆY+ˆϵ−∑j≠iXjˆβj=Xiˆβi+ˆϵ
There are a number of options in R for creating component + residual plots (see Section 3.13). This approach can be easily generalized to more complicated models that allow for non-linear relationships (e.g., quadratic terms), by replacing Xiˆβi with multiple terms corresponding to the columns in the design matrix associated with the ith explanatory variables; however, component + residual plots are not appropriate if you include interactions in the model. Moya-Laraño & Corcobado (2008) suggest that component + residual plots are sometimes better than added variable plots at diagnosing non-linearities, but they are not as good as added-variable plots at depicting the amount of variability explained by each predictor (given everything else in the model).
Below, we demonstrate how to construct component + residual plots using the crPlots
function in the car
package.
3.13.3 Effect plots
Another way to visualize fitted regression models is to form effect plots using what Lüdecke (2018) refers to as either adjusted or marginal means. Plots of adjusted means are formed using predictions where a focal variable is varied over its range of observed values, while all non-focal variables are set to constant values (e.g., at their means or modal values). Marginal means are formed in much the same way, except that predictions are averaged across different levels of each categorical variable. These two types of means are equivalent if there are no categorical predictors in the model.
Marginal means can be calculated using the effects
function in the effects
package and then plotted. Alternatively, we can use the ggeffect
function in the ggeffects
package (Lüdecke, 2018) to format the output and create plots using ggplot2
(Wickham, 2016). Adjusted means can be created using the ggpredict
function in the ggeffects
package or the visreg
function in the visreg
package (Breheny & Burchett, 2013). The visreg
package also provides an option for producing contrast plots, which compare adjusted means to predictions obtained by setting all predictors (including the focal predictor) to specific reference values.
Below, we briefly illustrate the ggeffect
and ggpredict
functions using the RIKZdat
data set and our linear model containing week
and NAP
(but not their interaction). If we use ggpredict
or ggeffect
with the argument terms = c("NAP", "week.cat")
, we get predictions for a range of NAP values associated with each week. The output of these functions is a list with an associated print
function that provides nicely formatted output.
## # Predicted values of Richness
##
## # week.cat = 1
##
## NAP | Predicted | 95% CI
## ----------------------------------
## -1.40 | 14.55 | [12.28, 16.82]
## -0.60 | 12.73 | [10.76, 14.70]
## 0.00 | 11.37 | [ 9.46, 13.28]
## 0.80 | 9.55 | [ 7.46, 11.64]
## 2.20 | 6.37 | [ 3.48, 9.27]
##
## # week.cat = 2
##
## NAP | Predicted | 95% CI
## ---------------------------------
## -1.40 | 6.92 | [ 4.56, 9.28]
## -0.60 | 5.11 | [ 3.24, 6.97]
## 0.00 | 3.74 | [ 2.12, 5.36]
## 0.80 | 1.93 | [ 0.34, 3.52]
## 2.20 | -1.25 | [-3.51, 1.00]
##
## # week.cat = 3
##
## NAP | Predicted | 95% CI
## ----------------------------------
## -1.40 | 8.37 | [ 6.04, 10.70]
## -0.60 | 6.55 | [ 4.71, 8.39]
## 0.00 | 5.19 | [ 3.58, 6.80]
## 0.80 | 3.37 | [ 1.78, 4.97]
## 2.20 | 0.19 | [-2.09, 2.48]
##
## # week.cat = 4
##
## NAP | Predicted | 95% CI
## ---------------------------------
## -1.40 | 11.95 | [8.65, 15.25]
## -0.60 | 10.14 | [7.21, 13.07]
## 0.00 | 8.77 | [6.01, 11.53]
## 0.80 | 6.96 | [4.25, 9.66]
## 2.20 | 3.78 | [0.68, 6.87]
We can then use a built in plot
function to visualize these predictions with partial residuals overlaid by adding residuals = TRUE
(Figure 3.14).
Alternatively, if we want to create a plot just for NAP
, we can use either ggpredict
(for adjusted means) or ggeffect
(for marginal means).
## # Predicted values of Richness
##
## NAP | Predicted | 95% CI
## ----------------------------------
## -1.40 | 14.55 | [12.35, 16.75]
## -1.00 | 13.64 | [11.61, 15.66]
## -0.40 | 12.28 | [10.40, 14.15]
## 0.00 | 11.37 | [ 9.51, 13.22]
## 0.40 | 10.46 | [ 8.55, 12.37]
## 0.80 | 9.55 | [ 7.52, 11.58]
## 1.20 | 8.64 | [ 6.44, 10.85]
## 2.20 | 6.37 | [ 3.56, 9.18]
##
## Adjusted for:
## * week.cat = 1
## # Predicted values of Richness
##
## NAP | Predicted | 95% CI
## ----------------------------------
## -1.40 | 9.66 | [ 7.78, 11.54]
## -1.00 | 8.75 | [ 7.19, 10.31]
## -0.40 | 7.39 | [ 6.24, 8.53]
## 0.00 | 6.48 | [ 5.52, 7.44]
## 0.40 | 5.57 | [ 4.67, 6.47]
## 0.80 | 4.66 | [ 3.67, 5.66]
## 1.20 | 3.75 | [ 2.55, 4.96]
## 2.20 | 1.48 | [-0.49, 3.45]
ggpredict
forms predictions where week.cat
is set to 1 (its reference value), whereas ggeffect
generates predictions for each week, then averages these predictions, weighted by the proportion of observations in each week (for more on these calculations, see Section 16.5.4). As a result, the absolute values of the predictions will differ even though the effect of NAP
will look similar when we visualize the output (i.e., the slope of the depicted line is the same in both panels of Figure 3.15)
library(patchwork)
p1 <- plot(padj, residuals = TRUE, facet = TRUE)
p2 <- plot(pm, residuals = TRUE, facet = TRUE)
p1 + p2
Note: the normality assumption is required for small data sets, but the Central Limit Theorem (CLT) guarantees that sampling distribution for a difference in sample means will be approximately Normally distributed for large samples; a common rule is that we need roughly 30 observations in both groups for the CLT to apply↩
Note: there are other variations on the t-test that could be applied if the variances of the two groups are not assumed to be equal↩
Importantly, I want to formally recognize that a binary framework for sexual identification does not encompass everyone’s experience or identity and that more inclusive categories should be considered in human-subject research.↩
In fact, many more connections can be made between linear regression models and common statistical methods; see e.g., https://lindeloev.github.io/tests-as-linear/.↩
We refer to expected
Richness
here to signify that we need to average over ϵi↩Note that our species richness measure is just the count of species on the sampled beach↩
It is common with experimental data to test for significant interactions prior to testing main effects of individual predictors. For observational data, however, it is prudent to be more cautious. A sensible approach is often to include interactions only when they can be justified a priori based on biological grounds. Here, for illustrative purposes only, we will explore a model that includes an interaction between
NAP
andweek.cat
, but we suspect it would be difficult to motivate the need for this interaction, and the researchers did not design their study to test for it↩There is also a type II sums of squares that is most relevant to models that also include one or more interaction terms. We will not consider them here↩