set.seed(5)
n_levels = 3
n_obs_per_level = 25
# Construct X
x = rep(c(1:n_levels),
each = n_obs_per_level)
# Assign group-wise means
mus = NULL
mus[1] = 5.0
mus[2] = 7.5
mus[3] = 5.8
sigma = 2.5 # residual sigma
# Simulate data obs
y = NULL
for(i in 1:length(x)){
this_level = x[i]
y[i] =
mus[this_level] +
# residual deviation from group-wise mean:
rnorm(1, mean = 0, sd = sigma)
}
# Store in data frame
one_way_df = data.frame(
y = y,
x = factor(x) # store as factor/categorical
)
# Plot the means:
## Defaults to boxplot() when x is discrete
plot(y~x, data = one_way_df)
8 ANOVA
8.1 One-way ANOVA
Remember that Analysis of Variance (ANOVA) is using linear regression to analyze how discrete input variables affect an outcome. Because the input variables are comprised of discrete “levels” (e.g., treatment groups), we are really comparing how the mean of an outcome varies among discrete groups. This is therefore a generalization of two-sample \(t\)-tests to the case of more than two samples/groups.
One-way ANOVA is the special case in which we only have one input variable, which has more than two “levels”. The ANOVA linear model can be written in several ways, but for me the easiest way to think about is as follows: \[y_{i,l} = \mu_l + \epsilon_{i,l}\] \[\epsilon \sim N(0, \sigma^2 I)\]
In this case \(l = 1,\dots,L\), where \(L\) is the number of levels in the input variable \(x\), and \(\mu_l\) is the mean outcome of group level, \(l\). Thus, each data observation \(y_{i,l}\) varies about a group level mean, with residuals \(\epsilon_{i,l}\), which are normally and independently distributed.
Perhaps this is easier to visualize and see the code. Let’s simulate a case in which we have a single input variable \(x\) that has 3 levels.
We can see that - at least visually - level 2 has a higher mean outcome than the other levels (e.g., \(\bar{y}_2\) seems largest). Let’s run the ANOVA model and see the summary.
Df Sum Sq Mean Sq F value Pr(>F)
x 2 125.5 62.74 10.32 0.000115 ***
Residuals 72 437.9 6.08
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
8.1.1 Manually calculate \(F\)
As we can see, the aov()
function is using an \(F\)-test to determine if any of the group-wise means differ from the global mean (see lecture materials on this topic). Indeed, based on the \(p\)-value, at least one group-wise mean is different. To verify our understand, as usual, let’s manually calculate the \(F\) test statistic and \(p\)-value.
# First, we need to run the "null" model (intercept only)
m_one_way_null = lm(y ~ 1,
data = one_way_df)
# Extract the residuals (errors)
resid_full = m_one_way$residuals
resid_null = m_one_way_null$residuals
# Sum of Square Errors (SSE)
sse_full = crossprod(resid_full)
sse_null = crossprod(resid_null)
# degrees of freedom
n_obs = length(y)
df_full = n_obs - n_levels
df_null = n_obs - 1
# Calculate F_stat
f_test = ((sse_null - sse_full)/(df_null - df_full)) / (sse_full/df_full)
# Degrees of freedom for the F distribution:
df_numerator = df_null - df_full
df_denominator = df_full
p_one_way = 1 - pf(f_test,
df1 = df_numerator,
df2 = df_denominator)
# Compare to anova()
f_test; p_one_way
[,1]
[1,] 10.31654
[,1]
[1,] 0.0001149183
summary_aov = summary(m_one_way)
summary_aov[[1]]$`F value`; summary_aov[[1]]$`Pr(>F)`
[1] 10.31654 NA
[1] 0.0001149183 NA
We see that our manual calculation of \(F\) and the corresponding \(p\)-value match with the output of the aov()
function, so we can verify we are understanding what aov()
is doing.
8.1.2 Tukey HSD
But, which specific means differ from each other? To know this, we need to compute all pairwise differences between the group-wise means. Remember that for \(L\) number of groups, we have \(L(L-1)/2\) number of pairwise comparisons. We want to correct for multiple comparisons, so we don’t inflate our risk of Type I errors. Therefore, we’ll conduct a Tukey Honest Significant Difference (HSD) test.
TukeyHSD(m_one_way)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = y ~ 1 + x, data = one_way_df)
$x
diff lwr upr p adj
2-1 2.9796804 1.310442 4.6489184 0.0001710
3-1 0.5570861 -1.112152 2.2263241 0.7050400
3-2 -2.4225942 -4.091832 -0.7533562 0.0024902
This test is looking at pairwise differences. For instance 2-1
is the difference between \(x\) levels 2 and 1, and so on. The output reports the raw difference in means as diff
, and then reports the lower and upper confidence limits of this difference as lwr
and upr
. These default to 95% confidence intervals, though the user can specify an option in the function to change this percentage confidence. Then the output reports the p adj
, which is the adjusted \(p\)-value, adjusting for multiple comparisons. Here, we can conclude that while levels 1 and 3 are not different from one another, level 2 is different from both levels 1 and 3. This makes sense when we look at the boxplot (again).
plot(y~x, data = one_way_df)
8.1.3 Manually calculate \(q\)
How does the Tukey HSD adjust for multiple comparisons? It uses a special hypothesis test, which has an associated probability distribution, tukey
. Remember that we calculate a test statistic, \(q\) for the difference between levels \(i\) and \(j\): \[q_{i,j} = \frac{|\bar{y}_i - \bar{y}_j|}{\sqrt{\hat{\sigma}^2_p/n}}\] Here, \(\hat{\sigma}^2_p\) is the pooled variance of the whole outcome data set, \(y\), and \(n\) is the number of observations per level. This is why a balanced design is important; the test assumes that each level has the same number of observations.
Let’s manually calculate the \(p\)-value for the difference between group levels 2 and 1.
# Need to extract the data observations associated with all x levels, separately
these_x_1 = which(x == 1)
these_x_2 = which(x == 2)
these_x_3 = which(x == 3)
y_1 = y[these_x_1]
y_2 = y[these_x_2]
y_3 = y[these_x_3]
# Calculate pooled variance for whole data set
# Notice how this is not the same as var(y)
pooled_var = (var(y_1)+var(y_2)+var(y_3))/3
# Calculate q test statistic
q_test =
abs(mean(y_1) - mean(y_2)) /
sqrt(pooled_var/n_obs_per_level)
q_test
[1] 6.041317
# Degrees of freedom for q test stat
df_q = n_obs - n_levels
# Calculate p-value
p_2v1 = ptukey(q_test,
nmeans = 3,
df = df_q,
lower.tail = FALSE)
p_2v1
[1] 0.0001710377
We can see this \(p\)-value matches the first p adj
from the TukeyHSD()
output.
8.2 Two-way ANOVA
Two-way ANOVA is the special case in which we have exactly two input variables, each of which has two or more “levels”. The two-way ANOVA linear model can be written in several ways. We’ll start with the easiest case in which each of the two input variables only has two levels. This refers to a “2x2” experimental design. To be concrete, we’ll use an example. We will simulate data for plant growth in which we manipulate Temperature (Low
or High
) and soil Moisture (Dry
or Wet
). We apply the 2x2 combination of these treatments which leads to four total treatments (e.g., Low-Dry, Low-Wet, etc.). We will apply each treatment combination to 25 plants and measure the outcome of Growth.
In the following model structure, we will code Temperature as Low == 0
and High == 1
, and we will code soil Moisture as Dry == 0
and Wet == 1
.
\[y_{i} = \mu + \beta_{\text{Temp}}\text{Temp} + \beta_{\text{Moist}}\text{Moist} + \beta_{\text{Intx}}\text{Temp}\text{Moist} + \epsilon_{i,l}\] \[\epsilon \sim N(0, \sigma^2 I)\]
In the model structure, \(\mu\) is a global mean for \(y\) (i.e., across all treatments). Then, this global mean can be altered (i.e. affected by) the treatment combinations. \(\beta_{\text{Temp}}\) and \(\beta_{\text{Moist}}\) are the “main” effects and \(\beta_{\text{Intx}}\) is the interactive effect. Temp
and Moist
are binary indicator variables (0/1), so, for instance, if Temperature is Low, Temp == 0
, and so on. This means that \(\beta_{\text{Temp}}\) only gets added to the global mean when Temp == 1
(i.e., Temperature is High), and so on. The \(\beta_{\text{Intx}}\) would get added to the global mean if both Temp and Moist are 1, so High
Temperature and Wet
Moisture.
8.2.1 Only main effects
First, let’s simulate a case in which we only have main effects, no interaction. Specifically, we will assume that the plant Growth declines under High
Temperature, but that there is no effect of soil Moisture. Also, for data visualization, we will use the ggplot2
package, because it is easier to customize.
library(ggplot2)
# 2x2 design
# Replicated 25 times
# Low, High (0, 1)
n_reps = 25
Temp = rep(0:1, each = n_reps*2)
# Low, High (0, 1)
Moisture = c(
rep(0:1, each = n_reps),
rep(0:1, each = n_reps)
)
# Simulate different effects:
set.seed(8)
# Just main effect
global = 5
beta_t = -1.25
beta_m = 0
beta_intx = 0
sigma = 1.0
y = NULL
for(i in 1:length(Temp)){
y[i] = global +
beta_t * Temp[i] +
beta_m * Moisture[i] +
beta_intx * Temp[i] * Moisture[i] +
rnorm(1, mean = 0, sd = sigma)
}
# Store as data frame
two_way_df1 = data.frame(
Growth = y,
Temp = factor(Temp, levels = c(0,1), labels = c("Low", "High")),
Moisture = factor(Moisture, levels = c(0,1), labels = c("Dry", "Wet"))
)
ggplot(two_way_df1) +
geom_boxplot(aes(x = Temp, y = Growth, color = Moisture))
Here, it is visually clear that higher temperatures lead to lower plant growth, but there is no clear effect of soil moisture, just as we simulated.
Let’s run the ANOVA model and see if the output makes sense.
Df Sum Sq Mean Sq F value Pr(>F)
Temp 1 43.31 43.31 36.134 3.31e-08 ***
Moisture 1 0.00 0.00 0.002 0.963
Temp:Moisture 1 0.00 0.00 0.000 0.999
Residuals 96 115.06 1.20
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Indeed, we only see a main effect of Temp
, and no main effect of Moisture
, and no interactive effect (Temp:Moisture
).
8.2.2 Interactive effect
Now will simulate a case in which we have a main effect of temperature, similar to the above (Growth declines at High
Temp). We will also add a positive interactive effect. This means that the effect of specific effects of temperature on growth will depend on the soil moisture content. Specifically, with a positive interactive effect, if Temperature is High
and Moisture is Wet
, then we will get an increase in Growth, rather than a decline. Let’s see what this looks like visually.
# INTERACTION
global = 5
beta_t = -1.25
beta_m = 0
beta_intx = 2.5
sigma = 1.0
set.seed(5)
y = NULL
for(i in 1:length(Temp)){
y[i] = global +
beta_t * Temp[i] +
beta_m * Moisture[i] +
beta_intx * Temp[i] * Moisture[i] +
rnorm(1, mean = 0, sd = sigma)
}
# Store as data frame
two_way_df2 = data.frame(
Growth = y,
Temp = factor(Temp, levels = c(0,1), labels = c("Low", "High")),
Moisture = factor(Moisture, levels = c(0,1), labels = c("Dry", "Wet"))
)
ggplot(two_way_df2) +
geom_boxplot(aes(x = Temp, y = Growth, color = Moisture))
What wee see is that the effect of Temperature depends on the value of soil Moisture. In this case, as Temperature moves from Low
to High
, plant Growth declines if the soil is Dry
, but Growth increases if the soil is Wet
.
But those trends are just visual at this point. How do we quantify whether specific comparisons are statistically significant? We will again use the Tukey HSD! First, run the ANOVA and verify that there is a significant interaction.
Df Sum Sq Mean Sq F value Pr(>F)
Temp 1 0.11 0.11 0.122 0.727
Moisture 1 54.20 54.20 59.746 1.06e-11 ***
Temp:Moisture 1 41.00 41.00 45.190 1.28e-09 ***
Residuals 96 87.09 0.91
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Indeed, the interaction is significant, so now we need to figure out which specific differences among covariate levels exist.
TukeyHSD(aov_intx)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = Growth ~ Temp + Moisture + Temp:Moisture, data = two_way_df2)
$Temp
diff lwr upr p adj
High-Low -0.06659871 -0.4447271 0.3115297 0.7273973
$Moisture
diff lwr upr p adj
Wet-Dry 1.472439 1.094311 1.850567 0
$`Temp:Moisture`
diff lwr upr p adj
High:Dry-Low:Dry -1.3471656 -2.0515400 -0.6427911 0.0000152
Low:Wet-Low:Dry 0.1918721 -0.5125023 0.8962466 0.8920281
High:Wet-Low:Dry 1.4058403 0.7014658 2.1102147 0.0000062
Low:Wet-High:Dry 1.5390377 0.8346633 2.2434121 0.0000007
High:Wet-High:Dry 2.7530058 2.0486314 3.4573803 0.0000000
High:Wet-Low:Wet 1.2139681 0.5095937 1.9183426 0.0001084
The important part of the out put is $'Temp:Moisture'
, which shows the pairwise tests of the interactive effects. See if you can understand which differences are statistically significant, after accounting for multiple comparisons.
8.3 ANCOVA
ANCOVA is a variant of ANOVA, which stands for “Analysis of Covariance”. For one-way ANCOVA, we are analyzing the effect of one discrete (i.e., categorical) input variable, plus one continuous input variable. The “covariance” part refers to the fact that we’re trying to understand if the slope of the continuous input variable is the same or different between the levels of the discrete input variable.
For one-way ANCOVA, we have a single discrete input variable, and let’s assume the levels are 0 or 1. The model for one-way ANCOVA can then be written as: \[y_i = \mu_0 + \alpha_1 X_i + \beta_1 Z_i + \beta_2 X_i Z_i + \epsilon_i\] Here, \(\beta_1\) is the slope of \(Z\) when \(X=0\), but \(\beta_1 + \beta_2\) is the slope of \(Z\) when \(X=1\). If there is no interaction (i.e., \(\beta_2=0\)), then the slope of \(Z\) is the same for the two levels of \(X\).
Let’s continue with our example from two-way ANOVA, where we have Temperature and soil Moisture. However, for ANCOVA, we will assume that Temperature is a continuous input variable, rather than a discrete one.
8.3.1 Create data set and model function
We are going to explore all four possible outcomes of the simple one-way ANCOVA. We will therefore create a static data set, as well as a function to simulate the outcome variable \(Y\), depending on the model parameters.
library(ggplot2)
set.seed(7)
n_rep = 35
# Dry, Wet (0, 1)
Moisture = rep(0:1, each = n_rep)
# Continuous Temperature
Temp = c(
runif(n_rep, min = 5, max = 25),
runif(n_rep, min = 5, max = 25)
)
# Function to simulate observed data
calc_y_func = function(
mu_0 = 0,
alpha_1 = 0,
beta_1 = 0,
beta_2 = 0,
sigma = 1.0){
y = NULL
for(i in 1:length(Temp)){
y[i] = mu_0 +
alpha_1 * Moisture[i] +
beta_1 * Temp[i] +
beta_2 * Temp[i] * Moisture[i] +
rnorm(1, mean = 0, sd = sigma)
}
return(y)
}
8.3.2 Main effect of Temperature
First, we’ll see what the data look like when we only have an effect of the continuous input variable, in this case Temperature.
set.seed(2)
# 1. Only slope effect
mu_0 = 5
alpha_1 = 0
beta_1 = 0.75
beta_2 = 0
sigma = 2.25
y1 = calc_y_func(
mu_0,alpha_1,beta_1,
beta_2,sigma
)
# Store as data frame
ancova_df1 = data.frame(
Growth = y1,
Temp = Temp,
Moisture = factor(Moisture, levels = c(0,1), labels = c("Dry", "Wet"))
)
ggplot(ancova_df1) +
geom_point(aes(x = Temp, y = Growth,
shape = Moisture, color = Moisture))
We can see that while there is clearly a linear effect of Temperature on plant Growth, there is no obvious difference between Dry and Wet soil Moisture.
Now, let’s use model simplification to statistically validate our visual interpretation of the data.
Df Sum Sq Mean Sq F value Pr(>F)
Temp 1 1422.8 1422.8 214.231 <2e-16 ***
Moisture 1 17.0 17.0 2.559 0.114
Temp:Moisture 1 14.6 14.6 2.204 0.142
Residuals 66 438.3 6.6
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Analysis of Variance Table
Model 1: Growth ~ Temp + Moisture + Temp * Moisture
Model 2: Growth ~ Temp + Moisture
Res.Df RSS Df Sum of Sq F Pr(>F)
1 66 438.35
2 67 452.99 -1 -14.641 2.2044 0.1424
# Notice how this F and p-value is the same from
# the summary() table of m1.
m3 = update(m2, .~. -Moisture)
summary(m3)
Df Sum Sq Mean Sq F value Pr(>F)
Temp 1 1423 1422.8 205.9 <2e-16 ***
Residuals 68 470 6.9
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Check the slope and intercept
m3$coefficients
(Intercept) Temp
4.487613 0.787559
The model simplification validates that there is only an effect of Temperature, and the estimated slope matches our simulated value pretty closely.
8.3.3 Main effect of Moisture
Now, let’s assume there is no effect of Temperature, but there is a difference in the levels of Moisture.
set.seed(5)
# 2. Only factor effect
mu_0 = 5
alpha_1 = 4
beta_1 = 0
beta_2 = 0
sigma = 2.0
y2 = calc_y_func(
mu_0,alpha_1,beta_1,
beta_2,sigma
)
# Store as data frame
ancova_df2 = ancova_df1
ancova_df2$Growth = y2
ggplot(ancova_df2) +
geom_point(aes(x = Temp, y = Growth,
shape = Moisture, color = Moisture))
We can see the difference in the means of Moisture levels, but no clear, linear Temperature effect. We will validate this with model simplification.
Df Sum Sq Mean Sq F value Pr(>F)
Temp 1 7.73 7.73 1.875 0.176
Moisture 1 173.80 173.80 42.164 1.28e-08 ***
Temp:Moisture 1 0.04 0.04 0.009 0.923
Residuals 66 272.06 4.12
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Df Sum Sq Mean Sq F value Pr(>F)
Temp 1 7.73 7.73 1.904 0.172
Moisture 1 173.80 173.80 42.797 9.92e-09 ***
Residuals 67 272.10 4.06
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Df Sum Sq Mean Sq F value Pr(>F)
Moisture 1 181.3 181.3 45.27 4.37e-09 ***
Residuals 68 272.3 4.0
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Indeed, the model with only Moisture is best. We can then use Tukey HSD to validate the specific differences in the Moisture levels.
TukeyHSD(m3)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = Growth ~ Moisture, data = ancova_df2)
$Moisture
diff lwr upr p adj
Wet-Dry 3.218663 2.264057 4.173269 0
The test shows that Wet soil leads to an approximately 3.2 value increase in plant Growth over dry soil, and this is a statistically significant difference (p-value is very low, near zero).
We can use ggplot2
to help us visualize this differnce more clearly.
ggplot(ancova_df2, aes(x = Temp, y = Growth,
color = Moisture)) +
geom_point() +
geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'
8.3.4 Main effects of Temperature and Moisture
Now, let’s assume there is a linear effect of Temperature, and there is a difference in the levels of Moisture. But, there is still no interaction.
set.seed(1)
# 3. Factor and Slope effect
mu_0 = 3
alpha_1 = 5
beta_1 = 0.5
beta_2 = 0
sigma = 2.0
y3 = calc_y_func(
mu_0,alpha_1,beta_1,
beta_2,sigma
)
# Store as data frame
ancova_df3 = ancova_df1
ancova_df3$Growth = y3
ggplot(ancova_df3) +
geom_point(aes(x = Temp, y = Growth,
shape = Moisture, color = Moisture))
We can see the difference in the means of Moisture levels, and an obvious linear Temperature effect. Again, we will validate this with model simplification.
Df Sum Sq Mean Sq F value Pr(>F)
Temp 1 697.2 697.2 197.926 <2e-16 ***
Moisture 1 491.1 491.1 139.408 <2e-16 ***
Temp:Moisture 1 0.3 0.3 0.082 0.776
Residuals 66 232.5 3.5
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Df Sum Sq Mean Sq F value Pr(>F)
Temp 1 697.2 697.2 200.7 <2e-16 ***
Moisture 1 491.1 491.1 141.3 <2e-16 ***
Residuals 67 232.8 3.5
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# m2 is best
m2$coefficients
(Intercept) Temp MoistureWet
3.5666389 0.4710967 5.3763016
The test shows that Wet soil leads to an approximately 5.3 value increase in plant Growth over dry soil, and that Temperature has a similar positive, linear effect on plant Growth for both wet and dry soil.
We can use ggplot2
to help us visualize this difference more clearly.
ggplot(ancova_df3, aes(x = Temp, y = Growth,
color = Moisture)) +
geom_point() +
geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'
8.3.5 Interaction between Temperature and Moisture
Finally, let’s assume there is an interaction between Moisture and Temperature.
########################
set.seed(1)
# 4. Positive interaction
mu_0 = 15
alpha_1 = 0
beta_1 = -0.45
beta_2 = 0.75
sigma = 2.0
y4 = calc_y_func(
mu_0,alpha_1,beta_1,
beta_2,sigma
)
# Store as data frame
ancova_df4 = ancova_df1
ancova_df4$Growth = y4
ggplot(ancova_df4) +
geom_point(aes(x = Temp, y = Growth,
shape = Moisture, color = Moisture))
We can visually notice how the slope of Temperature depends on whether soil Moisture is Wet or Dry. But we need to conduct model simplification to validate this.
Df Sum Sq Mean Sq F value Pr(>F)
Temp 1 3.0 3.0 0.857 0.358
Moisture 1 2466.5 2466.5 700.191 < 2e-16 ***
Temp:Moisture 1 303.6 303.6 86.190 1.37e-13 ***
Residuals 66 232.5 3.5
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# cannot drop interaction
# full model is best
m1$coefficients
(Intercept) Temp MoistureWet Temp:MoistureWet
15.686070436 -0.487258985 0.005302987 0.773837781
We see the model coefficients match well to our simulated values. Specifically, there is a positive interaction: the slope of Temperature when soil is Dry is negative (\(-0.49\)), but when the soil is Wet, the slope of temperature increases and becomes positive (\(-0.49 + 0.77 = 0.28\)).
Let’s visualize this more clearly with ggplot2
.
ggplot(ancova_df4, aes(x = Temp, y = Growth,
color = Moisture)) +
geom_point() +
geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'