8  ANOVA

8.1 One-way ANOVA

Remember that Analysis of Variance (ANOVA) is using linear regression to analyze how discrete input variables affect an outcome. Because the input variables are comprised of discrete “levels” (e.g., treatment groups), we are really comparing how the mean of an outcome varies among discrete groups. This is therefore a generalization of two-sample \(t\)-tests to the case of more than two samples/groups.

One-way ANOVA is the special case in which we only have one input variable, which has more than two “levels”. The ANOVA linear model can be written in several ways, but for me the easiest way to think about is as follows: \[y_{i,l} = \mu_l + \epsilon_{i,l}\] \[\epsilon \sim N(0, \sigma^2 I)\]

In this case \(l = 1,\dots,L\), where \(L\) is the number of levels in the input variable \(x\), and \(\mu_l\) is the mean outcome of group level, \(l\). Thus, each data observation \(y_{i,l}\) varies about a group level mean, with residuals \(\epsilon_{i,l}\), which are normally and independently distributed.

Perhaps this is easier to visualize and see the code. Let’s simulate a case in which we have a single input variable \(x\) that has 3 levels.

set.seed(5)

n_levels = 3
n_obs_per_level = 25

# Construct X
x = rep(c(1:n_levels), 
        each = n_obs_per_level)

# Assign group-wise means
mus = NULL
mus[1] = 5.0
mus[2] = 7.5
mus[3] = 5.8
sigma = 2.5 # residual sigma

# Simulate data obs
y = NULL
for(i in 1:length(x)){
    this_level = x[i]
    y[i] =  
        mus[this_level] +
        # residual deviation from group-wise mean:
        rnorm(1, mean = 0, sd = sigma)
}

# Store in data frame
one_way_df = data.frame(
    y = y,
    x = factor(x) # store as factor/categorical
)

# Plot the means:
## Defaults to boxplot() when x is discrete
plot(y~x, data = one_way_df)

We can see that - at least visually - level 2 has a higher mean outcome than the other levels (e.g., \(\bar{y}_2\) seems largest). Let’s run the ANOVA model and see the summary.

# Use the aov() function
m_one_way = aov(y ~ 1 + x, data = one_way_df)

summary(m_one_way)
            Df Sum Sq Mean Sq F value   Pr(>F)    
x            2  125.5   62.74   10.32 0.000115 ***
Residuals   72  437.9    6.08                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

8.1.1 Manually calculate \(F\)

As we can see, the aov() function is using an \(F\)-test to determine if any of the group-wise means differ from the global mean (see lecture materials on this topic). Indeed, based on the \(p\)-value, at least one group-wise mean is different. To verify our understand, as usual, let’s manually calculate the \(F\) test statistic and \(p\)-value.

# First, we need to run the "null" model (intercept only)
m_one_way_null = lm(y ~ 1, 
                    data = one_way_df)
    
# Extract the residuals (errors)
resid_full = m_one_way$residuals
resid_null = m_one_way_null$residuals

# Sum of Square Errors (SSE)
sse_full = crossprod(resid_full)
sse_null = crossprod(resid_null)

# degrees of freedom
n_obs = length(y)
df_full = n_obs - n_levels
df_null = n_obs - 1

# Calculate F_stat
f_test = ((sse_null - sse_full)/(df_null - df_full)) / (sse_full/df_full)

# Degrees of freedom for the F distribution:
df_numerator = df_null - df_full
df_denominator = df_full

p_one_way = 1 - pf(f_test,
                   df1 = df_numerator,
                   df2 = df_denominator)

# Compare to anova()
f_test; p_one_way
         [,1]
[1,] 10.31654
             [,1]
[1,] 0.0001149183
summary_aov = summary(m_one_way)
summary_aov[[1]]$`F value`; summary_aov[[1]]$`Pr(>F)`
[1] 10.31654       NA
[1] 0.0001149183           NA

We see that our manual calculation of \(F\) and the corresponding \(p\)-value match with the output of the aov() function, so we can verify we are understanding what aov() is doing.

8.1.2 Tukey HSD

But, which specific means differ from each other? To know this, we need to compute all pairwise differences between the group-wise means. Remember that for \(L\) number of groups, we have \(L(L-1)/2\) number of pairwise comparisons. We want to correct for multiple comparisons, so we don’t inflate our risk of Type I errors. Therefore, we’ll conduct a Tukey Honest Significant Difference (HSD) test.

TukeyHSD(m_one_way)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = y ~ 1 + x, data = one_way_df)

$x
          diff       lwr        upr     p adj
2-1  2.9796804  1.310442  4.6489184 0.0001710
3-1  0.5570861 -1.112152  2.2263241 0.7050400
3-2 -2.4225942 -4.091832 -0.7533562 0.0024902

This test is looking at pairwise differences. For instance 2-1 is the difference between \(x\) levels 2 and 1, and so on. The output reports the raw difference in means as diff, and then reports the lower and upper confidence limits of this difference as lwr and upr. These default to 95% confidence intervals, though the user can specify an option in the function to change this percentage confidence. Then the output reports the p adj, which is the adjusted \(p\)-value, adjusting for multiple comparisons. Here, we can conclude that while levels 1 and 3 are not different from one another, level 2 is different from both levels 1 and 3. This makes sense when we look at the boxplot (again).

plot(y~x, data = one_way_df)

8.1.3 Manually calculate \(q\)

How does the Tukey HSD adjust for multiple comparisons? It uses a special hypothesis test, which has an associated probability distribution, tukey. Remember that we calculate a test statistic, \(q\) for the difference between levels \(i\) and \(j\): \[q_{i,j} = \frac{|\bar{y}_i - \bar{y}_j|}{\sqrt{\hat{\sigma}^2_p/n}}\] Here, \(\hat{\sigma}^2_p\) is the pooled variance of the whole outcome data set, \(y\), and \(n\) is the number of observations per level. This is why a balanced design is important; the test assumes that each level has the same number of observations.

Let’s manually calculate the \(p\)-value for the difference between group levels 2 and 1.

# Need to extract the data observations associated with all x levels, separately
these_x_1 = which(x == 1)
these_x_2 = which(x == 2)
these_x_3 = which(x == 3)
y_1 = y[these_x_1]
y_2 = y[these_x_2]
y_3 = y[these_x_3]

# Calculate pooled variance for whole data set
# Notice how this is not the same as var(y)
pooled_var = (var(y_1)+var(y_2)+var(y_3))/3

# Calculate q test statistic
q_test = 
    abs(mean(y_1) - mean(y_2)) /
    sqrt(pooled_var/n_obs_per_level)

q_test
[1] 6.041317
# Degrees of freedom for q test stat
df_q = n_obs - n_levels

# Calculate p-value
p_2v1 = ptukey(q_test, 
               nmeans = 3,
               df = df_q, 
               lower.tail = FALSE)
p_2v1
[1] 0.0001710377

We can see this \(p\)-value matches the first p adj from the TukeyHSD() output.

8.2 Two-way ANOVA

Two-way ANOVA is the special case in which we have exactly two input variables, each of which has two or more “levels”. The two-way ANOVA linear model can be written in several ways. We’ll start with the easiest case in which each of the two input variables only has two levels. This refers to a “2x2” experimental design. To be concrete, we’ll use an example. We will simulate data for plant growth in which we manipulate Temperature (Low or High) and soil Moisture (Dry or Wet). We apply the 2x2 combination of these treatments which leads to four total treatments (e.g., Low-Dry, Low-Wet, etc.). We will apply each treatment combination to 25 plants and measure the outcome of Growth.

In the following model structure, we will code Temperature as Low == 0 and High == 1, and we will code soil Moisture as Dry == 0 and Wet == 1.

\[y_{i} = \mu + \beta_{\text{Temp}}\text{Temp} + \beta_{\text{Moist}}\text{Moist} + \beta_{\text{Intx}}\text{Temp}\text{Moist} + \epsilon_{i,l}\] \[\epsilon \sim N(0, \sigma^2 I)\]

In the model structure, \(\mu\) is a global mean for \(y\) (i.e., across all treatments). Then, this global mean can be altered (i.e. affected by) the treatment combinations. \(\beta_{\text{Temp}}\) and \(\beta_{\text{Moist}}\) are the “main” effects and \(\beta_{\text{Intx}}\) is the interactive effect. Temp and Moist are binary indicator variables (0/1), so, for instance, if Temperature is Low, Temp == 0, and so on. This means that \(\beta_{\text{Temp}}\) only gets added to the global mean when Temp == 1 (i.e., Temperature is High), and so on. The \(\beta_{\text{Intx}}\) would get added to the global mean if both Temp and Moist are 1, so High Temperature and Wet Moisture.

8.2.1 Only main effects

First, let’s simulate a case in which we only have main effects, no interaction. Specifically, we will assume that the plant Growth declines under High Temperature, but that there is no effect of soil Moisture. Also, for data visualization, we will use the ggplot2 package, because it is easier to customize.

library(ggplot2)

# 2x2 design
# Replicated 25 times
# Low, High (0, 1)
n_reps = 25
Temp = rep(0:1, each = n_reps*2)

# Low, High (0, 1)
Moisture = c(
    rep(0:1, each = n_reps),
    rep(0:1, each = n_reps)
)

# Simulate different effects:
set.seed(8)
# Just main effect
global = 5
beta_t = -1.25
beta_m = 0
beta_intx = 0
sigma = 1.0

y = NULL
for(i in 1:length(Temp)){
    y[i] = global + 
        beta_t * Temp[i] +
        beta_m * Moisture[i] +
        beta_intx * Temp[i] * Moisture[i] +
        rnorm(1, mean = 0, sd = sigma)
}

# Store as data frame
two_way_df1 = data.frame(
    Growth = y,
    Temp = factor(Temp, levels = c(0,1), labels = c("Low", "High")),
    Moisture = factor(Moisture, levels = c(0,1), labels = c("Dry", "Wet"))
)


ggplot(two_way_df1) +
    geom_boxplot(aes(x = Temp, y = Growth, color = Moisture))

Here, it is visually clear that higher temperatures lead to lower plant growth, but there is no clear effect of soil moisture, just as we simulated.

Let’s run the ANOVA model and see if the output makes sense.

summary(aov(Growth ~ Temp + Moisture + Temp:Moisture,
            data = two_way_df1))
              Df Sum Sq Mean Sq F value   Pr(>F)    
Temp           1  43.31   43.31  36.134 3.31e-08 ***
Moisture       1   0.00    0.00   0.002    0.963    
Temp:Moisture  1   0.00    0.00   0.000    0.999    
Residuals     96 115.06    1.20                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Indeed, we only see a main effect of Temp, and no main effect of Moisture, and no interactive effect (Temp:Moisture).

8.2.2 Interactive effect

Now will simulate a case in which we have a main effect of temperature, similar to the above (Growth declines at High Temp). We will also add a positive interactive effect. This means that the effect of specific effects of temperature on growth will depend on the soil moisture content. Specifically, with a positive interactive effect, if Temperature is High and Moisture is Wet, then we will get an increase in Growth, rather than a decline. Let’s see what this looks like visually.

# INTERACTION
global = 5
beta_t = -1.25
beta_m = 0
beta_intx = 2.5
sigma = 1.0

set.seed(5)
y = NULL
for(i in 1:length(Temp)){
    y[i] = global + 
        beta_t * Temp[i] +
        beta_m * Moisture[i] +
        beta_intx * Temp[i] * Moisture[i] +
        rnorm(1, mean = 0, sd = sigma)
}

# Store as data frame
two_way_df2 = data.frame(
    Growth = y,
    Temp = factor(Temp, levels = c(0,1), labels = c("Low", "High")),
    Moisture = factor(Moisture, levels = c(0,1), labels = c("Dry", "Wet"))
)

ggplot(two_way_df2) +
    geom_boxplot(aes(x = Temp, y = Growth, color = Moisture))

What wee see is that the effect of Temperature depends on the value of soil Moisture. In this case, as Temperature moves from Low to High, plant Growth declines if the soil is Dry, but Growth increases if the soil is Wet.

But those trends are just visual at this point. How do we quantify whether specific comparisons are statistically significant? We will again use the Tukey HSD! First, run the ANOVA and verify that there is a significant interaction.

aov_intx = aov(Growth ~ Temp + Moisture + Temp:Moisture,
               data = two_way_df2)
summary(aov_intx)
              Df Sum Sq Mean Sq F value   Pr(>F)    
Temp           1   0.11    0.11   0.122    0.727    
Moisture       1  54.20   54.20  59.746 1.06e-11 ***
Temp:Moisture  1  41.00   41.00  45.190 1.28e-09 ***
Residuals     96  87.09    0.91                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Indeed, the interaction is significant, so now we need to figure out which specific differences among covariate levels exist.

TukeyHSD(aov_intx)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = Growth ~ Temp + Moisture + Temp:Moisture, data = two_way_df2)

$Temp
                diff        lwr       upr     p adj
High-Low -0.06659871 -0.4447271 0.3115297 0.7273973

$Moisture
            diff      lwr      upr p adj
Wet-Dry 1.472439 1.094311 1.850567     0

$`Temp:Moisture`
                        diff        lwr        upr     p adj
High:Dry-Low:Dry  -1.3471656 -2.0515400 -0.6427911 0.0000152
Low:Wet-Low:Dry    0.1918721 -0.5125023  0.8962466 0.8920281
High:Wet-Low:Dry   1.4058403  0.7014658  2.1102147 0.0000062
Low:Wet-High:Dry   1.5390377  0.8346633  2.2434121 0.0000007
High:Wet-High:Dry  2.7530058  2.0486314  3.4573803 0.0000000
High:Wet-Low:Wet   1.2139681  0.5095937  1.9183426 0.0001084

The important part of the out put is $'Temp:Moisture', which shows the pairwise tests of the interactive effects. See if you can understand which differences are statistically significant, after accounting for multiple comparisons.

8.3 ANCOVA

ANCOVA is a variant of ANOVA, which stands for “Analysis of Covariance”. For one-way ANCOVA, we are analyzing the effect of one discrete (i.e., categorical) input variable, plus one continuous input variable. The “covariance” part refers to the fact that we’re trying to understand if the slope of the continuous input variable is the same or different between the levels of the discrete input variable.

For one-way ANCOVA, we have a single discrete input variable, and let’s assume the levels are 0 or 1. The model for one-way ANCOVA can then be written as: \[y_i = \mu_0 + \alpha_1 X_i + \beta_1 Z_i + \beta_2 X_i Z_i + \epsilon_i\] Here, \(\beta_1\) is the slope of \(Z\) when \(X=0\), but \(\beta_1 + \beta_2\) is the slope of \(Z\) when \(X=1\). If there is no interaction (i.e., \(\beta_2=0\)), then the slope of \(Z\) is the same for the two levels of \(X\).

Let’s continue with our example from two-way ANOVA, where we have Temperature and soil Moisture. However, for ANCOVA, we will assume that Temperature is a continuous input variable, rather than a discrete one.

8.3.1 Create data set and model function

We are going to explore all four possible outcomes of the simple one-way ANCOVA. We will therefore create a static data set, as well as a function to simulate the outcome variable \(Y\), depending on the model parameters.

library(ggplot2)
set.seed(7)
n_rep = 35
# Dry, Wet (0, 1)
Moisture = rep(0:1, each = n_rep)
# Continuous Temperature
Temp = c(
    runif(n_rep, min = 5, max = 25),
    runif(n_rep, min = 5, max = 25)
)

# Function to simulate observed data

calc_y_func = function(
        mu_0 = 0,
        alpha_1 = 0,
        beta_1 = 0,
        beta_2 = 0,
        sigma = 1.0){
    y = NULL
    for(i in 1:length(Temp)){
        y[i] = mu_0 + 
            alpha_1 * Moisture[i] +
            beta_1 * Temp[i] +
            beta_2 * Temp[i] * Moisture[i] +
            rnorm(1, mean = 0, sd = sigma)
    }
    
    return(y)
}

8.3.2 Main effect of Temperature

First, we’ll see what the data look like when we only have an effect of the continuous input variable, in this case Temperature.

set.seed(2)
# 1. Only slope effect
mu_0 = 5
alpha_1 = 0
beta_1 = 0.75
beta_2 = 0
sigma = 2.25

y1 = calc_y_func(
    mu_0,alpha_1,beta_1,
    beta_2,sigma
)

# Store as data frame
ancova_df1 = data.frame(
    Growth = y1,
    Temp = Temp,
    Moisture = factor(Moisture, levels = c(0,1), labels = c("Dry", "Wet"))
)

ggplot(ancova_df1) +
    geom_point(aes(x = Temp, y = Growth, 
                   shape = Moisture, color = Moisture))

We can see that while there is clearly a linear effect of Temperature on plant Growth, there is no obvious difference between Dry and Wet soil Moisture.

Now, let’s use model simplification to statistically validate our visual interpretation of the data.

m1 = aov(Growth ~ Temp + Moisture + Temp*Moisture, data = ancova_df1)
summary(m1)
              Df Sum Sq Mean Sq F value Pr(>F)    
Temp           1 1422.8  1422.8 214.231 <2e-16 ***
Moisture       1   17.0    17.0   2.559  0.114    
Temp:Moisture  1   14.6    14.6   2.204  0.142    
Residuals     66  438.3     6.6                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# We can try dropping the interaction
m2 = update(m1, .~. -Temp:Moisture)
anova(m1, m2)
Analysis of Variance Table

Model 1: Growth ~ Temp + Moisture + Temp * Moisture
Model 2: Growth ~ Temp + Moisture
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1     66 438.35                           
2     67 452.99 -1   -14.641 2.2044 0.1424
# Notice how this F and p-value is the same from
# the summary() table of m1. 
m3 = update(m2, .~. -Moisture)
summary(m3)
            Df Sum Sq Mean Sq F value Pr(>F)    
Temp         1   1423  1422.8   205.9 <2e-16 ***
Residuals   68    470     6.9                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Check the slope and intercept
m3$coefficients
(Intercept)        Temp 
   4.487613    0.787559 

The model simplification validates that there is only an effect of Temperature, and the estimated slope matches our simulated value pretty closely.

8.3.3 Main effect of Moisture

Now, let’s assume there is no effect of Temperature, but there is a difference in the levels of Moisture.

set.seed(5)
# 2. Only factor effect
mu_0 = 5
alpha_1 = 4
beta_1 = 0
beta_2 = 0
sigma = 2.0

y2 = calc_y_func(
    mu_0,alpha_1,beta_1,
    beta_2,sigma
)

# Store as data frame
ancova_df2 = ancova_df1
ancova_df2$Growth = y2

ggplot(ancova_df2) +
    geom_point(aes(x = Temp, y = Growth, 
                   shape = Moisture, color = Moisture))

We can see the difference in the means of Moisture levels, but no clear, linear Temperature effect. We will validate this with model simplification.

m1 = aov(Growth ~ Temp + Moisture + Temp*Moisture, data = ancova_df2)
summary(m1)
              Df Sum Sq Mean Sq F value   Pr(>F)    
Temp           1   7.73    7.73   1.875    0.176    
Moisture       1 173.80  173.80  42.164 1.28e-08 ***
Temp:Moisture  1   0.04    0.04   0.009    0.923    
Residuals     66 272.06    4.12                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
m2 = update(m1, .~. - Temp:Moisture)
summary(m2)
            Df Sum Sq Mean Sq F value   Pr(>F)    
Temp         1   7.73    7.73   1.904    0.172    
Moisture     1 173.80  173.80  42.797 9.92e-09 ***
Residuals   67 272.10    4.06                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
m3 = update(m2, .~. - Temp)
summary(m3)
            Df Sum Sq Mean Sq F value   Pr(>F)    
Moisture     1  181.3   181.3   45.27 4.37e-09 ***
Residuals   68  272.3     4.0                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Indeed, the model with only Moisture is best. We can then use Tukey HSD to validate the specific differences in the Moisture levels.

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = Growth ~ Moisture, data = ancova_df2)

$Moisture
            diff      lwr      upr p adj
Wet-Dry 3.218663 2.264057 4.173269     0

The test shows that Wet soil leads to an approximately 3.2 value increase in plant Growth over dry soil, and this is a statistically significant difference (p-value is very low, near zero).

We can use ggplot2 to help us visualize this differnce more clearly.

ggplot(ancova_df2, aes(x = Temp, y = Growth, 
                       color = Moisture)) +
    geom_point() +
    geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'

8.3.4 Main effects of Temperature and Moisture

Now, let’s assume there is a linear effect of Temperature, and there is a difference in the levels of Moisture. But, there is still no interaction.

set.seed(1)
# 3. Factor and Slope effect
mu_0 = 3
alpha_1 = 5 
beta_1 = 0.5
beta_2 = 0
sigma = 2.0

y3 = calc_y_func(
    mu_0,alpha_1,beta_1,
    beta_2,sigma
)

# Store as data frame
ancova_df3 = ancova_df1
ancova_df3$Growth = y3

ggplot(ancova_df3) +
    geom_point(aes(x = Temp, y = Growth, 
                   shape = Moisture, color = Moisture))

We can see the difference in the means of Moisture levels, and an obvious linear Temperature effect. Again, we will validate this with model simplification.

m1 = aov(Growth ~ Temp + Moisture + Temp*Moisture, 
         data = ancova_df3)
summary(m1)
              Df Sum Sq Mean Sq F value Pr(>F)    
Temp           1  697.2   697.2 197.926 <2e-16 ***
Moisture       1  491.1   491.1 139.408 <2e-16 ***
Temp:Moisture  1    0.3     0.3   0.082  0.776    
Residuals     66  232.5     3.5                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
m2 = update(m1, .~. - Temp:Moisture)
summary(m2)
            Df Sum Sq Mean Sq F value Pr(>F)    
Temp         1  697.2   697.2   200.7 <2e-16 ***
Moisture     1  491.1   491.1   141.3 <2e-16 ***
Residuals   67  232.8     3.5                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# m2 is best
m2$coefficients
(Intercept)        Temp MoistureWet 
  3.5666389   0.4710967   5.3763016 

The test shows that Wet soil leads to an approximately 5.3 value increase in plant Growth over dry soil, and that Temperature has a similar positive, linear effect on plant Growth for both wet and dry soil.

We can use ggplot2 to help us visualize this difference more clearly.

ggplot(ancova_df3, aes(x = Temp, y = Growth, 
                       color = Moisture)) +
    geom_point() +
    geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'

8.3.5 Interaction between Temperature and Moisture

Finally, let’s assume there is an interaction between Moisture and Temperature.

########################
set.seed(1)
# 4. Positive interaction
mu_0 = 15
alpha_1 = 0
beta_1 = -0.45
beta_2 = 0.75
sigma = 2.0

y4 = calc_y_func(
    mu_0,alpha_1,beta_1,
    beta_2,sigma
)

# Store as data frame
ancova_df4 = ancova_df1
ancova_df4$Growth = y4

ggplot(ancova_df4) +
    geom_point(aes(x = Temp, y = Growth, 
                   shape = Moisture, color = Moisture))

We can visually notice how the slope of Temperature depends on whether soil Moisture is Wet or Dry. But we need to conduct model simplification to validate this.

m1 = aov(Growth ~ Temp + Moisture + Temp*Moisture, 
         data = ancova_df4)
summary(m1)
              Df Sum Sq Mean Sq F value   Pr(>F)    
Temp           1    3.0     3.0   0.857    0.358    
Moisture       1 2466.5  2466.5 700.191  < 2e-16 ***
Temp:Moisture  1  303.6   303.6  86.190 1.37e-13 ***
Residuals     66  232.5     3.5                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# cannot drop interaction
# full model is best

m1$coefficients
     (Intercept)             Temp      MoistureWet Temp:MoistureWet 
    15.686070436     -0.487258985      0.005302987      0.773837781 

We see the model coefficients match well to our simulated values. Specifically, there is a positive interaction: the slope of Temperature when soil is Dry is negative (\(-0.49\)), but when the soil is Wet, the slope of temperature increases and becomes positive (\(-0.49 + 0.77 = 0.28\)).

Let’s visualize this more clearly with ggplot2.

ggplot(ancova_df4, aes(x = Temp, y = Growth, 
                       color = Moisture)) +
    geom_point() +
    geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'