7  Model Selection

7.1 Generate the data

Here we will demonstrate two approaches to model comparison. But first let’s generate data, in the same way we did for multiple linear regression in (Chapter 4). Note that in this case, we will specify that two of the input variables have zero slope (i.e., no linear association with the outcome variable).

n = 40
n_covariate = 4
p = n_covariate + 1

betas = vector("numeric", length = p)
xmat = matrix(0, nrow = n, ncol = p)
sigma = 2.25

# Column for intercept
xmat[,1] = 1

# Generate the covariate data randomly:
set.seed(5)
xmat[,2] = rnorm(n, mean = 5, sd = 8)
xmat[,3] = runif(n, min = 0, max = 20)
xmat[,4] = rchisq(n, df = 50)
xmat[,5] = rpois(n, lambda = 10)

# Set the betas:
betas[1] = 1.0
betas[2] = 0.0
betas[3] = -0.2
betas[4] = 0.0
betas[5] = 1.8

# Calculate the observed 'y', adding residual error
y = xmat %*% betas + rnorm(n, mean = 0, sd = sigma)

par(mfrow=c(2,2))
for(i in 2:p){
    plot(y ~ xmat[,i],
         xlab = paste("covariate ", i-1))
}

# Create a data.frame
my_df = data.frame(y, xmat[,2:5])
head(my_df)
          y         X1        X2       X3 X4
1 28.492690 -1.7268438 18.788730 38.39431 18
2 14.221411 16.0748747 16.474910 60.76146 10
3  9.064956 -5.0439349  4.223082 33.40577  5
4  9.366421  5.5611421  1.832589 41.76465  5
5 18.874673 18.6915270  9.405498 40.26706 11
6 17.706978  0.1767361  1.001228 46.05881 10
# Run the model, report the summary
m1 = lm(y ~ 1 + X1 + X2 + X3 + X4, data = my_df)
m1_summary = summary(m1)
m1_summary

Call:
lm(formula = y ~ 1 + X1 + X2 + X3 + X4, data = my_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.3257 -1.4053 -0.4331  1.3299  4.3178 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.41757    1.82632   1.871  0.06968 .  
X1           0.03245    0.03810   0.852  0.40016    
X2          -0.26989    0.07297  -3.698  0.00074 ***
X3          -0.01823    0.03267  -0.558  0.58050    
X4           1.68543    0.11083  15.207  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.94 on 35 degrees of freedom
Multiple R-squared:  0.8901,    Adjusted R-squared:  0.8775 
F-statistic: 70.86 on 4 and 35 DF,  p-value: 2.741e-16

7.2 Parsimony via model simplification

We will successively simplify the model until we find a “minimally acceptable” model that explains the most variability in the outcome variable.

There are several built-in functions in R that can help us make quantitatively justified decisions about which input variables can be dropped from the full model to determine our minimally acceptable model. First, we can use the \(F\)-test as described in lecture. This can be implemented by the anova() function.

Based on the summary() output, we see that input variable 3 (\(x_3\)) has the least significant effect on \(y\), so we will drop that first and proceed from there.

# The full model lives in object m1
formula(m1)
y ~ 1 + X1 + X2 + X3 + X4
# Create a new model with the update() function
# This function has a strange notation, but so it goes...
m2 = update(m1, .~. -X3)
formula(m2)
y ~ X1 + X2 + X4

Call:
lm(formula = y ~ X1 + X2 + X4, data = my_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4492 -1.3713 -0.3323  1.2450  4.3718 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.54371    0.92986   2.736 0.009605 ** 
X1           0.02863    0.03712   0.771 0.445496    
X2          -0.26530    0.07181  -3.694 0.000728 ***
X4           1.68217    0.10962  15.346  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.921 on 36 degrees of freedom
Multiple R-squared:  0.8891,    Adjusted R-squared:  0.8799 
F-statistic: 96.22 on 3 and 36 DF,  p-value: < 2.2e-16
# Use anova() to test if the drop of X3 is justified
anova(m2, m1)
Analysis of Variance Table

Model 1: y ~ X1 + X2 + X4
Model 2: y ~ 1 + X1 + X2 + X3 + X4
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1     36 132.91                           
2     35 131.73  1    1.1712 0.3112 0.5805

Remember that the hypothesis tested is: \[H_0:\text{simple model}\] \[H_A:\text{complex model}\] So if the \(p \ge 0.05\), as usual, we cannot reject the null hypothesis. In this case, it means that the simple model is just as good as the more complex model. Therefore, we are justified in dropping \(x_3\). From the data, we could not detect that \(x_3\) has a statistically meaningful linear relationship with the outcome data \(y\).

Let’s manually calculate that \(F\) test statistic and associated \(p\)-value to verify that we understand how the test works.

# Extract the residuals (errors)
resid_null = m2$residuals
resid_full = m1$residuals

# Sum of Square Errors (SSE)
sse_null = crossprod(resid_null)
sse_full = crossprod(resid_full)

# degrees of freedom
df_null = n-(p-1) # we dropped one input variable
df_full = n-p

# Calculate F_stat
f_test = ((sse_null - sse_full)/(df_null - df_full)) / (sse_full/df_full)

# Degrees of freedom for the F distribution:
df_numerator = df_null - df_full
df_denominator = df_full

p_m1vm2 = 1 - pf(f_test,
                df1 = df_numerator,
                df2 = df_denominator)

# Compare to anova()
f_test; p_m1vm2
          [,1]
[1,] 0.3111883
          [,1]
[1,] 0.5805028
anova_m1vm2 = anova(m2,m1)
anova_m1vm2$`F`; anova_m1vm2$`Pr(>F)`
[1]        NA 0.3111883
[1]        NA 0.5805028

Let’s continue with the simplification process, using the anova() function.

# model 2 is the current best.
summary(m2)

Call:
lm(formula = y ~ X1 + X2 + X4, data = my_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4492 -1.3713 -0.3323  1.2450  4.3718 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.54371    0.92986   2.736 0.009605 ** 
X1           0.02863    0.03712   0.771 0.445496    
X2          -0.26530    0.07181  -3.694 0.000728 ***
X4           1.68217    0.10962  15.346  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.921 on 36 degrees of freedom
Multiple R-squared:  0.8891,    Adjusted R-squared:  0.8799 
F-statistic: 96.22 on 3 and 36 DF,  p-value: < 2.2e-16
# Now, drop x1 and check
m3 = update(m2, .~. -X1)

# Check:
anova(m3, m2)
Analysis of Variance Table

Model 1: y ~ X2 + X4
Model 2: y ~ X1 + X2 + X4
  Res.Df   RSS Df Sum of Sq      F Pr(>F)
1     37 135.1                           
2     36 132.9  1    2.1969 0.5951 0.4455
# The p-value is not significant, so we
# can accept the null (simpler model)

summary(m3)

Call:
lm(formula = y ~ X2 + X4, data = my_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4900 -1.4021 -0.1473  1.3871  4.3194 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.46667    0.91941   2.683 0.010847 *  
X2          -0.25788    0.07077  -3.644 0.000819 ***
X4           1.69901    0.10683  15.904  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.911 on 37 degrees of freedom
Multiple R-squared:  0.8873,    Adjusted R-squared:  0.8812 
F-statistic: 145.6 on 2 and 37 DF,  p-value: < 2.2e-16
# Remove X2 and check
m4 = update(m3, .~. -X2)
anova(m4, m3)
Analysis of Variance Table

Model 1: y ~ X4
Model 2: y ~ X2 + X4
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1     38 183.58                                  
2     37 135.10  1    48.478 13.277 0.0008195 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Ok now the p-value is significant
# We need to reject the null (simpler model)
# We *cannot* reliably remove X2

# Try X4 just in case:

m5 = update(m3, .~. -X4)
anova(m3, m5)
Analysis of Variance Table

Model 1: y ~ X2 + X4
Model 2: y ~ X2
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1     37  135.1                                  
2     38 1058.6 -1   -923.53 252.92 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# p-value is significant again
# need to reject the null
# we cannot drop X4

# Therefore, m3 is most parsimonious
summary(m3)

Call:
lm(formula = y ~ X2 + X4, data = my_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4900 -1.4021 -0.1473  1.3871  4.3194 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.46667    0.91941   2.683 0.010847 *  
X2          -0.25788    0.07077  -3.644 0.000819 ***
X4           1.69901    0.10683  15.904  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.911 on 37 degrees of freedom
Multiple R-squared:  0.8873,    Adjusted R-squared:  0.8812 
F-statistic: 145.6 on 2 and 37 DF,  p-value: < 2.2e-16

Therefore, model 3, is the minimum acceptable model: \[y_i = \beta_0 + \beta_2 x_{2,i} + \beta_4 x_{4,i} + \epsilon_i\]

We could actually come to the same result, using a different, more automated function, step(). However, this function uses a different metric to test the null vs. full model hypothesis, the Akaike nformation criterion (AIC), which is calculated as: \[\text{AIC} = - 2ln(\text{Model Likelihood}) + 2k\] And \(k\) is the number of estimated parameters in the model. We can then compare the AIC values to decide which models are “best”. We will learn more about AIC later.

Let’s use the step() function and verify it gives us the same final outcome.

m1_step = step(m1)
Start:  AIC=57.68
y ~ 1 + X1 + X2 + X3 + X4

       Df Sum of Sq     RSS     AIC
- X3    1      1.17  132.90  56.030
- X1    1      2.73  134.46  56.497
<none>               131.73  57.676
- X2    1     51.48  183.22  68.871
- X4    1    870.38 1002.11 136.839

Step:  AIC=56.03
y ~ X1 + X2 + X4

       Df Sum of Sq     RSS     AIC
- X1    1      2.20  135.10  54.686
<none>               132.90  56.030
- X2    1     50.39  183.29  66.888
- X4    1    869.43 1002.34 134.848

Step:  AIC=54.69
y ~ X2 + X4

       Df Sum of Sq     RSS     AIC
<none>               135.10  54.686
- X2    1     48.48  183.58  64.951
- X4    1    923.53 1058.63 135.034
summary(m1_step)

Call:
lm(formula = y ~ X2 + X4, data = my_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4900 -1.4021 -0.1473  1.3871  4.3194 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.46667    0.91941   2.683 0.010847 *  
X2          -0.25788    0.07077  -3.644 0.000819 ***
X4           1.69901    0.10683  15.904  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.911 on 37 degrees of freedom
Multiple R-squared:  0.8873,    Adjusted R-squared:  0.8812 
F-statistic: 145.6 on 2 and 37 DF,  p-value: < 2.2e-16

We can see the selected model only includes \(x_2\) and \(x_4\), just like our decision based on the \(F\)-test.

7.3 Model averaging

Recall from lecture that model averaging represents another philosophical approach to model selection and model comparison. In this case, the idea is that we cannot know with certainty which model of a nested sub-set of models is “true”. Therefore, instead of reporting the slopes and intercepts from the single “best” model that based on parsimony, we should report “averaged” values of slopes and intercepts. These averages will take into account all of the possible nested subset of models in which those slopes and intercepts could have been calculated. This averaging procedure can produce slope and intercept estimates (as well as estimates of their uncertainty) that are less biased, and can perhaps yield better predictions of future data.

We will use the MuMIn package (Multimodel Inference) to do model averaging later, but first we will do it manually.

7.3.1 Required calculations

Recall that to come up with averaged estimates of model parameters (e.g., model-averaged slopes) we need to calculate weighted averages. These averages are weighted by how well sub-models explain the data. Following lecture, we will use the corrected \(AIC\), noted as \(AIC_c\), to calculate how well a model explains the data.

\[\text{AIC}_c = - 2ln(\text{Model Likelihood}) + 2k + \frac{2K(K+1)}{n-K-1}\] where \(n\) is the number of data observations. Then, to calculate the weights we need to see how much each sub-model deviates from the best model. For this deviation we calculate, for sub-model \(i\):

\[\Delta \text{AIC}_{c,i} = \text{AIC}_{c,\text{best}} - \text{AIC}_{c,i}\] The weight of sub-model \(i\) is:

\[w_i = \frac{\text{exp}(-\Delta \text{AIC}_{c,i} / 2)}{\sum_{r=1}^{R} \text{exp}(-\Delta \text{AIC}_{c,r} / 2)} \] And \(R\) is the number of submodels being examined.

We are now ready to calculate the weighted average of any parameter of interest in the full model, \(\theta\):

\[\hat{\bar{\theta}} = \sum_{r=1}^{R} w_r \hat{\theta}_{r}\] Here, \(\hat{\bar{\theta}}\) is the model-averaged estimate of parameter \(\theta\), \(w_r\) is the weight of sub-model \(r\), and \(\hat{\theta}_r\) is the parameter estimate derived from sub-model \(r\).

We can also calculate the new averaged uncertainty in the parameter estimate:

\[\hat{\text{var}}(\hat{\bar{\theta}}) = \sum_{r=1}^{R} w_r \left( \hat{\text{var}}(\hat{\theta})_r + (\hat{\theta} - \hat{\bar{\theta}})^2 \right) \] Here, \(\hat{\text{var}}(\hat{\bar{\theta}})\) is the standard error of the averaged model parameter, whereas \(\hat{\text{var}}(\hat{\theta})_r\) is the standard error of model parameter \(\theta\) estimated from sub-model \(r\).

7.3.2 Manual calculation

Let’s see if we can manually calculate all of this from a less complex example model. Imagine our full model is a model that only has two input variables. We’ll use our simulated data set from above. (Of course, we know this is a poor model, but we’re just doing a case-study here.)

# Full model:
full_mod = lm(y~1+X1+X2, data = my_df)

If this full model has two inputs, then the number of sub-models is \(2^2 = 4\), which iteratively drop one or both input variables. Now, run each sub-model. I know, this is tedious.

sub_m2 = update(full_mod, .~. -X2)
sub_m3 = update(full_mod, .~. -X1)
sub_m4 = update(full_mod, .~. -X2-X1) # Intercept only

# Store all models in a list for easy looping later:
model_list = list(
    full_mod, sub_m2, sub_m3, sub_m4
)

# how many models?
n_mod = 4

To get the model-averaged slopes of inputs \(x_1\) and \(x_2\), we’ll need to calculate \(AIC_c\) values and model weights. We’ll store calculations in arrays as much as possible, so we can loop through.

Let’s start with \(AIC_c\). Fortunately, there’s a built-in function for this in the MuMIn package, but we’ll do one manually first.

# Extract neg-log-likelihood from full model:
nll_full = -1*logLik(full_mod)
# This is in a weird format, so we'll convert:
nll_full = as.numeric(nll_full)
k_full = 4 # two slopes + 1 intercept + residual sigma

# Calculate AIC_c
aic_c_full = 2*nll_full + 2*k_full + (2*k_full*(k_full + 1))/(n - k_full - 1)
aic_c_full
[1] 251.5063
# Check with built-in
library(MuMIn)
AICc(full_mod)
[1] 251.5063
# Now calculate all:
AICc_vec = NULL
for(i in 1:n_mod){
    AICc_vec[i] = AICc(model_list[[i]])
}
AICc_vec
[1] 251.5063 252.3124 251.2157 253.8381

We can see that the ‘best’ model, according to \(AIC_c\) is the sub_m3, which includes the intercept and only input \(x_2\). This makes sense, because we know that the slope of \(x_1\) was simulated as zero.

Let’s now calculate the \(\Delta \text{AIC}_c\).

# Best AICc
AICc_best = min(AICc_vec)

#\Delta AIC_c
Delta_AICc_vec = AICc_vec - AICc_best
Delta_AICc_vec
[1] 0.290611 1.096742 0.000000 2.622409

Now we can calculate the model weights (i.e., the value representing how “good” each model is, relative to the best model).

# Calculate the denominator of the weight calculation
weight_denom = 0
for(i in 1:n_mod){
    weight_denom = 
        weight_denom +
        exp( -Delta_AICc_vec[i] / 2 )
}
weight_denom
[1] 2.712144
# Now the individual weights:
weight = NULL
for(i in 1:n_mod){
    weight[i] = 
        exp( -Delta_AICc_vec[i] / 2) / weight_denom
}
weight
[1] 0.31884668 0.21307516 0.36871201 0.09936614
# Sum to 1?
sum(weight)
[1] 1

We can see the “better” models, based on \(AIC_c\) have higher weights, and the weight vector should add to 1.

Let’s calculate the model-averaged slope estimate for input \(x_2\). To do this, we’ll first need to extract the estimate from each sub-model. This is a little tedious, because we need to know which coefficient refers to \(x_2\) in each sub-model object (or if the coefficient is absent and therefore equal to zero).

coef_x2 = NULL
coef_x2[1] = coef(model_list[[1]])[3]
coef_x2[2] = 0 # Absent from this sub-model
coef_x2[3] = coef(model_list[[3]])[2]
coef_x2[4] = 0 # Absent from this sub-model
coef_x2
[1] 0.2975017 0.0000000 0.3649074 0.0000000
# Averaged, based on model weight:
avg_coef_x2 = 0
for(i in 1:n_mod){
    avg_coef_x2 = 
        avg_coef_x2 +
        weight[i] * coef_x2[i]
}

avg_coef_x2
[1] 0.2294032

We can see the model-averaged slope estimate for input \(x_2\) is slightly less than the estimate from the full model.

summary(full_mod)

Call:
lm(formula = y ~ 1 + X1 + X2, data = my_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.9147 -3.6749  0.6359  3.4050 10.5608 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 12.58750    1.78926   7.035 2.55e-08 ***
X1           0.14204    0.09854   1.441   0.1579    
X2           0.29750    0.16725   1.779   0.0835 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.205 on 37 degrees of freedom
Multiple R-squared:  0.1637,    Adjusted R-squared:  0.1185 
F-statistic: 3.621 on 2 and 37 DF,  p-value: 0.03662
coef(full_mod)[3]
       X2 
0.2975017 

I’ll leave calculating the model-averaged standard error of the slopes as an exercise for you as a student.

As I mentioned above, fortunately someone created a package to do this model averaging for us and remove a lot of the tedium.

First, run all sub-models using the MuMIn::dredge() function.

# Required for MuMIn::dredge functionality
options(na.action = "na.fail")

# Fit all sub-models:
dredge_test = dredge(full_mod)
Fixed term is "(Intercept)"
dredge_test
Global model call: lm(formula = y ~ 1 + X1 + X2, data = my_df)
---
Model selection table 
  (Intrc)    X1     X2 df   logLik  AICc delta weight
3   12.71       0.3649  3 -122.275 251.2  0.00  0.369
4   12.59 0.142 0.2975  4 -121.182 251.5  0.29  0.319
2   15.26 0.191         3 -122.823 252.3  1.10  0.213
1   16.31               2 -124.757 253.8  2.62  0.099
Models ranked by AICc(x) 

See how this output has run all sub-models, calculated the likelihoods, the \(AIC_c\), the \(\Delta AIC_c\), and the model weights.

Now, we can average all of the models.

# Average the models:
test_average = model.avg(dredge_test, fit = TRUE)
summary(test_average)

Call:
model.avg(object = get.models(object = dredge_test, subset = NA))

Component model call: 
lm(formula = y ~ <4 unique rhs>, data = my_df)

Component models: 
       df  logLik   AICc delta weight
2       3 -122.27 251.22  0.00   0.37
12      4 -121.18 251.51  0.29   0.32
1       3 -122.82 252.31  1.10   0.21
(Null)  2 -124.76 253.84  2.62   0.10

Term codes: 
X1 X2 
 1  2 

Model-averaged coefficients:  
(full average) 
            Estimate Std. Error Adjusted SE z value Pr(>|z|)    
(Intercept)  13.5710     2.1113      2.1512   6.308   <2e-16 ***
X2            0.2294     0.2083      0.2113   1.086    0.278    
X1            0.0860     0.1092      0.1108   0.776    0.438    
 
(conditional average) 
            Estimate Std. Error Adjusted SE z value Pr(>|z|)    
(Intercept)  13.5710     2.1113      2.1512   6.308   <2e-16 ***
X2            0.3336     0.1683      0.1737   1.921   0.0547 .  
X1            0.1617     0.1009      0.1041   1.553   0.1205    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This summary() statement shows the model-averaged values of slopes of \(x_1\) and \(x_2\), and the intercept. We care about the “full average”. You can see the averaged estimate of the slope for \(x_2\) matches our manual calculation. For emphasis:

test_average$coefficients[1,2];
[1] 0.2294032
avg_coef_x2
[1] 0.2294032

7.3.3 Back to more complex model

Ok, but our full model had four input variables, which means the number of sub-models is \(4^2 = 16\). Let’s not do that manually, but instead use the MuMIn::model.avg() function.

# Reminder, m1 was our full model:
summary(m1)

Call:
lm(formula = y ~ 1 + X1 + X2 + X3 + X4, data = my_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.3257 -1.4053 -0.4331  1.3299  4.3178 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.41757    1.82632   1.871  0.06968 .  
X1           0.03245    0.03810   0.852  0.40016    
X2          -0.26989    0.07297  -3.698  0.00074 ***
X3          -0.01823    0.03267  -0.558  0.58050    
X4           1.68543    0.11083  15.207  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.94 on 35 degrees of freedom
Multiple R-squared:  0.8901,    Adjusted R-squared:  0.8775 
F-statistic: 70.86 on 4 and 35 DF,  p-value: 2.741e-16
# Fit all sub-models:
dredge_m1 = dredge(m1)
Fixed term is "(Intercept)"
# Average the models:
m1_average = model.avg(dredge_m1, fit = TRUE)
summary(m1_average)

Call:
model.avg(object = get.models(object = dredge_m1, subset = NA))

Component model call: 
lm(formula = y ~ <16 unique rhs>, data = my_df)

Component models: 
       df  logLik   AICc delta weight
24      4  -81.10 171.34  0.00   0.56
124     5  -80.77 173.31  1.97   0.21
234     5  -81.01 173.78  2.43   0.17
1234    6  -80.60 175.74  4.39   0.06
4       3  -87.23 181.13  9.79   0.00
14      4  -87.20 183.55 12.20   0.00
34      4  -87.23 183.60 12.26   0.00
134     5  -87.19 186.15 14.81   0.00
2       3 -122.27 251.22 79.87   0.00
12      4 -121.18 251.51 80.16   0.00
1       3 -122.82 252.31 80.97   0.00
23      4 -122.21 253.55 82.21   0.00
(Null)  2 -124.76 253.84 82.49   0.00
123     5 -121.18 254.12 82.78   0.00
13      4 -122.82 254.78 83.44   0.00
3       3 -124.73 256.12 84.77   0.00

Term codes: 
X1 X2 X3 X4 
 1  2  3  4 

Model-averaged coefficients:  
(full average) 
             Estimate Std. Error Adjusted SE z value Pr(>|z|)    
(Intercept)  2.642223   1.219137    1.258700   2.099 0.035802 *  
X2          -0.258803   0.074396    0.076727   3.373 0.000743 ***
X4           1.693799   0.109648    0.113293  14.951  < 2e-16 ***
X1           0.007999   0.023509    0.024078   0.332 0.739716    
X3          -0.003320   0.016617    0.017119   0.194 0.846219    
 
(conditional average) 
            Estimate Std. Error Adjusted SE z value Pr(>|z|)    
(Intercept)  2.64222    1.21914     1.25870   2.099 0.035802 *  
X2          -0.26062    0.07141     0.07385   3.529 0.000417 ***
X4           1.69380    0.10965     0.11329  14.951  < 2e-16 ***
X1           0.02940    0.03744     0.03875   0.759 0.448057    
X3          -0.01452    0.03232     0.03345   0.434 0.664272    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Notice how we see the model with inputs \(x_2\) and \(x_4\) is the best, based on \(AIC_c\) (note this is the model labeled as 24 meaning it inclues inputs 2 and 4).

We can also plot the model parameter estimates with their confidence intervals:

# Plot the coefficient estimates (from the averaged model)
plot(m1_average)

Finally, we can use the predict() function as we have before to visualize the effect of each input variable on the outcome. Here, we will show the independent, model-averaged effect of \(x_2\), when all other input variables are held at their average values. Then, we’ll do the same for \(x_4\).

# Predict from the average model:
# How does y change as a function of x2, while 
# other inputs held at their average?
new_df = data.frame(
    X1 = rep(mean(my_df$X1), 100),
    X2 = seq(0, 19, length.out = 100),
    X3 = rep(mean(my_df$X3), 100),
    X4 = rep(mean(my_df$X4), 100)
)

pred_m1_avg_x2 = 
    predict(m1_average,
            newdata = new_df,
            se.fit = TRUE)

plot(my_df$y ~ my_df$X2,
     xlab = "input x2", ylab = "y", pch = 19)
lines(pred_m1_avg_x2$fit ~ new_df$X2)
lines(pred_m1_avg_x2$fit-2*pred_m1_avg_x2$se.fit ~ new_df$X2, lty = 2)
lines(pred_m1_avg_x2$fit+2*pred_m1_avg_x2$se.fit ~ new_df$X2, lty = 2)

# Predict from the average model:
# How does y change as a function of x4, while 
# other inputs held at their average?
new_df = data.frame(
    X1 = rep(mean(my_df$X1), 100),
    X2 = rep(mean(my_df$X2), 100),
    X3 = rep(mean(my_df$X3), 100),
    X4 = seq(0, 19, length.out = 100)
)

pred_m1_avg_x4 = 
    predict(m1_average,
            newdata = new_df,
            se.fit = TRUE)

plot(my_df$y ~ my_df$X4,
     xlab = "input x4", ylab = "y", pch = 19)
lines(pred_m1_avg_x4$fit ~ new_df$X4)
lines(pred_m1_avg_x4$fit-2*pred_m1_avg_x4$se.fit ~ new_df$X4, lty = 2)
lines(pred_m1_avg_x4$fit+2*pred_m1_avg_x4$se.fit ~ new_df$X4, lty = 2)

Another, perhaps simpler way to vizualize how well a model matches the data is to plot the model predictions of the data versus the observed data. We can even compare this to the non-averaged model.

raw_predict_avg = predict(m1_average)
raw_predict_nonavg = predict(m1)
raw_predict_bad = predict(sub_m2)

plot(my_df$y ~ raw_predict_avg,
     xlab = "Model Prediction", ylab = "Data, y", pch = 19, col = "red")
points(my_df$y ~ raw_predict_nonavg, pch = 19, col = "black")
points(my_df$y ~ raw_predict_bad, pch = 19, col = "orange")
# 1-to-1 line
abline(a = 0, b = 1)

It is hard to see, and likely not significant in this case, but the red points (model-averaged) tend to be closer to the 1:1 line, compared to the black points, meaning the averaged model makes slightly better predictions of the observed data. What is more clear, is that the “bad” model (which only included covariate \(x_1\)), does not match the 1:1 line at all; it’s more of a shot-gun of points. Therefore, this clearly indicates the model is not predictive of the \(y\) data. This is a good visualization of how well your models’ within-sample prediction (i.e., how close the model predictions of observed data match the observed data).