4 Ordinary Least Squares

4.1 In-class Code

Remember that our goal is to estimate the linear relationship between data observations of response variable, \(y\), and its measured covariate, \(x\), following: \(Y = XB + \epsilon\), where \(\epsilon \sim N(0, \sigma^2 I).\) Our coefficients to estimate are therefore \(\hat{B}\), which is a column vector of the intercept and slope. We also estimate the standard deviation of the residuals (i.e., residual error), \(\hat{\sigma}\). To estimate the coefficients, we are attempting to minimize the residual sum of squares, \(|| \epsilon || ^ 2\). See Footnotes 4.12.1 for more information regarding this notation.

4.2 Generate the data

We’ll start with a very small data set to emphasize the basics, and then the in-class activity will go into more depth. Here, we’ll implement the OLS estimation with a single covariate that we demonstrated in lecture.

n = 4 # number observations
p = 2 # number of parameters

# Covariate:
x0 = c(1,1,1,1) # placeholder for intercept
x1 = c(2,3,5,1) # value of x
xmat = matrix(data = c(x0,x1), 
               nrow = n, 
               ncol = p)
xmat

     [,1] [,2]
[1,]    1    2
[2,]    1    3
[3,]    1    5
[4,]    1    1

# Coefficients:
## betas[1]: intercept
## betas[2]: slope
betas = c(4, 2)

xmat %*% betas

     [,1]
[1,]    8
[2,]   10
[3,]   14
[4,]    6

# residuals
epsilon = c(0, -1, 1, 3)

# Data observations:
y = xmat %*% betas + epsilon

4.3 Plot the relationship

# Plot in layers
## Create a blank plotting canvas, specifying axis limits
plot(x=NA,y=NA, xlab = "x", ylab = "y",
     ylim = c(0,max(y)), xlim = c(0,max(x1)))
## Add data points
points(y ~ x1, pch = 19, cex = 2)
## Add known linear relationship
abline(coef = betas, col = "black", lwd = 2)

# Show the residuals:
segments(x0 = x1, x1 = x1,
         y0 = y, y1 = y - epsilon)

# Show the model predictions, \hat{y}:
y_hat = xmat %*% betas
points(y_hat ~ x1, cex = 1.25)

4.4 Estimate the coefficients using R’s `lm()` function

# Run the model:
lm_out = lm(y ~ 1 + x1)
# Show the summary output
summary(lm_out)


Call:
lm(formula = y ~ 1 + x1)

Residuals:
     1      2      3      4 
-1.029 -1.657  1.086  1.600 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   5.7714     2.0500   2.815    0.106
x1            1.6286     0.6565   2.481    0.131

Residual standard error: 1.942 on 2 degrees of freedom
Multiple R-squared:  0.7547,    Adjusted R-squared:  0.6321 
F-statistic: 6.153 on 1 and 2 DF,  p-value: 0.1313

# Extract the estimated coefficients
lm_coef = coef(lm_out)
lm_coef

(Intercept)          x1 
   5.771429    1.628571

4.5 Estimate the coefficients manually

Now we will use the matrix algebra and derivation of normal equations to estimate the intercept and slope from the observations, \(Y\). Remember that we estimate the coefficient vector, \(\hat{B}\) from: \[X^TX \hat{B} = X^T Y\] \[\hat{B} = (X^TX)^{-1} X^T Y\] These equations include the multiplicative inverse matrix, \((X^TX)^{-1}\). See the Footnotes 4.12.2 for more information about inverse matrices and the solve() function.

# Let's break up the normal equations into intermediates:
xtx = t(xmat) %*% xmat

## Use solve() to find inverse of xtx
## why solve()? See Appendix, linked above.
inv_xtx = solve(xtx)
xty = t(xmat) %*% y

bhat = inv_xtx %*% xty

# More efficient:
# Remember, xtx * bhat = xty
# So we can use solve() again
bhat_solve = solve(xtx, xty)

# Are they the same?

# How does this manual solution compare to lm()'s solution?

4.6 Plot the estimated relationships

# Plot in layers
## Create a blank plotting canvas, specifying axis limits
plot(NA,NA,
     xlab = "x", ylab = "y",
     ylim = c(0,max(y)),
     xlim = c(0,max(x1)))
## Add data points
points(y ~ x1, pch = 19, cex = 2)
## Add known linear relationship
abline(coef = betas,
       col = "black", lwd = 2)

# Show the residuals:
segments(
  x0 = x1,
  x1 = x1,
  y0 = y,
  y1 = y - epsilon,
)

# Show the model predictions, \hat{y}:
y_hat = xmat %*% betas
points(y_hat ~ x1,
       cex = 1.25)

# Add the lm() estimate:
abline(coef = lm_coef,
       col = "orange", lty = 2, lwd = 2)

# Add the manual OLS estimate:
abline(coef = bhat_solve,
       col = "purple", lty = 3, lwd = 2)

4.7 Why are the \(\hat{B}\) different from true \(B\)?

Remember, we are estimating the coefficients by minimizing the sum of squared errors (SSE), \(|| \epsilon ||^2\).

# True sum of squares:
sum(epsilon^2)

[1] 11

# Estimated (i.e., minimized sum of squares):
## From lm()
sum(lm_out$residuals^2)

[1] 7.542857

## From manual OLS
sum( (y - xmat %*% bhat_solve)^2 )

[1] 7.542857

You can see that the OLS strategy minimized SSE, but this is actually lower than the true SSE.

4.8 Understanding Uncertainty in \(\hat{B}\)

While the OLS analysis estimates the regression coefficients, \(\hat{B}\), from the observed data \(Y\), our estimates of these coefficients have error (i.e., uncertainty), such that the estimates are only as good as the data. Specifically, if we have fewer data points (i.e., \(n\) is low), we have less certainty in \(\hat{B}\). In lecture, we showed, that: \[\hat{B} \sim N \left( B, (X^TX)^{-1} \hat{\sigma}^2 \right), \] and we know that \(\hat{\sigma}^2\) depends on sample size \(n\), following: \[\hat{\sigma}^2 \quad = \quad \frac{1}{n-p} (Y_{obs} - Y_{pred})^T (Y_{obs} - Y_{pred}) \quad = \quad \frac{1}{n-p} \hat{\epsilon}^T \hat{\epsilon}\]

Using these equations, we showed then that \(SE(\beta_i) = \sqrt{diag\left( (X^TX)^{-1} \right)_i \hat{\sigma}^2}\). Let’s calculate this manually and compare to the output of the lm() function.

# Extract the model summary, which has useful components
lm_out_summary = summary(lm_out)
# Extract the estimated residual standard deviation, sigma
est_sigma = lm_out_summary$sigma
est_sigma

[1] 1.942017

# We already calculated (X^T X)^{-1} as inv_xtx
beta_cov_mat = inv_xtx * est_sigma^2
beta_cov_mat

          [,1]       [,2]
[1,]  4.202449 -1.1853061
[2,] -1.185306  0.4310204

se_beta = sqrt(diag(beta_cov_mat))
se_beta

[1] 2.0499876 0.6565214

Compare these values to the output of the summary() of Section 4.4 in the column labelled Std. Error.

4.9 Confidence Intervals for \(\hat{B}\)

To calculate confidence intervals for \(\hat{B}\), we first must understand the \(t\) (a.k.a. Student’s \(t\)) probability distribution. This distribution represents the case when we are estimating the mean of a normally distributed variable and either the sample size is small or the variable’s standard deviation is unknown. Essentially, the \(t\) distribution increases the uncertainty (i.e., variance) in cases of low sample size (i.e., small \(n\)). With low sample size (and/or high number of parameters), the degrees of freedom of the \(t\)-distribution, \(\nu\) is low, whereas with high sample size, \(\nu\) is large. As \(\nu\) approaches infinity, the \(t\)-distribution approximates the standard normal distribution (i.e., \(N(\mu, \sigma)|\mu=0,\sigma=1\)).

It is the case for \(\hat{B} \sim N \left( B, (X^TX)^{-1} \hat{\sigma}^2 \right)\) that we do not know the mean (\(B\)), and we are estimating the variance, \(\hat{\sigma}^2\). Specifically, we are estimating the true mean vector, \(B\), as \(\hat{B}\), and we are estimating the variance of the residuals as \(\hat{\sigma}^2\). We can therefore re-write the uncertainty in \(\hat{B}\) as a multivariate \(t\) distribution: \[(\hat{B} - B) \sim t_{\nu} \left( 0, \Sigma \right),\] where the means are zero, \(\nu\) is the degrees of freedom (i.e., \(n-p\)), and \(\Sigma = (X^TX)^{-1} \hat{\sigma}^2\). \((\hat{B} - B)\) represents the deviation of the estimated coefficients from the true coefficients, which is why the distribution is centered around zero. It is perhaps easier to separate the individual estimated coefficients, \(\beta_i\), into their separate \(t\)-distributions: \[\frac{(\hat{\beta}_i - \beta_i)}{SE(\hat{\beta}_i)} \sim t_{\nu}\] \[(\hat{\beta}_i - \beta_i) \sim t_{\nu} SE(\hat{\beta}_i),\] which shows that the \(t\)-distribution that describes the deviation of regression coefficients from the true value of those coefficients is scaled by the uncertainty in the estimated coefficients \(SE(\hat{\beta}_i)\). As shown in Dr. Barber’s materials, using this information, we can derive the confidence interval (at the \(\alpha\) confidence level) calculation for \(\hat{\beta}_i\) as: \[ \hat{\beta}_i \pm t \left(\frac{1-\alpha}{2}, \nu \right) SE(\hat{\beta}_i),\] where the \(t()\) notation represents the critical value of the \(t\)-distribution, \(t_{crit}\), with \(\nu\) degrees of freedom, for which \(P(z \le t_{crit}) = \frac{1-\alpha}{2}\), and \(z\) is a continuous, random variable. This critical value can be calculated in R using the qt() function, which we show below.

Covariance of \(\hat{\beta}_i\)

Although it is convenient and easier to digest the confidence interval of individual \(\hat{\beta}_i\), we must realize that the estimates of the \(\beta_i\) can covary (i.e., have non-zero covariance), which is quantified in the variance-covariance matrix of \(\hat{B}\), \((X^TX)^{-1} \hat{\sigma}^2\). We will show why this is important below.

Let’s manually calculate the 95% confidence intervals in \(\hat{B}\) and compare to R’s internal function confint().

# Extract the degrees of freedom from the model (\nu)
# which can also be calculated as n - p
t_df = lm_out$df.residual

# Calculate t critical for alpha = 0.05
# This will give us the 95% conf interval (CI)
t_crit = qt(1-(0.05/2), df = t_df)

# Calculate the upper and lower CI for both betas
ci_int = lm_coef[1] + c(-1,1)*t_crit*se_beta[1]
ci_slope = lm_coef[2] + c(-1,1)*t_crit*se_beta[2]

# Construct a table of values
ci_mat = 
    rbind(c(lm_coef[1], ci_int),
          c(lm_coef[2], ci_slope))
colnames(ci_mat) = c("coef", "lowCI", "highCI")
rownames(ci_mat) = c("intercept", "slope")
ci_mat

              coef     lowCI    highCI
intercept 5.771429 -3.048956 14.591813
slope     1.628571 -1.196212  4.453355

# Compare these manual calculations to built-in
# function confint(), which by default extracts the 
# 95% CI for a lm() model's coefficients
confint(lm_out)

                2.5 %    97.5 %
(Intercept) -3.048956 14.591813
x1          -1.196212  4.453355

4.10 Propagate uncertainty in \(\hat{B}\) for predictions of \(Y\)

There are several ways to calculate and visualize our uncertainty in model predictions of observed data \(Y\) and unobserved data of the dependent variable (i.e., interpolation). The colored lines drawn on the figure in Section 4.6 represent the expected values of \(Y\) based on the OLS analysis’ estimate of \(\hat{B}\), but this line does not include uncertainty in these coefficient values.

4.10.1 Multivariate \(t\)-distribution method

First, we will calculate uncertainty by sampling from the multivariate \(t\) distribution that represents error in regression coefficients, \(\hat{B}\).

# We will "bootstrap" 1000 samples of intercept and slope
set.seed(3)
n_samp = 500

# Draw from the multivariate t 
# which represents (\hat{B} - B)
test_mat_deviates = 
  mnormt::rmt(n_samp, mean = c(0,0), S = beta_cov_mat, df = t_df)

# Now calculate the realized intercept and slope
# using the t-distributed deviates
test_mat_t = cbind(
  lm_coef[1] + c(test_mat_deviates[,1]),
  lm_coef[2] + c(test_mat_deviates[,2])
)

# Calculate the 95% quantiles and compare to the 
# calculated 95% confidence intervals from above
apply(test_mat_t, 
      MARGIN = 2, # applies function (FUN) to columns (dim 2)
      FUN = quantile, probs = c(0.025, 0.5, 0.975))

           [,1]       [,2]
2.5%  -3.820226 -0.5319227
50%    5.809501  1.5967318
97.5% 13.564890  4.1240734

# Compare
ci_mat

              coef     lowCI    highCI
intercept 5.771429 -3.048956 14.591813
slope     1.628571 -1.196212  4.453355

# Plot the relationship between intercept and slope
# Notice the covariance
plot(test_mat_t, xlab = "Intercept", ylab = "Slope")

Next, for each pair of intercept and slope randomly drawn above, we will calculate the expected values of \(Y\) across the range of covariate \(x\). We will then summarize the 95% quantile of expected \(Y\) at each value of \(x\) in this interpolation. To do this, we need a function to calculate the expected value of \(Y\). This function will have the intercept and slope as inputs and will output the expected values of \(Y\) across a range of \(x\). Then, we will apply() this function using all of the values of intercept and slope, in a vectorized and therefore very efficient manner, rather than using any for loops.

# Create a matrix that holds the values of x
# over which we want to interpolate the expected
# values of Y
x_fake_mat = 
  cbind(
    rep(1, times = 100),
    seq(0,max(x1),length.out = 100)
  )

# Create a function that will calculate the expected values
y_hat_fun = function(x, x_mat){
  x_mat %*% x
}

# Apply this function to all intercepts and slopes that
# we drew from the multivariate t
y_pred_mt = apply(test_mat_t, 1, y_hat_fun, x_mat=x_fake_mat)

# Summarize the 95% quantile of the expected value of Y
# at each value of x 
y_pred_mt_summary = apply(y_pred_mt, 1, quantile, probs = c(0.025, 0.975))
str(y_pred_mt_summary)

 num [1:2, 1:100] -3.82 13.56 -3.62 13.58 -3.41 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:2] "2.5%" "97.5%"
  ..$ : NULL

4.10.2 `predict()` function method

R has a built-in function predict() (see specific variant predict.lm()) which calculates expected values of the dependent variable from a linear regression model estimated using the function lm().

# Note that 'newdata' must be a data frame that includes the ranges
# of each covariate in the regression model for which you want 
# to generate interpolated or predicted values of the dependent variable

# Here we are calculated the expected values as well as the 
# 95% confidence intervals for those expected values
y_predict = predict(lm_out,
                 newdata = data.frame(x1 = c(x_fake_mat[,2])),
                 interval = "confidence", level = 0.95)
str(y_predict)

 num [1:100, 1:3] 5.77 5.85 5.94 6.02 6.1 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:100] "1" "2" "3" "4" ...
  ..$ : chr [1:3] "fit" "lwr" "upr"

4.10.3 Compare the two methods

Let’s visualize the output of the two methods to compare.

# plot
plot(x=NA,y=NA,xlab = "x", ylab = "y",
     xlim = c(0,max(x1)), ylim = c(-5, 25), pch = 19)
# Plot the expected values of Y for each pair of int/slope 
for(i in 1:n_samp){
  lines(y_pred_mt[,i] ~ x_fake_mat[,2],
        # Reduce the opacity of each line
        col = scales::alpha("black", alpha = 0.1), lwd = 2)
}
# Add the data points
points(y ~ x1, col = 'orange', pch = 19, cex = 2)
# Add the expected values of Y from \hat{B}
abline(coef = lm_coef, col = "orange", lwd = 3)
# Add the conf int of expected Y using multivariate t
lines(y_pred_mt_summary[1,] ~ x_fake_mat[,2], lty = 2, lwd = 3, col = "orange")
lines(y_pred_mt_summary[2,] ~ x_fake_mat[,2], lty = 2, lwd = 3, col = "orange")
# Add the conf int of expected Y using predict() function
lines(y_predict[,"lwr"]~ x_fake_mat[,2], lty = 3, lwd = 3, col = "purple")
lines(y_predict[,"upr"]~ x_fake_mat[,2], lty = 3, lwd = 3, col = "purple")

There is yet a third option to calculate the uncertainty in predicted (i.e., interpolated or extrapolated) values of \(Y\), which is to derive an exact calculation of the confidence interval using the \(t\) distribution, similar to that shown in Section 4.9. See Ch4.1 of Dr. Barber’s book for this derivation.

4.11 Multiple Linear Regression

So far, we have only discussed a single input variable in our model, which is a simple linear regression. When we have multiple input variables, we are dealing with multiple linear regression analysis, so the model looks like: \[y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + \dots + \beta_{p-1} x_{p-1,i} + \epsilon_i\] where \(p\) is the number of model coefficients and \(p-1\) is the number of input variables. Still, in matrix notation the model is \(Y = XB + \epsilon\), so the least squares regression analysis approach still works. However, our interpretation of the model coefficients becomes a bit more challenging.

Let’s look at a data set within the faraway package.

library(faraway)
data(gala)

# Plot the raw data relationships
par(mfrow=c(1,3))
plot(gala$Species ~ gala$Elevation, xlab = "Elevation", ylab = "Species", pch = 19)
plot(gala$Species ~ gala$Area, xlab = "Area", ylab = "Species", pch = 19)
plot(gala$Species ~ gala$Adjacent, xlab = "Adjacent", ylab = "Species", pch = 19)

par(mfrow=c(1,1))

# Conduct multiple and single linear regressionm, focusing on Elevation
m1 = lm(Species ~ Elevation + Area + Adjacent, data = gala)
m2 = lm(Species ~ Elevation, data = gala)
coef(m1); coef(m2)

(Intercept)   Elevation        Area    Adjacent 
-5.71892681  0.31498110 -0.02031217 -0.07527974

(Intercept)   Elevation 
 11.3351132   0.2007922

plot(gala$Species ~ gala$Elevation, xlab = "Elevation", ylab = "Species", pch = 19)
abline(coef=coef(m1)[1:2])
abline(coef=coef(m2)[1:2], lty = 2)

What we see above is how the addition of Area and Adjacent input variables into the model “adjusts” the effect of Elevation, leading to two unique estimates of the slope (i.e., effect) of Elevation. Let’s probe multiple linear regression more closely by using simulated data.

First, let’s simulate a model with 80 data points that correspond to observations of 4 input variables and one outcome variable. Note that in Footnotes 4.12.3, we show a case with a categorical/binary input variable.

n = 80
n_covariate = 4
p = n_covariate + 1

betas = vector("numeric", length = p)
xmat = matrix(0, nrow = n, ncol = p)
sigma = 2.25

# Column for intercept
xmat[,1] = 1

# Generate the covariate data randomly:
set.seed(5)
xmat[,2] = rnorm(n, mean = 5, sd = 8)
xmat[,3] = runif(n, min = 0, max = 20)
xmat[,4] = rchisq(n, df = 50)
xmat[,5] = rpois(n, lambda = 10)

par(mfrow=c(2,2))
for(i in 2:p){
    hist(xmat[,i], main = paste("covariate ", i-1))
}

# Set the betas:
betas[1] = 1.0
betas[2] = 0.75
betas[3] = -1.2
betas[4] = 0.0
betas[5] = 1.8

# Calculate the observed 'y', adding residual error
y = xmat %*% betas + rnorm(n, mean = 0, sd = sigma)
par(mfrow=c(1,1))
hist(y)

par(mfrow=c(2,2))
for(i in 2:p){
    plot(y ~ xmat[,i],
         xlab = paste("covariate ", i-1))
}

How do we figure out the expected value of \(y\) for a particular situation? Here’s an example. What is the expected value of \(y\) when \(x_2 = 0.5\), but the rest of the input variables are at their average values?

# Written out long-ways:
pred_y = 
    betas[1]*1 + 
    betas[2]*mean(xmat[,2]) + 
    betas[3]*0.5 + 
    betas[4]*mean(xmat[,4]) + 
    betas[5]*mean(xmat[,5]) 
pred_y

[1] 22.6006

Now let’s use ordinary least squares regression to estimate our model coefficients from the data, and then compare these to our “known” values of the model parameters.

# Run the model:
m1 = lm(y ~ 0 + xmat)
# Note that the following two models give the same results
#m2 = lm(y ~ 0 + X1 + X2 + X3 + X4 + X5, data = data.frame(xmat))
#m3 = lm(y ~ 1 + X2 + X3 + X4 + X5, data = data.frame(xmat))
summary(m1)


Call:
lm(formula = y ~ 0 + xmat)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4388 -1.4712  0.2816  1.5305  5.0032 

Coefficients:
       Estimate Std. Error t value Pr(>|t|)    
xmat1  1.589838   1.759860   0.903    0.369    
xmat2  0.737086   0.035629  20.688   <2e-16 ***
xmat3 -1.295274   0.044252 -29.270   <2e-16 ***
xmat4 -0.003676   0.028481  -0.129    0.898    
xmat5  1.826125   0.088971  20.525   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.388 on 75 degrees of freedom
Multiple R-squared:  0.9763,    Adjusted R-squared:  0.9747 
F-statistic: 616.7 on 5 and 75 DF,  p-value: < 2.2e-16

#summary(m2)
#summary(m3)

# Compare known `betas` to estimated coefficients
cbind(betas, coef(m1))

      betas             
xmat1  1.00  1.589837527
xmat2  0.75  0.737085861
xmat3 -1.20 -1.295274007
xmat4  0.00 -0.003676167
xmat5  1.80  1.826125438

# plot the regression lines with abline
coef_m1 = coef(m1)
par(mfrow=c(2,2))
for(i in 2:p){
    
    plot(y ~ xmat[,i], pch=19,
         xlab = paste("covariate ", i-1),
         ylab = "y",
         ylim = range(y))
    abline(coef=coef_m1[c(1,i)])
}

Well, those regression lines do not look correct. That is because we are interpretting the slopes and intercepts a little incorrectly and not plotting them in the correct manner.

How to plot the output of lm() for multiple linear regression

When we isolate and visualize the relationship between the outcome and a single input variable, what we are really observing is the adjusted relationship, after accounting for the other input variables in the model. To understand the expected value of \(y\) for any particular value of the single input variable, we really need to set the other input variables to their mean value. Let’s demonstrate this below with the predict() function.

Let’s determine the expected values of \(y\) for input variable 2 (\(x_2\)) and plot it.

# Prediction for covariate 2 when all other input vars at mean
my_df = data.frame(xmat[,2:5])
head(my_df)

          X1         X2       X3 X4
1 -1.7268438 13.8180926 61.27634  8
2 16.0748747  6.7393185 58.15099 14
3 -5.0439349  0.8145552 36.82198 16
4  5.5611421 18.1722388 38.24042 10
5 18.6915270 16.7070212 51.91376 15
6  0.1767361 12.9778881 45.11988 16

# Re-run the model but with just the input variables, 
# and the intercept is implicit
m2 = lm(y ~ 1 + X1 + X2 + X3 + X4, data = my_df)

# Now let's try to predict y across a range of 
# input variable 2,
# while holding the other input variables at
# their average values

n_pred = 100
new_df = data.frame(
  X1 = rep(mean(my_df$X1), n_pred),
  X2 = seq(0, 20, length.out = n_pred),
  X3 = rep(mean(my_df$X3), n_pred),
  X4 = rep(mean(my_df$X4), n_pred)
)

y_pred2 = predict(m2, newdata = new_df)

# Now plot:
par(mfrow=c(1,1))
plot(y ~ my_df$X2, pch = 19,
     xlab = "covariate 2", ylab = "y")
lines(y_pred2 ~ new_df$X2)

Now we see that the predict() function shows a more intuitive relationship between input variable \(x_2\) and outcome \(y\), while accounting for the effects of the three other input variables.

4.12 Footnotes

4.12.1 Euclidean norm & cross product

We often see the syntax, \(|| a ||\), which is the Euclidean norm of the \(n\)-sized vector \(a\): \[|| a || = \left( \sum_{i=1}^{n} a_i^2 \right) ^ {1/2} ,\] so that when we see \(|| a ||^2\), this results in the sum of squares of vector \(a\), \(\sum_{i=1}^{n} a_i^2\).

In the context of least squares regression, we are trying to minimize the residual sum of squares, where the residuals, \(\epsilon_i\), are in vector, \(\epsilon\). The sum of squares of vector \(\epsilon\) is therefore \(|| \epsilon ||^2\). Algebraically, we can find this value as the cross product of \(\epsilon\), which is \(\epsilon^{T}\epsilon\). Let’s do a coded example with vector \(x\).

# Vector of real numbers
x = c(1, 2, 3, 4)

# sum of squares
sum(x^2)

[1] 30

# Evaluated as cross-product
t(x) %*% x

     [,1]
[1,]   30

## Or with crossprod()
crossprod(x,x)

     [,1]
[1,]   30

# Euclidean norm also known as the 2-norm
# so sum of squares is 2-norm, squared
norm(x, type = "2") ^ 2

[1] 30

4.12.2 `solve()` and Inverse of matrix

Suppose we have matrices \(A\), \(X\), and \(B\), and the following expression is true: \[AX=B.\]

Then, suppose \(X\) is unknown, such that we want to find the solution for \(X\), when we rearrange: \[X = A^{-1} B,\] where \(A^{-1}\) is the multiplicative inverse of matrix \(A\). To figure this out computationally, we can use the solve() function in R, as long as \(A\) is a square matrix and has an inverse.

# Create A and known X
A = matrix(c(1,1,
             5,2), ncol = 2)
X = matrix(c(2,3), ncol = 1)

# Dot product to calculate B
B = A %*% X

# Suppose you have A and B, but want to find X
X_solve = solve(A, B)

# Did it work?
X; X_solve

     [,1]
[1,]    2
[2,]    3

     [,1]
[1,]    2
[2,]    3

We can see, then, that solve() is internally evaluating \(A^{-1}\). Remember that \(A^{-1}\) is not trivial to calculate, as it is the matrix that must satisfy: \(AA^{-1} = I\), where \(I\) is an identity matrix. In fact, solve(A) returns the inverse of \(A\), if it exists.

inv_A = solve(A)

#Did it work?
(inv_A %*% B)

     [,1]
[1,]    2
[2,]    3

     [,1]
[1,]    2
[2,]    3

4.12.3 Multiple linear regression with a categorical input

Let’s simulate a case in which we have one categorical input variable that takes on values “low”, “medium”, and “high”, and one continuous input variable.

set.seed(7)
n=90
sigma = 0.8

# Xmatrix
## Intercept
x0 = rep(1, times = n) 
## Categorical input variable
### Note that we need to code this as "0" "1" "2" to 
### simulate our outcome variable "y"
x1 = rep(c(0,1,2), each=n/3)
x1L = factor(x1, labels = c("low", "med", "high"))
## Continuous input variable
x2 = rnorm(n, 0, 2.5)
xmat = cbind(x0,x1,x2)
head(xmat)

     x0 x1        x2
[1,]  1  0  5.718118
[2,]  1  0 -2.991929
[3,]  1  0 -1.735731
[4,]  1  0 -1.030732
[5,]  1  0 -2.426683
[6,]  1  0 -2.368200

# Intercept and 2 slopes
betas=c(1.5, 1.2, -1.5)

# Simulate outcome variable, as usual
y2 = xmat %*% betas + rnorm(n,0,sigma)

# Plot the relationships
par(mfrow=c(1,2))
plot(y2~x1)
plot(y2~x2)

# Run the model
## Note that we us the "factor" input variable
## "x1L", which has "levels"
m_cat = lm(y2 ~ 1 + x1L + x2)
summary(m_cat)


Call:
lm(formula = y2 ~ 1 + x1L + x2)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.9182 -0.5032  0.1465  0.5061  1.2139 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.66504    0.13741  12.117  < 2e-16 ***
x1Lmed       1.15971    0.19113   6.068 3.38e-08 ***
x1Lhigh      2.29169    0.19170  11.954  < 2e-16 ***
x2          -1.51229    0.03275 -46.171  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7314 on 86 degrees of freedom
Multiple R-squared:  0.9674,    Adjusted R-squared:  0.9663 
F-statistic: 851.1 on 3 and 86 DF,  p-value: < 2.2e-16

coef(m_cat)

(Intercept)      x1Lmed     x1Lhigh          x2 
   1.665040    1.159710    2.291690   -1.512293

How do we interpret the slopes, because we see there is a separate slope for x1Lmed and x1Lhigh? We can understand better by seeing how the linear model addes up. For instance, what is the expected value of the outcome variable when \(x_1\) is high, and \(x_2 = 2.0\)?

## Using m1_binL:
y2_pred = 
    1*coef(m_cat)[1] + # Global average (intercept)
    0*coef(m_cat)[2] + # Not "med"
    1*coef(m_cat)[3] + # Yes "high"
    2.0*coef(m_cat)[4] # x2=2.0 * slope
as.numeric(y2_pred)

[1] 0.9321446

When we assigned the slope of the categorical input variable as \(1.2\), remember this is the expected change in \(y\) as the input variable changes by a value of \(1.0\). In the model, we code the \(x_1\) variable as taking numerical values \(0\), \(1\), and \(2\) to represent categories, “low”, “medium”, and “high”. So, the slope for x1med is the expected change in \(y\) as the input variable changes from “low” to “medium”, an effective change of \(1.0\). Then, the slope for x1high is the expected change in \(y\) as the input variable changes from “low” to “high”, an effective change of \(2.0\); hence, this slope is estimated as \(2.29\), with standard error \(0.19\). Notice how this slope is approximately twice our “known” slope for the input variable, which was \(1.2\).