6  Maximum Likelihood

6.1 Generate some data

First, let’s generate some data for the case of a simple linear regression.

### PARAMS
beta0 = 1.5
beta1 = 0.5
sigma = 0.4
n = 30

### GENERATE DATA
set.seed(5)
x = runif(n, -1.5, 1.5)
exp_y = beta0 + beta1*x
y = exp_y + rnorm(n, mean=0, sd=sigma)

# Create a data frame 
my_df = data.frame(y = y, x = x)

plot(my_df$y ~ my_df$x, pch = 19,
     xlab = "x", ylab = "y")

6.2 Calculate a likelihood

Remember that, for simple linear regression, the likelihood of a single data point is as follows: \[y_i \sim N(\beta_0 + \beta_1 x_i, \sigma^2)\] \[P(y_i | X_i B, \sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}} \text{e}^{-\frac{1}{2}\frac{(y_i - X_i B)^2}{\sigma^2}}\]

Then, the full likelihood of the data set \(Y\) is computed as:

\[ P(Y | X B, \sigma^2) = \prod^n_{i=1} P(y_i | X_i B, \sigma^2)\] Or, on the natural logarithmic scale: \[ ln\left(P(Y | X B, \sigma^2)\right) = \sum^n_{i=1} ln\left(P(y_i | X_i B, \sigma^2)\right)\]

# How to calculate the likelihood of a single data point:
dnorm(y[1], 
      mean = beta0 + beta1*x[1],
      sd = sigma,
      log = FALSE)
[1] 0.987769
# Calculate the full likelihood of the data, using the product
## Vectorized:
LH_notlog= 
prod(
    dnorm(y, 
          mean = beta0 + beta1*x,
          sd = sigma,
          log = FALSE)
)

# Log-likelihood, vectorized
LH_log = 
sum(
    dnorm(y, 
          mean = beta0 + beta1*x,
          sd = sigma,
          log = TRUE)
)

# Make sure the output makes sense
LH_notlog
[1] 8.496937e-10
# Indeed the log-likelihood is the log of the 
# likelihood on the raw probability scale.
LH_log; log(LH_notlog)
[1] -20.88615
[1] -20.88615

6.3 Using optim() to minimize the negative log-likelihood

As we discussed in lecture, it is more computationally convenient to minimize functions, rather than to maximize. Therefore, to conduct linear regression analysis with maximum likelihood methods, we will find the values of \(\hat{B}\) that minimize the negative log-likelihood of the data: \(-ln\left(P(Y | X B, \sigma^2)\right)\).

6.3.1 Function to calculate the negative log-likelihood

First, we need to construct a function that calculates the negative log-likelihood and that specifies the parameters of the model that eventually need to be estimated by the optim() function.

### NEG LOG-LIK MINIMIZATION
neg_log_lik = function(p, data_df){
    beta0=p[1]
    beta1=p[2]
    sigma=p[3]
    
    mu = beta0 + beta1*data_df$x
    
    nll = -sum(dnorm(data_df$y, mean=mu, sd=sigma, log = TRUE))
    return(nll)
}

Here you can see the inputs to the function: p is a vector of parameters to be estimated (i.e., optimized), and data_df is a data.frame that holds the values of outcome variable \(y\) and associated input variables, in this case, just one \(x\).

Then, we can use the optim() function, which implements a gradient descent algorithm to estimate the values of the parameters that minimize the provided function, neg_log_lik(). We will learn more about gradient descent later, because this is a very important method used widely across machine learning and neural networks.

m_nll = 
    optim(
        par = c(0.1,0,0.1),
        fn = neg_log_lik,
        data_df = my_df,
        method = "L-BFGS-B",
        lower=c(-5,-5,0.001),
        upper=c(5,5,5),
        hessian = TRUE
    )
par_tab_nll = rbind(m_nll$par)
colnames(par_tab_nll) = c("int", "slope", "sigma")
par_tab_nll
          int   slope     sigma
[1,] 1.598468 0.56205 0.4571571

Note the optim() options. par specifies the initial guesses of the three parameters, whereas lower and upper specify the bounds across which to search for the best values of the parameters. With hessian = TRUE we are asking the function to output an estimate of the Hessian matrix of the function, which helps us to estimate the standard errors of the parameter estimates (see Footnotes 6.6.1). The method specifies the algorithm used to minimize the function, which in this case is a modified quasi-Newton method, L-BFGS-B, which is a type of gradient descent algorithm, to be discussed later.

We can see that the function outputs three point-estimates, which are \(\hat{B}\) (i.e., slope and intercept), as well as the residual standard deviation, \(\hat{\sigma}\).

6.3.2 Compare optim() results to the OLS output

### COMPARE TO LM()
m_ols = lm(y ~ 1 + x, data = my_df)
m_ols_summary = summary(m_ols)
m_ols_summary # Notice p-value

Call:
lm(formula = y ~ 1 + x, data = my_df)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.04966 -0.37035  0.06069  0.37520  0.72646 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.59847    0.08643  18.495  < 2e-16 ***
x            0.56205    0.09864   5.698 4.14e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4732 on 28 degrees of freedom
Multiple R-squared:  0.537, Adjusted R-squared:  0.5204 
F-statistic: 32.47 on 1 and 28 DF,  p-value: 4.136e-06
par_tab_ols = c(coef(m_ols), m_ols_summary$sigma)
names(par_tab_ols) = c("int", "slope", "sigma")
par_tab_ols; par_tab_nll
      int     slope     sigma 
1.5984685 0.5620501 0.4732005 
          int   slope     sigma
[1,] 1.598468 0.56205 0.4571571
plot(my_df$y ~ my_df$x, pch = 19,
     xlab = "x", ylab = "y")
# Line from OLS
abline(coef = coef(m_ols), col = "black", lwd = 2)
# Line from MaxLikelihood
abline(coef = m_nll$par[1:2], col = "red", lty = 2, lwd = 2)

As we learned in lecture, the estimates of \(\hat{B}\) from least squares and maximum likelihood are equivalent. And indeed, we see the same estimates produced from lm() and optim(). Also if you look at Residual standard error in the lm() output, you see the equivalent estimate for \(\hat{\sigma}\) compared to the optim() output.

6.4 Hypothesis-testing for maximum likelihood

As explained in lecture, we will use the likelihood ratio test to test:

\[H_0: \beta_i = 0\] \[H_A: \beta_i \ne 0\] In the least squares framework, we used a \(t\)-test. But, for maximum likelihood, we are going to base our test on the likelihood of a model that does or does not include the slope, similar to the \(F\)-test we learned before.

For the case of simple linear regression, we’re testing whether there is a significant difference between these models: \[H_0: y_i = \beta_0 + \epsilon_i\] \[H_A: y_i = \beta_0 + \beta_1 x_i + \epsilon_i\] Notice that in \(H_0\), the slope \(\beta_1\) is assumed to be zero.

6.4.1 Likelihood ratio and the \(\chi^2\) test

Our goal is to understand if the likelihood of the null model, \(P(Y | \beta_0, \sigma^2)\), is sufficiently low compared to the likelihood of the full model, \(P(Y | \beta_0, \beta_1, x, \sigma^2)\), that we can reliably reject the null hypothesis.

We therefore construct a ratio of the likelihoods of the full and null model, very similar to the \(F\)-test framework. The log-likelihood ratio (\(LHR\)) becomes our test statistic: \[LHR_{\text{test}} = -2 ln \left(\frac{LH_{\text{null}}}{LH_{\text{full}}} \right)\]

Then, folks smarter than I have done the math to prove that this test statistic is equivalent to a \(\chi^2\) test statistic, such that:

\[LHR_{\text{test}} \sim \chi^2_k\]

where \(\chi^2_k\) is a \(\chi^2\) probability distribution with \(k\) degrees of freedom. \(k\) is equal to \(p_{\text{full}} - p_{\text{null}}\), where \(p\) is the number of model coefficients. In the case of simple linear regression, where we are removing just one model coefficient from the full model (i.e., set slope equal to zero), then \(k = 2-1 = 1\).

Finally, we can determine \(P(\chi^2 > LHR_{\text{test}})\), which gives us our \(p\)-value. This statistical test is known as the “likelihood ratio test,” and it is equivalently referred to as the “\(\chi^2\)” test, which we’ll see in the code below.

6.4.2 Manual calculation of the likelihood ratio test

To begin, we need to use maximum likelihood to estimate the likelihood of the “null” model. We need to adjust our function that will be used by optim() to only include two parameters: the intercept, and the residual standard deviation.

# Need a null model:
nll_null = function(p, data_df){
    beta0=p[1]
    sigma=p[2]
    
    mu = beta0
    
    nll = -sum(dnorm(data_df$y, mean=mu, sd=sigma, log = TRUE))
    return(nll)
}

m_nll_null = 
    optim(
        par = c(0.1,0.1),
        fn = nll_null,
        data_df = my_df,
        method = "L-BFGS-B",
        lower=c(-5,0.001),
        upper=c(5,5)
    )
par_tab_nll_null = rbind(m_nll_null$par)
colnames(par_tab_nll_null) = c("int", "sigma")
par_tab_nll_null
          int     sigma
[1,] 1.612056 0.6718164
plot(my_df$y ~ my_df$x, pch = 19,
     xlab = "x", ylab = "y")
abline(coef = m_nll$par[1:2], col = "red", lty = 1)
abline(h = m_nll_null$par[1], col = "red", lty = 2)

The flat dashed line represents the null (intercept-only) model. Now, we calculate the likelihood ratio test statistic, and compare to the \(\chi^2\) probability distribution to determine our \(p\)-value of the test. Note that within the optim() function’s output list, there is a numeric object called value. This value is the negative log-likelihood of the model with the estimated coefficients. We can use this to calculate our test statistic.

# use exp() to convert the negative log likelihood to 
# raw probability scale
log_lh_full = -m_nll$value
log_lh_null = -m_nll_null$value

lh_full = exp(log_lh_full)
lh_null = exp(log_lh_null)

# Now calculate LHR
lhr = -2 * log(lh_null / lh_full)
lhr
[1] 23.09763
# Of course, using rules of natural logs, this is equivalent:
-2 * (log_lh_null - log_lh_full)
[1] 23.09763

Now that we have our value of \(LHR_{\text{test}}\), we use the \(\chi^2\)-distribution to find \(P(\chi^2 > LHR_{\text{test}})\), which is the \(p\)-value of the test.

# How many parameters being "removed" (i.e., set to zero) in test:
df_chi = 2 - 1

# Prob null is true
p_val = 1 - pchisq(lhr, df = df_chi)
p_val
[1] 1.539803e-06

Based on this low \(p\)-value, we would say there is sufficient evidence to reject the null hypothesis and that the slope \(\beta_1\) is significantly different than zero.

We can compare this outcome to a built-in R function called drop1().

drop1(m_ols, test = "Chisq")
Single term deletions

Model:
y ~ 1 + x
       Df Sum of Sq     RSS     AIC Pr(>Chi)    
<none>               6.2697 -42.964             
x       1    7.2703 13.5401 -21.866 1.54e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In the function, we specified Chisq test, which implements the \(\chi^2\) test using the likelihood ratio. What we see in this summary output is Pr(>Chi) which is equivalent to our manually computed value of \(P(\chi^2 > LHR_{\text{test}})\). This output from drop1() does not provide a whole lot of detail, but if you look at the help(), it says that if you specify test = "Chisq", it conducts a likelihood-ratio test. It doesn’t specifically output the likelihood ratio, but we can see the \(p\)-value is equivalent to our manual calculation above.

6.5 Gradient descent algorithm

How does the optim() function work? There are various optimization algorithms that can be implemented by optim() that are specified by the user in the method option. Several of these are gradient-based algorithms, which is a family of algorithms that are very common in engineering problems (e.g., optimal control of drones) and machine learning (e.g., neural networks, reinforcement learning). These algorithms generally minimize an “objective function”: \(\min_{{x\in {\mathbb R}^{n}}}\;f(x)\).

A gradient descent algorithm is the most common algorithm for minimizing a function. There are many varieties of these algorithms that employ various “tricks” to make the algorithms more efficient (e.g., find the solution in a smaller number of iterations) or more reliable. In lecture we outlined the foundational gradient descent algorithm, upon which many more advanced algorithms are based. For a function \(f(x)\), find the value of \(x\) that solves the problem: \(\min _{{x\in {\mathbb R}^{n}}}\;f(x)\)

  1. Choose a starting value of \(x\)
  2. Evaluate \(\nabla f(x)\). With \(h=10^{-4}\): \[\text{grad} = \frac{f(x+h) - f(x)}{h}\]
  3. Set the search direction as \(\text{direct} = -\nabla f(x)\)
  4. Set the step size as a fraction of \(\nabla f(x)\): \(\text{alpha} = 0.005\)
  5. Update \(x\) for next iteration: \[x = x + \text{alpha}* \text{direct}\]
  6. Repeat Steps 2-5 until \(\nabla f(x) \approx 0\) (i.e., \(||\text{grad}|| \le 10^{-4}\))

Let’s implement this algorithm for the maximum likelihood solution of simple linear regression, using the data set above. To make this easier, we’re going to assume that we know the intercept and the residual standard deviation perfectly, so we are only trying to estimate the slope. See Footnotes 6.6.3 for the case in which we estimate all three model parameters simultaneously using gradient descent.

# Set up some storage arrays:
slope_guess = NULL
nll_guess = NULL

# initial guess
slope = 0.1

# set h for approx of gradient
h = 1e-4

# set step size
alpha = 0.005

# Set gradient to arbitrary high number
## This makes the while() loop work
grad = 100

# initialize counter
i = 1

# While gradient is not yet \approx zero
while (norm(grad, "2") > 10^-4) {
    
    # Store the current value of slope:
    slope_guess[i] = slope
    
    #############
    # Approximate the gradient of the nll
    #############
    ## Adjust the slope by a small number, h
    slope_adj = slope + h
    ## Calculate f(slope)
    f_slope = neg_log_lik(p = c(beta0, slope, sigma), 
                          data_df = my_df)
    ## Store the current nll
    nll_guess[i] = f_slope
    ## Calculate f(slope + h)
    f_slope_adj = neg_log_lik(p = c(beta0, slope_adj, sigma), 
                              data_df = my_df)
    ## Calculate gradient
    grad = (f_slope_adj - f_slope) / h
    
    # search direction
    direct = -grad
    
    # update slope
    slope = slope + alpha * direct
    
    # counter for x
    i = i + 1
}

# Output the optimal slope and associated nll
n_iter = i-1

grad_descent_tab = cbind(slope_guess[n_iter], nll_guess[n_iter])
colnames(grad_descent_tab) = c("slope", "neg_log_lik")

# Compare to optim() outpu
optim_nll_tab = cbind(m_nll$par[2], m_nll$value)
colnames(optim_nll_tab) = c("slope", "neg_log_lik")

print("Gradient Descent:");grad_descent_tab
[1] "Gradient Descent:"
         slope neg_log_lik
[1,] 0.5651002    20.58064
print("optim() output:");optim_nll_tab
[1] "optim() output:"
       slope neg_log_lik
[1,] 0.56205    19.08618

Note that the estimates are not exactly the same, especially because when we used optim() we were estimating all three parameters simultaneously: intercept, slope, residual standard error.

Now, let’s visualize the gradient descent.

plot(nll_guess ~ slope_guess, 
     pch = 19, type = "b", 
     ylab = "Neg. Log. Likelihood",
     xlab = expression(hat(beta)[1]),
     xlim = c(0.1, 0.6), ylim = c(15, 40))

6.6 Footnotes

6.6.1 Hessian matrix

The optim() function provides point estimates for the maximum likelihood-derived model coefficients. Just like in least squares regression, however, we want to quantify the uncertainty in these estimates. We therefore want the standard error in the model coefficient estimates.

In the case of least squares, we showed how we can calculate a variance-covariance matrix for the model coefficients, and then the square-root of the diagonal of this matrix equals the standard error. For maximum likelihood we can estimate this same variance-covariance matrix, but it comes from a different matrix called the Hessian. We do not need to go into detail, but the Hessian is the matrix of second derivatives of the likelihood with respect to the parameters (I will not ask you to recall this information). Then the variance-covariance matrix of the estimated model coefficients is calculated as the inverse of the Hessian matrix that corresponds to the negative log-likelihood. If the Hessian matrix of the negative log-likelihood is \(H\), then \[SE(\hat{\beta_i}) = \sqrt{\text{diag}\left( H^{-1}\right)_i}\] I understand that’s complicated, but it’s easy enough to extract these values computationally from optim() output, assuming you use the option hessian = TRUE.

# Extract the Hessian from the optim() output
hessian = m_nll$hessian
# Calculate the var-cov matrix from the inverse Hessian
# Remember solve(X) gives X^-1
params_varcov = solve(hessian)
# Then extract the diagonal and take the square root
# This gives a vector of SE(\param_i)
se_params = sqrt(diag(params_varcov))
params_tab = cbind(m_nll$par, se_params)
colnames(params_tab) = c("Estimate", "Std. Error")
rownames(params_tab) = c("Intercept", "slope", "sigma")
params_tab
           Estimate Std. Error
Intercept 1.5984685 0.08349686
slope     0.5620500 0.09529343
sigma     0.4571571 0.05901782
# Same as OLS? 
m_ols_summary$coefficients[c(1:2), 1:2] # Pretty close!
             Estimate Std. Error
(Intercept) 1.5984685 0.08642710
x           0.5620501 0.09863766

We can see that the standard errors for the maximum likelihood estimators are the same as the OLS estimators.

6.6.2 optim() using least squares

Remember that optim() is not specific to maximum likelihood, but rather it implements one of several optional minimization algorithms. Therefore, we can use it to minimize any quantity. To emphasize this point, remember that in least squares regression, we are finding the values of the model coefficients \(\hat{B}\) that minimize the sum of squared errors, \(\sum_i^n \epsilon_i^2 = \epsilon^T\epsilon\). Let’s minimize this quantity using optim().

### LEAST SQUARES MINIMIZATION

# We need a function to calculate the sum of squared errors:
least_sq = function(p, data_df){
    beta0=p[1]
    beta1=p[2]
    y = data_df$y
    n = length(y)
    
    expected_y = beta0 + beta1*data_df$x
    sse = 0
    for(i in 1:n){
        epsilon_i = y[i] - expected_y[i]
        sse = sse + (epsilon_i)^2
    }
    
    return(sse)
}

### OPTIMIZE LEAST SQUARES
fit_least_sq = 
    optim(
        par = c(0,0),
        fn = least_sq,
        data_df = my_df,
        method = "L-BFGS-B",
        lower=c(-5,-5),
        upper=c(5,5),
        hessian = TRUE
    )
# Create a table of estimates:
par_tab_least_sq = rbind(fit_least_sq$par)
colnames(par_tab_least_sq) = c("int", "slope")
par_tab_least_sq
          int     slope
[1,] 1.598469 0.5620501
# Compare to original OLS estimates:
coef(m_ols)
(Intercept)           x 
  1.5984685   0.5620501 

You could also use the Hessian output to calculate the standard errors of the model coefficients, but I will leave that up to you.

6.6.3 Gradient descent, multiple parameters

Now let’s suppose we want to estimate all of our model parameters (intercept, slope, residual standard deviation) simultaneously using gradient descent. Recall from lecture that we need to estimate three components of the gradient (the partial derivates of the function with respect to each model parameter).

# How many params to estimate?
n_param = 3

# Set up some storage arrays:
## Guess that the number of iterations will be 100 or less...
param_guess = array(0, dim=c(100,n_param))
nll_guess = vector("numeric", length = 100)

# initial guesses
param = vector("numeric", length = n_param)
param[1] = 0.0 # intercept
param[2] = 1.0
param[3] = 2.5

# set h for approx of gradient
h = 1e-4

# set step size
alpha = 0.005

# Set gradient components to arbitrary high numbers
## This makes the while() loop work
grad = rep(100, times = n_param)

# initialize counter
i = 1

# While gradient is not yet \approx zero
while (norm(grad, "2") > 10^-4) {
    
    # Store the current value of slope:
    param_guess[i, ] = param
    
    #############
    # Approximate the gradient of the nll
    #############
    
    ## Calculate nll with all current params
    f_x = neg_log_lik(p = param, 
                          data_df = my_df)
    ## Store the current nll
    nll_guess[i] = f_x
    
    ## One param at a time, approximate gradient component
    ## (i.e., partial derivative)
    for(j in 1:n_param){
        ## Keep all but one params the same
        param_adj = NULL
        param_adj = param
        param_adj[j] = param[j] + h
        
        f_x_adj = neg_log_lik(p = param_adj, 
                              data_df = my_df)
        ## Calculate gradient component
        grad[j] = (f_x_adj - f_x) / h
    }
    
    # search direction
    ## Note that 'direct' is an array of size n_param
    direct = -grad
    
    # update params
    for(j in 1:n_param){
        param[j] = param[j] + alpha * direct[j]
    }
    
    # counter for x
    i = i + 1
}

# Output the optimal slope and associated nll
n_iter = i-1

grad_descent_tab = cbind(rbind(param_guess[n_iter,]), 
                         nll_guess[n_iter])
colnames(grad_descent_tab) = 
    c("int", "slope", "sigma", "neg_log_lik")

# Compare to optim() output
optim_nll_tab = cbind(rbind(m_nll$par), 
                      m_nll$value)
colnames(optim_nll_tab) = colnames(grad_descent_tab)

print("Gradient Descent:");grad_descent_tab
[1] "Gradient Descent:"
         int     slope     sigma neg_log_lik
[1,] 1.59842 0.5620025 0.4571052    19.08618
print("optim() output:");optim_nll_tab
[1] "optim() output:"
          int   slope     sigma neg_log_lik
[1,] 1.598468 0.56205 0.4571571    19.08618

Now, plot the gradient descent:

par(mfrow=c(1,3))
plot(nll_guess[1:n_iter]~param_guess[1:n_iter, 1], 
     pch=19, type = "b", 
     col = "blue",
     ylab = "Neg. Log Likelihood", 
     xlab = expression("Intercept, "~beta[0]))
plot(nll_guess[1:n_iter]~param_guess[1:n_iter, 2],
     pch=19, type = "b", 
     col = "orange",
     ylab = "Neg. Log Likelihood", 
     xlab = expression("Slope, "~beta[1]))
plot(nll_guess[1:n_iter]~param_guess[1:n_iter, 3], 
     pch=19, type = "b", 
     col = "red",
     ylab = "Neg. Log Likelihood", 
     xlab = expression("Residual Std.Dev., "~sigma))

We can see for the intercept, our first guess was too low, so we descended the gradient by addition (i.e., search direction was positive), whereas for the slope and the residual standard deviation, our first guess was too large, so we descended the gradients by subtraction (i.e., search direction was negative).