APTS: Statistical Computing Assessment
The assessment
In my previous post, I shared my reflections on the first week of APTS at the
University of Warwick. Now, I will go through my solution for the
Statistical Computing assessment.
The assignment, which can be found here, involves the implementation of an
R
function for efficiently and stably fitting a generalized linear model (GLM)
for the one-parameter family distribution.
The density (or mass function) of a one-parameter exponential family model
with canonical parameter θ is of the form
f(y∣θ)=exp{θy−b(θ)+c(y)}
The linear predictor η=Xβ in a GLM is related to the model’s canonical parameter θ via a suitable link function. Moreover, X is an n×p design matrix with known values, β is a p-dimensional set of regression parameters, and y=(y1,…,yn) is an n-dimensional vector of observations.
Below, I will outline the main ideas of the solution and compare the estimated
coefficients with those from the R’s glm
function.
You can access a file with the detailed solution
here.
The solution
In the first two items, we should write the log-likelihood in vector form and derive the first and second derivatives with respect to β. I will skip the technical details, although they can be found in the detailed solution file. The log-likelihood, score vector, and hessian matrix are respectively given by
ℓ(β)=y⊤Xβ−1⊤b(Xβ)Δℓ(β)=X⊤[y−b′(Xβ)]Δ2ℓ(β)=−X⊤diag(ω)X
The score vector and hessian matrix are required in order to apply the Newton-Rapshon (NR) algorithm to fit the the GLM model. Given a initial guess, β(0)=(β(0)0,β(0)1,…,β(0)p−1), the NR update schema for the one-parameter exponential family model can be written as follows:
β(k+1)=β(k)+[X⊤diag(ωk)X]−1X⊤[y−b′(Xβk)]=β(k)+[X⊤diag(wk)X]−1X⊤zk
By defining Xk=diag{wk}1/2X,
we can rewrite the NR update in a more efficient computational form using the
QR decomposition of Xk=QkRk, where
Rk is an upper triangular n×p matrix and Qk
has the same dimension of Xk with orthogonal columns such that
Q⊤kQk=In×p.
Also, note that diag{wk}=diag{wk}1/2diag{wk}1/2,
then the NR update can be set up as:
β(k+1)=β(k)+[X⊤diag(wk)1/2diag(wk)1/2X]−1X⊤zk=β(k)+[X⊤diag(wk)1/2Xk]−1X⊤zk=β(k)+[X⊤diag(wk)1/2QkRk]−1X⊤zk=β(k)+R−1kQ−1k[diag(wk)]−1/2(X⊤)−1X⊤zk=β(k)+R−1kQ⊤k[diag(wk)]−1/2zk.
The matrix multiplication of diag(wk) and zk
can be expressed as a Hadamard (elementwise) product, which is more
computationally efficient. It turns out that
β(k+1)=β(k)+R−1kQ⊤k[w−1/2k∘zk]β(k+1)−β(k)=R−1kQ⊤k[w−1/2k∘zk]Rk(β(k+1)−β(k))=Q⊤k[w−1/2k∘zk].
Computational implementation
We have now implemented a function, glm1
, which fits a GLM for the one-parameter
family distributions, avoiding the need to perform matrix inversion and create
n×n matrices of the form X⊤X.
glm1 <- function(y, X, bp, bpp) {
p <- NCOL(X)
beta_k <- rep(0, p)
list_beta <- list()
stop_error <- 1e-6
j <- 1L
current_error <- 1
while (current_error > stop_error) {
eta_k <- X %*% beta_k
z_k <- y - bp(eta_k)
w_k <- bpp(eta_k)
X_k <- drop(w_k^(1/2)) * X
wz_k <- w_k^(-1/2) * z_k
qr_out <- qr(X_k)
Q_k <- qr.Q(qr_out)
R_k <- qr.R(qr_out)
a_k <- backsolve(R_k, crossprod(Q_k, wz_k))
# Update the parameter
list_beta[[j]] <- a_k + beta_k
current_error <- max(abs(beta_k - list_beta[[j]]))
beta_k <- list_beta[[j]]
j <- j + 1L
}
do.call(cbind, list_beta)
}
Because the left hand side of the NR update is upper triangular due to the
Rk matrix, the backsolve
function can be used to efficiently
solve a triangular system of linear equations.
The termination criterion is defined as
max|β(k)−β(k+1)|<ϵ,
where ϵ=10−6 and the function returns a data.frame
with the
estimates parameter for each iteration until this criterion is achieved.
Cross-checks
I will cross-check my implementation with the R’s built glm
function for the
Bernoulli GLM, which the probability mass function is given by
f(y∣θ)=exp{yθ−log(1+eθ)}
A user friendly function, logReg
, which call the glm1
is defined
to fit a logistic regression model in a quite similar way as glm
function.
logReg <- function(formula, data) {
mf <- model.frame(formula, data = data)
y <- model.response(mf)
if (is.factor(y))
y <- as.numeric(y) - 1
X <- model.matrix(formula, mf)
bp <- function(theta) 1 / (1 + exp(-theta))
bpp <- function(theta) {
e <- exp(-theta)
e / (1 + e)^2
}
glm1(y, X, bp, bpp)
}
The first check concerns a simulated data from 500 samples of logistic regression model with two covariates, where the true values of the coefficients are β=(1.0,−0.5,0.5) and the covariates were simulated from standard Normal and uniform random variables.
set.seed(69)
n <- 500
X <- cbind(1, rnorm(n), runif(n))
betas <- c(1, -0.5, 0.5)
eta <- drop(X %*% betas)
p <- 1 / (1 + exp(-eta))
y <- rbinom(n, size = 1, prob = p)
sim_data <- data.frame(y = y, x1 = X[, 2], x2 = X[, 3])
As shown below, the coefficients estimated from logReg
and glm
are the same
to seven decimal places.
out_logReg <- logReg(formula = y ~ x1 + x2, data = sim_data)
out_glm <- glm(formula = y ~ x1 + x2, data = sim_data, family = "binomial")
cbind(true = betas,
logReg = out_logReg[, ncol(out_logReg)],
glm = coef(out_glm))
## true logReg glm
## (Intercept) 1.0 1.0698941 1.0698941
## x1 -0.5 -0.5386558 -0.5386558
## x2 0.5 0.5473424 0.5473424
After performing a second cross-check using the Pima.tr
data from the MASS
package, it was found that the estimated values were identical up to nine decimal
places, confirming that the glm1
function had been successfully implemented.
out_logReg <- logReg(type ~ ., data = MASS::Pima.tr)
out_glm <- glm(type ~ ., data = MASS::Pima.tr, family = "binomial")
cbind(logReg = out_logReg[, ncol(out_logReg)],
glm = coef(out_glm))
## logReg glm
## (Intercept) -9.773061533 -9.773061533
## npreg 0.103183427 0.103183427
## glu 0.032116823 0.032116823
## bp -0.004767542 -0.004767542
## skin -0.001916632 -0.001916632
## bmi 0.083623912 0.083623912
## ped 1.820410367 1.820410367
## age 0.041183529 0.041183529