11  Regularization and Real Data Analysis

Author
Affiliation

Amiya Ranjan Bhowmick

Institute of Chemical Technology, Mumbai

We start our discussion of today’s lecture with the usual multiple linear regression using the matrix set up which is given by \[Y_{n\times 1} = X_{n\times (p+1)}\beta_{(p+1)\times 1} + \epsilon_{n\times 1},\] and the symbols have the standard interpretations.

In addition, we assume that \(\epsilon_{n\times 1} \sim \mathcal{N}_n\left(0_{n\times 1}, \sigma^2I_{n\times n}\right)\). We have also checked in the class that \(\mbox{E}(\widehat{\beta}) =\beta\) that means the estimator \(\widehat{\beta}\) an unbiased estimator of \(\beta\). To recall that \[\widehat{\beta} = \mbox{argmax}_{\beta}(Y-X\beta)'(Y-X\beta).\]

The covariance matrix of \(\widehat{\beta}\) is given by \[Cov(\beta) = \sigma^2(X'X)^{-1}.\] The diagonal entries of the above matrix gives us the variance associated with the estimator for individual component \(\beta_j\) of \(\beta\) for \(j\in \{0,1,2,\ldots,p\}\). It is important to note that if the matrix \((X'X)\) is nearly singular, or ill conditioned, then the determinant \(|X'X|\approx 0\) and the variance \(Var(\widehat{\beta_j})\) for \(j\in\{0,1,\ldots,p\}\) will be large and the estimates will unreliable. By unreliable, I mean the standard error of the estimators will be extremely large or may be infinity as well.

A natural question arises, in what scenarios such a possibility arise. From the theory of linear algebra, we know that if one column can be written as a constant multiple of another column, the determinant of the matrix is zero. It is not only between two columns, if a column can be written as a linear combination of two or more other columns (with at least one non-zero coefficient), then also the determinant is zero. In other words, the column space of the matrix does not have the dimension equal to the number of columns of \(X\). In other words, the columns are linearly dependent.

Given the above discussion, if we want to avoid such situation where \(|X'X|\approx 0\), we need to identify which columns of \(X\) are responsible for this. While discussing the concept of correlation, we have emphasized the fact that the correlation is a measure of linear relationship between random variables. Therefore, the degree of linear dependence between two columns of \(X\) (also called feature by machine learning experts) can be measured by the correlation between two columns of \(X\). However, when one column can be written as a linear combination of other columns, that correlation may not be able to capture. Such dependence can be checked by fitting linear regression of the following form \[X_j = \beta_0^{(-j)} + \beta_1^{(-j)}X_1 + \cdots+\beta_{j-1}^{(-j)}X_{j-1}+\beta_{j+1}^{(-j)}X_j + \beta_pX_p + \epsilon\] for \(j\in \{1,2,\ldots,p\}\). A large r.squared value of the above regression model \(R_{-j}^2\) indicates that \(X_j\) can be approximately represented as a linear combination of other columns.

Data scientists computes the quantity called, variance inflation factor (VIF) by \[\mbox{VIF}_j = \frac{1}{1-R_{-j}^2}, j=1,2,\ldots,p.\] If the variance inflation factor is more than 5 or 10 (depending on the problem and also domain knowledge), an analyst may choose to drop those variables.

Let us consider the \(\texttt{Auto}\) dataset from the ISLR2 package in R.

library(ISLR2)
pairs(Auto, col = 'red')

Pairs plot for the Auto data set with five selected columns.
data = Auto[,c("mpg", "displacement", 
               "horsepower", "weight", 
               "acceleration")]
head(data)
  mpg displacement horsepower weight acceleration
1  18          307        130   3504         12.0
2  15          350        165   3693         11.5
3  18          318        150   3436         11.0
4  16          304        150   3433         12.0
5  17          302        140   3449         10.5
6  15          429        198   4341         10.0
pairs(data, col = "red")

cor(data)
                    mpg displacement horsepower     weight acceleration
mpg           1.0000000   -0.8051269 -0.7784268 -0.8322442    0.4233285
displacement -0.8051269    1.0000000  0.8972570  0.9329944   -0.5438005
horsepower   -0.7784268    0.8972570  1.0000000  0.8645377   -0.6891955
weight       -0.8322442    0.9329944  0.8645377  1.0000000   -0.4168392
acceleration  0.4233285   -0.5438005 -0.6891955 -0.4168392    1.0000000
library(corrplot)
corrplot(corr = cor(data), method = "number")

fit = lm(mpg ~ ., data = data)
summary(fit)

Call:
lm(formula = mpg ~ ., data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-11.378  -2.793  -0.333   2.193  16.256 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  45.2511397  2.4560447  18.424  < 2e-16 ***
displacement -0.0060009  0.0067093  -0.894  0.37166    
horsepower   -0.0436077  0.0165735  -2.631  0.00885 ** 
weight       -0.0052805  0.0008109  -6.512  2.3e-10 ***
acceleration -0.0231480  0.1256012  -0.184  0.85388    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.247 on 387 degrees of freedom
Multiple R-squared:  0.707, Adjusted R-squared:  0.704 
F-statistic: 233.4 on 4 and 387 DF,  p-value: < 2.2e-16
R2 = summary(fit)$r.squared
fit_1 = lm(displacement ~ horsepower + weight + acceleration,
           data = data)
R2_displacement = summary(fit_1)$r.squared

fit_2 = lm(horsepower ~ displacement + weight + acceleration,
           data = data)
R2_horsepower = summary(fit_2)$r.squared

fit_3 = lm(weight ~ displacement + horsepower + acceleration,
           data = data)
R2_weight = summary(fit_3)$r.squared

fit_4 = lm(acceleration ~ displacement + horsepower + weight,
           data = data)
R2_acceleration = summary(fit_4)$r.squared
print(R2)
[1] 0.7069812
print(R2_displacement)
[1] 0.9064277
print(R2_horsepower)
[1] 0.8866601
print(R2_weight)
[1] 0.9027659
print(R2_acceleration)
[1] 0.6158656
vif_displacement = 1/(1-R2_displacement)
vif_horsepower = 1/(1-R2_horsepower)
vif_weight = 1/(1- R2_weight)
vif_acceleration = 1/(1-R2_acceleration)
print(c(vif_displacement, 
        vif_horsepower, 
        vif_weight, 
        vif_acceleration))
[1] 10.686922  8.823022 10.284456  2.603255

Verification with existing packages in R

library(car)
Loading required package: carData
vif(fit)
displacement   horsepower       weight acceleration 
   10.686922     8.823022    10.284456     2.603255 

Optimal choice of lambda

lambda = 0.01
X = cbind(rep(1,nrow(data)), as.matrix(data[,2:5]))
head(X)
    displacement horsepower weight acceleration
1 1          307        130   3504         12.0
2 1          350        165   3693         11.5
3 1          318        150   3436         11.0
4 1          304        150   3433         12.0
5 1          302        140   3449         10.5
6 1          429        198   4341         10.0
Y = data$mpg

Y_hat = numeric(length = nrow(data))
for(i in 1:nrow(data)){
  new_X = X[-i, ]
  beta_lambda = solve(t(new_X)%*%new_X + lambda*diag(ncol(X)))%*%t(new_X)%*%Y[-i]
  Y_hat[i] = X[i, ]%*%beta_lambda
}

error = Y - Y_hat
pred_error = mean(error^2)

Repeat above for differnt lambda values

lambda_vals = seq(0.0001, 0.1, by = 0.001)
pred_error = numeric(length = length(lambda_vals))
for(j in 1:length(lambda_vals)){
  lambda = lambda_vals[j]
  Y_hat = numeric(length = nrow(data))
  for(i in 1:nrow(data)){
    new_X = X[-i, ]
    beta_lambda = solve(t(new_X)%*%new_X + lambda*diag(ncol(X)))%*%t(new_X)%*%Y[-i]
    Y_hat[i] = X[i, ]%*%beta_lambda
  }
  
  error = Y - Y_hat
  pred_error[j] = mean(error^2)
  
}
plot(lambda_vals, pred_error, col = "red", pch = 19,
     cex = 1.2, xlab = expression(lambda),
     ylab = expression(E(Y-widehat(Y))^2), type = "l",
     lwd = 2)
which.min(pred_error)
[1] 20
lambda_star = lambda_vals[which.min(pred_error)]
points(lambda_star,pred_error[which.min(pred_error)],
       pch = 19, col = "blue",
       cex = 1.3)
text(0.02, 18.348, bquote(lambda == .(lambda_star)),
       cex = 1.4, col = "blue", pch = 19, bty = "n")

In the above analysis, it is clear that the computation associated with the approximation of the prediction error is heavy. The number of rows in the data set is 392, therefore, 392 many times, model training has been carried out. The advantage of the above method is basically is that every row of the data set has been used for both model training as well as model testing purpose. The number of model fitting exercises may be reduced by dividing the data into \(k\) many segments of approximately equal size.

nrow(data)
[1] 392
Y = data$mpg
X = cbind(rep(1,nrow(data)), as.matrix(data[,-1]))
Y
  [1] 18.0 15.0 18.0 16.0 17.0 15.0 14.0 14.0 14.0 15.0 15.0 14.0 15.0 14.0 24.0
 [16] 22.0 18.0 21.0 27.0 26.0 25.0 24.0 25.0 26.0 21.0 10.0 10.0 11.0  9.0 27.0
 [31] 28.0 25.0 19.0 16.0 17.0 19.0 18.0 14.0 14.0 14.0 14.0 12.0 13.0 13.0 18.0
 [46] 22.0 19.0 18.0 23.0 28.0 30.0 30.0 31.0 35.0 27.0 26.0 24.0 25.0 23.0 20.0
 [61] 21.0 13.0 14.0 15.0 14.0 17.0 11.0 13.0 12.0 13.0 19.0 15.0 13.0 13.0 14.0
 [76] 18.0 22.0 21.0 26.0 22.0 28.0 23.0 28.0 27.0 13.0 14.0 13.0 14.0 15.0 12.0
 [91] 13.0 13.0 14.0 13.0 12.0 13.0 18.0 16.0 18.0 18.0 23.0 26.0 11.0 12.0 13.0
[106] 12.0 18.0 20.0 21.0 22.0 18.0 19.0 21.0 26.0 15.0 16.0 29.0 24.0 20.0 19.0
[121] 15.0 24.0 20.0 11.0 20.0 19.0 15.0 31.0 26.0 32.0 25.0 16.0 16.0 18.0 16.0
[136] 13.0 14.0 14.0 14.0 29.0 26.0 26.0 31.0 32.0 28.0 24.0 26.0 24.0 26.0 31.0
[151] 19.0 18.0 15.0 15.0 16.0 15.0 16.0 14.0 17.0 16.0 15.0 18.0 21.0 20.0 13.0
[166] 29.0 23.0 20.0 23.0 24.0 25.0 24.0 18.0 29.0 19.0 23.0 23.0 22.0 25.0 33.0
[181] 28.0 25.0 25.0 26.0 27.0 17.5 16.0 15.5 14.5 22.0 22.0 24.0 22.5 29.0 24.5
[196] 29.0 33.0 20.0 18.0 18.5 17.5 29.5 32.0 28.0 26.5 20.0 13.0 19.0 19.0 16.5
[211] 16.5 13.0 13.0 13.0 31.5 30.0 36.0 25.5 33.5 17.5 17.0 15.5 15.0 17.5 20.5
[226] 19.0 18.5 16.0 15.5 15.5 16.0 29.0 24.5 26.0 25.5 30.5 33.5 30.0 30.5 22.0
[241] 21.5 21.5 43.1 36.1 32.8 39.4 36.1 19.9 19.4 20.2 19.2 20.5 20.2 25.1 20.5
[256] 19.4 20.6 20.8 18.6 18.1 19.2 17.7 18.1 17.5 30.0 27.5 27.2 30.9 21.1 23.2
[271] 23.8 23.9 20.3 17.0 21.6 16.2 31.5 29.5 21.5 19.8 22.3 20.2 20.6 17.0 17.6
[286] 16.5 18.2 16.9 15.5 19.2 18.5 31.9 34.1 35.7 27.4 25.4 23.0 27.2 23.9 34.2
[301] 34.5 31.8 37.3 28.4 28.8 26.8 33.5 41.5 38.1 32.1 37.2 28.0 26.4 24.3 19.1
[316] 34.3 29.8 31.3 37.0 32.2 46.6 27.9 40.8 44.3 43.4 36.4 30.0 44.6 33.8 29.8
[331] 32.7 23.7 35.0 32.4 27.2 26.6 25.8 23.5 30.0 39.1 39.0 35.1 32.3 37.0 37.7
[346] 34.1 34.7 34.4 29.9 33.0 33.7 32.4 32.9 31.6 28.1 30.7 25.4 24.2 22.4 26.6
[361] 20.2 17.6 28.0 27.0 34.0 31.0 29.0 27.0 24.0 36.0 37.0 31.0 38.0 36.0 36.0
[376] 36.0 34.0 38.0 32.0 38.0 25.0 38.0 26.0 22.0 32.0 36.0 27.0 27.0 44.0 32.0
[391] 28.0 31.0
head(X)
    displacement horsepower weight acceleration
1 1          307        130   3504         12.0
2 1          350        165   3693         11.5
3 1          318        150   3436         11.0
4 1          304        150   3433         12.0
5 1          302        140   3449         10.5
6 1          429        198   4341         10.0
k = 10          # number of folds
floor(nrow(data)/k)
[1] 39
fold = sample(rep(1:k, nrow(data)), 
              size = nrow(data), replace = FALSE)
fold
  [1] 10  4  4  2  1 10 10  5  4  5  9  9  8  8  6  4  4  6  1  3  2  7  5  9  8
 [26]  2 10  3  8 10  8  2  3  9  9  9  2  1  5  5  8  1 10  8 10  7  8 10  9  4
 [51]  7  5  1 10  9  7  9  9  3  2  7  1  9  5  9  9  7 10  1  8  3  8 10  3 10
 [76]  7  5  3  3 10  5  8  2  9  9  3  6  3  3  8  8  3  5  1  6  2 10  9  1  3
[101]  1  9 10  2  2  6  5  7  1  3  9 10  3  9  2  7  4  3  1  4  3 10  4  9  7
[126]  9  3  5 10  6  1  3  5  2  4  7  4  2  7  2  7  5  7  8  2  2  3  6  2  2
[151] 10  3  3  8  4  5  2  1  3  2  8  7  2  9  7  8  6  8  1  7  6  9  5  9  7
[176]  8  2  3  7  7 10  9  9  7  6  8  7  1  7  7  9  9  5  2  1  5 10  6  4 10
[201]  6  5  1  4  7  7  6  3  7  3  7  4  8  9  7  6  5  3  9  7  6  1  5  2  3
[226]  9  7  5  1  4  4  8  1  5  3  8  7  1 10  6  7  1  7  5  2  9  3  9  6  2
[251]  6  7  8 10  2  7  6 10 10  9  8  3  7  6  2  8  7  7  5  6  9  6  7  1  5
[276]  9 10  4  1  9  2 10  8  5  8  4  2  3 10  9  7  8  3  4  9 10  8  7  3  1
[301]  3  5  7  1  8  5  8  9  9  2  8 10  2  7  5  8  7  1  3  5  2  2  6  9  8
[326]  9  4  4  8  3  1  9  6  5  4  1  5  8  1  3  5  3  3  7 10  2  4  9  8  9
[351]  9  8  1  6  7  5  3  2  3 10 10  3  2  7  4  5  5 10  5  8  3  5  1 10  2
[376] 10  8  6  8  8  2  2  5  6  8  7  3  2  7  6  3  4
table(fold)
fold
 1  2  3  4  5  6  7  8  9 10 
33 41 46 26 40 28 50 44 47 37 
# implementation of k-fold cross validation
lambda = 0.01
error = numeric(length = k)
for(i in 1:k){
  new_Y = Y[(fold != i)]
  new_X = X[(fold != i),]
  beta_lambda = solve(t(new_X)%*%new_X + lambda*diag(ncol(new_X)))%*%t(new_X)%*%new_Y
  Y_hat = X[fold==i,]%*%beta_lambda
  error[i] = mean((Y[fold == i] - Y_hat)^2)
}
k_fold_cv = mean(error)

# Do it for multiple lambda values
lambda_vals = seq(0.001,0.2, by = 0.001)
k_fold_cv = numeric(length = length(lambda_vals))
for(j in 1:length(lambda_vals)){
  lambda = lambda_vals[j]
  error = numeric(length = k)
  for(i in 1:k){
    new_Y = Y[(fold != i)]
    new_X = X[(fold != i),]
    beta_lambda = solve(t(new_X)%*%new_X + lambda*diag(ncol(new_X)))%*%t(new_X)%*%new_Y
    Y_hat = X[fold==i,]%*%beta_lambda
    error[i] = mean((Y[fold == i] - Y_hat)^2)
  }
  k_fold_cv[j] = mean(error)
  
}

plot(lambda_vals, k_fold_cv, col = "red",
     type = "l", lwd = 2,
     main = paste("k = ", k),
     xlab = expression(lambda),
     ylab = expression(E(Y-widehat(Y))^2))
points(lambda_vals[which.min(k_fold_cv)],
       k_fold_cv[which.min(k_fold_cv)],
       pch = 19, col = "blue", cex = 1.3)

# how the model coefficients shrinks as a function of $\lambda$

data = Auto[,c("mpg", "displacement", 
               "horsepower", "weight", 
               "acceleration")]
lambda_vals = seq(0, 2, by = 0.5)
Y = data$mpg
X = cbind(rep(1,nrow(data)), as.matrix(data[,-1]))
beta_lambda = matrix(data = NA, ncol = ncol(X),
                     nrow = length(lambda_vals))
for(i in 1:length(lambda_vals)){
  lambda = lambda_vals[i]
  beta_lambda[i,] = solve(t(X)%*%X + lambda*diag(ncol(X)))%*%t(X)%*%Y
}

matplot(lambda_vals, abs(beta_lambda[,-1]),
        type = "l", lwd = 2)