Improving Linear Models
BACK: Resampling Methods
UNDER HEAVY CONSTRUCTION: NOT EVERYTHING MAY BE CORRECT.
But What’s wrong with least-squares regression?
One of the issues with least squares is when n»p. You end up having a lot of predictive variables that don’t actually contribute much to the response, so it makes it difficult to the model output and coefficients. We want to make our models more accurate and more interpretable.
So rather than our goal being to just minimize the RSS, we can either change or add to that goal. The three approaches we take are: best subset selection, shrinkage/regularization, and dimension reduction.
Subset Selection
Idea: Fit a model for every possibility of predictor combinations. i.e: Fit all models with 1 predictor, all models with 2 predictors, … all models with p predictors. Then pick the best one, based on CV error, AIC, BIC, etc.
This would most certainly give us the best model on our training set, but it is extremely computationally heavy, since we are calculating \(2^p\) models. We also run the risk of looking for models that look like it fits well to our data, but has no predictive capability (since we are evaluating training error), leading to overfitting and high variance of our coefficients.
We instead look to forward and backward stepwise selection. In FSS, we start with nothing for our model. We then consider the p predictors choose for our first predictor, picking the best one with the lowest RSS. We would then pick our next predictor based on the remaining p-1 predictors. Repeat until we have p models, i.e: model with 1 predictor, model with 2 predictors, … model with p predictors. Then we would pick the best, based on CV error, AIC, BIC, etc.
In BSS, we do the same thing, but start with a model with all of our predictors. We would then consider removing a predictor that results in the lowest RSS. There is no guarantee that the solution found in BSS will be the same as FSS.
Shrinkage
Idea: Occam’s Razor The simplest model is the best one. Minimize RSS, but also punish the model for making our predictor values too large. We consider ridge regression and lasso regression. Doing shrinkage methods generally lowers the variance in the coefficients.
Ridge regression takes the form:
\[min(Objective) = RSS + \lambda \sum_{j = 1}^{p} \beta_j^2\]\(\lambda\) is just a penalty term, for which we will solve for using cross validation. We would train x amount of models with varying \(\lambda\)s and then pick the one with the lowest MSE. Note that the coefficients of the model trained by ridge regression will never be 0, but can be very small.
Lasso regression is similar. Rather than picking the 2-norm for \(\beta_j\), we will use the 1-norm.
\[min(Objective) = RSS + \lambda \sum_{j = 1}^{p} |\beta_j|\]As opposed to ridge regression, lasso regression allows the coefficients of the linear model to be 0.