Resampling Methods

BACK: Classification

UNDER HEAVY CONSTRUCTION: NOT EVERYTHING MAY BE CORRECT.

Let’s say we created some model that fits very well to our data. Cool. But who’s to say that our model can accurately predict across different, but similar sets of data? How do we estimate the variability of our model?

The ideal is collect our first set of data, train some model on it, and collect more data to evaluate the model. But that is expensive ($). Resampling methods used to be expensive (computationally), but thanks modern computers.

The idea of resampling is essentially this: We can draw different random samples from our training data. For example, in our data of indices $1, 2, ..., 100$, we could pick $1, 2, ... 80$ as our training or we could pick $2, ..., 100$. Training a linear regression model to each sample, we can then take a look at how the two fits differ. Repeating this approach in every possible subset of a given size would most surely give us information about the variability of how well a linear regression fit works on this data.

It's nice to be able to forget things.