Some statistical notes — Resampling
Well !! Well !! Well!! Well…
We have been doing quite some statistical modeling,
Now it’s time to estimate the precision of sample statistics or validating the models by suing random subsets, In shorts it’s time for looking into the Resampling Methods.
Two of the popular methods are Cross Validation and Bootstrap. Depending on the scenario we might want to estimate the test error associated with a given statistical learning method to test its performance or select the appropriate level of flexibility: Cross Validation is the go to person. Or imagine, you want to measure the accuracy of parameter estimate or of given statistical method : Bootstrap would be the chosen one.
LOOCV turns out to be very expensive since the model has to b fit n times and when n is large its turns out to be real slow to fit. However LOOCV is a very general method and can be used with any kind of predictive model be it logistic regression or LDA or such.
Some R code snippets shows LOOCV.
A further to this is k-fold Cross Validation. The whole set of observations is divided in k groups (folds) of approximately equal size. The first fold is treated as validation set and the method is then fit on the remaining k-1 folds. MSE is then computed on the observation and this is thus repeated k times, each time a different group is treated as the validation set. As an outcome of this we would have k estimates of test error ; we then average these k values to get the k-fold CV.
Well it’s apparent that LOOCV is a special type of k-fold Cross Validation. As discussed when n gets larger there is great computational overhead. Hence we normally choose k = 5 or 10; also keeping in mind bias-variance trade-off.
Now in real world scenario — would you be aware of the true test MSE. No!!! Therefore it’s quite difficult to determine the accuracy of CV estimate. This is much easier to determine the MSE in simulated data and once we have it we can work out the true MSE and thereby evaluate the accuracy of our cross-validation results.
Did you realize this? When we perform CV, though our focus is towards determining how well our model is going to predict value in real world, the actual estimate of the test MSE — our interest remains only in identifying the location of the minima in the test MSE curve. That’s because the aim to find the method which results in the lowest test error of all the model(s) on which we have experimented. And hence the location of the minia is more significant that the actual value of the estimated test MSE.
We keep going back to LOOCV!! We discussed shortly back that k-fold CV has computational advantage over LOOCV when k<n. However a more significant reason for selecting k-fold CV is a more accurate estimate through bias-variance trade-off.
Understanding how validation set approach, LOOCV, k-fold CV works between ½ of the dataset, n: n-1 observation sets, k : k — 1 set of observations it’s understandable that from the perspective of bias reduction , LOOCV is a preferred approach. However, we cannot ignore the fact that LOOCV has a higher variance than k-fold CV with k < n. When we perform OOCV we average the outputs of n fitted models, each of which is trained on an almost identical set of observations. As such the outputs are highly +vely correlated with each other. In contract, when we are perform k-fold CV with k<n, we are averaging the outputs of k fitted models that are somewhat less correlated with each other since the overlap between the training sets in each model is smaller. We know that the mean of highly correlated data points would have a higher variance than would be the mean of not so correlated data points — the test error estimate resulting from LOOCV tends to have higher variance than does the test error estimate form k-fold CV. And that gives k-fold CV an edge. However empirically it has been observed that k fold CV with k = 5 or 10 yields test results that are neither high bias nor high variance.
Cross Validation works great with both qualitative as well as quantitative data. While dealing with quantitative data we use MSE while in dealing with qualitative data we use the number of misclassified observations.
In a real data scenario we might not know the Bayes decision boundary and the test error rates; in such a scenario how do we choose the degree of freedom of a logistic regression is a valid question. Cross Validation is the answer!! We know that normally the training error tends to decrease as the flexibility of the model increases.
Now we all know how powerful R is when it comes to statistical calculations. Like calculating linear regression is no big an issue when we use R. However there are places where in order to measure variability things might get a bit tricky even with standard software like R. These are places where Bootstrap comes into play and renders to be extremely useful. Bootstrap is a widely applicable and extremely popular tool which is used to quantify the uncertainty associated with a given estimate or statistical learning method.