Cross-Validation allows users to evaluate the model performance of machine learning models to give us a better insight into the data. Now, let’s break the name for better understanding.
“K” represents any number in series 1,2,3,4,5….k.
“Fold” as in we are folding something over itself.
“Cross” like a crisscross/overlap pattern, going back and forth over and over again.
“Validation” so as to check the accuracy of our model.
To define it together, the original sample is randomly partitioned into k equal-sized subsets. A single subset from the k subsets is kept as validation data for testing the model, while the remaining k-1 subsets are used as training data. The cross-validation procedure is then performed k times, with each of the k subsets serving as validation data exactly once. After that, the k results can be averaged to get a single estimate. Consider this example, if we have 100 data points and we choose a k-value (a number we have chosen for how many times we will split our data by) of five, we get five subsets of 20 points each. Then we take and hold onto four(k-1) of those five subsets. So, four sets of 20 each (80 total) and one set of 20.
The benefit of this strategy is that it doesn’t matter how the data is separated. Every data point appears exactly once in a test set, and k-1 times in a training set. As k is increased, the variance of the resulting estimate decreases.
Methodology of Evaluating a Model
Let’s take a step back and consider the concept of train and test data splits in general. If you’re new to statistical models, the concept of splitting your data into two (or more) pieces could be foreign to you. That is, each data point will be assigned to either a test set or a training set of data at random. We’ll utilise one part, the training data, to train our model. The test data, on the other hand, will be used to put our trained model to the test and see how well it does, say, forecasting the value of a house, or whatever it is we’re aiming for. We could use some kind of metric to assess our model. If you are using a multi-linear regression model you may be using R² or root mean squared error to evaluate your models’ performance; first on the training set, then on the test set. Drawbacks of the above technique:
- If you’re starting out with a smaller data collection, this method may provide a problem. Now you must divide that already limited set into two groups, leaving you with even fewer data to train or test your model on.
- Another issue with a basic train/test strategy is that, even if your data was randomly divided, outliers or other anomalies may only be present in one group and not equally represented in the other. This may make your test set appear to do particularly poorly, but our model would not have suffered as much if that one data point had been included in the training data.
The solution to the above-mentioned problem of splitting the data and getting extreme values in one and not the other is to not just split your data into train and test once but to do it multiple times. That is, we will do it ‘n-times’ or we can say ‘k-times’. Running our model several times and moving the outliers around randomly give us a better idea of how our model would perform out in the ‘real world’.
Again, using the example demonstrated above, of having a hundred data points and a k-value of 5, each set represents 20 data points. First, we randomly assign 20 data points to each subset. We hold out the testing subset(blue gap) and now we train a model on the data represented by yellow boxes(in the diagram). Then we test it on the data represented by the test subset. And we do it over and over again ‘k’ times. Also, we can shuffle the data around each time as well so that each time the subsets change they are getting 20 randomly assigned data points. So we are completely mixing up and “crossing’ our data.
Then we ‘validate’ our models by looking at the scores from each of our models, and taking the average of those scores.