| • Science | • People | • Locations | • Timeline |
Intuitively, the capacity of a classification model is related to how complicated it can be. Think of thresholding a high- degree polynomial, where if the polynomial evaluates above zero, we classify that point into a positive class, negative otherwise. If we use a very high-degree polynomial, it can be very wiggly, and can fit a training set exactly. But, we should expect that outside of the training points, the classifier would not generalize well, because it is too wiggly. Such a polynomial has a high capacity. Alternatively, we can think about thresholding a linear polynomial. This polynomial may not fit the entire training set, because it has a low capacity. This notion of capacity can be made more rigorous.
First, we need the concept of shatter ing. Consider a classification model with some parameter vector . The model can shatter a set of data points () if, for all assignments of labels to those data points, there exists a such that the model makes no errors when evaluating that set of data points.
Now, we are ready to define a mathematical notion of capacity, called the VC dimension. The VC dimension of a model is the maximum such that some data point set of cardinality can be shattered by .
The VC dimension has utility in statistical learning theory, because it can predict a probabilistic upper bound on the test error of a classification model.
The bound on the test error of a classification model (on data that is drawn i.i.d. from the same distribution as the training set) is given by
with probability , where is the VC dimension of the classification model, and is the size of the training set.