# Note(7)--Uniform Laws, empirical process theory and VC dimension

Posted by GwanSiu on April 8, 2018

## 1. Uniform Convergence and Empirical Risk Minimization

Empirical risk minimization is core topic in machine learning. Let’s take a binary classification as example.

Given a training set ${(X_{1},y_{1}),…,(X_{n},y_{n})}$ that we assume are drwawn i.i.d from the distribution $P$. Each $X_{i}\in\mathcal{R}^{d}, y_{i}\in {-1, +1}$. Our goal is to find a classifier $f:\mathcal{R}^{d}\rightarrow{-1, +1}$ to minmize the error on future unseen sample: $\min\,\mathcal{P}(f(X)\neq y)$.

For a given classfier $f$ we can estimate its mis-classification error(training error in machine learning) as:

If $f$ is some fixed classifier we know by Hoeffding’s bound that:

Intuitively, we should a select a better classfier from soem set of classifier $\mathcal{F}$, then a natural way is to find the best classifier on trainging set:

This procedure is known as empirical risk minimization. This terminology is to answer a question: How do we argue that in some cases this procedure will indeed select a good classifier? This question is highly related to uniform convergence.

Let $f^{\ast}$ be the classifier in $F$. We would like to bound the excess risk of the classifier we choose, i.e.:

The formulation means that we want the error casued by chosen $\tilde{f}$ is close to the the error caused by the true $f$.

The typical way to deal with this formulation is to consider the decomposition:

The first term dependent on the data because of $\tilde{f}$ and cannot be suitable for Hoeffding’s inequality, so its empirical risk is not the sum of independent variables, the second term is $\leq 0$ and the third term is small just by the Hoeffding inequality.

Instead we have to rely on the uniform convergence bound, suppose we can show that with probability at least $1-\delta$,

then we can conclude that excess risk with probability at least $1-\delta$ satisfies:

Therefore, we have to show the uniform convergence of the empirical risk to the true error over the collection of classifiers we are interested in.

Let’s generalize the formulation, given a collection of sets $A$, we would like to understand the uniform convergence of empirical frequencies to probabilities, i.e. we want to bound:

where $x_{1},…,X_{n}$ are an i.i.d sample from some distribution $P$.

## 2. Finite collections and Hopythesis Set Complexity

How do we estimate $\Delta(\mathcal{A})$. We turn our view back to the collections of set $\mathcal{A}$ has finite cardinality $\vert \mathcal{A}\vert$. In this case, for a fixed $A$ we have by the Hoeffding’s inequality:

we want a strong gaurantee that this convergence happens uniformly for all sets in $\mathcal{A}$, so we can use the union bound, i.e.:

So if we want that with probability $1-\delta$ the deviation be smaller than $t$ we need to choose:

In orther words we have that with probability at least $1-\delta$,

To obtain unifrom convergence over $\mathcal{A}$ we pay a price which is logarithmic in the size of the collection.

## 3. VC dimension

VC dimension measure the complexity of $\vert \mathcal{A}\vert$, which is highly related to uniform convergence.

### 3.1 Shattering

Let’s ${z_{1},…,z_{n}}$ be a finite set of $n$ points. We let $N_{A}(z_{1},…,z_{n})$ be the number of distinct sets in the collection of sets

$N_{A}(z_{1},…,z_{n})$ is counting the number of subsets of ${z_{1},…,z_{n}}$ that the collection of sets $\mathcal{A}$ picks out. Note that, $N_{A}(z_{1},…,z_{n})\leq 2^{2}$(each one have chance to be selected.)

We now define the $n-$th shatter coefficient of $\mathcal{A}$ as:

The shatter coefficient is the maximal number of different subsets of $n$ points that can be picked out by the collection $\mathcal{A}$

VC Theorem: For any distribution $\mathcal{P}, and class of sets$\mathcal{A}$we have that: Notes: There are two noteworthy aspects of this Theorem: 1. Distribution free: it can apply to any distritution on the samples. 2. VC theorem essentially reduce the question of uniform convergence to a combinatorial question about the collection of sets, i.e. we now need only to understand the shatter coefficients which are completely independent from the probability/statistics. Glivenko-Cantelli: We related Glivenko-Cantelli theorem to VC theorem, the empirical CDF converges in probability to the true CDF. We have: Now verifying convergence in probability is straightforwward, for any$t> 0, \lim_{n\rightarrow \infty}8(n+1)\text{exp}(-nt^{2}/32)=0$. VC dimension: VC dimension define the complexity of a set system$\mathcal{A}$, the VC dimension$d$is the largest integer$d$for which$s(\mathcal{A}, d)=2^{d}$. Once we have for any$n>d$, we have that$s(\mathcal{A}, n)<2^{n}$. There is a phase transition of shattering coefficients: once it is no longer exponential the shattering coefficients become polynomial in$n$. That n is the VC dimension of a set system$\mathcal{A}$. ### 3.3 Sauer’s Lemma If$\mathcal{A}$has finite VC dimension$d$, then for$n>d$we have that, We can use Sauer’s lemma to conclude that for a system$\mathcal{A}$of VC dimension$d$. Doing the usual thing we see that with probability$1-\delta$, There are some important notes: 1. If$d<\infty$then$\Delta(\mathcal{A})\xrightarrow{p}0$, and so we have a uniform LLN for the collection of sets$\mathcal{A}$. 2. There are covnerses to the VC theorem that say roughly that if the VC dimension is infinite then there exists a distribution over the samples for which we do not have a uniform LLN. 3. Roughly, one should thinnk of the VC result as saying for a class with VC dimension d, ### 3.4 VC Theory, Empirical Risk Minimization and Uniform Convergence If$\mathcal{A}$has VC dimension$d$then we can use the VC theorem in a straightforwward way to conclude that with probability$1-\delta\$:

This way leads to a uniform convergence gurantee for empirical risk minimization over linear classifiers since they induce relatively simple sets whose VC dimension is well-understood. 