## 1. Uniform Convergence and Empirical Risk Minimization

Empirical risk minimization is core topic in machine learning. Let’s take a binary classification as example.

Given a training set ${(X_{1},y_{1}),…,(X_{n},y_{n})}$ that we assume are drwawn i.i.d from the distribution $P$. Each $X_{i}\in\mathcal{R}^{d}, y_{i}\in {-1, +1}$. Our goal is to find a classifier $f:\mathcal{R}^{d}\rightarrow{-1, +1}$ to minmize the error on future unseen sample: $\min\,\mathcal{P}(f(X)\neq y)$.

For a given classfier $f$ we can estimate its mis-classification error(training error in machine learning) as:

If $f$ is some fixed classifier we know by Hoeffding’s bound that:

Intuitively, we should a select a better classfier from soem set of classifier $\mathcal{F}$, then a natural way is to find the best classifier on trainging set:

This procedure is known as *empirical risk minimization*. This terminology is to answer a question: **How do we argue that in some cases this procedure will indeed select a good classifier?** This question is highly related to uniform convergence.

Let $f^{\ast}$ be the classifier in $F$. We would like to bound the excess risk of the classifier we choose, i.e.:

The formulation means that we want the error casued by chosen $\tilde{f}$ is close to the the error caused by the true $f$.

The typical way to deal with this formulation is to consider the decomposition:

The first term dependent on the data because of $\tilde{f}$ and cannot be suitable for Hoeffding’s inequality, so its empirical risk is not the sum of independent variables, the second term is $\leq 0$ and the third term is small just by the Hoeffding inequality.

Instead we have to rely on the uniform convergence bound, suppose we can show that with probability at least $1-\delta$,

then we can conclude that excess risk with probability at least $1-\delta$ satisfies:

Therefore, we have to show the uniform convergence of the empirical risk to the true error over the collection of classifiers we are interested in.

Let’s generalize the formulation, given a collection of sets $A$, we would like to understand the uniform convergence of empirical frequencies to probabilities, i.e. we want to bound:

where $x_{1},…,X_{n}$ are an i.i.d sample from some distribution $P$.

## 2. Finite collections and Hopythesis Set Complexity

How do we estimate $\Delta(\mathcal{A})$. We turn our view back to the collections of set $\mathcal{A}$ has finite cardinality $\vert \mathcal{A}\vert$. In this case, for a fixed $A$ we have by the Hoeffding’s inequality:

we want a strong gaurantee that this convergence happens uniformly for all sets in $\mathcal{A}$, so we can use the union bound, i.e.:

So if we want that with probability $1-\delta$ the deviation be smaller than $t$ we need to choose:

In orther words we have that with probability at least $1-\delta$,

To obtain unifrom convergence over $\mathcal{A}$ we pay a price which is logarithmic in the size of the collection.

## 3. VC dimension

VC dimension measure the complexity of $\vert \mathcal{A}\vert$, which is highly related to uniform convergence.

### 3.1 Shattering

Let’s ${z_{1},…,z_{n}}$ be a finite set of $n$ points. We let $N_{A}(z_{1},…,z_{n})$ be the number of distinct sets in the collection of sets

$N_{A}(z_{1},…,z_{n})$ is counting the *number of subsets of* ${z_{1},…,z_{n}}$ that the collection of sets $\mathcal{A}$ picks out. Note that, $N_{A}(z_{1},…,z_{n})\leq 2^{2}$(each one have chance to be selected.)

We now define the $n-$th shatter coefficient of $\mathcal{A}$ as:

The shatter coefficient is the maximal number of different subsets of $n$ points that can be picked out by the collection $\mathcal{A}$

### 3.2 VC Theory and Glivenko-Cantelli Therory

**VC Theorem:** For any distribution $\mathcal{P}, and class of sets $\mathcal{A}$ we have that:

Notes: There are two noteworthy aspects of this Theorem:

*Distribution free*: it can apply to any distritution on the samples.- VC theorem essentially reduce the question of uniform convergence to a combinatorial question about the collection of sets, i.e. we now need only to understand the shatter coefficients which are completely independent from the probability/statistics.

**Glivenko-Cantelli:** We related Glivenko-Cantelli theorem to VC theorem, the empirical CDF converges in probability to the true CDF. We have:

Now verifying convergence in probability is straightforwward, for any $t> 0, \lim_{n\rightarrow \infty}8(n+1)\text{exp}(-nt^{2}/32)=0$.

**VC dimension:** VC dimension define the complexity of a set system $\mathcal{A}$, the VC dimension $d$ is the largest integer $d$ for which $s(\mathcal{A}, d)=2^{d}$. Once we have for any $n>d$, we have that $s(\mathcal{A}, n)<2^{n}$. There is a phase transition of shattering coefficients: once it is no longer exponential the shattering coefficients become polynomial in $n$. That n is the VC dimension of a set system $\mathcal{A}$.

### 3.3 Sauer’s Lemma

If $\mathcal{A}$ has finite VC dimension $d$, then for $n>d$ we have that,

We can use Sauer’s lemma to conclude that for a system $\mathcal{A}$ of VC dimension $d$.

Doing the usual thing we see that with probability $1-\delta$,

There are some important notes:

- If $d<\infty$ then $\Delta(\mathcal{A})\xrightarrow{p}0$, and so we have a uniform LLN for the collection of sets $\mathcal{A}$.
- There are covnerses to the VC theorem that say roughly that if the VC dimension is infinite then there exists a distribution over the samples for which we do not have a uniform LLN.
- Roughly, one should thinnk of the VC result as saying for a class with VC dimension d,

### 3.4 VC Theory, Empirical Risk Minimization and Uniform Convergence

If $\mathcal{A}$ has VC dimension $d$ then we can use the VC theorem in a straightforwward way to conclude that with probability $1-\delta$:

This way leads to a uniform convergence gurantee for empirical risk minimization over linear classifiers since they induce relatively simple sets whose VC dimension is well-understood.