# Note of PRML(2)

## The information theory and machine learning

Posted by GwanSiu on September 3, 2017

## 1. 信息量和熵

### 1.2 熵(信息量编码角度)

$H(x)=-\sum_{x}p(x)\text{log}_{2}p(x) \tag{1}$

### 1.3 熵(系统的混乱程度)

$W=\frac{N!}{\prod_{i} n_{i}!} \tag{2}$

$H=\frac{1}{N}\text{In}W=\frac{1}{N}\text{In}N!-\frac{1}{N}\sum_{i}\text{In}n_{i}! \tag{3}$

$\text{In}N!\simeq N\text{In}N-N$

$H=-\lim\limits_{n\to\infty} \sum_{i}(\frac{n_{i}}{N})\text{In}(\frac{n_{i}}{N})=-\sum_{i}p_{i}\text{In}p_{i} \tag{4}$

$H[p]=-\sum_{i}p(x_{i})\text{log}_{2}p(x_{i}) \tag{4}$

### 1.4 交叉熵与条件熵

$H[y\arrowvert x]=\int\int p(y,x)\text{In}p(y\arrowvert x)dydx \tag{5}$

$H[x,y]=H[y\arrowvert x]+H[x] \tag{6}$

$H[x,y]$为$p(x,y)$的信息熵，$H(x)$是$p(x)$是信息熵，而$H[y\arrowvert x]$便是用分布$p(x)$去表示分布$p(y)$所需要的额外信息，这可以看成是分布$p(x)$和分布$p(y)$的距离误差。

## 2.相对熵(KL散度)

### 2.1 KL散度

KL散度(Kullback-Leibler divergence)描述了两个分布$p(x)$与$q(x)$的距离。假设真实的概率分布为$p(x)$，使用非真实分布$q(x)$对该信息进行编码，那么所需要的额外信息(距离)为$d=H(p,q)-H(p)$。因此，KL散度所需额外信息的测度:

\begin{aligned} \text{KL}(p\Vert q)&=H(p,q)-H(p) \\ &= -\int p(x)\text{In}(q(x))dx-(-\int p(x)\text{In}p(x)dx) \\ &= -\int p(x)\text{In}(\frac{q(x)}{p{x}})\\ \end{aligned}

Jensen不等式: $f(E[x])\leq E[f(x)] \tag{8}$ 将Jensen不等式用在KL散度上:

$\text{KL}(p\Vert q)=-\int p(x)\text{In}\lgroup \frac{q(x)}{p(x)} \rgroup \geq -\text{In}\int q(x)dx=0 \tag{9}$

### 2.2 KL散度与机器学习

$\text{KL}(p\Vert q) \simeq \frac{1}{N} \sum_{n=1}^{N} (-\text{In}q(x_{n}\arrowvert \theta)+\text{In}p(x_{n})) \tag{10}$

## 2.3 KL散度与互信息(multual information)

\begin{aligned} I(x,y)&=\text{KL}(p(x,y)\Vert p(x)p(y)) \\ &=-\int \int p(x,y)\text{In}(\frac{p(x)p(y)}{p(x,y)})dxdy \end{aligned}

$I(x,y)=H(x)-H(x\arrowvert y)=H(y)-H(y\arrowvert x) \tag{11}$

## 3.JS散度

JS散度

$JS(p(x)\Vert q(x))=\frac{1}{2}\text{KL}(p(x)\Vert \frac{p(x)+q(x)}{2})+ \frac{1}{2}\text{KL}(q(x)\Vert \frac{p(x)+q(x)}{2}) \tag{12}$