# Variational Inference

Posted by Gwan Siu on March 29, 2018

## 1. What’s variational inference?

Representation, learning, and inference are 3 cores problem of machine learning. For statistic inference, it involves finding the approxiate model and parameters to represent the distribution given observed variables. In other work, given complete data $x$ and $y$ and unknown parameter $\theta$, this is classical parameter estimation problem in ML area. Usually, we adopt maximum likelihood estimation(MLE):

in many real situations, we are given imcomplete data, i.e., only data $x$, in this case, latent variables $z$ are introduced. For example, in gassian mixture model, we introduce $z_{i}$ to indicate the underlying gaussian distribution. Thus, the overall formulation is changed

in fact, this integration usually is high-dimensional integration, which is intractable. It means that extract inference is impossible in this case. Therefore, we need to introduce approximate inference techniques in this case. Samling-based algorithms and variation-based algorithms are two kinds of approximate inference algorithms in modern bayesian statistics.

In this article, we mainly focus on variational inference(VI). The core idea of VI is to posit a family of distribution and then to find the member of that family which is close to the target, where closeness is measured using the Kullback-Leibler divergence.

In my previous article-EM, we can see that data likelihood can be decomposed into evidence lower bound and KL divergence:

where $\mathcal{F}(q,\theta)$ is evidence lower bound for marginal likelihood due to $\text{KL}(q(z),p(z\vert x;\theta))$ is non-negative.

Instead of maximize marginal likelihood directly, EM algorithm and variational inference maximize the lower bound.

1. The first term is the expectation of the data likelihood and thus $\mathcal{F}(q,\theta)$ encourage distributions put their mass on configurations of latent variables that explain observed data.
2. The second term is the negative KL divergence between the variational distribution and the prior, so the $\mathcal{F}(q,\theta)$ force $q(z)$ to close to the prior $p(z)$.

Hence, maximize $\mathcal{F}(q,\theta)$ ** means to balance the likelihood and prior.**

## 2. Expectation-Maximization

In EM framework, we assume $q(z)=p(z\vert x;\theta^{old})$. The ELBO becomes:

where $H(q)$ is the entropy of $z$ given $x$. It is constant w.r.t $\theta$ and thus we will not take it into account when we maximize ELBO. The EM algorithm is sufficient to maximize $Q(\theta, \theta^{old})$

E-step: maximize $\mathcal{F}(q,\theta)$ w.r.t distribution over hidden variables given the parameters:

M-step: maximize $\mathcal{F}(q,\theta)$ w.r.t the parameters given the hidden distribution

## 3. Mean Field Theory

In EM framework, $q(z)=p(z\vert x;\theta^{old})$ is computed by iterative method. It means that we can find a analytical solution of $p(x\vert x;\theta^{old})$, this is possible for simple modles but can not be generalized to complex models. Instead, we approximate the posterior distribuiton by a family of simple dsitributions.

we assume the latent variables are mutually independent and each governed by a distinct factor in the variational distribution， i.e. $z_{i}\perp z_{j}$, for $i\neq j$. This is called mean-field theory.

## 4. Coodinate Ascent Variational Inference(CAVI)

In this part, I will compbined with mean-field theory and talk about how ELBO is maximize. One latent variabe posterior $q(z_{i})$ is updated by the rest latent variables $i\neq j$. Here, I will talk about CAVI algorithm. Let $q(z)=\prod_{i}q(z_{i})$. Then, the EBLO becomes:

Since KL divergence is non-negative, thus, ELBO is maximized when $\text{KL}(q(z_{j}),\hat{p}(z_{i\neq j}))=0$, i.e.

Similarly, in variational EM:

E-step: $q^{\ast}=\frac{1}{Z}\text{exp}(\mathbb{E}[\ln p(x,z;\theta)]_{i\neq j})$

M-step: maximize the $\mathcal{F}(q,\theta)$.

The figure below is the process of CAVI algorithm: ## 4. Variational inference and GMM

In this section, CAVI algorithm is used for Mixture of Gaussians model(GMM). It will be helpful to understand how CAVI works.

### 4.1 Joint distribution computation

Given observed data $X=(x_{1},…,x_{n})$ from $K$ independent gaussian distribution with mean $\mu_{k}$. One-hot vector $c_{i}\in \mathbb{R}^{k}$ indicate the distribution to which each data belong. The hyperparameter $\sigma^{2}$ is fixed. latent variables are $\mu, c$. The prior is:

According to bay theroem, we can compute the joint distribution:

once we have joint distribution, we can compute marginal distribution. However, the formulation has no analytical solution, and the computational complexity is $\mathcal{O}(K^{n})$.

### 4.2 GMM and CAVI

Now, we should compute variational ditribution $q(z)$, where $m=(m_{1},…,m_{k}), s^{2}=(s_{1}^{2},…,s_{K}^{2}), \phi=(\phi_{1},…,\phi_{n})$ are variational parameters, hence the formulation of variational distribution is:

1. we can obtain the formulation $\mathrm{ELBO}$, which is a function of $m,s^{2},\phi$.
1. from section 3, we obtain how CAVI algorithm update latent variables. Now, we applyied it into GMM to compute cluster indicator $c$ and update $c$, noted $\mu$ is fixed:

the second term $\log p(c_{i})$ is log prior and it is a constant. Hence, we pay our attention to the first term: the distribution of $c_{i}$ gaussian distribution. In detail, we simplify it due to $c_{i}=(c_{i1},…,c_{ik})$ is one-hot vector, and we have:

from the formulation above, $\mathbb{E}[\mu_{k}]$ and $\mathbb{E}[\mu_{k}^{2}]$ can be computed. For each data point $i$, parameter $\phi_{ik}$ in the $k$th component of the latent variable $c$. The updated formulation is:

smiliarly, we can compute latent variable $\mu$ of GMM. Firstly, we should calculate the optimal variational distribution $q(\mu_{k})$, and the update the parameter $m_{k}, s_{k}^{2}$ of $\mu_{k}$:

The algorithm of GMM and CAVI is below: ## 5. Comparision of MCMC and VI

MCMC VI
More computationally intensive Less intensive
Gaurantess producing asymptotically exact samples from the target distribution No such gaurantees
Slower Faster, expecially for large data sets and complex distributions
Best for precise inference Useful to explore many scenarios quickly or large data sets

Reference 