1. What’s variational inference?
Representation, learning, and inference are 3 cores problem of machine learning. For statistic inference, it involves finding the approxiate model and parameters to represent the distribution given observed variables. In other work, given complete data $x$ and $y$ and unknown parameter $\theta$, this is classical parameter estimation problem in ML area. Usually, we adopt maximum likelihood estimation(MLE):
in many real situations, we are given imcomplete data, i.e., only data $x$, in this case, latent variables $z$ are introduced. For example, in gassian mixture model, we introduce $z_{i}$ to indicate the underlying gaussian distribution. Thus, the overall formulation is changed
in fact, this integration usually is highdimensional integration, which is intractable. It means that extract inference is impossible in this case. Therefore, we need to introduce approximate inference techniques in this case. Samlingbased algorithms and variationbased algorithms are two kinds of approximate inference algorithms in modern bayesian statistics.
In this article, we mainly focus on variational inference(VI). The core idea of VI is to posit a family of distribution and then to find the member of that family which is close to the target, where closeness is measured using the KullbackLeibler divergence.
In my previous articleEM, we can see that data likelihood can be decomposed into evidence lower bound and KL divergence:
where $\mathcal{F}(q,\theta)$ is evidence lower bound for marginal likelihood due to $\text{KL}(q(z),p(z\vert x;\theta))$ is nonnegative.
Instead of maximize marginal likelihood directly, EM algorithm and variational inference maximize the lower bound.
 The first term is the expectation of the data likelihood and thus $\mathcal{F}(q,\theta)$ encourage distributions put their mass on configurations of latent variables that explain observed data.
 The second term is the negative KL divergence between the variational distribution and the prior, so the $\mathcal{F}(q,\theta)$ force $q(z)$ to close to the prior $p(z)$.
Hence, maximize $\mathcal{F}(q,\theta)$ ** means to balance the likelihood and prior.**
2. ExpectationMaximization
In EM framework, we assume $q(z)=p(z\vert x;\theta^{old})$. The ELBO becomes:
where $H(q)$ is the entropy of $z$ given $x$. It is constant w.r.t $\theta$ and thus we will not take it into account when we maximize ELBO. The EM algorithm is sufficient to maximize $Q(\theta, \theta^{old})$
Estep: maximize $\mathcal{F}(q,\theta)$ w.r.t distribution over hidden variables given the parameters:
Mstep: maximize $\mathcal{F}(q,\theta)$ w.r.t the parameters given the hidden distribution
3. Mean Field Theory
In EM framework, $q(z)=p(z\vert x;\theta^{old})$ is computed by iterative method. It means that we can find a analytical solution of $p(x\vert x;\theta^{old})$, this is possible for simple modles but can not be generalized to complex models. Instead, we approximate the posterior distribuiton by a family of simple dsitributions.
we assume the latent variables are mutually independent and each governed by a distinct factor in the variational distribution， i.e. $z_{i}\perp z_{j}$, for $i\neq j$. This is called meanfield theory
.
4. Coodinate Ascent Variational Inference(CAVI)
In this part, I will compbined with meanfield theory and talk about how ELBO is maximize. One latent variabe posterior $q(z_{i})$ is updated by the rest latent variables $i\neq j$. Here, I will talk about CAVI algorithm. Let $q(z)=\prod_{i}q(z_{i})$. Then, the EBLO becomes:
Since KL divergence is nonnegative, thus, ELBO is maximized when $\text{KL}(q(z_{j}),\hat{p}(z_{i\neq j}))=0$, i.e.
Similarly, in variational EM:
Estep: $q^{\ast}=\frac{1}{Z}\text{exp}(\mathbb{E}[\ln p(x,z;\theta)]_{i\neq j})$
Mstep: maximize the $\mathcal{F}(q,\theta)$.
The figure below is the process of CAVI algorithm:
4. Variational inference and GMM
In this section, CAVI algorithm is used for Mixture of Gaussians model(GMM). It will be helpful to understand how CAVI works.
4.1 Joint distribution computation
Given observed data $X=(x_{1},…,x_{n})$ from $K$ independent gaussian distribution with mean $\mu_{k}$. Onehot vector $c_{i}\in \mathbb{R}^{k}$ indicate the distribution to which each data belong. The hyperparameter $\sigma^{2}$ is fixed. latent variables are $\mu, c$. The prior is:
According to bay theroem, we can compute the joint distribution:
once we have joint distribution, we can compute marginal distribution. However, the formulation has no analytical solution, and the computational complexity is $\mathcal{O}(K^{n})$.
4.2 GMM and CAVI
Now, we should compute variational ditribution $q(z)$, where $m=(m_{1},…,m_{k}), s^{2}=(s_{1}^{2},…,s_{K}^{2}), \phi=(\phi_{1},…,\phi_{n})$ are variational parameters, hence the formulation of variational distribution is:

 we can obtain the formulation $\mathrm{ELBO}$, which is a function of $m,s^{2},\phi$.

 from section 3, we obtain how CAVI algorithm update latent variables. Now, we applyied it into GMM to compute cluster indicator $c$ and update $c$, noted $\mu$ is fixed:
the second term $\log p(c_{i})$ is log prior and it is a constant. Hence, we pay our attention to the first term: the distribution of $c_{i}$ gaussian distribution. In detail, we simplify it due to $c_{i}=(c_{i1},…,c_{ik})$ is onehot vector, and we have:
from the formulation above, $\mathbb{E}[\mu_{k}]$ and $\mathbb{E}[\mu_{k}^{2}]$ can be computed. For each data point $i$, parameter $\phi_{ik}$ in the $k$th component of the latent variable $c$. The updated formulation is:
smiliarly, we can compute latent variable $\mu$ of GMM. Firstly, we should calculate the optimal variational distribution $q(\mu_{k})$, and the update the parameter $m_{k}, s_{k}^{2}$ of $\mu_{k}$:
The algorithm of GMM and CAVI is below:
5. Comparision of MCMC and VI
MCMC  VI 

More computationally intensive  Less intensive 
Gaurantess producing asymptotically exact samples from the target distribution  No such gaurantees 
Slower  Faster, expecially for large data sets and complex distributions 
Best for precise inference  Useful to explore many scenarios quickly or large data sets 
Reference

Previous
SamplingInverse sampling, Rejectaccept sampling and MCMC 
Next
Note(8) Sufficiency Statistics