Gaussian Processing

Posted by GwanSiu on May 15, 2021

1. Introduction

Gaussian processing is an nonparametric learning algorithm, which defines a distribution over a functional space with the form $f:\mathcal{X}\rightarrow \mathbb{R}$, where $\mathcal{X} is any data domain$.

The key assumption of Gaussian processing is that the function values of $M$ inputs, i.e., $\mathbf{f}=[f(\mathbf{x}{1}),\cdots, f(\mathbf{x}{M})]$, is jointly Gaussian, with the mean $\mathbf{\mu}=[m(\mathbf{x}{1}),\cdots, m(\mathbf{x}{M})]$ and the covariance $\Sigma_{ij}=\mathcal{K}(\mathbf{x}{i},\mathbf{x}{j})$, where $\mathbf{\mu}$ denotes a mean function and $\mathcal{K}$ is a positive definite kernel (Mercer kernel).

This assumption holds for any $M>0$, including the case where we have $N$ training points $\mathbf{x}{n}$ and $1$ test point $\mathbf{x}{\ast}$, so we can infer $f(\mathbf{x}{\ast})$ from the training points $f(\mathbf{x}{1}),\dots,f(\mathbf{x}{n})$ by manipulating the joint Gaussian distribution $p(f(\mathbf{x}{1},\cdots, f(\mathbf{x}{N}), f(\mathbf{x}{\ast})))$. We will explain the learning and inference procedure in the following contents.

2. Gaussian Processing Model

2.1 Noise-free observations

Suppose we observe a training set $\mathcal{D}={(\mathbf{x}{i}, y{i})}{i=1}^{N}$, where $y{i}=f(\mathbf{x}{i})$ is the noise-free observation of the function evaluated at $\mathbf{x}{i}$.

We view Gaussian processing(GP) as an interpolator of the training data, when the GP is to predict $f(\mathbf{x})$ in which the input $\mathbf{x}$ has been observed in the training data.

In machine learing problem, we actually concern the case of predicting the outputs for new inputs that may not be in $\mathcal{D}$. Concretely, given a test set $\mathcal{X}{\ast}$ of the size $N{\ast}\times D$, where $N_{\ast}$ is the number of testing samples and $D$ is the dimension of each sample vector. We want to predict the function outputs $f_{\ast}=[f(\mathbf{x}{1}),\cdots, f(\mathbf{x}{N_{\ast}})]$. By the assumption of GP, the joint distribution $p(\mathbf{f}{X}, \mathbf{f}{\ast}\vert \mathbf{X}, \mathbf{X}_{\ast})$ has the following form

$$ \begin{equation} \begin{pmatrix} \mathbf{f}{X}
{\ast} \end{pmatrix}\sim \mathcal{N}\left( \begin{pmatrix} \mathbf{\mu}{X}
{\ast} \end{pmatrix},

\begin{pmatrix} \mathbf{K}{X,X} & \mathbf{K}{X,\ast}
\mathbf{K}^{T}{X,\ast} & \mathbf{K}{\ast, \ast} \end{pmatrix} \right), \end{equation} $$ where $\mu_{X}=[m(\mathbf{x}{1}), \cdots, m(\mathbf{x}{N})], \mathbf{\mu}{\ast}=[m(\mathbf{x}^{\ast}{1}),\cdots,\mathbf{x}^{\ast}{N{\ast}}], \mathbf{K}{\ast, \ast}=\mathcal{K}(X, X)\in\mathbb{R}^{N\times N}, \mathbf{K}{X, \ast}=\mathcal{K}(\mathbf{X}, \mathbf{X}{\ast})\in\mathbb{R}^{N\times N{\ast}}$, and $\mathbf{K}{\ast, \ast}=\mathcal{K}(\mathbf{X}{\ast},\mathbf{X}{\ast})\in\mathbb{R}^{N{\ast}\times N_{\ast}}$. According to the property of the generalized Gaussian system, we can obtain the posterior with the following form

\[\begin{align} p(\mathbf{f}_{\ast}\vert \mathbf{X}_{\ast}, \mathcal{D}) &= \mathcal{N}(\mathbf{f}_{\ast}\vert \mathbf{\mu}_{\ast}, \mathbf{\Sigma}_{\ast}) \\ \mathbf{\mu}_{\ast} &= m(\mathbf{X}_{\ast}) + \mathbf{K}^{T}_{\mathbf{X},\ast}\mathbf{K}^{-1}_{\mathbf{X},\mathbf{X}}(\mathbf{f}_{\mathbf{X}}-m(\mathbf{X})) \\ \mathbf{\Sigma}_{\ast} &= \mathbf{K}_{\ast, \ast} - \mathbf{K}^{T}_{\mathbf{X},\ast}K^{-1}_{X, X}\mathbf{K}_{\mathbf{X}, \ast} \end{align}\]

Figure 1 illustrates the graphic model of Gaussian processing.