Naive Bayes

Posted by GwanSiu on June 17, 2018

1. Introduction

Naive Bayes is a classical supervised learning method, which is simple but effective. Usually, given a dataset $X={x_{1},…,x_{N}}$ and its label $Y={y_{1},…,y_{K}}$, where $N$ denotes the number of data and $K$ is the number of classes, $x_{i}\in\mathbb{R}^{N},y\in\mathbb{R}$. we build a parametric model $p(y\vert x,\theta)$ to simulate the generation of data, and our goal is to estimate paramters $\theta$ by bayes theorem: $\displaystyle{p(y\vert x)=\frac{p(x\vert y)p(y)}{p(x)}}$.

In fact, maximum likelyhood estimation(MLE) and maximum a-posterior estimation are commomly adopted for paramters estimation. Additionally, variational method is more advanced method for parameter estimation, which is beyond our scope.

Usually, the value of $\mathbf{\theta}^{\text{MLE}}$ and $\mathbf{\theta}^{\text{MAP}}$ are not the same.

Our story begins with bayes theorem:

In naive bayes, we assume the attribute of $x_{i}$ is conditional independent. Thus, the formulation above can be written as:

In the next Seesion, variant models of naive bayes are introduced.

2. Model of Naive Bayes

2.1 Bernoulli Naive Bayes

Suppose: Binary vectors of length $K$: $x\in {0,1}^{K}$.

Generative story:

  1. $Y\sim \text{Bernoulli}(\Phi)$, binary classification.
  1. $X_{K}\sim \text{Bernoulli}(\theta_{k,Y}),\forall k\in{1,…,K}$.

Model:

Classification: Find the class that maximizes the posterior:

Training: Find the class-conditional MLE parameters

$P(Y)$ is independent with the class, and is used in all the data. For each $P(X_{k}\arrowvert Y)$, we condition on the data with the corresponding class.

2.2 Model 2: Multinomial Naive Bayes

Suppose: Option 1: integer vector(word Ids),$x={x_{1},…,x_{M}},\text{where } x_{i}\in{1,…,K}$ a word id, for $i=1,…,M$.

Generative story:

For $i\in {1,…,N}$: $y^{(i)}\sim \text{Bernoulli}(\Phi)$.

For $j\in {1,…,M_{i}}$: $x_{j}^{(i)}\sim \text{Multinomial}(\theta_{y^{(i)}},1)$

Model:

2.3 Model 3: Gaussian Naive Bayes

Support: $X\in \mathbb{R}^{K}$, $X$ is a continuous variable, all above are discrete variables.

Model: Product of prior and the event model:

Gaussian Naive Bayes assumes that $p(x_{k}\vert y)$ is given by a Normal distribution.

  • When $X$ is multivariate-Gaussian vector:
  • The joint probability of a data and it label is:
  • The naive Bayes simplification:
  • More generally:
  • where $p(\bullet \arrowvert \bullet)$ is an arbitrary conditional(discrete or continuous) 1-D density:

1.1.4 Model 4: Multiclass Naive Bayes

Model: The only change is that we permit $y$ to range over $C$ classes.

Now, $y\sim \text{Multinomial}(\Phi,1)$ and we have a seperate conditional distribution $p(x_{k}\arrowvert y)$ for each of the $C$ classes.

3 Regularization

In this case, we take bernoulli naive bayes model as an example for spam and non-spam document classification. In real scenario, we usually oberve some words that is never seen before. In this case, what happen to MLE estimation of $p(x_{k}\vert y)$

the result is that $\theta_{k,0}^{\text{MLE}}=0$.

In the test time, the posterior probability would be:

3.1 Added- 1 Smooth

In order to avoid this case, the simplest method is to add single pseudo-abservation to the data. This converts the true observations $D$ into a new dataset $D^{‘}$ from we derive the models

where $\mathbf{0}$ is the vector of all zeros and $\mathbf{1}$ is the vector of all ones. This has the effect of pretending that we observed each feature $x_{k}$ with each class $y$.

The parameter estimation will become:

3.2 Added- $\lambda$ Smooth

Suppose we have a dataset obtained by repeatedly rolling a K-side(weight) die. Give data $D={(x_{i})}{i=1}^{N}$ where $x{i}\in{1,…,K}$, we have the following MLE:

with add-$\lambda$ smoothing, we add pseudo-observations as before to obtain a smoothed estimate: