## 1. Introduction

Naive Bayes is a classical supervised learning method, which is simple but effective. Usually, given a dataset $X={x_{1},…,x_{N}}$ and its label $Y={y_{1},…,y_{K}}$, where $N$ denotes the number of data and $K$ is the number of classes, $x_{i}\in\mathbb{R}^{N},y\in\mathbb{R}$. we build a parametric model $p(y\vert x,\theta)$ to simulate the generation of data, and our goal is to estimate paramters $\theta$ by bayes theorem: $\displaystyle{p(y\vert x)=\frac{p(x\vert y)p(y)}{p(x)}}$.

In fact, maximum likelyhood estimation(MLE) and maximum a-posterior estimation are commomly adopted for paramters estimation. Additionally, variational method is more advanced method for parameter estimation, which is beyond our scope.

Usually, the value of $\mathbf{\theta}^{\text{MLE}}$ and $\mathbf{\theta}^{\text{MAP}}$ are not the same.

Our story begins with bayes theorem:

In naive bayes, we assume the attribute of $x_{i}$ is **conditional independent.** Thus, the formulation above can be written as:

In the next Seesion, variant models of naive bayes are introduced.

## 2. Model of Naive Bayes

#### 2.1 Bernoulli Naive Bayes

**Suppose:** Binary vectors of length $K$: $x\in {0,1}^{K}$.

**Generative story:**

- $Y\sim \text{Bernoulli}(\Phi)$, binary classification.

- $X_{K}\sim \text{Bernoulli}(\theta_{k,Y}),\forall k\in{1,…,K}$.

**Model:**

**Classification:** Find the class that maximizes the posterior:

**Training:** Find the **class-conditional** MLE parameters

$P(Y)$ is independent with the class, and is used in all the data. For each $P(X_{k}\arrowvert Y)$, we condition on the data with the corresponding class.

#### 2.2 Model 2: Multinomial Naive Bayes

**Suppose:** Option 1: integer vector(word Ids),$x={x_{1},…,x_{M}},\text{where } x_{i}\in{1,…,K}$ a word id, for $i=1,…,M$.

**Generative story:**

For $i\in {1,…,N}$: $y^{(i)}\sim \text{Bernoulli}(\Phi)$.

For $j\in {1,…,M_{i}}$: $x_{j}^{(i)}\sim \text{Multinomial}(\theta_{y^{(i)}},1)$

**Model**:

#### 2.3 Model 3: Gaussian Naive Bayes

**Support:** $X\in \mathbb{R}^{K}$, $X$ is a continuous variable, all above are discrete variables.

**Model:** Product of **prior** and the event model:

Gaussian Naive Bayes assumes that $p(x_{k}\vert y)$ is given by a Normal distribution.

- When $X$ is multivariate-Gaussian vector:
- The joint probability of a data and it label is:

- The
**naive Bayes**simplification:

- More generally:
- where $p(\bullet \arrowvert \bullet)$ is an arbitrary conditional(discrete or continuous) 1-D density:

#### 1.1.4 Model 4: Multiclass Naive Bayes

**Model:** The only change is that we permit $y$ to range over $C$ classes.

Now, $y\sim \text{Multinomial}(\Phi,1)$ and we have a seperate conditional distribution $p(x_{k}\arrowvert y)$ for each of the $C$ classes.

## 3 Regularization

In this case, we take bernoulli naive bayes model as an example for spam and non-spam document classification. In real scenario, we usually oberve some words that is never seen before. In this case, what happen to MLE estimation of $p(x_{k}\vert y)$

the result is that $\theta_{k,0}^{\text{MLE}}=0$.

In the test time, the posterior probability would be:

### 3.1 Added- 1 Smooth

In order to avoid this case, the simplest method is to add single pseudo-abservation to the data. This converts the true observations $D$ into a new dataset $D^{‘}$ from we derive the models

where $\mathbf{0}$ is the vector of all zeros and $\mathbf{1}$ is the vector of all ones. This has the effect of pretending that we observed each feature $x_{k}$ with each class $y$.

The parameter estimation will become:

### 3.2 Added- $\lambda$ Smooth

Suppose we have a dataset obtained by repeatedly rolling a K-side(weight) die. Give data $D={(x_{i})}*{i=1}^{N}$ where $x*{i}\in{1,…,K}$, we have the following MLE:

with add-$\lambda$ smoothing, we add pseudo-observations as before to obtain a smoothed estimate: