## 1. Logistic Regression

**Data:** Inputs are continuous vectors of length $K$. Outputs are discrete.

**Model:** Logistic function applied to dot product of parameters with input vector.

**Learning:** Finds the parameters that minimized some objective function.

Usually, we minimize the negative log conditional likelihood:

\[\begin{equation} J(\theta) = -\log\prod_{i=1}^{N}p_{\theta}(y^{(i)}\vert \mathbf{x}^{(i)}) \end{equation}\]*Why can’t maximize likelihood(as in naive bayes)*

**Prediction:** Output is the most probable class.

## 2. Learning Methods

**learning:** Three approaches to solve $\theta^{\ast}=\arg\min_{\theta}J(\theta)$

**Approach 1:**Gradient Descent(Take larger-more certain-steps apposite the gradient)**Approach 2:**Stochastic Gradient Descent(SGD)(Take many small steps apposite the gradient)**Approach 3:**Newton’s Method(use second derivatives to better follow curvature)

## 3. Newton’s Method and Logistic Regression

The motivation behind **Newton’s method** is to use a **quadreatic approximation** of our function to make a good guess where we should step next.

The **Taylor series expansion** for an infinitely differentiable function $f(x),x\in\mathbb{R},$ about a pint $\nu\in\mathbb{R}$ is:

The **2nd-order Taylor series approximation** cuts off the expansion after the quadratic term:

The vector version of Taylor series expansion for an infinitely differentiable function $f(x),\mathbf{x}\in \mathbb{R}^{K}$, about a point $\mathbf{v}\in\mathbb{R}^{K}$ is:

\[\begin{equation} f(x) = f(\nu)+\frac{(x-\nu)^{T}\nabla f(x)}{1!}+\frac{(x-v)^{T}\nabla^{2} f(x)(x-v)}{2!}+... \end{equation}\]The **2nd-order Taylor series approximation** cuts off the expansion after the quadratic term:

Taking the derivative of $\tilde{f}(v)$ and setting to 0 gives us the closed form minimizer of this(convex) quadratic function:

\[\arg\min_{x}\tilde{f}(x)=x-(\nabla^{2}f(x))^{-1}\nabla f(x)\]The added term $\nabla x_{nt}=-(\nabla^{2}f(x))^{-1}\nabla f(x)$ is called Newton’s step.

**The algorithm of Newton’s Method(Newton-Raphson method)**

Goal: $x^{\ast}=\arg\min_{x}f(x)$

- Approximate the function with the 2nd order Taylor series:

- Compute its minimizer

- Step to that minimizer:

- Repeat

## 3. Iterative Weighted Least Square(Newton’s Method for Logistic Regression)

As we know, linear regression has closed-form solution, but there are no longer closed-form solution for logistic regression due to nonlinearity of the logistic sigmoid function. In fact, the lost function of logistic regression is convex, and it is guaranteed to find a global optimal solution. Furthermore, the error funciton can be solved by an efficient iterative optimization algorithm based on Newton-Rashon iterative optimization scheme.

For **Logistic Regression:**

take $-H^{-1}g$ back to $\theta\leftarrow \theta - H^{-1}g$, we obtain:

\[\begin{align} \theta&\leftarrow\theta-H^{-1}g \\ &=\theta - (X^{T}SX)^{-1}(X^{T}(\mu-y)) \\ &=(X^{T}SX)^{-1}((X^{T}SX)\theta-(X^{T}(\mu-y))) \\ &= (X^{T}SX)^{-1}X^{T}(SX\theta-(\mu-y)) \\ &= (X^{T}SX)^{-1}X^{T}S(X\theta-S^{-1}(\mu-y) \\ &= (X^{T}SX)^{-1}X^{T}Sz \end{align}\]where $z=(X\theta-S^{-1}(\mu-y)$

Compare with the closed-form in linear regression: $\theta^{\ast}=(X^{T}X)^{-1}X^{T}y$, the step $\theta\leftarrow \theta-H^{-1}g$ can be reformlized as $\theta=(X^{T}SX)^{-1}X^{T}Sz$, where $z=(X\theta-S^{-1}(\mu-y)$. It’s the same form as the closed-form of linear regression.

Therefore, we can see the above update yields the minimizer for the **weighted least squares problem:**

where $S_{ii}$ is the weight of the $i\text{th}$ “training example” consisting of the pair $(x^{(i)},z_{i})$.

The explaination in wiki:

The method of iteratively reweighted least squares (IRLS) is used to solve certain optimization problems with objective functions of the form:

\[\begin{equation} \arg\min_{\beta}sum_{i=1}^{n}(y_{i}-f_{i}(\beta))^{p} \end{equation}\]by an iterative method in which each step involves solving a weighted least squares problem of the form:

\[\begin{equation} \beta^{t+1} = \arg\min_{\beta}\omega_{i}(\beta^{(t)})(y_{i}-f_{i}(\beta))^{2} \end{equation}\]IRLS is used to find the maximum likelihood estimates of a generalized linear model, and in robust regression to find an M-estimator, as a way of mitigating the influence of outliers in an otherwise normally-distributed data set. For example, by minimizing the least absolute error rather than the least square error.