Linear regression is an old topic in the machine learning community, and this topic has been studied by researchers for the past decades. In this post, I will highlight some kerypoints on regression models. Specifically, I will begin with the univariate regression model, and consider it as the basic block to build the multiple regression model.
1. The univariate regression
Suppose that we have a dataset $\mathcal{D}=(x_i, y_i)_{i=1}^n$ with $n$ samples, where observation $\mathbf{y}$ is a $n$ dimensional vector, i.e. $\mathbf{y}=(y_1, y_2, \cdots, y_n)\in \mathbb{R}^n$, and the measurement $\mathbf{x}$ is also a ndimension vector, i.e. $\mathbf{x}=(x_1, x_2, \cdots, x_n)\in \mathbb{R}^n$. We additionally assume that obseravation and measurement can be modeled as
where $\beta^{\ast}\in \mathbb{R}$ is the groundtruth coefficient, which is unknown, and $\mathbf{\epsilon}=(\epsilon_{1}, \epsilon_{2},\cdots, \epsilon_{n})\in\mathbb{R}^{n}$ is the noise term, and $\mathbb{E}[\epsilon_{i}]=0, \text{Var}(\epsilon_{i})=\sigma^{2}, \forall i$, $\text{Cov}(x_{i}, x_{j})=0,\text{for } i\neq j$.
Our goal is to estimate the coefficient in the underlying model, and we commonly use the least mean square estimator(LMSE). We formulate it as follows:
To obtain the optimal $\hat{\beta}$, we take the derivative of Eq.(2), and set the firstorder derivative to zero. The univariate linear regression coefficient of $\mathbf{y}$ on $\mathbf{x}$ is
Next, we consider the incepter in the underlying linear model. Eq.(1) is reformulated as follows
Correspondingly, we alternative optimize the following problem, i.e.
by solving the above problem, and we can obtain
note that
Discussion: there is a question about when we consider the incepter term in the linear model. Without the incepter term, we assum that the underlying linear model pass through the origin. In most applications, we do not know any prior knowlage on the underlying linear model. It is applicable to consider the incepter term when we build the linear model.
2. The Multivariate Regression
In the univariate regression, the dimension $p$ of data we cconsider is just 1. In this session, we consider high dimensional data, i.e. $p>1$, and use the univariate regression model as the basic block to further develop the multivariate regression model.
With the dataset $\mathcal{D}=(x_i, y_i)_{i=1}^n$ with $n$ samples, each sample are $p$dimensional. $\mathbf{X}=[\mathbf{1}, \mathbf{x}_1, \mathbf{x}_2, \cdots, \mathbf{x}_n] \in \mathbb{R}^{n \times (p+1)}$ (For an intercept term, we can just append a column $\mathbf{1}\in\mathbb{R}^n$ of all 1s to the matrix X) and $\mathbf{Y}=[y_1,y_2,\cdots,y_n]\in\mathbb{R}^{n\times 1}$.
Additionally, the columns of $\mathbf{X}$ are linear independent, and rank($\mathbf{X}$)=p (assume $n» p$).
The underlying linear model is
where $\mathbf{\beta}^{\ast}=(\beta_{0},\beta_{1},\cdots, \beta_{p})\in\mathbb{R}^{(p+1)\times 1}$ is the groundtruth coefficient. The error term $\mathbf{\epsilon}=(\epsilon_{1},\cdots, \epsilon_{n})\in\mathbb{R}^{n}$ are as before (i.e. satisfying $\mathbb{E}[\mathbf{\epsilon}]=0$ and $Cov(\mathbf{\epsilon})=\sigma^{2}I$).
As before, we aims to estimate $\mathbf{\beta}^{\ast}$ by applying LMSE, and solve the the following optimization problem,
there are two ways to estimate $\mathbf{\beta}$, i.e. iterative undated schedule based on gradeint information, and the analitical form.
2.2 Learning or estimation method
Learning as optimization let $L(\mathbf{\beta}) = \frac{1}{n}\Arrowvert \mathbf{Y}\mathbf{X}\mathbf{\beta}\Arrowvert^{2}$
 update rule:
 The gradient is
** Analitical form (normal equaiton)**
Take the derivative and set the firstorder derivative to 0, and we obtain
the predicted value are

 In the analitical form, to solve the LMSE problem requires the inverse of $\mathbf{X}^{T}\mathbf{X}$. To be noted that we assume that the columns of $\mathbf{X}$ are linear independent, and thus the matrix $\mathbf{X}^{T}\mathbf{X}$ is a fullrank matrix. The inverse exist. Moreover, the secondorder derivative $\frac{\partial^{2}L(\mathbf{\beta})}{\partial \mathbf{\beta}\partial \mathbf{\beta}^{T}}=\mathbf{X}^{T}\mathbf{X}$ is posivedefinite, and the solution $\hat{\mathbf{\beta}}$ is unique.

 Both Iterative updated schedule and analitcal can obtain the solution. How to choose the learning schedule in real applications? Well, in the analitical form, it requires the inverse of matrix. The complexity of obtaining the inverse matrix is $O(p^{3})$ in general. When $p$ is very large, e.g. highdimensional data, analytical form solution is much more expensive. In fact, it can be improved by some techniques, e.g. QR decompositon, etc., but this content beyond the scope of this post, and is not discussed.
3. More perspectives on linear regression
Geometry interpretion: let $\mathbf{H}=\mathbf{X}(\mathbf{X}\mathbf{X}^T)^{1}\mathbf{X}$, the linear regression fit $\hat{y} \in \mathbb{R}^n$ is exactly projection of $\mathbf{y}\in\mathbb{R}^{n}$ onto the linear subspace $span(\mathbf{x}_1,\cdots,\mathbf{x}_n)=col(\mathbf{X})\subset \mathbb{R}^n$
Let $L\subset \mathbb{R}^{n}$ be a linear subspace, i.e. $L=span(\nu_1,\cdots, \nu_k)$ for some $\nu_1,\cdots,\nu_k \in \mathbb{R}^n$. If $V\in\mathbb{R}^{n\times k}$ contains $\nu_{1},\cdots,\nu_{k}$ on its columns, then
The function $F:\mathbb{R}^{n}\rightarrow\mathbb{R}^{n}$ that projects points onto $L$ is called the projection map onto $L$. This map is a linear operator, i.e. $F(x)=P_{L}x$, where $P_{L}\in\mathbb{R}^{n\times n}$ is the projection matrix onto $L$.
The matrix $P_{L}$ is symmetric: $P_{L}^{T}=P_{L}$, and idempotent: $P_{L}^{2}=P_{L}$. Furthermore, we have
 $P_{L}x = x$ for all $x\in L$, and
 $P_{L}x = 0$ for all $x\perp L$.
For any subspace $L\subseteq \mathbb{R}^{n}$, its orthogonal complement is $L^{\perp}={\mathbf{x}\in\mathbb{R}^{n}:\mathbf{x}\perp L}={\mathbf{x}\in\mathbb{R}^{n}:\mathbf{x}\perp\mathbf{\nu} for any \mathbf{\nu}\in L}$.
Fact: $P_{L}+P_{L^{\perp}}=I$, so that $P_{L^{T}}=IP_{L}$.
for linear regression of $\mathbf{y}$ on $\mathbf{X}$, the residual vector is
so the residual $\mathbf{r}$ is orthongonal to any $\mathbf{\nu}\in col(\mathbf{X})$. In particular, the residual $\mathbf{r}$ is orthogonal to each of $\mathbf{x_{1}},\cdots,\mathbf{x_{n}}$.
Moreover, the projection map $P_{L}$ onto any linear subspace $L\subseteq\mathbb{R}^{n}$ is always nonexpensive, that is, for any points $\mathbf{x,z}\in\mathbb{R}^{n}$, we have $\Arrowvert P_{L}xP_{L}z\Arrowvert_{2}\leq \Arrowvert xz\Arrowvert_{2}$.
Therefore, if $\mathbf{y}_1,\mathbf{y}_2 \in \mathbb{R}^n$, and $\hat{\mathbf{y}}_1, \hat{\mathbf{y}}_2 \in \mathbb{R}^n$ are regression fit, then
Q: If the columns of $\mathbf{X}$ are not linear independent, $\mathbf{X}$ is not full rank. This implies that $\mathbf{X}^{T}\mathbf{X}$ is singular and the least squares coefficients $\hat{\mathbf{\beta}}$ are not uniquely defined.
Q: Is $\hat{\mathbf{y}}$ still the projection of $\mathbf{y}$ onto the column space of $\mathbf{X}$? Yes, but just more than one way to express that projection in terms of the column vectors of $\mathbf{X}$.
Mark: The nonfullrank case accurs in most applications, because the input vector might be encoded by a redudant fashion. In signal and image analysis, rank deficiencies usually accurs in the case that the dimension of data $p$ excess the number of data $n$.
How to solve this issue?
 the features are typically reduced by filtering.
 filtering is controlled by regularization.
4. Connection between the univariate regression and multivariate regression
4.1 Univariate regression review
We hold the orthogonality assumption in the above assumption. In the rest part, we go further to explain why we need this.
In the univariate linear regression, let’s consider the simplest the coefficients of $\mathbf{y} \in \mathbb{R}^n$ on a single predictor $\mathbf{x} \in \mathbb{R}^n$ as
Given $p$ predictor variable $\mathbf{x}_1, \cdots, \mathbf{x}_p \in \mathbb{R}^n$, the univariate linear regression coefficients of $\mathbf{y}$ on $\mathbf{x}_i$ is
Note: if $\mathbf{x}_1,\cdots,\mathbf{x}_p$ are orthogonal, we can estimate $\beta_j$ in the multivariate regression in terms of $\mathbf{y}$ on $\mathbf{x}_j$ only.
Next, we consider the intercept term, the coefficient of $\mathbb{x}$ can obtained by implementing two steps:

 Regress $\mathbf{x}$ on $\mathbf{1}$, yielding the coefficient
and the residual $\mathbf{z}=\mathbf{x}\mathbf{\hat{x}}\mathbf{1}\in\mathbb{R}^{n}$

 regress $\mathbf{y}$ on $\mathbb{z}$, yielding the coefficient
4.2 Multivariate regression by orthogonalization
We can extend this idea to multivariate linear regression of $\mathbf{y}\in\mathbb{R}^n$ on $\mathbf{X}=[\mathbf{x}_1,\cdots,\mathbf{x}_p]\in\mathbb{R}^{N\times p}$. Consider the pstep procedure:
 Let $\mathbf{z}_1=\mathbf{x}_1$.
 (Normalization): For $j=2,\cdots,p$, regress $\mathbf{x}j \text{ onto } \mathbf{z}_1, \cdots, \mathbf{z}{j1}$,
to get coefficients $\hat{\gamma}_{jk} = \frac{\langle\mathbf{z}_k, \mathbf{x}_j \rangle}{\Arrowvert\mathbf{x}_j \Arrowvert_2^2}$ for $k=1,\cdots,j1$, and residual vector
 Regress $\mathbf{y}$ on $\mathbf{z}_p$ to get the coefficient $\hat{\beta}_p$
Claim: the output $\hat{\beta}$ of this algorithm is exactly the coefficient in the multivariate linear regression of $\mathbf{y}$ on $\mathbf{x}_{1},\cdots,\mathbf{x}_p$.
4.3. Correlated data and variance inflation
Suppose $\mathbf{x}_1,\mathbf{x}_2,\cdots, \mathbf{x}_n$ are correlated, this make the predicted coefficient $\hat{\mathbf{\beta}}_j=\frac{<\mathbf{z}_j, \mathbf{y}>}{\Arrowvert \mathbf{z}_j \Arrowvert_2^2}$ unstable, as the denominator is very small.
We can explicit compute the variance of the $j$th multiple regression:
we can see that the correlated variables inflates the variance of multiple regression coefficients. We can explain it based on the Zscore for the $j$th multiple regression:
so if $\mathbf{x}_{j}$ is highly correlated, its regression coefficients will likely be not significant,(because Zscore is small).
4.4 Shortcomings of regression

 Predicitve ability: the linear regression model do not predict well, especially when $p$ is large.

 Interpretative ability: in some case, we need to select a smaller subset that have strongest effects on the output.
5. More perspectives on linear regression
5.1. The GaussMarkov theorem
The Gaussmarkov theorem states that ordinary least square estimation has the smallest mean squared error of all linear estimations with no bias. (To be noted: smallest MSE within the class of linear unbiased setimator.)
From the persepctive of biasedvariance decomposition, consider the mean squared error of an estimator $\hat{\mathbf{\beta}}$ in estimating $\beta$:
Note: However, there may exist a biased estamation which can reduce the variance by a great margin with a little payoff of increasing MSE, because such estimators are biased. Specifically, these estimators shrink or set some coefficients in some dimensions to zero.
To investigate the sampling propoerties of $\hat{\beta}$, we assume that the observations $y_{i}$ are uncorrelated and have constant variance $\sigma^{2}$, and the $\mathbf{x}_{i}$ are fixed.
The variancecovariance matrix of the least squrares parameter estimation is given by
where the variance $\sigma^{2}$ is estimated by
This is the unbiased estimation for variance $\sigma^{2}$, i.e. $\mathbb{E}[\sigma^{2}]=\sigma^{2}$.
Proof:
Now, we have $Var(\mathbf{X}^{T}\mathbf{X})^{1}\mathbf{X}^{T}\mathbf{\epsilon}) = \mathbb{E}^{2}[(\mathbf{X}^{T}\mathbf{X})^{1}\mathbf{X}^{T}\mathbf{\epsilon}](\mathbb{E}[(\mathbf{X}^{T}\mathbf{X})^{1}\mathbf{X}^{T}\mathbf{\epsilon}])^{2}$
Since $\mathbb{E}[(\mathbf{X}^{T}\mathbf{X})^{1}\mathbf{X}^{T}\mathbf{\epsilon}]=0$, then
5.2 Prediction error and mean squre error
Suppose that we use $\hat{f}$ to predict $f$, we would like to predict $y_{0}$ via $\hat{f}(x_{0})$.
Question: how does the prediction error(PE) relate to the mean square error (MSE)?
since $\mathbb{E}[(y_{0}f(x_{0}))]=0$, and $MSE(\hat{f}(x_{0}))=[Bias(\hat{f}(x_{0}))]^{2}+Var(\hat{f}(x_{0}))$, we look at the PE and MSE across all the input $\mathbf{x}_1, \ldots,\mathbf{x}_n$., and we have
PE on all input data is
if $\hat{ƒ}(x_{i})=x_{i}^{T}\beta$, then
how to derive the last term?
For linear regression, its prediction error is just $\sigma^{2}+\frac{p\sigma^{2}}{n}$. The second term is the additive variance, i.e. $\frac{1}{n}\sum_{i=1}^{n}Var(x_{i}^{T}\hat{\beta})$. This means that each additional predictor variable will add the same amount of variance $\sigma^{2}/n$, regardless of whether its true coefficient is large or small.
Therefore, we can shrink some coefficients to zero by introducing bias a little so as to reduce a larger variance.