The reference materials are based on cmu 10705,2016 and 2017.
In note(1), we just review some necessary basic probability knowledge.
1. Axioms of Probability
 $\mathbb{P}(A)\geq 0$
 $\mathbb{P}(\Omega)=0$
 If $A$ and $B$ are disjoint, then $\mathbb{P}(A\cup B)=\mathbb(P)(A)+\mathbb{P}(B)$
2. Random Variable.
2.1 The Definition of Random Variable
Let $\Omega$ be a smaple space(a set of posible events) with a probability distribution(also called a probability measure $\mathbb{P}$). A random variable
is a mapping function: $X:\Omega \rightarrow \mathbb{R}$. We wirte:
and we can write $X\sim P$ that means $X$ has a distribution $P$.
2.5 Cumulative distribution
The cumulative distribution function($cdf$) of $X$ is
The property of cdf:

 $F$ is rightcontinuous function. At each point $x$, we have $F(x)=\lim_{n\rightarrow}\infty F(y_{n})=F(x)$ for any sequence $y_{n}\rightarrow x$ with $y_{n} >x$.

 $F$ is nondecreasing. If $x<y$ then $F(x)\leq F(y)$.

 $F$ is normalized. $\lim_{n\rightarrow \infty}F(x)=0$ and $\lim_{x\rightarrow \infty}F(x)=1$.
Conversely, any $F$ satisfying these three properties is a cdf for some random variable.
If $X$ is discrete, its probability mass function
($pmf$) is:
If $X$ is continuous, then its probability density function
($pdf$) satisfies:
and the $p_{X}(x)=p(x)=F^{‘}(x)$. The following are all equivalent:
Definition:Suppose that $X\sim P$ and $Y\sim Q$. We say that $X$ and $Y$ have the same distribution if $\mathbb{P}(X\in A)=Q(Y\in A)$ for all $A$. In that case we say that $X$ and $Y$ are equal in distribution and we write $X、overset{d}{=}Y$}
Lemma 1.1: $X\overset{d}{=}Y$ if and only if $F_{X}(t)=F_{Y}(t)$ for all $t$.
3. Expectation, Variance and Generated Function
3.1 Expetation and Its Properties.
The mean
or expected value
of $g(X)$ is
Properties of expected value:

 Linearity of Expectation: $\mathbb{E}(\sum_{j=1}^{k}c_{j}g_{j}(X))=\sum_{j=1}^{k}c_{j}\mathbb{E}(g_{j}(X))$.

 If $X_{1},…,X_{n}$ are independent then

 $\mu$ is often used to denote $\mathbb{E}(X)$.
Roughly, Expectation is a linear operator. More insight, Expectation is weight average from the mathematics perspective.
3.2 Variance and Its Properties
The defintion of Variance
is: $\text{Var}(X)=\mathbb{E}[(X\mu)^{2}]$ or $\text{Var}(X)=\mathbb{E}(X^{2})(E[X])^{2}$. It’s sum of distances between each point and the mean. Physically, it describes the degree of difussion of points.
If $X_{1},…,X_{n}$ are independent then
The covariance is
and the correlation is $\rho_{X,Y} =\text{Cov}(X,Y)/\sigma_{x}\sigma_{y}$. Recall that $1\leq\rho \geq 1$.
Proof: $\text{Cov}(X,Y)=\mathbb{E}(XY)\mu_{X}\mu_{Y}$.
Proof: $1\leq \rho_{X,Y} \leq 1$.(CauchySchwarz Inequality)
3.3 Conditional Expectation and Variance
The conditional expectation of $Y$ given $X$ is the random variable $\mathbb{E}(Y\arrowvert X)$ whose value, when $X=x$ is
where $p(y\arrowvert x)=p(x,y)/p(x)$.
The Law of Total Expectation or Law of Iterated Expectation:
The Law of Total Variance is
3.4 Moment Generated Function
The mement generated function(mgf) is
If $M_{X}(t)=M_{Y}(t)$ for all $t$ in an interval around 0 then $X\overset{d}{=}Y$.
The moment generated function can be used to “generate” all the moments of a distribution, i.e. we can take derivatives of the mgf with respect to $t$ and evaluated at $t=0$, i.i. we have that
4. Independence
(Definition): $X$ and $Y$ are independent if and only if
for all $A$ and $B$.
Theorem 1.2 Let $(X,Y)$ be a bivariate random vector with $p_{X,Y}(x,y)$. $X$ and $Y$ are independent iff $p_{X,Y}=(x,y)=p_{X}(x)p_{Y}(y)$.
$X_{1},…,X_{n}$ are independent if and only if
Thus, $p_{x_{1},…,x_{n}}=\prod_{i=1}^{n}p_{X_{i}}(x_{i})$.
If $X_{1},…,X_{n}$ are independent and identically distributed we say they are iid
and we write
Independence and condition: A and B are independent events then $P(A 
B)=P(A)$ Also, for any pair of events A and B . 
Independece means that knowing B
does not change the probability of A
.
5.Transformations
Let $Y=g(X)$ where: $g:\mathbb{R}\rightarrow \mathbb{R}$. Then
where $A_{y}={x:g(x)\leq y}$.
The density is $p_{Y}(y)=F^{`}_{Y}(y)$. If $g$ is monotonic, then
where $h=g^{1}$.
Let $Z=g(X,Y)$. For example, $Z=X+Y$ or $Z=X/Y$. Then we find the pdf of $Z$ as follows:

 For each $z$, find that set $A_{z}={(x,y):g(x,y)\leq z}$.
 Find the CDF:
 3.The pdf is $p_{Z}(z)=F^{`}_{Z}(z)$.
6. Important Distributions
6.1 Bernoulli Distribution
$X\sim \text{Bernoulli}(\theta)$ if $\mathbb{P}(X=1)=\theta$ and $\mathbb{P}(X=0)=1\theta$ and hence
Mean:
$\mu_{theta}=\mathbb{E}[\theta]=theta$.
Variance:
$\text{Var}(theta) = \mathbb{E}[(\theta\mu_{theta})^{2}]=\theta(1\theta)$
6.2 Binomial Distribution
$X\sim \text{Binomial}(\theta)$ if
Mean:
$\mu_{\theta}=n\theta$.
Variance:
$\text{Var}(\theta)=n\theta(1\theta)$. (Indicated function is used to prove the mean and variance.)
6.3 Multinomial Distribution
The miltivariate version of a Binomial distribution is called a Multinomial distribution. Consider drawing a ball from an urn with has balls with $k$ different colors labeled “Color 1, color 2,…, color k.” Let $p=(p_{1},p_{2},…,p_{k})$ where $\sum_{j=1}^{n}p_{j}=1$ and $p_{j}$ is the probability of frawing color $j$. Draw $n$ balls from the urn(independently and with replacement) and let $X=(X_{1},…,X_{k})$ be the count of the number of balls of each color drawn. We say that $X$ has a Multinomial(n,p) distribution. Then,
Mean:
$\mathbb{E}[X_{i}]=np_{i}$
Variance:
$\text{Var}(X_{i})=np_{i}(1p_{i})$
6.4 Chisquared Distribution
$X\sim \chi^{2}{p}$ if $X=\sum{j=1}^{n}Z_{j}^{2}$ where $Z_{1},…,Z_{n}\sim N(0,1)$. The pdf of $\chi$ is:
The mean
: $\mu=n$, and the variance:
$\text{Var}(\chi)=2n$. n is the degree of freedom.
The cdf of $\chi$:
Noncenteral chisquared(More on this below). $X\sim \chi_{1}^{2}(\mu^{2})$ if $X=Z^{2}$ where $Z\sim N(\mu,1)$.
6.5 Gamma Distribution
$X\sim \Gamma(\alpha, \beta)$ if
for $x>0$ where $\Gamma(\alpha)=\int_{0}^{\infty}\frac{1}{\beta^{\alpha}}x^{\alpha1}e^{x/\beta}$.
6.6 Gaussian Distribution(Normal Distribution)
$X\sim N(\mu,\sigma^{2})$ if
If $X\in \mathbb{R}^{d}$ then $X\sim N(\mu,\Sigma)$ if
where $\mathbb{E}[Y]=\mu$ and $\text{cov}[Y]=\Sigma$. The moment generating function is
Theorem (a). If $Y\sim N(\mu,\Sigma)$, then $\mathbb{E}[Y]=\mu,\text{cov}(Y)=\Sigma$.
(b). If $Y\sim N(\mu,\Sigma)$ and $c$ is a scalar, then $cY\sim N(c\mu,c^{2}\Sigma)$.
Theorem Suppose that $Y\sim N(\mu,\Sigma)$. Let
where $Y_{1}$ and $\mu_{1}$ are $p\times 1$, and $\Sigma_{11}$ is $p\times p$.
(a). $Y_{1}\sim N_{p}(\mu_{1},\Sigma_{11}),Y_{2}\sim N_{np}(\mu_{2},\Sigma_{22})$
(b). $Y_{1}$ and $Y_{2}$ are independent if and only if $\Sigma_{12}=0$.
(c). If $\Sigma_{22}> 0$, then the condition distribution of $Y_{1}$ given $Y_{2}$ is
Lemma: Let $Y\sim N(\mu,\sigma^{2}I)$, where $Y^{T}=(Y_{1},…,Y_{n}),\mu^{T}=(\mu_{1},…,\mu_{n})$ and $\sigma^{2}>0$ is a scalar. Then the $Y_{i}$ are independent, $Y_{i}\sim N_{1}(\mu,\sigma^{2})$ and
Theorem Let $Y\sim N(\mu,\Sigma)$. Then:
(a). $Y^{T}\Sigma^{1}Y\sim \chi_{n}^{2}(\mu^{T}\Sigma^{1}\mu)$.
(b). $(Y\mu)^{T}\Sigma^{1}(Y\mu)\sim \mu$.
(c). $(Y\mu)^{T}\Sigma^{1}(Y\mu)\sim \chi_{n}^{2}(0)$.
7. Sample Mean and Variance
Let $X_{1},…,X_{n}\sim P$. The sample mean is
and the sample variance is
The sampling distribution of $\hat{\mu}_{n}$ is
Pratics Problem. Let $X_{1},..,X_{n}$ be iid
with $\mu=\mathbb{E}(X_{i})=\mu$ and $\sigma^{2}=\text{Var}(X_{i})=\sigma^{2}$. Then
Theorem If $X_{1},…,X_{n}\sim N(\mu,\sigma^{2})$ then
(a). $\hat{\mu}{n}\sim N(\mu,\frac{\sigma^{2}}{n})$.
(b). $\frac{(n1)\hat{\sigma}^{2}{n}}{\sigma^{2}}\sim \chi^{2}{n1}$
(c). $\hat{\mu}{n}$ and $\hat{\sigma}_{n}^{2}$ are independent.
Proof: $\mathbb{E}[\hat{\mu}]=\mu$\
Proof: $\mathbb{E}[\hat{\sigma}^{2}]=\sigma^{2}$.
8. Bayesian Theorem
Let $A_{1},…,A_{k}$ be a partition of $\Omega$ such that $P(A_{i})>0$ for all $i$. If $P(B)>0$, then for each $i=1,…,k$:
Total Probability Let $A_{1},…,A_{k}$ be a partition of $\Omega$. Then, for any evernt $B$,