Note(1)--Basic Review of Basic Probability

Posted by Gwan Siu on January 30, 2018

The reference materials are based on cmu 10-705,2016 and 2017.

In note(1), we just review some necessary basic probability knowledge.

1. Axioms of Probability

  • $\mathbb{P}(A)\geq 0$
  • $\mathbb{P}(\Omega)=0$
  • If $A$ and $B$ are disjoint, then $\mathbb{P}(A\cup B)=\mathbb(P)(A)+\mathbb{P}(B)$

2. Random Variable.

2.1 The Definition of Random Variable

Let $\Omega$ be a smaple space(a set of posible events) with a probability distribution(also called a probability measure $\mathbb{P}$). A random variable is a mapping function: $X:\Omega \rightarrow \mathbb{R}$. We wirte:

and we can write $X\sim P$ that means $X$ has a distribution $P$.

2.5 Cumulative distribution

The cumulative distribution function($cdf$) of $X$ is

The property of cdf:

    1. $F$ is right-continuous function. At each point $x$, we have $F(x)=\lim_{n\rightarrow}\infty F(y_{n})=F(x)$ for any sequence $y_{n}\rightarrow x$ with $y_{n} >x$.
    1. $F$ is non-decreasing. If $x<y$ then $F(x)\leq F(y)$.
    1. $F$ is normalized. $\lim_{n\rightarrow -\infty}F(x)=0$ and $\lim_{x\rightarrow \infty}F(x)=1$.

Conversely, any $F$ satisfying these three properties is a cdf for some random variable.

If $X$ is discrete, its probability mass function($pmf$) is:

If $X$ is continuous, then its probability density function($pdf$) satisfies:

and the $p_{X}(x)=p(x)=F^{‘}(x)$. The following are all equivalent:

Definition:Suppose that $X\sim P$ and $Y\sim Q$. We say that $X$ and $Y$ have the same distribution if $\mathbb{P}(X\in A)=Q(Y\in A)$ for all $A$. In that case we say that $X$ and $Y$ are equal in distribution and we write $X、overset{d}{=}Y$}

Lemma 1.1: $X\overset{d}{=}Y$ if and only if $F_{X}(t)=F_{Y}(t)$ for all $t$.

3. Expectation, Variance and Generated Function

3.1 Expetation and Its Properties.

The mean or expected value of $g(X)$ is

Properties of expected value:

    1. Linearity of Expectation: $\mathbb{E}(\sum_{j=1}^{k}c_{j}g_{j}(X))=\sum_{j=1}^{k}c_{j}\mathbb{E}(g_{j}(X))$.
    1. If $X_{1},…,X_{n}$ are independent then
    1. $\mu$ is often used to denote $\mathbb{E}(X)$.

Roughly, Expectation is a linear operator. More insight, Expectation is weight average from the mathematics perspective.

3.2 Variance and Its Properties

The defintion of Variance is: $\text{Var}(X)=\mathbb{E}[(X-\mu)^{2}]$ or $\text{Var}(X)=\mathbb{E}(X^{2})-(E[X])^{2}$. It’s sum of distances between each point and the mean. Physically, it describes the degree of difussion of points.

If $X_{1},…,X_{n}$ are independent then

The covariance is

and the correlation is $\rho_{X,Y} =\text{Cov}(X,Y)/\sigma_{x}\sigma_{y}$. Recall that $-1\leq\rho \geq 1$.

Proof: $\text{Cov}(X,Y)=\mathbb{E}(XY)-\mu_{X}\mu_{Y}$.

Proof: $-1\leq \rho_{X,Y} \leq 1$.(Cauchy-Schwarz Inequality)

3.3 Conditional Expectation and Variance

The conditional expectation of $Y$ given $X$ is the random variable $\mathbb{E}(Y\arrowvert X)$ whose value, when $X=x$ is

where $p(y\arrowvert x)=p(x,y)/p(x)$.

The Law of Total Expectation or Law of Iterated Expectation:

The Law of Total Variance is

3.4 Moment Generated Function

The mement generated function(mgf) is

If $M_{X}(t)=M_{Y}(t)$ for all $t$ in an interval around 0 then $X\overset{d}{=}Y$.

The moment generated function can be used to “generate” all the moments of a distribution, i.e. we can take derivatives of the mgf with respect to $t$ and evaluated at $t=0$, i.i. we have that

4. Independence

(Definition): $X$ and $Y$ are independent if and only if

for all $A$ and $B$.

Theorem 1.2 Let $(X,Y)$ be a bivariate random vector with $p_{X,Y}(x,y)$. $X$ and $Y$ are independent iff $p_{X,Y}=(x,y)=p_{X}(x)p_{Y}(y)$.

$X_{1},…,X_{n}$ are independent if and only if

Thus, $p_{x_{1},…,x_{n}}=\prod_{i=1}^{n}p_{X_{i}}(x_{i})$.

If $X_{1},…,X_{n}$ are independent and identically distributed we say they are iid and we write

Independence and condition: A and B are independent events then $P(A B)=P(A)$ Also, for any pair of events A and B.

Independece means that knowing B does not change the probability of A.


Let $Y=g(X)$ where: $g:\mathbb{R}\rightarrow \mathbb{R}$. Then

where $A_{y}={x:g(x)\leq y}$.

The density is $p_{Y}(y)=F^{`}_{Y}(y)$. If $g$ is monotonic, then

where $h=g^{-1}$.

Let $Z=g(X,Y)$. For example, $Z=X+Y$ or $Z=X/Y$. Then we find the pdf of $Z$ as follows:

    1. For each $z$, find that set $A_{z}={(x,y):g(x,y)\leq z}$.
  • Find the CDF:
  • 3.The pdf is $p_{Z}(z)=F^{`}_{Z}(z)$.

6. Important Distributions

6.1 Bernoulli Distribution

$X\sim \text{Bernoulli}(\theta)$ if $\mathbb{P}(X=1)=\theta$ and $\mathbb{P}(X=0)=1-\theta$ and hence

Mean: $\mu_{theta}=\mathbb{E}[\theta]=theta$.
Variance: $\text{Var}(theta) = \mathbb{E}[(\theta-\mu_{theta})^{2}]=\theta(1-\theta)$

6.2 Binomial Distribution

$X\sim \text{Binomial}(\theta)$ if

Mean: $\mu_{\theta}=n\theta$.
Variance: $\text{Var}(\theta)=n\theta(1-\theta)$. (Indicated function is used to prove the mean and variance.)

6.3 Multinomial Distribution

The miltivariate version of a Binomial distribution is called a Multinomial distribution. Consider drawing a ball from an urn with has balls with $k$ different colors labeled “Color 1, color 2,…, color k.” Let $p=(p_{1},p_{2},…,p_{k})$ where $\sum_{j=1}^{n}p_{j}=1$ and $p_{j}$ is the probability of frawing color $j$. Draw $n$ balls from the urn(independently and with replacement) and let $X=(X_{1},…,X_{k})$ be the count of the number of balls of each color drawn. We say that $X$ has a Multinomial(n,p) distribution. Then,

Mean: $\mathbb{E}[X_{i}]=np_{i}$
Variance: $\text{Var}(X_{i})=np_{i}(1-p_{i})$

6.4 Chi-squared Distribution

$X\sim \chi^{2}{p}$ if $X=\sum{j=1}^{n}Z_{j}^{2}$ where $Z_{1},…,Z_{n}\sim N(0,1)$. The pdf of $\chi$ is:

The mean: $\mu=n$, and the variance: $\text{Var}(\chi)=2n$. n is the degree of freedom.

The cdf of $\chi$:

Non-centeral chi-squared(More on this below). $X\sim \chi_{1}^{2}(\mu^{2})$ if $X=Z^{2}$ where $Z\sim N(\mu,1)$.

6.5 Gamma Distribution

$X\sim \Gamma(\alpha, \beta)$ if

for $x>0$ where $\Gamma(\alpha)=\int_{0}^{\infty}\frac{1}{\beta^{\alpha}}x^{\alpha-1}e^{-x/\beta}$.

6.6 Gaussian Distribution(Normal Distribution)

$X\sim N(\mu,\sigma^{2})$ if

If $X\in \mathbb{R}^{d}$ then $X\sim N(\mu,\Sigma)$ if

where $\mathbb{E}[Y]=\mu$ and $\text{cov}[Y]=\Sigma$. The moment generating function is

Theorem (a). If $Y\sim N(\mu,\Sigma)$, then $\mathbb{E}[Y]=\mu,\text{cov}(Y)=\Sigma$.
(b). If $Y\sim N(\mu,\Sigma)$ and $c$ is a scalar, then $cY\sim N(c\mu,c^{2}\Sigma)$.

Theorem Suppose that $Y\sim N(\mu,\Sigma)$. Let

where $Y_{1}$ and $\mu_{1}$ are $p\times 1$, and $\Sigma_{11}$ is $p\times p$.
(a). $Y_{1}\sim N_{p}(\mu_{1},\Sigma_{11}),Y_{2}\sim N_{n-p}(\mu_{2},\Sigma_{22})$
(b). $Y_{1}$ and $Y_{2}$ are independent if and only if $\Sigma_{12}=0$.
(c). If $\Sigma_{22}> 0$, then the condition distribution of $Y_{1}$ given $Y_{2}$ is

Lemma: Let $Y\sim N(\mu,\sigma^{2}I)$, where $Y^{T}=(Y_{1},…,Y_{n}),\mu^{T}=(\mu_{1},…,\mu_{n})$ and $\sigma^{2}>0$ is a scalar. Then the $Y_{i}$ are independent, $Y_{i}\sim N_{1}(\mu,\sigma^{2})$ and

Theorem Let $Y\sim N(\mu,\Sigma)$. Then:
(a). $Y^{T}\Sigma^{-1}Y\sim \chi_{n}^{2}(\mu^{T}\Sigma^{-1}\mu)$.
(b). $(Y-\mu)^{T}\Sigma^{-1}(Y-\mu)\sim \mu$. (c). $(Y-\mu)^{T}\Sigma^{-1}(Y-\mu)\sim \chi_{n}^{2}(0)$.

7. Sample Mean and Variance

Let $X_{1},…,X_{n}\sim P$. The sample mean is

and the sample variance is

The sampling distribution of $\hat{\mu}_{n}$ is

Pratics Problem. Let $X_{1},..,X_{n}$ be iid with $\mu=\mathbb{E}(X_{i})=\mu$ and $\sigma^{2}=\text{Var}(X_{i})=\sigma^{2}$. Then

Theorem If $X_{1},…,X_{n}\sim N(\mu,\sigma^{2})$ then
(a). $\hat{\mu}{n}\sim N(\mu,\frac{\sigma^{2}}{n})$.
(b). $\frac{(n-1)\hat{\sigma}^{2}
{n}}{\sigma^{2}}\sim \chi^{2}{n-1}$
(c). $\hat{\mu}
{n}$ and $\hat{\sigma}_{n}^{2}$ are independent.

Proof: $\mathbb{E}[\hat{\mu}]=\mu$\

Proof: $\mathbb{E}[\hat{\sigma}^{2}]=\sigma^{2}$.

8. Bayesian Theorem

Let $A_{1},…,A_{k}$ be a partition of $\Omega$ such that $P(A_{i})>0$ for all $i$. If $P(B)>0$, then for each $i=1,…,k$:

Total Probability Let $A_{1},…,A_{k}$ be a partition of $\Omega$. Then, for any evernt $B$,