Summary
Nonparametric Bayes is a method of Bayesian statistics that allows one to build probability models from the data itself and to estimate probability distributions from the data instead of assuming the true probability distribution that generates the data. This allows the use of flexible models for the data and automatically adjusts the probability distribution to fit the data. Here we describe this nonparametric technique based on the Machine Learning Professional series “Nonparametric Bayes – The Mathematics of Point Processes and Statistical Machine Learning“.
In the previous article, I gave an overview of the mathematics of nonparametric Bayesian point processes and statistical machine learning. In this article, we will organize basic knowledge about probability distributions. The relationship between probability distributions is shown below.
Bernoulli and binomial distributions
The Bernoulli distribution and binominal distribution are described.
The Bernoulli distribution is a discrete distribution with a daily random variable x∈{0,1}, where the probability that x=1 is π(0≤π≤1) and the probability that π=0 is 1-π. The Bernoulli distribution is defined as follows with π as a parameter.
\[Bernoulli(x|\pi)=\pi^x(1-\pi)^{1-x}\quad(x\in\{0,1\})\quad(4)\]
Consider n independent trials that follow a Bernoulli distribution, and let x ∈ {0,1} denote the value in the i-th trial. Also, let n0(n1) denote the number of times 0(1) was obtained.
The probability of x={x1,x2,…,xn} for the original hand given π can be calculated as follows.
\[p(\mathbf{x}|\pi)=\displaystyle\prod_{i=1}^np(x_i|\pi)=\pi^{n_1}(1-\pi)^{n_0}\quad(5)\]
If we are interested in n1, the number of occurrences of 1 in n trials, rather than the value in each trial, the probability of n1 is parameterized by π and n and follows a binomial distribution expressed as follows
\[Bi(n_1|\pi,n)=\frac{n!}{n_1!(n-n_1)!}\pi^{n_1}(1-\pi)^{n-n_1}\quad(6)\]
Poisson distribution
This section describes the Poisson distribution. The Poisson distribution is a probability distribution that is often used as a distribution for discrete events with natural numbers, such as frequencies.
Poisson distribution is defined as follows with λ>0 as a parameter.
\[Po(x|\lambda)=\frac{\lambda^x}{x!}e^{-\lambda}\quad(x\in\mathbb{N}\cup\{0\})\quad(7)\]
The expected value and variance of the Poisson distribution are as follows
\[\mathbb{E}[\pi]=\lambda,\ \mathbb{V}[\pi]=\lambda\quad(8)\]
The relation between Poisson distribution and binomial distribution is as follows when nπ=λ
\[\lim_{n\rightarrow\infty} Bi(x|\pi,n)=Po(x|\lambda)\quad(9)\]
Multi-distribution
We describe the multinominal distribution, which is an entailed extension of the binomial distribution.
Let x be a random variable that takes on K different values {1,2,…,K}. Let π=(π1,π2,…,πK)\((\sum_{k=1}^K\pi_k=1)\) be the probability of taking each value. n independent trials are considered, and let xi=k indicate that the value in the i-th trial is k. Also, express the number of times the value k is obtained in nk. p(xi=k|π)=πk is the probability that xi=k, given π.
In this case, the probability of x={x1,x2,…,xn} given π can be calculated as follows.
\[p(\mathbf{x}|\pi)=\displaystyle\prod_{i=1}^np(x_i|\pi)=\prod_{k=1}^K\pi_k^{n_k}\quad(10)\]
If we are interested in the number nk of occurrences of each value in n trials, rather than the value in each trial, the probability of \(\{n_k\}_{k=1}^K\) follows a multinomial distribution \((\{n_k\}_{k=1}^K|\pi,n)\) defined by the following formula with pi and n as parameters.
\[Multi(\{n_k\}_{k=1}^K|\pi,n)=\frac{n!}{\prod_{k=1}^K}\displaystyle\prod_{k=1}^K\pi_k^{n_k}\quad(11)\]
The xi in each trial can be considered as p(xi=k|π)=Multi(nk=1|π,1)=πk (∀k’≠k,nk’=0) according to a multinomial distribution with n=1, which is denoted as Multi(xi|π). categorical distribution.
beta distribution
This section describes the beta distribution. Beta distribution is a probability distribution often used as the distribution that the parameter π (0≤π≤1) of Bernoulli distribution or binomial distribution follows.
When the random variable π has the following probability density function, π is said to follow a beta distribution with parameters a>0 and b>0.
\[Beta(\pi|a,b)=\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\pi^{a-1}(1-n)^{b-1}\quad(12)\]
Here, the following equation is a generalized factorial function called the gamma function.
\[\Gamma(x)=\displaystyle\int_0^{\infty}t^{x-1}e^{-x}dx\quad(13)\]
This gamma function has the following properties when n≥2 is an integer and α is a non-negative decimal.
\[\Gamma(1)=1,\ \Gamma(n)=(n-1)\Gamma(n-1)=(n-1)!\\\Gamma(n+\alpha)=(n-1+\alpha)\Gamma(n-1+\alpha)\quad(14)\]
The expected value and variance of the beta distribution are as follows
[mathbb{E}[pi]=frac{a}{a+b}, mathbb{V}[pi]=frac{ab}{(a+b)^2(1+a+b)}quad(15)]Dirichlet distribution
We describe the Dirichlet distribution, an extension of the beta distribution to the entiled distribution, where the set of K-dimensional probability vectors is defined as follows
\[\Delta^K=\{\pi=(\pi_1,\pi_2,\dots,\pi_k)|\displaystyle\sum_{k=1}^K\pi_k=1,\ \pi_k\geq 0 \forall k\}\quad(16)\]
The Dirichlet distribution is often used as such a probability distribution over ΔK.
When the random variable π has the following probability density function, π is said to follow a Dirichlet distribution with parameter α=(α1,α2,…,αK)(αK>0).
The expected value and variance of the Dirichlet distribution are as follows.
\[\mathbb{E}[\pi_k]=\frac{\alpha_k}{\alpha_0},\ \mathbb{V}[\pi_k]\frac{\alpha_k(\alpha_0-\alpha_k)}{\alpha_0^2(1+\alpha_0)},\ ここで\alpha_0=\displaystyle\sum_{k=1}^K\alpha_k\quad(17)\]
Gamma distribution and inverse gamma distribution
We discuss the gamma distribution and inverse-gamma distribution as representative probability distributions that non-negative random variables follow.
When a random variable τ has the following probability function, τ is said to follow a gamma distribution with parameters a>0 and b>0.
\[Ga(\tau|a,b)=\frac{b^a}{\Gamma(a)}\tau^{a-1}exp(-b\tau)\quad(19)\]
The expected value and variance of the gamma distribution are as follows
\[\mathbb{E}[\tau]=\frac{a}{b},\ \mathbb{V}[\tau]=\frac{a}{b^2}\quad(20)\]
When τ follows a gamma distribution, 1/τ follows an inverse gamma distribution. ν=1/τ, the probability density function of the inverse gamma distribution is defined as follows, with a>0 and b>0 as parameters
\[IG(\nu|a,b)=\frac{b^a}{\Gamma(a)}\nu^{-a-1}exp(-\frac{b}{\tau})\quad(21)\]
The expected value and variance of the inverse gamma function are as follows
\[\mathbb{E}[\nu]=\frac{b}{a-1}(a>1),\ \mathbb{V}[\tau]=\frac{b^2}{(a-1)^2(a-2)}(a>2)\quad(22)\]
Gaussian distribution
We describe the Gaussian distribution as a typical probability distribution that a D-dimensional real-valued vector x ∈ ℝD follows. When a random variable x has the following probability density function, x is said to follow a Gaussian distribution with μ ∈ ℝ and a positive definite symmetric matrix Σ of DxD as a parameter.
\[N(\mathbf{x}|\mu,\Sigma)=\frac{1}{\sqrt{(2\pi)^D|\Sigma|}}exp\left(-\frac{1}{2}(\mathbf{x}-\mu)^T\Sigma^{-1}(\mathbf{x}-\mu)\right)\quad(23)\]
The expectation and covariance matrix of the Gaussian distribution is as follows
\[\mathbb{E}[\mathbf{x}]=\mu,\ \mathbb{C}[\mathbf{x}]=\Sigma\quad(24)\]
Wishart distribution
We describe the Wishart distribution as the probability distribution followed by a semi-positive definite symmetric matrix A of DxD.
When a random variable A has the following probability density function, A is said to follow a Wishart distribution with parameters ν≥D and DxD matrix Σ.
\[W(A|\nu,\Sigma)=\frac{|A|^{\frac{1}{2}(\nu-D-1)}}{2^{\frac{\nu D}{2}}\pi^{\frac{D(D-1)}{4}}|\Sigma|^{\frac{n}{2}}\displaystyle\prod_{d=1}^D\Gamma\left(\frac{\nu-d+1}{2}\right)}exp\left(-\frac{1}{2}tr(\Sigma^{-1}A)\right)\quad(25)\]
The expectation and covariance matrix of the Wishart distribution is as follows
\[\mathbb{A}=\nu\Sigma,\ \mathbb{C}[A]=2\nu\Sigma\otimes\Sigma\quad(26)\]
Student t distribution
The Student-t distribution is the distribution that a D-dimensional real-valued vector x ∈ ℝD follows.
When a random variable x has the following probability density function, x is said to follow a Student-t distribution with μ ∈ ℝD, ν ∈ ℝD, and the positive definite symmetric matrix Σ of DxD as parameters.
\[St(\mathbf{x}|\mu,\nu,\Sigma)=\frac{1}{\sqrt{\pi^ D\nu^D|\Sigma|}}\frac{\Gamma(\nu/2+D/2)}{\Gamma(\nu/2)}\left[1+\frac{1}{\nu}(\mathbf{x}-\mu)^T\Sigma^{-1}(\mathbf{x}-\mu)\right]^{-\frac{\nu+D}{2}}\quad(27)\]
The expected value and covariance matrix of the Student’s t distribution are as follows.
\[\mathbb{E}[\mathbf{x}]=\mu,\ \mathbb{C}[\mathbf{x}]=\frac{\nu}{\nu-2}\Sigma\quad(28)\]
Although the above definition formula is common, there is another definition: the positive definite symmetric matrix Φ with μ ∈ ℝD, ν ∈ ℝ+, and DxD may be defined as follows, with the parameters
\begin{eqnarray}& &p(\mathbf{x}|\mu,\nu,\Phi)\\& &=St(\mathbf{x}|\mu,\nu,\Phi)\\& &=\frac{1}{\sqrt{\pi^ D|\Phi|}}\frac{\Gamma(\nu/2+D/2)}{\Gamma(\nu/2)}\left[1+(\mathbf{x}-\mu)^T\Phi^{-1}(\mathbf{x}-\mu)\right]^{-\frac{\nu+D}{2}}\quad(29)\end{eqnarray}
In this case, we will mainly use the Student t distribution of the above definition.
The expected value of the Student t distribution and the covariance matrix are then expressed as follows.
\[\mathbb{E}[\mathbf{x}]=\mu,\ \mathbb{C}[\mathbf{x}]=\frac{1}{\nu-2}\Phi\quad(30)\]
In the next article, we will give an overview of stochastic generative models and learning.
コメント