Summary
A Gaussian Process (GP) is a nonparametric regression and classification method based on probability theory, and is a type of stochastic process used for modeling continuous data. In the Gaussian process approach, the process of data generation is modeled by a probability distribution, and kernels (kernel functions, kernel function matrices) are used to express relationships among data points, and to estimate the relationships and uncertainty among data points. The kernel of a Gaussian process can have various shapes and can be customized to fit the characteristics of the data, allowing for a flexible model to be constructed to fit the data. Here we describe this Gaussian process based on the Machine Learning Professional Series “Gaussian Processes and Machine Learning“.
In the previous article, we discussed generalizations of Gaussian process regression. In this article, I will discuss the basics of stochastic models as the basis of stochastic generative models and Gaussian processes.
The hypothesis that an observation Y in the real world is obtained by sampling Y˜p(Y) from some establishment distribution p(Y) is called a stochastic generative model of observation Y. This section describes the concept, formulation, and calculation method of stochastic generative modeling.
introduction
In Gaussian process regression, y=f(x)+ε is assumed as the most basic model, the input x is fixed as a constant, and the function f(x), observed y, and observed noise ε are random variables. It is intuitively easy to accept that the observation noise ε is a random variable. However, treating the observed value y as a random variable is a strange idea when one thinks about it.
Is it necessary to treat a value that has already been observed and is right in front of us as a random variable? The function f(・) that generates this observed value is treated as a stochastic process, and the value f=(f(x1),…,f(xN)) at inputs x1,…,xN is treated as a random variable. Treating f(⋅) and f, the objects we want to know, as random variables is a strange idea when you think about it.
About these
- The introduction of the likelihood function is a change of mindset that “the observed value y in front of us is a random variable,” and is the basis of the maximum likelihood estimation method.
- The introduction of prior probability and posterior probability is a change of mindset that “what we want to know (hidden variable f or unknown parameter θ) is a random variable,” and is the basis of Bayesian estimation.
These two points are the basis of stochastic generative models and represent a major shift in thinking that runs counter to everyday intuition.
The concept of probabilistic generative modeling is not only necessary as a basis for applying Gaussian processes to data analysis, but it is the underlying concept for all machine learning in its raw form. The methodologies of maximum likelihood estimation and Bayesian estimation are built on the concept of probabilistic generative modeling.
Stochastic Variables and Writing Generative Models – Stochastic variable x and probability distribution p(x)
First, let us review the basic concepts of random variables, probability distributions, probability density functions, stochastic processes, etc.
A random variable is a variable whose value is determined probabilistically on a trial-by-trial basis.
<Example 1>
Suppose that the random variable is an ideal roll of the dice, the set of possible values of X is {1,2,3,4,5,6}, and the probability of occurrence of each value is p(X=1)1/6,…,p(X=6)=1/6 as the numbers.
The sum of the probabilities in all cases must equal 1.
\[p(X)=\displaystyle\int p(X,Y)dY\]
In general, the properties of a random variable X consist of a set 𝒳 of possible values for X and a probabilistic distribution p(X) that defines the probability of X taking these values. The probability distribution p(X) defines the probability of occurrence of a realization x∈𝒳 of X (p(X=x)) for all realizations.
<Example 2> Let the probability variable X be the horizontal coordinate measured from the center of the target at which the arrow of a dart thrown at the center of the target pierces the target. The possible values of the realization are real numbers, x ∈ ℝ, for example x=2.7(cm) or x=-4.5(cm), etc. The probability of occurrence of x is represented by the probability density function p(x), and the probability of x being within a certain range is represented by the following integral.
\[p(a<x<b)=\displaystyle\int_a^bp(x)dx\]
The integral of the probability with respect to all possible values the random variable can take must be 1, as shown in the figure below.
When it is obvious to take integrals for all cases, the integral range may be written in abbreviated form as follows
\[\displaystyle\int_{y\in \mathbb{R}}p(y)dy=\int p(y)dy=1\]
The probability density function of a one-dimensional Gaussian distribution with mean μ and variance σ2 is written as follows
\[N(x|\mu,\sigma^2)=\frac{1}{\sqrt{2\pi\sigma^2}}exp\left(-\frac{1}{2\sigma^2}(x-\mu)^2\right)\]
The notation of probability, probability distribution, and probability density function varies in textbooks and literature. When we write “p(X)” here, we say that p(X) represents a probability distribution if X is each suspended variable taking discrete values, and p(X) represents a probability density function if X is a random variable taking continuous values. There are also various notations to express that the random variable X follows a one-dimensional Gaussian distribution with mean μ and variance σ2.
We define a stochastic process. Until now, a stochastic process has been conceived as a generator that stochastically generates a function f(x), but we define it again as follows.
<Definition 1 Confirmation process>
For any N input values x1,…,xN∈𝒳, the simultaneous probability of N output values fN=(f1,…,fN)=(f(x1)),…f(xN)),f(xn)∈𝒴 When p(fN)=p(f1,…fN) can be given, this relation f(⋅) is called a stochastic process. Let N be any natural number and 𝒳 and 𝒴 be the set of possible values that the input and output values can take, respectively.
Although we assume here only the real-valued case, such as 𝒳∈ℝD,𝒴=ℝ, we can define “stochastic processes” in general even in complex numbers and discrete spaces. Gaussian processes are a special case of stochastic processes.
<Definition 2 Gaussian process>
In a stochastic process f, f(⋅) is called a Gaussian process if the simultaneous probabilities p(f1,…,fN) are obtained as an N-dimensional Gaussian distribution.
Here, we will discuss the concept of “infinite” when dealing with Gaussian processes. If we define a Gaussian process as follows: “Given any natural number N, the simultaneous probability p(f1,;fN) is determined,” we can handle stochastic processes with a similar meaning while avoiding the word “infinite. For this reason, the term “infinite dimensionality” is not used, but is paraphrased as “given any natural number N”.
Simultaneous probability p(X,Y) and perimeterization
We can consider the joint distribution of multiple random variables. For example, the joint distribution of random variables X and Y is written as p(X,Y). This means the probability distribution when the combination of two random variables X and Y (X,Y) is considered as one new random variable.
Since the Gaussian process method deals with models of simultaneous probabilities of multiple random variables, we will review the general method for dealing with simultaneous probabilities.
When the set of possible values for X is written as 𝒳 and the set of possible values for Y is written as 𝒴, the set of possible values for the combination of X and Y (X,Y) is written as 𝒳x𝒴 and the sets 𝒳 and 𝒴 is called the Cartesian product. By using the Cartesian product set, we can write (X,Y) ∈ 𝒳x𝒴, and we can consider simultaneous distributions in the same way for combinations of two or more random variables.
<Definition 3: Peripheralization and Peripheral Distribution。
The operation of obtaining the probability density function of X from the probability density function p(X,Y) of the simultaneous distribution of multiple random variables X and Y by the following integral calculation is called marginalization, and the probability distribution thus obtained is called the marginal distribution.
\[p(X)=\displaystyle\int p(X,Y)dY\]
This operation is sometimes called “peripheral integration with respect to Y” or “eliminating Y by peripheralization”; if the set 𝒴 of values Y can take is discrete, the operation of taking the sum over all possible values of Y instead of integration is called peripheralization
\[p(x)=\displaystyle\sum_{Y∈𝒴} p(X,Y)\]
Next, consider the conditional distribution.
<Definition 4: Conditional distribution>
The conditional distribution p(Y|X) is the probability distribution of Y when the value of X is known or given. The condition X does not necessarily have to be a random variable. p(Y|X) is the probability density function of Y.
\[\displaystyle\int p(Y|X)dY=1\quad(1)\]
On the other hand, it should be noted that there is no particular restriction on the value of the integral with respect to condition X.
\[\displaystyle\int p(Y|X)dX=?\quad(2)\]
The following relationship holds between conditional, simultaneous, and marginal distributions.
\[p(Y|X)p(X)=p(X,Y)=p(X|Y)p(Y)\]
A variant of this in the following form is called Bayes’ theorem (Bayesian theorem).
<Theorem 5 Bayes’ Theorem>
\[p(Y|X)=\frac{p(X|Y)p(Y)}{p(X)}\]
For the sake of practice, consider the situation where the realizations x,d of two random variables X,D are generated in a chain.
<Example 3: Chained dice and darts model>.
The dice are rolled, the roll d determines the position coordinate μ of the target by the known functional relation μ=μ(d), the dart is thrown at the target, and the sucking coordinate of the stabbing position is x=X1∈ℝ. (Figure below)
How can we write the probability distribution of x in the form of a density function p(x)? x a priori is a continuous generative process determined by the probability distribution p(d) of the dice roll and the probability distribution p(x|d) of the position coordinates of the dart throwing result.
First, assuming that unbiased ordinary dice are used, the following uniform distribution is assumed for the probability distribution of dice rolls.
\[p(d=1)=1/6,…p(d=6)=1/6\quad(3)\]
Next, assume that the position at which a dart thrown aiming at position micro sticks follows a normal distribution centered at μ with an appropriate variance σ2. Then the probability distribution of the result of dart throwing is as follows.
\[p(x|\mu)=N(x|\mu,\sigma^2)\]
Here, the position coordinates μ to be aimed at with darts are determined by the roll of the dice, and are as follows.
\[p(x|d)=p(x|\mu(d))=N(x|\mu(d),\sigma^2)\]
Now we have modeled each of the two processes. Next, we combine them: the simultaneous distribution p(x,d) of X and D can be expressed as the product p(x,d)=p(x|d)p(d) of the two probability distributions defined above. This completes the combination.
Finally, the density function p(x) can be obtained by marginalizing the simultaneous distribution p(x,d) with respect to d.
\[p(x)=\displaystyle\sum_{d=1}^6 p(d)p(x|d)=\sum_{d=1}^6\frac{1}{6}N(x|\mu(d),\sigma^2)\]
This is the answer we seek, and we can draw a probability density function with six peaks, as shown in the figure below in Figure 4.3. The probability distribution that can be written in the form of a weighted average of multiple Gaussian distributions is called a mixture of Gaussian distributions.
The above can be abstracted and summarized as follows.
- Represent unknown values as random variables (e.g., x and d)
- Represent the stochastic process by which individual values are generated by their respective probability distributions (e.g., p(d) and p(x|d))
- Represent the simultaneous distribution of all random variables (e.g. p(x,d)=p(x|d)p(d))
- Find the required marginal distribution (e.g. p(x))
This process of representing the apriori of observed values in terms of probability distributions is called probabilistic generative modeling (probabilistic generative modelin).
Using the symbol for probabilistic generative modeling with ~, a probabilistic generative model of a continuous event can be expressed simply as follows.
\begin{eqnarray}\begin{cases}d&\sim&p(d)\\x|d&\sim&N(\mu(d),\sigma^2)\end{cases}\quad(4)\end{eqnarray}
<Example 4 Gaussian process generation model>
In the Gaussian process generating model y=f(x)+ε, the input points X=(X1,…,XN)T are given constants. If the function f(x) is a Gaussian process, what distribution do the observed values y=(y1,…,yN)T at these input points follow?
\begin{eqnarray}\begin{cases}\mathbf{f}&\sim&N(\mu,\mathbf{K})\\\mathbf{y}|\mathbf{f}&\sim& N(\mathbf{f},\sigma^2\mathbf{I}_N)\end{cases}\quad(5)\end{eqnarray}
The function output f=f(d(x1),…f(xN))T follows a Gaussian distribution N(μ,K). where μ=((μ(x1),…μ(xN))T is the mean vector and K is the NxN covariance matrix, whose (n,n’) components are the output k(xn,xn’) of the kernel function.
The predictive distribution p(y) of y can then be obtained by calculating the marginal distribution p(y)=∫p(y|f)p(f)df obtained by the marginalized integral. The computation of the predictive distribution in Gaussian process regression has the same structure as the dice dart in Example 3, in that Gaussian process regression assumes a two-step sequence of events.
Independence: p(X,Y)=p(X)p(Y) and conditional independence: p(X,Y|Z)=p(X|Z)p(Y|Z)
When examining the relationship between multiple random variables, the most basic and important properties are independence and conditional independence.
<Definition 6: Independence of random variables.>
In general, random variables X and Y are said to be independent when the following relationship holds between conditional, simultaneous, and marginal distributions for two random variables X and Y. \[p(X,Y)=p(X)p(Y)\]This relationship is equivalent to p(X)=p(X|Y) and further to p(Y)=p(Y|X).
<Definition 7: Conditional independence of random variables.>
Random variables X and Y are said to be conditionally independent under condition Z when the two random variables X and Y satisfy the following relationship \[p(X,Y|Z)=p(X|Z)p(Y|Z)\]
Even if conditionally independent p(X,Y|Z)=p(X|Z)p(Y|Z) under condition Z, unconditional independence p(X,Y)=p(X)p(Y) is not necessarily satisfied. Here p(X,Y) is the marginal distribution ∫p(X,Y,Z)dZ=∫p(X,Y|Z)p(Z)dZ=p(X,Y) under the appropriate distribution p(Z).
<Example 5>
Independence or conditional independence among three or more random variables can be determined in a similar manner.
\[p(A,B,C|D,E)=p(A|D,E)p(B,C|D,E)\]
is true, the following equation can be obtained by renaming the combination of conditions (B, C) as F and the combination of conditions (D, E) as G.
\[p(A,F|G)=p(A|G)p(F|G)\]
From this relation, we can say that A and (B,C) are conditionally independent under the condition (D,E) at this time.
<Example 6 Independence in multivariate Gaussian distribution>
When the simultaneous probability of y1,y2,y3 (p(y1,y2,y3)) is a three-dimensional Gaussian distribution with mean (μ1,μ2,μ3) and covariance matrix σ2I3, it can be shown by the following transformation.
\begin{eqnarray}p(y_1,y_2,y_3)&=&\frac{1}{\sqrt{(2\pi)^3|\sigma\mathbf{I}_3|}}exp\left(-\frac{1}{2\sigma^2}\left[(y-1-\mu_1)^2+(y_2-\mu_2)^2+(y_3-\mu_3)^2\right]\right)\\&=&\frac{1}{\sqrt{2\pi\sigma^2}}exp\left(-\frac{1}{2\sigma^2}(y_1-\mu_1)^2\right)\\& &\times\frac{1}{\sqrt{2\pi\sigma^2}}exp\left(-\frac{1}{2\sigma^2}(y_2-\mu_2)^2\right)\\& &\times\frac{1}{\sqrt{2\pi\sigma^2}}exp\left(-\frac{1}{2\sigma^2}(y_3-\mu_3)^2\right)\\&=&N(y_1|\mu_1,\sigma^2)\times N(y_1|\mu_1,\sigma^2)\times N(y_1|\mu_1,\sigma^2)\\&=&p(y_1)p(y_2)p(y_3)\end{eqnarray}
More generally, in a three-dimensional Gaussian distribution, y1, y2, and y3 are independent of each other when the covariance matrix is a diagonal matrix, and vice versa.
<Example 7 Conditional independence in Gaussian process regression>
In Example 4, the second stage of the chained apriori of Gaussian process regression was p(y|f)=N(f,σ2IN). This means that the N random variables in y1,…,yN are “conditionally independent conditional on f” from each other. That is, p(y|f)=p(y1|f)x…xp(yN|f).
Note that independence between components does not generally hold for p(y), which is obtained by eliminating f by the following marginalized integral.
\[p(\mathbf{y})=\displaystyle\int p(\mathbf{y}|\mathbf{f})p(\mathbf{f})d\mathbf{f}\]
That is, in general, p(y)≠p(y1)x…xp(yN).
<Example 8: Independent Equivalent Distribution and Conditional Independence>.>
Obtaining multiple realizations from a random variable with the same probability distribution by generating repeated samples is called “sampling from an independent and identically distributed (iid)”. For example, the process of rolling the same dice three times and observing the rolls is written as follows.
\[d_1,d_2,d_3\overset{iid}{\sim} p(d)\]
\(\overset{iid}{\sim}\) indicates that the left-hand side is sampled independently from the probability distribution of the right-hand side. However, when multiple variables are written on the left-hand side, it is clear that sampling is from independent homoscedastic distributions, so the i.i.d. may be omitted, as in d1,d2,d3~p(d).
Note that in the real world, sampling from independent same-distribution is not strictly possible. When dice are rolled, the probability of the roll changes due to deformations such as hand marks on the dice or shaving of the corners. The condition of the person throwing the darts is slightly different with each throw. It is necessary to recognize that the independent homodistribution is part of a model or process to simplify and understand the real world.
Graphical model of Gaussian process regression model
The graphical model is a notation for visualizing the independence and conditional independence among random variables in a graphical form. The relationships among the random variables are shown in the figure below.
When considering a model that deals with a large number of random variables in a rainstorm, these visualizations are important because it is important to understand the independence and conditional independence among the random variables.
For example, the relationship in Example 5 can be depicted in the graphical model below.
Where B and C are practices with no arrows between them, it means that neither independence nor conditional independence between B and C is specifically specified.
In data analysis such as regression and identification, the number of random variables is generally arbitrary. In the case of a mathematical expression in a document, the same abstraction is used in the graphical model as in the case of a mathematical expression in a document, such as the symbol “1,…M” (Figure (a) below).
The same thing may also be represented by a panel display (see (b),(c) above) in graphical models. The panel display is useful for simply writing down complex models and is used in much of the literature, but care must be taken because it is unreadable without a precise understanding of the model.
<Example 9.>
A stochastic generative model of linear regression is depicted using a graphical model. The chained generative process can be written collectively as follows.
\begin{eqnarray}\begin{cases}\omega_m\quad\overset{i.i.d.}{\sim} N(0,\lambda^2)\\f_n|\mathbf{x}_i=f(\mathbf{x}_n;\mathbf{w})=\displaystyle\sum_{m=1}^M\omega_m\phi_m(\mathbf{x}_n)\\y_n|f_n\sim N(f_n,\sigma^2)\end{cases}\quad(6)\end{eqnarray}
where n=1,…,N are the indices of the observations and m=1,…,M are the indices of the basis. If we write the parameters W=(w1,…,wM) in vector form, the above stochastic a priori can be depicted as shown in Figure (a) below or in panel view as shown in Figure (b) below.
Note that when data analysis is based on stochastic generative modeling, the random variables corresponding to the observed data (yn in this case) are sometimes drawn as squares to distinguish them from others, and this is also the case here.
The same thing is also summarized in vector form. Using the input point X, the observed value Y, the unobserved function value (hidden variable) fN=(f1,…,fN) and the parameter w, the stochastic generative model is written as follows
\[p(Y,f_N,w|X)=p(Y|f_N)p(f_N|X,w)\]
Figure 4.7(c) corresponds to this writing.
It is important to note that the input point X is considered a constant, not a random variable. Since it is a constant, it is not enclosed in ⚪︎ on the head of the graphical model on the formula for conditional probability.
<Example 10.>
A stochastic generative model with noise can be written collectively as follows
\begin{eqnarray}\begin{cases}\mathbf{f}_N,f_*|\mathbf{X},x_*&\sim& N(\mu,\mathbf{K})\quad&(7)&\\y_n|f_n&\sim& N(f_n,\sigma^2)\quad&(8)&\\y_*|f_*&\sim& N(f_*,\sigma^2)\quad&(9)&\end{cases}\end{eqnarray}
where μ and K are the mean and covariance matrices corresponding to the N+1 input points X,x*. Compare this with the figure below.
Of particular note is that, corresponding to the fact that the simultaneous probability of f1,…fN,f* conditional on X is given by Equation (7), all combinations when choosing two from f1,…fN,f* in the graphical model shown above are undirected with no arrow This is where they are connected by links. Comparing Figures 4.8 and 4.7, we also need to verify that there is no direct link between f1,…fN,f* in the linear model.
In the next article, I will discuss maximum likelihood estimation and Bayesian estimation as the basis for stochastic generative models and Gaussian processes.
コメント