Summary
A Gaussian Process (GP) is a nonparametric regression and classification method based on probability theory, and is a type of stochastic process used for modeling continuous data. In the Gaussian process approach, the process of data generation is modeled by a probability distribution, and kernels (kernel functions, kernel function matrices) are used to express relationships among data points, and to estimate the relationships and uncertainty among data points. The kernel of a Gaussian process can have various shapes and can be customized to fit the characteristics of the data, allowing for a flexible model to be constructed to fit the data. Here we describe this Gaussian process based on the Machine Learning Professional Series “Gaussian Processes and Machine Learning“.
In this article, reading note was described.
Machine Learning Professional Series – Gaussian Processes and Machine Learning
“The first Japanese introduction to Gaussian processes, an overwhelmingly flexible Bayesian regression model. Starting with basic linear regression, the principles of Gaussian processes are carefully explained from the ground up. It also introduces recent topics such as unsupervised learning and practical applications. Let’s get started!”
Chapter 0: You will understand the Gaussian process method in only 5 minutes.
Overview
Gaussian process method is a method of pattern recognition using the theory of Gaussian processes
0.1 Step 1: What is machine learning?
Machine learning is an artificially created functional relationship
0.2 Step 2: Regression and Least Squares Method
Parameter space and data space
0.3 Step 3: Probability Modeling and Bayesian Estimation
Posterior distribution of parameter space in Bayesian estimation
If nothing is known as a prior distribution, clouds will spread and data will be scattered.
Data converge when posterior distribution reflects influence of real data
Parameters converge
0.4 Step 4: Gaussian distribution and covariance
One-dimensional Gaussian distribution and two-dimensional Gaussian distribution
When two or more variables follow a Gaussian distribution, the distribution of the vector summarizing them is a multivariate Gaussian distribution
Multivariate Gaussian distribution is defined by a central location and a covariance matrix
Two random variables must be independent of each other to have non-zero covariance.
0.5 Step 5: Gaussian Processes and Gaussian Process Regression
Machine learning is the process of learning a function f(x)
Finding f(x) to fit the data is the goal of regression problems
In regression, it is important to properly determine the range of f(x), i.e., the functional model
What we want to find in Gaussian process regression is the cloud of posterior distribution of the function f(x)
Bayesian estimation to find
For any number of input points x1,…. for any number of input points x1,…,xn of the function f(x), its output f(x1),… ,f(xN) is an N-dimensional multivariate Gaussian distribution
Conceptual diagram
Example of Gaussian Process Regression
Column: Function Clouds and Gaussian Processes
When the cloud of functions is obtained
The degree of unknowability can be determined
Understand the difference between confident and unconfident regions
Allows for model selection and feature selection
Chapter 1 Linear Regression Models
Overview
Gaussian process is a dimensionless extension of the linear regression model
Given a given input x, predicting the value of the corresponding output y is called regression.
Continuous values are regression, discrete values are identification or classification
1.1 Simple regression
Regression from real number x to real number y
N pairs of (x,y) observations
D={(x1,y1), (x2,y2), … , (xN,yN)}
Simplest regression model (single regression)
y(y-value predicted from x)=a +bx (a,b ∈ R)
Equation 1
Equation 2
The minimum is reached when the derivative with respect to a and b is zero
1
2
Formula Normal equation for single regression
(e.g.) when D={(3,2),(2,4),(-1,1)}
1
2
1.2 Multiple regression and vector representation
Input x is not one-dimensional, but a vector xT = (x1, x2, …, …) of D dimensions. ,xD)
Transposition T
Representation as a longitudinal vector
y(y predicted from xT) = 𝛚0 + 𝛚1×1 + 𝛚2×2 +… + 𝛚DxD
multiple regression
Two-dimensional example
y=1 + 1.2×1 + 0.6×2
(yn – yn(y predicted from XT)) = (yn (𝛚0 + 𝛚1×1 + 𝛚2×2 +… + 𝛚DxD))2
y(y predicted from xT) = 𝛚0 + 𝛚1×1 + 𝛚2×2 +… + 𝛚DxD
For N data, we have
Equation
X is called the design matrix
Formula for the derivative of the linear form of a vector
Derivative of a vector in quadratic form
(Example)
1.3 Linear regression model
The functions 𝛟1(x),𝛟2(x),… ,𝛟H(x), and prepare H
y(value of y predicted from x) = 𝛚0 + 𝛚1𝛟1(x) + 𝛚2𝛟2(x) +…. + 𝛚H𝛟H(x)
H=2 and 𝛟1(x)=x, 𝛟2(x)=x2
y(value of y predicted from x) = 𝛟𝙬
Example
Basis vectors and feature vectors
𝜱(X) = (1, x, x2, six)
y(y predicted from x) = WT𝜱(X)
For the parameter w, we have a linear equation: linear regression model
= 𝜔0𝜙0(x) + 𝜔1𝜙1(x) +… + 𝜔H𝜙H(x)
Base function
𝜙0(x), 𝜙1(x), … , 𝜙H(x)
Feature vector
𝜙(X) = (𝜙0(x), 𝜙1(x), … , 𝜙H(x))T
Solution to linear regression model
Example
𝜙(x)=(1, x, x2, six, cosx)T
If 𝜙(x) is “tokyo”, we consider as basis function 𝜙(x) a function like “length of x string”, “whether x starts with a capital letter (1 if it starts, 0 if it does not)”, and “whether x is a verb”, then 𝜙(x)=(5, 1, 0) T.
1.4* Ridge regression
There are cases where the solution cannot be computed well
Inverse matrix (𝜱T𝜱)-1 does not exist
Computation becomes unstable
Some i-th element of 𝜱(x) is the same as or a constant multiple of the other j-th element at all data points
Each column of 𝜱 is no longer linearly independent
Example
y = WTX = 𝜔0 + 𝜔1×1 + 𝜔2×2
X1T = (2, 4), X2T = (3, 6), X3T = (4, 8)
Matrix representation
X
XTX
No inverse matrix exists.
When it is not a perfect constant multiple but almost the same
Inverse matrix
Unstable due to large absolute value of 𝜔
When the absolute value of the coefficient Wi is extremely large and undesirable
Regularization
When optimizing a vector of coefficients, provide constraints to ensure that the coefficients do not take extreme values
|W|2 = WTW
E = (y-Xw)2 + αWTW
Solution to Ridge Regression
Column: Correlation Coefficients and Regression Models
HSIC (Hilbert-Schmidt Independence Criterion)
Capture nonlinear correlations with high accuracy
The kernel method allows us to define kernels not only for real values and vectors, but also for common data structures such as strings, graphs, and trees, so we can capture correlations between them.
Kernels can also be used for Gaussian processes.
Chapter 2 Gaussian Distribution
2.1 What is Gaussian Distribution?
Overview
Goodness of fit” of a regression model is defined descendingly based on squared error
Consider linear regression models as stochastic models
To think of it as a stochastic model is to give the probability that data will be generated
Regression model predicts y given x
What is the probability that Y takes?
Actual data does not all ride on a straight line or plane, but deviates slightly from it
If the deviation ε follows a Gaussian distribution p(ε)=N(0,σ2) with mean 0 and variance σ2, which is the most standard distribution of error
Probability density of Gaussian distribution
Mean μ, variance σ2
Shape
The leading constant is a normalization constant to make the integral over x into a probability density of 1
The body of the probability density function is exp(⋅)
Using the symbol ∝ for proportionality, except for the constant
Normal distribution with mean 0 and variance 1, in particular, standard normal distribution
When the shape of the error distribution is unknown, the normal distribution is widely used as a model for error distribution
Even mutually independent errors that do not necessarily follow a normal distribution
The central limit theorem allows its insertion to approach the normal distribution.
Even when the shape of the error distribution is known not to follow a normal distribution, it is often not a very good approximation.
When mean and variance are defined, the normal distribution is information minimizing (= entropy maximizing)
Stable and good approximations are obtained even when sample size is small
Unless there is a special reason, “normal distribution” is a model that should be treated as standard
There is an ingenious way to create a robust regression model against outliers by assuming a Cauchy distribution for the error distribution.
2.1.1 Sampling from a Gaussian distribution
A sample x from the standard normal distribution N(0,1) can be generated by a function rand() that generates uniform random numbers from (0,1) (Box-Muller method)
Formula
Sample
Gaussian distribution N(𝜇, 𝜎2) with mean 𝜇 and variance 𝜎2, where x scales 𝜎(standard deviation) times and the value shifts by 𝜇 times
X’ = 𝜇 + 𝜎x
2.1.2 Stochastic model for linear regression
The mean of y given x is a linear regression
WTX
The actual value of y is the sum of this and the noise ε
y=WTX + ε
p(ε) = N(0,σ2)
When applied to a linear regression model
p(y) = N(WTx, 𝜎2)
Given a pair of N data points D
For X and y
If the N pairs (xn, yn) in D are independent of each other, then the probability of predicting y from X is
Since log x is a monotonically increasing function, maximizing p(y|X) is equivalent to maximizing log p(y|X)
If we take the logarithm of the above equation
Same as linear regression with squared error
2.2 Prior distribution of weights and ridge regression
Avoid extreme values as components of regression coefficient w Ridge regression can also be derived from stochastic models
Avoid extremely large absolute values for each element wi of W
Random variables where Wi independently follow a normal distribution with mean 0 and variance λ2
Ridge regression equation
The value of α is meant to determine the balance between the observed noise and the coefficients with respect to fitting the training data.
2.3 Multivariate Gaussian distribution
Overview
About the multivariate Gaussian distribution
Probability density function for multivariate Gaussian distribution
D-dimensional vector X=(X1,. Xc) follows a Gaussian distribution N(x|μ,Σ) with mean μ and covariance matrix Σ
μ=(μ1,…. ,μD) is the mean vector representing the expected value of X
μ=𝔼[x].
Σ is the covariance matrix of DxD
Its (i,j) elements represent the covariance of xi and xj
Σij = 𝔼[(xi-𝔼[xi])(xj-𝔼(xj)] = 𝔼[xi,xj]- 𝔼[xi]𝔼[xj])
All xi and similar combinations that are elements of Σ
The covariance matrix can be expressed collectively as Σ=𝔼[xxT] – 𝔼[x]𝔼[x]T and matrix calculation
Since the covariance matrix Σ always appears in the form Σ-1 in the Gaussian distribution
𝛬 = Σ-1 is called the precision matrix
Sometimes (μ, 𝞚) instead of (μ, Σ)
Σ is the covariance of (x-µ)
The determinant|Σ-1| denotes the size of the Σ-1 spring space
Normalization constant at the top
Since μ only translates the distribution, the shape of the Gaussian distribution is determined by Σ
Function Form Example
Corresponding Equation Example
Covariance matrix of multivariate Gaussian distribution
If Σ is a diagonal matrix
Any two dimensions are independent of each other, and the probability density of the multivariate Gaussian distribution is the product of the one-dimensional Gaussian distribution
If Σ is not a diagonal matrix
The distribution of x is correlated among its components
Example 1 Positive correlation
Generate 100 points from N(0,Σ)
Strong correlation between x1 and x2
Example 2 Negative correlation
100 points are generated from N(0,Σ)
Example 3 No correlation
2.3.1 Multivariate Gaussian distribution and linear transformation
Vector x ˜ N(0,Σ) ∝ exp(-1/2xTΣ-1x) following a Gaussian distribution with mean 0
linearly transformed by matrix A to y=Ax, y still follows a Gaussian distribution
2.3.2 Sampling from a multivariate Gaussian distribution
A sample y ˜ N(0,Σ) from a multivariate Gaussian distribution with mean 0 and covariance Σ is
In MATLAB and Octave, it can be expressed as “mvnrnd
In R, generated by mvrnorm in the MASS package
If you want to generate Σ by yourself, first decompose Σ into Cholesky’s equations like Σ=LLT.
Cholesky decomposition
Decompose the symmetric matrix Σ using the lower triangular matrix, i.e., the matrix L whose diagonal components are filled only
cool” in NATLAB, Octave and R
To generate a vector y according to N(0,Σ)
First, x=(x1,x2,… .xn) at random
Just transform y=Lx
2.3.3 Perimeterization of Multivariate Gaussian Distribution
When a vector x follows a multivariate Gaussian distribution, peripheralizing some dimensions of x will cause the remaining dimensions to follow a multivariate Gaussian distribution
Example Perimeterizing a 2-dimensional multivariate Gaussian distribution to 1 dimension
When a D-dimensional vector x follows a multivariate Gaussian distribution x˜N(μ,Σ)
Let the first L dimensions of x be x1 and the remaining D-L dimensions be x2.
If μ and Σ are similarly partitioned in L dimensions
The distribution of x1 that makes p(x)=p(x1,x2) peripheral with respect to x2 is
Formula 3.2 Perimeterization of multivariate Gaussian distribution
p(x) = ∫ p(x1, x2) dx2 = N(μ1, Σ11)
To marginalize for x2 is to pretend that we did not see mean µ2 or part of the covariance matrix Σ12, Σ21, Σ22.
2.3.4 Conditional distribution for multivariate Gaussian distribution
When a vector x follows a multivariate Gaussian distribution, the distribution when some dimensions of x are fixed, i.e., when the cut is “cut” in a certain dimension, is also Gaussian
Example In a two-dimensional Gaussian distribution p(x1,x2), the distribution p(x2|x1) of x2 when cut by x1=c (constant)
The distribution will be Gaussian with mean and variance differing depending on C
Consider a vector x ˜ N(μ, Σ) following a Gaussian distribution in dimension D, divided into x1, and x2
Distribution of x2 when x1 is fixed (superscript distribution of x2 when simultaneous distribution p(x1,x2) is cut by x1)
Formula 2.4 Conditional probability of multivariate Gaussian distribution
p(x2|x1) =N( μ2 + Σ21Σ11-1(x1 – μ1), Σ22 – Σ21Σ11-1Σ12)
Meaning of the formula
Mean of conditional distribution : μ2 + Σ21Σ11-1(x1 – μ1)
Given an observed value x1, new mean of x2 correlated with it
The mean of the conditional distribution is the original mean μ2 plus the deviation of x1 from μ1 (x1-μ1) transformed by the matrix Σ21Σ11-1
The precision matrix Σ11-1 for x1 is
Larger when the variance of x1 is small, i.e., when the observation is accurate
Smaller when the variance of x1 is large, i.e., when the observation is hazy
The smaller Σ11-1 is (the higher the observation system), the larger the “misalignment” x1-μ1 is magnified.
Further projected into the space of x2 by the covariance matrix Σ21 of x1 and x2
The larger the variance (the less accurate the observations), the less information is known about x1, and the less x2 is updated
Conditional variance : Σ22 – Σ21Σ11-1Σ12
An observation x1 is obtained and the new variance is reduced from the original variance Σ22 by Σ21Σ11-1Σ12
When the precision matrix Σ11-1 of x1 is high, a large value is subtracted; when the precision matrix Σ11-1 is low, a small value is subtracted.
Chapter 3 Gaussian Processes
Overview
Gaussian process is an integral elimination of the weights of a linear regression model
Gaussian distribution of infinite dimension
Gaussian process is a probability distribution that generates random functions
3.1 Linear regression models and the curse of dimensionality
Representation of functions by basis functions and their weighting
Problem
Only usable when the dimension of input x is small
Exponential increase in required basis functions as dimension increases when basis functions are placed
Curse of dimensionlity
3.2 Gaussian process
Overview
How can we achieve a flexible regression model even when x is high dimensional?
Take the expected value of the parameter W in the linear regression model and eliminate the integral from the model.
For input x, let Φ(x)=(Φ0(x),Φ1(x),Φ2(x),…) be the feature vector of x. ,ΦH(x))T
Weight of each feature W=(W0,W1,…. WH) linear regression model with each feature weight W=(W0,W1,…,WH)
ŷ = WTΦ(x) = w0φ0(x) + w1φ1(x) + …. +wHφH(x)
Array equation
ŷ = ΦW
Φ: planning matrix
X1,…. XN is a constant matrix given X1,…,XN
W: Weight vector
Follows a Gaussian distribution
Y ˜ N(0, λ2ΦΦT)
W is taken to be the expected value and disappears.
No matter how high the dimension of x or Φ(x) is, there is no need to find high-dimensional weights, and the distribution of y is determined by the covariance matrix λ2ΦΦT, which depends on the number of data N
Ŷ = (ŷ1,… ŷN) is the same as the observed data y = (y1,… N) is the fitted value that approaches the observed data y = (y1,…,yN).
Let W be generated from a Gaussian distribution with mean zero and variance λ2I
W ˜ N(0, λ2I)
Definition 3.1 Definition of Gaussian Process
For any set of N inputs (x1,x2,… . xN) for any set of N inputs
The corresponding output y=(y1,y2,…. When the simultaneous distribution p(y) of the corresponding output y=(y1,y2,…,yN) follows a multivariate Gaussian distribution
The relationship between x and y follows a Gaussian process
process
Gaussian process is a kind of stochastic process
What is a stochastic process?
The set of inputs (x1,x2,… xN) corresponding to a set of inputs (y1,y2,…,xN). …,yN) corresponding to the set of inputs (y1,y2,…,yN). yN) for a set of random variables (y1,y2,…,yN) corresponding to the set of random variables
You can take N as large as you want.
Gaussian distribution of infinite dimension
3.2.1 Meaning of Gaussian process
What is the meaning of Y ˜ N(0, λ2ΦΦT)?
Let K=λ2ΦΦT be the covariance matrix
Φ(X)=(Φ0(x),…. ΦH(x))T as
The constant multiple of the inner product of the feature vectors Φ(xn) and Φ(xn’) of Xn and sammars are the (n,n’) elements Knn’ of the covariance matrix K
Large covariance between two variables in a multivariate Gaussian distribution
Easy to get similar values
Knn’ is large
Xn and Xn’ are similar in the space of feature vectors
Corresponding yn and yn’ also have similar values
Property 3.2 Intuitive properties of Gaussian processes
If input x is similar, output y is also similar
Input x is similar
Large inner product through feature vector Φ(x)
Output y is similar
Easy to get close values due to large covariance in Gaussian distribution
Formula 3.3 Gaussian process with mean 0
y ˜ N(0,K)
Observed data can be set to zero mean by subtracting the mean in advance, so consider this formula first
3.2.2 Kernel Trick
The distribution of y is determined only by the elements of the covariance matrix K Knn’ = Φ(xn)TΦ(xn’)
No need to find feature vectors explicitly
Kernel function
k(xn, xn’) = Φ(xn)TΦ(xn’)
kernel matrix or Gram matrix of Φ
Covariance computed from the kernel function k(xn,xn’)
Defining the kernel function as k(X, x’) = (XTX’+1)2
When X=(X1,X2)T and X’=(X1,X’2)T
k(x,x’)=(x1x’1 + x2x’2 +1)2
Define the feature vector as the above equation
k(x,x’)=Φ(X)TΦ(x’)
No need to write down Φ(x) to get the result
Kernel trick
Calculate the inner product using only the kernel function without directly expressing the feature vector Φ(x)
Even if Φ(x) is infinite dimensional, it can be easily calculated by the kernel trick
K is the covariance matrix of the Gaussian distribution
In order for it to be a Gaussian process
K is a symmetric matrix with an inverse K-1 and a positive definite matrix whose eigenvalues are all positive
3.2.3 Definition of Gaussian process
A Gaussian process is a box that outputs a function f(*)
Dimensionless Gaussian distribution
To determine the function f, we need to know the value of the function f for every input x1,x2,x3,… The output values f(x1),f(x2),f(x3),… can be determined.
Conceptual Diagram
Infinite number of points x1,x2,x3,… on input space X for an infinite number of points x1,x2,x3,…, in the input space X, we can compute a kernel matrix K representing the similarity between them
From the Gaussian distribution with this as the covariance matrix, a dimensionless vector f(f(x1),f(x2),f(x3),…) is obtained
Infinite number of inputs X=(x1,x2,x3,…) and f=(f(x1),f(x2),f(x3),…) for dimensionless vectors following a Gaussian process. where f˜N(μ,K) denotes a Gaussian process with mean μ.
Among the f’s, we can find a finite number of observed f(x1),f(x2),f(x3),… The Gaussian distribution of a finite number of f(x1),f(x2),f(x3),….
Marginal distribution
A probability distribution in which some random variables are eliminated from the simultaneous probability distribution
Conditional probability distribution restricts certain random variables to specific values
Creating a marginal distribution from a simultaneous probability distribution is called marginalization.
Example
Joint probability
Probability that two events will both occur
P(X∩Y) = P(X,Y)
Marginal probability
Probability of one event alone without regard to other events
P(X) or P(Y)
Adding all simultaneous probabilities together to obtain P(X) is called marginalization for Y
Applying the additivity theorem
Example: Gaussian process on 3 points X=(x1,x2,x3)
The Gaussian distribution N(0,K) of ellipsoid type depending on the proximity (=covariance) between x1,x2,x3 is determined, and f=(f(x1),f(x2),f(x3)) is obtained as a sample from it.
Definition 3.4Exact definition of a Gaussian process
For any natural number N, the input x1,x2,x3,… for any natural number N, the vector f(f(x1),f(x2),…) of outputs corresponding to the inputs x1,x2,x3,…,xN∈X. f(xN)) is the vector of outputs corresponding to the mean μ=(μ(x1),μ(x2),…,xN∈X). Knn’= k(xn,xn’)), we write f ˜ GP(μ(x), k(x,x’)) as f ˜ GP(μ(x), k(x,x’)).
Kernel Method and Gaussian Processes
Gaussian process is a kernel method from the standpoint of Bayesian statistics
Given a kernel function and a mean function, a Gaussian process, which is a prior distribution of f, can be determined.
Posterior distribution of f can be obtained if data is sought
The predictive distribution based on a Gaussian process is
Expected values are consistent with the results of kernel ridge regression
The variance is added to this, and it becomes a Gaussian distribution.
Gaussian equivalent of SVM (Support Vector Machine) for classification
RVM (Relevance VectorMachine)
RVM is a superior classifier with fewer support vectors and higher performance than SVM
RVM is computationally expensive in principle O(Nβ)
Unlike SVM, the training problem is not convex
Not widely used
3.2.4 Samples from Gaussian Processes
Gaussian kernel or radial basis function (RBF) kernel
The kernel function most often used for similarity between inputs x in a Gaussian process
Θ1,Θ2∈R are parameters for determining the properties of the kernel function
Outline
RBF kernel is a function whose values decay exponentially with the distance between x and x’, like a Gaussian distribution
The result of randomly generating samples f using this RBF
Changing the sample interval also changes the value of the covariance matrix
The sample f from the Gaussian process has an infinite number of points (=dimensionless), which when aligned form a curve
A portion of it can be extracted as shown in the figure on the right.
Not available in conventional parametric models, such as linear regression models with a small number of parameters
Two-dimensional Gaussian process generated using RBF kernel
3.3 Gaussian Processes and Kernels
3.3.1 RBF kernel and basis functions
Introduction.
Curves by Gaussian processes with RBF kernels, or hypersurfaces in general, can be viewed as linear models with an infinite number of weighted basis functions
In Gaussian processes, W can be integralized and eliminated, and the resulting Gaussian process can be represented in finite dimensions only at the location of the data.
By linear summation of the functions at a given location of the data, the posterior distribution is expressed as a Gaussian distribution
Kernel and function form
The posterior expectation of a Gaussian process is a weighted sum of the kernel functions used by the function
General properties of Gaussian processes, not just RBF kernels
Expected value formula
The kernel function k(x*,⋅) is a weighted sum of the kernel function k(x*,⋅) with the point x* to be predicted as a variable at the training data point.
3.3.2 Various kernels
In addition to the RBF kernel, there are various kernel functions that express the proximity of x and x’
Linear kernel
k(x,x’)=xTx’
Sample from Gaussian process
It means that the feature vector is an identity map (a transformation that returns itself as it is) Φ(x)=x
Equivalent to multiple regression if x contains a constant x0=1 corresponding to the bias term
Exponential kernel
k(x,x’)=exp(-|x-x’|/θ)
Sample from Gaussian process
Brownian motion and related
Also called Ornstein-Uhlenbeck process
Periodic kernel
k(x,x’)=exp(θ1cos(|x-x’|/θ2))
Sample from Gaussian process
Viewed as the declination of a point moving on the unit flame in polar coordinates
In a periodic kernel, even if the distance|x-x’| between x and x’ is large, if this distance is 2π times the period θ2, the value of the kernel will be large, so it is considered “like” and the value of y is close to periodic
Gaussian kernel
k(x,x’)=exp(-|x-x’|2/θ)
Sample from Gaussian process
Property 3.5 Gaussian Process Regression and Linear Regression
Gaussian process regression involves a linear regression model
Combining Kernels
The above kernels can also be combined
The sum and product of the two kernels is the correct kernel function, which is semi-positive definite
Example
Combination of linear and periodic kernels
Regression with “overall linear relationship + periodicity” can be performed
Matern kernel
Other kernels
Kernels can be defined not only for real vectors, but also for strings, graphs, tree structures, etc.
A major advantage of the kernel method is that you can think in the space of a complex data structure by simply defining a similarity kernel structure on top of it.
Character kernel
Fisher kernel
HMM Peripheralization Kernel
3.3.3 Observation Noise
Models with noise in the observed values
yn = f(xn) + εn
εn ˜ N(0,σ2)
y=(y1,y2,… ,yN), the probability distribution is
p(y|f) = N(f,σ2I)
X=(x1,x2,… ,xN), the distribution of y after given
p(y|f) = ∫ p(y, f|X)df = ∫ p(y|f) p(f|X) df = ∫ N(y|f, σ2I)N(f|μ,K)df
p(y|X) = N(μ, K + σ2K)
Convolution of two independent Gaussian distributions
From the formula for the chain speed of Gaussian distribution (2.p.92), the result is still Gaussian
Covariance matrix is the sum of the covariance matrices of the two Gaussian distributions
y follows a Gaussian distribution with the new kernel function k'(xn, xn’)= k(xn,xn’) +σ2δ(n,n’)
δ(n,n’) is a function that returns 1 when n=n’ and 0 otherwise.
3.4 Gaussian Process Regression Model
Introduction
How exactly do we solve regression problems based on Gaussian processes?
N observations (N pairs of inputs x ∈ 𝝌 and outputs y ∈ ℝ) D={(x1,y1),(x2,y2),… ,(xN,yN))
There is a relationship y=f(x) between y with x
This function f is generated from a Gaussian process f ˜ GP(0,k(x,x’)) with mean 0
Noise in the observed values is also included in the kernel function
y=(y1,y2,… T, then this y follows a Gaussian distribution, and can be calculated using the kernel function k(x,x’) for all pairs (xn,xn’) of inputs as follows
3.4.1 Predictive Distribution for Gaussian Process Regression
How can the value of y* at x* be computed?
The linear model predicts y*=WTΦ(X*) using the weights w
For Gaussian processes, W is integral eliminated and does not exist
New y’=(y1,…. ,yN,y*)
x1,… K’, the (N+1)x(N+1)-dimensional kernel function matrix computed from xN and x*.
y’~N(0,K’)
When new data come in for the Gaussian process model
A vector of similarities (kernel function values) between the new input x* and the training data input
k*=(k(x*,x1), k(x*,x2),…. ,k(x*,xN))T
Similarity of x* to itself
k**=k(x*, x*)
When a vector that follows a Gaussian distribution is in the above equation
Given a portion of the vector y1, the remaining y2 is Gaussian p(y2|y1) =N(μ2+ Σ21Σ11-1(y1-μ1), Σ22-Σ21Σ11-1Σ12)
When μ1=μ2=0
p(y2|y1) = N(Σ21Σ11-1y1, Σ22-Σ21Σ11-1Σ12)
Formula 3.6 Predicted distribution for Gaussian process
p(y*|x*,D) = N(k*TK-1y, k** – k*TK-1k*)
The expected value of y* is the mean of the Gaussian distribution
Formula 3.7 Expected value of the predictive distribution of a Gaussian process
𝔼[y*|x*,D] = k*TK-1y
If there is more than one predictor
The point you want to input is X*=(x1*,… xM*) and M points
The output you want to predict is y*=(y1*,… ,yM*)
Simultaneous distribution of y and y*.
Formulas 3.8 Predictive Distribution for Gaussian Processes (Multiple Predictors)
p(y*|X*,D) = N(k*TK-1y, k** – k*TK-1k*)
Prediction Results for Gaussian Processes
Calculation of Predicted Distribution
How is the expected value of the actual predictive distribution calculated?
The formula calculates the product of a matrix and a vector as in the above equation
Need to find the inverse K-1 of the NxN kernel matrix K
3.4.2 Computing Gaussian Process Regression
Basic Algorithm
Basic Definition of Gaussian Process Regression
Training data consisting of input x and output y pairs (x1,y1),(x2,y2),… (xN,yN)
Kernel function k(x,x’) giving the similarity, or covariance of the Gaussian distribution, between input x and input x’
Training data X=(x1,…. XN) and y=y1,…. ,yN) are given
Find output y* for new input x* of test data as p(y*|x*,X,Y) = N(kTK-1Y, k** – k*TK-1k*)
Uniquely determined given a kernel distribution
In Gaussian processes, the only “learning” is the hyperparameters of the kernel
Simple Gaussian Kernel Example
Efficient algorithm for computing covariance matrices with Gaussian kernel
X is a DxN-dimensional matrix with N D-dimensional longitudinal vectors
X.^2 is the square of each element of matrix X
repeat(X,a,b) is a function that replicates matrix X into rows a and columns b
Gaussian kernel with Gaussian error
3.4.3 Elemental Representation of Gaussian Process Regression
Expected value of the predictive distribution of a Gaussian process
Let the accuracy matrix be 𝜦= K-1
Using its element 𝜦nn’
The expected value of the prediction of y for a Gaussian process is the weighted sum of each yn of each training data with the above formula
Predictions are expressed as a weighted sum of training data
Similar to kernel methods such as SVM
Optimize the weights themselves
Gaussian process does not optimize the weights themselves (no learning except for the kernel parameters)
Replenisher Theorem
(Neural) network-like representation of the predictions of a Gaussian process
y* is not just a single x*, but x1,… x1,…,xN as a whole.
3.5 Hyperparameter Estimation for Gaussian Process Regression
How can one go about estimating the hyperparameters of a Gaussian process?
If we collectively set the hyperparameters as θ=(θ1,θ2,θ3), the kernel depends on θ
Kernel Equation
The kernel matrix K computed from K also depends on θ and is Kθ
The probability of the training data at this time is
Taking the logarithm
Find θ that maximizes the probability θ → θ that maximizes the equation on the right
How to find
MCMC method, in which the search is conducted by changing θ slightly so as to maximize the likelihood function
Use the gradient method
∂Kθ/∂θ is the matrix obtained by differentiating each element of the kernel matrix Kθ by the parameter of interest θ ∈ θ
Hyperparameters can be optimized using optimization routines such as SCG and L-BFGS methods.
The SCG (scaled conjugate gradient) method is an optimization method that has been used in the optimization of Gaussian processes
More stable than BFGS and L-BFGS methods
It is in the MATLAB and python libraries.
Example calculation with L-BFGS method
As the data is increased, the optimal hyperparameters also change, resulting in a more flexible regression function that better fits the data.
Method 3.9 Optimization of Hyperparameters for Gaussian Processes
Hyperparameter Estimation and Local Solutions
In gradient methods, θ is generally a local solution, not necessarily a true global optimum
In such cases, MCMC can be used to search, avoiding local solutions and approaching the global optimal solution
Kernel Selection
Kernels can be combined
Example
The weights θ1, θ2, θ3 and error variance θ5 of each kernel can be learned at the same time
No need to pre-select one specific kernel
Example
Computation of 100m track and field world record data with multiple kernel models
Overall trend of shorter times → linear kernel
A Gaussian kernel is used to represent the fine-indexed data, and a Bayesian regression of confidence (variance) according to the amount of data can be obtained.
Probability of training data
EVIDENCE
3.6* Generalization of Gaussian Process Regression
Introduction
Further generalization
Variations on the observation model
Observation model in which the observed value y is generated from the function value f by a Gaussian process
Expressed by conditional probability p(y|f)
Assumes Gaussian distribution in observation model
p(y|f)=N(f, σ2I)
Observation model may not be Gaussian
Cannot be solved analytically
Approximate inference such as MCMC, Laplace approximation, Expectation Propagation (EP) method, variational Bayesian, etc. is required
3.6.1 Robust Gaussian Process Regression
Cauchy distribution
Much thicker than Gaussian distribution
Dealing with outlier
Cauchy(x|x0,γ)=1/π(γ +(x-x0)2/γ)
Gaussian regression model with outliers
Gaussian process regression using Gaussian error, regression function is dragged to outliers
Cauchy error does not drag on outliers
Cauchy distribution is robust (robustus) against outliers
posterior distribution
Cannot include an error term in the kernel because it is not a simple Gaussian distribution
Approximate inference methods such as variational Bayesian and MCMC methods are needed to estimate F
Example of f sampled by MCMC using elliptical slice sampling
Assuming that the standard deviation σ of the Gaussian distribution further follows a gamma distribution
Distribution extended by taking the expected value for σ
Cauchy distribution is a special case of t-distribution with 1 degree of freedom
A similar extension of the Gaussian process, the Student-t process, has also been proposed.
Probabilistic model that allows f itself to be an outlier rather than an error
3.6.2 Gaussian Process Identification Model
If the observed value y is binary data such that some event has occurred (1) or not (0)
Denote the probability p(y=1|f) that y=1 given F as p(y=1|f) = σ(f)
Such a discrimination model using a Gaussian process is called a “Gaussian process discrimination model.
σ(x) is an S-shaped function
Logistic function for Gaussian process identification model
The most straightforward probability model is the cumulative density function of the normal distribution, also called the “probit function
Since the probit function includes the integral of the normal distribution, the logistic function is widely used instead
σ(1.7x) is a function very close to Ψ(x)
Error σ(1.7x) – Ψ(x) is less than 0.0.1 over the entire range
The distribution of f given Y is
It is not a simple Gaussian distribution
Approximate inference required
Estimation results of Gaussian process identification model computed using EP method in toolkit Gpy
Gaussian process identification model sparsely by selecting a small number of related vectors
RVM (Relevance Vector Machine)
Gaussian process identification models are less advantageous when you simply want to classify because of the complexity of both the model and the estimation.
Necessary when f is obtained in combination with other probabilistic models and the observed value y is obtained based on it
3.6.3 Poisson regression model
The number of machine failures, decay of elementary particles, etc., where the observed value of y is 2,4,0,1,… When the observed value of y is a non-negative natural number, such as 2,4,0,1,…
Poisson distribution
The Poisson distribution is a set of discrete values y=0,1,2,3,… Probability distribution for discrete values y=0,1,2,3,…
The expected value coincides with the real-valued parameter λ, 𝔼[Po(y|λ)] = λ
Example of Poisson distribution
The Poisson distribution is a distribution that follows the number of events that occur in total when many solutions are considered at the same time, such as the decay of a subatomic particle, which has a certain small probability of occurring.
Suppose we can write λ = ef
p(y|λ) =Po(y|λ) =λye-λ/y! =p(y|f) = (ef)ye-ef/y! =exp(fy -ef)/y!
Posterior distribution is not Gaussian
Example: Data representing the number of individuals of a particular plant in 50 one-dimensional, side-by-side intervals, estimated using a Poisson distribution and the Laplace approximation by a Gaussian process.
MCMC here searches for (low-dimensional) hyperparameters of a Gaussian process model
Column: Neural Networks and Gaussian Processes
A one-layer neural net is equivalent to a Gaussian process in the limit of the number of hidden elements → ∞
Considering Gaussian processes instead of neural networks eliminates the need for multiple weight optimization in neural networks
Unlike neural networks, which cannot predict what will be learned, it is possible to express prior knowledge about a problem through kernel functions, and to handle non-trivial vectors such as time series and graphs with good visibility.
Distribution of the Principal Statistical Power for the input (x1,x2)=(-0.2,0.4) when the parameters of the neural network are randomly sampled from a prior distribution.
Each point represents one neural network
As the number of hidden elements H increases, the distribution of output approaches a Gaussian distribution
When the network is multi-layered, as in deep learning
By recursively computing the covariance matrix, an equivalent Gaussian process (NNGP, Neural Network GP) can be obtained
It was concluded that NNGP has equivalent performance for analytically represented Gaussian processes without explicitly learning weights by SGD, etc., as is the case with neural networks.
For a neural network to asymptotically approach a Gaussian process, it only needs to follow a simultaneous distribution that is independent and has finite variance.
Chapter 4: Stochastic Generative Models and Gaussian Processes
Introduction.
Stochastic Generative Model
Hypothesis: “Observation Y in the real world is obtained by sampling Y˜p(Y) from some probability distribution p(Y).
Most basic model
Y = f(x) + ε
Input x fixed as constant
Function f(*), observed value y, and observed noise ε are stochastic variables far
The basic idea behind stochastic generative modeling
The introduction of the likelihood function is a shift in thinking from “the observed value y in front of us is a random variable,” and is the basis of the maximum likelihood estimation method.
The introduction of prior and posterior probabilities is a shift in thinking from “what we want to know (hidden variable f or unknown parameter sh) is a random variable” to “what we want to know is a random variable” and is the basis of Bayesian estimation.
4.1 Stochastic variables and stochastic generative models
4.1.1 Stochastic variable X and probability distribution p(X)
Random variable
Variable whose value is determined stochastically on a trial-by-trial basis
Example: Let the random variable X be the ideal roll of the dice
The set of values for taking X is {1,2,3,4,5,6}.
The probability of each occurrence is p(X=1)=1/6,… ,p(X=6)=1/6
The sum of the probabilities in all cases is 1
Probability distribution
In general
The properties of a random variable X are determined from the “probability distribution p(x)” that defines the set of possible values for X and the probability that X will take the given value.
The probability distribution p(X) determines the probability p(X=x) of occurrence of a realization x∈X of X for all realizations
Example
The probability variable X is the distance measured from the center of the target to the point where the arrow pierced the dart thrown at the center of the target.
Possible values for realizations are real numbers x ∈ ℝ
The probability of occurrence of x is represented by the probability density function p(x)
The probability that x falls within a specific range is represented by the integral
The integral of the probabilities for all possible values of the random variable is 1
Probability density function of one-dimensional Gaussian distribution with mean μ and variance σ2
Stochastic process
Generator that stochastically generates the function f(x)
Definition 4.1 Stochastic process
For any N input values x1,… for any N input values x1,…,xN∈X
N output values fN=(f1,… ,fN)=(f(x1),… ,f(xN), f(xn) ∈ y with simultaneous probability p(fN)=p(f1,…,fN)=p(f1,…,fN)=p(f1,…,fN)=p(f1),…,f(xn) ∈ y. fN) for f(x1),…,f(xN), f(xn) ∈ y. Then
This relation f(*) is called a stochastic process
where N is any natural number, and X and Y are the set of possible values for the input and output values, respectively
Definition 4.2 Gaussian process
In probability f(*), the simultaneous probability p(f1,… fN) are obtained as an N-dimensional Gaussian distribution
f(*) is called a Gaussian process
Memo 1.
If you want to emphasize that the probability distribution of variable x is a conditional prior distribution that depends on the parameter μ
p(x|μ)=N(y|μ, σ2), or x|μ˜N(μ, σ2)
4.1.2 Simultaneous probability p(X,Y) and marginalization
Consider joint distribution of multiple random variables
Example: Write p(X,Y) for the simultaneous distribution of random variables X and Y
Probability distribution when the combination of two random variables X and Y (X,Y) is considered as one new random variable
Same for three or more
The set of possible values for X is 𝓧, the set of possible values for Y 𝓨
The set of possible values for the combination of X and Y (X,Y) is 𝓧x𝓨
Direct product set of 𝓧 and 𝓨 (Cartesian product)
(X,Y) ∈ 𝓧x𝓨
Same for two or more.
Definition 4.3 Perimeterization and Perimeter Distribution
From the probability density function p(X,Y) of simultaneous distributions of multiple random variables X,Y
The operation of obtaining the probability density function p(X) for X by the following integral calculation is called marginalization
p(X) = ∫ p(X,Y)dY
The resulting probability distribution p(X) is called the marginal distribution
Peripheralizing and integrating with respect to Y” or “eliminating Y by peripheralizing”.
When the set of possible values 𝓨 for Y is discrete, sum all of them instead of integrating
Definition 4.4 Conditional Prior Probability
The conditional distribution p(Y|X) is
Probability distribution of Y when the value of X is known or given
Condition X need not necessarily be a random variable
There are no restrictions on integration with respect to condition X.
When p(Y|X) is the probability density function of Y, ∫ p(Y|X)dY =1 always holds
Relationships that can be established between conditional, simultaneous, and marginal distributions
P(Y|X)p(X) = p(X, Y) = p(X|Y)p(Y)
Definition 4.5 Bayes’ Theorem
p(Y|X) = p(X|Y)p(Y) / p(X)
Example: Chained dice dart model
Roll the dice and determine the position coordinates of the target μ by the known functional relation μ=μ(d) according to the roll d. Throw a dart at the target and let x=X1∈ ℝ be the horizontal coordinates of the first point that is hit.
How do we plot the probability distribution of x in the form of a probability density function p(x)?
The process of generating x is determined by the probability distribution p(d) of the dice roll and the probability distribution p(x|d) of the position coordinates of the dart throwing result.
Probability distribution of dice roll
p(d=1)=1/6, … Uniform distribution with ,p(d=6)=1/6
Sticking position of a dart thrown aiming at position μ
follows a normal distribution with appropriate variance σ2 centered on Μ
p(x|μ) = N(x|μ,σ2)
The position coordinate μ of the dart target depends on the roll of the dice
p(x|μ) = p(x|μ(d)) = N(x|μ(d),σ2)
Simultaneous distribution p(x,d) of X and D
Product of two probability distributions
p(x,d)=p(x|d)p(d)
The density function p(x) is obtained by marginalizing the simultaneous probability p(x,d) with respect to d
Specific Density Functions
Probability distributions that can be written in the form of weighted averages of multiple Gaussian distributions
Mixture Gaussian distribution
Procedure Abstraction
1 Represent unknown values in terms of random variables (e.g. x and δ)
2 Represent the stochastic process by which the individual values are generated as probability distributions, respectively (e.g., p(d) and p(x|d))
3 Represent the simultaneous distribution of all random variables (e.g. p(x,d) =p(x|d)p(d))
4 Find the required marginal distribution (e.g., p(x))
probabilistic generative modeling
Work to represent the generative process of observed values in terms of probability distributions
Example 4 Gaussian process regression model
4.1.3 Independence: p(X,Y)=p(X)p(Y) and conditional independence: p(X,Y|Z)=p(X|Z)p(Y|Z)
When examining the relationship between multiple random variables, the most fundamental and important tangents are independence and conditional independence
Definition 4.6 Independence of random variables
When some property holds between the conditional, simultaneous, and marginal distributions for the 2z random variables X and Y
Random variables X and Y are said to be independent
p(X,Y)=p(X)p(Y)
This relationship is equivalent to p(X)=p(X|Y)
Further equivalent to p(Y)=p(Y|X)
Definition 4.7 Conditional First Independence of Stochastic Variables
When two random variables X, Y and condition Z satisfy the following relationship
The random variables X and Y are conditionally independent under condition Z.
p(X,Y|Z) = p(X|z)p(Y|Z)
Example 5 Independence or conditional independence of three or more random variables
When p(A,B,C|D,E)=p(A|D,E)p(B,C|D,E)
By renaming the combination of conditions (B,C) as F and the combination of conditions (D,E) as G
p(A,F|G)=p(A|G)p(F|G)
A and (B,C) are conditionally independent under condition (D,E)
Example 6 Independence in multivariate Gaussian distribution
Example 7 Conditional independence in Gaussian process regression
p(y|f) =p(y1|f)x.. .xp(yN|f)
p(y|f)=N(f,σ2IN)
Conditional independence subject to f
p(y)≠p(y1)x.. .p(yN)
Independence between components does not hold for p(y)
Example 8: Independent Same Distribution and Conditional Independence
Obtaining multiple realizations from random variables with the same probability distribution by generating repeated samples
Independent and identically distributed (iid)
4.1.4 Graphical Model of Gaussian Process Regression Model
The graphical model is a notation for visualizing the presence or absence of independence and conditional independence among random variables in a graphical form.
Case of Example 5
Abstraction or panel display of random variables in graphical model
Example 9 Probability Generation Model for Linear Regression
Graphical model
Example 10 Probability-generating model for Gaussian process regression
Graphical model
f are all connected by no-arrow, no-directional links.
4.2 Maximum Likelihood and Bayesian Estimation
4.2.1 Probabilistic Generative Models and Maximum Likelihood Estimation
Probabilistic generative model of observation Y
“An observation Y in the real world is obtained by sampling Y~p(Y) from some probability distribution p(Y).”
Stochastic generative models are hypotheses
Probability-generating models can have more than one model for the same object.
p(Y)=P1(Y),p(Y)=P2(Y),…
The difference between hypotheses may be represented by the parameter θ, and the probability-generating model may be represented by the conditional establishment p(Y|θ)
p(Y|θ) is called a parametric model
Estimation of parameters
To determine the parameter θ based on the observation Y
Maximum likelihood estimation
The parameters θ of a stochastic generative model p(Y|θ) with parameters
A method of determining the likelihood function L(θ)=p(Y|θ) of the observed data so that L(θ)=p(Y|θ) is maximized.
Likelihood function L(θ)
In the probability function p(Y|θ), the observed data Y is treated as a constant and is regarded as a function of the parameter θ.
The parameter that maximizes the likelihood function is called the maximum likelihood estimate of the parameter
Arg maxθ L(θ) is the value of parameter θ such that the likelihood function L(θ) has the maximum value
Example 11 Maximum likelihood estimation of the mean of a Gaussian distribution with known covariance matrix
N observed D-dimensional longitudinal vectors yn ∈ ℝD, n=1,… ,N are
Assumed to be generated from a Gaussian distribution model with mean μ and covariance matrix Σ, respectively
Stochastic generative model
yn ˜ N(μ, Σ)
The likelihood function can be written as a function of the unknown mean vector μ
Take the logarithm of both sides
The point where the derivative of the log-likelihood with respect to μ is zero (point to note)
Maximum Likelihood Estimator
Example 12 Maximum Likelihood Estimation of a Dart Throwing Model with Dice
Probabilistic Generative Model
Aimed dart position coordinates θ=(μ(1),…. Maximum likelihood estimation when {,μ(6)} is an unknown parameter?
Likelihood function
Log Likelihood Function
4.2.2 Probabilistic Generative Models and Bayesian Estimation
Bayesian estimation considers that “what we want to know (unknown parameter θ) is a random variable
In Bayesian estimation, “estimation” of the unknown parameter θ means updating the distribution of the random variable θ by obtaining an observation X
What is the distribution of a random variable?
Probability density function
The distribution p(θ) before the update is called the prior or apriori probability distribution
The distribution p(θ|X) after updating based on observation X is called the posterior or a posteriori probability distribution
Bayesian Estimation Procedure
Modeling the likelihood function.
Write down the process of stochastic generation of observation Y under unknown variables θ in terms of conditional probability p(Y|θ)
Let this be a function L(θ)=p(Y|θ) of the unknown variable θ
Modeling prior probabilities
Express the range of possible values for the unknown variable θ and the degree of its subjective likelihood in terms of a probability distribution p(θ)
Apply Bayes’ theorem to obtain the posterior probability p(θ|Y) of the unknown variable θ
p(θ|Y) = p(Y|θ)p(θ) / p(Y)
p(Y) is defined by p(Y) = ∫ p(Y|θ)p(θ)dθ
When a given realization of the observed value Y is also a number independent of θ, but
p(Y) is called the normalizationconstant or marginalized likelihood
Bayesian and maximum likelihood estimation are the same in that they attempt to know unknown parameters based on the likelihood function.
The main feature of Bayesian estimation is that unknown parameters and predicted distributions are obtained in the form of probability distributions.
Gaussian process method yields a Gaussian distribution with mean and covariance for unknown vector values
Example 13 Bayesian estimation of Gaussian distribution when the covariance matrix is known
Observed D-dimensional longitudinal vector yn∈ℝD,n=1,… ,N
Assume the above stochastic generative model is a Gaussian distribution with mean μ and covariance Σ
Of the parameters, Σ is known
Prior probability of unknown parameter μ is given by Gaussian distribution
yn ˜ N(μ, Σ), n=1,…,N […]
μ ˜ N(μ0, Σμ0)
μ0 is a known D-dimensional longitudinal vector and is the center of the prior distribution
A symmetric matrix Σμ0 with magnitude DxD is also known and represents the covariance matrix of the prior distribution
The objective of Bayesian estimation is to find the posterior distribution p(μ|Y) of the unknown parameter μ
Bayes’ theorem (from the logarithmic representation of)
log p(μ|Y) = log p(Y|μ) + log p(μ) – log p(Y)
The first term on the right side is the log-likelihood function itself
The second term on the right side is the prior probability
The third term on the right side is a constant independent of μ
Variation of the above equation
Const summarizes constants independent of μ
The posterior probability p(μ|Y) is obtained in the form of a normal distribution N(μ,Σμ)
Inverse of variance
Precision
Inverse of the covariance matrix
Precision matrix
Gaussian distribution with precision matrix S=Σ-1 instead of covariance matrix Σ
Parameters of posterior probability
The precision matrix Sμ of the posterior probability is N times the precision matrix of the probability-generating model of the observation, excluding the term Sμ0 derived from the prior probability
For each additional observation yn, the accuracy matrix Sμ of the posterior probability increases by S
4.3 Representation of probability distributions
Introduction.
What does it mean to find a probability distribution?
What exactly do we need to do?
4.3.1 What is a nonparametric model?
There are two main ways to represent the probability distribution of a random variable by numerical values on a computer
Parametric methods
A probability distribution in which the probability distribution followed by the target of estimation is specified by a parameter of a certain dimension
Determine the probability distribution by finding this parameter
Representative
Gaussian distribution
Beta distribution
Assumes that the probability distribution follows a parametric probability distribution
Numerically obtaining the parameters is an estimation of the probability distribution
To express prior probability, likelihood probability, posterior probability, etc. in terms of parametric probability distributions
Parametric Probability Modeling
Nonparametric methods
A method of obtaining a probability distribution without assuming a known parametric probability distribution for the target of estimation
Typical example
Methods for obtaining histograms
Kernel density estimation method
K-nearest neighbor method
Gaussian process method is a kind of nonparametric method
No parametric probability distribution is assumed for the unknown function f(x) to be estimated.
4.3.2 Representing the probability distribution as a sample
A direct way to represent and visualize the probability distribution p(x) on a computer
If x is a scalar, it shows the functional shape of the density function p(x) and the distribution function F(x)=∫p(t)dt
Indirect methods
Appropriate number of samples generated from probability distribution X1,… Visualize XN
Various representations of probability distribution (1D)
Various representations of probability distributions (2-D)
Utility of visualizing probability distributions with a finite number of samples
Probability density function cannot be written (or is difficult to write) in analytical form
It is easy to generate samples computationally
Example 14 Visualization of a probability distribution that cannot be written down in analytical form
A uniform distribution that produces the value d with equal probability from the range (-1,1) and a normal distribution with variance σ2 centered on d are connected by a chain system
d ˜ Unif(-1, 1)
x ˜ N(d, σ2)
What does the marginal distribution p(x) look like?
Visualization by sample
(A) Scatter plot of stochastic variable x
(B) Histogram of a sample of 1000 points
(C) Blue line is the result of kernel density estimation using Gaussian distribution kernel Red dashed line is the true probability density function
Samples can also be used to find statistics
Given the function f(x)=x2 and
Expected value 𝔼[f(x)] = ∫ f(x)p(x)dx
Variance 𝕍[f(x)] = ∫(f(x) – 𝔼[f(x)])2p(x)dx
Devices using weighted samples
Let the probability distribution p(x) be N weighted samples (wnxn), n=1,. ,N
Expectations and variances can be calculated using weighted means
Weighted sampling algorithm
Example 15 Bayesian estimation for probability distributions that cannot be written in analytical form
Parameter σ2 of the model created in Example 14 is unknown
Given prior probability p(σ2)=Unif(0.1,2) and N i.i.d samples XN=(x1,…) for random variable x xN) for N i.i.d samples of random variable x.
How is the posterior probability obtained?
It is difficult to write down the parameter σ2 in analytical form
M weighted sample wmσ2, m=1,… It is possible to generate M
From Bayes’ theorem
p(σ2|XN) = p(σ2)p(XN|σ2) / p(XN)
Likelihood
EVIDENCE.
If we can find the likelihood p(XN|σ2) numerically, given the input values of XN and σ2
Possible to obtain a weighted sample of the posterior probability distribution p(σ2|XN)
Algorithm for obtaining a weighted sample of the posterior probability distribution
Algorithm for obtaining a weighted sample of the posterior probability distribution when hidden variables are involved
How to obtain the results of Bayesian estimation in the form of a sample or weighted sample
Monte Carlo method
Markov chain Monte Carlo (MCMC) method with a twist on how to create a sample series
Example 16 Kernel density estimation
Kernel density estimation is one of the simplest methods for probability distribution estimation
Representative of nonparametric methods
If the random variable x ∈ ℝd follows a probability density function p(x)
B samples x1,… Using the B samples x1,…,xB we obtain an estimate of the density function Ṗ(x) as above
hk(x)≥0, k=1,… K is the kth basis function
Satisfy the normalization condition of ∫ hk(x)dx = 0
By expressing each basis function in a parametric form, determining the parameters, and determining the number K appropriately, various probability densities p(x) can be approximated by Ṗ(x).
Kernel density estimation is a special case of the above
If the basis functions corresponding to each of the B samples are defined as follows
hb(x) = h((xb – x)/ σ)
h(x) is any positive-valued function: kernel function
σ: bandwidth (controls radius of basis)
The kernel function is a function of b=1,… B, and the probability density function with a peak at the origin is given as appropriate.
Specific examples
Gaussian distribution kernel
h(x) = N(x|0,Id)
Epanechnikov kernel
h(x) = 3/4(1-∥x∥2)⫿(∥x∥<1)
⫿(*) is an exponential function that takes the value 1 if the inside of () stands up and 0 otherwise.
Gaussian distribution kernels are used to obtain a smooth and good-looking probability density function
The Epanechnikov kernel is used when a logical basis for the accuracy of the density function estimation is required without regard to its appearance.
Example 17 Distribution estimation method using neural networks
Vector pattern x ˜ N(0, ID) generated from a Gaussian distribution of appropriate dimension D
Transform the above through neural network f(x;w) Probability generating model y= f(x;w)
Represent probability distribution of high-dimensional vector pattern y representing image, sound, etc. through the above
Convolutional Neural Networks for Image Classification
Tortoise Convolutional Neural Networks for Image Generation
Column: Brownian motion and Gaussian processes
The x-coordinate of a particle in Brownian motion is X(t) ∈ ℝ
The difference between the position at time t+∆t,(∆t>0) and the position at time t+∆t, is X(t+∆t) – x(t) ˜ N(0, ∆tσ2) using a Gaussian distribution
W(t): Winner process
W(t+∆t) – W(t) ˜ N(0, ∆t)
Gaussian process in one dimension
μ(t)=0
Covariance function
Chapter 5: Computational Methods for Gaussian Processes
Overview
Calculation of Gaussian process when the number of data points N is large
Computational cost for obtaining the kernel matrix and its inverse is the bottleneck
Auxiliary variable method” that intersperses hidden variables that are not directly observed
5.1 Computational Cost of Gaussian Process Regression
Main Objective of Gaussian Process Regression
To find the expected value of output y based on the predictive distribution
Computational Formula
Compute vector k* and matrix K and store in memory
Compute and store in memory the inverse matrix K-1
Compute k*TK-1y and store in memory
Computational cost
Memory consumption
Amount of arithmetic operations
Computational cost of Gaussian process regression
The influencing factors are: the number of input points N given as the teacher data, the dimensionality of the angular input D
Memory Consumption
Amount of memory required to store numerical data
O(ND)
O(N),O(N2) to store vector k* and matrix K, respectively
Amount of operations
Order of operations required to compute all NxN elements of matrix K
O(N2D)
Order required to compute inverse matrix K-1
O(N3)
Calculation of vector k* and matrix K-1 red
O notation (Big-O notation)
An indication of the strength of the relationship between scale factors such as N and D and computational complexity
When Gaussian processes are calculated as usual, the computational cost increases as the sample size N increases.
Calculation of O(N3)
1 second for N=1000
8 seconds for N=2000
1000 seconds (17 minutes) for N=10000
Bottleneck in terms of computational complexity of Gaussian process regression
Memory consumption O(N2) and arithmetic operations O(N3) required for inverse matrix computation
Effects of Devices for Gaussian Process Regression Calculations
Auxiliary variable method
By introducing M-dimensional (M<N) auxiliary variables, the Gaussian process method compresses the required inverse matrices to MxM matrices.
Reduces the order of memory consumption to O(NM + M2) and the order of operations to O(NM2 + M3).
variational method
Variational methods can make hyperparameter optimization of Gaussian processes more efficient.
Hyperparameter optimization requires updating the kernel matrix at each step of the algorithm, which updates the hyperparameters sequentially.
By introducing the variational method, it is possible to divide the data into mini batches of 100 or 500 pieces and learn them sequentially (mini-batch learning described in “Overview of mini-batch learning and examples of algorithms and implementations“).
In the case of a lattice of auxiliary input points, the lattice structure can be used to significantly improve computational efficiency.
This is especially useful when dealing with spatial data (e.g., weather) or time-series data (e.g., images, sound, and video), and when dealing with inputs and outputs that densely fill the space.
5.2 Auxiliary Variable Method
Introduction
Inducing variable method
Effective approximation method for reducing the computational complexity of Gaussian process regression methods
5.2.1 Partial data method
subset of data approximation (SOD)
Simple idea to use when N is too large to avoid NxN matrix calculation
Ordinary Gaussian Process Regression
According to the regression model y=f(x)+ε N input points X=(x1,…. ,xn), xn∈ℝD for N input points Y=y=(y1,…,xn), xn∈ℝD ,yN)T, yn∈ℝ behind
A function f(*) following a Gaussian process and its output value f(f1,…. ,fN)T=(f(x1),… ,f(xN))T ∈ ℝN
When the Gaussian process is defined by mean μ(x) and kernel function k(x,x’)
The Gaussian process regression model assumes the following Gaussian process for unknown numbers f and y
f ˜ N(μN, KNN)
μN is an N-dimensional longitudinal vector such that the nth component is μ(xN)
KNN is an NxN matrix such that the (n,n) component is k(xn, xn’)
yn ˜ N(fn, σ2)
Consider a function output value f*=f(x*) and an observed value y* at a new input point x*
The simultaneous probability of f and f* is represented by the following Gaussian distribution
μ*=μ(x*) and k**=k(x*,x*) are scalars, respectively
kN* is an Nth-order longitudinal vector such that the nth component is k(xn,x*)
Since the variance of the observed noise is σ2
The mean and variance of the predictive distribution p(f*|y) = N(ḟ*,σ*2) are as follows
f* = k*N(KNN + σ2IN)-1
σ*2 = k** – k*NT(KNN + σ2IN)-1kN*
partial data method
Select M input points (M<N) from N input points so that they are well representative of the distribution of all data
Inversion of the MxM matrix yields the predicted distribution of the Gaussian process
The partial data method is a method that uses a partial data set consisting of M input points chosen to be representative of the total data.
Similar accuracy can be obtained at a much smaller computational cost O(M3) than O(N3)
5.2.2 Calculation of auxiliary input points and auxiliary variable method
A refinement to the partial day method
The auxiliary variable method generates a partial data set consisting of M points that is representative of the entire data set, and uses this data to efficiently estimate the ring values at the test input points.
In the partial data method, data from N-M point sentences were discarded
Auxiliary variable method preserves accuracy by effectively using all data
In the auxiliary variable method, the function is defined by M virtual input points Z=(z1,…), called auxiliary input points (inducing points), within the domain of 8*). zM), called inducing points, within the domain of the 8* function.
Assuming that the appropriate auxiliary points have already been obtained
The output value f(zm) at the auxiliary input point zm is the “inducing variable”.
The above is then applied to m=1,…. M is the “inducing vector” u=(u1,…. ,uM)T
Procedure for finding the predictive distribution of an unknown function f(x) known to follow a Gaussian process
1. definition of prior probabilities
Define prior probabilities p(f,u,f*) of output value f at input point X, output value (auxiliary variable vector) u at auxiliary input point Z, and output value f* at test input x*.
Derive posterior probabilities
Find the posterior probability p(u|y) of the auxiliary variable based on the observed value y at input point X
Derive predictive distribution
Approximate the predictive distribution p(f*|y) of the output f* using the posterior distribution p(u|y) of the auxiliary variable
p(f*|y) = ∫ p(f*|f)p(f|y)df ≈ ∫ p(f*|u)p(u|y)du
There are various schools of auxiliary variable methods
The method is called FITC
Definition of prior probability in the auxiliary variable method
Gaussian process where f(*) is defined by the mean function u(x) and the covariance function k(x,x’)
The simultaneous prior probabilities of the output f and the auxiliary variable u are Gaussian as above
μM=(μ(z1),…. T is an M-dimensional longitudinal vector
Mean vector for auxiliary variable u
Covariance matrix K is a (N+M)x(N+M) real square matrix
K contains the variance-covariance matrix KNN for vector f, the variance-covariance matrix KMM for u, and the covariance matrix KNM for f and u
In the auxiliary variable method, we want to approximate the predictive distribution at the new point x* as follows
p(f*|y) = ∫ p(f*|f)p(f|y)df ≈ ∫ p(f*|u)p(u|y)du)
If the auxiliary point Z is properly placed and the new point x* is at the input point x1,… If the above equation holds, then
If each input point xn is put in place of a new point, the following equation holds
p(fn|y) ≈ ∫ p(fn|u)p(u|y)du
In the auxiliary variable method, based on the above considerations, the function f undergoes the following three-step generative process as a stochastic generative model
Formula 5.1 Three-step generative process assumed by the auxiliary variable method
(1) u ˜ N(0M, KMM)
(2) fn|u ˜ N(kMnTKMM-1u, kn – kMnTKMM-1kMn), n=1,… ,N
(3) yn|fn ˜ N(fn, σ2), n=1,… ,N
where kn = k(xn,xn), kMn = (k(z1,xn),…. ,k(zM,xn))T
Combine the first two stages of the three-stage generative process into a single simultaneous distribution
p(f|u)p(u) = p(f,u)
Gaussian distribution with N+M dimensions
Covariance matrix
Auxiliary variable method ignores all of its off-diagonal components and replaces them with zeros
Diagonal component kn=k(xn,xn), n=1,… leaving only n
When a matrix has many zeros, it is called a “sparse matrix” or sparse matrix.
Auxiliary Variable Method aka
Sparse Gaussian process sparse Gaussian process)
Interpretation in graphical models
In the original graphical model of stochastic generative model, there is an undirected link between all components of fN
In the graphical model of the auxiliary variable method, the components of fN no longer have direct links to each other, but instead are linked through the auxiliary variable u.
The output f* corresponding to the new input point x* is also passed through the auxiliary variable u to the observed values y1,… yN through the auxiliary variable u
Derivation of Posterior Probabilities of Auxiliary Variables
5.2.3 Computational cost of auxiliary variable method
800 MB for GP with N=10000 observation points, 9 MB with M=100 auxiliary input points using the auxiliary variable method
5.2.4 Placement of auxiliary input points
The stochastic generative model of the auxiliary variable method is a “bold approximation” of the stochastic generative model of Gaussian process regression
Whether it is an adequate approximation
Auxiliary input points Z=(z1,…. zM) depends on whether the function value f(zm) is representative of all the data well or not.
Comparison of approximations for different placement and density of auxiliary input points
If the function is smooth, the mean and variance can be approximated exactly with about 1/10 of the number of input points
The auxiliary variable method approximates the function by smoothly completing the values of the auxiliary points among the auxiliary points.
For complex shapes that are not smooth, a good approximation cannot be achieved without many auxiliary points
To maintain high approximation accuracy while minimizing the increase in the number of auxiliary input points
Requires ingenuity in proper placement of auxiliary input points
In regions where the function f(x) to be estimated is smooth, the density of auxiliary input points should be low.
If it is not smooth, set it higher.
How to create actual auxiliary input points
A certain percentage of the input data points are randomly selected as “partial data” to be used as auxiliary input points.
Clustering of input data points (K-means method, etc.) to obtain representative points to be used as auxiliary input points
The initial values selected in the above method are automatically adjusted to optimize the position of the auxiliary input points.
Arrange auxiliary input points on a grid
There is a way to greatly reduce the computational complexity by using the special characteristics of a lattice arrangement.
5.3 Variational Bayesian and Stochastic Gradient Methods
Introduction
Based on the Variational Bayesian method (VB method), we derive the stochastic gradient method algorithm.
A method to attribute a Bayesian estimation problem to a numerical optimization problem for a model with a complex hierarchical structure of unknown hidden variables and parameters.
The stochastic gradient method is an algorithm that solves numerical optimization problems by sequentially updating parameters.
Essential method to efficiently determine parameters of complex parametric models such as neural networks based on large data sets.
Improves computational efficiency when optimizing the hyperparameter θ in the auxiliary variable method of the Gaussian process regression model when the number of data N is large.
5.3.1 Variational Bayesian Method and Independent Decomposition Assumption
Variational Bayesian methods are approximate computational methods of Bayesian estimation
Not limited to Gaussian process regression Situation setting includes a wide range of general machine learning
Given a stochastic generative model p(Y|w,θ) of observation Y in the form of conditional probabilities subject to an unknown variable w and a hyperparameter θ
Given a prior probability p(w) of an unknown variable
Objective of Bayesian Estimation
To find the posterior probability p(w|Y,θ) of the unknown variable and the hyperparameter θ
The posterior probability is obtained to satisfy Bayes’ theorem (left equation) under the hyperparameters
p(w|θ) = p(Y|w,θ)p(w|θ) / p(Y|θ)
θ* = arg max p(Y|θ)
p(Y|θ) = ∫ p(Y|w,θ)p(w|θ)dw
Evidence (marginal likelihood)
Variational Bayesian approach
Approximating posterior probability distribution p(w|Y,θ) by variational posterior distribution q(w)
The analyst gives the variational posterior distribution q(w) in the form of a known parametric posterior distribution in a descending order
Example
Let q(w) = N(w|m,S), a Gaussian distribution depending on the parameter θw = (m,S)
Next, the parameters are adjusted to make the variational posterior distribution closer to the true posterior distribution
On the meaning of the variational posterior distribution q(w)
From the standpoint of variational Bayesian algorithm
Variational posterior distribution is a provisional posterior distribution
The tentative posterior distribution is initialized appropriately at the start of the algorithm and approaches the true posterior distribution step by step.
Eventually converge to the desired distribution
From the standpoint of stochastic generative modeling
Variational posterior distribution implies a model for the posterior distribution
Even if the prior distribution p(w) and the likelihood function p(Y|w,θ) have a clear shape, it does not necessarily mean that the posterior distribution will have a clear shape
The analyst sets up a tractable parametric probability density function in descent form, which is the variational posterior distribution.
Analysts can arbitrarily constrain the posterior distribution to a parametric variate posterior distribution
Variational posterior distribution is part of the hypothesis
Variational Bayesian Method Computational Objectives
Minimize the difference between two probability distributions q(w) and p(w|Y,θ) for w
The difference between two probability distributions is measured using the Kullback-Leibler divergence (KL) KL[q(w)||p(w|Y,θ)].
Definition 5.5 KL information content
The KL information content between two probability distributions defined by the probability density functions q(w) and p(w) is expressed as
KL[q(w)||p(w)] = ∫ q(w)log (q(w)/p(w)) dw
KL[q(w)||p(w)] ≥ 0 for any probability density function p,q
KL[q(w)||p(w)] = 0 when q(w) = p(w) for all w
In general, KL[q(w)||p(w)] = KL[p(w)||q(w)] does not hold
Cannot call the KL information content the distance between q and p
Instead of distance, call it “pseudo-distance
To obtain q(w), variational Bayesian methods define an evidence lower bound (EBLO) or a variational lower bound of evidence in the form of a functional of the posterior distribution q(w) and maximize it with respect to the variational posterior distribution.
Definition 5.6 Variational Lower Bound of Evidence
Fθ{q(w)} = ∫ q(w) log (p(Y|w,θ)p(w|θ) / q(w) dw
why is it called a variational lower bound on the evidence?
If the inequality Fθ{q(w)} ≤ p(Y|θ) for evidence p(Y|θ) holds for any probability density q(w) and
Since the equality in the above equation holds when q(w) ≡ p(w|Y,θ)
To prove that Fθ is a variational lower bound of the evidence and that the KL information content is greater than or equal to zero, we use Jensen’s inequality
Formula 5.7 Jensen’s inequality
For a random variable x ∈ ℝ and a function f(x) convex on x, f(𝔼[x]) ≥ 𝔼[f(x)] holds
Example
log x is an upward convex function
log ∫ p(x)f(x)dx ≥ ∫ p(x) log f(x) dx
If the shape of the variational posterior distribution q(w) is a function of the parameter θw, then the variational lower bound of the evidence can be expressed as a function of the parameter θw of the variational posterior distribution and the hyperparameter θ of the stochastic generative model F(θw,θ) = Fθ{q(w)}
The left-hand side is a function of θw,θ and the right-hand side is the functional of the probability density function q(w) (and a function of θ)
Evidence Lower Bound Function
In variational Bayesian methods, the two intertwined problems that make up Bayesian estimation
Estimation of posterior distribution p(w|Y,θ)
Optimization of hyperparameters θ
It boils down to a single numerical optimization problem with the evidence lower bound function as the objective function.
[θw*,θ*] = arg max F(θw, θ)
When θw*,θ is found such that the evidence lower bound function is maximized
The variational posterior probability q*(w) based on this is the best approximation of the posterior probability p(w|Y,θ*)
In particular, if the parametric probability distribution q(w)=q(w|θw) can represent the true posterior probability depending on the parameters
q*(w)=q(w|θw) is consistent with the exact posterior probability
At the same time, θ* is the hyperparameter that maximizes evidence
How to give the variational posterior probability q(w) in the variational Bayesian method?
Some of the hypotheses that analysts give for descent
Independent decomposition hypothesis (factorization model)
A convenient and frequently used way to model the variational posterior probability q(w) in variational Bayesian
Example
In the posterior probability p(w|X,θ), if the unknown parameter w=(w1,…. w5) is a 5-dimensional vector
Assuming q(w)=q(w1,w2)q(w3,w4,w5) in the variational posterior probability is called an independent decomposition process
Special cases included in the independent decomposition process
There is a mean field approximation
Consider independent variational posterior probabilities for each scalar variable, such as q(w)=q(w1)q(w2)q(w3)q(w4)q(w5)
Independent decomposition process ignores correlations between specific variable pairs
5.3.2 Applying Variational Bayesian Methods to Auxiliary Variable Methods
Applying Variational Bayesian Methods to the Retained Variables Method of Gaussian Process Regression
Variables appearing in the auxiliary variable method are decomposed into the following categories
Observed random variable
Observed value y=(y1,…. ,yN)T
Known constants
Input point X = (x1,…,XN) Unknown constant
Auxiliary input point Z = (z1,…,zM)
[…] […]
Parameters to adjust the covariance function k(x,x’:θ)
Stochastic variable to be estimated
Function output value at input point d=(f1,…. ,fN)T = (f(x1),f(x2),. .f(xN))
Function output value at auxiliary input point u=(u1,…. ,uM)T = (f(z1),…. ,f(zM))
Fitting a variational Bayesian method
Parameter w=(f,u) is the sum of all random variables to be “estimated
Hyperparameter θ is the sum of all unknown constants
The observed random variable corresponds directly to y
Known constants are ignored because they are constants
Assumption 5.8 Independent Decomposition Assumption for Auxiliary Variables Method with Variational Bayesian Method
5.3.3 Minibatch and stochastic gradient methods
Variational Bayesian and stochastic gradient methods are powerful when the number of data points N is huge (specifically, when there are tens or hundreds of millions of data points)
Stochastic gradient descent
N input-output pairs (xn,yn),n=1,… N into small pieces of about N7=100,500
Each small unit is called a mini-batch
Forbid calculating KMN using mini-batches to obtain the direction of purchase.
The gradient is determined with a small error compared to the one calculated using all the data.
The direction of error is different for each mini-batch, but when averaged over all mini-batches, the average error (bias) is negligible.
Obtain the gradient from different minibatches at each step, and use this to advance parameter updating.
Less time required for convergence compared to the normal gradient method
The gradient method is less likely to be trapped in local solutions because it fluctuates stochastically with each mini-batch.
5.4 Gaussian process method calculation based on a lattice-like auxiliary input point arrangement
Introduction
Regular lattice arrangement of auxiliary input points significantly reduces the computational cost of the Gaussian process method
For multidimensional cases, a large number of auxiliary input points must be placed to obtain a system
Placing auxiliary input points on a grid streamlines calculations.
5.4.1 Kronecker method
Kronecker method
The multidimensional lattice is represented by the direct product of a one-dimensional lattice and
Using the Kronecker product property of matrices
Reduces the amount of arithmetic involved in kernel matrices
Example
Input point x is given by a lattice point x=(x(1),x(2)) ∈ 𝓧1x𝓧2 in 2-dimensional space
𝓧1 and 𝓧2 are sets of 50 and 70 elements for vertical and horizontal Confucius coordinates, respectively
x is located on a 50×70 grid point
The kernel function must be in the form of a product of kernel functions for each lattice dimension
If you can write k(xn,xn’)=k(1)(xn(1),xn'(1))k(2)(xn(2),xn'(2)) in two dimensions
A Gaussian kernel would automatically satisfy this property
All auxiliary input points xm, m=1,… The covariance matrix KMM (in this example, the 3500×3500 matrix) with respect to the,M
The Kronecker product KMM= K1⊗K2 of the covariance matrices K1 (50×50 matrix in this example) and K2 (70×70 matrix in this example) constructed for each lattice dimension
Kronecker product
Matrix product defined between matrices of arbitrary shape (not limited to square matrices)
Example
The Kronecker product of a 2×3 matrix A=(aij) and a 2×2 matrix B=(bij) is
The amount of memory is 3500×3500=12.25 million in the bare
If this is stored in K1 and K2 separately, 50×50 + 70×70 = 7400 or so is sufficient
The heart of the Kronecker method
Eigenvalue decomposition of the kernel matrix expanded into the Kronecker product
Eigenvalue decomposition K=P𝚪PT of NxN matrix requires O(N2) memory consumption, O(N3) operations
Formula 5.9 Eigenvalues of the Kronecker product
5.4.2 Teplitz method
Toeplitz method
This method can be used when the auxiliary input points are equally spaced on a one-dimensional lattice and the kernel function can be written in terms of the difference between the two input values xi and xj, k(xi,xj) = h(xi – xj).
Combined with the Teplitz and Kronecker methods, it is possible to handle multi-dimensional equally spaced lattices.
Example
x=0, 0.1,0.2,… 10, such as using a Gaussian kernel with an equally spaced lattice
Toeplitz matrix
a square matrix A(aij) such that aij = ai-1,j-1
diagonal-constant matrix
Once the components of the row vector of the first row and the column vector of the first column are determined, all other components are known.
Circulant matrix
Special case of Teplitz matrix
MxM cyclic matrix is a square matrix B=(bij) such that bij=bi-1,j-1 and bMj=b1,j+1
Examples of Teplitz and cyclic matrices
A cyclic matrix has the property that the eigenvalues of the matrix can be obtained by the discrete Fourier transform of the first row against
5.4.3 Local Kernel Interpolation
Local kernel interpolation
To obtain the value of the kernel function at an arbitrary input point x, a minimum number of lattice points in the neighborhood of x are picked up and stored, weighted by the values on the lattice points.
For a one-dimensional lattice, the value of an arbitrary point is determined by complementing two neighboring lattice points, four points at most for two-dimensional lattices, and eight points at most for three-dimensional lattices.
It is necessary to place the completion points on the grid so that the range of possible values for the input point X and the test point x* covers a wide range.
5.4.4 KISSGP method and its arithmetic quantities
The three techniques, Kronecker, Teplitz, and local kernel interpolation, are interpolative in their relationship to each other
KISS-GP method
Implementations that use these simultaneously
Implementation Tools
GPML, a toolbox of MATLAB implementations of various algorithms for Gaussian process methods
Python toolbox called GPyTorch
Chapter 6: Application of Gaussian Processes
Overview
As practical examples of effective use of the characteristics of Gaussian processes, we will look at spatial statistics and Bayesian optimization issues.
6.1 Kriging and spatial statistics
A method for determining the total amount of minerals contained in an entire quarry from the results of a limited number of borehole inspections
Borehole inspection
It shows what quality ore can be obtained at what depth in a certain location.
Krieg’s method: Kringing
Coordinates x1,… Actual measured value f1,…,xN at coordinates x1,… Using the linear sum of f1,…,fN
Find the value f(x) that would be measured at any spatial coordinate x, as common sense dictates.
Gaussian process regression method, same in principle
Variograms and semivariograms
To model the covariance function, we use terms such as semivariogram and semivariogram
N points x1,… The measurements y1,…,xN at N points x1,…,xN ,yN at N points x1,…,xN are given
The semivariation corresponding to any combination of two points (xn,xn’) is defined as the magnitude of the observed values yn,yn’ at positions xn,xn’
Snn’ = 1/2(yn – yn’)2
Visualization of the relationship between the semi-variance snn’ and the distance of conversion rnn’=∥xn – xn’∥ in the form of a scatter plot.
Variogram (semi-dispersive cloud)
Semi-variogram
Function γ(r) expressing the expected value of the semi-variance snn’ in terms of the distance rnn’.
Empirical semi-variogram
Interpretation of function shape
Nugget
means a mass of things
Expected value of semi-variance for repeated measurements at the same location
Nugget for the value of γ(0)
Sill
Value of semi-dispersion when the distance is far enough
Sill the value of γ(r),r >> 0
Range
The minimum range over which the magnitude of the correlation can be said to remain unchanged at further distances
When the nugget is small and the range and sill are large
Objective function is highly uniform within a small area, but changes significantly when viewed over a large area
It makes sense to do drilling work over a large area to look for large chunks of high quality mineralization.
Nuggets are large
High variability within a small area
We should find and search for small lumps of high quality little by little by repeatedly boring in a small area, shifting the position little by little
Modeling the more general case where semi-dispersion is not necessarily a function of distance r alone
Modeling the variogram γ(xn,xn’) in the form of a function of the two observed position coordinates xn,xn’
6.2 Bayesian Optimization
6.2.1 What is Bayesian Optimization?
Bayesian optimization
Applied technology that takes full advantage of the features of Gaussian process regression method, which allows probabilistic forecasting based on a small number of samples and minimal assumptions
Objective
To find the input x* = arg max f(x) such that the unknown objective function f(x) has a maximum value
The user can use the input values X=x1,x2,…,… to get a clue to the shape of the unknown objective function. to obtain a clue to the shape of the unknown objective function.
Since experiments are costly, determine the appropriate xi for each experiment to get close to the optimal x* while saving on experimentation costs.
Example
We want to maximize wheat yield f(x) by applying the appropriate amount of each of the three types of fertilizer (A, B, C) x=(xA, xB, xC)
I want to minimize aerodynamic drag f(x) by determining appropriate parameters x for the streamlined nose shape of a new car. It costs 500,000 yen and 3 days per test to print out a model with a 3D printer and conduct wind tunnel tests to determine aerodynamic drag.
I want to optimize the learning accuracy f(x) by appropriately determining the structural parameters x of a deep neural network. Each experiment to determine the parameters takes 3 hours and costs 60 yen for computer usage.
Before Bayesian optimization, there was a method called “experimental design
Factors to be considered x1,x2… are discrete values x1∈{A,B,C},x2∈{large, medium, small},…, respectively. Assume the case where
In Bayesian optimization, the result of N experiments is N points of input/output data (x1,y1),…. When,(xN,yN) is obtained, an estimate of the function f(x) is obtained in the form of posterior probability based on the Gaussian process method.
The result of the estimation of the function f(x) is obtained in the form of the expected value of the posterior probability μ(x) and its standard deviation σ(x).
To find the true optimum
Around the tentative optimal solution with the largest expected value of μ(x)
Where the experiment is sparse (where the standard deviation σ(x) is large)
Determine XN+1 so that the acquisition functon a(x), defined by an appropriate combination of μ and σ, is maximized.
Formula 6.1 How to define the acquisition function in Bayesian optimization
The acquisition function a(x) is defined as the combination of the expected posterior probability μ=μ(x) and the standard deviation σ=σ(x) of the function f(x) expressed as a Gaussian process.
The upper confidence bound is defined as
By defining an acquisition function that adds a margin proportional to the standard deviation σ to the expected value of the estimate µ, the search preferentially searches around areas with low data density.
Expected improvement is defined as
Τ is the largest y among the N observations yN=f(xN) so far
t=(μ – τ)/σ is the difference between μ and τ for each x, normalized by the standard deviation
I( f > τ ) is an exponential function that takes the value 1 when f > τ and 0 when it does not.
Φ() and φ() are the cumulative distribution function and density function of the standard normal distribution, respectively
Bayesian optimization with the UCB criterion in progress
The interval f(x) at x that maximizes the gain function (green) is sequentially obtained by experiment, and the estimate is updated as we approach the x that maximizes f(x).
6.2.2* Automatic Relevance Determination (ARD)
Bayesian optimization assumes that the properties of the function under consideration are unknown
The Automatic RelevanceDetermination (ARD) mechanism is particularly effective when the input to the function is a multidimensional vector
ARD can automatically determine the vector components that affect the output of a regression with multidimensional vectors as input.
Observation point Xi=(Xi1,…. XiD), where the observation point Xi=(Xi1,…,XiD) is represented by a D-dimensional vector like
Consider an ARD kernel as in the above equation
The hyperparameter θ=(𝛈1,…,…) describing the properties of this kernel 𝛈D) is a D-dimensional vector
The dth component 𝛈d of this vector is a positive real number and determines the degree of relevance between the variation of the output y and the variation of the dth component of the observation point vector dream x
When the value of 𝛈d is zero or extremely small compared to others, the variation of the function f(xi) sampled from the Gaussian process is not affected by the variation of the d component of the input xi
Hyperparameters 𝛈d, d=1,… D, we can set how large or small the effect of each component of the input vector is on the variation of the function.
By estimating these hyperparameters based on the data, we can estimate the compensating impact of each component of the input vector
In high-dimensional data analysis using Gaussian process regression, a further extension of the above kernel function is commonly used
Formula 6.2 ARD kernel function
6.2.3* Derivation of the matrix derivative formula and ARD algorithm
The ARD algorithm is not very different from the hyperparameter determination method for Gaussian process regression models, except that it uses the standard ARD kernel function k
Combination of ARD (a device that automatically performs dimension selection) and Matern kernel (a device that provides a variety of covariance functions)
Various software libraries are available for practical applications of Bayesian optimization
Pyhton
GPyOpt
GPFlow
GPyTorch
R
rBayesuanOptimization
Chapter 7 Unsupervised Learning with Gaussian Processes
Overview
Using a Gaussian process, in a latent variable model where the observed value Y is generated from a latent variable X
The mapping from X to Y can be made nonlinear
At the same time, the problem can be defined mathematically and prospectively
Equivalent to thinking more mathematically about latent variable models with neural networks
Also explains how to sample latent variables according to Gaussian processes
7.1 Gaussian Process Latent Variable Model (GPLVM)
Overview
Case with input x and output y, where there is only y and x is unknown
Example
Let x be the click history of a web page.
Similar users often view similar pages
No input x representing user characteristics is given
If the nth data y is yn={yn(1),yn(2),…. ,yn(D)}T ∈ ℝD
There are N of these.
Y={y1,y2, … , yN}T
The whole Y of the observables is an NxD dimensional matrix
Ynd = yn(d) is the dth dimension value of the nth data
Each dimension d of Y has a different meaning
For each d, close ys are close to each other
N-dimensional longitudinal vectors for each d, each with N unknown inputs X=x1,x2,…. N unknown inputs X=x1,x2,…,xN, each of which is assumed to be generated by a Gaussian process regression from
Assume that the mean of y(d) is zero
Consider errors due to a simple Gaussian distribution
For each dimension d=1,… . D for each dimension d=1,….
f(d) ˜ N(0, Kx)
Kx is the kernel matrix of each element pavilion of X
(N,n’) elements are k(n,n’)
yn(d) = f(d)(xn) + ε, ε ˜ N(0,σ2) (n=1,…,N) […]
In summary
y(d) ˜ N(0, Kx + σ2I)
Observation point y follows a Gaussian process for each of the dimensions represented by x, ◯, and △.
There is a potential input x represented by ▲ behind it.
×If x is close, the corresponding vectors (x, ◯, △) are also close to each other
The probability of Y for the entire data is y(1),y(2),… Product of the probabilities of y(D),y(D)
Assume each x follows a K-dimensional standard Gaussian distribution
K is the number of dimensions of the latent space in which x exists, which can be determined independently of D
x need not be a vector as long as it can be optimized
Any kernel function that can measure similarity
If we consider kernels between characters and optimize “latent strings” (e.g., MCMC method), we can theoretically consider “GPLVM in which observed values are generated from characters” etc.
The simultaneous probability of X and Y is the above equation
Finding a potential X that maximizes this is the goal of the learning problem
Gaussian Process Latent Variable Model (GPLVM)
Example: Human pose embedded non-linearly in 2D space (K=2) by GPLVM
It is possible to find X directly without considering the prior distribution of X.
Because it is impossible to prevent x from reaching extreme values depending on the data
It is preferable to leave a prior distribution
7.1.1 Generative model of GPLVM
In GPLVM, data Y is generally considered to have been generated as follows
(1) For n= 1,…. ,N Generate potential inputs xn ~ N(0,Ik)
(2) X=x1,… Compute the kernel matrix Kx between them from X=x1,…,xN
(3) For d= 1,… D Generate function values f(d) ~ N(0, Kx) Generate observed values y(d) ~ p(y|f(d)) according to the observation model p(y|f)
If you use errors from a Gaussian distribution, you can combine the two steps together
(3)’ For d = 1,… ,D Generate observations y(d) ˜ N(0, Kx + σ2I)
All observed values y(d) = (y1(d),… yN(d)) at once
7.1.2 Objective Function of GPLVM
How do we find x from the above equation?
We focus on the term p(Y|X)
For clarity, we include the term σ2I in the kernel, and Kx+σ2I is again Kx.
Transformation of the equation
For computational purposes, Kx-1=𝚲 and abbreviated as y
The dashed line part of the above equation is
For matrices A and B
Using tr(AB) being the above equation
For matrices A and B, tr(ATB) means the “inner product” of matrices A and B
Take a large value when the two matrices are congruent
From the above, the probability of generating Y from X is
Information about X is embedded in Kx
Maximizing the objective function in the left equation is
The original input X=(x1,x2,…) of Kx so that the Gram matrix Kx-1 approximates the correlation matrix YYT of the observed data as well as possible. xN) so that the original input X=(x1,x2,…,xN) of Kx can be approximated as well as possible.
Since tr(AB)=tr(BA)
The transformation of the above equation is more than
It can be written in the same formula as the formula for a multidimensional Gaussian distribution Translated with www.DeepL.com/Translator (free version)
7.1.3 Learning GPLVM
Probability of generating Y from X
Taking the right logarithm, the log likelihood of the observed data is
To find X that maximizes the left equation contained in Kx, we basically use the gradient method
From the matrix derivative formula, using the fact that Kx is a symmetric matrix, the above formula becomes
Therefore, the derivative at Kx of the right equation L is
From the chain rule of differentiation
Given the standard Gaussian kernel
The derivative appears only in terms of k(xn,xn’) or k(xn’,xn)
k(xn,xn’)(n’=1,… ,N), we only have to consider the case of k(xn,xn’)(n’=1,…,n) and multiply by 2.
Final calculation part
Concrete example
MATLAB implementation of the GPLVM implementation of the first 200 latent coordinates of the oil shipment data.
Compared to the two-dimensional mapping of principal component analysis, the nonlinear compression of GPLVM better separates the classes that are inherently present.
Because the objective function of GPLVM is not convex and has many local solutions, the gradient method computation depends on initial values
GPLVM calculations can also be performed with packages such as Pyhton’s GPy
7.2 Properties of Gaussian Process Latent Variable Models
Relationship to Kernel PCA
GPLVM can be viewed as a kernelized version of Principal Component Analysis using Gaussian processes
What is the difference from Kernel Principal Component Analysis (Kernel PCA) using the kernel method?
Kernel PCA does not consider latent variables, but instead sends the observed data y to a higher dimensional feature space by projection φ(y), where the usual principal component analysis is performed
Advantages and Disadvantages of GPLVM
Because GPLVM is an establishment model, the probability of the data is high, i.e., parameters that better explain the data (e.g., kernel parameters) can be selected by maximizing the probability of the observed data.
Drawback.
The optimization problem is not convex, and the global solution is not as easily obtained as in principal component analysis.
There is no explicit relationship between x and y, as in a linear model, but only a Gaussian process that says “if x is similar, y is similar,” so there are infinite possibilities for X to represent data Y.
Difficult to optimize
Bayesian Gaussian latent variable model
Bayesian Gaussian Process Latent Model (Bayesian GPLVM)
A model for learning the posterior distribution of X using variational Bayesian methods and hyperparameters
Example
Training results for oil transmission data using GPy.models.BayesianGPLVM() included in GPy.
Unlike normal dimensionality compression, a “confidence” in the latent space can be assigned
Brightness is confidence
Variance in data space Y
7.3* Extension of Gaussian Process Latent Variable Model
7.3.1 Infinite Warp Mixture Model (iWMM)
7.3.2 Gaussian Process Dynamics Model (GPDM)
7.4* Sampling of potential Gaussian processes
7.4.1 Poisson point processes and Cox processes
7.4.2 Elliptical slice sampling
Appendix A Appendix
A.1 Matrix partitioning and inversion
A.2 Eigenvalues of a cyclic matrix obtained by discrete Fourier transform
A.3 Calculation of matrix derivatives
Bibliography
Software for Gaussian Processes
Afterword
コメント