Machine Learning Professional Series Bayesian Deep Learning Reading Notes

Machine Learning Natural Language Processing Artificial Intelligence Digital Transformation Image Processing Reinforcement Learning Probabilistic Generative Modeling Deep Learning Navigation of this blog

Machine Learning Professional Series Bayesian Deep Learning Reading Notes

Writing a reading note from “Bayesian Deep Learning” in the Machine Learning Professional Series

Preface
The goal of Bayesian deep learning
Challenges of deep learning
Emphasis on developing models that can scale to train large amounts of data and on improving speed and accuracy
The evaluation of the interpretability and reliability of the evidence for predictions takes a back seat.
Challenges of Bayesian statistics
While highly interpretable analysis is possible, practical application of methods that scale to large volumes of high-dimensional data will be delayed.
Goals
Relevance of the model

Chapter 1 Introduction

1.1 Bayesian Statistics and Neural Networks in Transition
1.1.1 Neural Networks
1958
Perceptron
Learning model based on brain function
1969
Marvin Minsky pointed out the theoretical limitations of the perceptron
1980s
The overall behavior of the interaction between multiple units of a network in the study of complex systems
1986
Error back propagation method
1.1.2 Bayesian statistics
1970s
Metropolitan Hastings Method
Enables approximate inference for a wide class of stochastic models by using computers
1990s
Proposed highly efficient computational algorithms such as Hamiltonian Monte Carlo methods
1980s
Bayesian networks
Graphical models with directed acyclic graphs
1.1.3 The Birth of Bayesian Neural Networks
1987s
Set prior distributions for weight parameters of neural networks and extract useful information for prediction by providing training data
Late 1980s
Development of a framework for statistical models of neural networks
1991
Algorithm for prediction of neural networks using Laplace approximation (by the inventor of convolutional neural networks (CNN))
w
Introduced the framework of model selection in Bayesian statistics to neural networks
Demonstrated the relationship between evidence, one of the quantitative evaluation measures of a model, and generalization error, which evaluates the prediction measure for test data.
1993
Development of learning algorithms for neural networks using variational inference or variational Bayesian methods
1990s
Bayesian neural networks are computationally expensive and require little data to perform well
The usefulness of Bayesian inference in graphical models was discovered.
State space models
Topic models
2000s
Rise of methods that deal directly with function space for regression, such as kernel methods and Gaussian processes
1.1.4 The rise of deep learning
Machine learning interest in Bayesian inference
Nonparametric Bayesian or Bayesian nonparametrics to build analysis methods that scale to large data sets
Huge computational cost for inference
High threshold to fully understand and use the theory
2006
Deep belief networks
Layered pre-training techniques
Efficient learning for deep networks with multiple layers, which have been difficult to learn in the past due to the gradient vanishing problem described in “The vanishing gradient problem and its countermeasures“.
Improves theoretical analysis and prediction accuracy for more free network structures and uncomfortable networks
2012
Use of large convolutional neural networks with more than 60 million parameters.
Use of large amounts of training data and deep learning techniques to produce results that significantly improve on traditional recognition errors.
w
A technique called dropout to prevent over-fitting
Probabilistic gradient descent for efficient training of large data sets
Fast convolutional operations on GPUs
1.2 Bayesian Deep Learning
Introduction
Since the 2000s, machine learning and neural networks based on Bayesian inference have evolved independently.
1.2.1 Limitations of Deep Learning
A large amount of data is required to train a model with a large number of parameters
Uncertainty is not well handled by neural networks
It is important to know what the prediction algorithm does not know (which is not possible with deep learning)
Lack of interpretability of models
It’s a black box and hard to tell why it made the prediction it did.
The number of hyperparameters that need to be adjusted is huge, and a lot of trial and error is required to improve performance.
1.2.2 Integration with Bayesian Statistics
Three directions of fusion of Bayesian learning and deep learning
Bayesianization of deep learning models
Redefine the deep learning model as a probabilistic model
Generative models do not fit models to data
In Bayesian models, hyperparameter tuning and model selection is done by evaluating the marginal likelihood
Deep generative models provide a guide to quantitatively evaluate the data-generating ability of a model.
Can be combined with other stochastic models
Completion of missing values can be performed through probability calculation.
Can be applied to multi-task learning and transfer learning described in “Overview of Transfer Learning and Examples of Algorithms and Implementations”
Bayesian interpretation of existing methods
Some computational techniques in deep learning are equivalent to those in Bayesian inference
Regularization and dropout to prevent overfitting can be seen as a form of variational inference in Bayesian inference.
Application of deep learning techniques to Bayesian inference
The computation of posterior distributions for a large number of random variables, called linear functions, is inferred by predicting the trends of a large number of variables using neural network infrastructure.
1.2.3 Notes on terminology and notation
Different names for inference

Chapter 2 Fundamentals of Neural Networks

2.1 Linear Regression Model
Introduction
Learning Based on the Error Function of Linear Regression
2.1.1 Learning by Least Squares Method
Output Function
Feature function
ε: Error between the prediction wTΦ(s) by the model and the label y
Error function
Gradient of the error function E(w) with respect to the parameter w
WLS such that ∇WE(w) = 0 is
Optimization problem
The prediction for a new value when the optimal solution is obtained is
2.1.2 Selection of basis functions
Regression with linear functions does not capture the detailed input-output relationships of the training data well.
Design a feature function Φ(x) that well represents the features of the training data that fluctuate up and down.
Example
Linear regression is “prediction by linear combination of parameters of basis functions”.
Linear regression model
2.1.3 Overfitting and Regularization
Introduction
Overfitting is easy when the order is large
2.1.3.1 Regularization term
Introduced to avoid overfitting
Penalty term
New cost function
Minimization by parameter w
Stopping point where ∇wJ(w)=0
L1 regularization becomes sparse
2.1.3.2 Problems with learning by regularization
Need to fix the feature function
Countermeasure
Introduce a new parameter into the basis function Φm and learn the basis function itself from the data at the same time.
Use a Gaussian process
Perform Bayesian estimation after preparing an infinite number of basis functions
It is difficult to determine which feature function Φ best represents the trend of the data.
Solution
Remove a part of the data and examine it quantitatively with the removed data after training
Perform a quantitative evaluation of a general-purpose model using a measure called the marginal likelihood or evidence
Unclear guidelines for setting the regularization term
Countermeasures
Gaussian process, one of the methods that combines Bayesian theory and kernel method
Rather than restricting the function in the space of parameters that are difficult to interpret, as in neural networks
Instead of restricting the function in the space of parameters that are difficult to interpret, as in neural networks, we directly give the function properties such as smoothness and periodicity for intuitive modeling.
Learning by error minimization and regularization cannot represent the uncertainty of prediction
2.2 Neural Networks
Introduction
In linear regression model, the basis functions are fixed in advance, so it is not possible to extract features flexibly according to the data.
In neural networks, the basis functions themselves are learned from the data by placing parameters in the basis functions.
2.2.1 Forward propagation neural network
2.2.1.1 Two-layer Forward Propagation Neural Network
This section describes the most basic neural network model, the feedforwrd neural network.
Construct a neural network that predicts a multidimensional label yn∈ℝD from a multidimensional input xn∈ℝH0.
Consider a model that builds another linear regression inside the feature function Φ used in the linear regression.
w(1)h1,h0∈ℝ and w(2)d,h∈ℝ are the weight parameters of the network.
A simple notation using matrices is the above equation
w(1)∈ℝH1xH0,
W(2)∈ℝDxH,
εn∈ℝD
Φ. refers to the operation of applying a nonlinear function Φ to each element.
The detailed partitioned representation
Zn,h1∈ℝ
a(2)n,d∈ℝ and a(1)n,h1∈ℝ, with weighted sums over hidden units and input values.
Schematic of the model
It is called a forward propagation neural network with L=2 layers.
2.2.1.2 Various Activation Functions
Basis functions used in neural networks Φ
Commonly Used Activation Functions
Relationship between the sigmoid function and the hyperbolic tangent function
Sigmoid function (sigmoid function)
Function form
Hyperbolic tangent function
Function form
Relationship between cumulative distribution function of standard normal distribution and Gaussian error function.
Cumulative distribution function of standard normal distribution.
Function form
Gauss error function
Function form
x
rectified linear unit (ReLU) or ramp function
Function form
exponential linear unit (ELU), a modification of the normalized linear function
Function form
2.2.1.3 Examples of Functions Represented in Neural Networks
An example of a neural network with the appropriate weight parameters W(1) and W(2) and the number of hidden units H1=1,2,4.
Hyperbolic tangent function is used as activation function
When H1=1, the total number of parameters is 2, and the input and output axes of the hyperbolic tangent function are simply scaled.
As H1 is increased, the number of parameters increases to represent complex functions combining multiple basis functions.
In a forward propagation neural network with L=2 layers, any continuous function can be approximated by increasing the number of H1.
2.2.1.4 Forward propagation neural network with multiple layers
More layers of representation
Input dimension corresponds to H0, output dimension corresponds to D=HL, and so on
Decompose the equation using active and hidden units
Model with deep network structure such that the number of layers is L>2
Model with parameters fixed as w(2)d,h1=1 when the number of layers is L=2
The link function corresponds to the inverse of the activation function.
A neural network model with a multilayer structure in deep learning is one that repeatedly applies nonlinear transformations in a generalized linear model.
2.2.2 Gradient Descent Method and Newton-Raphson Method
Introduction
Similar to linear regression models, forward propagating neural networks can be trained by minimizing the error function with respect to the parameters.
2.2.2.1 Gradient Descent Method
Major differences from linear regression
In the case of forward propagating neural networks, the parameters to be learned are introduced only in the nonlinear activation function, making analytical calculations difficult.
The most commonly used method
Let E(W) be the error function for a model with M-dimensional parameters, and
The gradient is given by the above equation
The direction in which the error function increases most rapidly near the Euclidean distance
In the gradient descent method, initial values are given for the parameters to be optimized.
Repeat moving the parameter slightly in the direction opposite to the gradient of the previous equation.
α>0 is called the learning rate.
Find the appropriate one by experiment
2.2.2.2 Newton-Raphson method
If the number of parameters N is not large, we can use the second derivative of the error function to make the optimization more efficient.
First, the error function to be minimized is approximated to the second order by Taylor expansion around a certain w.
∇2E is the Hessian matrix for the error function E.
If the error function is approximated by the second order, the minimum value of the approximated function E(W) can be obtained analytically
The gradient of E(W) is given by
If we set ∇wE(W)=0 and solve for w, we obtain the above equation.
The final conditional expression is the above equation.
The learning rate α in the gradient descent method becomes the inverse of the Hesse matrix.
The Newton-Raphson method converges to times more efficiently than the simple gradient descent method due to quadratic convergence.
When the number of parameters M is large, it takes a lot of time to calculate the Hesse matrix and its inverse.
Improved version
Quasi-Newton method described in “Quasi-Newton Method”
Approximate calculation of Hesse matrix
2.2.3 Error back propagation method
Training method for forward propagation neural networks using gradient descent method
Since it is not possible to find the parameter that minimizes the error analytically
Perform computer optimization using gradient
For a model with L layers, let W be the set of all weight parameters, and design the error function when the number of training data is N as shown in the above equation.
Since the error function E(W) for the entire data is simply the error En(W) for each data point, we will focus only on the derivative of a specific En(W).
The partial derivative of En(W) with respect to the weight of the top layer w(L)d,h∈W(L) is given by
However
denotes the difference between the target label data yn and the output a(L)n,d of the neural network
the partial derivative of the weights w(L-1)i,j∈W(L-1) in the first L-1th layer
Let Φ’ be a derivative of Φ
and
Therefore
However
The partial derivative calculation for each weight parameter is performed in the following L-2, L-3… in the same way.
In effect, we only need to calculate δ(l)n,j for each layer.
The gradient of each parameter is given by the above equation.
Finally, the algorithm of the backward propagation method
The gradient calculation of the neural network is actually just applying the chain rule of differentiation.
In many programming languages, automatic differentiation can be used for purchasing calculations.
A technique for automatically deriving the partial derivatives of a function implemented in a program.
In many cases, there is no need to do the actual computation or directly promulgate the gradient formula.
In practical use, the cost function J(W) is minimized by adding the regularization term Ω12(W) as in the above equation as in ridge regression.
2.2.4 Learning using the Hesse matrix
Gradient method using first order derivative is the most common method for training neural networks
Optimization methods using second order derivatives can also be applied
Newton-Raphson method, etc.
Need to calculate the Hesse matrix
Simple calculation is time consuming, so to simplify it
For simplicity, we approximate the above equation using the gradient of the output an(L) for each input xn.
2.2.5 Learning a Classification Model
Forward propagating neural networks are mainly applied to regression problems that predict real continuous values.
When the output takes one value out of a finite number of D, it is called discrimination or classification.
In the case of D=2 (binary classification), the output a(L)n∈ℝ for input xn is transformed into the range μn∈(0,1) using the activation function of the sigmoid function as above
Evaluate the error for label data yn∈{0,1} using the cross-entropy error function as shown above.
The more the values of each label yn and the network output μn match (the more the network output correctly classifies the labels), the smaller the value of the error function E(W) becomes.
The same applies to multi-level classification for D(>2)
Represent a label as a vector such that yn∈{0,1}D and ∑d=1Dyn,d=1
Let an(L) ∈ ℝD be the output of D-dimensional continuous values by the neural network, and use the softmax function (above) to compute the D-dimensional vector π(an(L))=(π1(an(L)),… ,πD(an(L)))T
We can define the error using the multi-level version of the cross-entropy error function (above).
As in the case of regression, we can construct a network by minimizing the error function using the error back propagation method.
In the case of multi-level classification, the partial derivative of the output layer is the above equation.
Using the fact that the partial derivative of the softmax function for a(L)n,d∈an(L) is the above equation
Denote the difference between the correct answer label yn,d and the classification result πd(an(L)) output by the network Translated with www.DeepL.com/Translator (free version)

2.3 Efficient Learning Methods
Introduction
Learning a forward propagating neural network based on gradient descent is a very simple and versatile algorithm.
Processing speed for large data is a problem
Overfitting for models with many parameters
2.3.1 Stochastic Gradient Descent described in “Overview of Stochastic Gradient Descent (SGD), its algorithms and examples of implementation”
To train a neural network based on the simple gradient descent method described above, all of the given N training data D={zn,yn}n=1N is fed in at once and the gradient is calculated.
Compute the gradient using all the training data in each update
For practical use, a small subset Ds={xn,yn}n∈S is used for each update, where M<N, instead of processing all N input/output data sets D at once.
S is a set of M randomly selected indices
The subset Ds selected by S
Set up the error function as above and update the parameters using error back propagation method
The mini-batch error function is equivalent to the error function E(W) for all training data D in terms of expected value.
Assuming that qD(S) is uniformly distributed over S, the above equation is obtained
The term “stochastic” is derived from the fact that random numbers are used to randomly obtain mini-batches from the training data.
If we set the learning rate as αi, and schedule the values of α1 and α2 according to the above conditions
If we schedule the values of α1 and α2 under the above conditions, the update by mini-batches will converge to the stopping point of E(W) with probability 1.
The simplest scheduling method is the above equation
Probabilistically search for a solution with ∇WE(W)=0 using the content of the expected value ∇wEs(W).
In order to improve the efficiency of stochastic gradient descent optimization, a method called momentum method is also often used.
Optimization is performed via a velocity vector p of the same dimension as the parameter to be updated, w, as in the above equation
Β ∈ [0,1) adjusts how much the gradient is affected by the past.
It is known that given an appropriate β, optimization via momentum method can perform better than simple gradient descent method
2.3.2 Dropout
Introduction.
Dropout is a technique that can significantly improve the prediction accuracy of many deep learning models, including large-scale forward propagating neural networks.
By adding a small amount of noise to the data and hidden units during model training, the goal is to reduce overfitting and increase prediction accuracy on test data.
Technology proposed by Hinton in 2012.
One of the most popular probabilistic regularization methods in the field of deep learning
Describes a method for training forward propagating networks using dropouts
Dropout is usually used in the framework of stochastic gradient descent methods
Given a mini-batch Ds={xn,yn}n∈S
Given a mini-batch Ds={xn,yn}n∈S, each unit is disabled with an independent probability r∈(0,1) when calculating the gradient for each data point {xn,yx}.
The probability r is a value that is predetermined before training.
Usually, r=0.5
For each given data point [xn,yn], a subgraph called a subnet is constructed by dropouts
After calculating the gradient for each point in the mini-batch, we simply average them to get the gradient for the final parameter update.
In dropout, the prediction of y* for test input x* using the trained model requires
Use the original network with no missing units.
Using the output of each unit scaled by a factor of 1-r instead will improve prediction performance
Explanation of how dropout prevents overfitting
Ensemble effect of combining subnets (ensemble)
In a method called bagging in machine learning
Usually, k different neural networks are trained independently using k separate data sets.
Finally, the K prediction results are combined.
In dropout
A single network is used to achieve an effect similar to bagging.
If the number of units is U, then 2U different neural networks (subnets) are combined and trained in an approximate manner.
Efficient ensembling of substantially more networks than a simple bagging ensemble
Robustly evolve a function that makes predictions by mimicking the process of genetic mating.
Probabilistic regularization methods similar to dropout
Instead of disabling the unit, multiply the output of the unit by the noise m by a Gaussian distribution m~N(1,1) with mean 1 and variance 1
Since the expected value of the noise is 𝔼[n]=1, no scaling of the prediction is required, unlike dropout.
DropConnect
A special case of dropout in which the connection between each scalar weight w and hidden unit z is randomly dropped.
2.3.3 Batch Normalization
Another probabilistic regularization that is frequently used along with dropout
Apply additive and informative noise to hidden units during training
Simultaneously improves efficiency of optimization during training and regularization effect by predicting noise.
Difficulties in training neural networks with multilayer structure
One of the reasons is that the distribution of the input to the next layer changes drastically when the parameters of the previous layer change.
In order to avoid the above, we can reduce the learning rate α.
To avoid the above, we need to reduce the learning rate α. Since the update rate is reduced, the computational efficiency is significantly reduced.
Batch regularization idea
Each time a mini-batch is given, each value of the isolation unit {zn}n∈S is adjusted to have mean 0 and variance 1, as shown in the above equation
Note that
C is a positive constant that is added to stabilize the numerical calculation.
If we simply normalize the input, the representational capability of each layer is limited to a simple one.
Using the sigmoid function as the activation function
If we use a sigmoid function as the activation function, we normalize the input and map it to a region around zero.
The function becomes essentially a linear transformation.
To solve these problems, we can use
To solve these problems, we add a unit that performs a linear transformation of the normalized output źn with parameters γ ∈ ℝ and β ∈ ℝ as shown above.
When prediction is done using a network with batch regularization
Instead of mini-batches, we use the entire training data to compute the regularization.
Batch regularization, like dropout, is considered to be implicitly performing some kind of variational inference method.
2.4 Extended Models of Neural Networks
Introduction
Important practical features of neural network models
Easy to customize to the target data and purpose
2.4.1 Convolutional Neural Networks
The most successful example among the many neural network extension models
A model that incorporates a transformation called convolution instead of the usual matrix product with weight parameters.
High performance in time series data and image recognition tasks
Example of convolutional computation
The left column is a single 2-dimensional image X represented by a matrix
On the right is a weight parameter called Fujigen’s filter.
The bottom is the data after convolution (the formula is the above equation)
Feature map
Example of applying various filters of size 3×3 to actual image data.
A general forward propagation neural network
Convolutional Neural Network
As shown in the diagram, an output Si,j depends only on a local region of the input X.
Dramatically fewer parameters are needed for the transformation than in the all-combinations case
Feature extraction is invariant to movement of feature points in the image.
Can be rewritten as an ordinary matrix product by appropriately transforming the filter and the input image.
Using the inverse error propagation method to learn weights, the filter W can be trained from the data.
The computed feature map is then transformed by an activation function such as a normalization function, as in a normal forward propagation neural network.
Then a nonlinear transformation called pooling function is often applied.
A typical pooling function is called max pooling.
Pooled values are less sensitive to small changes in the input
The pooling function itself is not trained, but is kept fixed.
When finally classifying the data, an all-connected neural network with the hierarchically processed features as input is added to the top layer.
2.4.2 Recursive Neural Networks
Forward propagation neural networks and convolutional neural networks are based on the hypothesis that each data point is statistically independent.
Fixed number of dimensions for input data and labels
Challenges
There are many types of data that are not statistically independent (correlated in time series)
Audio data
Same image data
Character string data
Difficult to handle data of different lengths
Neural networks are designed to represent time series data.
Example of a typical recursive neural network
Input data with continuous values x1,x2,…,xn ,yn with continuous values, given the category label data y1,y2,…,yn. ,yn are given.
Each hidden unit zn at time n has a time-series dependency
Based on the hidden unit zn-1 at the previous time and the input data xn at time n, it is calculated as above equation
Wzx, Wzz, and bz are model parameters of the network that are shared across all times
Φ is a nonlinear transformation applied to each element.
From the hidden unit zn at each time, the output is determined by the softmax function π, as shown above
Wyz and by are the parameters for determining the output.
The error function is defined by the error between the output of the network and the label at each time.
The error at time n is given by
The error over the entire time series is given by
The minimization of the parameter Θ of the error function is done by using the gradient obtained by the error back propagation method, as in a normal forward propagation neural network.
The structure of recursive neural networks in various applications
A model that regresses a single output from an input data series
Used for classification of natural language and video data
A model that generates a series of outputs from input data
Captioning from a single image
A convolutional neural network extracts image features from the middle layer and uses them as input to a recursive neural network to generate text
Both input and output are series data
Used in automatic translation, etc. Translated with www.DeepL.com/Translator (free version)
2.4.3 Self-encoders
Application to unsupervised learning
One of the most commonly used methods in the field of deep learning
Learning only input X=[x1,…,xN]. , xN] as training data
Compress X into a low-dimensional space Z=[z1,…,zN]. ,zN].
Used to extract characteristic structure of X
Goal of self-encoder
By using two neural networks f and g, the observed data X=[x1,… ,xN] to another sequence of variables Z=[z1,…,zN]. ,zN].
Each zn is a code or, in general, a latent variable.
Function f is a neural net called an encoder
Function g is a neural net called a decoder
In the self-encoder, these functions are learned by solving the minimization problem of a single loss function L, as shown above.
The decoder f converts the xn information into a lower dimensional zn
A compounder g learns to make zn closer to the original input xn.
The objective is to extract only the essential information other than noise from the high-dimensional xn as zn.
Zn is directly used for data compression
Use it as a feature input for another supervised algorithm in a later stage.
Usually, f and g are used in forward propagation neural networks.
If the representativeness is too high, it will retain all the information needed to recover the original data xn, including noise
It is necessary to use some method to suppress the representation acquired by the latent variable zn.
The most commonly used method is to add a regularization term Ω(zn) to zn to construct the objective function as shown in the above equation.
By restricting each zn, we avoid the convergence of the learning of self-encoding to a constant mapping.
Self-encoding can be easily trained unsupervised by simply combining two neural networks.
Challenges
The design of the encoder f and decoder g, as well as the error function f and regularization term Ω, to obtain useful features is not clear.
Difficult to adjust to avoid overfitting
Automatic differentiation
Differentiation by computer
The first and most basic method
The derivatives of the futuristic function to be optimized are obtained by calculation and then incorporated into the program.
The second method
Numerical differentiation
According to the definition of differentiation, calculate the change in the function f(x+h)-f(x), given an actual small change in the input h.
Third method
Symbolic differentiation
A method of obtaining a representation of a derivative by analyzing a given mathematical expression using symbolic processing.
The fourth method
Automatic differentiation
Automatically calculates the derivative of a function by applying the chain rule of differentiation to the function described by the program.
Classification of methods
Calculation from input variables
Calculating from function values
Generalization of the error propagation method
Complex derivative calculations should be left to automatic differentiation tools, and algorithm developers should concentrate on designing models and optimization policies.

Chapter 3 Fundamentals of Bayesian Inference

3.1 Probabilistic Inference
Introduction
Bayesian inference treats learning, prediction, and model selection as computational problems on a probability distribution.
3.1.1 Probability density function and probability mass function
M-dimensional vector x=(x1,… When the real-valued function p(x) of a vector x=(x1,…,xM)T∈ℝM satisfies the above two conditions, p(x) is called probability density function.)
For an M-dimensional vector x=(x1,…,xM)T such that each element is a discrete value When the function p(x) for M-dimensional vector x=(x1,…,xM)T where each element is discrete satisfies the two conditions in the above equation, p(x) is called probability mass function.
Distribution of x determined by probability density function and probability mass function
3.1.2 Conditional Distribution and Surrounding Distribution
Probability distribution p(x,y) for two variables x and y
The operation of removing one variable x by integration as shown in the above equation is called “matginalization.
In the simultaneous distribution p(x,y), the probability distribution of x when a specific value is determined for y
Formula
p(x|y) is the probability distribution of x
Y is a kind of parameter that determines the characteristics of this distribution.
Putting the above together, we get the above equation.
Conditional distribution satisfies the conditions of probability distribution
When the above equation is satisfied, x and y are said to be independent.
3.1.3 Expected Value
The expected value is used to quantitatively express the characteristics of a probability distribution.
When x is a vector, the expected value 𝔼p(x)|f(x)| of a function f(x) for a probability distribution p(x) is calculated as above.
For two probability distributions p(x) and q(x), the expected value as in the above equation is called KL divergence (kullback-Leibler divergence) described in “KL divergence constraint”
KL divergence represents the “distance” between two probability distributions.
It does not satisfy the mathematical axiom of distance.
3.1.4 Variable transformation
A method to derive a new probability density function by performing a variable transformation on the probability density function.
Necessary to understand computational techniques for approximate inference, such as reparameterization gradient and normalizing flow.
Consider the operation of transforming a variable into a one-to-one variable, such as y=f(x), using the all-singular function f:ℝM→ℝM.
If the known probability density function is px(x), the probability density function of y obtained by the transformation is the above equation
J is the Jacobean matrix of the inverse function g of f
det(Jg) is the determinant of Jg
Example: A random variable following a Gaussian distribution is transformed by the hyperbolic tangent function to create a new probability density function
One-dimensional Gaussian distribution is defined as above
For a variable x following a Gaussian distribution, consider a new variable y=Tanh(x) transformed by the hyperbolic tangent function
The derivative of the hyperbolic tangent is the above equation
The probability density of Y is the above equation
Probability density obtained by transformation y=Tanh(x) for Gaussian distribution
Histogram obtained from the distribution
3.1.5 Graphical Model
A notation that uses nodes and arrows to represent the relationships among multiple variables in a probability model.
A directed graph without a loop structure
Example 1
A probability model with three variables x, y, and z
Graphical model
Example 2
Expression
Graphical model
Another expression
3.2 Exponential Distribution Families
Introduction
Many practical probability distributions used in Bayesian inference belong to a class with a certain form, called the exponential family.
The exponential family of distributions has a number of advantages
3.2.1 Examples of Probability Distributions
Gaussian distribution
A one-dimensional Gaussian or normal distribution is a distribution with a probability density function of x ∈ ℝ as shown above
μ ∈ ℝ is the mean parameter
σ2>0 is the variance parameter
Gaussian distribution is extended to M-dimensional multivariate as shown above
μ ∈ ℝM is the mean parameter in M dimensions
σ is a covariance matrix of size MxM
The covariance matrix must be a positive definite matrix.
Discrete value system
Bernoulli distribution
Coin toss distribution (variable that takes two values)
A single parameter determines the nature of the distribution.
Probability mass function is the above equation.
Categorical distribution
Bernoulli distribution extended to take arbitrary D value
The distribution that generates the random variable s is the above equation.
π is a D-dimensional parameter that determines the distribution
x
Gamma distribution
Probability distribution that generates a positive real number λ>0
Equation of distribution
Both parameters a and b need to be given as positive real numbers
Γ(⋅) is the gamma function
Properties of the gamma function
3.2.2 Calculation Example of Gaussian Distribution
For the most frequently used Gaussian distribution, the following is an example of calculation of the peripheral distribution.
Multidimensional distribution (above equation)
Assumptions
Consider a variable x ∈ ℝD divided into two vertical vectors x1 ∈ ℝD and x2 ∈ ℝD, where D=D1+D2.
The mean parameter μ ∈ ℝD is similarly divided into μ1 ∈ ℝD1 and μ2 ∈ ℝD2.
The covariance matrix is divided as above
Since Σ is a symmetric matrix, we have Σ12=Σ21T
If we set the precision matrix as Λ=Σ-1, we get
The above equation is valid for the elements of the covariance matrix.
In this case, the surrounding distribution p(x1) becomes the above equation.
The conditional distribution p(x1|x2) is given by the above equation
However
3.2.3 Exponential Family of Distributions
Introduction
Let’s look at the spirits and properties of exponential families of distributions in detail.
3.2.3.1 Definition
An exponential family is a family of probability distributions that can be written in the form of the above equation
η:natural parameter
t(x):sufficient statics
h(x):base measure
a(η):log partition function
This function is used to guarantee that the probability distribution integrates to 1.
3.2.3.2 Examples of distributions
Many distributions, such as Gaussian, Poisson, multinomial, and Bernoulli distributions, can be expressed as exponential distribution sequences.
Example
Probability mass function of Bernoulli distribution
When transformed
Correspondence in exponential type expression
Probability mass function in the case of Poisson distribution
When transformed
Correspondence in Exponential Expression Translated with www.DeepL.com/Translator (free version)

3.2.3.3 Relationship between logarithmic distribution function and sufficient statistics
Important properties of exponential families of distributions
The gradient of the logarithmic distribution function a(η) with respect to η is the expected value of the sufficient statistic t(x).
The second-order partial derivative of the log distribution function a(η) is the covariance of the sufficient statistic.
3.2.4 Conjugacy of Distributions
Introduction
An Example of Analytical Inference Calculations for Exponential Families of Distributions
For an exponential distribution function, there exists a conjugate prior distribution as shown in the above equation.
3.2.4.1 Analytic Calculation of the Posterior Distribution
An important property of the conjugate prior distribution is that for a likelihood function with an exponential family of distributions, the posterior distribution has the same form as the prior distribution
Consider N data X=[x1,…,xN]. ,xN], the posterior distribution has the above formula
Focusing on η, the same form is obtained.
Assuming that the parameters of the posterior distribution are λ1 and λ2
If we take an appropriate prior distribution for the likelihood function with exponential distribution, the posterior distribution can be obtained analytically.
3.2.4.2 Analytical calculation of the predictive distribution
Using the conjugacy property, the predictive distribution of unobserved data x* can be obtained analytically using the posterior distribution as shown in the above equation.
The resulting probability distribution will not be an exponential family of distributions.
3.2.4.3 Example: Inference of parameters of Bernoulli distribution
As an example, we will use the natural parameter representation of the Bernoulli distribution to learn the parameters and make predictions to derive the distribution.
The conjugate prior of the Bernoulli distribution is the beta distribution as shown above.
If we fix the equation
Converting the expression to an exponential function type
Suppose that N probability distributions xn∈[0,1] following Bernoulli distribution are observed.
The respective distributions in the original parameter representation are
and
However
3.2.4.4 Example: Inference of the precision parameter of a Gaussian distribution
When the mean parameter μ of a one-dimensional Gaussian distribution is fixed
The precision or the conjugate prior distribution of the parameter gamma = σ-2 is given by the gamma distribution
The one-dimensional Gaussian distribution is
Corresponding to the definition of exponential family of distributions, it can be expressed by natural parameters as shown in the above equation.
The gamma function is expressed by the above equation
The above equation is obtained.
The posterior distribution is represented by the above equation using the gamma distribution.
The posterior distribution is the result of peripheral removal of the precision parameter γ
Student’s t-distribution as in the above equation
However
The mean and variance are as in the above equation
3.3 Bayesian Linear Regression
Introduction
We will use a linear regression model to train the model and predict the test data using Bayesian inference.
3.3.1 Model
From the input X={x1,…. ,xN} to predict a continuous value label Y={y1,…,yN}. ,yN} from input X={x1,…,xN}, the above equation is the simultaneous distribution of the Bayesian linear model for predicting continuous value labels Y={y1,…,yN}.
The label yn is output according to a Gaussian distribution with a fixed variance σ2y as shown in the above equation.
The mean of the Gaussian distribution is determined by the feature function Φ:ℝH0→ℝH1
Only the weight parameter w ∈ ℝH1 for each feature is to be learned.
A Gaussian prior distribution with mean 0 and covariance σw2I is given as above.
The graphical model is shown in the figure above.
Example: Sample function f(x;w)wTΦ(x) from a model where the feature function is a 3D function Φ(x)=(x3,x2,x,1)T and the parameter noise is σw2=1.
In the regression model in Bayesian statistics, the candidate functions are sampled from the prior distribution before the data is observed.
3.3.2 Learning and Prediction
Introduction
Learning and prediction of Bayesian linear regression models is done analytically.
3.3.2.1 Analytical Calculation of the Posterior Distribution
Let the posterior distribution be the above equation
Take the logarithm and organize it with respect to w
The result is a Gaussian distribution as shown in the above equation.
However
3.2.2.2 Analytical calculation of predictive distribution
The distribution of the predicted value y*, p(y*|x*,Y,X), when the input value x* of the test is given after learning, is the above equation.
This distribution is also Gaussian as shown in the above equation
However
The prediction distribution after giving some data points
The red wavy line is the mean value of the prediction
The blue area is the interval where the standard deviation of the prediction is twice σ*.
The multiple green lines are the five samples from the predictive distribution
3.3.2.3 Comparison with maximum likelihood estimation
Comparison with maximum likelihood estimation using a one-hour model for training data.
In Bayesian estimation, the phenomenon of uncertainty in prediction is occurring.
Bayesian estimation reflects the information of the amount of data used for training
In maximum likelihood estimation, the information about the amount of data disappears
3.3.3 Surrounding Likelihood
As training data, we have an input set X={x1,…. ,xN} and the label set Y={y1,…. Y={y1,…,yN} as the training data, and the parameter w is the value obtained by removing the integral from the simultaneous distribution.
This is called the marginal likelihood or evidence in Bayesian linear regression.
A quantity that expresses the plausibility of the occurrence of data given the model.
3.3.4 Sequential learning
In training a model using Bayesian inference, we can adaptively train on new training data by storing the results of training with a posterior distribution.
In analytic learning with conjugate prior distributions, if we do not assume an ordered isosceles teacher in the data generation process, the posterior distributions will eventually match if the data is given only temporarily or all at once.
In the example of Bayesian linear regression, the posterior distribution of the parameters after the first training data D1={X1,Y1} comes in is the above equation
Furthermore, if the next training data set D2={X2,Y2} is given, the above equation is obtained.
The posterior distribution p(w|Y1,X1) learned from the first data set D1 is used as the prior distribution to calculate the next posterior distribution.
3.3.5 Application to active learning
By calculating the predictive distribution of a linear regression model, the uncertainty in the value of the predicted mass can be quantitatively measured by a measure of variance.
Efficiently collect labeled data for training Bayesian linear regression and other probabilistic inference-based prediction methods.
Select appropriate input data from an unlabeled input Xpool and ask a human or other θ for the label yq.
How to select the input data point xq for which you want to know the label
Choose the one with the largest uncertainty in the prediction y* given the new input x*.
Choose the xq that maximizes the entropy of the predictive distribution
Prediction Equation in a Linear Model
Active Learning Example
The same framework as for active learning can be used to search for the maximum value of an unknown function f(x).
It is common to use a Gaussian process
3.3.6 Relationship with Gaussian Process
The mean and variance of the predictive distribution in a linear regression model are the same as above.
If we rewrite them using Woodbury’s formula, we get the above equation
However
The feature function Φ is always aggregated into the above equation for two input data points x and x’.
k(xox’): Kernel function or covariance function
Regression can be performed by directly designing the covariance function k(x,x’), instead of designing the function Φ that performs feature extraction
Design something like similarity or closeness between different data Translated with www.DeepL.com/Translator (free version)
3.4 Relationship with Maximum Likelihood Estimation and MAP Estimation
Introduction
This section describes maximum likelihood estimation and MAP estimation.
3.4.1 Maximum Likelihood Estimation and Error Minimization
We derive a method for learning the parameters of a regression model using maximum likelihood estimation.
Assume that the label yn is observed as a function f(xn;w) with parameter w, plus noise ε.
The observed noise follows a Gaussian distribution with a fixed variance σ2 as shown in the above equation.
In summary, yn is a Gaussian distribution with mean and variance as shown in the above equation.
Given the training data D=[X,Y], the likelihood of the model is given by the above equation
In maximum likelihood estimation, the parameter w is optimized so that the plausibility of the occurrence of data Y is maximized.
The maximum likelihood solution is the above equation.
If we write the log likelihood concretely, it becomes the above equation.
In the case of linear regression, the maximization of the log-likelihood function with respect to the parameter w coincides with the minimization of the error function.
The expression for the gradient is the above equation.
When the learning rate is α, the update equation becomes the above equation.
If the noise parameter σ-2 is absorbed by the learning rate, the equation is equivalent to the gradient descent method.
3.4.2 MAP Estimation and Regularization
As a method similar to maximum likelihood estimation, there exists a method called maximum z posteriori estimation or maximum a posteriori estimation (MAP estimation).
We show that MAP estimation is equivalent to minimizing a cost function with a regularization term.
MAP estimation searches for w such that the posterior distribution of the parameters is a function of w and takes the maximum value of the function.
Assume that the prior distribution of the parameters is given by a Gaussian distribution.
The log posterior probability becomes the above equation.
The parameter that adjusts the strength of regularization is interpreted as the above equation.
Maximization of the above equation is equivalent to minimizing the cost function of the equation with the L2 regularization term introduced.
Maximum likelihood estimation or MAP estimation is done with a single point rather than a distribution of parameters or predictions y*.
They are clearly distinguishable from Bayesian estimation, which uses only probability calculations such as conditional prior and neighborhood distributions.
3.4.3 Error Functions for Classification Models
Introduction
Based on the framework of stochastic models, we consider the extension of finite categories from regression models to predict real values, and consider the correspondence with cross-entropy error functions.
3.4.2.1 The case of binary classification
First, we consider the case where the label takes a binary value yn∈{0,1}.
Interpreting it as a probability model, we can assume that it is generated from the Bernoulli distribution of the above equation
The parameter μn ∈ (0,1) is assumed to be obtained by setting the continuous output of the regression model to ηn ∈ ℝ and applying the sigmoid function as in the above equation
Model for the case f(xn;w)wTΦ(xn)
Given the natural parameter representation of the Bernoulli distribution as in the above equation
The role of the sigmoid function is to convert the parameter μ of the Bernoulli distribution into the natural parameter η
The log-likelihood based on the Bernoulli distribution is given by the above equation
3.4.2.2 The case of multi-valued classification
The same can be considered for multi-level distributions
Using the categorical distribution
The Haykin parameter πn in dimension D is transformed by the softmax function as above
The log-likelihood by the categorical model is the above equation
This is equivalent to minimizing the cross-entropy error function.

Chapter 4 Approximate Bayesian Inference

4.1 Sampling-based Inference Methods
Introduction
In statistical analysis using Bayesian inference, let X be the observed data and Z be the set of unobserved variables such as parameters and latent variables.
The first step is to design a probabilistic model p(X,Z).
In many models, p(Z|X) cannot be obtained analytically
Sampling algorithm is a method to investigate the characteristics of the distribution by obtaining multiple samples from p(Z|X) instead of obtaining p(Z|X) in a Meiji manner.
4.1.1 Simple Monte Carlo Method
Consider finding the expected value ∫ f(z)p(z)dz of a function f(z) with respect to a distribution p(z).
We assume that the analytical integration of the expected value is difficult, but sampling from the distribution p(z) is easy.
The most basic method is to approximate p(z) by extracting a sufficiently large number of T samples from p(z) as in the above equation
simple monte Carlo method
For each of the N velocity data X=[x1,…. , xN], evaluate the marginal likelihood p(X) of the model p(X,θ)p(θ) with parameter θ
Approximating the marginal likelihood using simple Monte Carlo
4.1.2 Rejection Sampling
Simplest way to obtain a sample from a probability distribution p(z) whose density is difficult to compute.
Consider a method for taking a sample from a target distribution such as the one above
Assume that p(z) cannot be computed, but only the unnormalized function p(z) can be computed.
Set up a hypothetical distribution q(z), which is called the proposal distribution and which can be sampled easily.
Obtain samples z(f)~q(z) from the proposal distribution.
Obtain samples ū~Unit(u|0,kq(z(f))) from the uniform distribution Unit(0,kq(z(f))).
If ū>ṗ(z(f)), the sample z(t) is rejected
If Ũ≤p(z(f)), the sample z(t) is accepted.
In the above procedure, the probability that a sample is accepted is given by the above equation.
If high-dimensional sampling is required in a complex model, the sample acceptance rate will be very low.
4.1.3 Self-Normalized Focused Sampling
It is easy to obtain a sample from the probability distribution p(z) for which we want to take the expected value.
Can be used in situations where it is not possible to obtain a sample directly from p(z) as in discard sampling.
Differences from discard sampling method
Instead of obtaining the samples z(1), z(2),…. Instead of obtaining the samples z(1), z(2),…, from p(z) itself, the goal is to efficiently calculate the expected value itself.
Formula for calculating the expected value
4.1.4 Markov Chain Monte Carlo Method
Discard sampling is intuitive and easy to implement, but in practice it can only be applied to simple integrals in one dimension.
A means for efficient sampling in high dimensional spaces
When common sense holds for a series of random variables z(1), z(2),… When common sense holds for a series of random variables z(1), z(2),…, the series is called a first-order Markov chain.
When the above equation holds, the distribution p*(x) is said to be stationary distribution.
Let the transition probability 8transition probability be the above equation
Assume that the stationary distribution p*(z) is the posterior distribution from which we want to take a sample.
The idea of the Markov chain Monte Carlo method is to design the transition probability such that the distribution converges to p*(z)
A sufficient condition for p(z) to be stationary is the detailed balance condition
In addition to the detailed balance condition, when the sample size is set to t → ∞, a stationary distribution p*(z) converges from an arbitrary initial state p(z0) depending on the transition probability T
In a Markov chain, it is possible to transition from any state to any other state in a finite number of times.
All states do not have a fixed periodicity.
It is possible to return to the same state a finite number of times.
4.1.5 Metropolis-Hastings Method
The most basic Markov chain Monte Carlo method
Algorithm
The transition probability is given by the above equation
Prove that the detailed balancing condition is satisfied
Need a transition probability T(z’,z) that converges to a stationary distribution
If it is difficult to design directly, the proposed distribution of transitions q(z|z’) can be used instead
An example of the proposed distribution
A Gaussian distribution z*~N(z(t),I) with the previous sample z(t) as the mean is often used
The behavior of the Metropolis-Hastings method is shown when the two-dimensional Gaussian distribution p(z)=N(0,Σ) is used as the target distribution for the sample.
Since the proposed distribution is a Gaussian distribution N(z(t),1), the behavior becomes a random walk described in “Overview of Random Walks, Algorithms, and Examples of Implementations“.
4.1.6 Hamiltonian Monte Carlo Method
Introduction
The Hamiltonian Monte Carlo method (HMC) or hybrid Monte Carlo method (HMC) is a sampling method that combines the simulation of the trajectory of an analytic dynamical object with the Metropolis-Hastings method. sampling method that combines analytical dynamics simulation of object trajectories with the Metropolis-Hastings method.
By using the derivative information of the posterior distribution, the method avoids random walk-like behavior and searches for the posterior distribution more efficiently than the Metropolis-Hastings method, which applies a Gaussian distribution.
4.1.6.1 Simulation of the Hamiltonian
Before going into the sampling algorithm, we will describe a numerical simulation of analytical dynamics using the Hamiltonian.
Let the position vector of an object be one ∈ ℝD and the motion vector be p ∈ ℝD.
If there is no frictional energy phenomenon, the motion of an object on an undulating surface will keep the Hamiltonian (above equation)
U(k): potential energy determined by position
K(p): Kinetic energy
Assumption
Mass is assumed to be 1
K(p)=1/2pTp
The behavior of Z and q with respect to time τ is determined by the partial derivative of the Hamiltonian as shown in the above equation
Integrating the above equations, we get the above equation
Since it cannot be solved analytically, it must be calculated by numerical simulation.
The simplest method is Euler’s method, which approximates the behavior at time ε > 0, as shown in the above equation.
The numerical error due to discretization is large.
Use the leapfrog method described in ‘Overview of the Leapfrog Method, Algorithms and Examples of Implementations’.
Can calculate the position z* and momentum p* of an object at time εL
Characteristics of the Hamiltonian simulation
Hamiltonian is time-invariant
It has reversibility.
The transition from the starting point (z,p) to the destination point (z*,p*) is one-to-one
The efficiency of the Hamiltonian Monte Carlo method is characterized by the property of volume preservation.
4.1.6.2 Application to Sampling Algorithms
We will apply the simulation using the Leapfrog method to a sampling algorithm.
For the probability distribution p(z) ∝ Ṗ(z) where we want to obtain a sample, we introduce an auxiliary variable p and extend it as in the above equation
Since the two distributions are independent, the sample at z obtained from the simultaneous distribution p(z)p(p) can be viewed as identical to the one obtained from the peripheral distribution p(z)
If we set p(p)=N(p|0,1) and lnṖ(z)=-U(z) to calculate the simultaneous distribution
The Hamiltonian is obtained.
After sampling the momentum p according to the Gaussian distribution, we can simulate the Hamiltonian to obtain a new candidate sample point (z*,p*).
Since the trajectory is drawn while keeping the Hamiltonian H almost constant, the ratio γ (above equation) is always close to 1 in the Metropolis method.
Compared to methods such as Metropolis-Hastings, which do not exactly equal 1 but have a random walk-like behavior, Hamiltonian Monte Carlo can achieve a very high acceptance rate.
The Hamiltonian Monte Carlo algorithm
The main settings that determine the behavior are the step size ε and the number of steps L
When L is fixed, the smaller ε is, the lower the simulation error and the higher the acceptance rate.
If Ε is fixed to a small value and L is increased, the amount of movement can be increased while maintaining a high acceptance rate.
The Hamiltonian Monte Carlo method is very general
It has a limitation that it cannot handle non-differentiable variables such as discrete latent variables.
It has been used for Bayesian training of neural networks, since neural network models often consist of continuous latent variables
Example of Hamiltonian Monte Carlo method
Since we are using the derivative of the target distribution, we are actively moving towards the center of the target distribution.
4.1.6.3 Langevin Dynamics Method
When L=1, the method is called the Langevin Monte Carlo method or the Langevin dynamics method.
The equation is as above
Implementation is much simpler
Random walk-like behavior becomes stronger
4.1.7 Gibbs sampling
When it is difficult to sample the entire Z directly from the probability distribution p(z)
Z=[Z1,Z2,… Z=[Z1,Z2,…,Zm], sequential sampling can be performed by dividing the variables into M subsets.
This is called Gibbs sampling.
Each conditional distribution can be calculated analytically as shown in the figure on the right. If a model can be constructed using a conjugate prior distribution, it can be calculated efficiently.
Effective in the following cases
When the number of variables to be sampled is huge.
When you want to obtain a sample from a huge probability model in which multiple probability distributions are combined.
Example of Gibbs sampling with a two-dimensional Gaussian distribution as the target distribution for obtaining a sample
Since sampling is performed using the conditional distributions p(z1|z2) and p(z2|z1) for each variable, the obtained trajectory is vertical along the axis.
There is no red point because it is always accepted.
We divide the random variable to be sampled as Z=[Z1,Z2], and consider sampling Z1 under the condition of Z2.
The proposed distribution is q(Z*|Z)=p(Z1*|Z2), and noting that Z2*=Z2 while remaining fixed for Z2, we get the above equation
The new sample Z1* obtained by Gibbs sampling is always accepted.
4.2 Optimization-based Inference Methods
Introduction
Markov chain Monte Carlo methods are based on the assumption that the samples obtained by an infinite number of computations are identical to the true values.
We will explain how to converge the posterior distribution by approximating it by optimization. Translated with www.DeepL.com/Translator (free version)

4.2.1 Variational Inference Methods
The most widely used method among approximate inference algorithms by optimization.
It approximates the unanalyzable integrals that appear when calculating the posterior distribution by replacing them with optimization problems.
The calculation of the marginal likelihood p(x) requires the integration method p(X)=∫p(X,Z)dZ for the latent variable Z.
The variational inference method considers a lower bound L(𝝃) on the log marginal likelihood lnp(x), called the evidence lower bound (ELBO).
Refers to the mean or variance of the approximate distribution in variational inference
F=-L(𝝃) with negative ELBO is called variable energy
By maximizing L(𝝃) by 𝝃 using general optimization methods such as gradient descent, we can obtain an approximate solution with log-likelihood lnp(X).
There are several ways to design an ELBO, depending on the model and the purpose.
The most commonly used methods
Approximating the posterior distribution p(Z|X) by a distribution q(Z;𝛏) defined by the parameter 𝝃.
A measure of the goodness of approximation
Approximate distribution q(Z;𝛏) is obtained by redigestion with respect to variational parameter 𝛏 using KL divergence
The log likelihood lnP(X) can be decomposed into ELBO and the KL divergence of the previous equation, as shown above
However
Since the log likelihood inp(X) itself is constant regardless of the value of 𝛏, the problem of minimizing the KL divergence with respect to 𝛏 is equivalent to the problem of maximizing L(𝝃)
There are various choices of how to place the approximate distribution q
For complex models, the method of approximating independence to the posterior distribution is often used, as in the above equation
If the set of latent variables Z is Z=[Z1,…. The set of latent variables Z can be divided into M parts as Z=[Z1,…,ZM].
This is called mean field approximation.
For each approximation distribution q(Z1),… . q(Z1),…,…, q(ZM) are updated alternately, the characteristics of the algorithm are very similar to Gibbs sampling.
The main advantage of variational inference is its computational efficiency.
Variational inference minimizes the KL divergence with respect to the variational parameters, so the approximation accuracy always improves during the computation process.
The disadvantage is that as the true posterior distribution becomes more complex, the approximation accuracy is limited.
In contrast to sampling algorithms
4.2.2 Example: Learning a Latent Variable Model with Mean Field Approximation
Introduction
A typical use case of variational inference methods
Mainly, high-dimensional observed data X=[x1,…. ,xN] is represented by a low-dimensional latent set =[z1,… ,zN].
Data compression
Extraction of features
Interpretation with probability model
Principal component analysis (PCA)
Independent component analysis
K-means method
Construct a linear dimensionality reduction model
Approximate the posterior distribution using mean-field approximation
Based on the similarity with the EM algorithm (expectation maximization algorithm) for maximum likelihood estimation
The variational EM algorithm is the basis for the following techniques
A nonlinear version of the variational auto encoder (VAE)
A nonparametric Bayesian version of the Gaussian process latent variable model (GPLVM)
4.2.2.1 Application to linear dimensionality reduction
As in the supervised linear regression model, the observed data X=[x2,… ,xN] is linearly reduced by the input variable Z=[z1,…,zn]. , zn] and a fixed noise σx2 (above equation).
Assume that the input variable Z is not a set of observed values, but a set of unobserved latent variables
Assume that each latent variable is generated by a Gaussian distribution as in the above equation.
The parameters are assumed to follow a Gaussian distribution as in the above equation.
Decompose and approximate the true posterior distribution by q as in the above equation
The lower bound of the logarithm of the marginal likelihood p(X)=∫p(X,Z,W)dZdW is the above equation.
In the variational inference method, as in the Markov chain Monte Carlo method, the approximate distribution is first initialized appropriately, and then the lower bound L is maximized by repeating the steps to update the approximate distribution.
At a step i+1, the approximate distribution of the previous step is qi(Z),qi(W)
The subsequent calculation of qi+1(W) is obtained by maximizing the lower bound of the previous equation with respect to qi+1(W) while keeping q(W) fixed
However
Also
Maximizing the previous equation is equivalent to minimizing the KL divergence, so the optimal solution is the above equation
Update the approximate posterior distribution of the right-hand side parameters
The optimal qi+1(Z) can be obtained by fixing qi+1(W).
where
And
Updating the approximate posterior distribution of the latent variable as in the right-hand side
In the linear dimensionality reduction model, ri(W) and ri(Z) at each step can be obtained analytically (resulting in a Gaussian distribution)
4.2.2.2 Application to a mixture Gaussian distribution
When discrete latent variables are used instead of continuous latent variables Z, a clustering algorithm can be derived
Let the observed data X=[x1,…. ,xN] into K groups.
As an example, consider the Gaussian mixture distribution
Probability model assuming that each data point is generated from a collection of K different Gaussian distributions
Let the likelihood function of the observed distribution be the above equation
σx2 is a fixed noise.
Each latent variable is assumed to follow a categorical distribution as in the above equation
The parameters of each group k=1,…. The parameters of each group k=1,…,K follow a Gaussian distribution.
Using the mean-field approximation and decomposing as in the above equation, we can obtain the variational EM algorithm in the same way as the linear dimensionality reduction.
4.2.3 Laplace Approximation
Similar to variational inference, Laplace approximation is an inference method that approximates the posterior distribution using a simpler distribution
When the point where the value taken by the posterior distribution p(Z|X) is the maximum is set as ZMAP, Laplace approximation approximates the posterior distribution by a Gaussian distribution as shown in the above equation.
The precision matrix Λ(Z) of the Gaussian distribution is the negative of the Hesse matrix of the log-posterior distribution.
The procedure of Laplace approximation is as follows
First, calculate the maximum value ZMAP of the log-posterior distribution using an optimization algorithm such as the gradient descent method or the Newton-Raphson method.
Next, evaluate the above equation at that point to obtain the accuracy matrix 𝝠(ZMAP).
By approximating and expressing the posterior distribution as a Gaussian distribution, it becomes easier to evaluate the integrals necessary for calculating the predictive distribution and the marginal likelihood later.
The Laplace approximation corresponds to a second-order approximation of the shape of the log-posterior distribution by a Taylor expansion around ZMAP.
Use the fact that ZMAP is the times ∇zlnp(Z|X)=0
Taking the exponent of the previous equation, the distribution of Z is approximated by the above equation.
By normalizing the right-hand side, we can obtain a Gaussian representation.
The Laplace approximation is simple in idea and can be easily extended to learn parameters based on existing regularization and MAP estimation.
Challenges
Computation of the Hesse function is time consuming.
Since the approximation uses a Gaussian distribution, it is not directly applicable to probability distributions defined on discrete or non-negative real numbers.
4.2.4 Approximation by Moment Matching
Introduction
This section introduces an approximation method for probability distributions called moment matching, which is the basis for approximate inference methods such as assumed density filtering and expectation propagation method. We will introduce a probability distribution approximation method called moment matching.
4.2.4.1 Moment matching
Similar to variational inference and Laplace approximation, we will consider a method for approximating a complex probability distribution p(z) using a simpler distribution q(z).
The approximation distribution can be expressed as an exponential distribution as shown in the above equation.
Define the KL divergence between the target distribution p(z) to be approximated and the approximate distribution of the preceding equation as shown in the above equation.
Minimize the previous equation with respect to the natural parameter η.
Calculate the gradient for η and set it to zero to obtain the above equation.
In order to approximate optimally using an exponential family of distributions
simply calculate the expected value of the sufficient statistics of p(z), 𝔼p[t(z)], i.e., the moments of p(z), and use the result to determine the parameters of the exponential distribution family.
4.2.4.2 Assumed Density Filtering
Approximate inference using moment matching can be used in cases where sequential learning cannot be performed analytically.
This section describes the simplest approximate sequential learning method, namely, assumed density filtering.
After observing the data set D1, the posterior distribution of the parameter θ is given by
If there is conjugacy between the likelihood function p(D1|θ) and the prior distribution p(θ), then p(θ|D1) and subsequent p(θ|D1,D2), p(θ|D1,D2,D3),… can also be analytically incorporated into the posterior distribution using the same form as the prior distribution.
The above approach cannot be used in models where conjugacy does not hold
Set up an approximate distribution q1(θ) for p(θ|D1) and make an approximation as in the above equation
where Z1=∫p(D1|θ)p(θ)dθ
For the approximate distribution q1(θ), choose the same distribution as the prior distribution p(θ).
Calculate the moments of the right-hand side and determine q(θ) to have moments that match them.
Then, sequentially, data sets D2, D3… are introduced, the approximation is performed as in the above equation.
The approximate posterior distribution can be updated while maintaining the same form of distribution as q(θ).
As a practical example, consider the case where the distribution q(θ) for a parameter θ ∈ ℝ has two distributions, Gaussian or Gamma.
Let the new likelihood term to be added be fi+1(θ)=p(Di+1|θ)
As shown in the above equation, we want to update the approximate distribution from qi(θ) to qi+1(θ) by matching the moment of qi+1(θ) to the moment of ri+1(θ).
4.2.4.3 Example of Gaussian Distribution
Consider an approximate distribution with a one-dimensional Gaussian distribution (above equation).
The normalization constant Zi+1 becomes the above equation.
Partial differentiation of the logarithm of the right equation by μi yields the above equation
Therefore, the first-order moment of the distribution ri+1(θ) becomes the above equation.
Partial differentiation of the logarithm of the right equation by the parameter vi yields the above equation
Therefore, the second-order moment is given by the above equation
Therefore, the parameter of the new distribution qi+1 obtained by moment matching is the above equation
Continued
4.2.4.4 Example of Gamma Distribution
Similarly, consider the case of gamma distribution
The normalization constant is given by the above equation.
Using the definition of the density function of the gamma distribution, the first and second order moments to be approximated are given by the above equations.
Since the mean and variance of the gamma distribution are a/b and a/b2, respectively
Since the mean and variance of the gamma distribution are a/b and a/b2, respectively, common sense holds for the parameters ai+1 and bi+1 of the new distribution qi+1(θ) obtained by moment matching.
If we rewrite this in terms of ai+1 and bi+1, we get the above equation
4.2.5 Example: Learning a probit regression model by moment matching
Using a classification model for predicting binary yn{-1,1} called profit regression as an example
Derive approximate inference by hypothetical density filtering using moment matching
The likelihood function for probit regression is the above equation
The nonlinear function Φ is the cumulative distribution function of the standard normal distribution
The prior distribution of the parameters is assumed to be Gaussian with a fixed variance v0 (above).
The marginal likelihood of this model is the above equation
Cannot be calculated analytically
Instead, we consider adding the likelihood term fi(w)=p(yi|xi,w) one by one by updating the approximate posterior distribution by moment matching
Use a Gaussian distribution for the proximity of the posterior distribution of the parameters
The first update is to add the first likelihood term fi(w) to the prior distribution to obtain the first approximation q1(w) as in the above equation
Thereafter, update with the above equation
The final result is the above equation
However Translated with www.DeepL.com/Translator (free version)
4.2.6 Expectation Value Propagation Method
Hypothetical density filtering discards data once it has been trained and does not use it for training again
There is a problem that strongly depends on the order of the training data that comes in sequentially.
Assumed density filtering is a generalized method that can be used for batch learning.
By visiting the same data multiple times during the optimization process, we can obtain an approximate posterior distribution that is more accurate than process density filtering.
Consider a stochastic model consisting of a prior distribution p(θ) of parameter θ and a likelihood function p(x|θ).
fn is called a factor, and is mapped as in the above equation.
The posterior distribution of this model when using factors is shown in the above equation.
The approximate distribution for this posterior distribution is expressed by the product of the approximating factors ープル as shown in the above equation
For example, suppose we choose the probability density of Gaussian distribution as the approximation factor, such as ˟n(θ)=N(θ|μn,Σn).
q(θ), obtained by regularizing the product of N+1 approximating factors, is also Gaussian
Consider a means of successively updating the approximation formula by giving some i-th factor fi
Initialize the parameters of the approximate distribution q(θ) appropriately.
Let the current approximate distribution in the process of updating be qold(θ).
First, remove the i-th current approximation factor from the approximation distribution.
To qi(θ) from which the i-th approximation factor has been removed, add the model factor fi(θ) and normalize it to the above equation.
Zi is a normalization constant
Calculate the moments of the distribution r(θ) and make them the moments of the new approximate distribution qnew(θ)
The same procedure as for the assumed density filtering can be used
When qnew(θ) is a Gaussian distribution, it can be done by using ① and ②.
Using the newly updated approximate distribution qnew(θ), update the approximation factor Horse(θ) as in the above equation.
Repeat for all i=0,…,N Apply it repeatedly for all i=0,…,N to improve the approximate distribution.
Intuitively, the
Expectation Propagation is an extension of the approximate sequential learning method, Assumed Density Filtering, to batch learning.
The idea of the expectation propagation method is to update the approximate posterior distribution again without discarding the factors used for training.
If we use the same procedure as for hypothetical density filtering to add factors without replication, the amount of training data will be increased (the same data will be used over and over)
Replace each approximate factor model Horsei with a factor of the model
Insert a process to extract the factor to be updated from the approximate distribution.
The expectation propagation method does not reduce a specific objective function.
It has been experimentally shown to work well when used for training neural nets and Gaussian processes.

Chapter 5 Bayesian Inference in Neural Networks

Introduction.
We will apply the algorithms of the previous section to training and predicting neural network models.
Explain how to train using mini-batches to make approximate inference computations more efficient for large datasets
Present application examples.
5.1 Approximate Inference Methods for Bayesian Neural Network Models
Introduction
The approximate inference method can be directly applied to deep learning models such as forward propagating neural networks.
First, we explain the basic training method for neural networks by batch learning.
5.1.1 Bayesian Neural Network Model
How to Bayesize a Forward Propagation Neural Network
Probabilistic learning and prediction is performed by setting a prior distribution for the parameters that govern the behavior of the network.
Using the simplest Bayesian model of a forward propagating network, we will derive various approximate inference algorithms for the posterior distribution.
For simplicity, we will focus on regression models.
The main idea of the discussion is the same for classification models with Bernoulli or categorical distributions in the output layer.
Given input data X=[x1,…. Given the input data X=[x1,…,xN], the observed data Y=[y1,… ,yN] and the simultaneous distribution of the parameters as the above equation
Assume that the regression problem is to predict yn∈ℝD from xn∈ℝH0.
For the observation model, we use a Gaussian distribution as in the above equation.
Σy2 is a fixed noise parameter
f(xn;W) is a neural network whose output dimension is D
For example, in the case of L=2, the output of each dth dimension is the above equation.
In the framework of Bayesian inference, we calculate the posterior distribution of the parameters after the training data is given.
We need to explicitly set the prior distribution for the parameters of the neural network.
Let each weight parameter be w ∈ W and give an independent Gaussian distribution as in the above equation
Here are some sample cases where the hyperbolic tangent function is used as the activation function, the input vector is (x,1)T , and the number of hidden layers H1 and the noise σw2 assumed as the weight parameter are changed
The larger H1 is, the more complex the function generated by the prior distribution becomes.
The larger the value of σw2, the more steeply varying the function is generated.
5.1.2 Learning by Laplace Approximation
Introduction
First, we derive the learning and prediction by Laplace approximation for the Bayesian neural network model.
5.1.2.1 Approximating the Posterior Distribution
First, the MAP estimate of the posterior distribution of the model is obtained by optimization, and then the surrounding area is approximated by Gaussian distribution.
The posterior distribution is given by the above equation
Since the weight parameter W is not included in the denominator p(X|Y), it can be ignored in the process of W optimization.
The local optimal solution WMAP is updated using the gradient of the log-posterior distribution as shown in the above equation.
Since the log-posterior distribution can be written as above
Calculating the partial derivative of a parameter w ∈ W yields the above equation
The regularization term Ω12(W) is derived from the Gaussian prior distribution p(w) for parameter w
E(W) is the error function of the neural network
The approximate Gaussian posterior distribution, calculated by Laplace approximation, can be expressed as
The accuracy matrix 𝝠 is obtained as above
H is the Hesse matrix for the error function of the neural network
5.1.2.2 Approximating the Predictive Distribution
After obtaining the approximation of the posterior distribution of the parameters, we can approximate the predictive distribution of the output y* for the input of the test as shown in the above equation.
This cannot be calculated analytically because it involves a nonlinear neural network.
To compute the predictive distribution, we perform a linear approximation of the neural network function.
Make the following assumptions
The density of the posterior distribution of the parameters is concentrated around the MAP estimate.
The density of the posterior distribution of the parameters is concentrated around the MAP estimate, and in that small range, the function value of the neural network, f(x;W), can be well approximated by a well-formed function of W.
If we approximate the function f(x;W) of W by Taylor expansion to the first order around WMAP, we get the above equation
g is the gradient of the function evaluated by WMAP as in the above equation
By using this approximation, the nonlinear function peculiar to neural networks is eliminated.
The approximation of the predictive distribution we want to find is the above equation
However
Example of predictive distribution obtained by Laplace approximation
The red wavy line represents the predictive mean f(x;WMAP).
The light blue area represents the uncertainty of the prediction (twice the standard deviation)
5.1.3 Learning with the Hamiltonian Monte Carlo method
Introduction
Consider sampling from the posterior distribution of a Bayesian neural network using the Hamiltonian Monte Carlo method.
The Hamiltonian Monte Carlo method can be applied if the logarithmic distribution is differentiable with respect to the variable to be sampled.
Compared to the Laplace approximation and variational methods, which assume a forced Gaussian fit to the posterior distribution
5.1.3.1 Inference of Weighted Parameters
The application of the Hamiltonian Monte Carlo method to Bayesian neural networks is
The application of Hamiltonian Monte Carlo methods to Bayesian neural networks is the same as the application to ordinary generalized linear models such as logistic regression.
If we use an unnormalized posterior distribution, the corresponding potential energy is given by
Differentiation of the potential energy is necessary to use the LIP-BLOCK method
Experimental results with H0=1 as the input dimension and D=1 as the output dimension.
Unlike the Laplace approximation, the uncertainty of the prediction is realized by sampling multiple functions.
5.1.3.2 Hyperparameter Inference
σw2, which governs the prior distribution of the weight parameter W, and the noise parameter σγ2 of the observation model are treated as hyperparameters.
Usually, the appropriate ones are given as fixed values before training.
This reflects the rough scale of the data.
Prior distributions can also be set for hyperparameters and inferred simultaneously with weight parameters in a sampling framework.
Setting prior distributions for hyperparameters
Parameter W is determined according to a Gaussian distribution with variance σw2
Introduce the precision parameter γw=σw-2 and consider the gamma distribution as shown above
Let aw>0 and bw>0 be fixed values.
The accuracy parameter γy=σy-2 for observation noise is also given by the above equation.
Let ay>0 and by>0.
When the prior distributions of these precision parameters are introduced, the model becomes the above equation.
The overall posterior distribution is shown in the above equation.
Use Gibbs sampling to obtain a sample of each random variable
Sample each random variable W, γw, and γy using conditional probability
The distribution of W conditioned on the sampled values of γw and γy is given by the above equation
Posterior distribution of parameters itself
We can get a sample of W by running a Hamiltonian Monte Carlo method.
Given W and γy, the distribution of γw, ignoring the part not related to γw, is given by
p(W|γw) is the Gaussian distribution
Since we use the gamma function, which is a conjugate prior distribution, for the prior distribution p(γw) of accuracy γw
This conditional distribution can also be obtained analytically as a gamma distribution.
From the above, the above equation can be obtained as a sample of γw.
Posterior distribution of the gamma distribution (3.60)
Kw is the total number of weight parameters
The distribution of γy given W and γw is also obtained in the same way
The above equation is obtained as a sample of Γy
Intuitive interpretation of the above equation
Γy is calculating the error that cannot be expressed by the function f(x;W) of the neural network.
Since the mean of the gamma distribution is ḃw/ḃw, the larger ḃw is, the lower the accuracy of y by the function f(x:W), and the larger the variance of the observation is learned.
The above assumes for simplicity that a common hyperparameter γw is given for all weight parameters.
The hyperparameters can be divided into several groups
Consider stratification with different parameters Translated with www.DeepL.com/Translator (free version)

5.2 Improving the Efficiency of Approximate Bayesian Inference
Introduction
Bayesian neural networks require a large amount of computation for parameter permutations.
We will explain an advanced method that enables approximate Bayesian inference even for large networks and huge amounts of data.
Efficient training sampling and variational inference using mini-batches
5.2.1 Learning with stochastic gradient Langevin dynamics described in “Overview of Stochastic Gradient Langevin Dynamics (SGLD) and examples of algorithms and implementations”
Stochastic gradient descent is one of the methods that has contributed the most to the efficient learning of large neural network models.
Methods such as optimization with additional regularization terms and MAP estimation
cannot handle parameter uncertainty.
Overfitting
Cannot make predictions or evaluate models based on uncertainty
In the field of Bayesian inference
Gradient-based sampling, such as Hamiltonian Monte Carlo, is most commonly used when the posterior distribution is differentiable
Standard Markov chain Monte Carlo methods are not computationally efficient for large data sets.
Combining the computationally efficient stochastic gradient descent method with the blue-ribbon dynamics method for uncertainty estimation compensates for the weaknesses of the two methods.
A method that combines mini-batch based learning and Markov chain Monte Carlo methods
Optimizing the cost function of a neural network with a regularization term using a stochastic gradient descent algorithm
Based on the interpretation of the MAP estimation of the regularization, we write the parameter update as Wnew=Wold+∆W, and the update width is given by
The learning rate αi is set to satisfy the above equation.
The step to obtain a sample candidate for the Langevin dynamics method, which is a batch learning algorithm, is given by
The potential energy is U=-lip(W|Y,X)
The step size is ε=√αi
The momentum vector is p~N(0,1)
In the Langevin dynamics method, to correct the error due to discretization
In order to correct the discretization error in the Langevin dynamics method, it is necessary to decide whether or not to accept the candidate points using the Metropos Hastings method.
If the learning rate αi is small, the acceptance rate can be as close to 1 as possible.
Finally, the mini-beech version of the Langevin dynamics method becomes the above equation
where
At the beginning of the algorithm, take advantage of the stochastic gradient descent method to efficiently search the space of the posterior distribution.
As T increases, the Langevin dynamics method provides an approximate sample from the true posterior distribution
5.2.2 Learning by Probabilistic Variational Inference
Variational inference can be used to approximate the posterior distribution of parameters.
There are many applications of variational inference to neural networks.
For deep learning models that handle large amounts of data, conventional batch learning cannot be used efficiently.
Combining variational inference and stochastic gradient descent with scalable learning methods using mini-batches
Dramatically improve training of Bayesian neural networks
Let 𝛏 be a set of variational parameters and approximate the posterior distribution q(W;𝛏) of the neural network parameters W
ELBO becomes the above equation
When using gradient descent to maximize with respect to the variational parameter 𝛏={μi,j(l),σi,j(l)}i.j.l
To evaluate the gradient at each step, we need to load the entire training dataset and perform the calculation once.
Deep learning is not efficient due to the large amount of data
For simplicity, we assume an independent Gaussian distribution for approximation, as shown in the above equation.
Even in the framework of variational inference, it is desirable to apply optimization using mini-batches, such as stochastic gradient method.
Consider a mini-batch Ds={xn,yn}n∈S with M data from a data set D, and evaluate the partial ELBO
Eq.
The Ls computed by sampling such a data set will be an unbiased estimator of L when using the whole data.
In the stochastic variational estimation method, instead of directly maximizing the lower bound L(𝛏) for the entire data
Instead of directly maximizing the lower bound Ls(𝛏) of the entire data, we can maximize the lower bound Ls(𝛏) in mini-batches.
The posterior distribution of the parameters can be approximated efficiently even for large-scale data.
5.2.3 Monte Carlo approximation of the gradient
Introduction
In general, to maximize the ELBO
The parameters are completely de-integrated by the approximate distribution q(W), and then
It is easy to maximize the ELBO with respect to 𝛏.
In the case of neural networks, the parameter W in ELBO is Gaussian and cannot be analytically de-integrated
To apply the gradient descent method to the above equation, we need to calculate the gradient with respect to the variational parameter 𝛏.
The KL divergence term DKL[q(W;𝛏)||p(W)] in the right equation can be calculated analytically since both distributions are Gaussian.
The log likelihood term ∫q(W;𝛏)lnpy(yn|f(xn;W))dW on the right hand side cannot be integrated analytically.
Use Monte Carlo methods to approximate the integral and obtain an estimate of the gradient
Given a function f(W) and a distribution q(w;𝛏) with parameters w∈ℝ, the goal is to evaluate the gradient of the above equation
In the following, we will explain how to calculate I(𝛏).
5.2.3.1 Score Function Estimation
A simple gradient evaluation formula, called score function estimation, uses the above formula to evaluate the previous equation
Thus
That is, since the above equation is
An unbiased estimator of I(𝛏) can be obtained by sampling w from the distribution q(W;𝛏) and then evaluating the derivative.
The score function estimation is based on the
If the derivative of the log inq(w;𝛏) of the distribution to be sampled is computable, it can be used.
In practice, this leads to very high variance
For efficient ELBO maximization, techniques to reduce the variance, such as the control variates method, are necessary.
5.2.3.2 Reparameterization gradient
The next section describes a method called reparametrization gradient.
The basic idea
Instead of sampling W directly from a distribution q(w;𝛏) that depends on a variational parameter 𝛏
Instead of sampling ε directly from the distribution p(ε) without the variational parameter, we first sample ε from p(ε)
We then apply the transformation w=g(𝛏;ε) to it to obtain a sample of w.
Thus, we obtain an unbiased estimator of the gradient as in the above equation
Specifically, consider the example of a Gaussian distribution q(w;𝛏)=N(w|μ,σ2) with the variational parameter 𝛏={μ,σ2}.
By defining the above equation, we can sample w according to a Gaussian distribution with mean μ and variance σ2
The derivative of the gradient with respect to the variational parameter 𝛏=[µ,σ] is the above equation
Finally, we obtain the Fouham estimator of the gradient for each variational parameter as shown in the above equation.
5.2.3.3 Generalization of Reparameterized Gradient
Estimating the gradient using the reparameterized gradient can reduce the variance of the gradient compared to the score function.
To use it, we need a transformation g that does not depend on the variational parameter 𝛏.
There are not many cases like this
Re-parameterization cannot be applied to gamma and beta distributions as it is to Gaussian distributions.
Methods of Improvement
Gradient estimation can be applied to a wider variety of distributions by loosening the constraints on the transformation g.
Allow 𝛏 dependence to remain, as in q(ε;𝛏) for the distribution of ε obtained by the transformation g
By using implicit differentiation, the reparameterization gradient can be applied to the distribution of various continuous values.
Difficult to obtain g directly
By using implicit differentiation, reparameterization gradient can be applied to various continuous functions such as gamma distribution, Dirichlet distribution, and von Mises distribution, which is a probability distribution on the circumference.
A method that allows the reparameterization gradient to be applied to discrete random variables.
By setting the temperature parameter of the distribution to 0, it can be matched to a categorical distribution, which is a discrete distribution.
The categorical distribution can be continuously relaxed by the Gumbel softmax distribution.
By continuously relaxing the categorical distribution using the Gumbel softmax distribution, gradient-based optimization methods such as the inverse error propagation method can be applied.
5.2.4 Variational Inference with Gradient Approximation
Maximizing the ELBO of a Bayesian Neural Network Using Reparameterization Gradients
Maximizing the ELBO in variational inference corresponds to improving the accuracy of the approximation of the posterior distribution.
Replacing the set of parameters W by the integral of ε, we get the above equation
Let Ls(𝛏) be the ELBO for mini-batch [xn,yn]n∈S
Approximated by a single sample value ε~N(ε|0,I), the above equation becomes
The right-hand side is an unbiased estimate of ELBO based on random mini-batch extraction and sampling of the noise ε
The gradient of the right-hand side is the above equation
The algorithm becomes Translated with www.DeepL.com/Translator (free version)
5.2.5 Learning by Expectation Propagation
Introduction
How to train a network using expectation propagation method
The idea is similar to the back propagation method of ordinary neural networks.
In the forward propagation calculation, the probability is propagated through the network to evaluate the surrounding likelihood.
In back propagation, the gradient of the surrounding likelihood is calculated in order to learn the parameters.
Since the probabilistic back propagation method can process data sequentially, it can be scaled to training with large amounts of data.
This framework can be used for approximate inference for precision parameters that determine the variability of observed data, and for precision parameters that govern the prior distribution of weights.
5.2.5.1 Model
Consider the prediction of a one-dimensional label yn∈ℝ, and define the likelihood function as above
The activation function of the neural network f(xn;W) is a normalized linear function.
The accuracy parameter of the observation, γy, is generated by a gamma prior distribution as shown in the above equation.
The parameter W is assumed to follow an independent Gaussian distribution (above).
The accuracy parameter of the weights, γw, is also assumed to have a gamma prior distribution.
The purpose of training this model is to approximate inference of the posterior distribution (above equation).
5.2.5.2 Approximate distribution
The idea of stochastic back propagation is based on assumed density filtering, which is a kind of expectation propagation method.
The approximate distribution of the parameters is shown in the above equation.
For computational efficiency, the same type of distribution as the prior distribution is selected as the approximate distribution.
It is updated successively using moment matching in the assumed density filtering.
5.2.5.3 Initialization and introduction of prior distribution factors
In the initial step of learning
Initialize the distribution of the prior so that it is uninformative
Mi,j=0
vij=∞
αγy=1
βγy=0
αγw=0
βγw=0
The approximate distribution is updated by adding the factors of the posterior distribution in the above equation one by one.
The parameters of the updated approximate distribution become the above equation
where
The updated gamma distribution of q(γw) is also given by the above equation using the result of moment matching
Z(αγw,βγw) cannot be obtained strictly, so approximation is performed.
5.2.5.4 Introduction of Likelihood Factors
After each factor of the prior distribution has been added, the likelihood factors of the above equation are added one by one.
Update the approximate distribution of the weights q(W) and the approximate distribution of the precision parameter of the observation q(gamma y).
Use the update formula for moment matching using Gaussian distribution and Gaussian distribution
xxx
5.2.5.5 Distribution of activity
Next, we calculate the distribution of the activity a(l).
From the central limit theorem, we know that
If the hidden units in each layer are large, a(l) follows a Gaussian distribution as an approximation
Assume that the distribution of a(l) is Gaussian
In general, after a variable that is assumed to follow a Gaussian distribution passes through a century or linear function, it becomes a mixture of the two distributions as shown in the figure above
xxx
5.2.5.6 Gradient-based Learning
Finally, the output z(0) of the bottom layer is treated as mean xj, variance 0.
Using the previous results recursively, approximate the final distribution of z(l) with a Gaussian distribution.
After the approximate representation of the normalized constant Z is obtained, the gradient of the derivative of the parameter can be calculated as in the usual error propagation method.
5.2.5.7 Related Methods
Methods similar to the stochastic back propagation method
There is a deterministic variational inference method based on maximization of ELBO.
Deterministic near-samurai computation of expected value
5.3 Bayesian Inference and Probabilistic Regularization
Introduction
Various regularization methods are used for training neural networks.
Dropout and batch regularization methods have provided dramatic performance improvements in large-scale training of neural networks.
Can be interpreted as an example of variational Bayesian learning
5.3.1 Monte Carlo Dropout described in “Overview of Monte Carlo Dropout and Examples of Algorithms and Implementations”
Introduction
Dropout is a technique used to prevent overfitting of neural networks.
We will start with a two-layer network as an example.
5.3.1.1 Relationship between dropout and variational estimation methods
We have a multi-formulation of dropout that is frequently used in deep learning models
Let ṁ(1) ∈ {0,1}H0 and ṁ(2) ∈ {0,1}H1 be the vector of two masks for input xn∈ℝH0 and intermediate layer zn∈ℝH1, respectively.
The element ṁi(l) of each mask is determined according to the Bernoulli distribution during the forward propagation of the network
γ1∈(0,1) is a set value that determines the probability that the value of the first layer mask will be zero.
In the lowest layer, a portion of the input is set to zero by first setting the input xn to the above equation
⦿ is the operation of the product of each element of the vector.
Using the dropped out input xn, the next intermediate layer becomes the above equation
Φ. is an arbitrary activation function calculated on an element-by-element basis
If we apply dropout to zn, we get the above equation
The final result is determined using the dropped out źn as above
The output zn is as above
diagm(x) is an operation that returns a matrix with each element of vector x as a diagonal component.
The cost function of a mini-batch with M data when using dropout is as above if the regularization term is also considered
The cost function is the above equation.
Introduce a likelihood function (above) with precision γ instead of an error function.
Variable transformation g is introduced as the above equation
The gradient with respect to the parameter W becomes the above equation
xxx
Dropout Summary
In the variational inference method, the approximate posterior distribution of the parameters is q(Wm;W) and
Reparameterization gradient using noise sampling by (5.74) and variable transformation by (5.84), and
5.74
5.84
Minimizing the variational energy of the mini-batch (or maximizing the ELBO) in (5.85)
5.3.1.2 Approximating the predictive distribution using dropout
By taking advantage of the fact that dropout is an example of variational Bayesian
Any implementation of a deep neural network that uses dropout can also make Bayesian predictions.
Let the approximate predictive distribution be the above equation.
The mean of this predictive distribution is the above equation.
The covariance of the predictive distribution is likewise given by the above equation.
5.3.2 Relationship to Other Probabilistic Regularization Methods
Introduction
This section introduces the relationship between statistical regularization methods, such as batch regularization and stochastic gradient descent, and approximate Bayesian inference.
5.3.2.1 Bayesian Interpretation of Batch Regularization
The connection between batch regularization and variational inference methods has been pointed out.
If we interpret batch regularization as implicitly approximating the posterior distribution
It is possible to perform prediction with uncertainty by simply modifying the neural model in which batch regularization is implemented.
Batch regularization introduces noise into the training by randomly selecting mini-batches.
5.3.2.2 Bayesian Interpretation of Statistical Gradient Descent
Stochastic gradient descent method
An essential technique for efficient training of large data sets.
It has been pointed out that it is related to approximate Bayesian inference.
Avoid convergence to overfitting parameters by adding randomness
There are similarities between stochastic Markov chain Monte Carlo as a sampling algorithm and stochastic gradient descent as an optimization method.
Stochastic gradient descent with a constant learning rate tends toward the local optimum of the objective function in the early stages of the search.
Once reached, it bounces around the neighborhood
Stochastic differential equation
5.4 Applications Using Uncertainty Estimation
5.4.1 Image Recognition
Convolutional neural networks with Monte Carlo dropouts can be used to make predictions with uncertainty.
Depth estimation of image data
Segmentation of image data
Uncertainty can be divided into two categories: one from noise in the observed data and the other from lack of training data.
Uncertainty due to noise does not change even with more data
Variance parameter of observed distribution
Uncertainty due to lack of training data
Spread of posterior distribution of parameters
Uncertainty increases when there are objects in the image that are not included in the training data.
The Gaussian distribution is chosen as the likelihood function, and the convolutional neural network f xn;W) simultaneously determines the predicted depth value yn and the associated variance vn for each pixel.
Model
The variance of the predictive distribution using Monte Carlo dropout is given by the sample size T
5.4.2 Sequential data
Recursive neural nets also have a tendency to overfit the data
Probabilistic normalization by dropout takes place
If we randomly disable hidden units, the information that should be conveyed during the recursive process is lost
Monte Carlo dropout prevents overfitting, but allows learning without losing information in the recursive process
Applying Approximate Bayesian Inference to Recursive Neural Networks
Construct language models for natural language processing.
Prevent overfitting by using stochastic gradient Langevin dynamics and Monte Carlo dropout.
Provide explanatory text for image data
Generating explanatory text using samples of parameters obtained in the learning process
Different interpretations of the image can be given for each sampled parameter.
A given image data can be interpreted in multiple ways depending on the model
The posterior distribution of the parameters represents these differences as a multimodal distribution. Translated with www.DeepL.com/Translator (free version)
5.4.3 Active Learning
In the world of image learning, it is necessary to prepare a large amount of labeled data for training.
Bayesian neural network models are also used for active learning
Transfer learning
Semi-supervised learning
By performing active learning that combines convolutional neural networks and approximate Bayesian inference
Efficient learning for high-dimensional data such as images
Based on the small amount of training data obtained, selects samples of input data with high prediction uncertainty and requests correct labels from the annotator.
We have used this method for handwriting recognition with the MNIST data set, and for classification tasks on skin cancer image data sets.
5.4.4 Reinforcement Learning
Prediction with uncertainty is also important in reinforcement learning applications.
One of the very simple reinforcement learning tasks
How to apply the uncertainty of prediction obtained by Bayesian inference
Difficulties of the Banded Problem
If we simply continue to select actions that increase the expected value of the current reward, we will miss actions that could have resulted in a larger reward.
Trade-off between exploration and exploitation
Exploratory algorithms that exploit uncertainty in prediction
Tailor the choice between exploration and exploitation according to the state of learning.
Let p(r|x,a,W) be a prediction model such as a forward propagation neural network with a weighted parameter W. The algorithm using Thompson sampling has the above equation.
Since there is little training data in the early stage of the search, the approximate distribution q(W) will be close to the prior distribution p(W).
The choice of action will be extensive based on the prior distribution.
As more data is collected, the approximate posterior distribution q(W) concentrates on a certain value W
Preference will be given to actions with high expected reward
A natural shift from exploration to exploitation depending on the learning situation
Banded problems are usually evaluated quantitatively by the difference in reward between the ideal behavior and the one called the regret.
Thompson sampling, which takes advantage of prediction uncertainty, yields a smaller sum of regrets than methods based on the greedy method.

Chapter 6: Deep Generative Models

Introduction
Train a model with a nonlinear structure such as a neural network in an unsupervised learning setting.
By obtaining a low-dimensional subspace representation of the observed data, we can
By obtaining a low-dimensional subspace representation of the observed data, we can compress the data, extract features, and interpolate missing values.
Models used for unsupervised learning
A model constructed by layering multiple nonlinear layers, such as a neural network
A large number of unobserved variables, called latent variables or local parameters, exist
Introduces computational techniques such as amortized inference and variable models to improve the efficiency of approximate inference
In addition to latent variable inference, we will discuss nonparametrics Bayes or Bayesian nonparametrics, which learn the structure of the network itself.
6.1 Variational Self-Coder
Introduction
Variable auto encoder (VAE) is an unsupervised model that introduces a nonlinear transformation using a nyural network to a linear dimensionality reduction model.
Applicable to image data, etc.
6.1.1 Generative Networks and Inference Networks
Introduction
This section describes the basic structure of variational self-encoders and how they are trained.
6.1.1.1 Model and Approximate Distribution
The observed data X=[x1,…. ,xN] is assumed to be generated by the output of a Bayesian neural network as shown in the above equation
f(zn;W) is a forward propagating neural network
Zn: Input vector of the neural network. It is treated as an unobserved latent variable assumed to be generated according to a Gaussian distribution.
W: Weight parameter of the neural network.
Generated by an independent Gaussian distribution
Generate observed data xn from latent representation of data zn
Similar to the linear dimensionality reduction model, the objective of learning is to obtain the posterior distribution of the latent variables and parameters in the above equation.
Find the posterior distribution by variational inference
The objective of learning is to design the approximate distribution q(Z,W) and minimize the KL divergence (above equation)
How to design the approximate distribution q(Z,W)?
A simple way is to partition the approximate distributions of Z and W using mean-field approximation (above equation)
Ψ and 𝛏 are sets of variational parameters
For the approximate distribution of the parameter W, choose a Gaussian distribution that is easy to calculate.
Let 𝛏={mi,j(l),vi,j(l)} be the set of the corresponding variational parameters (mean and variance).
For the approximate distribution of the latent variable Z, we also use a mean-field approximation using a Gaussian distribution.
In the self-encoder, mn and vn are taken as the output of the neural network with xn as input.
Neural networks to regress variational parameters of approximate distributions.
In the approximate distribution q(Z;Z,Ψ) using inference network
The target of optimization is not the variational parameter mn or vn.
The parameter for optimization is the inference network Ψ
The inference network learns a mapping from each observed data xn to a latent representation of the data zn
Why use regression models for latent variable inference calculations?
Due to the nonlinear transformation by the neural network, we cannot obtain an analytic update equation for Z at the variational E step
Updating the approximate distribution of a single latent variable alone requires optimization using gradient descent with variational parameters
For each approximate distribution q(z1),… . q(zN), instead of updating the variational parameters of each approximate distribution q(z1;x1,Ψ),…,q(zN;Ψ) individually. . q(zN;x1,Ψ), instead of updating the variational parameters of multiple approximate distributions q(z1;x1,Ψ),…, q(zN;xN,Ψ) individually.
In other words
Receiving one data point xj and updating q(zj;zj,Ψ) is
affects the other ∏j≠iq(zj;xj,Ψ).
A method of inferring the posterior distribution of a latent variable Z from data X “while predicting” it using a regression model such as a neural network
It has been used in a generative model called the Helmholtz machine before the variational self-encoder.
6.1.1.2 Learning by variational inference
Variant of KL divergence to be minimized
If the difference between the log perimeter likelihood and the KL divergence of the posterior distribution is given by the above equation
L(φ,𝛏) is the lower bound of the log marginal likelihood inp(X)
Minimization of the above equation and maximization of the lower bound L(φ,𝛏) are equivalent
Assume that a large amount of training data exists
Use stochastic gradient descent method
Construct a mini-batch by randomly extracting M indices S from D containing N data
The ELBO of the mini-batch becomes the above equation
Consider updating the approximate posterior distribution q(W;𝛏) of parameter W by stochastic gradient descent method
Corresponds to training the generative network f(zn;W)
The gradient for 𝛏 is
We need to take expectation values for both q(zn;xn,Ψ) and q(W;𝛏).
The expectation for the former is approximated by taking a simple Monte Carlo sample from q(zn;zn,Ψ).
Optimize the parameter approximation distribution q(W;𝛏) using the reparameterization gradient
Consider updating the approximate posterior distribution q(Z;X,Ψ) of the latent variable Z using stochastic gradient descent
Corresponding to the training of the inference network f(xn;Ψ)
The gradient due to the variational parameter Ψ is
Compute the expectation with respect to q(W;𝛏) by simply sampling W from q(W;𝛏)
Sample each zn using the reparameterization gradient
Update Ψ by finding the gradient with respect to the variational parameter Ψ
Algorithm for variational self-encoder
A person who fed a variational self-encoder with MNIST handwritten image data and trained it on the latent space
Each character x is generated by giving the latent variable values (z1,z2)T for each coordinate to the learned generative model.
6.1.2 Semi-supervised Learning Model
Introduction
By using a generative model, we can naturally obtain a semi-supervised learning model in which some of the labels do not exist in the input data.
Prerequisites
Let X be a set of input data (images, etc.)
Let Y be the set of corresponding category data (e.g., image data).
Let A be the index of labeled data, and DA={XA,YA}.
Let the index of unlabeled data be U, and let Du=XU.
Assume that the label set for Xu is not held as training data. Translated with www.DeepL.com/Translator (free version)

6.1.2.1 M1 Model
The simplest semi-supervised learning using a variational self-encoder can be achieved in the following steps
Train the encoder and decoder of the variational self-encoder with all input data {XA, XU}.
Use the expected value of the latent variable 𝔼[ZA] for the labeled input data XA as the feature input
Learn the parameters to predict the label YA using some supervised learning method.
Semi-supervised learning using the analysis results of ordinary variational self-encoders
By using not only labeled XA but also a large amount of unlabeled XU
By using not only labeled XA but also a large number of unlabeled XU, we expect to obtain characteristic properties of the data in the space of Z.
6.1.2.2 M2 Model
Drawbacks of the M1 model
The label data YA is not utilized in the process of unsupervised learning by the variational self-encoder.
If information on label data YA is not given at the feature extraction stage, features that are originally required for prediction may be lost from the original input data X.
Graphical model with all the parameters
Simultaneous distribution (generative network) of this model is shown in the above equation.
Semi-supervised learning framework using the above equation
Design the approximate posterior distribution used in variational inference as above
π(xn;φ) is a newly prepared inference network to approximate the distribution of YU.
All variational parameters are grouped together in Φ.
The prior distribution p(W) and the approximate posterior distribution q(W;𝛏) of the parameters are Gaussian.
Under the above design, we calculate the KL divergence between the true posterior distribution and the approximate distribution
However
xxx
Illustration of the learning results for the latent variables obtained by the M2 model.
The leftmost column is the image given for the test.
Subsequent columns are images generated by the generative model.
The latent variables estimated from the test images have as their representation the typeface of the characters of each label.
6.1.3 Applications and Extensions
Introduction
Variational self-encoders are based on a probabilistic generative model
It can be easily combined with other probabilistic models and domain knowledge applications.
Applied knowledge can be used not only for images, but also for recommendations, equist analysis, interactive response systems, molecular structure exploration, etc.
6.1.3.1 Extending the model
An image generation model that introduces the structure of a recursive neural network and the function of attention to a variational self-encoder.
Use recursive neural networks to perform spatial gaze area transitions
DRAW reconstructs the image by partially rewriting it
Variational self-encoder reconstructs the entire good sense of an image at once
Propose a model that combines a variational self-encoder and a convolutional neural network
The convolutional layer is incorporated into the inference network and the reverse convolutional layer is incorporated into the generation network.
63
Further introduce a recursive neural network to share the latent space
94
Using a semi-supervised learning framework
6.1.3.2 Importance-weighted self-encoders
Theoretical extensions of variational self-encoders
The structure of the variational self-encoder itself uses the same
Maximize a strictly larger lower bound than ELBO
By using T sampled latent variables from the inference network, we maximize a new lower bound as above
The variational self-encoder efficiently approximates the posterior distribution of the latent variables by using inference networks.
The ability to approximate the posterior distribution will be very limited.
6.2 Variational Models
Introduction
In variational inference, the design of the class of distributions for the posterior distribution q determines the performance of the algorithm.
The following points are important
(1) Expected value calculation and sampling using q should be easy.
(2) q should be easy to optimize under KL divergence and other indicators.
(3) q should be flexible enough to approximate complex true posterior distributions with good accuracy.
In mean-field approximation, well-known distributions with special characteristics such as exponential genus are used as approximations.
Requirements 1) and 2) are satisfied.
In terms of requirement (3), the approximation capability becomes low.
In self-encoder
In terms of requirement (3), the approximation capability becomes low.
Probability model used for approximate distribution q in variational inference method
Variational model satisfying requirement (3)
Method using normalizing flow
Hierarchical variational model
A model that extends the variational inference method to handle models with more degrees of freedom and approximate distributions.
6.2.1 Normalizing flow
Introduction
Problems with variational self-encoders based on mean-field approximation
Assuming a simple distribution such as a diagonal Gaussian for the approximate distribution.
The true posterior distribution of a latent variable can be too complex to be represented by a simple Gaussian distribution.
Normalizing flow
For a sample z0 from a simple probability distribution such as a Gaussian, apply multiple irreducible and differentiable exchanges of functions f1,…,fx. A method of obtaining samples from more complex distributions by applying multiple irreducible and differentiable f1,…, fx exchanges to z0 samples from simple probability distributions such as the Gaussian.
6.2.1.1 Transformation by reversible functions
The normalization flow is based on a transformation of the probability density function
Consider a reversible and continuous function f:ℝD→ℝD
Using this transformation ź=f(z), q(ź) becomes the above equation for the probability density function q(z)
The above equation is a Jacobi matrix
det(*) is the determinant
Apply this kind of transformation K times from z0
The final probability density of random variable zk is given by
6.2.1.2 Example of transformation
Example of a concrete function f
planer flow
h: Differentiable nonlinear function
λ: Parameter that determines the transformation
λ: Parameter that determines the transformation, determined to maximize the ELBO
The Jacobi matrix required to calculate the density of the distribution obtained by the planer flow is given by
Applying the above equation with respect to z0, we obtain
repeated contraction and expansion in the direction perpendicular to the hyperplane wTz+b=0
Example of a sample with planar flow
Radial flow
Jacobi matrix for radial flow
Example of a sample with radial flow
6.2.1.3 Application to variational inference
Regularized flow can be combined with variational inference methods to obtain a much more accurate approximation of the posterior distribution than a simple mean-field approximation.
ELBO for a given day is
To apply the normalization flow to inference networks based on self-encoders, etc.
To apply to inference networks, use the above as the initial distribution, and then apply the transformation using the normalized flow.
6.2.1.4 Stein variational gradient descent method
Variational inference using other sequential variable transformations of the normalized flow
Apply purchase descent methods using functional bifurcation in reproducing kernel Hilbert space fields.
Minimize the KL divergence for the true self-distribution.
No need to compute determinants or inverse matrices
6.2.2 Hierarchical Variational Model
Introduction
Represent complex approximate distributions by hierarchizing the approximate distribution. Translated with www.DeepL.com/Translator (free version)
6.2.2.1 Model of approximate distribution
The approximate distribution qMF of the latent variable Z={z1,…,zM} using the usual mean-field approximation can be expressed as above if λ is a set of variational parameters. Using the usual mean field approximation, the approximate distribution qMF of the latent variables Z={z1,…,zM} can be expressed by the above equation, where λ is a set of variational parameters
The M latent variables z1,… ,zM are assumed to be independent.
The approximate distribution qHVC based on the hierarchical distribution model is expressed by the above equation to hierarchize the approximate distribution.
The distribution q(λ;𝛏), called the variable prior, is given for the variational parameter λ
q(zm|λm) can be called the variable likelihood.
By permuting with respect to Λ, the approximate distribution becomes some kind of mixture distribution.
Example of a variational model using a two-dimensional Gaussian distribution for the variational prior and a Poisson distribution for the variational likelihood
In (A.1), λ1 and λ2 are generated by using independent Gaussian distributions.
As a result, the latent variables z1 and z2 in (A.2) are also independent
(B.1) generates λ1 and λ2 by using Gaussian distribution with strong correlation
The latent variables in (B.2) have complex relationships
By assuming a distribution that is not independent of the conventional variational parameter generation, we can capture the correlation of complex latent variables.
The hierarchical variational model maximizes the ELBO with respect to the newly introduced variable hyperparameter 𝛏.
6.2.2.2 Examples of variational prior distributions
Introduce a useful variational model.
Let K be the number of mixture elements, π be the parameters of a K-dimensional categorical distribution, and 𝛏={μk,Σk}k=1K be the set of parameters of an M-dimensional Gaussian distribution.
Using the mixture model as a variational prior, two different correlations (sexual correlation and negative correlation) can be expressed between λ1 and λ2.
Normalized flow can also be used as prior distribution.
We can also use a Gaussian process as a variational model.
Equations
A method that does not require the derivation of an algorithm that depends on the design of the model.
6.2.3 Non-Explicit Models and Likelihood-free Variational Inference
Introduction
Probability distributions, including exponential families of distributions, can be used in combination to model a wide variety of data generating processes.
Models that cannot calculate density, but can generate output
Approximate Bayesian computation (ABC)
Simulation of data generation
Methods for cases where the generative model or approximate distribution is constructed as a non-explicit model
Also used in generative adversarial networks (GANs) described in “Overview of GANs and their various applications and implementations”
An example using Bayesian inference in a coin tossing example
The Bernoulli distribution is used as the probability of determining the face of the coin.
For the parameters of the Bernoulli distribution, the beta distribution, which is today’s lottery distribution, is used.
The posterior distribution is the same Bernoulli distribution as the prior distribution.
In the actual phenomenon of coin tossing
The actual phenomenon of coin tossing is determined through a physical process determined from the initial state of the coin before tossing (angle and speed of tossing, etc.)
Uncertainty about the initial state, or lack of information, leads to uncertainty in the final estimate
If a realistic physical model can be described as a simulator, it will be a better data generation process than the extremely simplified Beta-Bernoulli model
6.2.3.1 Non-Explicit Model
Consider a hierarchical generative model for observed data X, such that it consists of a latent variable Z and a set of parameters θ shared by all data
The density function p(xn|zDP,θ) is not defined in the Meiji sense.
It has only the means to generate data xn given a latent variable zn and a parameter θ.
Suppose that zn is generated by a function g and noise ε~p(ε) as shown above.
The likelihood is the above equation.
6.2.3.2 Variational Inference without Likelihood
The posterior distribution of the non-explicit model in the previous equation is the above equation.
Since it cannot be calculated analytically, the posterior distribution is approximated in the framework of the variational method.
The approximate distribution to be processed should be highly expressive.
In the variational inference method without likelihood, the constraints on the process of approximate distribution are loosened so that a wider class of approximate distributions can be set.
For approximate distributions of latent variables, we also process non-explicit distributions with variational parameter φ.
Each latent variable zn can be easily sampled from a distribution with variational parameter φ, as in the above equation.
The value of the variational likelihood (above) does not necessarily have to be computable.
Without an explicit density function, we can perform variational estimation using only the fact that we have a sample zn.
Use an approximate distribution q𝛏(θ) of θ with a non-explicit variational likelihood and a variational parameter 𝛏, and make the entire approximate distribution the above equation
The variational prior q𝛏(θ) is assumed to be a Gaussian distribution, which makes it easy to perform both sampling and density calculations for θ.
From the previous equation, the lower bound of the log perimeter likelihood is the above equation
p(xn,zn|θ) and qΦ(zn|zn,θ) become a non-meiotic distribution where the density cannot be calculated based on
To solve this problem, we use the empirical distribution of the data, qD(xn).
Add -lnqD(xn) to the lower bound of the previous equation, and it becomes common sense
In variational inference without likelihood, the lower bound is calculated by directly estimating the logarithm of the density ratio that appears in the previous equation (above equation)
To select the density ratio estimator r(xn,zn,θ;η), we use a regression model such as a differentiable neural network with η as a parameter.
There are various possible ways to train the density ratio estimator r.
For example
Use a loss function based on the proper scoring rule (above)
The sigmoid function Sig(r(xn,zn,θ;η)) returns 1 for samples from distribution p
and returns 0 for samples from distribution q
the previous equation takes the minimum value η
The density ratio estimator r(xn,zn,θ;η) can be trained by obtaining a Fuchball function for η in the previous equation, using only samples from xn and zn.
From the above, the objective function to be maximized is the above equation.
The density ratio is replaced by a differentiable function r(xn,zn,θ;η).
Sampling zn and θ using the reparameterized gradient, we obtain an approximation of the gradient with respect to the variational parameters φ and η
From the above, the likelihood-free variational estimation algorithm becomes Translated with www.DeepL.com/Translator (free version)
6.3 Structural Learning of Generative Networks
Introduction
One of the difficulties in training deep learning models
Many models require predetermination of the network structure
A large amount of trial and error is required to find a structure with good performance
Probabilistic models that generate binary matrices with infinite number of columns
Estimate the structure of the effective graph from the data
Network weights, biases, latent variables, as well as network width and depth can be learned simultaneously in the framework of Bayesian inference
6.3.1 Indian Cooking Process
Introduction
The Indian cooking point process can be used to construct a generative model for matrices where there is no upper bound on the number of columns.
This can be used to automatically determine the dimensions of latent variables in dimensionality reduction models such as principal component analysis and factorization.
6.3.1.1 Generating an Infinite Matrix
Consider a binary matrix M of size NxH.
Construct the generation process of M when H → ∞.
Assumptions
Assume that each element mn, h ∈ {0, 1} is generated from Bernoulli distribution Bern(πh)
Let α>0 and β>0 be hyperparameters.
Suppose the parameter πh is generated from the Beta distribution Beta(α β/H,β).
The distribution of the matrix M is the above equation
However
If the number of columns is H → ∞, then p(M) → 0 (the probability of generating all binary matrices is 0).
Let [M] denote the equivalence of matrices that can be made the same by rearranging the values of M.
If H → ∞ with respect to p([M]), the distribution of matrix M is as above
Hi is the number of binary sequences i in Z
h+ is the number of columns h such that Nh>0
The above equation is the expected value of H+.
The right equation shows that the probability does not change even if the rows of M are exchanged.
The generation of a binary matrix with infinite columns as shown in the above equation can be done by a procedure called the Indian buffet process as follows
The first customer who comes to the restaurant takes a dish according to the Poisson distribution Poi(alpha) defined in the above equation
The Nth customer takes each dish h according to the probability Nh/(n+β-1), and finally takes a new dish according to Poi(αβ/(n+β-1)).
The generated binary matrix M ∈ {0,1}Nx∞ has the following properties
The number of dishes per customer follows Poi(α)
The expected value of the total number of dishes to be taken is Nα
The type of total number of dishes taken by a customer is given by the above equation
Limβ→0H+=α(all customers choose the same dish)
limβ→∞H+=Nα(customers will not choose the same dish from each other)
Sample example of a binary matrix for N=20 when hyperparameters α and β are kept
6.3.1.2 Gibbs Sampling
Build a generative model with infinite number of latent variables using the Indian cooking point process
Example
Suppose we model the data X with an infinite dimensional parameter Θ and a binary matrix M as above
Assume that Θ can be analytically de-integrated as p(X|M)=∫p(X|M,θ)p(θ)dθ)
Using Gibbs sampling, we can sample each mn and h from the posterior distribution p(M|X) as in the above equation.
6.3.2 Infinite Neural Network Model
Introduction
We construct a multilayer deep network using a generative model called a nonlinear Gaussian belief network.
Further, we repeat the Indian cooking process to construct an infinite number of networks.
6.3.2.1 Nonlinear Gaussian Belief Network
First, consider a network with a finite number of L layers
Assumptions
Let HI be the number of units in each I layer
Let zh(l) denote the h-th unit on layer l
M(I) is a binary matrix of size Hl-1xHI
Let the activity of layer I be the above equation
The activity αh(l) is assumed to include noise from the Gaussian distribution, as is common knowledge.
The distribution of zh(l) is as above
The hidden unit zh(l) is transformed into the above equation by the hyperbolic tangent function Φ(*)=Tanh(*).
Samples of nonlinear transformations
Binary with low accuracy (a)
Decisive with high accuracy (c)
Simultaneous distribution of the entire model
6.3.2.2 Series Indian cooking process
An infinite version of the network model described above is constructed using the Indian cooking process.
Generate a network with no upper limit to the number of units in each layer and no upper limit to the depth.
Generated binary sequence and network structure
Networks composed of serial Indian cooking processes can also be approximated by Markov chain Monte Carlo methods
Alternate sampling with Gibbs sampling in the following three blocks
A set of hidden units Z
A set of binary matrices M
A set of parameters [W,b,γ].
6.4 Other Deep Generative Models
Introduction
Introduction to models that are related to probability theory
6.4.1 Deep Exponential Distribution Families
Introduction
Latent variable model constructed by combining exponential families of distributions in a hierarchical manner
By layering multiple layers of conventional latent variable models, interpretability and predictive performance for complex data can be improved.
It is a generalization of sigmoid belief networks, nonnegative matrix factorization as described in “Overview of non-negative matrix factorisation (NMF) and examples of algorithms and implementations“, and variational self-encoders.
6.4.1.1 Models
In the deep exponential family of distributions, the exponential family of distributions pEF, as shown above, is the basic part of each layer.
The input data set is X, the latent variable set is Z, and the weight parameter set is W={W(1),…,W(L) ,W(L), the deep exponential family of distribution in layer L will be the simultaneous distribution as above.
Graphical Model
In the top layer, the latent variable zn(L) of the HL-dimensional vector is generated from the prior distribution as shown above
The natural parameter η becomes a hyperparameter of the model.
In the subsequent layers, the common sense zn(l) is generated using the nonlinear transformation g(l)(/ ) without the latent variable zn(l+1) of the layer above and the weight parameter Ture.
6.4.1.2 Concrete Example
By specifically choosing each probability distribution that constitutes a deep exponential family of distributions, various models can be defined
Sigmoidal belief networks
In the deep exponential family of distributions, this model is composed of latent units that follow a Bernoulli distribution and weights that follow a Gaussian distribution.
By combining multiple families of deep exponential distributions, matrix factorization models such as linear dimensionality reduction can be deepened.
For example, for the observed data xn, the likelihood function is defined as follows
Matrix factorization methods are used in product recommendation algorithms.
A deep matrix factorization model can
A deep matrix factorization model can provide a hierarchical representation for each feature of the buyer and the product.
6.4.1.3 Inference of Posterior Distributions
The deep exponential family of distributions does not allow for analytical calculations to calculate the posterior distribution.
Approximating the posterior distribution by mean-field approximation
The approximate distribution of the latent variable uses the same exponential distribution family as the distribution set in the generative model.
For more complex models, variational models are applied.
For more complex models, variational models are applied. When the number of data is large, the ELBO is optimized by stochastic variational inference using mini-batches.
6.4.2 Boltzmann Machine
Boltzman machine is a method proposed for learning arbitrary probability distributions on a D-dimensional binary vector x ∈ [0,1]D.
The standard Boltzmann machine considers the simultaneous distribution of data using an energy function as shown in the above equation
Z is a constant for the probability distribution to be normalized to 1
In Boltzmann machine, the energy function becomes the above equation.
U and b are parameters for weight and bias, respectively
The energy function is transformed into the above equation
Usually, a random variable x is decomposed into a visible unit (v) and a hidden unit (h).
Restricted Boltzman machine is a probabilistic model often used in deep learning.
The energy function is the above equation.
The corresponding graphical model is a biparate graph as shown in the figure above.
In a constrained Boltzmann machine, there are no connections between the same units.
Multiple layers of isolated units in a constrained Boltzmann machine
Graphical model
The energy function of a deep Boltzmann machine with two hidden layers is the above equation.
Using mean field approximation, we can approximate the above equation.
Learning can be done by minimizing the KL divergence
6.4.3 Adversarial Generative Networks
A method for learning a generative model using differentiable networks.
Generative network g and discriminative network d are trained by competing with each other
Generative network g generates virtual data x=g(z;θγ) with noise z and parameter θg.
The discriminative network d outputs the probability d(x;θd) that the data x is real.
A typical learning method optimizes g and d based on a zero-sum game like common sense
qD(x) is the empirical distribution of the data
Intuitively
The discriminative network d is trained to maximize the probability that the data from the training data is authentic by maximizing the first term in the above equation.
In addition, the second term maximizes the probability 1-d((g(z))) of judging the generated data from the generation network g to be fake.
By the time the learning converges, the data generated by the generating network g will be indistinguishable to the discriminating network d (probability will be 0.5).
Advantages of adversarial generative networks
The optimization process does not require approximate inference of latent variables such as variational self-encoders.
Can be used even if the likelihood function is not defined.
Disadvantages of Adversarial Generative Networks
Low stability
Generative networks generate only specific data and lose diversity Translated with www.DeepL.com/Translator (free version)

Chapter 7: Deep Learning and Gaussian Processes

Introduction
We introduce the Gaussian process, which is a nonparametric Bayesian model.
We also show that a multilayer forward propagating neural network with an infinite number of hidden units is equivalent to a Gaussian process.
Interpreting a deep learning model as a Gaussian family
mathematically simpler interpretation is possible.
Preventing overfitting and evaluating uncertainty in prediction, which have been problems with deep learning.
The biggest problem with using Gaussian processes
Computational cost
We will show how to use approximation algorithms, such as variational inference, to learn large datasets as efficiently as deep learning.
Unsupervised learning models using Gaussian processes
We will also show how to combine them in a hierarchical manner.
7.1 Basics of Gaussian Processes
Introduction
We motivate the use of Gaussian processes by using the kernel representation of Bayesian linear regression models.
This section explains how to construct the covariance function, which is the most important factor in designing a Gaussian process, and how to select models and optimize parameters using the marginal likelihood.
7.1.1 Probability distribution in function space
In the Bayesian linear regression model
The parameter w is given a prior distribution
The function represented by the weighted sum of basis functions is generated stochastically.
In the Gaussian process
The prior distribution on the function space is defined directly, and
In the Gaussian process, the prior distribution is defined directly on the function space, and inferential calculations such as posterior and predictive distributions after observing the data are also performed directly on the function space.
Definition of a Gaussian process in the context of machine learning
Let F be a set of random variables.
For any natural number N, N random variables {f(x1),… N random variables {f(x1),…,f(xN)} selected from F for any natural number N follow a Gaussian distribution
F is called a Gaussian process.
Variables that follow a Gaussian distribution are
determined by giving the mean and covariance parameters.
In a Gaussian process
In the Gaussian process, the properties of the function f generated by the mean function m(x) and the covariance function or kernel function k(x,x’) are determined.
Notated as above
Concrete example
A linear regression model has a covariance function like the nourishment equation as a Gaussian process
In linear regression
The function is drawn by sampling the parameter w
In a Gaussian process
There is no explicit parameter
In order to visualize the function sampled directly from the above equation
To visualize a function sampled directly from the above equation, prepare a concrete set of input values X={x1,… , xN}.
Using these input values and the previous equation, calculate the covariance matrix (above equation)
Sample an N-dimensional vector from an N-dimensional multidimensional Gaussian distribution N(0,K) and plot it as a curve.
Advantages of the Gaussian process
The prior distribution given to the function can be determined by directly defining the covariance function itself, without directly calculating the inner product of the feature transformation Φ.
7.1.2 Regression Model with Added Noise
In the Gaussian process, the characteristics of the generated function are determined by defining the covariance function.
In practice, when performing regression using N sets of training data D={X,Y}, independent noise is added to the output fn=f(xn) of each function as shown above
Fn: Latent function
Equivalent to defining the likelihood function as shown in the above equation
By eliminating the integration of each latent function fn, the observed data also becomes a Gaussian process with a covariance function as shown in the above equation.
In the Gaussian process model, the predictive distribution for a new test input after the data is given can be obtained by calculating the conditional distribution of a simple multidimensional Gaussian distribution
Assuming that the training data is D={X,Y}, the set of new input values is X*, and the set of corresponding predictions is Y*, the simultaneous distribution of the output values will be the Gaussian distribution shown above.
The index of the covariance matrix represents the input value for calculating the covariance function
For example
The covariance matrix Kx*x indicates that each element is obtained by k(x*,x)
From the formula for the Gaussian distribution (A.7) and the formula for the inverse matrix (A.3), the predictive distribution is given by the above equation
However
The biggest problem with the Gaussian assumption is the amount of computation.
Approach by Approximate Inference
7.1.3 Covariance Function
Introduction
Prediction using a Gaussian process is characterized by the design of the covariance function.
The covariance function k(x,x’) is a function that takes two inputs x and x’.
Available conditions
For any sequence of real numbers [x1,. ,xN], the covariance matrix must be semi-positive definite.
7.1.3.1 Exponential Quadratic Covariance Functions
A commonly used covariance function is the exponential quadratic covariance function as shown in the above equation
The name
RBF kernel (radial basis function kernel)
Squared exponential kernel (squared exponential kernel)
Can be rewritten as the inner product of a feature transformation Φ for two input vectors x and x’.
Φ is a dimensionless vector
The function f is sampled 10 times from the previous equation
The exponential quadratic covariance function has a stronger correlation the closer the input data points x and x’ are.
The generated function will be one that changes slowly.
This can be controlled by the parameters of the covariance function, such as σ and I in the previous equation.
Results of training N=10 data pairs {xn,yn}n=1N for exponential quadratic covariance function
For the exponential quadratic covariance function, the covariance function with weights wi>0 for each dimension (above equation) is also often used.
Covariance function used in an input dimensionality reduction technique called automatic relevance determination (ARD)
Thus, when the hyperparameter wi is optimized based on maximization of the marginal likelihood, the weights of the input dimensions that do not contribute to the prediction, wi, are eclipsed to zero.
7.1.3.2 Designing a Covariance Function
As a useful application, we can create new covariance functions by combining existing ones.
For example
For example, a new covariance function can be created from two covariance functions k1 and k2 by performing the following operation
We can also create a new covariance function for the input by transforming the input with a function Φ and then applying another covariance function.
Concrete example
When we want to model periodic trends in data
For a one-dimensional input x ∈ ℝ, apply a mapping to a two-dimensional circle such as Φ(x)=(sin(x),cos(x))T, and then
and then substitute it into the exponential quadratic covariance function to create the periodic covariance function (see above).
Examples of functions generated by the above periodic covariance function
Prediction of data using these functions
Concrete example
Define a new covariance function as in the above equation
Combines the properties of the two covariance functions
with coarse periodicity
Example of a function
Predicting Data
Other examples of covariance function combinations
Convolutional Gaussian process
7.1.4 Surrounding Likelihood
In a Gaussian process, the marginal likelihood is also important, given the data D=[X,Y
The log marginal likelihood of a Gaussian process can be easily calculated from the definition of the multidimensional Gaussian distribution (above)
Θ is a set of hyperparameters
Optimization with respect to hyperparameter Θ
Automatically adjust the rate of change of the function, periodicity, etc.
The right equation can be used to evaluate the fit of the model to the data.
Partial differentiation of the previous equation with respect to the hyperparameter θ yields the above equation
α=K-1θY
A large number of hyperparameters overfits the data
This can be prevented by setting a prior distribution for the hyperparameters and inferring the posterior distribution probabilistically.
7.2 Classification by Gaussian Process
Introduction
Gaussian processes can be applied to classification problems.
By changing the likelihood function in the regression model, a classification model can be realized.
As for inference, analytical calculations cannot be performed because of the complexity of the posterior distribution
Use approximation methods such as Laplace approximation and expectation propagation to solve this problem.
7.2.1 Classification Model Using Bernoulli Distribution
Construct a classification model that combines the Bernoulli and Gaussian distributions.
The regression model uses the Gaussian distribution for the likelihood function.
In the classification
After restricting the value of the function to μ(xn) ∈ (0,1) through the sigmoid function
We consider a model that generates a bivalent label yn ∈ [0,1] using Bernoulli distribution
In other words
Let the likelihood function be the above equation
Let the parameter μ(xn) of the Bernoulli distribution be the above equation.
Assume that the latent function f is sampled according to a prior distribution by a Gaussian process with the covariance function kβ(xj,xi)=k(xi,xj)+δi,jβ-1
By including a precision of β-1 for each fn=f(xn) generated, the model takes into account labeling errors that occur independently for each data point.
The latent function f(xn) sampled from the above model, the mean value µ(xn) after transformation by the sigmoid function, and the bivalent data points generated from µ(xn).
It allows for complex classifications that cannot be handled by logistic regression.
A prediction using this model would look like this
Compute the prediction f*~f(x*) of the latent variable at test input x* after observing the training data D=[X,Y
However, F={f1,…. ,fN}
By peripheralizing f* from the likelihood function using the previous equation, we obtain the predictive distribution of y*.
The extension to multiclass classification supports the Sobol function and categorical distribution instead of the sigmoid function and Bernoulli distribution.
Computation of the predictive distribution is obtained by applying variational inference. Translated with www.DeepL.com/Translator (free version)

7.2.2 Laplace Approximation
Introduction
One of the difficulties of classification by Gaussian processes is that the calculations described above cannot be performed analytically.
We will use the Laplace approximation to give an approximate solution to this problem.
7.2.2.1 Approximation of the posterior distribution
In Laplace approximation, a Gaussian distribution as shown in the above equation is set as the approximate posterior distribution q of the latent function F.
F is a value that takes the maximum value of the log-posterior distribution as shown in the above equation (MAP estimation).
Let the objective function be Ψ(F)
Kβ=Kxx+β-1I
By taking the gradient about F, the above equation is obtained
In the case of a likelihood function with Bernoulli distribution, the derivatives required for the above equation are
and
The matrix 𝝠 is the Hesse matrix of the negative log-posterior distribution at point F.
7.2.2.2 Approximating the Predictive Distribution
After obtaining the MAP estimate F by optimization using the gradient descent method or the Newton-Raphson method, we can use the approximate posterior distribution.
After obtaining the MAP estimate F by optimization using the gradient descent method or Newton-Raphson method, approximate the predictive distribution by using the approximate posterior distribution.
The predictive distribution of the latent function f* is given by the above equation.
Since it is an integral calculation using two Gaussian distributions, it can be calculated analytically.
The mean μ* and variance σ*2 are respectively given by the above equation
k*=(kβ(x*,x1),…. (kβ(x*,x1),…,kβ(x*,xN))T
Using this result, the predictive distribution can be approximated from the above equation
Results of classification when the input is 2D.
On the left, some observed data points and the mean of the true distribution that generates the data are shown as contours.
On the right is the result of learning a Gaussian process classification model using Laplace approximation and drawing the predicted mean as a contour line.
7.2.3 Expected Value Propagation Method
In addition to the Laplace approximation, the expectation propagation method is often used as an approximate inference method for Gaussian process classification models.
Consider a classification problem with yn∈[-1,1] as the output label.
The cumulative density function Φ used in the probit regression model is used to construct the likelihood function as shown in the above equation.
Assume that each function value fn is generated according to a Gaussian process with a covariance function kβ.
The latent function of this model becomes the above equation.
Because of the difficulty in calculating the normalization of the right-hand side, the posterior distribution cannot be obtained analytically.
In the framework of the expectation propagation method, we approximate the above equation as follows
t(fn|μn,σ2n) is an approximation factor
Define the Gaussian density function as in the above equation
The parameters μn and σn2 of the approximation factor are initialized randomly in advance before executing the algorithm.
The other approximate distributions are Gaussian as in the above equation
μ=(μ1,…,μN)T ,µN)T
Σ has diagonal components σ12,…,σN2. NxN matrices such that the nonlinearization is all zero.
On the procedure for updating the I-th approximation factor t(fi|μi,σi2)
The perimeter of the i-th latent function in the approximation distribution q(F) is the above equation from the result of the general Gaussian perimeter
μi and σi2 are the i-th element of μ and the (i,j)-th element of σ, respectively.
First, let us denote as qi(fi) the re-normalized version with the current approximation factor t(fi|μi,σi2) removed
The normalized version with the likelihood factor p(yi|fi) added can be expressed as
From the general results of moment matching using the Gaussian distribution in (4.65) and (4.66), the mean and variance of r(fi) become the above equations
However
The final update equation is the above equation.
Typical Algorithm of Expectation Propagation for Large Data
7.3 Sparse Approximation of a Gaussian Process
Introduction
The main challenge of Gaussian processes is their computational complexity.
Just as deep learning models could be trained on large amounts of data using methods such as stochastic gradient descent
7.3.1 Variational Inference Methods with Induction Points
Introduction
This section introduces variational inference and sparse approximation under the Gaussian assumption based on inducing points or pseudo input.
In the sparse approximation using inducing points, a covariance matrix K of size NxN, whose inverse is difficult to compute, is approximated by a smaller matrix of size MxM such that M(<N) is a low-rank approximation.
Kzz is computed by M induced points z1,…,zM. Kzz is a covariance matrix computed by M induced points z1,…,zM
Kzz is an NxM matrix
Each inductive point plays the role of a variational parameter
7.3.1.1 Approximating the posterior distribution by variational inference
In order to derive a variational inference method with induction points for Gaussian processes
For simplicity, we consider a finite set of input points Xall
xxx
7.3.1.2 Approximating the Predictive Distribution
Given a new test input, the predictive distribution for y*∈ℝ is given by the above equation using the approximation distribution
However
and
Here is an example of the predictive distribution of a Gaussian regression model with induction points
Induction point z is randomly selected from the same space as x.
As the number of guidance points increases, it gets closer to the exact solution.
The ELBO of the approximation also approaches the rigorously calculated log-peripheral likelihood as the number of guiding points is increased.
7.3.1.3 Optimization of Hyperparameters
If we want to optimize the hyperparameters of the covariance function and get a better fit to the data of a Gaussian process, we apply the gradient descent method by partial differentiation of (7.66) instead of the log likelihood
This will also change the Schrudelow.
7.3.2 Probabilistic Variational Inference
The stochastic variable method combines the variational inference method, which is based on maximizing the lower bound of the log perimeter likelihood, with the stochastic gradient descent method, which improves learning efficiency by dividing the training data into mini-batches A method combining the stochastic gradient descent method (stochastic gradient descent method)
Can be applied to Gaussian processes
7.4 Interpreting Gaussian Processes in Deep Learning
Introduction
In the framework of supervised learning, Gaussian processes are widely used as models for learning nonlinear mappings from input variables to output variables (as in forward propagation neural networks, etc.).
Clarify the relationship between the two
If the number of weight parameters for each layer in an all-connected forward propagation neural network is infinite, the number of layers will be infinite.
Advantages of the Gaussian representation of deep learning models
Inference can be performed analytically, resulting in rigorous predictions from the model.
Rigorous computation of the marginal likelihood allows for easy model selection, including optimization of hyperparameters and total number of parameters.
The space of functions can be designed directly by means of kernels (covariance functions) instead of a space of parameters, which allows for a wider choice of functions to be regressed and makes the design more intuitive.
Consider a multilayer neural network that is mainly composed of multiple layers.
The above equation is used for the first layer l=1.
The following is calculated as above using a nonlinear transformation by the activation function Φ
The weights and biases are assumed to follow a zero-mean Gaussian prior distribution, respectively.
In Section 7.4.4, we will discuss the means of fitting the covariance function of a Gaussian process with a sparsely coupled structure such as that used in convolutional networks.
7.4.1 The case of a single hidden layer
Consider the limit of a neural network with one hidden layer.
Assumptions
Forward propagating neural network with L=2
Consider the network structure such that ai(2)(x) is the output
Let the number of hidden units H1→∞.
The weight parameter and bias parameter follow a prior distribution that is independent for each element.
Consider the behavior of the output for a single input data x
Specifically, examine the mean and variance of ai(2)(x)
From the prior distribution of the parameters, the mean value is
Given the independence of each parameter, the variance is
zj(1)(x) assumes a finite variance common to each j
If we set H1→∞ in the right equation, the variance ki(2)(x,x) will diverge to infinity.
To prevent this, we give eigenvalues to the prior distribution of the weight parameter, and scale it according to the number of hidden units H1.
Thus, the above equation is obtained.
7.4.2 Multilayer Case
For the deep model built by iterating the above discussion, consider making the number of hidden units in each layer H1 infinite.
Assume that each output ai(l) in the I-th layer follows a Gaussian distribution with the following moments
If the number of hidden layers is Hl→∞, the output of each layer has the moment as shown in the above equation
However
7.4.3 Deep Kernel
Introduction
Derive a covariance function with a deep layer structure using a concrete nonlinear function Φ
In order to construct a covariance function with a deep structure using the above equation, it is necessary to calculate the expectation value (above equation) of the nonlinear transformation Φ using a two-dimensional Gaussian distribution N(a|0,Σ).
7.4.3.1 Covariance Function with Gaussian Error Function
First, we derive the recurrence equation when the Gauss error function is chosen as the nonlinear function.
The Gauss error function is defined as above.
Gauss error function and hyperbolic tangent function
Gaussian error function and hyperbolic tangent function are similar.
The expectation value by Gaussian distribution for this non-basis can be calculated analytically as above equation.
By applying the preceding equation, we can obtain a recursive definition of the covariance function as shown above.
7.4.3.2 Covariance Function with Normalized Linear Function
Covariance functions with a recursive structure can also be obtained for normalized linear functions, which are often used in deep learning models.
In the general form, consider a nonlinear transformation with natural number m as a parameter
Nonlinear transformation by Φm(a)=Θ(a)am
The expected value calculation by Gaussian distribution can be obtained analytically as shown in the above equation
However
The function 𝑱m(Θ) becomes as above
Finally, the recursive covariance function becomes the above equation Translated with www.DeepL.com/Translator (free version)

7.4.3.3 Prediction and Surrounding Likelihood
Example of a Gaussian Process with a Covariance Function with Deep Structure
7.4.4 Convolutional Gaussian Process
Introduction
One of the advantages of a Gaussian process is that by designing the covariance function, certain properties can be given to the generated function.
We introduce the convolutional structure used in layer recognition into the covariance function of a Gaussian process.
It is possible to efficiently learn data such as images with multi-dimensional and positional structures that are difficult to model with existing Gaussian processes.
Compared to existing convolutional neural networks, it can estimate uncertainty based on Bayesian inference and automatically select model structure and hyperparameters.
Application of sparse approximation based on variational inference enables efficient computation while suppressing overfitting.
7.4.4.1 Models
7.4.4.2 Learning by Inter-Domain Approximation
7.4.4.3 Improving the Covariance Function
7.5 Generative Models with Gaussian Processes
Introduction
Gaussian processes are mainly used in the framework of supervised learning, where a function between input and output variables is learned.
By treating the input variable as an unobserved latent variable, we can construct a nonparametric learning model.
Nonlinear version of linear dimensionality reduction model
Complex low-dimensional subspaces that cannot be extracted by linear models such as principal component analysis and factor analysis can be extracted.
The Gaussian process latent model is a nonparametric model when the number of hidden units in the generative network of the variational self-encoder is maximized.
7.5.1 Gaussian Process Latent Variable Model
Introduction
By treating the input X of a Gaussian process as a latent variable, we can use the Gaussian process as a model for unsupervised learning.
Using a variational inference method based on induction points, we can perform descent-like approximate inference.
7.5.1.1 Models
7.5.1.2 Approximation by variational inference
7.5.2 Deep Gaussian Processes
Introduction
Hierarchical combination of latent variable models for Gaussian processes
More expressive models can be represented
7.5.2.1 Models
7.5.2.2 Approximation by Variational Inference
7.5.2.3 Further Improving the Efficiency of Inference
Gaussian Processes and Support Vector Machines
Equations are similar
Objective function of support vector machine
Objective function of MAP estimation for Gaussian process model used in Laplace approximation
In practice, they are very different

Appendix A

A.1 Calculating the Gaussian Distribution
A.1.1 Preparation
A.1.2 Conditional and neighborhood distributions
A.1.3 Linear transformation of the Gaussian distribution
A.2 ELBO maximization for variational inference methods using inductive points
A.3 Integral Calculations for Cumulative Distribution Functions of Normal Distributions
A.4 Calculating the covariance function
A.4.1 Covariance functions for Gaussian error functions
A.4.2 Covariance Function for Normalized Linear Functions