Summary
Variational Bayesian learning applies a variational approach to stochastic models in Bayesian estimation to obtain the approximate posterior distribution, which is useful when the probability distribution is complex and difficult to obtain analytically, or when you want to perform efficient estimation on large data sets. The variational method is a general-purpose method widely applied in fields such as classical and quantum mechanics of physics, optimal control theory, economics, electrical engineering, optics, statistics, etc. It is an optimization method that selects an optimal function from a set of functions and then finds the minimum and maximum values of that function. Variational Bayesian estimation defines a variational distribution as a probability distribution defined in the function space within a family, and performs optimization calculations based on a certain distance or information index between the posterior distribution obtained by Bayes’ theorem and them. Here, we describe this variational Bayesian learning based on the Machine Learning Professional Series “Variational Bayesian Learning.
Machine Learning Professional Series “Variational Bayesian Learning” reading notes
Describe reading notes from the Machine Learning Professional Series “Variational Bayesian Learning”.
Preface
Chapter 1: Probability and Bayes’ Theorem
Introduction
Review of basic concepts of probability
1.1 Simultaneous distribution
Example
Probability of being female and a lover of fried shrimp (Pr(a=A,b=Ḃ)=0.4
Probability of being male and a hamburger steak lover (Pr(a=Ā,b=B)=0.3
Probability of being male and a lover of fried shrimp (Pr(a=Ã,b=Ḃ)=0.2
Joint distribution
1.2 Peripheral distribution
Example
Probability that a randomly selected person, regardless of gender, is a hamburger lover
Probability that a randomly selected person likes fried shrimp regardless of gender
For some random variables that follow a simultaneous distribution, add up all possible values they can take.
marginal distribution
1.3 Conditional distribution
Example
What happens if the male/female ratio changes from (0.5, 0.5) to (0.4, 0.6)?
Assuming that the respective food preferences of women and men remain the same
When a survey is conducted with only women or men, the ratio of hamburger steak lovers to fried shrimp lovers remains the same.
Normalize the simultaneous probability distribution p(a,b) using the probability of being female (or male), i.e., the marginal probability p(a)
Conditional distribution for variable b given variable a
Example
Example of conditional probability calculation
Example
Calculation of probability of liking hamburgers (fried shrimp) after personnel change
1.4 Bayes’ Theorem
Interchange a and b in the conditional distribution
Example: Probability that one randomly selected person is a woman (or man) from a group of hamburger (or fried prawn) lovers
Connecting the two above equations via simultaneous probability p(a,b)
Bayes theorem
Given a conditional probability distribution p(b|a) and a marginal distribution p(a), it provides a means to calculate from them a conditional distribution p(a|b) in which the random variable and the variable in the condition are interchanged.
Different Expressions of Bayes’ Theorem
Example.
Probability that this person is a woman (man) under the observation that she likes hamburgers
By interchanging the random variables in the conditional distribution p*(b|a) that depends on the unknown variable a and the variables included in the conditional by Bayes’ theorem
Example
Calculation before personnel changes
p(a) and p*(a) are prior distributions in Bayesian learning
Chapter 2: Bayesian Learning Framework
Introduction
Introduction to the Bayesian Learning Framework
2.1 Bayesian Posterior Distribution
A probabilistic model is a set of probabilistic laws that the observed data are (assumed to) follow.
Stochastic models in Bayesian learning
It is the conditional probability distribution of the observed data D that depends on the unknown model parameter ω ∈ W
Called model likelihood
(expresses prior knowledge about the model parameter ω before the observed data is available)
Bayesian learning
Given observed data D and a probability model {p(D|ω), p(ω)}
Compute posterior distribution p(ω|D)
Note: General stochastic model
Simultaneous distribution of observed data and model parameters
p(D|ω):model distribution
p(ω):prior distribution
The stochastic model is given as the simultaneous distribution of observed and unobserved variables
Unobserved variables include latent (hidden) variables (z) as well as model parameters.
When there are multiple random variables, the simultaneous distribution depends on the “hyperparameter” K that specifies the model
General Probability Models
Bayes’ Theorem
If we can calculate the marginal likelihood p(D), we can calculate the posterior distribution using the above formula
p(D|ω)p(ω): probability model
Distribution of observed data D
The marginal likelihood is also called the partition function
Obtained by marginalizing the simultaneous distribution p(D,ω) with respect to the parameter ω
When the parameters are discrete variables
δ: Dellack’s delta function
Example: if the observed value x ∈ ℝ has a 1D Gaussian distribution with variance 1
Estimate the mean value parameter ω = μ
N independent observations D={x(1),…. The probability distribution for N independent observations D={x(1),…,x(N)} is given by the above equation
Gaussian distribution with mean 0 and variance 1 is used as prior distribution (above equation)
The posterior distribution is as above
Note: Likelihood and marginal likelihood
Substituting the observed value D into the conditional probability distribution p(D|ω) actually results in a function of the unknown parameter ω
Larger values better describe the actual observed data
A measure of plausibility of the unknown parameter ω
The marginal likelihood p(D) is a parameter-independent constant since it is marginalized with respect to the parameter ω
What does it indicate plausibility about?
Depends on the stochastic model {p(D|ω), p(ω)} (as a function of Ω) and indicates its plausibility
Peripheral likelihood is used for model selection and hyperparameter estimation
Disadvantages of Bayesian Learning
The numerator of the right-hand side of the posterior distribution is easy to compute, but the integral operation to compute the denominator (marginal likelihood p(D)) is difficult.
The marginal likelihood p(D) is marginalized with respect to the unknown parameter ω and is constant
The shape of the posterior distribution is easy to see
2.2 Posterior Probability Maximization Estimation Method
If we know the shape of the posterior probability, we can find the parameter that maximizes the posterior probability
Maximum a posteriori (MAP) estimation method
Generalization of maximum likelihood (ML) estimation
Note: Relationship between regularization and posterior probability maximization estimation methods
The loss term L(D,ω), which represents the degree of incompatibility between the observed data and the model, and
Define a regularization term R(ω) used to prevent overfitting
Statistical method to minimize the sum of the two (called the regularization method)
Can be interpreted as a posterior probability maximization estimation method
The posterior probability maximization estimator is obtained by minimizing the sign-reversal of the logarithm of the simultaneous distribution (above equation), which is the objective function of the posterior probability maximization estimator
The first term is a loss function
The second term is the regularization function
2.3 Bayesian Learning
Essential differences between posterior probability maximization estimation methods and Bayesian learning
Advantages of Bayesian Learning Methods
Information on estimation accuracy of unknown variables is available at all times
Difficult to over-learn
All unknown variables can be estimated from observed data in a single framework
Allows model selection and hyperparameter estimation
To benefit from these advantages, at least one of the following quantities must be computed
Peripheral likelihood (zero-order moments)
Normalization factor for Bayes posterior distribution (constant that is multiplied or divided by to perform normalization)
Super-parameter estimation and model selection are performed by maximizing this quantity
Posterior mean (first moment)
<…> denotes expected value for distribution p
<f(ω)>p(ω)=∫f(ω)p(ω)dω for any function f(ω)
The posterior mean is also called Bayesian estimator
Used as an estimator of the parameter ω
Posterior covariance (second order moment)
T represents the transposition of an industry or vector
The posterior covariance is used to express confidence intervals for the estimated parameters
predictive distribution (expected value of the model distribution)
p(Dnew|ω) is the model distribution with unobserved new data Dnew assigned as a random variable
Predictive distribution directly gives the probability distribution of data that will be observed in the future
If it is computationally difficult, p(Dnew|ὠ), a Bayesian estimator assigned to the model distribution, is substituted.
Why is the inability to compute proportionality constants a major problem?
All four quantities above depend on the marginal likelihood p(D)
All four quantities above require integral calculations of the form ∫f(ω)p(D,ω)dω with respect to some function f(ω)
Let p(D,ω) be the unplanned probability distribution for ω
Peripheral likelihood is the zeroth order moment
The posterior mean is a first order moment
Posterior covariance is the second moment
If the zero-order moment, the marginal likelihood, cannot be computed, the other three quantities are also difficult to compute.
Two categories of methods to approximate integral calculations
First approach
Samples following the posterior distribution ω(1),… , ω(T) ˜ p(ω|D) are generated on the computer and
Method of approximating the integral by the sample mean
Technique
Generate samples using the unplanned distribution p(D,ω) (for ω)
Gibbs sampling
Metropolis-Hastings algorithm
Second approach
Selecting a function that is as close as possible to the Bayesian posterior distribution from a class of functions whose expected values can be calculated.
Variable Bayes learning
Expectation’s propagation
2.4 Empirical Bayesian Learning
Bayesian learning always requires defining a prior distribution
What if there is no specific prior knowledge to consider?
Use the “uninformed prior” as a fair prior distribution that contains as little foreknowledge as possible.
Prepare multiple prior distributions and select the one that best fits the observed data by model selection
Note: Uninformed prior distribution
When there is no prior information at all for an unknown parameter, we want to use a prior distribution π(ω) with the smallest bias possible.
Approach
Simplest method
Use flat distribution p(ω)∝1
Problem
Improper when the definition line of Ω is not melting (the zeroth moment of the unplanned probability distribution diverges, so standardization is not feasible)
Depends on how parameters are taken
Jeffreys prior
Fisher information is taken into account.
Uniformly pre-distributed in the sense of Kullback-Leibler divergence between probability distributions, invariant to parameter transformations
Issue.
Jefferies prior distribution is non-normal
Difficult to calculate expected value of posterior distribution
Empirical Bayesian learning or type 2 maximum likelihood estimation
Prior distribution p(ω|k) depending on unknown parameter k
K: hyperparameter
Peripheral likelihood using prior distribution with superparameters
Function dependent on super-parameters
Expresses the plausibility of a superparameter given observed data D
Estimating super-parameters by maximizing the marginal likelihood
Note: Superparameter Estimation and Model Selection
Preparing several candidate stochastic models and selecting the model that best fits the observed data D
Introduce continuous-valued superparameters to perform empirical Bayesian learning
Chapter 3: Stochastic Model Examples
Introduction
Introduction to basic and slightly more complex practical probability models
3.1 Gaussian distribution model
Suppose that an M-dimensional observation vector x ∈ ℝM follows an M-dimensional Gaussian distribution, which is engineered by an unknown model parameter ω = (μ, Σ)
μ ∈ ℝM: M-dimensional mean vector
ℝM is an M-dimensional real vector
Σ ∈ 𝕊++M is the MxM covariance matrix
𝕊++M is a set of positive definite symmetric matrices
A symmetric matrix is a square matrix that remains unchanged when transposed
A positive definite matrix is a symmetric matrix in which all eigenvalues are positive
|Σ| is the determinant (determinat) of the matrix Σ
This observation is obtained by making N independent observations Observed data D={x(1),…. x(N)} for the observed data D={x(1),…,x(N)}.
Assumptions
All x(n) are generated according to the same distribution
All pairs of observation vectors {x(n),x(n’)}, where n≠n’, are independent
The conjugate prior depends on which of the parameters ω=(μ,Σ) is Bayesian learned
Gaussian prior if only the mean parameter μ is Bayesian learned
Wishart distribution for Bayesian learning of only the covariance parameter (inverse of Σ-1)
If both μ and Σ-1 are Bayesian, the Gaussian-Wishart distribution is a combination of these distributions
Μ0, Σ0, V0, v0, λ0 are super parameters
If no prior knowledge is available, point estimation is done by empirical Bayes
Gaussian distribution with covariance parameter proportional to the unit matrix (Σ=σ2IM)
Equation
Isotropic Gaussian
Anisotropic Gaussian
3.2 Linear regression model
Consider a stochastic model that obeys the stochastic law above, where the combination of input x and output y depends on the unknown model parameters w=(α, σ2)
Eq.
Called a linear regression model
Substituting Ε=y-αTx into the probability equation, we get the above equation
Assumptions
N input-output pairs D={(x(1),y(1)},… (x(N),y(N))} is observed
Assume that the observed noise ε(n)=y(n)-αTx(n) is independent for n≠n’ between different samples
The model likelihood is given by the above equation
Linear regression is the most commonly used model for curvefitting the relationship between input x and output y.
A nonlinear input-output relationship can also be expressed with respect to T by a nonlinear mapping of the low-dimensional input t to the high-dimensional input x.
Let the 1-dimensional input variable t ∈ ℝ be an M-dimensional input variable (1,t,t2,…. tM-1)T∈ℝ, then
Eq.
Conjugate prior distribution depends on which of the parameters are Bayesian learned
Gaussian distribution if only the regression parameter α is Bayesian learned
Gamma distribution if only the noise variance parameter (inverse of σ-2) is Bayesian
Gaussian-Gamma distribution for Bayesian learning of both Α and σ-2
Linear regression model example
3.3 Automatic Relevance Determination Model
Consider a model generated as in the above equation
Observations y ∈ ℝL
Unknown variable α ∈ ℝL
X∈ℝLxM
Assume that each component of the noise ε ∈ ℝL independently follows a Gaussian distribution
Expression of model likelihood for observed data D=y and model parameters ω=(α,σ2)
Called a linear Gaussian model.
Linear regression model corresponds to L=N dimensional linear Gaussian model
Since y is an L-dimensional vector, the above equation consists of L equalities.
For L<M, α cannot be uniquely estimated even if there is no noise
How to solve
Using an isotropic Gaussian p(α)=NormM(α;0,σ02IM) as a prior distribution on α
The posterior probability maximization estimator for α is the above equation
Ridge regression, a regularization method
Empirical Bayesian learning of the superparameter C using the automatic relevance determination (ARD) prior may also yield a solution for L<M.
Superparameters corresponding to prior covariance C=Diag(c12,…. ,cM2) ∈ 𝔻M restricted to the body size matrix
Called automatic relevance determination model
What does it mean to perform empirical Bayesian learning using an auto-relevance prior?
The auto-relevance prior is a prior with mean zero and variance cm2 that differs for each component m
When Cm2 is very small, the probability that the corresponding component is am=0 is very large
Estimating the superparameter C using empirical Bayesian methods
The estimator αEB tends to be a sparse vector
A sparse vector is obtained as an estimator of Α (= linear regression model with a small number of input components to explain the output)
Example
Approximation to sparse vectors on wavelet space of natural images in image processing
3.4 Multinomial distribution model
Assume that exclusive K types of events occur with the probability of the above equation
∆K-1: (K-1)-dimensional standard simplex
The histogram after N repetitions of the above equation will have a multinomial distribution.
Histogram of N repeated trials of the above equation
IIK is a set of vectors consisting of K integers
ℍNK-1 is the set of N samples and K category histograms
Multinomial distribution is a basic probability distribution along with Gaussian distribution
Commonly used in Bayesian learning as a component of mixture distribution models and latent Dirichlet allocation models.
A model in which histogram observations D=x follow a multinomial distribution
Conjugate prior distribution depends on which of the parameters are Bayesian learned
The conjugate prior distribution for the unknown parameter ω=θ is the Dirichlet distribution, which is a probability distribution on ∆K-1
3.5 Matrix decomposition model
Assumptions
Consider observed data D=V∈ℝLxM given in matrix form
Assume that the observation matrix is the sum of a low-rank signal matrix U ∈ ℝLxM and a noise matrix ε ∈ ℝLxM
To restrict the matrix U to low rank, express it in product form
A ∈ ℝMxH
BℝLxH
H≤min(L,M)
Matrix Decomposition Model
Denote matrix column vectors in bold lowercase and row vectors in bold lowercase with tilde
Matrix Components
Assume that each component of the noise matrix ε independently follows a Gaussian distribution Norm(εl,m;0,σ2)
The probability distribution of the observation matrix V is as above
Continued
∥FRO is Frobenius norm described in “Overview of the Frobenius norm and examples of algorithms and implementations“: Sum of squares of all components of a matrix
To allow variational Bayesian learning, use Gaussian distributions that are conditionally conjugate to the matrices A and B, respectively
tr(・) is the trace of the matrix: sum of the diagonal components of the west row lemma
Conjugate prior distribution depends on which of the parameters are Bayesian learned
If you want to estimate the appropriate rank by automatic relevance determination, use a diagonal prior covariance matrix (see above)
Diag(c1,… ,cH) is a function of c1,. and cH are diagonal matrices with c1,…,cH as diagonal components.
matrix factorization model
Applications
Probabilistic principal component analysis
Stochastic extension of the classical method of Principal Component Analysis
Consider a stochastic model where the observed value v ∈ ℝL depends on the latent variable ã ∈ ℝH in the above form
B ∈ ℝLxH is a linear mapping from a low-dimensional latent variable space ∈ ℝH to a higher-dimensional data space ∈ ℝL
Ε ∈ ℝL is the observed noise
Assume that each component follows an independent Gaussian distribution ε˜Norm(0, σ2IL)
Stochastic Principal Component Analysis Model Observation Vector v
M observations V=(v1,…. ,vM) are given
These follow ã˜NormH(0,IH) Latent variable AT=(ã1,…. ãM) according to the latent variable AT=(ã1,…,ãM) in the form of the above equation.
The probability distribution coincides with that with CA=IH
Introducing a prior distribution on the linear mapping matrix B yields an interpretation of the matrix decomposition model as a stochastic principal component analysis model
To perform dimensionality reduction by principal component analysis
It is important to properly set the rank H of the matrix U, which is the dimension of the space of latent variablesã
Bayesian learning’s model selection feature allows H to be appropriately estimated from observed data
Reduced-rank regression model
Regress the relationship between multidimensional inputs x ∈ ℝM and output y ∈ ℝL on a low-rank mapping
Equation
Special case of matrix decomposition model assuming preprocessing on input/output data
Noise is assumed to follow a Gaussian distribution ε˜NormL(0,σ2IL)
Interpretation of reduced-rank regression model
Input x is mapped to the lower (H-dimensional) space by AT∈ℝHxM and then to the output space by B∈ℝLxH
Image of reduced-rank regression model
Assumptions
Suppose N input-output data (above equation) are observed
Model likelihood is the above equation
Assume inputs are pre-whitened and outputs are centralized
The covariance matrix between input and output is the observation matrix (above)
Consider noise variance with scale modified as above
The model likelihood can be written as the relationship between the unknown parameters ω=(A,B) as in the above equation
Collaborative filtering
Consider the situation where some of the observation matrices V have missing values
Let 𝚲 be the set of observed components of V ∈ ℝLxM. The model distribution is given by
continued
P𝚲(V):ℝLxM ↦ ℝLxM is a function that maps unobserved values to 0 (above equation) to the observed values themselves
#(𝚲) is the number of sets (number of observed components)
Example: How much a user (user) likes a product (item)
Estimates a low-rank matrix, assumed to represent user preferences, from observed components only
Predict missing values* based on the estimated low-rank matrix
A method of predicting missing values by approximating the observation matrix V with the low-rank matrix U Translated with www.DeepL.com/Translator (free version)
3.6 Mixed distribution model
Models created by superposition of basic distributions such as Gaussian and multinomial
Equation
α=(α1,…. αk): Mixture weight parameter
Take values in the (K-1)-dimensional standard bride rex (above equation)
Continued
The distribution p(x|τk) of the individual components is called the mixture component and has different parameters τk
The unknown parameters of this model are ω=(α, τ1,…. , τk)
N i.i.d. observed data D={x(1),… x(N)} are given, the model likelihood is given by the above equation
Difficult to compute due to complex multiplication of multiple mixture components
Think of the model likelihood as a marginal likelihood for auxiliary unknown variables, which makes the model likelihood easier to deal with.
Consider the probability model in the above equation
According to the multinomial distribution (first equation), a latent variable is generated that describes to which mixture component k the sample belongs
An observation x is generated from the mixture components specified by the latent variable according to the following (second equation)
ek∈{0,1}K is a K-dimensional bivalent vector where only the ith component is 1 and the other components are 0
When z=eK, it means that the sample was generated from the kth mixture component
The set of possible valences {ek}k=1K corresponds to the set ℍ1K-1 of one-sample histograms
From the probability model above, integrating out the latent variable z and calculating the marginal likelihood for x yields the above equation
Using the simultaneous distribution of observed value x and latent variable z (above equation)
N i.i.d. observed data D={x(1),…. N latent variables H={z(1),…,x(N)} corresponding to each of the N i.i.d. observed data D={x(1),…,x(N)}. z(N)} for each of the N latent variables, the simultaneous distribution of observed data and latent variables becomes the above equation
Monomial, factorizable per observation and per mixture
It is called complete likelihood.
Most computations, including maximum likelihood estimation, posterior probability maximization, and Bayesian learning in a mixture distribution model are based on this full likelihood.
The conditional conjugate prior distribution is
Dirichlet distribution for the mixture weight parameter α
For the mixture component parameter {τk}k=1K, it is the conjugate prior distribution of the mixture component p(x|τk).
3.7 Mixed Gaussian Distribution Model
Mixture distribution model using M-dimensional Gaussian distribution as the mixture component
Equation
Mixed Gaussian distribution model
Gaussian-Wishart-based distribution is used as the conditional conjugate prior distribution
Gaussian and Wishart prior distributions are used when learning only μ and σ
3.8 Potential Dirichlet Allocation Model
Latent Dirichlet allocation model
Often used as a dimensionality reduction method for document data
Assumptions
There is a set of M documents
Each document m consists of N(m) words {ω(n,m)}n=1N(m)
Let L kinds of words be 1-of-L representations ω(n,m) ∈ {el}l=1L
Assume each word belongs to a potential topic z(n,m) ∈ {eh}h=1H
Each document has a different topic distribution θ ∈ ∆H-1
Each topic has a different word distribution βh∈∆L-1
Equation for Latent Dirichlet Allocation Model
Differences from general mixed distribution models
Have the above equation as a mixture component
Graphical model of latent Dirichlet allocation model
Light blue edges are observed variables
White edges are unobserved variables
Arrows indicate dependencies among variables
Enclosures labeled H, N, and M are called plates, meaning that there are H, N, and M nodes in them, respectively.
As a prior distribution, the Dirichlet distribution is used, which is conditionally conjugate to θm and βh, respectively.
The per-document topic distribution is summarized in the form of an MxH matrix with document parameters Θ=(θ1,…. T
The word distribution per topic is summarized in the form of an LxH matrix with topic parameter B=(β1,…. βH).
Observed data D
and
Latent variable H
is the above equation
The marginal probability for the observed data D is given by the above equation
Continued
Interpretation of Latent Dirichlet Allocation Model
It can be viewed as a matrix decomposition model that approximates the multinomial distribution parameter U with a low-rank matrix BΘT whose rank is the number of topics H
Chapter 4 Conjugacy
Introduction.
Analytically obtain the posterior distribution (including normalization factors) using the conjugacy of the model likelihood and prior distribution
4.1 Typical probability distribution
The model likelihood p(D|ω) and prior distribution p(ω) that make up the stochastic model consist of representative probability distributions p(x|ω) as shown in the table above
The blue area is the “normalization factor” that is independent of the random variable.
Day-like constants to satisfy the condition that probabilities add up to 1
The complexity of the probability equation lies in the standardization factor
The main body is surprisingly simple
Bayesian learning does not suffer much from the complexity of standardization factors
Normalizing factors help with the integral calculations required for Bayesian learning
From the definition, the integral value of the main body (the black portion that depends on the random variable) is
Most of the integral calculations required in Bayesian learning can be done through normalization factors
The table on the right falls into four categories
4.2 Definition of Conjugacy
Conjugacy is defined as the relationship between the model likelihood p(D|w) and the prior distribution p(w), which constitutes the probabilistic model {p(D|w),p(w)}.
Definition: conjugate prior distribution
A prior distribution such that the prior distribution p(ω) and the posterior distribution p(ω|D) have the same functional form.
Other conditions are necessary for conjugacy to be useful
The function class containing any distribution function is the prior distribution
Not useful for calculations
In order for the calculation to be useful
Not only are the prior and posterior distributions in the same class of distribution functions, but also
It is implicitly assumed that expectation calculations (at least normalization factors and means (momentary moments)) can be easily computed for the class of distribution functions.
Consider the functional form of the posterior distribution
When considering conjugacy, always focus on the functional form with respect to the parameter ω
It is important to consider p(D|ω) not as a function of the observed data D (i.e., model distribution), but as a function of the parameter ω (i.e., model likelihood)
Abbreviations for each distribution
Gaussian distribution
Gamma distribution
Wishart distribution
Multinomial distribution
Dirichlet distribution
4.3 Isotropic Gaussian Distribution Model
Assumptions
As the simplest example, consider the “isotropic Gaussian distribution model” (above equation)
N i.i.d. observed data D={x(1),…. The model likelihood for N i.i.d. observations D={x(1),…,x(N)} is given by the above equation
Isotropic Gaussian Likelihood Function
When considering conjugacy, consider the probability distribution with observed data D as the random variable as a function of the parameter ω.
Among the parameters (μ, σ2) of the isotropic Gaussian distribution, only the mean parameter μ is Bayesian learned first (i.e., ω=μ)
Omit the proportionality constant by considering the model likelihood as a function of µ
Only the part of the sum in the exponential function that depends on μ is taken out and organized.
ẋ is the mean of the sample
In the third equation, the upper equation is omitted as a proportionality constant (μ does not appear)
The last equation shows that the model likelihood p(D|μ) (as a function of μ) has the same form as an isotropic Gaussian distribution with mean ẋ, variance σ2/N
The maximum likelihood estimator of the mean parameter is the above equation
Model likelihood for the mean parameter of an isotropic Gaussian distribution model
Isotropic Gaussian distribution for one sample x is also isotropic Gaussian for mean parameter μ
Isotropic Gaussian function is closed with respect to the product
The product of isotropic Gaussian functions with different means is an isotropic Gaussian function
Memo.
The fact that the functional form of the model likelihood is closed with respect to the product is the key to easy Bayesian learning.
A family of distributions with these properties
A probability distribution that can be multiplied into the above equation by a successful transformation of random variables and parameters (t=t(x),η=η(x)).
A(⋅) and B(⋅) are arbitrary functions
A(⋅) must not depend on t and B(⋅) must not depend on η
The key point is that the interaction between parameters and random variables is always of the form exp(ηTt)
Using p(η) = exp(ηTt(0) – Ao(η) + Bo(t(0))) as prior distribution
N observations D=(t(1),…. ,t(N))=(t(x(1)),… t(x(N))) is obtained, then
The posterior distribution can be written in the form of the same exponential distribution family (see above)
η is the natural parameter
t is sufficient statics
All included in exponential distribution family
Assuming an isotropic Gaussian prior with K=(μ0,σ02)wo as a hyperparameter as common sense
The functional form of the posterior distribution is the above equation
The final posterior distribution is as above
Gamma type likelihood function
If only the variance parameter σ2 is Bayesian learned
If we consider the model likelihood as a function of σ2 and omit the proportionality constant, we obtain the above equation
As a function of the inverse of the variance
It takes the form of a gamma distribution.
The maximum likelihood estimator of the variance parameter is the above equation
The model likelihood for the dispersion parameter of an isotropic Gaussian model is gamma
Gamma type is also closed with respect to the product
The functional form of the posterior distribution is the above equation
Using the gamma distribution (above equation) with K=(α0,β0) as a hyperparameter
continued
The expression for the final posterior distribution is given above
Isotropic Gaussian-Gamma type likelihood function
When Bayesian learning is performed by considering both mean and variance as parameters ω=(μ,σ-2)
Completing the square with respect to Μ yields the above equation
where as isotropic Gaussian-Gamma distribution
Continued
Isotropic Gaussian-Gamma distribution is the product of isotropic Gaussian and gamma distributions
Model in which the variance parameter of the isotropic Gaussian distribution depends on the random variable of the gamma distribution
x and γ are not independent
Isotropic Gaussian-Gamma distribution is also closed about the product
The posterior distribution is the above equation
Using the isotropic Gaussian-Gamma prior (above equation) with k=(μ0, λ0, α0,β0) as a hyperparameter
However
The final expression for the posterior distribution is the above
Although the equation is complex, various moments can be calculated Translated with www.DeepL.com/Translator (free version)
4.4 Gaussian distribution model
General Gaussian distribution models can be analyzed in much the same way as isotropic Gaussian distribution models
Wishart distribution, a multidimensional extension of the gamma distribution, appears in Bayesian learning of the covariance parameter Σ
Assumptions.
Consider an M-dimensional Gaussian distribution (upper equation) described by an unknown model parameter ω=(μ,Σ)
N i.i.d. observed data D={x(1),… The model likelihood for N i.i.d. observations D={x(1),…,x(N)} is given by the above equation
Gaussian likelihood function
If we focus only on the mean parameter μ and consider the covariance parameter Σ to be constant, the model likelihood is
The functional form of the posterior distribution is as above
Using the Gaussian prior distribution (above) with k=(μ0,Σ0) as a hyperparameter
continued
However
The final posterior distribution is the above equation
Wishart-type likelihood function
When only the covariance parameter Σ is Bayesian learned
If the mean parameter μ is regarded as a constant, the model likelihood becomes the above equation
As with the isotropic Gaussian distribution, the model likelihood is treated as a function of the inverse Σ-1 of the covariance matrix
The functional form of the posterior distribution is as above
Using the Wishart prior distribution (above) with k=(V0,γ0) as a hyperparameter
The final posterior distribution is the above equation
The Wishart distribution is a multidimensional extension of the gamma distribution and agrees when M=1
Gaussian-Wishart type likelihood function
When both mean and covariance parameters are Bayesian learned
The model likelihood for the parameter ω = (μ, Σ-1) is given by the above equation
CONTINUED
Let us assume the above equation as Gaussian-Wishart distribution
Gauss-Wishart type functions are also closed about the product
The expression for the posterior distribution is given above
Using the Gauss-Wishart prior (Joushiki) with k=(μ0, λ0, V0,γ0) as a superparameter
Continued
However
The final posterior distribution is the above equation
4.5 For a linear regression model
Consider a linear regression model with ω = (a, σ2) as a parameter (above)
N i.i.d. observed data D=y={y(1),… ,y(N)}T, X=(x(1),…. The model likelihood for x(N) is given by the above equation
Gaussian likelihood function
Bayesian learning of regression parameter a only
Expanding the exponential part of the model likelihood and completing the square as a function of the regression parameter a yields the above equation
When the inverse of XTX exists, the maximum likelihood estimator of a is the above equation
The functional form of the posterior distribution is
Using the Gaussian prior (above) with k=(a0, Σ0) as a hyperparameter
However
The final posterior distribution is as above
Gamma type likelihood function
If only the variance parameter σ2 is Bayesian learned
Model likelihood is as above
The maximum likelihood estimator is the above equation
The functional form of the posterior distribution is as above
Using the gamma prior distribution (above) with k=(α0,β0) as a hyperparameter
continued
The final posterior distribution is given by
Gaussian-Gamma type likelihood function
When both the regression parameter a and the variance parameter σ2 are Bayesian learned
If the model likelihood is a function of ω=(a, σ-2), then the above equation becomes
Gaussian-Gamma distribution with the above equation
The posterior distribution is given by
Using the Gaussian-Gamma prior (above) with k=(μ0,Λ0, α0,β0) as a hyperparameter
However
The final posterior distribution is as above
4.6 Multinomial distribution model
Assumptions.
With the probability of the occurrence of K exclusive types of events (above equation) as parameters
on the histogram
Consider a multinomial distribution model
Dirichlet likelihood function
If we consider the model likelihood as a function of the parameter ω=θ, we obtain the above equation
1K is a K-dimensional vector with all components 1
The model likelihood of a multinomial distribution model is a Dirichlet-type function
Dirichlet-type functions are closed about the product
The posterior distribution is as in the above equation
Using the Dirichlet prior distribution (above) with k=Φ as a hyperparameter
The final posterior distribution is the above equation
Note: Special case of multinomial distribution
Multinomial distribution becomes binomial distribution (see above) when K=2.
Multinomial distribution becomes Bernoulli distribution (upper expression) when K=2,N=1
Dirichlet distribution becomes bata distribution (above) when K=2
beta function
Chapter 5: Predictive Distribution and Empirical Bayesian Learning
Introduction.
Compute marginal likelihood, posterior mean, posterior covariance, and predictive distribution from posterior distribution
The posterior mean and posterior covariance can be easily calculated from the values of the parameters specifying the posterior distribution, provided the posterior distribution is of a well-known shape.
The predictive distribution and marginal likelihoods are obtained by calculations similar to those used to derive the posterior distribution
5.1 Posterior Mean (Bayesian Estimator) and Posterior Covariance
To complete the Bayesian learning, four quantities are computed as needed
marginal likelihood
Posterior mean
Posterior covariance
Predicted distribution
The shape of the posterior distribution depends on which parameters are Bayesian learned.
In all cases, the posterior distribution has the shape of a typical probability distribution
Isotropic Gaussian Distribution Model
Gaussian distribution model
Linear regression model
Multinomial distribution model
To get the posterior mean and posterior covariance, we can find the mean and covariance of a well-known distribution
First- and second-order statistics for representative probability distributions
5.2 Prediction distribution
Introduction
The predictive distribution for a new observation Dnew can again be computed using the fact that the distribution is closed with respect to the product
Practical calculations for predictive distributions in linear regression and multinomial distribution models
5.2.1 Linear regression model case
Assumptions.
For a linear regression model with ω = a ∈ ℝM as unknown parameters (above equation)
N samples
Model Likelihood
Gaussian distribution with mean 0 and covariance C is used as the prior distribution (above equation)
The posterior distribution is as above
However
Calculating the predictive distribution of output y* for new input x*.
The predictive distribution is the expected value for the posterior distribution of the model distribution (on the new inputs and outputs)
Complete the square of the function under integration as a function of the integrating variable, the mean parameter a
Since the predictive distribution is a function of the new output y*, quantities that depend on y* are not omitted and are taken out of the integral
Continued
Here we use the above equation
Calculation Continued
Continued
where
Final Predicted Distribution
Example Result
Since the observed data exists only in the central half, the confidence interval extends at both ends.
5.2.2 Multinomial distribution model
Assumptions
Unknown parameter ω=θ=(θ1,…. θk) ∈ ∆K-1 with unknown parameters D=x=(x1,…,θk) ∈ ∆K-1. Multinomial distribution model on observed data D=x=(x1,…,xk)∈ℍNK-1 with unknown parameters ω=(θ1,…,θk)∈∆K-1 (upper equation)
The posterior distribution is as above
The predictive distribution for a new sample x*∈ℍ is given by
Continued
The equation for the predictive distribution is
ditto
5.3 Peripheral Likelihood
Calculating the marginal likelihood of a linear regression model
The marginal likelihood is used as a descriptor for model selection and hyperparameter estimation
Cannot inadvertently remove proportionality constants in the middle of the calculation
If all model candidates are described by the hyperparameter k=C, we can focus only on the k dependence and omit the independent factors
In cases where model selection is made from several completely different stochastic models, all factors need to be considered
Peripheral Likelihood Calculation
Continued
From the equation for the standard factor
Final equation (marginal likelihood of linear regression model)
5.4 Empirical Bayesian Learning
In empirical Bayesian learning, the superparameter k is estimated by maximizing the marginal likelihood p(D|k)
Sign reversal of log-likelihood
Eq.
Since log(⋅) is a monotonic function
Maximizing the marginal likelihood
The Bayesian free energy of the linear regression model is given by the above equation
Fix the superparameter, the prior covariance matrix, to the body size matrix
When light hair or Bayesian learning is performed with this prior distribution, “automatic relevance determination” occurs.
Behavior of Bayesian free energy in the automatic relevance determination model
Empirical Bayesian estimation
James-Stein type estimator
Has the property of superiority over maximum likelihood estimation
Chapter 6 Variational Bayesian Learning
Introduction
For many practical stochastic models, Bayesian learning cannot be done analytically based on conjugacy
However, many of them are composed of combinations of underlying probability distributions that are conjugate
Variational Bayesian learning is a method of constraining the posterior distribution based on the partial conjugacy of the probability model and approximating the Bayes posterior distribution within those constraints.
6.1 How Variational Bayesian Learning Works
Formulate Bayesian learning as a functional minimization problem
A functional is a function with a function as a variable
Assumptions
Let r(w) be any probability distribution in the space with parameter w
free energy or variable free energy
Continued
where the above equation is the Kullback-Leiblerdivergence from probability distribution p1(ω) to probability distribution p2(ω)
F*≡ -logp(D) is the Bayesian free energy
Minimizing the free energy (above equation) is
Equivalent to finding the distribution closest to the posterior distribution
The solution obtained by solving the unconstrained minimization problem (above equation) is
Bayesian posterior distribution (above equation)
Define Bayesian problem as a minimization problem
There is a limited number of γ such that the expectation calculation can be performed analytically
Minimization problems make it difficult to even evaluate the objective function, except in regions where γ has a special functional form
Variational Bayesian learning solves a constrained minimization problem (see above), which is added to allow the expected value of the objective function to be calculated.
s.t.: Abbreviation for “subject to” means solving the minimization problem while the constraint γ ∈ g is satisfied.
If a specific distribution (e.g. Gaussian) is chosen for g, it may be possible to calculate the expected value for free energy evaluation for all r ∈ g.
In variational Bayesian learning, the
Set the weakest possible constraint (wide search region g) so that the optimal functional form is automatically selected based on partial conjugacy of the model likelihoods.
6.2 Conditional Conjugacy
Ensure that the matrix factorization, mixed Gaussian, and latent Dirichlet allocation models do not have conjugate prior distributions (for which expectations can be calculated).
Model likelihood equation for matrix factorization model
Way parameters to be Bayesian learned are colored red or blue
Consider it as a function of the parameter ω=(A,B)
Exponential function with a fourth order term ∥BAT∥2Fro=tr(BATABT)
Obviously different from the Gaussian distribution, which has only a second-order term in the exponential function.
Integration of a function with a fourth order term in the exponential function cannot be done analytically
There is no conjugate prior distribution for the parameter ω=(A,B)
Model likelihood for mixed Gaussian distributions
Covariance of each Gaussian distribution is all Σk=σ2IM
σ2 assumes no Bayesian learning
Latent variable H={z(n)}n=1N is estimated from data along with unknown model parameter ω=(α,{μk}k=1K)
Model Likelihood of Latent Dirichlet Allocation Model
The above model has no conjugate prior distribution over the entire set of unknown parameters
Definition 6.1 Conditional conjugate prior distribution
Divide the unknown parameters (or broadly unknown variables) ω=(ω1,ω2) into two parts and consider ω2 to be a constant
When the prior distribution p(ω1) on ω1 and the posterior distribution (above equation) have the same functional form
Let this prior distribution p(ω1) be the model likelihood p(D|ω) with respect to the parameter ω1 (under a given ω2).
Note: Other uses of conditional conjugacy
Conditional conjugacy plays an important role beyond variational Bayesian learning
In Gibbs sampling, a Markov chain Monte Carlo method
Generate a Markov chain by sampling each parameter in turn, taking advantage of the fact that the posterior distribution of ω1 is a well-known distribution (easy to sample) given other parameters ω2
In collapsed Gibbs sampling, collapsed variational Bayesian learning, and partially Bayesian learning
After marginalizing some of the parameters based on conditional conjugacy, the Gibbs sampling method, variational Bayesian learning, or posterior probability maximization estimation method is applied to the remaining parameters, respectively.
6.3 Design Guideline
Variational Bayesian Learning Design Based on Conditional Conjugacy
Assumptions.
S groups ω=(ω1,…,…) of unknown parameters ω to be Bayesian learned. ωS).
All s=1,… S, so that the model likelihood has a conditional conjugate prior distribution p(ωs) with respect to ωs
The posterior distribution (above equation) is isomorphic to the prior distribution p(ωs) as a function of ωs, and given that {ωs’}s’≠s as a constant, the expected value can be calculated
Using the prior distribution (above equation) under this partition
To allow the expectation calculation for ωs to be performed independently of {ωs’}s’≠s, we impose the independence of the above equation as a constraint on the posterior distribution
Enables performing expectation calculations in free energy and solving the minimum problem
Definition of variational Bayesian posterior
Optimize each factor of the posterior distribution separately
Free energy can be written as a function of a finite dimensional unknown variable (variational parameter)
6.4 Variational method
How to find the conditions that the function to be solved must satisfy from the extreme value conditions of the functional
Change in (smooth) objective function F(γ) with respect to a smiling change in variable function γ
For γ to be a minimal solution, the variate must be zero for all possible values of ω
The variational method can also be used when the objective function F(r) contains derivatives of the variable function r(ω) (e.g. dr/dω1)
Free energy has no derivative term.
The variant δI is computed by the mere derivative of the function γ
Must hold at all points within the parameter’s domain W
Can be interpreted as a stationary condition in infinite dimension
The variate δI=δI(ω) corresponds to the gradient of the variable function γ(ω), which is an infinite-dimensional vector with all points in W as indices (values at all points are considered as independent components).
6.5 Variational Bayesian Learning Algorithm
Applicable Equations
To solve the minimization problem (above equation), we use
Free energy (above equation)
Substitute the above equations as γ(ω) and p(ω)
By decomposing using the decomposition conditions on the left, we compute the free energy variate for each factor γs(ωs)
Conditions for decomposition
If this condition holds for all s=1,… S and ωs∈Ws for all s=1,…,S and ωs∈Ws.
Right side is a function of ωs
Local search algorithm for variational Bayesian learning
Mean of variational Bayes posterior distribution
Variational Bayesian estimators are used in the model distribution to estimate the predictive distribution
6.6 Empirical Variational Bayesian Learning Algorithm
Since it is difficult to compute the marginal likelihood p(D), the free energy F(γ) is used as a substitute
In the framework of variational Bayesian learning
Model selection and hyperparameter estimation are performed by minimizing the free energy, which is the upper bound of the Bayesian free energy – logp(D).
When the prior distribution or model likelihood has a superparameter k, the free energy is the above equation
6.7 For the matrix factorization model
Introduction.
Deriving Variational Bayesian and Empirical Variational Bayesian Learning Algorithms in Matrix Decomposition Models
Model Likelihood and Prior Distribution Equations
V is the observation matrix
A∈ℝMxH and B∈ℝLxH (where H≤min(L,M)) are unknown parameters
The prior distribution has an unknown diagonal covariance matrix (see above) as a hyperparameter
The observation noise parameter σ2 is treated as a superparameter because the learning method does not significantly affect the accuracy estimation.
6.7.1 Derivation of the variational Bayesian learning algorithm
Condition of minimizing free energy by imposing independence between A and B as a constraint on the posterior distribution
Equation for free energy under the above independence constraints
Continued
Applying the variational method for γA(A) and γB(B), respectively, we obtain the above equation as an expression corresponding to the anchorage condition
Substituting the model likelihood (arrow (1)) and the prior distribution of A (arrow (2)) into (arrow (3)), we obtain the above equation
However
Finally, γA becomes the above equation
Continued
Similarly, substituting the model likelihood (arrow ①) and the prior distribution of B (arrow ②) into (arrow ③) and focusing only on the B dependence, we obtain the above equation
However
Finally, γB becomes the above equation
Find the mean and variance of the posterior distribution
Variational Bayesian posterior distribution is determined
Since we know that ΓA and γB are Gaussian, from Table 5.1
From these, the variational parameters are given by the above equation
After setting appropriate initial values for the variational parameters, the left equation can be applied iteratively until convergence to obtain a local solution.
6.7.2 Free energy as a function of the variational parameters
Find the free energy F not as a functional of γA and γB, but as a function of the variational parameters (Â, ΣA, Ḃ, ΣB)
Using the above equation, we move from a functional optimization problem to a functional optimization problem
continued
Key Points on Optimization Problems
6.7.3 Derivation of the empirical variational Bayesian learning algorithm
The empirical variational Bayesian algorithm is derived by including the hyperparameters k=(Ca,CB,σ2) in the variables that minimize the free energy
Continued
Partial differentiation of the free energy by the prior covariances Ca and CB (diagonal components) yields the above equation
The above equation is obtained as a stationary condition
Partial differentiation of the free energy by the noise variance σ2 yields the above equation
The above equation is obtained as a stationary condition
By repeating the above two commands from appropriate initial values, an empirical variational Bayesian solution can be determined.
Empirical Variational Bayesian Learning Algorithm for Matrix Decomposition Models
6.8 The case of a matrix factorization model with missing values
Introduction.
The same policy can be used to derive a variational Bayesian learning algorithm when not all components of the observation matrix are observed
Calculations are complicated by the missing posterior covariance of A and B
Model likelihood with missing
Use the same prior distribution as the one without missing
6.8.1 Derivation of the variational Bayesian learning algorithm
Stopped condition without missing values
Equation with correction for the effect of missing values
Continued
However
∑(l,m)∈𝛬 is the sum over all observed indices (l.m)∈𝚲
∑l;(l,m) ∈ 𝛬 is the sum over all l satisfying (l.m) ∈ 𝚲 for a given m
ΓA(A) is a Gaussian distribution (see above) with mean am and variance 𝚺A,m each satisfying the above equation
Similarly, γB is modified as in the above equation
However
γ(B) is Gaussian with mean bm and variance 𝚺B,l each satisfying the above equation.
For models with missing values, the covariance of each row vector of A and B is different for each row (depends on m and l)
The final equation is above
The first and second order moments of the variational Bayesian posterior distribution are as above
Repeating the above equation until convergence as in the case of no missing values yields a variational Bayesian local solution
6.8.2 Free energy as a function of variational parameters
The free energy is expressed above as a function of the variational parameter
6.8.3 Derivation of the empirical variational Bayesian learning algorithm
Partial differentiation of the free energy by C2ah, c2bh, and σ2, respectively, yields a superparameter update law
CONTINUED
After setting appropriate initial values, the above equation is repeated until convergence, yielding a local iteration of empirical variational Bayesian learning.
Algorithm: Empirical variational Bayesian learning algorithm for matrix factorization models with missing values
The posterior mean of the corresponding component (above equation) is used to predict missing values
6.9 For the mixed Gaussian distribution model
Introduction
Applying Variational Bayesian Learning to a Mixed Gaussian Distribution Model (above equation)
Assumptions
First, for simplicity, consider the case where all covariances of the mixed Gaussian components are known and are unit matrices
Symmetric (uniform) Dirichlet prior is used for the prior distribution of α
Isotropic Gaussian distribution with mean 0 is used for the prior distribution of μk
N i.i.d. observed data D={x(1),…. x(N)} and the corresponding N latent variables H={z(1),…. The model (full) likelihood for each of the N latent variables H={z(1),…,z(N)} is given by the above equation
6.9.1 Derivation of the variational Bayesian learning algorithm
In the mixed Gaussian distribution model, the latent variable H={z(n)}n=1N is introduced to make the model likelihood easier to handle
In addition to the unknown parameter ω={α, {μk}k=1K), it is necessary to have an approximate posterior distribution for the latent variable
In the mixed Gaussian distribution model, conditional conjugacy can be used by splitting the unknown variables into latent variables and parameters.
Variational Bayesian learning of a mixed Gaussian model solves the minimization problem in the above equation
Under this independence constraint, the free energy is multiplied as in the above equation
Applying the variational method to γH(H) and γω(ω) respectively, the above equation is obtained by calculating the stopping condition
Substituting the model likelihood into the above equation and focusing only on the latent variable H, we obtain the above equation
where zk(n) satisfies the above equation
The posterior distribution of the latent variable is a multinomial distribution independent of each sample
where
On the other hand, substituting the model likelihood into the above equation and focusing only on the parameter ω, we obtain the above equation
However
Continued
We see that the posterior distribution of the parameters is a Dirichlet distribution with respect to α and a seat of an isotropic Gaussian distribution with respect to {μk}k=1K
However
Calculation of the final expected value
The variational parameter describing the posterior distribution is the above equation
where
Summarizing the results obtained using the expected values, the above equation becomes
However
where {z(n)}n=1N,α and {μk,σ2k}k=1K are variational parameters and satisfy the above equation
However
CONTINUED
If necessary, use the command (1) and update the variational parameters according to equation (2) until convergence, a local solution to variational Bayesian learning can be obtained.
6.9.2 Free energy as a function of variational parameters
Using the previous results, we express the self-propagating energy as a function of the variational parameters {z(n)}n=1N, α, {μk,σk2}k=1K
6.9.3 Derivation of the empirical variational Bayesian learning algorithm
Partial differentiation of the free energy by the superparameter k=(Φ,σ02) yields the above equation
Setting ∂F/∂Φ=0 yields the stopping condition, which cannot be solved with respect to Φ
Based on the second derivative (above equation)
Update Φ by the Newton-Raphson method (above)
where Ψm(z) ≡ dm𝚿(z)/dzm is the mth-order polygamma function
The prior distribution σ02 becomes the above equation
Algorithm: Empirical variational learning algorithm for mixed Gaussian distribution models
After setting appropriate initial values, the above equation is repeated until convergence to obtain local times of empirical variational Bayesian learning.
Note: Calculating the ratio of the gamma function, which you should not do.
Be careful when handling the gamma function (causes overflow)
6.10 The case of the potential Dirichlet allocation model
Introduction
Deriving Variational Bayesian Learning for Latent Dirichlet Allocation Models
Assumptions.
The model likelihood and prior distribution are expressed by the above equations
6.10.1 Derivation of the variational Bayesian learning algorithm
As in the case of the mixed Gaussian model, find the latent variable H={{z(n,m)}n=1N}m=1M and the approximate posterior distribution γ(H,ω) over the unknown parameter ω=(Θ,B)
Conditional conjugacy by splitting the unknown variables into latent variables and parata
Under the independence constraints in the above equation, multiply the free energy as above
Applying the variational method to ΓH(H) and γω(ω) respectively, we obtain the above equation as a stationary condition
Substituting the model likelihood (②) into the above equation (①) and focusing only on the latent variable H dependence, we obtain the above equation
However
The above equation indicates that the posterior distribution of the latent variable is a multinomial distribution
where
On the other hand, substituting the model likelihood (②) into the above equation (①) and focusing only on the parameter ω dependence, we obtain the above equation
However
Continued
The above equation implies that the variational Bayes posterior distribution is independent with respect to the parameters Θ and B, and that Θ=(Θ1,…. θM)T and B=(β1,…,θM)T for each row and B=(β1,…,θM)T for each column, respectively. T and for each row of Θ=(Θ1,…,ΘM)T and each column of B=(β1,…,βH).
However
Calculating the expected value using the results so far yields the above equation
The variational Bayes posterior distribution of the final latent Dirichlet allocation model can be expressed as above
CONTINUED
where
However
After setting appropriate initial values, the above equation can be repeated until convergence to obtain local times of empirical variational Bayesian learning
6.10.2 Free energy as a function of the variational parameter
Express the free energy as a function of the variational parameters ((z(m,n)}n=1N}m=1M, Θ, B in the above equation
6.10.3 Derivation of the empirical variational Bayesian learning algorithm
Partial differentiation of the free energy by the hyperparameter k=(α,η) yields the above equation
Continued
Δn,n’ is Kronecker delta
Update the superparameters using the Newton-Raphson method (above equation) based on these
∂F/∂x is the gradient with respect to x
∂2F/∂x∂x’ is the Hessian with respect to x
max(・) acts on each component of the vector
In other words
After setting appropriate initial values, the above equation can be repeated until convergence to obtain the local times of empirical variational Bayesian learning.
Algorithm: Empirical variational Bayesian learning of a latent Dirichlet allocation model
Note: Stochastic models where conditional conjugacy is not available Translated with www.DeepL.com/Translator (free version)
Chapter 7: Properties of Variational Bayesian Learning
Introduction
Variational Bayesian learning is an approximation method, so there is no guarantee that it inherits all the characteristics of Bayesian learning.
Experimental results have confirmed its usefulness in terms of model selection ability and resistance to overtraining.
Discuss theoretical analysis results supporting experimental success
7.1 Asymptotic and Asymptotic Theories
Introduction
Variational Bayesian learning is a method that allows the computation of expected values without restricting the functional form of the posterior distribution by assuming only independence among parameters.
In fact, the variables for which independence is assumed are inherently strongly correlated parameters
Independence constraints result in a posterior distribution with a basic functional form such as Gaussian or Dirichlet distribution.
Theoretical results on variational Bayesian learning
Theory applicable only to a matrix factorization model with no missing values and a similar bilinear model with a finite number of observations.
Discovery of Global Emancipation
Phase transition phenomena induced by sparsity
Comparison of behavior with Bayesian learning and theoretical guarantees of model selection (hyperparameter estimation) performance
Theory to evaluate the behavior of variational free energy in the large-sample limit, applied to many stochastic models, especially those with latent variables
Approximation error evaluation to Bayesian posterior distribution
Phase transition phenomenon of variational Bayesian solutions for superparameters
7.2 Asymptotic Theory of Variational Bayesian Learning in Matrix Decomposition Models
Introduction.
Summarize variational Bayesian learning of a matrix factorization model without missing values
Consider the above equation as the model distribution for the observation matrix V ∈ ℝLxM and the prior distributions for the unknown parameters A ∈ ℝMxH and B ∈ ℝLxH
However, the quaternion
7.2.1 Variational Bayesian Global Solution
Suppose the singular value decomposition as described in “Overview of Singular Value Decomposition (SVD) and examples of algorithms and implementations” of the observation matrix is given by the above equation
γh≥0 is the singular value of the h-th largest V
ωah∈ℝM and ωbh∈ℝL are the corresponding right and left singular vectors
Corollary: The posterior distributions ∑A and ∑B of the global solution are diagonal
Corollary.
Rewriting the free energy using the new variational parameters {ah, bh, σ2ah,σ2bh}h=1H, we obtain the above equation
However
Any dependence of the free energy on the variational parameters is included in the fourth term.
Each Fh depends only on the hth component
Each Fh can be minimized independently with respect to the four variables {ah, bh, σ2ah,σ2bh}.
The stopping condition becomes the above equation
Easily transformed into a simultaneous polynomial equation
Theorem.
Variational Bayesian solution is a reduced singular value decomposition
Its estimated singular value γhVB is 0 when the observed singular value γh is less than a threshold value, and a reduced estimate when γh is greater.
Variational Bayesian solutions are sparse in terms of singular components because of this threshold phenomenon.
The variational Bayesian posterior distribution is described by the above theorem in Kanzei
7.2.2 Behavior of the posterior distribution
To illustrate the behavior of the Bayesian and variational Bayesian posterior distributions, consider the case L=M=H=σ2=1
Equation
Bayesian posterior distribution is as above
The shape can be illustrated by setting an appropriate proportionality constant
Specific figure for the case where the prior distribution is nearly flat (ca2=cb2=10000)
(Unstandardized) Bayesian posterior distribution
Variational Bayesian posterior distribution
Figure Explanation
When V=0 (left)
Bayesian posterior distribution has symmetric peaks on the axes and peaks at the origin
The variational Bayes posterior distribution that approximates this is a Gaussian distribution with a peak (variational Bayes estimator) at the origin
When V=1 (middle)
The Bayesian posterior distribution has one peak each in the first (A,B>0) and third (A,B<0) quadrants
Variational Bayesian posterior distribution with imposed independence between A and B cannot extend from the first to the third, so it remains a Gaussian distribution peaked at the origin
When V=2 (right)
Peaks of the Bayesian posterior distribution move away from each other and the probability value at the origin becomes smaller
Variational posterior distribution leaves the origin to forbid one of the peaks
When Cacab→∞, the threshold is γVB=1
7.2.3 Empirical variational Bayesian global solution
Consider the case of empirical variational Bayesian learning for the superparameters k=(Ca,Cb,σ2)
Solve a minimization problem estimating only measures and CB under a given noise variance σ2
Stopping conditions obtained by partial differentiation by Cah2 and Cbh2
Theorem 7.3.
Theorem 7.4.
7.2.4 Analysis of model selection performance
Supplement 7.3
Algorithm: Global Empirical Variational Bayesian Learning Algorithm for Matrix Decomposition Models
Theorem 7.5.
Numerical Experimental Results
7.3 Asymptotic Theory of Variational Bayesian Learning in Mixed Gaussian Distribution Models
Introduction.
Asymptotic form of free energy
When all K components are used
Case with only K* components
7.4 Other theoretical results
Asymptotic Theory by Matrix Decomposition Model and Asymptotic Theory by Gaussian Distribution Model
By applying the global analytical solution of the matrix factorization model directly to the asymptotic theory
Variational Bayesian learning tends to have a stronger overlearning suppression effect than Bayesian learning
Similar stochastic models
Derivation of approximate global emancipation
Sparse additive matrix factorization model
Efficient local search algorithm for matrix factorization models with missing values
Which of the unknown parameters and hyper-parameters should be Bayesian learned and which should be point-estimated (by empirical Bayesian learning)?
Paper in which the matrix decomposition model was first proposed as a stochastic principal component analysis
Partial Bayesian learning
Application of asymptotic analysis method for free energy
Hidden Markov Models
Mixed exponential distribution family
Bayesian Networks
Latent Dirichlet allocation model
Peripheralize some of the parameters before learning variational Bayesian learning
コメント