Variational Bayesian Learning

Machine Learning Technology Artificial Intelligence Technology Digital Transformation Technology Probabilistic Generative Models  Navigation of this blog

Variational Bayesian Learning

Variational Bayesian learning applies a variational approach to the probabilistic model in Bayesian estimation to obtain the approximate posterior distribution, which is useful when the probability distribution is complex and difficult to obtain analytically, or when you want to perform efficient estimation on large data sets.

Here, we first discuss the variational method. The variational method is a general-purpose method widely applied in fields such as classical and quantum mechanics of physics, optimal control theory, economics, electrical engineering, optics, statistics, etc. It is an optimization method that selects an optimal function from a set of functions and then finds the minimum and maximum values of that function.

The basic idea of the variational method is to solve the problem of finding the optimal function by finding a function that satisfies the variational principle by taking variational steps in a set of functions, where the variational principle is the basic principle for finding a function that satisfies certain conditions in a set of functions, which conditions will vary depending on the specific problem, such as the principle of least action or Hamilton’s principle.

The general procedure of the variational method is as follows.

  1. Problem Formulation: Formulate the problem mathematically to clarify the conditions for the function and variates to be optimized.
  2. Variational definition: For the function to be optimized, define the variates as the differences when the function is varied slightly.
  3. Variational conditions: Derive an equation (Euler-Lagrange equation) to find a function that satisfies the variational conditions.
  4. Solution of the equation: Solve the derived Euler-Lagrange equation to obtain the optimal solution.
  5. Consideration of boundary conditions: If the problem has boundary conditions, they are taken into account to find the optimal solution.
  6. Solution verification: Verify that the optimal solution obtained is in fact the optimal solution

Bayesian inference is a method for probabilistically estimating unknown parameters in the fields of statistics and machine learning. In Bayesian inference, the variational method defines the variational distribution as a probability distribution defined in the function space within a family by applying the variational approach described above, and performs an optimization calculation based on a certain distance or information measure between the posterior distribution and them obtained by Bayes’ theorem. The optimization calculation is performed based on a certain distance or information index between the posterior distribution obtained by Bayes’ theorem and them.

The basic procedure of the variational method in Bayesian estimation is as follows.

  1. Selecting a prior: Select a prior distribution and determine the parameters of the prior.
  2. Observation of data: Calculate the likelihood function using the observed data.
  3. Selection of variate distribution: Select a family of variate distributions, which are approximate probability distributions, and determine the parameters of the variate distribution.
  4. Optimization of Variational Evidence Lower Bound: The parameters of the variational distribution are optimized to maximize the Variational Evidence Lower Bound (VELBO) while approximating the posterior distribution using the variational distribution.VELBO is a measure of the distance between the posterior and variational distributions.
  5. Parameter estimation: Using the optimized variate distribution, we obtain parameter estimates.
  6. Evaluate the estimation results: Evaluate the reliability of the obtained estimation results and make corrections or improvements as necessary.

The MCMC method using Gibbs sampling, etc., is a commonly used method for Bayesian estimation. The differences between the variational method and the MCMC method are as follows.

  • Variational method: Since the variational method approximates the posterior distribution in probabilistic modeling, care must be taken to ensure accuracy and reliability, and it is important to select an appropriate family of variational distributions and optimization method. In addition, since probabilistic inference methods are used to approximate the posterior distribution, the solutions obtained are stochastic.
  • MCMC method: The MCMC method approximates the posterior distribution by sampling each variable in turn using conditional probability in a stochastic model that includes multiple variables, so it is assumed that conditional probability can be obtained. In addition, because it uses Markov chains, a convergence analysis is required.

Since the MCMC method is based on sampling, the computation time increases as the amount of data increases. Therefore, variational Bayesian learning is often used for efficient estimation of large data sets.

The theory and implementation of the variational Bayesian method are described below.

Implementation

Variational methods (Variational Methods) are used to find the optimal solution in a function or probability distribution, and are one of the optimization methods widely used in machine learning and statistics, especially in stochastic generative models and variational autoencoders (VAE). In particular, it plays an important role in machine learning models such as stochastic generative models and variational autoencoders (VAE).

Variational Bayesian Inference is one of the probabilistic modeling methods in Bayesian statistics, and is used when the posterior distribution is difficult to obtain analytically or computationally expensive.

This section provides an overview of the various algorithms for this variational Bayesian learning and their python implementations in topic models, Bayesian regression, mixture models, and Bayesian neural networks.

Detailed Technologies

Variational Bayesian learning is one of the approximation methods to calculate this, and has a wide range of applications, as it enables expectation calculation by selecting the posterior probability distribution from a set of functions satisfying certain constraints. The key to deriving a variational Bayesian learning algorithm is to find a property called conditional conjugacy for a given probability model and design constraints according to this property.

I will discuss the variational Bayesian algorithm below. First, as a prerequisite, I will describe the simultaneous, peripheral, and conditional distributions of probability and Bayes’ theorem using them.

As an example, let us consider a case in which 100 people are working as organizers of a year-end party in a department, and assume that a restaurant with hamburgers as its signature menu and a restaurant with fried shrimp as its signature menu are candidates for the party. 100 people are surveyed, and 10 women and 30 men prefer hamburgers, while 40 women and 20 men prefer fried shrimp. Suppose that the respondents like fried shrimp.

In the previous article, as an introduction to the overview of variational Bayesian estimation and explanation of the algorithm, we discussed the simultaneous distribution of probabilities, the marginal distribution, the conditional distribution, and Bayes’ theorem. In this article, we will discuss variational Bayesian learning using them.

In many practical probability models, Bayesian learning cannot be done analytically based on conjugacy. However, many of them are composed of combinations of underlying probability distributions with conjugacy. Variational Bayesian learning is a method of constraining the posterior distribution based on the partial conjugacy of such probability models and approximating the Bayesian posterior distribution within these constraints.

The flow of the description is to describe the framework of variational Bayesian learning, the conditional conjugacy that plays an important role in them, and the specific algorithm design guidelines, variational Bayesian learning algorithm, and empirical variational Bayesian learning algorithm.

In this article, we describe the derivation of algorithms for variational Bayesian learning and empirical variational Bayesian learning in a matrix factorization model. Variational Bayesian learning minimizes the free energy by imposing independence between A and B on the posterior distribution as a constraint condition. The empirical variational Bayesian learning algorithm is derived by including the hyperparameter κ=(CA,CB,σ2) in the variable that minimizes the free energy.

When all components of the observation matrix are unobserved, the same policy can be used to age the variational Bayesian learning algorithm, but the posterior covariance of A and B is a little more complicated due to the missing effects. In this article, we will discuss the derivation of those algorithms.

Variational Bayesian methods are applied to the mixed Gaussian distribution model. For simplicity, we consider the case where all covariances of the mixture Gaussian distribution are known and are unitary matrices. We also use a symmetric (uniform) Dirichlet prior for the prior of α and an isotropic Gaussian distribution with mean 0 for the prior of μk.

This section describes the derivation of the variational Bayesian learning of the latent Dirichlet distribution model. As in the mixed Gaussian distribution model, we obtain the latent variable \(ℋ=\{\{z^{(n,m)}\}_{n=1}^{N^{(m)}}\}_{m=1}^M\) and the approximate posterior distribution r(ℋ,ω) over the unknown parameters ω=(Θ, B). We can use conditional conjugacy by splitting the unknown variables into latent variables and parameters

When looking at the probability distribution p(x|ω), you may not feel that the formulas for these typical probability distributions are “complicated. However, upon closer inspection, it is the normalization factor that is complex, while the main body (the black portion that depends on the random variable x) is surprisingly simple. In Bayesian learning, we are not bothered by the complexity of the standardization factor.

Rather, the normalizing factor helps with the integral calculations required for Bayesian learning. This is because, by definition, the integral of the body (the part in black that depends on the random variable) is the reciprocal of the normalizing factor (the part in blue). In fact, most of the integral calculations required in Bayesian learning can be performed through the normalization factor, and there is almost no need for cumbersome analytical integral calculations in practice.

Conjugacy refers to the relationship between the model likelihood p(𝒟|ω) and the prior distribution p(ω) that make up the stochastic model {p(𝒟|ω),p(ω)}, where “the prior distribution p(ω) and posterior distribution p(ω|𝒟) are the same function A prior distribution such that the model likelihood p(ω) and the posterior distribution p(ω|𝒟) are the same functional system is called a conjugate prior to the model likelihood p(𝒟|ω).” This is defined as

In the previous article, we described a method for analytically computing the posterior distribution (including normalizing factors) using the conjugacy of the model likelihood and prior distribution in some basic probability models. In this article, we describe a method for calculating the marginal likelihood, posterior mean, posterior covariance, and predictive distribution from the posterior distribution.

The posterior mean and posterior variance can be easily calculated from the values of parameters specifying the posterior distribution if the posterior distribution is of a well-known shape. The predictive distribution and X={Xi}i∈V can be obtained by calculations similar to those used to derive the posterior distribution, but care must be taken regarding the factors that can be omitted. Finally, we also discuss empirical Bayesian learning, which estimates the superparameters based on the marginal likelihoods.

  • Bayesian Learning Framework
  • Examples of Probabilistic Models

In the previous section, we discussed the derivation of the stochastic propagation method from the Bethe free energy function. In this article, we will derive a generalized probability propagation method from the Kikuchi free energy function, which is a generalization of the Bethe free energy.

The motivation for the need to extend the probability propagation method is the case where there are many small cycles and the probability propagation method has a large approximation error, and there is a need to obtain more accurate values by considering and computing pseudo-peripheral probabilities in a slightly wider range, such as including these cycles.

For these, the Hasse diagram approach is used to decompose the probability distribution.

In this article, we derive a stochastic gradient method algorithm based on the Variable Bayesian method (VB method).

The variational Bayesian method is a method to attribute Bayesian estimation problems with complex hierarchical structures of unknown hidden variables and parameters to numerical optimization problems. The stochastic gradient method is an algorithm for solving numerical optimization problems by sequential updating of parameters, and is an essential method for efficient parameter determination for complex parametric models such as neural networks based on huge data. It can further improve the computational efficiency of the auxiliary variable method, especially when one wants to optimize the hyperparameters θ in the auxiliary variable method of Gaussian process regression models when the number of data N is large.

In this article, we will discuss variational inference algorithms for Poisson mixture distributions. In order to obtain an update formula for the variational inference algorithm, it is necessary to perform a decomposition approximation process for the posterior distribution. Here, we aim to approximate the posterior distribution by separating latent variables and parameters as follows.

コメント

タイトルとURLをコピーしました