Overview of Variational Bayesian Learning and Various Implementations

Machine Learning Artificial Intelligence Digital Transformation Probabilistic Generative Models Variational Bayesian Learning Python Navigation of this blog

About Variational Methods in Machine Learning

Variational methods (Variational Methods) are used to find the optimal solution in a function or probability distribution, and are one of the most widely used optimization methods in machine learning and statistics, especially in stochastic generative models and variational autoencoders (VAE). It is a method that plays an important role in machine learning models, especially in stochastic generative models and variational autoencoders (VAE).

Variational methods are particularly useful when the true function or distribution cannot be obtained analytically or when the computational cost is high, and variational methods can also be applied to models that include stochastic elements, making them widely used in machine learning modeling and inference.

The idea of the variational method is to find the closest approximate solution to a function or probability distribution by assuming an approximate family of functions or families of distributions for the function or distribution to be optimized, and to consider the problem of minimizing an objective function that represents the distance between the true function or distribution and a close approximate function or distribution.

The variational approach is to parameterize the function or probability distribution to be optimized, perform optimization on the parameters using ordinary optimization methods (Gradient method as described in “Overview of Gradient Method and Examples of Algorithms and Implementations” and Newton method as described in “Overview of Newton Method and Examples of Algorithms and Implementations), and adjust the parameters of the approximate function or approximate distribution to get as close as possible to the true function or distribution. .

In addition to minimizing the objective function, optimization by the variational method may also consider constraints on the approximate function or approximate distribution (e.g., probability distribution, smoothness of the function, etc.), which serve to ensure that the best approximate solution has the desired properties.

Variational methods have been applied to a variety of machine learning tasks. This is the case, for example, in variational autocoders (VAEs), where the variational method is used to approximate the posterior distribution in the latent space, and in stochastic generative models, where the variational method is used to find an approximate distribution that minimizes the distance, such as cross-entropy described in “Overview of Cross-Entropy and Related Algorithms and Implementation Examples,”, between it and the true distribution.

Variational Bayesian Learning

Variational Bayesian Inference is one of the probabilistic modeling methods in Bayesian statistics, and is used when the posterior distribution is difficult to obtain analytically or computationally expensive. Variational Bayesian learning is also applicable to large data sizes and can handle high-dimensional parameter spaces.

While it is difficult to directly calculate the posterior distribution of parameters and models in ordinary Bayesian statistics, Variational Bayesian Learning allows us to calculate the posterior distribution by replacing it with an approximate distribution (variational distribution), thereby approximating it in a form that can be handled analytically.

In variational Bayesian learning, the posterior distribution of the model is approximated through the following procedure.

Set the prior distribution and likelihood function of the model.
Determine the shape of the approximate distribution. Usually, a parameterized family of distributions (e.g., Gaussian or Dirichlet as described in “Overview of Dirichlet distribution and related algorithms and implementation examples“) is chosen.
Define an objective function to optimize the parameters of the approximate distribution. Typically, the objective function is set such that the distance between the approximate distribution and the true posterior distribution is minimized, and this distance is often expressed in terms of an information-theoretic measure such as cross-entropy.
To optimize the parameters of the approximate distribution, optimization methods such as variational methods are applied. Usually, optimization methods such as gradient or conjugate gradient methods are used.
Once the parameters of the approximate distribution have been optimized, the approximate distribution is used to estimate the parameters and predictions.

Application of Variational Bayesian Learning

Variational Bayesian learning is a method for approximating the posterior distribution of a probabilistic model and has been applied in a variety of fields. The following are examples of applications of variational Bayesian learning.

Topic Modeling: Variational Bayesian learning is widely used as a method of topic modeling. Topic modeling is a method for extracting topics (themes) from large text data sets, and Variational Bayesian Learning enables efficient estimation of the parameters of the topic model and the posterior distribution of the topics. For more information on topic models, see “Topic Model Theory and Implementation.
Bayesian regression: Variational Bayesian learning has also been applied as a method for Bayesian regression. Bayesian regression combines data and prior distributions to estimate the posterior distribution to make predictions and estimate uncertainty. Variational Bayesian learning can approximate the posterior distribution in Bayesian regression. For more information on Bayesian regression, see “Machine Learning with Bayesian Inference and Graphical Models.
Mixture Models: Mixture models assume that data are generated from several different probability distributions. Variational Bayesian learning is used to estimate the posterior distribution of the parameters and class assignments of the mixture model, which can then be applied to tasks such as data clustering and anomaly detection. For more information on mixed models, see “Machine Learning with Bayesian Inference and Graphical Models.
Bayesian Neural Networks: Variational Bayesian learning is also applied to training Bayesian Neural Networks (BNNs), which model uncertainty by setting a probabilistic prior on the neural network weights and biases and estimating the posterior distribution, used to approximate the posterior distribution of the BNN. For more information on Bayesian neural networks, see “Bayesian Deep Learning Overview.

Next, we describe the algorithm used for variational Bayesian learning.

Algorithms used in variational Bayesian learning

Several algorithms are used for variational Bayesian learning. Typical algorithms are described below.

Variational Expectation-Maximization (VEM): VEM is the basic algorithm for variational Bayesian learning and is based on the EM algorithm, which alternates between estimating model parameters and updating the variational distribution, compute the posterior distribution of the hidden variables and maximize the objective function, called the Q function, and in the M step, the parameters are updated so as to maximize the Q function. This procedure shall be repeated iteratively to obtain the approximate posterior distribution and optimal values of the parameters.
Variational Autoencoder (VAE): The VAE is a form of variational Bayesian learning as a generative model: it is a neural network model consisting of an encoder and a decoder that performs variational inference to approximate the posterior distribution in the latent variable space. VAE uses a technique called the reparameterization trick to enable the computation of gradients. During training, latent variables sampled by the reparameterization trick are used to generate and reconstruct data.
Black Board Variational Inference (BBVI): BBVI is a variational Bayesian learning technique that is a probabilistic variational inference method. The method does not directly estimate the posterior distribution of the model, but uses approximate gradient information for optimization; BBVI uses a combination of Monte Carlo based sampling and automatic differentiation to estimate the gradient of the objective function, which allows for efficient parameter optimization.
Variational Inference for Input (VI for Input): VI for Input is a variational Bayesian learning technique that applies variational inference to the data points themselves. While ordinary variational Bayesian learning estimates the posterior distribution of the model parameters, VI for Input estimates the variational posterior distribution of each data point. This allows for different approximations to be made for individual data points.
Variational Restricted Boltzmann Machine (Variational RBM): Variational RBM (Variational Restricted Boltzmann Machine) combines variational inference and the Restricted Boltzmann Machine (RBM). RBM is a kind of probabilistic energy model that can model the coupling between visible and hidden variables, and variational RBM is a method to approximate this RBM by variational inference.

The details of each algorithm are described below.

Variational Expectation Maximization(VEM)

Variational Expectation Maximization (VEM) is an extension of the EM method described in “Examples of Implementations of EM Algorithm and Various Applications. VEM is a method that combines variational inference and the EM algorithm to estimate parameters of a stochastic model and estimate hidden variables.

VEM uses a probability model to represent the true data distribution and an approximate distribution (variational distribution) to estimate the parameters of the model, and the goal of VEM is to find the parameters that maximize the likelihood of the true data distribution.

The VEM procedure is as follows

E Step (Expectation Step): Using the initial parameters, optimize the parameters of the variational distribution. The variational distribution is used to approximate the true data distribution, and in this step, the parameters of the variational distribution are used to compute the posterior distribution of the hidden variables.
M-step (Maximization Step): The posterior distribution of the hidden variable computed in the E-step is used to find parameters that maximize the likelihood of the true data distribution. Usually, the parameters are updated using an optimization method (e.g., gradient method) to maximize the likelihood function.
Iteration of E and M steps: The above steps are repeated until convergence conditions (e.g., small change in parameters) are met.

VEM can be viewed as a generalization of the EM algorithm: In the EM algorithm, the posterior distribution of the hidden variable is computed in the E step and the parameters are optimized in the M step; in VEM, the parameters of the variational distribution are optimized in the E step and the parameters are optimized in the M step.

VEM is widely used as part of variational Bayesian inference and allows for efficient inference and learning of complex stochastic models by approximating the posterior distribution through variational inference and using the approximated distribution to estimate parameters.

Variational Autoencoder(VAE)

The Variational Autoencoder (VAE) is a combination of Variational Reasoning as a generative model and the Autoencoder described in “Autoencoder“, VAE achieves feature representation and generation of data by mapping high-dimensional data to low-dimensional latent space.

The VAE consists of two main parts

Encoder: The encoder converts the input data into parameters of a probability distribution in the latent space. Typically, given input data, the encoder outputs a probability distribution (usually Gaussian) with mean and variance (or other parameters).
Decoder: A decoder takes samples of the latent space as input and reconstructs the original data. The decoder has the necessary parameters to perform the mapping from the latent space to the original data space.

VAE learns the feature representation of the data by combining the encoder and decoder. By generating samples of the latent space, it is also possible to generate new data.

VAE training is performed from the perspective of maximum likelihood estimation. described in “Overview of Maximum Likelihood Estimation and Algorithms and Their Implementations” Specifically, the parameters of the model are adjusted to maximize the log-likelihood of the training data. Since it is difficult to maximize the log-likelihood directly, training is performed using variational inference and the EM algorithm described in “Examples of Implementations of EM Algorithm and Various Applications“.

Variational inference approximates the posterior distribution of the latent space and allows optimization based on stochastic gradient descent using a technique called the reparameterization trick (reparameterization trick).

VAE will be a method used for a variety of tasks, including data dimensionality reduction, anomaly detection, and application as a generative model.

Black Board Variational Inference(BBVI)

Black Board Variational Inference (BBVI) is a method used to approximate the posterior distribution of a stochastic model by finding the approximate distribution of the model parameters and hidden variables without analytically finding the posterior distribution.

BBVI is illustrated using a metafore called a “blackboard. The “blackboard” here refers to a virtual surface for writing the parameters and hidden variables of the model, and BBVI uses the gradient information of the log-likelihood of the model to update the values of the parameters on the blackboard.

The procedure for BBVI is as follows

Initialize the parameters of the variate distribution: Initialize the parameters of the variate distribution (e.g., mean and variance).
Generating a blackboard: Write the values of the model parameters and hidden variables on the blackboard.
Sampling: Sampling the parameters and hidden variables on the blackboard. In this case, samples from variational distributions and random noise are used.
Evaluating the log-likelihood: Calculate the log-likelihood of the model using the sampled parameters and hidden variables.
Computing the gradient: Compute the gradient of the log-likelihood. This gradient is only computed for the parameters on the blackboard.
Update the blackboard: The parameters on the blackboard are updated using the information of the gradient.
Repeat Step 3 through Step 6 to iteratively update the parameters on the blackboard.

BBVI uses Monte Carlo and stochastic gradient methods to approximate and optimize the gradient. This allows for efficient approximate distributions to be obtained even when analytical calculations are difficult in variational inference.

BBVI is a very flexible method, applicable to a wide variety of stochastic models and Bayesian networks, and useful for large data sets and high-dimensional parameter spaces.

Variational Inference for Input(VI for Input)

Variational Inference for Input (VI for Input) is a type of variational inference that approximates the posterior distribution of data (input). While the goal of ordinary variational inference is to obtain the posterior distribution of the parameters of the model, variational inference for input aims to approximate the posterior distribution of the data itself.

The input variational method uses a combination of model parameters as a generative model for the data and an approximate distribution to represent the posterior distribution of the data. The input variational method is performed in the following steps

Encoder Design: Design an encoder to approximate the posterior distribution of the data. The encoder takes data as input and outputs parameters of the approximate distribution.
Encoder training: Define an objective function such as likelihood function or KL divergence to optimize the parameters of the encoder. Typically, optimization methods such as stochastic gradient descent are used to update the encoder parameters.
Approximation of the posterior distribution of the data: Use trained encoders to approximate the posterior distribution of the data. Specifically, the parameters of the approximate distribution obtained as the output of the encoder are used to construct the posterior distribution.

The input variational method has been applied to tasks such as data feature representation and anomaly detection. This would be, for example, the application of input variational methods to data such as images or text to estimate the posterior distribution of the data and to perform data generation and sampling.

Input variational methods are sometimes used in combination with methods such as variational autoencoder (VAE) or variational RBM (Restricted Boltzmann Machine), which may allow for integrated variational reasoning for both data and model parameters.

Variational Restricted Boltzmann Machine

Variational Restricted Boltzmann Machine (Variational RBM) is a generative model that combines variational inference with the Restricted Boltzmann Machine (RBM) described in the “Machine Learning Professional Series Bayesian Deep Learning Reading Notes” and elsewhere. RBM is a kind of probabilistic energy model that can model the coupling between visible and hidden variables, and variational RBM is a method to approximate this RBM by variational inference.

Variational RBM combines the following elements.

Constrained Boltzmann Machine (RBM): The RBM is a stochastic energy model that represents the coupling between visible and hidden variables. The visible variables represent the observed data and the hidden variables represent the latent elements in the process of generating the data; the RBM learns the weight parameters of the coupling between the visible and hidden variables for data generation and feature representation.
Variational inference: Variational inference is a method for obtaining approximate distributions when the true posterior distribution is difficult to obtain analytically. Variational inference defines a variational distribution to approximate the posterior distribution of the model.

Variational RBM is learned by combining variational inference with the EM algorithm described in “Examples of Implementations of EM Algorithm and Various Applications“.

Initialization of variational distribution: Initialize the parameters of the variational distribution.
E Step (Expectation Step): Calculate the posterior distribution of the hidden variable using the parameters of the current variate distribution.
M Step (Maximization Step): Optimize the RBM weight parameters using the posterior distribution of the hidden variables calculated in the E Step. Usually, optimization methods such as gradient descent or contrast methods are used.
The E and M steps are repeated until the convergence condition is satisfied.

Variational RBM is used for tasks such as feature representation as a generative model, data generation, and anomaly detection. Approximating the posterior distribution by variational inference shall improve the capability of feature representation and data generation in RBM.

The following tools can be used to implement these algorithms.

About libraries and platforms applicable to variational Bayesian learning

Variational Bayesian learning is an approximate inference method for probabilistic models, which can be implemented in a variety of libraries and platforms. Below we describe some of the most popular libraries and platforms.

PyMC3: PyMC3 is a probabilistic programming library written in Python that supports variational inference.
Edward: Edward is a probabilistic programming library running on TensorFlow that specializes in variational inference and provides useful functions and model classes for variational inference.
Stan: Stan is a library developed as a programming language for Bayesian statistical modeling and fast approximate inference, which also supports variational Bayesian implementation and is written in C++ but can be used from Python.
TensorFlow Probability (TFP): TensorFlow Probability is a probabilistic programming library provided as part of TensorFlow that supports a variety of probabilistic modeling and inference methods including variational Bayesian methods.
Pyro: Pyro is a probabilistic programming library based on PyTorch that supports variational Bayesian inference; Pyro also provides a high-level API for flexible model description and inference.

These libraries provide a rich set of features and tools to apply to variational Bayesian learning, as well as to develop, run, and visualize models using platforms such as Jupyter Notebook. This allows for flexible and effective experimentation and application of variational Bayesian learning.

Next, we discuss specific implementations using these tools.

Example implementation in python of a topic model using variational Bayesian learning

Many libraries in Python are used to implement topic models using variational Bayesian learning, with the following libraries being the most popular.

Gensim: Gensim is a powerful library for easy topic modeling in Python, which can be used to implement topic models using a variational Bayesian estimation called Latent Dirichlet Allocation (LDA). Below is an example of a simple topic model implementation using Gensim.

from gensim import corpora
from gensim.models import LdaModel

# Pre-processing of text data and conversion to topic model input format
documents = preprocess_text_data()  # Functions for preprocessing text data
dictionary = corpora.Dictionary(documents)
corpus = [dictionary.doc2bow(doc) for doc in documents]

# Learning topic models
num_topics = 10
lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics)

# Display topic distribution
for topic_id in range(num_topics):
    print(f"Topic {topic_id}:")
    words = lda_model.show_topic(topic_id)
    for word, prob in words:
        print(f"{word}: {prob}")

Pyro: Pyro is a probabilistic programming library and tool that supports various Bayesian inference methods, including variational Bayesian inference. The following is an example of a topic model implementation using Pyro.

import pyro
import pyro.distributions as dist

# Model Definitions for Topic Models
def topic_model(data):
    num_topics = 10
    num_words = len(vocab)

    with pyro.plate("topics", num_topics):
        topic_weights = pyro.sample("topic_weights", dist.Dirichlet(torch.ones(num_topics)))

    with pyro.plate("words", num_words):
        topic_words = pyro.sample("topic_words", dist.Dirichlet(torch.ones(num_topics)))

    with pyro.plate("data", len(data)):
        doc_topics = pyro.sample("doc_topics", dist.Dirichlet(topic_weights))
        word_topics = pyro.sample("word_topics", dist.Categorical(doc_topics), obs=data)

# Perform inference
guide = pyro.infer.autoguide.AutoDiagonalNormal(topic_model)
svi = pyro.infer.SVI(topic_model, guide, pyro.optim.Adam({"lr": 0.01}), pyro.infer.Trace_ELBO())
num_steps = 1000
for step in range(num_steps):
    loss = svi.step(data)

# Display topic distribution
topic_weights = pyro.param("topic_weights")
for topic_id, weights in enumerate(topic_weights):
    print(f"Topic {topic_id}: {weights}")

Example implementation in python of Bayesian regression using variational Bayesian learning

A Python library, PyMC3, is often used to implement Bayesian regression using variational Bayesian learning PyMC3 is a powerful library for Bayesian statistical modeling and supports variational inference. Below is an example implementation of Bayesian regression using PyMC3.

import pymc3 as pm
import numpy as np

# Creating Data Sets
np.random.seed(0)
n = 100  # Number of data points
X = np.random.randn(n, 2)  # feature vector
beta = np.array([1, 2])  # True Parameters
epsilon = 0.5 * np.random.randn(n)  # noise
y = np.dot(X, beta) + epsilon  # target variable

# Model Definition
with pm.Model() as model:
    # prior distribution
    sigma = pm.HalfCauchy("sigma", beta=10, testval=1.0)  # Noise dispersion
    beta = pm.Normal("beta", mu=0, sd=10, shape=2)  # regression coefficient

    # stochastic model
    mu = pm.Deterministic("mu", pm.math.dot(X, beta))
    likelihood = pm.Normal("likelihood", mu=mu, sd=sigma, observed=y)

    # variational inference
    approx = pm.fit(n=10000, method=pm.ADVI())

# Obtaining inference results
trace = approx.sample(draws=5000)

# Display Results
pm.summary(trace)

# prediction
X_new = np.array([[1, 2], [3, 4]])  # new data point
with model:
    pm.set_data({"X": X_new})
    post_pred = pm.sample_posterior_predictive(trace, samples=500)

# Display of forecast results
print("Predictive mean:", post_pred["likelihood"].mean(axis=0))
print("Predictive std:", post_pred["likelihood"].std(axis=0))

The above code defines a Bayesian regression model using PyMC3 and performs variational inference. Half Cauchy and normal distributions are used as prior distributions, the likelihood of the normal distribution is specified for the observed data, and summary statistics of the posterior distribution of the parameters are displayed as inference results. Predictions are also made for new data points, and the predictive distribution is generated from the sampled posterior distribution.

Example implementation in python of a mixture model using variational Bayesian learning

PyMC3 is often used to implement mixture models using variational Bayesian learning. Below is an example of a mixed model implementation using PyMC3.

import pymc3 as pm
import numpy as np

# Creating Data Sets
np.random.seed(0)
n = 1000  # Number of data points
k = 2  # Number of components
true_weights = np.array([0.4, 0.6])  # True mixing ratio
true_means = np.array([-2, 2])  # true average
true_stds = np.array([0.5, 1])  # True standard deviation
true_assignments = np.random.choice(k, size=n, p=true_weights)
data = np.concatenate([np.random.normal(true_means[i], true_stds[i], size=sum(true_assignments == i)) for i in range(k)])

# Model Definition
with pm.Model() as model:
    # prior distribution
    weights = pm.Dirichlet("weights", a=np.ones(k))  # Mixing Ratio
    means = pm.Normal("means", mu=0, sd=10, shape=k)  # average
    stds = pm.HalfNormal("stds", sd=10, shape=k)  # standard deviation

    # Latent variable (component assignment)
    z = pm.Categorical("z", p=weights, shape=n)

    # data generating distribution
    obs = pm.Normal("obs", mu=means[z], sd=stds[z], observed=data)

    # variational inference
    approx = pm.fit(n=10000, method=pm.ADVI())

# Obtaining inference results
trace = approx.sample(draws=5000)

# Display Results
pm.summary(trace)

# Display of component assignments
z_pred = np.argmax(pm.sample_posterior_predictive(trace, var_names=["z"])["z"], axis=1)
print("Predicted Assignments:", z_pred)

In the above code, PyMC3 is used to define the mixture model and perform variational inference. Prior distributions are set for the mixture ratio, mean, and standard deviation parameters, categorical and normal distributions are used for the data generating distributions, and variational inference is used to approximate the posterior distribution of the parameters.

As a result of inference, summary statistics of the posterior distribution of the parameters are displayed, and the most probable component is selected using results sampled from the posterior distribution to predict component assignments.

Example implementation in python of a Bayesian neural network using variational Bayesian learning

PyMC3 and TensorFlow Probability (TFP) are commonly used to implement Bayesian neural networks (BNN) using variational Bayesian learning. Below are examples of BNN implementations using PyMC3 and TFP.

PyMC3 Case:

import pymc3 as pm
import numpy as np
import theano.tensor as tt

# Creating Data Sets
np.random.seed(0)
n = 100  # Number of data points
X = np.random.randn(n, 1)  # feature vector
y = 2 * X.squeeze() + 1 + 0.5 * np.random.randn(n)  # target variable

# Model Definition
with pm.Model() as model:
    # prior distribution
    l = pm.Gamma("l", alpha=2, beta=1)
    sigma = pm.HalfCauchy("sigma", beta=5)

    # neural network layer
    def neural_network(X):
        w1 = pm.Normal("w1", mu=0, sd=1, shape=(1, 10))
        b1 = pm.Normal("b1", mu=0, sd=1, shape=(10,))
        h = tt.nnet.relu(pm.math.dot(X, w1) + b1)

        w2 = pm.Normal("w2", mu=0, sd=1, shape=(10, 1))
        b2 = pm.Normal("b2", mu=0, sd=1, shape=(1,))
        output = pm.math.dot(h, w2) + b2
        return output

    # Output of Neural Network
    y_pred = neural_network(X)

    # likelihood function
    likelihood = pm.Normal("likelihood", mu=y_pred, sd=sigma, observed=y)

    # variational inference
    approx = pm.fit(n=10000, method=pm.ADVI())

# Obtaining inference results
trace = approx.sample(draws=5000)

# Display Results
pm.summary(trace)

TensorFlow Probability（TFP）Case:

import tensorflow as tf
import tensorflow_probability as tfp

# Creating Data Sets
np.random.seed(0)
n = 100  # Number of data points
X = np.random.randn(n, 1).astype(np.float32)  # feature vector
y = 2 * X.squeeze() + 1 + 0.5 * np.random.randn(n).astype(np.float32)  # target variable

# Model Definition
model = tf.keras.Sequential([
    tfp.layers.DenseVariational(10, input_shape=(1,), activation=tf.nn.relu),
    tfp.layers.DenseVariational(1, activation=None)
])

# Definition of loss function
neg_log_likelihood = lambda y, rv_y: -rv_y.log_prob(y)
kl_loss = sum(model.losses) / n
loss = neg_log_likelihood(y, model(X)) + kl_loss

# Optimizer Definition
optimizer = tf.keras.optimizers.Adam()

# Performing Learning
@tf.function
def train_step():
    with tf.GradientTape() as tape:
        loss_value = loss(y, model(X))
    gradients = tape.gradient(loss_value, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))
    return loss_value

# Model Learning
for _ in range(10000):
    loss_value = train_step()

# Display of inference results
q_weights = model.trainable_variables[::2]
q_biases = model.trainable_variables[1::2]
for i in range(len(q_weights)):
    print(f"Weight {i+1}: {q_weights[i].numpy()}")
    print(f"Bias {i+1}: {q_biases[i].numpy()}")

The above code implements BNN using PyMC3 and TFP respectively. In the case of TFP, the parameters are learned by defining a probabilistic neural network using the DenseVariational layer and minimizing the loss function.

Reference Information and Reference Books

Detailed information on variational Bayesian learning is provided in “About Variational Bayesian Learning“. Please refer to it as well.

Reference book is “Foundations of Optimization“

“Variational Principles and Methods“

“Variational Bayesian Learning Theory“