Machine Learning with Bayesian Inference and Graphical Model

Machine Learning Technology Artificial Intelligence Technology Digital Transformation Technology Probabilistic Generative Models Navigation of this blog

Machine Learning with Bayesian Inference and Graphical Model

Machine learning using Bayesian inference is a statistical learning method that calculates the posterior probability distribution for an unknown variable given observed data according to Bayes’ theorem, the fundamental law of probability, and then calculates estimators for the unknown variable and predictive distributions for new data to be observed in the future based on the obtained posterior probability distribution.

Bayesian statistics used here is based on the idea that not only the data but also the elements behind the data are generated probabilistically, which can be easily understood by applying meta-probability to the previously mentioned probability distribution of “a device that manufactures dice (which generate data with a certain probability) manufactures dice with a certain probability of fluctuation.

In contrast to maximum likelihood estimation and maximum posterior probability estimation methods, which are the basis of general machine learning, Bayesian learning has the following characteristics.

Information about the accuracy of estimation of unknown variables comes naturally
Tends to be difficult to over-learn
All unknown variables can be estimated from observed data in a single framework.
This allows for automatic washing of model degrees of freedom and hyperparameter estimation.

In this Bayesian learning process, learning is performed in two main steps. In the first step, the relationship between observed data and unobserved variables is described as a probability distribution by combining various discrete distributions and Gaussian distributions, and the conditional distribution (posterior distribution) of the unobserved variables is obtained analytically or approximately based on the model constructed in the next step.

In order to examine the probability distribution in Step 1, it is necessary to consider a probability model. The basic concept of a probability model is that events (random variables) with uncertainty are connected by edges showing relationships to form a graph, and such a probability model described using a graph is called a graphical model. There are two major types of graphical models: Bayesian networks (directed) and Markovian stochastic fields (undirected). Simply put, the former is a probabilistic causal relationship and the latter is a probabilistic dependency relationship.

Next, in order to obtain the conditional distribution (posterior distribution) of the unobserved variable in step 2, either analytically or approximately, it is first necessary to perform an expectation value calculation for the unknown variable. This calculation cannot be performed analytically except in special cases, and numerical calculation becomes difficult when the unknown variables are high-dimensional. Markov chain Monte Carlo (MCMC) and variational Bayesian learning are approximations to compute this.

The details of machine learning using Bayesian inference are described below.

Implementation

Uncertainty and Machine Learning Technology

Uncertainty (Uncertainty) refers to a state of uncertainty or information in which future events or outcomes are difficult to predict, caused by the limitations of our knowledge or information, and represents a state in which it is difficult to have complete information or certainty. Mathematical methods and models, such as probability theory and statistics, are used to deal with uncertainty. These methods are important tools for quantifying uncertainty and minimizing risk.

This section describes probability theory and various implementations for handling this uncertainty.

Overview of Bayesian Inference and Various Implementations

Bayesian inference is a method of statistical inference based on a probabilistic framework and is a machine learning technique for dealing with uncertainty. The objective of Bayesian inference is to estimate the probability distribution of unknown parameters by combining data and prior knowledge (prior distribution). This paper provides an overview of Bayesian estimation, its applications, and various implementations.

Bayesian Network Inference Algorithms

Bayesian network inference is the process of finding the posterior distribution based on Bayes’ theorem, and there are several types of major inference algorithms. The following is a description of typical Bayesian network inference algorithms.

Overview of Bayesian Multivariate Statistical Modeling and Examples of Algorithms and Implementations

Bayesian multivariate statistical modeling is a method of simultaneously modeling multiple variables (multivariates) using a Bayesian statistical framework, which allows the method to capture the probabilistic structure and account for uncertainty with respect to the observed data. Multivariate statistical modeling is used to address issues such as data correlation, covariance structure, and outlier detection.

Overview and Implementation of Markov Chain Monte Carlo Methods

Markov Chain Monte Carlo (MCMC) is a statistical method for sampling from probability distributions and performing integration calculations. The MCMC is a combination of a Markov Chain and a Monte Carlo method. This section describes various algorithms, applications, and implementations of MCMC.

Overview of NUTS and Examples of Algorithms and Implementations

NUTS (No-U-Turn Sampler) is a type of Hamiltonian Monte Carlo (HMC) method, which is an efficient algorithm for sampling from probability distributions, as described in “MCMC Method for Stochastic Integral Calculations: Algorithms other than Metropolis Method (HMC Method)”. HMC is based on the Hamiltonian dynamics of physics and is a type of Markov chain Monte Carlo method. NUTS improves on the HMC method by automatically selecting the appropriate step size and sampling direction to achieve efficient sampling.

EM Algorithm and Examples of Various Application Implementations

The EM algorithm (Expectation-Maximization Algorithm) is an iterative optimization algorithm widely used in statistical estimation and machine learning. In particular, it is often used for parameter estimation of stochastic models with latent variables.

Here, we provide an overview of the EM algorithm, the flow of applying the EM algorithm to mixed models, HMMs, missing value estimation, and rating prediction, respectively, and an example implementation in python.

Solving Constraint Satisfaction Problems Using the EM Algorithm

The EM (Expectation Maximization) algorithm can also be used as a method for solving the Constraint Satisfaction Problem. This approach is particularly useful when there is incomplete information, such as missing or incomplete data. This paper describes various applications of the constraint satisfaction problem using the EM algorithm and its implementation in python.

Overview of Variational Bayesian Learning and Various Implementations

Variational methods (Variational Methods) are used to find the optimal solution in a function or probability distribution, and are one of the optimization methods widely used in machine learning and statistics, especially in stochastic generative models and variational autoencoders (VAE). In particular, it plays an important role in machine learning models such as stochastic generative models and variational autoencoders (VAE).

Variational Bayesian Inference is one of the probabilistic modeling methods in Bayesian statistics, and is used when the posterior distribution is difficult to obtain analytically or computationally expensive.

This section provides an overview of the various algorithms for this variational Bayesian learning and their python implementations in topic models, Bayesian regression, mixture models, and Bayesian neural networks.

Black-Box Variational Inference (BBVI) Overview, Algorithm, and Implementation Examples

Black-Box Variational Inference (BBVI) is a type of variational inference method for approximating the posterior distribution of complex probabilistic models in probabilistic programming and Bayesian statistical modeling. BBVI is called “Black-Box” because the probability model to be inferred is treated as a black box and can be applied independently of the internal structure of the model itself and the form of the likelihood function. BBVI is a method that can be used for inference without knowing the internal structure of the model.

Overview of Hidden Markov Models (HMMs) and various applications and implementations

HMM is a type of probabilistic model used to represent the process of generating a series of observations, and is widely used for modeling series data and time series data in particular. The hidden state represents the latent state behind the series data, which is not directly observed, while the observation results are the data that can be directly observed and generated from the hidden state.

This section describes various algorithms and practical examples of HMMs, as well as a concrete implementation in python.

Overview of the Gelman-Rubin Statistic and Related Algorithms and Examples of Implementations

The Gelman-Rubin statistic (or Gelman-Rubin diagnostic, Gelman-Rubin statistical test) is a statistical method for diagnosing convergence of Markov chain Monte Carlo (MCMC) sampling methods, particularly when MCMC sampling is done with multiple chains, where each chain will be used to evaluate whether they are sampled from the same distribution. This technique is often used in the context of Bayesian statistics. Specifically, the Gelman-Rubin statistic evaluates the ratio between the variability of samples from multiple MCMC chains and the variability within each chain, and this ratio will be close to 1 if statistical convergence is achieved.

Overview of the Fisher Information Matrix and Related Algorithms and Examples of Implementations

The Fisher information matrix is a concept used in statistics and information theory to provide information about probability distributions. This matrix is used to provide information about the parameters of a statistical model and to evaluate its accuracy. Specifically, it contains information about the expected value of the derivative of the probability density function (or probability mass function) with respect to its parameters.

Overview of Bayesian Structural Time Series Models and Examples of Applications and Implementations

Bayesian Structural Time Series Model (BSTS) is a type of statistical model that models phenomena that change over time and is used for forecasting and causal inference. This section provides an overview of BSTS and its various applications and implementations.

Overview of Bayesian Deep Learning and Examples of Applications and Implementations

Bayesian deep learning refers to an attempt to incorporate the principles of Bayesian statistics into deep learning. In ordinary deep learning, model parameters are treated as non-probabilistic values, and optimization algorithms are used to find optimal parameters. This is called “Bayesian deep learning”. For more information on the application of uncertainty to machine learning, please refer to “Uncertainty and Machine Learning Techniques” and “Overview of Statistical Learning Theory (Non-Equationary Explanation).

Overview of Constraint-Based Structural Learning and Examples of Algorithms and Implementations

Constraint-based structural learning is a method of learning models by introducing specific structural constraints in graphical models (e.g., Bayesian networks, Markov random fields, etc.), an approach that allows prior knowledge and domain knowledge to be incorporated into the model.

BIC, BDe, and other score-based structural learning

Score-based structural learning methods such as BIC (Bayesian Information Criterion) and BDe (Bayesian Data Information Criterion) will be those used to evaluate the goodness of a model by combining the complexity of the statistical model and the goodness of fit of the data to select the optimal model structure. These methods are mainly based on Bayesian statistics and are widely used as information criteria for model selection.

Bayesian Network Sampling (Sampling)

Bayesian network sampling models the stochastic behavior of unknown variables and parameters through the generation of random samples from the posterior distribution. Sampling is an important method in Bayesian statistics and probabilistic programming, and is used to estimate the posterior distribution of a Bayesian network and to evaluate uncertainty. It is an important method in Bayesian statistics and probabilistic programming, and is used to estimate the posterior distribution of Bayesian networks and to evaluate certainty.

Variational Bayesian Analysis of Dynamic Bayesian Networks

A dynamic Bayesian network (DBN) is a type of Bayesian network for modeling uncertainty that changes over time. The variational Bayesian method is a statistical method for inference of complex probabilistic models, which allows estimating the posterior distribution based on uncertain information.

Overview of Variational Autoencoder Bayes (Variational Autoencoder, VAE) and Examples of Algorithms and Implementations

Variational Autoencoder (VAE) is a type of generative model and a neural network architecture for learning latent representations of data. The VAE learns latent representations by modeling the probability distribution of the data and sampling from it. An overview of VAE is given below.

Theory and application

This section provides an overview of Bayesian machine learning (Bayesian learning) used for inference in stochastic generative models. First, Bayesian learning is performed in two steps. Step 1 is to describe the relationship between observed data D and unobserved variable X (simultaneous distribution p(D,X)) by combining various discrete distributions, Gaussian distributions, and other probability distributions. Then, based on the model constructed in the next step, the conditional distribution of the unobserved variable (posterior distribution \(\displaystyle p(X|D)=\frac{p(D,X)}{p(D)}\)) is obtained analytically or approximately. The denominator term p(D) is called the model evidence or marginal likelihood, and represents the likelihood that data D will emerge from the model. The conditional probability p(X|D) can be discrete or continuous.

Machine Learning with Bayesian Inference – Mixed Models, Data Generation Processes and Posterior Distributions

Polynomial regression, which takes advantage of the conjugacy of probability distributions, is a model that can analytically determine the posterior distribution of parameters and predictive distributions for unobserved values. In machine learning, however, data with complex statistical properties, such as those in image and natural language applications, are often the target of analysis, and it becomes necessary to construct complex probabilistic models accordingly. For such models, it is very difficult to calculate the posterior and predictive distributions analytically. In this article, we will discuss a mixture model as an example of such a complex model.

Approximate Computation of Various Models in Machine Learning by Bayesian Inference

Many algorithms for approximate inference have been proposed. Here, we discuss variational inference based on Gibbs sampling and mean-field approximation, which are relatively simple and widely used. In Bayesian learning, a model representing data and the corresponding approximate inference method are combined to form a single algorithm as a whole. Since the choice of the optimal approximate inference method depends on the model to be handled, the size of the data, the required computational cost, and the application, it is useful to have multiple methods in one’s arsenal in pursuit of better performance.

An Example of Machine Learning with Bayesian Inference: Inference by Gibbs Sampling of a Poisson Mixture Model

In this article, we introduce the Poisson mixture model for one-dimensional data and describe an algorithm for actually inferring the posterior distribution. The reason for describing the Poisson mixture model is that various techniques (Gibbs sampling, variational inference, and collapsed Gibbs sampling) can be derived relatively easily compared to the Gaussian mixture model. It is also an example of an application of analytical computation of posterior and predictive distributions. In addition, we will discuss nonnegative matrix factorization as an advanced model utilizing the nonnegativity of the Poisson distribution later in this paper, and similar calculations using the Poisson distribution and gamma distribution will appear there as well. The same kind of calculations using Poisson and gamma distributions also appear in this paper.

An Example of Machine Learning with Bayesian Inference: Variational Inference for Poisson Mixture Models

In this article, we will discuss variational inference algorithms for Poisson mixture distributions. In order to obtain an update formula for the variational inference algorithm, it is necessary to perform a decomposition approximation process for the posterior distribution. Here, we aim to approximate the posterior distribution by separating latent variables and parameters as follows.

An Example of Machine Learning with Bayesian Inference: Inference by Collapsed Gibbs Sampling for Poisson Mixture Models

In this article, we describe an algorithm for collapsed Gibbs sampling for Poisson mixture models. Usually, in collapsed Gibbs sampling for a mixture model, the first step is to remove parameters from the simultaneous distribution by marginalization. We then compare inference results from Gibbs sampling, variational inference, and collapsed Gibbs sampling.

Learning Example of Bayesian Inference: Gaussian Mixture Model with Gibbs Sampling

In this study, we consider a multidimensional Gaussian distribution whose mean and precision matrices are ways as the observation model, and derive Gibbs sampling, variational inference, and collapsed Gibbs sampling as the underlying technical algorithms for the posterior distribution. From conjugacy, we will use the Gauss-Wishart distribution as a prior distribution of parameters for the multidimensional Gaussian distribution. Compared to the Poisson mixture model described earlier, the Gaussian Wishart distribution is more complicated in some respects, such as its multidimensionality and the use of Gaussian Wishart, which is a bit complicated to calculate by hand. Machine Learning with Bayesian Inference” and “Machine Learning with Bayesian Inference: Inference by Collapsed Gibbs Sampling of Poisson Mixture Models”.

Bayesian Inference with Variational and Decayed Gibbs Sampling for Gaussian Mixture Models

Even in variational inference for Gaussian mixture models, a computationally efficient algorithm can be derived by approximating latent variables and parameters separately as follows. In this section, we describe the algorithm of collapsed Gibbs sampling for Gaussian mixture models. Again, we will consider a Gaussian mixture model with all parameters μ, Λ, and π removed from the periphery. The figure below shows the results of clustering as a Gaussian mixture model with N=200 two-dimensional observations K=3.

Image Feature Extraction and Missing Value Inference with Linear Dimensionality Reduction Model in Bayesian Inference

Linear dimensionality reduction (linear dimensionality reduction) is a basic technique for reducing the amount of data, extracting feature patterns, and summarizing and visualizing data by mapping multidimensional data to a low-dimensional space. In fact, it is known empirically that, for many real data, a space of dimension M, which is much smaller than the dimension D of the observed data, is sufficient to represent the main trends of the data, so the idea of dimensionality reduction has been developed and utilized in various application fields, not limited to machine learning.

The methods described here are closely related to techniques called probabilistic principal component analysis, factor analysis, or probabilistic matrix factorization. Although closely related to techniques such as probabilistic principal component analysis, factor analysis, or probabilistic matrix factorization, we will focus here on simpler models that are simpler than commonly used methods.

In addition, as a specific application here, we will also conduct simple experiments on image data compression and interpolation of missing values using the linear dimensionality reduction model. The ideas of dimensionality reduction and missing value interpolation are common to models such as nonnegative matrix factorization and tensor decomposition.

Applied Bayesian Inference with Nonnegative Matrix Factorization

Nonnegative matrix factorization (NMF), like linear dimensionality reduction, is a method for mapping data to a low-dimensional subspace. As the name suggests, the model assumes non-negativity for the observed data and all of its unobserved variables. Non-negative matrix factorization can be applied to any non-negative data, and can be used to compress and interpolate image data in the same way that linear dimensionality reduction is used.

In addition, when handling audio data in terms of frequency using the Fast Fourier Transform, it is often possible to obtain a better representation using a model that can assume non-negativity. In addition, since many data can be assumed to have non-negative values in recommendation algorithms and natural language processing, a wide range of applications are being attempted. Various probabilistic models have been proposed for nonnegative matrix factorization, but here we construct a model using the Poisson distribution and the gamma distribution.

Model Building and Inference in Bayesian Inference – Overview and Model of Hidden Markov Models

In this article, we will discuss the hidden Markov model (HMM), which is widely used for modeling time series data. Hidden Markov models are very important models that are being applied not only to traditional audio signals and text data, but also to nucleotide sequences and financial transaction data. In the models described so far, conditional independence is established for the distribution of each parameter X={x1,….xN} after the parameter θ is given, as shown in the following equation.

Construction of Hidden Markov Models and Fully Decomposed Variational Inference in Bayesian Inference

In the hidden Markov model based on the Poisson observation model constructed in “Model Construction and Inference in Bayesian Inference: Overview and Model of Hidden Markov Models”, we seek an approximation algorithm for the posterior distribution by variational inference. Since the hidden Markov model is simply a mixture model with a time dependence on the latent variable, it is thought that it is better to decompose the model into parameters and latent variables and perform approximate inference based on the same idea. However, in this model, the handling of the normal series is complicated, so for the sake of simplicity, inference is performed by further decomposing the time direction into pieces as follows.

Hidden Markov Model Construction and Structured Variational Inference in Bayesian Inference

In the previous article, we derived a variational inference algorithm for hidden Markov models relatively easily by assuming a complete decomposition with respect to the time direction. In fact, it is known that for inference of state series, it is not necessary to assume such a time-dominated decomposition, and as in the case of mixed models, an efficient algorithm can be derived simply by assuming only the following decomposition of parameters and latent variables.

Overview of Topic Models as Applied Models of Bayesian Inference and Application of Variational Inference

A topic model is a generic term for a generative model for analyzing documents written mainly in natural language. LDA assumes that a potential topic (politics, sports, music, etc.) exists behind a document that is a list of words, and that each word in the document is generated based on that topic. By using topics learned with a large amount of document data, it will be possible to classify and recommend news articles and retrieve semantically relevant documents from a given word query. In recent years, there have also been cases where LDA has been applied not only to natural language processing but also to image and genetic data.

Inference with Gibbs Sampling in Topic Models as an Applied Model of Bayesian Inference

We now describe collapsed Gibbs sampling for LDA. In the mixed model, we considered a new model with parameters peripheral to the stochastic model and sampled the latent variables one by one; in LDA, the algorithm can be derived using exactly the same procedure.

Tensor Decomposition and Recommendation Techniques as Applied Models of Bayesian Reasoning

In this section, we discuss tensor factorization, which is often used in applications such as recommender systems for items (books, movies, restaurants, etc.). In the field of machine learning, tensor factorization often simply refers to a multidimensional array such as Rn,m,k, and is treated as the multidimensional number of a matrix, which is a two-dimensional array. In this section, we first discuss the idea of collaborative filtering when using matrix factorization, and then extend it to the tensor case to derive a recommendation algorithm. The ideas presented here are closely related to the model of transition matrix reduction.

Logistic Regression as an Applied Model of Bayesian Inference

In this section, we discuss logistic regression, a model in which discrete label data y are learned directly from input variables x. The mixed Gaussian predictions of the linear regression model allow for the exact computation of the posterior distribution of the parameters and the predictive distribution for new data. Logistic regression differs from linear regression in that it contains nonlinear variable transformations within it, making such an analytical calculation impossible.

In this section, as a use of variational inference, we describe an approach to optimization that uses Gaussian distribution to approximate the posterior distribution and purchasing information, rather than the approach of mean-field approximation by linear dimensionality reduction or decomposition of the posterior distribution used in LDA. This technique is exactly the same one that can be used in the case of neural network training, which will be discussed later.

Neural Networks as Applied Models of Bayesian Inference

Neural networks, like linear and logistic regression, are probabilistic models that directly estimate the predicted value y from the input x. In this section, we describe a continuous-value regression algorithm using a neural network. Unlike linear regression models, the main feature of neural networks is that they can learn from data a nonlinear function for predicting y from x.

As with many of the models described so far, we will treat the neural network completely Bayesian and solve all learning and prediction by probabilistic (approximate) inference. This has the advantage over general neural networks obtained by maximum likelihood or MAP estimation that overfitting can be naturally suppressed and that the degree of uncertainty and confidence in the prediction can be treated quantitatively.

A graphical model is a probabilistic model described using graphs. Stochastic models introduce “uncertainty” into the events handled in machine learning, some of which are inherently unstable, and some of which are simply caused by a lack of information.

There are two main types of graphical models: Bayesian networks (directed) and Markovian stochastic fields (undirected). Simply put, the former is probabilistic causality and the latter is probabilistic dependence.

First, let us discuss the Bayesian model, as an example, the case of gene expression of human blood type. Human blood types are determined from pairs of genes of type A, B, and O. In other words, there are seven types: AA, AO, BB, BO, AB, and OO. For the sake of simplicity, let us consider a world without B (three types: AA, AO, and OO).

Graphical models – Markov probability field

While a Bayesian network is a probability field in which one random variable determines another random variable, a Markovian network is a model in which each random variable is interrelated.

The simplest Markov probability model is called the Ising model. This model is used in physics as a model for magnetic materials, and has a lattice of +/- values called spins (+1 for spin pointing up, -1 for spin pointing down), as shown below.

Graphical model with factor graph representation

Probability distribution functions of Bayesian networks and Markovian stochastic fields are represented by products of local functions. In this article, we describe a more direct graphical representation of the product representation of functions.

The probability distribution functions of Bayesian networks and Markovian probability fields can be expressed as products of local functions. Here, we describe a causal graph representation that directly illustrates using such a product representation. In this method, both Bayesian networks and Markov establishment fields are represented by hypergraphs. Since hypergraphs are extensions of undirected graphs, the arrow information of Bayesian networks is lost. On the other hand, it can describe a more detailed structure than the graph representation of a Markov probability field.

Probability distribution functions (families) with product representations corresponding to graph structures, such as Bayesian networks, Markov probability fields, and factor graph-type models, are collectively called graphical models.

Computing Peripheral Probability Distributions – Probability Propagation

In this article, we will discuss the task of computing the probability distribution around a graphical model. This is also called probability theory. As an approach, I will discuss the probability propagation method as an efficient computational method when the graph is a tree. Next, I will describe the probability propagation method for simple systems, generalized probability propagation method, probability propagation method extended to factor graphs, and finally, an example of application to a hidden Markov model.

Computing Peripheral Probability Distributions – Bethe Approximation

Here, the algorithm of the probability propagation method is applied to the non-tree case to compute the approximate peripheral probability distribution. This can be understood as a Bethe approximation in terms of variational methods.

In the previous section, we discussed the algorithm of the probability propagation method on trees. This is. The peripheral probability distribution will be computed efficiently using message propagation. On a graph with cycles, the same algorithm can be applied as in Algorithm 1 to perform approximate computation.

Calculation of marginal probability distribution – Kikuchi approximation

In the previous section, we discussed the derivation of the stochastic propagation method from the Bethe free energy function. In this article, we will derive a generalized probability propagation method from the Kikuchi free energy function, which is a generalization of the Bethe free energy.

The motivation for the need to extend the probability propagation method is the case where there are many small cycles and the probability propagation method has a large approximation error, and there is a need to obtain more accurate values by considering and computing pseudo-peripheral probabilities in a slightly wider range, such as including these cycles.

For these, the Hasse diagram approach is used to decompose the probability distribution.

Computing Peripheral Probability Distributions – Mean Field Approximation

In the previous article, we discussed how the approximate computation of the marginal probability distribution can be obtained by solving a variational problem. In this article, we will discuss the approximate computation of the marginal probability distribution from the variational problem, which is called the mean-field approximation.

As mentioned earlier, the probability propagation method on a graph with cycles could be formulated as a variational problem for the Bethe free energy function, which approximates the Gibbs free energy function. The mean field approximation (mean field approximation) described here does not approximate the Gibbs free energy function, but rather narrows the range of variates taken.

In mean field approximation, the probability distribution function to be included in the argument of the Gibbs free energy function is limited to the one that decomposes it into a product of probability distribution functions for each variable.

Computation of graphical models without hidden variables

In this issue, we described the approximate computation of the marginal probability distribution from a variational problem called the mean-field approximation. In this article, we describe a method for learning the parameters of a graphical model. In particular, we will discuss the case where all variables of the graphical model are observed.

So far, we have described a method for calculating the probability of a single graphical model given a specific model. In this article, we describe a method for obtaining a graphical model from data.

However, the structure of the underlying graph is assumed to be known. In this case, the problem boils down to learning the functions that appear when the graphical model is factorized. Since these functions are often parameterized in some way, we call such a task “parameter learning” of the graphical model. On the other hand, tasks in which the graph itself is also learned are called structure learning.

In this article, we will discuss how to perform parameter learning when all variables of a graphical model are observed as data. The case in which the values of some vertices on the graph are not observed will be discussed later.

Computation of graphical models with hidden variables

In the previous article, we discussed the computation of a graphical model with no hidden variables. In this article, we will consider a situation in which only some of the variables at the vertices of a graphical model are observed in the data. Such a problem setup is called learning a model with hidden variables (hidden variable, latent variable).

The approach will focus on variational EM methods, but will also introduce the wake-sleep algorithm, MCEM algorithm, stochastic EM algorithm, Gibbs sampling, contrastive divergence method, constrained Boltzmann machines, and others.

Specific examples of graphical models

In the previous issue, we discussed parameter learning for graphical models with hidden variables. This time, we will discuss specific examples of graphical models. As specific graphical models, we describe the computation of Boltzmann machines, mean-field approximations, Bethe approximations, hidden Markov models, Bayesian hidden Markov models, and so on.

Maximum Propagation Method for Calculating MAP Assignments in Graphical Models

In the previous article, we discussed specific examples of graphical models. In this article, we will discuss the computation of MAP assignments for graphical models. As in the case of probabilistic inference, it is possible to compute the MAP assignment (maximum a posteriori assignment) efficiently on a tree, which is the state that maximizes the probability value, and finding the MAP assignment is called MAP estimation.

Approaches include TRW maximum propagation, maximum propagation on a factor graph with cycles, maximum propagation on a tree graph, and MAP estimation by message propagation.

A linear summation method and message propagation algorithm for MAP estimation of discrete-state graphical models

Continuing from the previous article, this time we will discuss an algorithm for MAP estimation. For discrete-state graphical models, the problem of MAP estimation can be formulated as linear programming. As a dual of this linear programming, we derive a new message propagation algorithm.

Specific algorithms include the max-sum diffusion (MSD) algorithm, Generalized MPLP, MPLP algorithm, dual solution of the relaxation problem, dual decomposition, solution by message propagation, separation algorithm, cycle inequality, and linear programming of the MAP estimation problem. The formulation of the problem as a problem will be.

Structural Learning of Graphical Models

In this article, we will discuss methods for learning the graph structure itself from data. Specific approaches include learning graph structures from data with Bayesian networks and Markov probability fields, such as Max-Min Hill Climbing (MMHC), Chow-Liu’s algorithm, maximizing the score function, PC (Peter Spirtes and Clark Clymoir) algorithm, GS (Grow-Shrink) algorithm, SGS (Spietes Glymour and Scheines) algorithm, sparse regularization, and independence conditions.