Summary
Bayesian estimation can be one of the statistical methods for interpreting data and learning models from a probabilistic perspective. Machine learning using Bayesian inference is characterized by incorporating prior knowledge and experience into the model and updating that knowledge through data. It is particularly effective when data is scarce or there is uncertainty in the model, and is able to suppress overlearning and account for uncertainty in statistical estimation. This section describes this Bayesian inference based on the “Introduction to Machine Learning with Bayesian Inference” in the Machine Learning Professional series.
The specific procedures for Bayesian estimation are (1) setting the prior distribution of the model, which represents prior knowledge and experience as a probability distribution, (2) setting the probability that data will be generated according to the model (data likelihood), (3) calculating the posterior distribution of the model after observing data using Bayes’ theorem, (4) estimating parameters and predictions using the posterior distribution, and (5) modifying or improving the model after evaluating the estimated results.
Machine Learning Startup Series – Introduction to Machine Learning with Bayesian Inference Reading Notes
“An unprecedented introduction to the subject, plainly understandable in the shortest possible path! The book explains how to create algorithms by following the basic principles of Bayesian machine learning (Bayesian learning) in a consistent procedure of “building a model → deriving inference”. It is easy to understand everywhere!”
Chapter 1 Machine Learning and Bayesian Learning
1.1 What is Machine Learning?
Definition of Machine Learning1
Machine learning is a research topic in artificial intelligence
Techniques and methods that attempt to realize the same functionality in a computer as in the natural human capacity to learn.
Machine Learning Definition 2
Machine learning is the process of extracting rules and structures hidden in data
A generic term for computational techniques for making predictions about unknown phenomena and making decisions based on these predictions.
1.2 Typical tasks of machine learning
1.2.1 Regression
The task of finding a function y=f(x) from data to predict a continuous-valued output y∈ℝ from some M-dimensional input x∈ℝM
Linear regression equation
Learning W
ε: Noise
One-dimensional (M=1) and three-dimensional (M=3) regression models
1D
3-dimensional
Polynomial regression
Transformation of data using polynomials, etc.
1.2.2 Classification
A model in which the output y is limited to a finite number of symbols (rather than continuous values)
2-class classification
Let yn ∈ {0, 1} denote the output value
The probability that Yn takes 1 is 𝝁n ∈ (0, 1)
Representing 𝝁n from M-dimensional input values xn and parameters w
Example of function f: sigmoid function
Form of the sigmoid function
Using the sigmoid function, we can keep 𝝁n within (0,1) for any value of the real number WTxn
Whether the observed value is yn=0 or yn=1 depends stochastically on the value of 𝝁n
2-class classification by logistic regression
Two-class classification model using sigmoid function
Multi-class classification
Easily extendable to multiple classes
Instead of a sigmoid function, a “softmax function” with a ∈ ℝK as input can be used to classify into K classes
How to train a multi-class logistic regression algorithm using an optimization algorithm
1.2.3 Clustering
Task to divide given N data into K sets according to some criteria
Example: 200 2D data with K=3 Clustering using Gaussian Kongo model
Used for modeling data with several different trends
1.2.4 Linear dimensionality reduction
Task to approximate data Y ∈ ℝDxN represented by a matrix using a matrix W ∈ ℝMxD and a matrix X ∈ ℝMxN
Image of matrix decomposition
Example of mapping Iris data (4D) to 2D and 3D by dimensionality reduction
1.2.5 Other typical tasks
Task to separate a given series of signals based on a statistical model
Recommendation of products on an e-commerce site
Interpolation of missing values in general log data
Natural language processing
Simulation (generation of virtual data)
A task to explore the behavior of a system by manually manipulating some parameters of a learned model
Artificially generate new images from models learned from images
1.3 Two approaches to machine learning
1.3.1 Machine Learning as a Toolbox
Provide data to various existing prediction algorithms and select a good algorithm among them according to some criteria
Pairs of input data and output data (correct labels) are used as training data to learn a function for prediction.
Simplest prediction algorithm
Nearest neighbor
Simply search for the input data closest to the past data and use the label (output value) corresponding to that data as the prediction result
Other Prediction Algorithms
Support vector machine
Boosting
Random forest
In most cases, a process called feature extraction, as shown in the figure above, is placed in the previous step to improve the performance of individual tasks.
Fixed transforms such as polynomial functions and Fourier transforms are used.
Dimensionality reduction is also used.
Characteristics of the Toolbox Approach
Advantages
Machine learning systems can be built without advanced mathematical knowledge and with some programming skills.
Desired results can be achieved with relatively simple tuning
Disadvantages
Requires an understanding of the principles of operation of many algorithms with different construction concepts
Limited accuracy and applicability to a limited number of domains
1.3.2 Machine Learning as Modeling
By building a model (hypothesis) of the data in advance and learning the parameters and structure of the model from the data, some useful prediction or decision can be made.
Typical models and applications
Topic model in natural language
Assume that a document has multiple potential (unobservable) topics or themes
A model in which latent words in a document appear according to the topic.
News articles and web articles are targeted
Time seriesmodel or state-spacemodel
Hidden Markov model HMM
Model discrete changes in data
Linear dynamic system
Modeling continuous changes in state
Kalman filter
Inference algorithms guided by dynamic linear systems
A model for engineering “connected” data
Stochastic block model
Discover communities from social network or connection data
Recommends new friends based on these recommendations
Deep learning model
Partially models hierarchical information processing in the human brain
High accuracy has been demonstrated in speech recognition and image processing tasks
Requires “design” to mathematically extend and combine models according to the problem to be solved and the state to be analyzed
Probabilistic modeling and Bayesian inference approaches
The algorithm development process can be clearly separated as “model building x inference computation.
Superior accuracy and Jounan characteristics compared to toolbox approaches
Can interpolate missing data, or assymetric with other models or different types of data
Can successfully represent system uncertainty
Various probability distributions with well-known properties (Gaussian, multinomial, etc.) can be combined like blocks
Disadvantages
Requires some mathematical knowledge
Computation time and memory costs are high
1.4 Basic Probability Calculations
1.4.1 Probability distribution
M-dimensional vector x=(X1,…) such that each element is a continuous value. ,xM)T ∈ ℝM for a function p(x)
Condition that p(x) is called probability density function
Example: Probability distribution for M=1-dimensional Gaussian distribution or normal distribution
M-dimensional vector x=(X1,…. T ∈ ℝM such that each element is a discrete value.
Condition that p(x) is called probability mass function
Example: Bernoulli distribution, known for expressing the thought of tossing a coin
1.4.2 Inference of Probability Distributions
Probability distribution p(x,y) for two variables x and y
An operation in which one variable is removed by integration:
Resulting probability distribution: marginal distribution
Equation
Probability distribution of x when a specific value is determined for y in the simultaneous distribution p(x,y)
Assumptions of Bayes’ Theorem
Bayes’ Theorem
A procedure that calculates the probability p(x|y) of the cause x when the result y is obtained backward from the probability p(y|x) of obtaining the result y from the cause x.
Eq.
Independence of simultaneous distributions
The probability distribution of x does not change whether Y is given as a condition or not
1.4.3 Red ball and white ball problem
Assumptions.
Probability of choosing a bag
Probability of choosing a p(x=a)=1/2
Probability of choosing b p(x=b)=1/2
Probability of ball being released
Probability that the retrieved ball is red y=r
Probability that the bullet taken out is white y=w
Knowing that bag a was chosen, the conditional probability is
p(y=r|x=a)=2/3
p(y=w|x=a)=1/3
If we know that bag b was chosen, the conditional probability is
p(y=r|x=b)=1/4
p(y=w|x=b)=3/4
Probability that the selected bag is a and that a red ball is produced
p(x=a,y=r)=p(y=r|x=a)p(x=a) =2/3×1/2 =1/3
Probability that the chosen bag is b and that a red ball is produced
p(x=b,y=r)=p(y=r|x=b)p(x=b) =1/4×1/2 =1/8
Probability that the ball removed is a red ball, regardless of the bag chosen
Equation
Sum over all possible values of x
p(y=r)=1/3+1/8=11/24
If the ball taken out is found to be red, what is the probability that the bag chosen is a?
p(x=a|y=r)
p(x=a|y=r)=p(x=a,y=r)/p(y=r)
1/3*24/11=8/11
If the ball taken out turns out to be red, what is the probability that the bag chosen is b?
1-8/11=3/11
1.4.4 Multiple observations
Choose one of the bags
Participants take out the balls and put them back in (restored extraction)
The result of three participants drawing lots.
{y1, y2, y3} = {r, r, w}
Which bag x chosen is a or b?
p(x|y1=r, y2=r, y3=w)
p(x=a|y1=r, y2=r, y3=w)=256/337
p(x=b|y1=r, y2=r, y3=w)=81/337
When {y1,y2,y3,y4,y5,y6,y7,y8}={r,r,w,w,w,r,r,w,r}
p(x=a|y1,… .y8)=0.9221
When 29 red balls and 21 white balls are observed in 50 trials
p(x=a|y1,…. ,y50)=0.9999
1.4.5 Sequential inference
Important properties that hold for multiple independent data
Suppose we have one observation y1
The posterior distribution is obtained by ignoring the posterior denominator using Bayes’ theorem
Observation of data y1 means that the posterior distribution p(x|y1) is updated to p(x|y1).
If two more data [y1,y2] are obtained
The posterior distribution becomes the above equation
If we look at the second line and consider p(x|y1) to be some sort of prior distribution
After observing data y1, we can see that the posterior distribution is updated by further observing data y2
As a general form, the observed value Y=[y1,… yN] are independent, we can write the above equation as
The posterior distribution after receiving the Nth data can be expressed as above using the posterior distribution learned using N-1 data
A learning method can be implemented in which the previously obtained posterior distribution is used as a prior distribution for the next inference.
1.4.6 When parameters are unknown
Issue raised
It is not reliable whether the organizer of the raffle has a 1/2 chance of choosing the bag
Ratio of red and white balls cannot be known exactly in advance as 2/3 or 1/4
Are there two types of bags in the first place?
Are there any balls other than red and white ones?
The parameter of the ratio of balls to be extracted is unknown, and we make a guess about the parameter by extracting a few balls.
There is a bag containing multiple red and white balls, and the trial of removing a ball from the bag, checking its color, and returning it to the bag is repeated N times.
Assumptions.
Consider a probability distribution p(𝛉) with respect to an unknown ratio 𝛉 ∈ (0,1) of red and white balls
Obtain knowledge about 𝛉 from the data using only probabilistic reasoning
The observed data Y={y1,… . yN} be the observed data
Probability distribution
p(yn|θ) is generated by a probability distribution where each data point yn is determined by the value of θ
p(θ) is a prior distribution over the parameters, reflecting some prior knowledge of the possible values of θ
Giving fixed values such as θ=2/3 or θ=1/4 is also good prior knowledge
Inference of the likely value of 𝛉 by observing data Y
Prior knowledge p(θ) newly updated through observation of data Y
1.5 Graphical model
Introduction
A notation that uses nodes and arrows to represent the relationships among multiple variables in a probability model
Allows visual representation of various probability models
It is more convenient to consider the independence of variables in a model graphically than to use manual calculations.
Explanation of representation by a directed graph (graph with arrows) without loop structure called DAG (directedacyclicgraph)
1.5.1 Directed graphs
Graphical model of the red ball and white ball problem
p(x,y)=p(x|y)p(y)
Complex graphical model
Equation
When there are N variables
Graphical models are useful tools for multi-task learning
Multi-task learning is the idea of improving the prediction accuracy of individual tasks by solving several related tasks simultaneously
1.5.2 Node Conditioning
Head-to-tail model
p(x,y,z)=p(x)p(y|x)p(z|y)
When only y is observed
p(x,z|y)=p(x,y,z)/p(y) =p(x)p(y|x)p(z|y)/p(y) =p(x|y)p(z|y)
Conditionalindependence
If the value of Y is given as an observation, then x and z are independent of each other in terms of how they appear
Tail-to-tail model
p(x,y,z)=p(x|y)p(z|y)p(y)
When only Y is observed
p(x,z|y)=p(x,y,z)/p(y) =p(x|y)p(z|y)p(y)/p(y) =p(x|y)p(z|y)
Head-to-head model
Unlike the two examples above, it is not possible to decompose into two independent probability distributions
Node y has a dependency on being observed
p(x,z|y)=p(x,y,z)/p(y) =p(y|x,z)p(x)p(z)/p(y)
1.5.3 Markov blanket
Markov blanket
Co-parent relationship of x: c and d
Useful when considering sampling algorithms
1.6 Bayesian Learning Approach
1.6.1 Model Building and Inference
1 Model construction
Construct a simultaneous distribution p(D,X) for observed data D and an unobserved unknown variable X
2 Derivation of inference
Find the posterior distribution p(X|D)=p(D,X)/p(D) analytically or approximately
Denominator term p(D): model evidence
Marginal likehood
1.6.2 Bayesian Inference for Each Task
Linear regression and classification
Model
X={x1,…. xN}:Input values (observed values)
Y={y1,… yN}: Output values (observed values)
W: Vector of unknowns (to predict yn from each point xn)
Equation
Prediction distribution
Directly models the conditional distribution p(yn|xn,W) given input xn
Clustering
Model
S=[s1,… sN]: Cluster assignment (unobserved data: hidden variable)
X:Observed data
𝚯={θ1,… θk}:center position for each cluster
Equation
Linear dimensionality reduction
Model
Similar to the model for clustering except W is a matrix
Equation
1.6.3 Approximation for complex posterior distributions
Analytical posterior distribution cannot be obtained
When we want to obtain the posterior distribution p(X|D) for a certain probability model p(D,X), for a relatively simple model
Posterior distribution can be attributed to a simple probability distribution with well-known properties, such as Gaussian or Bernoulli distribution
A method called sampling is used to “know” an unknown probability distribution p(X|) that cannot be calculated analytically
This is done by obtaining a large number of samples X(i)~p(X|) from this distribution using a computer to examine the mean value and dispersion of the posterior distribution p(X|), or simply by visualizing the sample points obtained, in an attempt to find trends in the distribution
Sampling methods in the field of Bayesian inference
Gibbs sampling
HamiltonianMonteCarlo
Sequential Monte Carlo (sequentialMonteCarlo)
Another approach is to approximate the marginalization using a simple, easy-to-computate formula
The difficult integration involved in computing the posterior distribution p(X|) can be approximated by using a simple function that can be computed analytically, or by using a simple function that can be computed analytically.
A method to express the posterior distribution itself by proposing an approximate probability distribution q(X) that can be handled more conveniently, such as p(X|) ≈ q(X) directly.
Specific Methods
Laplaceapproximation
Variationalinference
Expectationpropagation
1.6.4 Decision Making Based on Uncertainty
Probabilistic inference is used only as a means of quantitatively expressing the uncertainty of the phenomenon under consideration.
The result of inference itself does not constitute any kind of decisionmaking, and the two must be distinguished as distinct processes.
Example
When the probability of tomorrow’s weather is p(y=sunny)=0.8, p(y=rainy)=0.2
Consider the loss function
LA(y=sunny, x=no umbrella)=0, LA(y=rain, x=no umbrella)=100, LA(y=sunny, x=with umbrella)=10, LA(y=sunny, x=with umbrella)=15
Calculate expected value
Without umbrella
With umbrella
Expected loss is less with umbrellas
By using as much information and knowledge as possible, the uncertainty p(y) can be estimated correctly, and by considering a loss function L(y,x) that matches the situation, the action x can be determined more logically.
The expression of uncertainty plays a very important role in the field of machine learning called reinforcementlearning.
Bayesianoptimization is another application of Bayesian learning that successfully handles uncertainty.
1.6.5 Advantages and Disadvantages of Bayesian Learning
Advantages
1. various problems can be solved consistently
2. can deal quantitatively with the uncertainty of the target
Models that cannot be predicted can be shown to be “unpredictable
Can calculate expected value
3. incorporate available knowledge naturally
Can clarify assumptions by setting prior distributions
Multiple models can be combined
Dimensionality reduction + clustering
Bernoulli distribution + Gaussian distribution
Bernoulli distribution + categorical distribution
Hidden Markov Models
Deep learning models
4. difficult to over-fit
Optimization algorithms based on maximum likelihood estimation or maximization of error function cause overfitting
Occurs when the number of parameters is large in relation to the number of data
Bayesian learning does not cause over-fitting and is therefore effective when the number of data is small or the number of dimensions of the data type is large.
Disadvantages
1. mathematical knowledge is required
2. computationally expensive
Chapter 2 Basic Probability Distributions
2.1 Expectation
2.1.1 Definition of expected value
The expected value <f(x)>p(x) of a function f(x) for a probability distribution p(x) when it is a vector is calculated as in the above equation
x is removed by integration, so it is no longer a function of x
Linearity holds for the expected value.
2.1.2 Basic expected value
The expected value of x itself for a given probability distribution p(x)
Equation
Expected value multiplied by vector x
Variance
Variance Variation
Expected values for two different variables
Expected value when x and y are independent (can be calculated separately)
If x and y are not independent, use conditional distribution to compute expected values in order from the inside out
Conditional expectation
Integral calculation removes x and becomes a function of y
2.1.3 Entropy
The expected value for a probability distribution p(x) is called entropy
Entropy is a measure of the “messiness” of a probability distribution.
Example 1
Discrete distribution such as p(x=1)=1/3, p(x=0)=2/3
Calculation of entropy
Example 2
Distribution q(x=1)=q(x=0)=1/2
Calculation of entropy
Greater entropy than in Example 1.
Represents the “unpredictability” of a variable arising from a probability distribution
2.1.4 KL divergence Kullback-Leibler divergence)
The expected value for two probability distributions p(x) and q(x) is called the Kullback-Leibler divergence
The distance between two probability distributions
Note that the general distance axiom is not satisfied
2.1.5 Approximate computation of expected value by sampling
For a given probability distribution p(x) and f(x), if the expected value cannot be calculated analytically, a sample can be obtained from p(x) to obtain an approximate value as in the above equation
Example
Discrete distributions such as p(x=1)=1/3, p(x=0)=2/3
Three events with x=1 and seven events with x=0 are observed
An approximate calculation of entropy is given above
Example: Approximation of entropy by samples
The more samples you increase, the closer the approximate solution approaches the exact solution
2.2 Discrete probability distribution
2.2.1 Bernoulli distribution
Simplest discrete probability distribution
Probability distribution for generating a binary variable x ∈ [0,1] (above equation)
One parameter 𝛍∈(0,1) determines the nature of the distribution
Used to describe two decreases that are not simultaneously angry, such as two sides of a coin or a winning or losing lottery ticket.
Example.
Sample {1,0,1,1,1,1,1,1,1,0,1,1,1,0,1,1,1,0,1,1,1,1,1,1,1,1} when 𝛍 is 0.9
Expected value
<x>=𝛍
<x2>=𝛍
entropy
Formula
Graph
KL divergence
Formula
Value
2.2.2 Binomial distribution
Repeat Bernoulli’s trial M times
where
The probability of a particular event such that the table appears m times out of M times is μm(1-μ)M-m, but the probability of a particular event such that the table appears m times out of M times is μm(1-μ)M-m.
Since there are a total of MCm events with the same number of times m, the above equation is obtained.
Note that this is different from the Bernoulli distribution multiplied by M times
Distribution with fixed parameters M,m
Expected binomial distribution
2.2.3 Categorical distribution
Extension of Bernoulli distribution to K dimensions
The probability distribution on s defined by the above equation is called a categorical distribution.
S is expressed as a K-dimensional vector
For dice, the roll of 5 is s=(0,0,0,0,0,1,0)T
π=(π1,…. T satisfies the K-dimensional parameter πk∈{0,1} and ∑k=1Kπk=1
If 𝛑K=1/6 with K=6, then a uniform 6-sided dice roll
If K=2, then Bernoulli distribution
Expected value calculation (similar to Bernoulli)
Calculation of entropy
2.2.4 Multinomial distribution
Bernoulli distribution, binomial distribution and categorical distribution are all special cases of multinomial distribution
Consider the distribution of the number of occurrences mk for the kth event after M iterations of trials in the categorical distribution
Equation
M is a k-dimensional vector
mk is the number of times the kth event occurs
Example of distribution for different parameters pi and m
Expected value
Relationships among various discrete distributions
2.2.5 Poisson Distribution
Probability distribution for generating a non-negative integer x
Logarithmic representation
Unlike the binomial distribution, where an upper bound on possible values is given, the probability is not completely zero with respect to any value, no matter how large that value may be.
Distribution for different parameters
Expected value
2.3 Continuous Probability Distribution
2.3.1 Beta distribution
Probability distribution that generates variables such that 𝛍∈(0,1)
CB(alb): Coefficient that guarantees normalization
Γ(・):Gamma function
Beta distribution with different parameters
Logarithmic representation of the beta function
Expected value
Ψ(⋅): digamma function
Entropy
Bernoulli distribution, conjugate prior of binomial distribution
2.3.2 Dirichlet distribution
Multidimensional extension of beta distribution
Normalized term
Dirichlet distribution
Logarithmic representation of Dirichlet distribution
Expected value
Entropy
KL divergence of two different Dirichlet distributions
Conjugate prior for categorical and multinomial distributions
2.3.3 Gamma distribution
Probability distribution that generates positive real numbers 𝝺 ∈ ℝ+.
Normalized term
Gamma distribution
Logarithmic representation of the gamma function
Expected value
Entropy
KL divergence between two gamma distributions
The gamma distribution is a conjugate prior for the parameter λ of the Poisson distribution, and also a conjugate prior for the precision parameter (inverse of the variance) of the 1D Gaussian distribution, which is introduced next
2.3.4 1D Gaussian distribution
Probability density
One-dimensional Gaussian distribution
Logarithmic representation of Gaussian distribution
Expected value
Entropy
KL divergence between two different Gaussian distributions
2.3.5 Multivariate Gaussian distribution
An extension of the 1D Gaussian distribution to the more general D dimensions
Multidimensional Gaussian distribution
Logarithmic representation of multidimensional Gaussian distribution
Correlations between different dimensions are represented by the off-diagonal components of the matrix Σ
Σ in the case of independent processes between dimensions
Logarithmic representation of the independent case
Expected value
Entropy
Transformation of the equation
For D
KL divergence
Transformation of the equation
2.3.6 Wishart distribution
Probability distribution that generates a positive definite matrix A of DxD
Used as a probability distribution to generate the precision matrix, which is the inverse of the covariance matrix of the multidimensional Gaussian distribution
Equation
𝛎: Degree of freedom parameter
Logarithmic representation of the Wishart distribution
Logarithmic representation of the normalization term of the Wishart distribution
Samples from the Wishart distribution
The Wishart distribution is a probability distribution that extends the gamma distribution to a matrix (a positive definite matrix from positive real numbers)
Expected value
Entropy
The Wishart matrix is the conjugate prior to the precision matrix (inverse of the covariance matrix) of the multidimensional Gaussian distribution
Chapter 3: Learning and Prediction with Bayesian Inference
3.1 Learning and Prediction
Introduction.
Explanation on learning parameters and predicting unobserved values using probabilistic inference
3.1.1 Posterior distribution of parameters
Assumptions
When the training data set is D, the simultaneous distribution is the above equation
θ: Unknown parameter included in the model
Prior uncertainty about the parameters is reflected in the model by setting the prior distribution p(θ)
The role of the term p(D|θ) is to describe how data D is generated from a particular parameter θ
When viewed as a function of Θ
The uncertainty of the parameters after observing the data D is expressed in the above equation using Bayes’ theorem.
Computing the conditional distribution p(𝛳|D) is “learning” in Bayesian learning
The posterior probability p(𝛳|D) is more characteristic of the observed data D than p(𝛳) by passing the likelihood function p(D|θ)
3.1.2 Predicted distribution
We want to use the distribution of observed parameters to gain some insight into the unobserved data x*.
This can be achieved by computing the predictive distribution in the above equation
Image of averaging the model p(x*|𝛳) for various 𝛳 weights expressed as p(𝛳|D)
Graphical model implying the above equation
Both the data D and the unknown value x* are governed in their generation process by the parameter θ
No direct dependence assumed between D and x*.
The corresponding simultaneous distribution is the above equation
If only data D is at hand, the posterior distribution of the remaining variables is as above
Integral elimination of Θ yields the predictive distribution in the above equation
Can be calculated even if data D is not observed at all
Integrate and display the parameters from the simultaneous distribution before observing the data to obtain the surrounding distribution of x*.
Very rough prediction relying only on prior knowledge p(θ)
Summary of Predictions
To predict x* using the model expressed using the above equation
First, learn the posterior distribution of the parameters using the above equation
Once the data D itself is discarded
Calculate the integral from the obtained posterior distribution using the above formula to obtain the predictive distribution for x*.
Confine all information in data D to the posterior distribution p(θ|D)
A probabilistic model that flexibly changes the expressive capacity of the predictive model to match the data D
Bayesian nonparametric
Assumption that parameters are conditionally independent under given
Example: Assuming time-series dependencies between data
3.1.3 Conjugate prior
One idea for efficiently computing the posterior or predictive distribution is to use a conjugate prior distribution
A prior distribution where the prior distribution p(𝛳) and the posterior distribution (𝛳|D) are set to have the same kind of probability distribution
What prior distributions can be conjugate?
Example
If the likelihood function is described by a Poisson distribution, then
If the prior distribution is set to the gamma distribution, which is a conjugate prior distribution, the posterior distribution will also be the gamma distribution.
Relationship between the likelihood function, the cooperative prior distribution, and the predictive distribution
For a distribution with two parameters, such as a Gaussian distribution, the conjugate prior distribution will differ depending on which parameter is trained.
Conjugacy plays an important role in building a sequential learning framework with smaller data sets
Posterior distribution after observing the first dataset D1
Posterior distribution after further observation of new data set
With a conjugate prior distribution, p(θ), p(θ|D1) and p(θ|D1,D2) all have the same form
Computational efficiency is increased even when using approximation methods that do not allow for the computation of analytical posterior distributions
3.1.4 Explanation of non-conjugate prior distributions
Example: If we assume a non-conjugate gamma distribution for the mean parameter of a Gaussian distribution, the posterior distribution will not be a gamma distribution
Approach
MCMC (Markov chain Monte Carlo)
Variational inference
Assuming that the posterior distribution p(θ|D) can be approximated by some distribution q(θ;𝜂) using the variational parameter η
Attempt to minimize as in the above equation
Solve using the gradient method
Problem
Computational cost for optimization
It is impossible to know how well the approximate solution approximates the posterior distribution
Example of approximating the posterior distribution using a gamma distribution that is not conjugate to the prior distribution of the mean parameter, with the Gaussian distribution as the likelihood function
True posterior distribution cannot be reproduced in the world, but a rough approximation based on the KL divergence criterion can be obtained.
3.2 Learning and Prediction of Discrete Probability Distributions
Introduction.
Explains how to analytically calculate the posterior and predictive distributions of parameters using various discrete probability distributions as examples
3.2.1 Learning and Prediction of Bernoulli Distribution
Consider the Bernoulli distribution (above), a probability distribution on binary values x ∈ {0,1}.
Distribution used to represent discrete events that take two values, such as the toss of a coin or an attempt to pick out red and white balls
Parameter μ is a parameter that controls how likely one event is to occur
Consider how to infer the distribution of 𝛍 from training data
Bayesian learning treats all values with uncertainty as random variables
Probability distribution (beta distribution) that generates values that satisfy the requirement μ ∈ (0,1) for the parameters of the Bernoulli distribution
a and b are parameters to control the prior distribution p(μ)
Prior distribution p(𝛍) = Beta(𝛍|a,b)
a, b are parameters to control the prior distribution p(𝛍)
a,b: parameters for parameters (hyperparameter)
N data points X={x1,…. ,xn} after observing N data points X={x1,…,xn
With this equation as it is, we do not know what form p(μ|X) will take
To reveal the shape of the distribution of p(μ|X), it is sufficient to focus only on the terms related to μ
Denominator p(x) is omitted in the third line
Calculate by taking the logarithm of the above equation
The N multiplication part in the first line is converted to N sums by taking the logarithm
In the second line, substitute a specific probability distribution definition to expand p(xn|μ) and p(μ)
All terms irrelevant to μ are absorbed by const.
The last line is organized with respect to the two terms lnμ and ln(1-μ) for μ.
Represented by the logarithmic representation of the beta distribution
In the learning model for parameters using the Bernoulli and beta distributions, the posterior distribution can be computed analytically and has the same form as the prior distribution
The beta distribution serves to store the number of times a coin has been flipped over and over.
Note: Empirical Bayesian Method
In Bayesian learning, simply use the probability calculation formula to find the posterior distribution with respect to the parameter μ
The posterior distribution is a beta distribution and the specific values of the parameters a and b are probabilistic.
Super-parameters such as a and b themselves are adjusted directly to the observed data
Model Selection Techniques
In the Bayesian learning mechanism, the superparameters of the prior are set to fixed values that reflect domain knowledge of the problem (e.g., the rough scale of possible values of the parameters).
For some beta prior distributions, we set the true parameter to μtruth=0.25. The data series X={x1,… xN}, and the posterior distribution is obtained.
In the case of N=0, there is no observed data, so the prior distribution remains unchanged.
No matter what prior distribution is set, as N increases, it takes the same shape
Interpretation
When there is little data in evidence, each forecaster is heavily influenced by the beliefs (prior distribution) of the earthquake
As the number of data increases, the opinions of each side gradually come into agreement
Compute the predictive distribution for unobserved values x*∈{0,1}
For simplicity, we first compute a rough estimate using the prior distribution p(μ)
Peripheralization of the parameter μ leads to the above equation
Finally, the formula for the predictive distribution is given above
From the formula for the beta distribution
When x*=1
When x*=0
Predictive distribution summarizing the two above equations
Result using a prior distribution p(μ) that has not yet been trained with any data
Example: set a-1,b=1 and the expected value of the next value will be 0.5
The posterior distribution p(μ|X) after observing N data X also has the same beta distribution as the prior distribution
3.2.2 Learning and Prediction of Categorical Distributions
Consider learning parameters for a categorical distribution that takes K general
The probability mass function of the categorical distribution is given above
Dirichlet distribution that can generate K-dimensional variables (above equation)
Actually calculate the posterior distribution using the model above to computationally confirm that the Dirichlet distribution is a conjugate prior distribution for the categorical distribution.
N discrete-valued data S={s1,… sN} that are assumed to follow a categorical distribution.
Using Bayes’ theorem, express the posterior distribution of π in the above equation
What is the identity of the posterior distribution p(pi|S)?
Logarithm of the above equation
The posterior distribution is also expressed above as a Dirichlet distribution
How is the predictive distribution calculated?
Predictive distribution can be calculated by marginalizing the parameter π
Using the normalized term of the Dirichlet distribution, the predictive distribution can be expressed as above
The formula is full of gamma distribution.
If we only consider the case where s*,k’=1 for some k’, we get the above equation.
When recapitulated in a categorical distribution, the above equation becomes
What is equally certain?
Example
Dice whose rolls are equally certain
Assume a prior distribution that says “the probability of a dice roll with infinite belief is 1/6.”
For a prior of categorical distribution, a huge value such as ak=1010 for each side k=1,… 6.
In actual systems, there are not many problems that can be discussed with such strong conviction.
A Bayesian learning approach is to specify parameter uncertainties and the structure of the model in advance, even roughly, based on the knowledge of the likely generative assumptions of the observed data.
3.2.3 Learning and Prediction of Poisson Distribution
As another example of Bayesian inference about discrete distributions, we look at the computation of prior and posterior distributions using the Poisson distribution
Poson distribution has λ as a parameter in the form of the above equation
To express the uncertainty of the parameters as in the above equation, we need a probability distribution that produces positive real values.
The gamma distribution in the above equation can be used as a prior distribution for λ
Assume that the hyperparameters a and b are fixed in advance
The gamma distribution is a conjugate prior of the Poisson distribution
N non-negative discrete values X={x1,… xN}, the posterior distribution is calculated as above
Investigate the functional form with respect to λ by performing a logarithmic calculation
If we rearrange the equation focusing on Lnλ and λ, the posterior distribution becomes a gamma function.
Finding the Predicted Distribution
The predictive distribution for unobserved data x* is given above
Substituting in the Poisson and gamma distribution formulas, we arrive at the above equation.
Find the expression for the integral part analytically from the definition of the gamma distribution, as in the above equation.
The final prediction equation to be obtained is the above equation.
This probability distribution is a negative binomial distribution with parameters r∈ℝ+ and p∈{0,1}.
However, the parameters r and p are as in the above equation.
The basic expected value of this distribution is the above equation
3.3 Learning and Prediction of 1D Gaussian Distribution
3.3.1 When the mean is unknown
When learning only the mean µ ∈ ℝ of a Gaussian distribution using N data that are assumed to follow a Gaussian distribution
Assumptions
The accuracy parameter λ (inverse of σ2) is assumed to be fixed
Set a prior distribution only for the mean parameter μ that you want to learn.
For some observed value x ∈ ℝ, consider a Gaussian distribution as in the above equation
Gaussian distribution for observed data
For µ, a Gaussian prior like the above is known to be a conjugate prior
m ∈ ℝ and λμ ∈ ℝ+ are fixed hyperparameters
Gaussian distribution with respect to the mean parameter
N 1D continuous value data X={x1,… xN} that are assumed to follow a Gaussian distribution.
Bayes’ theorem states that the posterior distribution is
At this point, we still do not know what probability distribution the posterior distribution p(μ|X) will be
Compute the logarithm and examine the functional form with respect to μ
We get a (convex) quadratic function about μ
Identify parameters of the distribution (mean, precision)
Inverse remainder from the conclusion
Suppose the posterior distribution can be written as a Gaussian distribution of the form
Take the logarithm and organize with respect to the second-order and first-order terms of μ
Comparing the two equations and finding the coefficients yields the above equation
The posterior distribution is Gaussian, the same as the prior distribution
The precision λ’ of the posterior distribution is the precision λ of the prior distribution plus Nλ.
The mean m’ of the posterior distribution is like a weighted sum of the sum of the knowledge m from the prior distribution and the observed data ∑m=1Nxn
View the predictive distribution for unobserved data x* using the prior distribution p(μ)
Compute the marginal distribution as in the above equation
Perform calculations using logarithms
The relationship between the predictive distribution p(x*) and the prior distribution p(μ) is as follows
Taking the logarithm and finding it for p(x*), we get the above equation.
ln p(μ) is independent of x*, so put it in the constant const.
Pμ|x*) is the conditional distribution of μ given x*
The logarithmic computation gives
The predictive distribution is as above
The mean parameter of the predictive distribution is the mean m of the prior distribution itself
Accuracy is interpreted as the variance, taking the reciprocal as in the above equation
The uncertainty of the predictive distribution is the sum of the respective uncertainties of the observed distribution and the prior distribution
If you want to find the predictive distribution p(x*|X) after observing N data
If you also want to learn the distribution of λ, you need to add the gamma prior p(λ) to the model
3.3.2 When accuracy is unknown
Assuming that the mean μ of the 1D Gaussian distribution is already known for some reason, consider a model that learns only the precision λ from the data
Assumptions
Let the observation model be the above equation
Given a gamma prior distribution as above with respect to λ
Known to be a conjugate prior for the accuracy parameter of the Gaussian distribution
Using Bayes’ theorem, the posterior distribution of λ is obtained as in the above equation
Perform logarithmic calculations, examine the functional form with respect to λ
Focusing on the coefficient part over λ and lnλ, we find that the gamma distribution is as shown in the above equation.
Calculate the predictive distribution p(x*)
Integral calculation as above
Simplified calculation using logarithms only
Using Bayes’ theorem, the above relationship holds for x* and λ
Taking the logarithm and neglecting the term p(λ), p(x*) becomes the above equation
Further calculation yields the above equation
Finally, the above equation becomes
The logarithm of the distribution called Student’s tdistribution (see above)
Example plot of Student’s t-distribution of the first order
Equation that takes the logarithm of Student’s t-distribution and const. the term that is not related to the random variable x.
Comparing the two equations, the predictive distribution is expressed by the above equation
3.3.3 When mean and precision are unknown
Consider the case where both mean and accuracy are unknown
Assumptions
Let the observation model be the above equation
Assuming Gauss-gamma distribution (above equation) with M, β, a and b as fixed parameters as prior distribution, we can obtain the same shape of posterior distribution
The precision of the mean parameter μ is not fixed, but is replaced by βλ
The goal is to find the four parameters m, β, a and b in the posterior distribution
Learning model for 1D Gaussian model distribution
Graphical model with unknown mean
Graphical model with unknown accuracy
Graphical model with unknown average precision
Mean parameter depends on precision parameter
Calculation of actual posterior distribution
First, we focus only on the mean μ
The p(μ|λ,X) part of the posterior distribution is the above equation
Find the remaining p(λ|X)
The simultaneous distribution is expressed as the product of the conditional distribution and the above equation
Continued
Taking the logarithm and obtaining the functional form for λ, we obtain the above equation.
Summarized using the defining equation of the gamma distribution, the above equation is
Calculating Predictive Distributions Using Prior Distributions
As in the above equation, Gaussian distribution and two variables, mean and precision, need to be integralized and eliminated
With a little trimming along the way, the formula for the predictive distribution becomes the above (using Student’s t-distribution)
3.4 Learning and Prediction of Multidimensional Gaussian Distributions
Introduction
Consider learning parameters and making predictions on new data using a D-dimensional multidimensional Gaussian distribution
Multidimensional Gaussian distribution is very important in applications
The covariance matrix Σ is characterized by its ability to capture correlations across dimensions of the data.
Derivation flow is similar to one-dimensional Gaussian distribution
3.4.1 When the mean is unknown
Discussion assuming that only the mean parameter of the D-dimensional random variable x ∈ ℝD is unknown and the precision matrix Λ ∈ ℝ is given
Assumptions.
Observation conditions are as above
Precision matrix Λ must be preconfigured as a positive definite matrix
For the mean parameter of a Gaussian distribution, set up the inference calculation as above, using the Gaussian distribution as a prior
Introduce fixed superparameters m ∈ ℝD and Λμ ∈ ℝDxD
The posterior distribution after observing N data X is as above
Taking the logarithm and rearranging it with respect to μ, we obtain the above equation
D-dimensional vector μ is an upward convex two-time number
The posterior distribution of μ is found to be Gaussian
Calculate the parameters of the resulting posterior distribution as above
Taking the logarithm and rearranging it for μ, we obtain the above equation.
Comparing the two equations to obtain a constant yields the above equation
Find the predictive distribution p(x*) for unobserved data points x*∈ℝD
Using Bayes’ theorem, write the predictive distribution in logarithmic form
Transformation of the equation
where
The final Gaussian distribution is shown above
Parameters are as above
3.4.2 When accuracy is unknown
Consider the case where the mean parameter is given as Yoshi and only the accuracy matrix is to be learned from the data
Assumptions
Let the observation model be the above equation
The probability distribution for generating the accuracy matrix, which is a positive definite matrix, is a Wishart prior as in the above equation
W ∈ ℝDxD is a positive definite matrix
γ is given a priori as a real value satisfying γ >D-1
Verify that the Wishart distribution is a conjugate prior to the precision matrix of a multidimensional Gaussian distribution
Using Bayes’ theorem, the data X=[x1,… xN] can be expressed as the logarithm of the above equation
Specifically, if we proceed according to the logarithms of the Gaussian and Wishart distributions and take out only the term related to Λ, we obtain the above equation
In order to organize the equation, we performed calculations on the trace Tr(・) along the way.
The resulting probability distribution is a Wishart distribution with parameters as in the above equation.
Calculate the predictive distribution
The predictive distribution p(x*) for an unobserved vector x* is given by Bayes’ theorem
where p(Λ|x*) is the result of the calculation of the posterior distribution, which can be diverted to the above equation.
Substituting this result, the polynomial and trace parts of x* cancel each other out to form the above equation
The above equation is a multidimensional version of Student’s t distribution over x ∈ ℝD.
μs∈ℝD, the positive definite matrices Λs∈ℝDxD and γs∈ℝ are parameters of this distribution
If we take the logarithm of this equation and depict it only with terms related to x, we obtain the above equation.
The final predictive distribution is shown above
3.4.3 When the average and precision are unknown
Consider the case where you want to learn the mean and precision
Assumptions
Let the observation model be the above equation
In the multidimensional case, using the Gaussian-Wishart distribution (above) as the prior distribution, the posterior distribution will be the same.
Find the posterior distribution after observing data X
Bayes’ theorem can be used to express the above equation
The posterior distributions of Μ and Λ can be decomposed by the conditional distribution as in the above equation
Therefore, first find the posterior distribution of μ, and then find the posterior distribution of Λ
For p(μ|Λ,x), the above equation is obtained by setting the precision matrix to βΛ
p(Λ|x), when organized in logarithm focusing only on Λ, becomes the above equation
Thus, rearranging the equation yields the above equation
Compared with the defining formula of the Wishart distribution, it can be expressed as in the above formula
Calculate the predictive distribution using the prior distribution p(μ,𝜦)
The predictive distribution for new data x*∈ℝD is the above equation
Calculate this equation using logarithmic computation
By Bayes’ theorem, the above equation holds for the predictive distribution p(x*)
The second term, using the results of the Gauss-Wishart posterior distribution as an ear, is given by
If we organize only the terms related to x*, we get the above equation
Consistent with the logarithmic representation of Student’s t-distribution
Corresponding to the definition of Student’s t-distribution, the above equation becomes
3.5 Linear regression example
Introduction.
As an example of analytical computation using a conjugate prior distribution, a linear regression (libear regression) model is constructed using a Gaussian distribution to learn the coefficient parameters and predict unobserved data
3.5.1 Model construction
In a linear regression model, the above equation is modeled as
The output value yn∈ℝM, the
Input values xn ∈ ℝM, the
Parameters w ∈ ℝM, ℝM, ℝM
Noise component εn ∈ ℝ
Assumption according to Gaussian distribution with mean zero (above equation)
λ ∈ ℝ+ is a known parameter of 1D Gaussian distribution
Put the Gaussian distribution of the noise components into the model and formulate the probability distribution as in the above equation.
Since we want to learn the parameter w from observed data, we set up a prior distribution as in the above equation.
m ∈ ℝM is the mean parameter
positive definite matrix Λ ∈ ℝMxM is the precision matrix parameter
m and 𝝠 are superparameters and fixed in advance
Example
Obtain a data sample as m=4
Assume input vector is a polynomial of degree 3 such as (1,x,x2,x3)T
If we set M as a zero vector and 𝝠=IM, we can sample the value of w.
Plotting the cubic interval for several sampled values of w
Results of plotting the accuracy parameter against the observed value yn
It is important to plot physical parameters to confirm the model
3.5.2 Calculating posterior and predictive distributions
Using the linear regression model constructed in the previous section, determine the posterior and predictive distributions after observing the data
The posterior distribution is expressed by the above equation using a Bayesian model.
Taking the logarithm of the above equation and organizing it in terms of w, it becomes common sense.
The posterior distribution of W can be written as a Gaussian distribution
Given a new input value x*, find the predictive distribution p(y*|x*,Y,X) of the output value y*.
For a new input data vector x* and an unknown output value y*, the above equation is expressed by Bayes’ theorem
Taking the logarithm and expressing it in terms of the lnp(y*|x*) to be obtained, we obtain the above equation.
The second term on the right side is the above equation
The above equation is organized with respect to y*.
Summarized as a one-dimensional Gaussian distribution, the above formula is
3.5.3 Model Comparison
Using the derived predictive distribution, we trained N=10 data points based on sine waves with polynomials of different degrees.
The model is too simple for M=1 and M=2 and does not capture the characteristics of a sine wave well.
When M=4, it is captured well
At M=10 (same number of parameters as training data), the average represented in the real game becomes unstable
Comparing the margins of multiple models
How to compare models?
In Bayesian estimation, model selection is done by comparing the value p(D), called marginal likelihood or model evidence, among multiple models
The value of p(D) for a model represents the likelihood of generating data D
In the linear regression model, we can compare the value of p(Y|X), which is the denominator of the above equation
The above equation is obtained by transforming
The logarithmic marginal likelihood is as above
Plotting lnp(X|Y) for N=10 as the number of training data and M as the dimension of the parameters
Best representation ability when M=4
This result varies greatly depending on the number of data (N).
Example showing the relationship between model representativeness and number of training data
Assumptions.
Assumes a true function with a sine wave added to the quadratic interval
Situation in which observed data with noise components are generated
Curves are learned using linear regression, quadratic regression, and the nearest neighbor method for different numbers of training data.
Fitting will differ as observed data increases
Results of evaluating errors in predictions separately for training and test data
Increasing training data will not improve accuracy.
Agawa: If there is no Agawa, building a more complex model is effective to increase flexibility
When the number of data is not sufficient, it is often appropriate to use a model with limited expressive power as much as possible.
Maximum Likelihood Estimation and MAP Estimation
Introduction to learning methods other than Bayesian estimation
Maximum likelihood estimation
A method to search for the best-fitting observation model by maximizing the likelihood function p(X|θ) for the observed data X with respect to the value of the parameter θ
Maximum likelihood estimation does not introduce a prior distribution p(θ)
Over-fitting occurs.
Search for Θ as a point rather than a spreading distribution
Maximum likelihood estimation may be used to train a mixture model
MAP estimation (maximum a posterior estimation)
In general, it introduces a prior distribution p(θ) of parameters and maximizes the posterior distribution with respect to θ
Although it does not suffer from errors like maximum likelihood estimation, MAP estimation also treats the posterior distribution as a “point,” and thus cannot handle parameter uncertainty at all.
The above two estimation methods can be interpreted as very special cases of variational estimation algorithms
Maximum Likelihood Estimation assumes a constant for the prior distribution p(θ) and an infinitely sharp approximate distribution q(θ) for the posterior distribution p(θ|X)
MAP estimation introduces a pre-designed p(θ) for the prior distribution as in Bayesian inference, and assumes an infinitely sharp approximate distribution q(θ) for the posterior distribution as in maximum likelihood estimation.
Chapter 4. Mixture Model and Approximate Inference
4.1 Mixture Model and Posterior Distribution Inference
Introduction
Explain mixture models as an example of complex models
Explanation of approximate inference methods for more efficient inference
Gibbs sampling
Variational inference
Collapsed Gibbs sampling
Examples of concrete realizations include
One-dimensional Poisson mixture model
Multidimensional Gaussian mixture model
4.1.1 Reasons for using mixed models
(Common example in the real world) Multiple data clusters exist in the background
A single two-dimensional Gaussian distribution ignores the structure and covers the entire distribution, and cannot represent the characteristics of the data.
By using a mixture model, each class can be assigned a probability distribution that represents the data.
Even if data belonging to different clusters is biased, the “mixing ratio parameter” can be used to flexibly accommodate this bias.
Mixed models can be easily combined with other probability models such as linear regression
Multimodal data, for example, with two trends
Simple polynomial regression does not capture features well
The figure shows the result of forcibly fitting a polynomial to the data using maximum likelihood estimation.
Simple estimation taking the average of two trends when the dimension is M=4
M=30 results in a very complex function
Put two regression functions behind the data (e.g., separating outliers and normal from outliers, etc.) Captures features well
Can be used to address outliers
4.1.2 Mixed Model Data Generation Process
To create a model to represent the data
Differences from DNN?
Example
Data with cluster structure as shown in the figure
N data X={x1,… xN} is generated as follows
The number of clusters K is known.
(1) Mixing ratio of each cluster π=(π1,…) T is generated from the prior distribution p(π) (where πk∈(0,1) and ∑k=1Kπk=1)
(2) For each cluster k=1,… The parameters θk (mean, variance, etc.) of the observation model for each cluster k=1,…,K are generated from the prior distribution.
(iii) n=1,…. N, the cluster assignment sn corresponding to xn is chosen by the ratio π
(4) n=1,…. N, data xn is generated from the kth probability distribution p(xn|θk) selected by sn
Model constructed based on hypotheses about the data generation process
Probabilistic simulator for generating data
SN: hidden variable or latent variable
Sn is a random variable that, unlike the xn at hand, cannot be observed directly, and potentially determines the distribution of each suspension that generates xn.
Specifically define the equations for constructing the algorithm
Step (iv)
Define the probability distribution for the point xn to be finally retrieved
Set up a Gaussian distribution for all of them.
Let θk={μk, Σk} be the parameters of the observation model
Mixture model is a mechanism to switch between K different observation models
The distribution to be handled can be anything but Gaussian.
Step (iii).
Means to assign K observation models to each data point
It is convenient to use 1 of K representation for sn
Sn is a vector of dimension k. When sn,k=1 for some k, it means that the kth cluster is specified.
Select a categorical distribution with π as a parameter for the distribution to sample Sn
Which k is more likely to be selected is determined by the proportion indicated by the mixing proportion parameter π that governs this distribution.
Combining steps (3) and (4), the probability distribution for generating xn is given by the above equation
The parameters of the observation model are Θ={θ1,…. ,θk}, written collectively as
The probability distribution is multiplied K times, but since sn can only take the value 1 out of K elements, only one of these K elements is always selected by sn.
The rest of the distribution disappears to the power of zero.
Step (ii)
Define a prior distribution p(θk) for the parameter θk of the observation model defined in step (4)
It is common to set up a prior distribution such that conjugacy holds for p(xn|θk)
Example
If p(xn|θk) is modeled as a Poisson distribution, it is convenient later to use the gamma distribution, which is today’s or previous distribution of the parameter
Step (i).
Often unknown until a sufficient amount of data is observed with respect to the mixing ratio π
Better to have some kind of prior distribution and learn from the data.
Since π is a parameter of the categorical distribution, choose a conjugate prior, the K-dimensional Dirichlet distribution
K-dimensional vector α such that each element is a real value of sex is a hyperparameter of the Dirichlet distribution
Simultaneous distribution on N data using steps (1) to (4)
Graphical model
𝛩: parameters of the observation model {𝛉1,… ,𝛉K}
𝛑: mixing ratio
p(𝛑)=Dir(𝛑|𝛂)
4.1.3 Posterior distribution of the mixed model
The mixed-model inference problem is a
Written in equation form, the conditional distribution we want to find is the above equation
Difficult to compute analytically
To estimate the cluster S to which data X belongs, we can compute the above equation
The posterior distribution has a shape in which random variables S, π, and Θ are bilaterally entangled, making it difficult to compute the normalization term p(X).
If you want to calculate P(X) seriously, you need to make the above equation peripheral.
The first line parameters Θ and π can be analytically eliminated by using a conjugate prior distribution.
The second line requires evaluating p(X,S) for all possible combinations of S
4.2 Approximation methods for probability distributions
Introduction.
Many algorithms for approximate inference have been proposed.
Relatively simple and widely used
Gibbs sampling
Variational inference by mean-field approximation
In Bayesian learning, a model representing the data is combined with the corresponding approximate inference method to form an overall algorithm.
The optimal choice of approximate inference method depends on the model to be handled, the size of the data, the required computational cost, and the application.
Having multiple methods in your arsenal can be very useful in the pursuit of better performance
4.2.1 Gibbs sampling
When we want to obtain some knowledge (e.g., various expected value calculations) about a certain probability distribution p(z1, z2, z3), we can think of a way to sample multiple realizations of Z1, Z2, Z3 from this distribution
If p(z1,z2,z3) is a well-known distribution such as a 3-dimensional Gaussian distribution, it is easy to sample variables in each dimension simultaneously
If the distribution has a more complex shape, it is difficult to sample all variables simultaneously
Methods developed to approach the above issues
One of the MCMC (Markov chain Monte Carlo) methods
Gibbs sampling
Sampling each i-th variable as in the above equation
To sample a random variable, we condition the distribution on already sampled values to obtain a simpler probability distribution that is easier to sample.
If the number of samples is sufficiently large, it can be assumed (and proven in theory) to be sampled from the true posterior distribution.
Need to set an initial value of I=0
Give some random value in advance.
Example: Gibbs sampling for Gaussian distribution
Easy to sample from 1D Gaussian distribution, but difficult to sample from 2D Gaussian distribution
Dotted lines indicate the order in which each sample was extracted
As the number of samples increases, the distribution can be considered to be directly from a true 2-D Gaussian distribution.
The red ellipse q(x) is the Gaussian distribution computed from the mean and variance obtained from the samples
Gibbs Sampling Challenges
It is not clear how many samples are needed to infer a complex posterior distribution
Example.
Not fully explored in its entirety
Other ways to sample variables
Blocking Gibbs sampling
Because Z2 and z3 are sampled at the same time, a sample closer to the true value can be expected.
Requires that p(z2,z3|z1) can be sampled
developmental form
Collapsed Gibbs sampling
Parameters are marginalized and removed from the model
Perform Gibbs sampling on the marginalized p(z1,z2)
4.2.2 Variational inference
Variational inference or variational approximation
Variational inference solves an optimization problem to obtain an approximate representation of an unknown probability distribution.
Gibbs sampling sequentially samples realizations from an unknown probability distribution, whereas
Represent a complex probability distribution p(z1, z2, z3) by a simpler approximation q(z1, z2, z3)
Represent complex probability distributions p(z1,z2,z3) by simpler approximations q(z1,z2,z3)
KL divergence minimization problem (above equation)
KL divergence is the difference between two probability distributions
The smaller the distance, the more “similar” the two probability distributions are
If solving without constraints results in the answer as is, then the solution is qopt(z1,z2,z3)=p(z1,z2,z3) as is
Limit the representativeness of the approximate distribution q, and use optimization to find the distribution that is closest to the true posterior distribution p within that limited distribution.
Typical usage
Variational inference based on mean-field approximation
Make independent assumptions (above equation) for each random variable in the approximate distribution q
Ignore complex dependencies between each variable z1, z2, z3 in p(z1,z2,z3)
Modify each decomposed approximate distribution q(z1),q(z2),q(z3) one by one to reduce the KL divergence
Example: Suppose q(z2) and q(z3) are already given.
The optimal q(z1) can be obtained by solving the optimization problem in the above equation
Find the KL divergence of the above equation when q(z2) and q(z3) are fixed
Find the form for determining q(z1)
obtained by logarithmic and expectation calculations.
Expression in the form of expected value from the definition formula of KL divergence
Expand each term piecemeal by logarithm
All terms irrelevant to Z1 are absorbed into the constant const
For <lnp(z1,z2,z3)>2,3, take the exponent exp and then log ln to combine them in one in
Given q(z2) and q(z3), the minimum of the preceding equation becomes
In stochastic models that make good use of conjugate prior distributions, such as mixture models, the results of the right-hand side can be attributed to the same form of distribution as the prior
A similar argument can be made for q(z2) and q(z3) to optimize
Algorithm for variational inference by mean-field approximation
Easiest way to set fixed value MZITER
Define upper bound for computation time
Monitor each iteration for an evaluation value such as ELBO (evidence lowerbound) and terminate when the change is less than a certain value of epsilon.
ELBO is explained in Appendix A.4
More realistic assumptions
General probability distribution p(D,z1,…) given observed data D ,zM) for the posterior distribution
The set of unobserved variables excluding some i-th variable zi from the set of unobserved variables is Zi distant
From the definition of conditional distribution, it can be expressed as the expected value of simultaneous distribution.
The approximate distribution q(zi) under the condition that other approximate distributions q(Zi) are fixed is given by
Example of applying variational estimation to a two-dimensional Gaussian distribution
The blue ellipse is the true 2-D Gaussian distribution to be estimated
Red ellipse is approximate distribution by variational estimation
Issue
Since correlations between variables assuming decomposition cannot be taken, a good approximation cannot be given to distributions with strong correlations.
Example of KL[|q(x)||p(x)|]
Optimizing in the direction of decreasing KL divergence
Variational inference is faster than Gibbs sampling
Convergence performance for large established models
In some cases, such as blocking Gibbs sampling, the above equations can be computed together if the decomposition is not complete.
structured variational inference
Algorithm for variational inference by mean-field approximation
Variational inference example for 2-D Gaussian distribution
Variational inference is faster than Gibbs
Excellent convergence performance for large stochastic models
There is also a “structured variational inference” approach
4.3 Inference in Poisson Mixture Models
Introduction
Introducing the Poisson mixture model for one-dimensional data, we derive an algorithm for actually inferring the posterior distribution
Poisson mixtures are relatively easy to use for various approximate methods (Gibbs sampling, variational inference, collapsed Gibbs sampling) compared to Gaussian mixtures.
Evolutionary model using the non-negativity of the Poisson distribution
4.3.1 Poisson mixture model
Poisson distribution is used to learn multimodal discrete non-negative data such as histograms, as shown in the figure above.
Adopt Poisson distribution as the observation model for a cluster k as in the above equation.
The conditional distribution p(xn|sn, 𝛌) in the mixture distribution is as above
Only one of the k Poisson distributions is specified by sn
Parameters of Poisson distribution 𝛌={𝛌1,… 𝛌K} for the parameters 𝛌1,…,𝛌K}, and set the prior distribution to the above equation
Use the gamma distribution, a conjugate prior distribution
a,b are superparameters given fixed values in advance
Common across clusters for simplicity, but
You can also set subscripts such as Ak or bk and give each cluster a value in advance.
The modeling of the Poisson mixture model is completed by providing probability distributions for the remaining latent variables S and the mixing ratio parameter π
Equation
Equation
Check visually by actually putting data into the super-parameters.
4.3.2 Gibbs sampling
Using Gibbs sampling, derive an algorithm for sampling the parameters 𝛌,𝛑 and the latent variable S from the posterior distribution of the Poisson mixture model.
The conditional distribution after X is observed is the above equation
For mixed distributions, sampling latent variables and parameters separately is simple enough to obtain a probability distribution
Algorithm calculation strategy
To begin with, we have S={s1,… ,sN} to compute the conditional distribution for sampling
The key is to treat λ and π as if they were observed in addition to the data X
Proceed with the calculation focusing only on the random variable S of interest
The first line transforms the posterior distribution from the definition of conditional distribution to a form proportional to the simultaneous distribution.
The second and third lines express the distribution as a product of conditional distributions according to the definition of the mixture model (above equation), and terms not related to S are excluded from the equation
The resulting p(S|X,λ,π) is the number of each s1,… sN independent distributions.
No need to sample everything at the same time
Calculate probability distribution to sample Sn
Logarithmically proceed with the calculation
Also
In conclusion
If we prepare a new parameter ηn for simplified expression, the above equation becomes
By computing each ηn,k for each sn, we can sample sn from the categorical distribution
Find the conditional distribution to sample parameters
First, the calculation is performed from the distribution of λ. By taking the logarithm, the above equation becomes
As for λ, we can sample from each independent K γ distributions as in the above equation.
For the distribution of π, the above equation becomes
π can be sampled from the Dirichlet distribution in the above equation
Final Gibbs Sampling Algorithm
4.3.3 Variational Inference
Derive a variational inference algorithm for Poisson mixture distributions
To obtain an updated variational inference formula, we need to make the assumption of a decomposition approximation for the posterior distribution
Approximates the posterior distribution by separating the parameters from the latent distribution
Variational Bayesian expectation maximization algorithm
Apply variational inference formulas
Decompose the simultaneous distribution according to the model’s overall definition formula (above), absorbing all terms irrelevant to S into const.
The logarithm of the approximate distribution q(S) is expressed as the sum of N
The sum of the respective expected values is given by the above equation
and
The approximate distribution of Sn is a categorical distribution as in the above equation
To complete this update equation, expectation calculations for λ and π are required
Find approximate formulas for parameters
Apply the variational inference formula as above
Terms on Λ and π are independently separated
Take out and compute only the term related to λ
The approximate posterior distribution of Λ is further divided into K independent distributions
The Kth approximate distribution is the gamma distribution as in the above equation
If only the term related to π is taken out and calculated, the above equation becomes
Dirichlet distribution as above
The required expected value is given by the above equation
Final Variational Inference Algorithm
Very similar to each sampling step of Gibbs sampling
Gibbs sampling can be derived by simply replacing the expectation calculation portion of the variational inference update equation with sample values
4.3.4 Decaying Gibbs sampling
Derive an algorithm for collapsed Gibbs sampling against a Poisson mixture model
In collapsed Gibbs sampling, the parameters are first de-peripheralized from the simultaneous distribution (see equation above)
Comparison of model after (right) and before (left) marginalization
In the peripheralized model, each s1,. ,sN are dependent on each other
Simultaneously from the posterior distribution as above s1,… sN, it is necessary to evaluate all possible KN combinations of S and then normalize
Analytical approach not possible, requires Gibbs sampling approach
Once the parameters have been marginalized, all that remains is to sample S from p(S|X).
The set of all latent variables other than the variable sn that you want to sample is defined as Sn={s1,…. sn-1,sn+1,… sN}.
It is enough if the conditional distribution p(sn|X,Sn) is obtained as a simple enough probability distribution
Approach
Using the definition of conditional distribution and expressing it in the form of simultaneous distributions, the above equation is obtained
Simply expressing the simultaneous distribution as a product of conditional distributions yields the above equation
If we consider p(Xn|snSn) in a graphical representation
Xn and sn are conditionally independent
Eliminate the parts of Sn that are irrelevant
The final result is the above equation
The resulting equation is decomposed into the following two equations
Equation
First, the upper equation is
The previous equation can be interpreted as the predictive distribution of sn as in the above equation
The term p(pi|Sn) on the right-hand side represents the “posterior distribution of pi” when N-1 samples of Sn are treated as if they were observed data
Applying Bayes’ theorem, the above equation becomes
Same as learning parameters using categorical and Dirichlet distributions
Dirichlet distribution with K-dimensional vector αn as a parameter (above equation)
Using the expression (above) for the posterior distribution of pi
The final result is the above equation
The lower part of the equation is
The upper equation is calculated in the same way as the upper equation
We’ll use Bayes Theorem to compute some sort of posterior distribution for the term on the right.
It will be Bayesian inference using the Poisson and gamma distributions
It will be a gamma distribution as well as a prior distribution
Using the previous equation, integrate and eliminate the parameter λ to compute some sort of predictive distribution for xn
It will be a negative binomial distribution
Decayed Gibbs sampling is sampled for every nth banana sample, so
The calculation of the posterior distribution for the jth sample obtained immediately after the i-th sample is obtained is given above
αj,k is the quantity used to sample si
Add to this the si,k actually sampled and remove the portion of some sj,k sampled in the past
The αj,k needed to sample a new sj is obtained
Similarly, other parameters can be expressed by the above equation
Collapsing Gibbs Sampling Algorithm for Poisson Mixture Models
4.3.5 Simple experiment
Application of the three derived approximation methods to toy data
Results of clustering simple bimodal data represented by histograms with variational inference based on a Poisson mixture model with K=2
Two Poisson distributions with different means are combined to represent the data
Data in the middle of the two clusters will have ambiguous affiliation, resulting in a soft expected value of cluster affiliation <sn>.
Results of increasing the number of clusters to K=8 and comparing the performance of the approximation methods in computation time.
Vertical axis is ELBO
Perform 10 experiments on each dataset and use the average ELBO obtained as the result
Variational inference yields fast convergence performance in the early stages
Methods based on Gibbs sampling give better final results
Collapsed Gibbs sampling is good for both initial and final
Better to start with Gibbs sampling and move on to variational and decaying
4.4 Inference in Gaussian Mixture Models
Introduction.
As an observation model, consider a multidimensional Gaussian distribution whose precision matrix is the way
Gibbs sampling, variational inference, and collapsed Gibbs sampling as approximate inference algorithms for the posterior distribution
Gauss-G distribution for multidimensional Gaussian distribution is used as the conjugate prior distribution
4.4.1 Gaussian mixture model
The Gaussian mixturemodel uses a Gaussian distribution as the observation model for the data xn∈ℝD in each cluster k.
If the parameters are θk={μk,Λk}, the above equation can be expressed as
μk∈ℝD, Λk∈ℝDxD
The conditional distribution including latent variables is as above
The prior distribution corresponding to the parameters of this observation model is expressed above using the Gauss-Wishart distribution
m∈ℝD,β∈ℝ+,W∈ℝDxD,γ>D-1
Assumed common across clusters
You can also add a separate chant for each cluster by subscripting it as mk∈ℝD
4.4.2 Gibbs sampling
Deriving the Gibbs Sampling Algorithm for Gaussian Mixture Models
The posterior distribution of interest is the above equation
Separate sampling of parameters and latent variables as in the above equation
Find the conditional distribution for sampling S
Assume that the parameters μ, Λ, and π have already been sampled
The above equation is obtained by focusing only on the terms related to S.
Rewriting the posterior distribution as proportional to the simultaneous distribution, we fit the simultaneous distribution of the mixture model
Logarithmic expression is common sense.
Since lnp(sn|π) is already given, the two are put together to form the above equation
We know that Sn should be sampled from a categorical distribution like the one above.
Assuming that all the latent variables S gauges are given as samples, find the conditional distributions for the remaining parameters
Rewriting the equation in a form proportional to the simultaneous distribution, and further decomposing it into a product of conditional distributions according to the formula for a mixture distribution
μ and Λ cannot be decomposed
The distribution of the parameters μ and Λ of the observed distribution and the distribution of π, which represents the mixing ratio, can be decomposed into separate terms
Calculation of μ
The (simultaneous) conditional distributions of μ and Λ are
Decompose into K independent distributions
Calculation Continued
We know that Μk can be sampled according to a multidimensional Gaussian distribution
Calculation of Λ
Find p(Λk|X,S), which is the above equation from the relationship of the conditional distribution
Using the μk result from the previous section, the above equation becomes
Λk can be sampled from a Wishart distribution
Conditional distribution to sample pi
The equation obtained by Gibbs sampling of the Poisson mixture model can be used as is.
Gibbs sampling algorithm for the posterior distribution of a Gaussian mixture model
4.4.3 Variational Inference
Variational inference for Gaussian mixture models also separates latent variables and parameters, as in the above equation
First, apply the variational inference formula with respect to q(S)
The approximate distribution q(S) is decomposed into N rows
If we focus only on one sn, we can calculate
and
Sn is the categorical distribution of the above equation
q(μ,Λ) and q(π) are Gaussian-Wishart and Dirichlet distributions
Approximate distribution with respect to parameters
It is divided into two parts: a term related to the parameters μ and Λ for the distribution of observed variables, and a term related to the parameter π for the distribution of latent variables.
First, take out only the terms related to μ and Λ
K approximate posterior distributions can be computed for each individual
Let’s sort it out with respect to μk.
The approximate distribution of Μk is expressed by the above equation with the condition of Λk
Next, we obtain Λk. The relationship of the conditional distribution leads to the above equation
Substituting the obtained q(μk|Λk), we get the expression for Λk as in the above equation
Therefore, the above formula is obtained as the Wishart distribution
Conditional distribution that samples pi
The equation obtained in the Poisson mixture model can be used without modification.
The expectation required for the update equation is the above equation
Variational inference algorithm in the final Gaussian mixture model
4.4.4 Collapsing Gibbs Sampling
Derivation of the algorithm for collapsed Gibbs sampling for Gaussian mixture models
Consider a Gaussian mixture model with all parameters μ, Λ and π removed from the periphery
Similar to collapsed Gibbs sampling in a Poisson mixture model, compute a procedure to sample sn anew, assuming that Sn has already been sampled.
The posterior distribution for Sn is decomposed into two terms as in the above equation
For p(s|Sn), we again use the formula obtained in the Poisson mixture model (above)
Only the likelihood term is different, so recalculate it.
In addition
The following Gauss-Wishart distribution is obtained as the posterior distribution
However
The expression for marginalization is expressed in terms of Student’s t-distribution and is expressed in the above equation.
Gauss-Wishart equation
Decaying Gibbs sampling algorithm for the final Gaussian mixture model
4.4.5 Simple experiment
Results of clustering N=200 two-dimensional observations as a Gaussian mixture model with K=3 by variational inference
Multidimensional Gaussian mixture model allows not only clustering but also K-class classification
Graphical Model
Assume a situation where the class assignment of each data point is already labeled as S
Predict S* for new input x*.
Example of class prediction by Gaussian mixture model (K=3)
Judged solid because there are no red data points in the rightmost region of the right figure
Big Data and Bayesian Learning
If you have a lot of data, why not just use maximum likelihood estimation?
Bayesian estimation also approaches maximum likelihood as N is increased
Cases with a sufficiently large number of data (N) do not need to be analyzed in the first place
The more detailed the analysis, the less data is available
Bayesian learning can integrate diverse information
Maximum likelihood estimation causes significant over-fitting
Bayesian learning shows its strength with respect to increasing the number of available data dimensions (number of types) D, rather than the size of the number of data N.
Chapter 5 Applied Model Building and Inference
5.1 linear dimensionality reduction
Introduction
Mapping multidimensional data to low-dimensional space
Techniques for reducing data volume, extracting feature patterns, and summarizing and visualizing data
The main trends of the data can be adequately represented in a space of dimension M, which is much smaller than the dimension D of the observed data.
Related Technologies
probabilistic principal component analysis
Factor analysis
probabilistic matrix factorization
Specific examples
Compression of image data and interpolation of missing values
5.1.1 Model
Observed data Y={y1,… yN} with low-dimensional latent variables X={x1,… xn} for the observed data Y={y1,…,yN}.
Define probability distribution with respect to yn (above equation)
Assume a linear relationship between Yn ∈ ℝD and xn ∈ ℝM
𝝈2y ∈ ℝ+ is the common variance parameter in all dimensions
Role of allowing for discrepancies between the mean WTxn + yn that cannot be expressed in 𝝁
Assume fixed values this time
If you want to learn from the data, you can also infer this by using a Gamma or Wishart prior.
Assuming the above distribution for W, μ, and X
The matrix parameter W ∈ ℝMxD
Wd ∈ ℝM is the d-th column vector of matrix W
Bypass parameter 𝝁 ∈ ℝD
The covariance matrices ΣW and 𝚺μ are fixed hyperparameters
W and 𝝁 are parameters that affect the entire data
xn is the latent variable for each data point yn
The overall simultaneous distribution of all the parameters is given by the above equation
Graphical model
5.1.2 Variational inference
Posterior distribution given observed data
The multiple integral required to calculate the denominator cannot be solved analytically
Approximate the true posterior distribution in decomposed form using variational inference
Using the variational inference formula (above)
The above three equations should be solved
Find the approximate posterior distribution of μ
Organize with respect to 𝝁
If we organize the prior distribution with respect to μ
Adding the terms squared with respect to μ yields the above equation
Finally, it can be summarized as a Gaussian distribution with respect to μ
Find the approximate posterior distribution of W
Compute approximate formula for W
Decompose into D independent terms and write down
The terms of the prior distribution of W are the above equation
Can be expressed as a sum of D terms
The posterior distribution of W is expressed by the above equation
Find the approximate posterior distribution of the latent variable X
Since it is a simple sum of N, it is sufficient to find the form of the distribution for some nth latent variable xn.
The term of the prior distribution of Xn is given by the above equation if we focus on the term for xn.
Adding these results together yields the above equation
Therefore, the approximate posterior distribution of X is the above equation
The expected value required for the calculation is the above equation.
If you want to do Gibbs sampling, simply replace the expected value above with the sampled value.
5.1.3 Data lossy compression
Example of face image compression on the Oliveti face dataset using the linear dimensional compression algorithm
Model Equation
M=32 compresses data size down to 3%.
Can be used for smoothing with reduced noise
M=4 compresses data size down to 0.4
Almost the same
5.1.4 Interpolation of missing values
Interpolate missing values using linear dimensionality reduction
Partial loss of data due to temporary inconvenience of sensors, communication, or servers for acquiring and storing data
When you want to integrate multiple sensors for some kind of processing and some sensors fail or are not equipped due to cost
Interpolation of incomplete data such as user questionnaires and profile information
Interpolation of missing values is a common task for systems using machine learning
Using nearby values or averages as a preprocessing step before processing at a later stage makes it difficult to distinguish between interpolated data and the real thing.
It is more consistent to store missing values based on the structure (model) assumed for the data than to store them in a preprocessive manner.
Interpolation of missing values in Bayesian learning
Treat missing values as unobserved, unconditioned variables
Suppose a partial element of some nth data yn is missing
Yn,Ď ∈ ℝĎ is the part of the data missing, yn,Ď ∈ ℝD-Ď is the part observed
Let Yn be the subset from which only yn is removed from the entire set of data Y.
Ideally, it is necessary to find a distribution p(yn,Ď,Yn) that marginalizes all but the observed variables.
Solution.
Decompose and approximate the posterior distribution as in the above equation
Update with variational inference loop
Update q(yn,Ď) using the variational inference theorem
WĎ is the matrix W taken up to the first Ď columns
μĎ is a vector from which only the first Ď elements of μ are extracted
Approximation of missing values results in a Gaussian distribution as in the above equation
The distribution is such that the expected value of the random variable <xn>,<W>,<μ> is used to represent the mean of the missing values yn,Ď
The formula is given above
The parameters and latent variables are learned at the same time as the missing values are supplemented with the estimated values obtained in the previous equation.
Example of completing multiplied pixels in a face image using the linear dimensionality reduction of the missing value completion version we have derived so far.
Data uniformly missing with 50% probability
Roughly correct image has been recovered.
5.2 Nonnegative matrix factorization NMF
Introduction
A method for mapping data to a low-dimensional subspace
Assumes non-negativity for observed data and all other unobserved variables
Where to apply
Applicable to any non-negative data
Applicable to image data compression and interpolation
This method often produces a better representation when handling audio data in the frequency band using the Fast Fourier Transform.
Also used in recommendation algorithms and natural language processing
Various probabilistic model representations have been proposed
5.2.1 Models
Non-negative matrix factorization as described in “Overview of non-negative matrix factorisation (NMF) and examples of algorithms and implementations” approximates the decomposition of a matrix X, each element of which has a non-negative value, into W and H
Matrix X with non-negativity ∈ ℕDXN
Two positive-valued matrices W ∈ ℝ+DxM and H ∈ ℝ+MxN
Writing element by element in X yields the above equation
Define new auxiliary variable S as the upper expression
By introducing a latent variable S, efficient variational inference and Gibbs sampling algorithms can be derived.
Often, by introducing an auxiliary latent variable into the model, the updated equation can be efficiently derived.
Auxiliary variable S is modeled to occur according to Poisson distribution
The observed data is expressed using a delta distribution as in the above equation.
The delta function is a probability distribution for discrete values as above
From a generative point of view, the value obtained in the above equation can be interpreted as the value of Xd,n as it is.
For the prior distribution of W and H, set the gamma distribution, which is the conjugate prior distribution of the Poisson distribution.
Aw∈ℝ+,bw∈ℝ*,aH∈ℝ+,bH∈ℝ+ are super parameters of the gamma distribution
The above equation summarizes the relationships among the above random variables and is written as a simultaneous distribution.
Graphical Model
Example of matrix factorization using variational inference for audio data
The central figure is a spectrogram of organ performance data X
Horizontal axis indicates time, vertical axis indicates frequency, and colors indicate energy intensity
The left two columns graph the estimation results of W:,1 and W:2 when M=2
Each displays the high-frequency and low-frequency patterns of the organ
The two columns of graphs above show the results of estimating H
Represents how the W pattern is used in the time series
First, a pattern representing the bass part is used in the first 0.6 seconds.
The pattern that wobbles the treble part begins to be used from around 3.3 seconds in the middle.
Non-negative factorization is used to map the spectrogram to a low-dimensional space and extract characteristic patterns
It is also used to recover high-frequency regions in the data using the idea of missing value completion.
5.2.2 Variational inference
Derive an algorithm for inferring the unobserved W, H, and S posterior distributions from the constructed model
Using variational inference, assume that the posterior distribution can be decomposed into each type of variable (equation above)
Using the variational inference formula, the simultaneous distribution of the model expressed in the previous equation is then fit, and the updating formula for the approximate distribution is written, yielding the above equation.
CONTINUED
First, find the approximate posterior distribution of W
Neglecting the scent irrelevant to Wdm, the above equation becomes
The logarithmic term of the gamma prior becomes the above equation
Tidying these up, we obtain the above equation.
Finally, the approximate posterior distribution of Wdm becomes common sense as a gamma distribution by introducing aW and bW
Next, find the approximate posterior distribution of H
Since W and H are covered by the model definition, they can be derived by the same process as the calculation of W.
Finally, find an update equation for the latent variable S
To organize with respect to S
The approximate distribution q(S) can be expressed as an M-term distribution with Xd,n as the number of trials as in the above equation
The expected value required to update each approximate distribution is given by the above equation
5.3 Hidden Markov model HMM
Introduction
Modeling for time series data
Where to apply
Traditional speech signals and string data
Nucleotide sequences and financial transaction data
In previous models
After the parameter θ is given, for each data X={x1,… ,xN} for each data X={x1,…,xN}, probability distribution when independence holds.
In the real world, they are often not time independent
Example.
Non-negative integer series data with N=500 data
Inference results with HMM
First half and second half of data show different trends
Poisson Mixture Model (PMM) Inference Results
Inference results with a lot of fine switching because the data can only be treated as a proxy for the observed data.
5.3.1 Model
Hidden Markov models are realized by adding time dependence to the latent variables in a mixed model with K clusters
In PMM latent variables S={s1,… sN}, the latent variables S={s1,…,sN} are assumed to be independent of each other, as in the above equation.
In the HMM model, there is a dependency between adjacent latent variables sn,sn-1 (see above)
Hidden Markov process latent variable S
A: Parameters governing the fiber between states of size KxK
π: Hidden Markov model only determines the leading state s1 (while in the mixed model it was the parameter that governed the latent ratio of the entire data)
Depends only on one neighboring state
Some extensions to mth-order Markov chains
A detailed description of the initial probability parameter π and the transition probability matrix governing the state series S
Fix the number of states as K
Initial state s1 is determined by a categorical distribution with parameter π
As a prior distribution of π, we introduce the Dirichlet distribution, which is the conjugate prior distribution of the categorical distribution (see formula 8 above).
The transition probability matrix A is represented as a KxK matrix in the first-order Markov chain model, and the probability value of transition from state i to state j is represented by the element Aj,i
Example
Set A as above in a 3×3 matrix
A2,3=0.4
If the n-1th state is sn-1,3=1, the probability that the next nth state is sn,2=1 is 0.4
The sum of any columns in the transition matrix must sum to 1, since there must always be a transition to one of k=3
It is also common to place strong constraints on the fiber of states, such as A1,2=0
State transition diagram
Using the transition probability matrix A, the probability of transition from sn-1 to sn can be concretely expressed as the above equation.
Select the probability A:,j that the previous variable sn-1 determines the next destination
Since each column of A is simply a categorical matrix, the corresponding prior distribution can be a K-dimensional Dirichlet distribution (see above)
K-dimensional vector β:,j is a hyperparameter for the transition probability matrix
By determining this value for each element βj,i, we can assume various transition structures for matrix A in the dimensions
Model the observation-generating distribution as switching according to the given latent variable sn
Data X={x1,… Let p(X|S,Θ) be the observation model for the data X={x1,…,xN}.
Simultaneous distribution becomes the above equation
The graphical model is shown above
A series of states is generated starting from S1, and an observation xn is generated from each state.
The distribution for the observed data X in the hidden Markov model can be freely chosen.
Example
Poisson distribution
Non-negative
Gaussian distribution
Categorical distribution
When analyzing DNA, RNA, and other nucleotide sequences, the type of base is represented using a categorical distribution
Since we want to treat the observed data xn as one-dimensional non-negative integers, we use the parameter Θ=λ={λ1,…. λ1,…,λk) as the observation model.
Introduce a gamma prior as a conjugate prior for each parameter λk
a,b are superparameters given a priori fixed values
5.3.2 Fully Decomposed Variational Inference
Find an algorithm for the posterior distribution by variational inference for a hidden Markov model based on a Poisson observation model
The hidden Markov model is simply a mixture model with a time dependence on the latent variables.
Approximate inference by decomposing into parameters and latent variables based on the same idea as the mixture model
To simplify the handling of time series, the time direction is also decomposed into pieces for inference.
Approximation assuming decomposition in the time series direction for q(S)
Variational inference formula is applied to transform the equation (above equation)
Computed from an approximate distribution of parameters first
Parameter λ, related to the observation model, can be calculated independently of other parameters π and A
If only the term related to λ is taken out, the above equation is obtained.
The calculation can be done in exactly the same way as the approximate self-distribution of the parameter λ in the variational inference of the Poisson mixture model in Chapter 4.
About π and A?
Further separate calculations for π and A
First, calculate from π
Logarithmic computation using the defining formulas for categorical and Dirichlet distributions yields the above equation.
The approximate posterior distribution is the Dirichlet distribution as above
Just add the expected value of the leading state s1 to the super parameter of the prior distribution
Compute the approximate posterior distribution with respect to A
Organized by the terms related to A, the above equation becomes
The resulting approximate posterior distribution can be expressed as K independent Dirichlet distributions (above)
It means that the distribution over the transition matrix parameters is computed by counting the expected value of the transition from state i to state j over all states
Although the expected value is expressed as <sn-1,sn,j>, under the assumption of decomposition, it can be simply divided into the above equations.
Calculation of approximate distribution of state series S
To organize with respect to S1
Finally, the posterior distribution of s1 is categorical as in the above equation, with η1 as the parameter of the approximate distribution
Otherwise s is
However
For n=N
However
All update formulas are as in the above equation
5.3.3 Structured Variational Reasoning
Inference on state series does not require decomposition in the time direction
As in the mixed model case, an efficient algorithm can be derived simply by assuming a decomposition of parameters and latent variables (above equation).
Structured variational inference
Applying the variational inference formula, the formula for the approximate distribution for q(S) is given above
Instead of finding the parametric probability distribution of q(S), we find the expected values such as <sn,j> and <sn-1,sn,j>, which are the minimum required for the approximate calculation of parameters such as λ, π, and A.
The goal is to find the above two peripheral distributions
Sn is the set of S from which sn has been removed
S[n-1,n] is the set of S without sn-1 and sn
Difficult to compute in a simple way
Forward backward algorithm approach (forward backward algorithm)
Exact Inference Algorithm on Graphical Model
From the above equation
Variational inference formula for q(S)
Decompose q(S) into a product of four terms
The above equation is obtained by exponentiating the expected value of each
Using these, the marginal distribution q(sn) can be written as
where
q(sn-1,sn) is the above equation
f(sn) is the above equation
b(sn) is the above equation
Expected value of state when using structural variational inference
Inference divided into mini-batches
Reference: How to Evaluate an Algorithm?
The most important points of concern in application
Predictive performance for the value you want to recommend
Separate the entire data into training and validation data and use the degree of prediction for the validation data as a quantitative evaluation of the final algorithm
Regarding prediction accuracy
Squared error for continuous value estimation
Percentage of correct answers for classification problems
Suggest indicators that may stand on the basis of task and domain knowledge
Regarding computational cost
Flexibly adjust methods according to computational cost and accuracy requirements
Maintenance and Support Perspectives
Update as environment changes
5.4 Topic Model
Introduction
Generic name for generative models for analyzing documents written primarily in natural language
The simplest example
Assume that a potential topic (politics, sports, music, etc.) exists behind a document that is a list of words, and that each word in the document is generated based on that topic.
By using topics learned with large amounts of document data
Classify and recommend news articles.
Retrieve semantically related documents from a given word query
Applied to image and genetic data as well as natural language processing
5.4.1 Model
LDA treats documents as frequency-of-occurrence-only data, ignoring word order
Assume that there are multiple latent topics in each document
In an article on Major League Baseball, topics such as sports and foreign countries exist in the background.
No actual labeling of topic names.
Each topic is a distribution of the frequency of occurrences of a vocabulary (type of words)
Example
For the topic of sports, words such as “game,” “score,” “baseball,” “soccer,” and “win/loss” appear more frequently.
For overseas topics, words such as “international,” “trade,” “leaders,” “UK,” and “Japan-U.S.” appear frequently.
Assumptions.
Let V be the total number of words in the vocabulary, and let v=1,… V
Wd,n∈{0,1} denotes the nth word in document d
A document d is a set of words wd={wd,1,… wd,N].
Furthermore, the set W=[W1,…. WD] is called a sentence collection.
Sentence set W is the observation data in LDA.
When the total number of topics is fixed as K, the kth topic Φk∈(0,1)V is a word of each word type v=1,… ,V representing the probability of occurrence of each word type v=1,…
If Φk,V=0.01, then topic k (e.g., sports) contains the word v (e.g., game) at a rate of 1%.
Each document is associated with a variable called topic proportion θd∈(0,1)K
Zd,n∈{0,1}K is the topic assignment for the nth word in document d, a latent variable indicating from which of the K topics the word wd,n was generated
for generating these variables.
The words wd,n and topic assignments zd,n are generated by a categorical distribution as in the above equation
Φ={ϕ1,… ,ϕk)
Word wd,n is generated from the kth topic specified by topic assignment zd,n
Specific words are generated according to the frequency distribution of the vocabulary represented by Φk
Latent variable zd,n is generated according to the topic ratio θd that document d has
The higher the value of Θd,k, the higher the probability that topic k has zd,n,k=1 in document d
The parameters of each categorical distribution are assumed to be generated from a Dirichlet prior distribution as above due to conjugacy
Α and β are the hyperparameters of the respective Dirichlet distributions
Fixed values are given in advance
Graphical model
Let W, Z, Φ and Θ be the corresponding collective expressions for each variable wd,n, zd,n, Φk and Θd, respectively, and the overall simultaneous distribution can be broadly summarized as
Each distribution is as follows1
Each distribution is as follows2
Each distribution is as follows3
Each distribution is as follows4
Results of estimating topic distribution Φ and latent variable Z on actual data
Above figure lists the frequency words for each extracted topic
Same word exists across topics
For a given document d, the person who applied more colors to the topics such that the expected value of the ubiquitous variable <zd,n> corresponding to each word is particularly high.
A document is made up of various topics.
5.4.2 Variational Reasoning
Goal of inference by LDA
To compute the posterior distribution of the remaining variables given the document set (above equation)
To obtain this distribution analytically, it is necessary to compute the marginal likelihood p(W) that appears in the sentence po
Difficult to achieve
Approximate posterior distribution by variational inference
Assuming a decomposition of the true posterior distribution with respect to the latent variable Z and other random variables as in the above equation, an analytic update law can be derived
Applying the variational inference formula based on this process yields the above equation
Continued
Perform specific calculations for each update formula
Approximate posterior distribution of latent variable Z
It is sufficient to perform the calculation independently for each element zd,n
Calculating the first term yields the above equation
The second term is the above equation
Adding these together and considering the constraint ∑ = 1, the posterior distribution of zd,n can be expressed as a K-dimensional categorical distribution as in the above equation
However
Approximate posterior distribution of parameters
The approximate distributions of Θ and Φ can be computed independently of each other
If we first consider only the terms related to θd, we obtain the above equation
If we consider the constraint Σ=1, we obtain a Dirichlet distribution as in the above equation
However
Compute the approximate posterior distribution of Φ
and
Given the constraint of ∑=1 for Φk, the Dirichlet distribution is as in the above equation
However
The final expected value calculation is as above
Initialize each variate parameter βk,v, αd,k and ηd,n,k and iteratively apply the distribution update formula for each variable
5.4.3 Collapsed Gibbs Sampling
Deriving collapsed Gibbs sampling for LDA
In the case of a mixed model, consider a new model with parameters peripheral to the stochastic model, and sample the latent variables one by one
A similar approach is used for LDA
Graphical model with parameters Θ and Φ removed in LDA
The effect of de-peripheralization is that the corresponding child nodes have dependencies on each other.
Instead of connecting all the nodes in the complete graph, enclose them with dotted ellipses.
Calculate the latent variable z corresponding to the graphical model above
Introduced new notations Zdn and Zd
Z={zd,n, Zd,n, Zd}.
Examining the conditional distribution of zd,n from the marginalized model yields the above equation
Decompose the simultaneous distribution into smaller pieces by the product of conditional distributions
Term on Wd,n
It doesn’t get any simpler than this.
Given the Markov blanket, wdn is a complete graph with Wd,n and Wd and cannot be removed from the condition
The term Wd,n
zd,n disappears from the condition as shown in the above equation
Zd,n has no co-parent relationship to any node in Wd,n with wd,n removed
As for the above equation
If we check for independence by Markov blanket while deleting nodes, we get the above equation.
The conditional distribution of zd,n, taking into account the terms not related to zd,n, can be summarized as the above equation.
By examining each term and calculating and regularizing the value for each zd,n,k=1 case, we obtain a discrete distribution (categorical distribution) for sampling zd,n
Equation
The predictive distribution computed by the Dirichlet posterior distribution p(θd|Zd,n) excluding only the Nth latent variable zd,n
For all k=1,… K-dimensional categorical distributions for sampling Zd,n, calculated in the case of K
5.4.4 Applications and Extensions of LDA
Application of LDA to document data
Clustering, classification and recommendation of documents using topic ratio Θ
Find documents with the highest probability of occurrence of w given a certain word query
Points where LDA is useful
Since the probability that a word w is generated from a given document is determined via the topic Φ and its ratio Θ
Even if w has never appeared in a document, it can be targeted for search based on its semantic content.
Model Extension of LDA
Introduce Markovianity between words
Inference can take into account relationships between words (e.g., idioms) that are ignored in the bag of words.
Temporal dependence of topics on a chronologically ordered set of documents
Model the evolution of the distribution of words in a topic
Chinese restaurant process (CRP)
Automatically infer the number of topics K from the data
Infer hierarchical structure among topics
Incorporate a variety of additional data into the model, rather than just treating a list of words as information
Correlate article metadata (e.g., author information) with topic ratios
Analyze similarities among multiple authors
Applications of LDA beyond documents
Computer Vision
Build image data and accompanying caption models, and
Application to automatically caption input images
Using partial patches in the image called “Visual word
An approach to analyze the composition of images
Bioinformatics
Pattern analysis of genetic data using a model almost identical to LDA
5.5 Tensor Decomposition
Introduction.
Mainly used in recommender systems for items (books, movies, restaurants, etc.)
In the field of machine learning, a tensor is often simply a multidimensional array such as Rn,m,k
Treated as an extended version of a matrix, which is a two-dimensional array
Closely related to linear reduction models
5.5.1 Collaborative filtering
In the framework of recommendation technology called collaborative filtering, the user’s item purchase records and recordings (the user’s evaluation of the product) are used to predict what items the user will be interested in next.
There are two types of methods
Memory-based methods
Calculate similarity Sim(i,j) between different users i and j using some method
Model-based methods
Model-based approach
Example: Consider rating data associated with fictitious users and items
Let Rn,m ∈ ℝ be the rating value of user n for item m, and assume that it is determined by a generative process as in the above equation
U:,n∈ℝD and V:,n∈ℝD are feature vectors of user n and item m, respectively
The more the directions of the 2z vectors match and the larger the size of each vector, the higher the rating result Rnm will be.
Add a noise term εn,m to allow for some exceptions
Matrix-based representation
R∈ℝNxM
U∈ℝDxN
V∈ℝDxM
Results of rating prediction
Subspace is set to D=2 dimensions
Interpolation results for Item7
User2 and User1 are high, User3, User4 and User5 are low
Different preferences of each user are extracted in a low-dimensional space
User5 has overall low rating
Extension to 3D
Utilize timestamps on ratings and when purchases were made
Capture temporal trends of items
Representation of ratings by tensor
Represent the rating data in a 3-dimensional array R∈ℝNxMxK
Matrix representation
Newly added Sd,k∈ℝ is the prevalence of feature d at time k
Extract potential U, V and S from observed data R
5.5.2 Model
Consider a model of collaborative filtering that takes into account transitions in the trend in the time direction
The idea of data representation with additional time-series information can be expressed using a one-dimensional Gaussian distribution as in the above equation
Λ ∈ ℝ+ is the precision parameter of the Gaussian distribution
The user feature vector U:,n∈ℝD and the item feature vector V:,m∈ℝD are assumed to follow a multidimensional Gaussian distribution as above
μU∈ℝD and ΛU∈ℝDxD are the mean and precision parameters for the user’s feature vector
For S ∈ ℝDxK, we want to extract the trend in the time direction, so we consider a Gaussian distribution assuming Markovianity as in the above equation
By using a Gaussian distribution instead of a categorical distribution, there is a dependency between neighboring variables.
In order to infer the parameters from the data, a gamma prior is set for λ and Gaussian-Wishart prior for U, V, and S.
The simultaneous distribution including all variables is shown above.
Graphical model of time series tensor decomposition
5.5.3 Variational Inference
The goal is to find the posterior distribution of the remaining variables, given the observed data R
Summarizing all parameters as Θ={λ, μU, μV, μS, ΛU, ΛV, ΛS}, the posterior distribution is given by
In the framework of variational inference, the posterior distribution is approximated by a decomposition of the above equation
Because of the time dependence of S, an additional depth is assumed for simplicity, as in the discussion of fully decomposed variational inference in the hidden Markov model.
Using the mean-field approximation formula (4.25), the respective approximate posterior distributions are as in the above equation
Continue
CONTINUED
U:,the term on n
and
q(U:,n) is Gaussian in dimension D as in the above equation
However
The operation ⚪︎ is the Hadamard product
It means an operation that performs element-by-element multiplication on matrices or vectors of the same size.
Similarly, q(V:,n) has a D-dimensional Gaussian distribution
However
Approximate distribution of S with time dependence
D-dimensional Gaussian distribution as above
However
However
However
Approximate distribution of parameters
For q(λ)
lnq(λ) is the above equation
The posterior distribution is the same gamma distribution as the prior distribution
However
The expected value is the above equation
For q(μU,ΛU) and q(μV,ΛV)
Gauss-Wishart distribution
However
Gauss-Wishart distribution
However
Gauss-Wishart distribution
However
Calculation of expected value
A model in which the time series of a latent variable is represented by a chain of linear Gaussian distributions
Continuous-valued version of the distribution of state time series in the hidden Markov model
State-space model
Find the neighborhood distribution from the direct approximation q(S) by the message passing algorithm described in “Overview of Message Passing in Machine Learning with Examples of Algorithms and Implementations“, without assuming the decomposition of the time series as in the present case.
Message passing is known as an example of Kalman filter and Kalman smoother
References 1
5.5.4 Interpolation of missing values
In a real-world recommendation system, not all users rate items.
The simplest approach is to apply the idea of missing value completion as shown in the linear time direction
Let Rn,m,k be the above equation for the locations with missing values
Approximate distributions can be computed as for other variables
The expected prediction accuracy of the rating is then <λ>=ã/ḃ
Ḃ depends on the sum of the expected differences between the rating values and the rating values expressed by the model (see above)
The larger the value, the less the model explains the rating.
Allows inference of different prediction accuracies for different users and items as λn,m instead of assuming a single accuracy λ for all ratings Rn,m,k
Good use of predictive accuracy information can recommend items with high uncertainty, even if the predicted average of the ratings is low.
If Rn,m,k are missing values for the approximate distributions of the other variables, the updating equation can be updated with each expected value as the upper expression.
5.6 Logistic regression
Introduction
A model that learns discrete label data y directly from input variables x
Predicting continuous values with a linear regression model allows for exact computation of the posterior distribution of parameters and the predictive distribution for new data
Logistic regression, unlike linear regression, involves internal nonlinear variable transformations, making the above analysis impossible.
As a usage of variational inference, instead of the approach of mean-field approximation by linear dimensionality reduction and the section of the posterior distribution used in LDA, an approach of optimization that approximates the posterior distribution by Gaussian distribution and uses the gradient information is presented.
A similar approach can be taken with neural networks.
5.6.1 Model
Consider a model that classifies input values xn∈ℝM into one of D classes
Assume that the multidimensional vector yn is output according to the categorical distribution in the above equation
The matrix W ∈ ℝMxD is a parameter of this model
Assume a Gaussian prior distribution for each element wm,d of W
f is a nonlinear function and assumes a D-dimensional softmax function
For each dimension d, it is defined as in the above equation
W:,d∈ℝM is the d-th column vector of matrix W
The softmax function SM(・) converts each real-valued input value in dimension D to a non-negative value by exp(・) and returns a normalized value so that ∑d=1DSMd(WTxn)=1
This transformation allows the parameters of the categorical distribution to be represented using the linear model WTxn
5.6.2 Variational Inference
The goal of logistic regression is to
To infer the posterior distribution of W given a training data set of output and input values [X,Y] and then predict the output value y* given new input data x*.
The posterior distribution is expressed above using Bayes’ theorem
Inference Methods for Posterior Distribution in Logistic Regression Models
Laplace approximation
Using an approximation of a local function
Sampling methods such as Hamiltonian Monte Carlo
Approximation of posterior distribution by Gaussian distribution
Relatively simple implementation, easy to scale for increasing data size
Approximation of posterior distribution by Gaussian distribution
Assume in advance a one-dimensional Gaussian distribution with MxD as the approximate posterior distribution, as in the above formula.
Μm,d and σm,d are the variational parameters of the approximate distribution
η is a notation that summarizes all those variational parameters
Goal of Variational Inference
Minimize the KL divergence of this approximate distribution and the true posterior distribution with respect to the variational parameters
Minimize the KL divergence by partial differentiation with respect to η using the gradient method
Expanding the expression for KL divergence using the notation for expected value yields the above equation
The first two terms are simply the expected value of the logarithm of the Gaussian distribution
Can be expressed as a function of η
The last term of the expectation of the likelihood function contains an internal non-linear function, so an analytical expectation calculation is not possible.
Using a simple Monte Carlo method, it is possible to approximate the expected value using a sample of W
A method called the re-parameterization trick can be used to approximate the gradient
In general, the sample values ὠ˜N(w|μ,σ2) obtained by the Gaussian distribution can be expressed in terms of common sense
Using a Gaussian distribution with a deterministic function and no parameters, rewrite and approximate the expression for the sample of w
Since it will be a function of the parameter η, it will be partial differentiable by μ and σ
Replacing σ=ln(1+exp(ρ)), the problem becomes one of optimizing the real value of ρ without considering the constraint that σ is positive.
If γ>0 is the learning rate in the gradient method, we obtain an update formula for variational inference as above.
Specific calculation of gradient
Final equation
5.6.3 Predicting Discrete Values
By fully minimizing the approximate objective function using the gradient method with the reparameterization trick, we obtain the approximate posterior distribution of the parameters q(W;opt)
Calculate the predictive distribution of y* at new input x* using the approximate function obtained
Analytical calculation is not possible due to the presence of nonlinear SM
Further approximation is needed to find y*.
Checking for trends in y* using a simple Monte Carlo method
Assuming the number of samples to be L, the parameter realizations W(1),… W(L) can be obtained from the approximate posterior distribution as in the above equation
Example
Assuming D=2 for the number of classes, a sample of these parameters can be used to illustrate a boundary where the probability of y* taking 0 and the probability of y* taking 1 are equal to 0.5.
A plot of about 10 data points and multiple candidate boundaries on a 2-dimensional plane with M=2
Using these samples, an estimate of y* is given above as
Evaluated against x* on the plane
The darker the red, the higher the probability of red
The darker the blue, the higher the probability of blue
By integrating multiple boundaries with L straight lines, a non-straight line probability can be assigned.
In regions with less data, probabilities tend to be softer (closer to 0.5 than 0 or 1)
5.7 Neural Networks
Introduction
Neural networks, like linear and logistic regression, are probabilistic models that directly estimate the predicted value y from the input x
Describes a regression algorithm for continuous values using a neural network
Unlike the aforementioned models, nonlinear functions for predicting y from x can be learned from data
Compared to neural networks obtained by maximum likelihood or MAP estimation, which are common
Combined with a gradient evaluation method called back propagation, variational inference similar to learning logistic regression can be applied.
5.7.1 Model
Use the simplest two-layer neural network
Let xn ∈ ℝM, yn ∈ ℝD and modeled by Gaussian distribution, the above equation becomes
Λy is a fixed precision parameter
The nonlinear function is defined as above
Assume a simple Gaussian distribution as above for the model parameters W(1) ∈ ℝMxK, W(2) ∈ ℝKxD
Tanh(・) becomes the above equation
Function with nonlinear transformation as shown in the figure (just moved the sigmoid function)
The function’s input W(1)Txn is a K-dimensional vector
The model parameter K, which determines the size of W(1) and W(2), is interpreted in the field of neural networks as the number of hidden units
Plot of a sample of the number of pieces by the neural network model for different values of K
Rewriting the entire equation, we obtain the above equation
When K=1, just scale the tanh function
For K>1, a richly expressive function that adds together several different tanh functions models the mean vector f(W,xn) of the Gaussian distribution.
When K → infinity, the neural network is a smooth nonlinear function
Adding a softmax function to the nonlinear transformation, and using the categorical distribution instead of the gamma distribution as the observation model
Replacing the nonlinear transformation part Wk(2)Tanh(⋅) of the multiclass classification neural network with the identity
If we make the input variables of the neural network a linear function, we obtain a non-linear dimensionality reduction method called autoencoder.
5.7.2 Variational inference
Consider the problem of inferring the posterior distribution of the parameters W={W(1),W(2)} of a neural network model
As in logistic regression, consider using an approximate posterior distribution q(W;η) using a diagonal Gaussian distribution
η is a set of variational parameters
Assume 1D Gaussian distribution for all W elements totaling MK+KD
Minimize the function g(W,η) using the parameter trick in the Gaussian distribution
It’s almost the same as the logistic regression approximation.
The only new calculation is the evaluation of the derivative with respect to w in the likelihood function
Using the derivative of the composite function, the derivative of the parameters in each layer is common sense
where
Implementation-wise, first evaluate the above equation using weights and data sampled from the approximate distribution
Then, in turn, evaluate d, which is the error from the above equation.
The result is used to calculate δ in the above equation.
The error δ is transmitted from the output side of the network to the input side in turn.
Neural networks require large amounts of training data
The stochastic gradient descent method, which calculates the gradient by feeding data sequentially one by one, is used for four periods.
5.7.3 Prediction of continuous values
Make predictions for new data y* using the predicted approximate posterior distribution q(W;ηopt)
Since the analytical predictive distribution p(y*|Y,x*,X) cannot be obtained directly, a sample from the approximate posterior distribution is obtained and visualized
Prediction function plotted using L=100 sampled Ws after approximate posterior distribution was obtained using variational inference for N=50 1D input data and main data Y
Predictions are uncertain in regions where data are not observed
Predictions for regions where no data exist are
Prior structure can make a big difference
Setting the prior distribution for W
Value of K
Choice of nonlinear function f
Reference: The Future of Bayesian Learning
The origins of machine learning algorithms.
What we do in Bayesian learning
Fix a part of the modeled system
Summarize the parts of the system that are not of interest
Can analyze a variety of problems at any granularity and from any perspective
Applicability
Efficient learning of large, highly nonlinear models is a key challenge
Efficient computational know-how developed in deep learning and optimization theory
Appendix A Supplemental Calculations
A.1 Basic Matrix Computation
A.1.1 Transposition
A.1.2 Inverse Matrix
A.1.3 Trace
A.1.4 Matrix expressions
A.1.5 Positive definite matrices
A.2 Special Functions
A.2.1 Gamma and digamma functions
A.2.2 Sigmoidal and Softmax functions
A.3 Gradient method
A.3.1 Gradient of a function
A.3.2 Fastest descent method
A.3.3 Coordinate descent method
A.4 Lower bounds on marginal likelihood
A.4.1 Peripheral Likelihood and ELBO
A.4.2 Example of Poisson mixture distribution
REFERENCES
コメント