Machine Learning Professional Series “Graphical Models” reading notes

Machine Learning Technology Artificial Intelligence Technology Digital Transformation Technology Probabilistic Generative Models Machine Learning with Bayesian Inference Navigation of this blog

Summary

Bayesian estimation can be one of the statistical methods for interpreting data and learning models from a probabilistic perspective. Machine learning using Bayesian estimation is characterized by incorporating prior knowledge and experience into the model and updating that knowledge through data. It is particularly effective when data is scarce or there is uncertainty in the model, and is able to control overlearning and account for uncertainty in statistical estimation. This section describes this Bayesian estimation based on the Machine Learning Professional Series “Graphical Models.

The specific procedures for Bayesian estimation are (1) setting the prior distribution of the model that expresses prior knowledge and experience as a probability distribution, (2) setting the probability that data will be generated according to the model (data likelihood), (3) calculating the posterior distribution of the model after observing the data using Bayes’ theorem, (4) estimating parameters and predictions using the posterior distribution, and (5) modifying or improving the model after evaluating the estimation results. (4) Estimating parameters and predictions using the posterior distribution, and (5) Modifying and improving the model by evaluating the estimation results. Among these, the graphical model will be a graphical representation of the stochastic variables and their relationships in order to visualize the structure of the model.

Machine Learning Professional Series “Graphical Models” reading notes

A shift in perspective is key! Understand and use the benefits of graphs to represent and analyze cause and effect and connections between variables. The book introduces various graphical models and carefully explains how to use them in machine learning. Compactly summarizes what you need to know, including how to identify problems for which this method is effective, how to handle graphs, and how to apply it to inference and learning.

Chapter 1: Introduction to Graphical Models

1.1 What is a Bayesian Network?

Overview

Graphical Model
Bayesian network (directed)
Models stochastic causality
Markov Stochastic Field (undirected)
Models probabilistic dependencies
Example
Blood type (deform and consider A and O)
Genotype combination (AA, AO, OO)

1.1.1 Probability model for one’s self

Probability variable X is your blood type
Probability variable Z is one’s genotype
Modeling probabilities when neither your blood type nor your genes are known
Probability P0(X = x |Z = z) of blood type x under genotype z
1 x = A, z = AA
1 x = A, z = AO
1 x = O, z = OO
0 Otherwise
Simultaneous probability of X, Z P(X, Z) = P0(X|Z)Pg(Z)
Pg(z) : Probability of genotype z
Probability Model
When blood type (X =A)
P(Z|X = A) = P0(X = A|Z)(Pg(Z)/P(X = A))
Bayes’ theorem
Calculation of probability of cause Z under result X
Probability of genotype AA when one’s blood type is A
Example: Pg(AA| X = A) =pg(AA)/(pg(AA) + Pg(AO)) =3/5
Pg(AA)=9/16, Pg(AO)=3/8, Pg(OO)=1/16

1.1.2 Probability model with parents

One’s (blood) genotype is determined with a certain probability from the blood genotypes of one’s parents.
Father’s blood type (X1), Mother’s blood type (X2)
Table of 3 x 3 x 3 conditional probabilities
Pi(Z0|Z1,Z2)
Z0 : my genotype
Z1 : Genotype of father
Z2 : Mother’s genotype
Simultaneous probability of blood type and genotype of self and parents
P(X0,Z0,X1,Z1,X2,Z2) = P0(X0|Z0)P0(X1|Z1)P0(X2|Z2)Pi(Z0|Z1,Z2)Pg(Z1)Pg(Z2)
Probabilistic Model
The problem of estimating one’s genotype
P(Z0|X0,X1,X2) = P(Z0,X0,X1,X2) / P(X0,X1,X2)
Probability of having the genotype AA when all the blood types of the parents and myself are type A
P(Z0 = AA|X0 = A, X1 = A, X2 = A) = 8/13
I want to know the genotypes of my parents.
The problem is to find the most probable combination of you and your parents.
argmaxP(Z0,Z1,Z2|X0,X1,X2)
The combination of values that maximizes the probability (under the given conditioning)
MAP assignment (maximum a posteriori assignment)
When you and your parents’ blood types are all type A
Most probable combination
Z0=AA, Z1=AA, Z2=AA

1.1.3 Stochastic model with spouses and children

Probability model with spouse and children
Spouse’s blood type (X3), Child’s blood type (X4)
Model Equation
P(X0,Z0,X1,Z1,X2,Z2,X3,Z3,X4,Z4) = P0(X0|Z0)P0(X1|Z1)P0(X2|Z2)P0(X3|Z3)P0(X4|Z4) Pi(Z0|Z1,Z2)Pg(z1)pg(Z2)Pg(Z3)Pi(Z4|Z0,Z3)
Network Model

1.1.4 Learning from Data

The above discussion is based on the hypothesis that the probability P0 and the conditional probabilities Pg and Pi are known in advance.
The case where probabilities are not known in advance is more common.
Example
Model with personality added as a random variable
Meticulous” “Rough”
Determine the value of the (conditional) probability from the data
Parameter Learning in Graphical Models

1.1.5 Learning the structure of the datagroup

Learning “arrows” between random variables when the relationship is not known a priori
Structure learning “structure learning

1.2 What is a Markov Stochastic Field?

Random variables in a Markov random field are interrelated, rather than one determining the other.
Simplest example
Ising model
There is a quantity called spin (±1) in the lattice
Spin xi at lattice point i is probabilistically determined by the surrounding spins
Overall probability
Z is a normalization constant
Gibbs distribution
If Jij >0, xi and the object have a high probability of having the same value
If Jij < 0, then the probability of different values is high
If hi > 0, there is a higher probability that xi will take the value 1
Variation of the equation
N(i) is the grid point adjacent to I
The behavior of xi is determined from the values of variables at its neighboring grid points
Requires probability calculations, MAP assignment calculations, parameter calculations, and structure calculations
Parameter learning for Markov probability fields (parameter learning)
Estimate {Jij} and {hi} from data

1.3 Advantages of Considering Graph Structure

Can represent probability with less information
Graph structures can be used to (approximate) probabilities and MAP assignments more efficiently

Chapter 2: Foundations of Probability Theory

2.1 Foundations of Probability Theory

2.1.1 Mathematical formulation of probability

Definition 2.1 (σ-additive families)
A set Ω is a Σ-algebra if its subset family F satisfies the following three conditions
1. Ω ∈ F
2. if A ∈ F, then ΩA ∈ F holds
ΩA is the complement of A
3. additive elements A1,A2…. UiAi ∈ F for
The pair (Ω, F) is called a measurable space, and the source of F is called a measurable set.
Definition 2.2 (Probability space)
Given a set Ω and a σ-additive family F consisting of its subset families. A map P from F to the real numbers is a probability if it satisfies the following three conditions (Kolmorgov’s axioms)
1. 0 ≤ P(A) ≤ 1 for any A ∈ F
2. p(Ω) =1
3. mutually prime (disjoint) A1,A2,… P(UiAi) =∑iP(Ai)
The set Ω is called the sample space, and the trio (Ω, F, P) is called the probability space
Stochastic variable
Function defined on the sample space
Definition 2.3 (Discrete random variables)
A map X : Ω → ℤ is a random variable if
For any x ∈ ℤ, the inverse image X-1({x}) is a measurable set
Example
Dice with six eyes that are not equal.
One roll of this dice is to consider one probability variable X:Ω → {1,2,3,4,5,6}.
An event with a value of 1 is
A1 := {ω ∈ Ω | X(Ω) = 1}, which is a subset of Ω
P(A1) is the probability that the dice roll equals 1
P(X=1)
Kolmogorov’s axiom 1
The probability of the dice coming up is between 0 and 1 for all of them.
Kolmogorov’s axiom 2
The probability of rolling a dice and having its eye be one of 1 to 6 is 1
Kolmogorov’s axiom 3
Probability of getting 1 + probability of getting 2 = probability of getting 1 or 2

2.1.2 Distribution function of the random variable

For a random variable X, P(X=x) represents the variability of the values taken by X
Probability distribution function as a function of x (or probabilitymass function)
Given two random variables X and Y
The probability distribution function P(X=x, Y=y) is regarded as a two-variable function of x and y
Simultaneous probability distribution function (or joint probability mass function)
If we add it up for Y
∑P(X=x, Y=y) =P(X=x)
Adding up the simultaneous probability distribution with respect to any variable
Marginalization
Peripheralizing a simultaneous probability function yields random variables for the remaining variables
Applies to more than 3 variables
2.2 Independence of random variables
When two dice are thrown, their eyes are scattered “irrespective” of each other
Independence of random variables
Definition 2.4 (Independence of random variables)
For any random variable X1 : Ω → ℤ for any B1, B2 ⊂ ℤ
P(X1 ∈ B1 , X2 ∈ B2) =P(X1 ∈ B1)P(X2 ∈ B2)
P(A∩B)=P(A,B)
X1 and X2 are said to be independent when they satisfy
X1 ⫫ X2
Example
Non-equal dice
Two random variables Xi :Ω → {1,2,3,4,5,6} (i=1,2)
If these are independent
P(dice 1 rolls 1 and dice 2 rolls 5) = P(dice 1 rolls 1) P(dice 2 rolls 5)
Expandable to N random variables

2.3 Conditional Probability

Dependencies among random variables
Can be captured by the concept of conditional probability

2.3.1 Definition of conditional probability

Probability space (Ω,F,P) and random variables X on it
Condition
The value of the random variable X is in B
Normalize the probability of subset X-1(B)={ω∈Ω|X(ω)∈B} of Ω to 1 and all other probabilities to 0
Conditional probability QB
QB(A) = P({ω ∈ Ω| Ω ∈ A, X(Ω) ∈ B}) / P(X ∈ B) for all A ∈ F
Denominator is normalized
P(A|B)=P(A∩B)/P(B)
P(A∩B)=P(A|B)P(B)
When X and Y are discrete random variables, the probability of X = x under the observation of the event Y = y
P(X = x| Y = y) = P(X = x, Y = y) / P(Y = y)
Example
Throw one dice.
X is the dice value, Y is the odd-even value of the dice value
P(x|Y=even) = 1/3 with x=2,4,6
P(x|Y=even)=0 for x=1,3,5

2.3.2 Independence and conditional probability

If X and Y are independent, the probability distribution of X is not affected by observing Y
Independence
P(A∩B)=P(A)P(B) when
P(A|B)=P(A)
Equal to the probability around A
P(B|A)=P(B)
Equal to the marginal probability of B

2.4 Conditional independence

Conditional independence is important in graphical models

2.4.1 Definition of conditional independence

Definition 2.5 (conditional independence)
For random variables X, Y, and Z, X and Y are conditionallyindependent under Z if
P(X,Y|Z)=P(X|Z)P(Y|Z)
In this case we write X ⫫ Y | Z
X and Y are independent under the condition that the value of Z is observed
When you want to know about Y, knowing the value of Z is sufficient as a hint, and the value of X is not necessary

2.4.2 Properties of Conditional Independence

Conditions of independence
P(X|Y,Z) = P(X|Z)
P(Y|X,Z) = P(Y|Z)
P(X,Y,Z)=Φ1(X,Z)Φ2(Y,Z) holds for some functions Φ1 and Φ2

2.4.3 Examples of Conditionally Independent Stochastic Variables

Assumptions.
I have M skewed coins on hand.
Let pz(z=1,2,…) denote the probability that the z-th coin will turn up face up when it is struck. . M)
Operation to randomly select a single coin with probability variable z
Toss the chosen coin twice.
Probability variable X: Whether the coin is heads or tails on the first toss.
Probability variable Y: Whether the coin is face up or face down on the second toss.
If Z = z is fixed, the probability that Y will turn up whatever the result of X is pz
X ⫫ Y | Z
X,Y are not independent
If X is face up, the probability that the chosen coin is face up increases, and the probability that Y is also face up increases.

2.4.4 Difference between independence and conditional independence

X ⫫ Y | Z is neither a necessary nor sufficient condition for X ⫫ Y
The strongly related X and Y are just “stratified” by the value of Z, making the remaining variation irrelevant.

2.5 Handling of continuous random variables

For a continuous set of values of a random variable

2.5.1 Probability density function

Represents an integral part over a range, rather than a probability at a particular point

2.5.2 Conditional probability density function

Chapter 3 Bayesian Networks

3.1 Terminology for directed graphs

Directed graph
A pair (V, vector E) of a finite set V and a subset vector E of its direct product set ⊂ V x V
V: set of vertices
Vector E: directed edge set
(𝓊, 𝓋) ∈ Vector E
Directed edges from vertex 𝓊 to 𝓋
For a subset V0 ⊂ V of vertices
Edge set vector E0={(𝓊, 𝓋) ∈ E| 𝓊, 𝓋 ∈ V0}, where only directed edges with both sides connected to V0 are taken
Induced subgraph of a graph (V0, vector E0)
Directed graph walk
A sequence of vertices (𝓋0,𝓋1,… ,𝓋L)
closed walk when 𝓋0=𝓋L
Directed graphs without closed paths
Directed acyclic graph (DAG)
A vertex extending a directed edge to vertex 𝓋 is the “parent” of 𝓋
pa(𝓋) := {𝓊 ∈ V | (𝓊, 𝓋) ∈ vector E}
A vertex with a directed edge extending from vertex 𝓋 is a “child” (child) of 𝓋
ch(𝓋) := {𝓊 ∈ V | (𝓋, 𝓊) ∈ vector E}
A vertex of a directed acyclic graph that has no particular parent is a “starting vertex” (source).
A vertex of a directed acyclic graph, especially one that has no children, is called an “end vertex” (sink).
For a vertex 𝓋 in a directed acyclic graph, its “descendant” is the set of vertices for which a path from 𝓋 exists
des(𝓋)
V({𝓋} ∪ des(𝓋))
Non-descendant
ndes(𝓋)

3.2 Characterization of directed acyclic graphs

Directed graphs without closed paths
Topological sort of directed graphs
The set of vertices V to {1,2,…. |V|} in all monojections from the vertex set V to {1,2,…
(𝓋0, 𝓋1) ∈ vector E ⇒ n(𝓋0) < n(𝓋1)
Order the nodes so that every node precedes the node beyond its output edge.
Directed acyclic graphs can always be topologically sorted
Example
The graph on the left can be topologically sorted into multiple outcomes as follows: 7, 5, 3, 11, 8, 2, 9, 10 (left to right, top to bottom in appearance) 3, 5, 7, 8, 11, 2, 9, 10 (numerically smaller nodes are brought forward) 3, 7, 8, 5, 11, 10, 2, 9 5, 7, 3, 8, 11, 10, 9, 2 (bring forward the node with fewer edges) 7, 5, 11, 3, 10, 8, 9, 2 (bring forward the node with more edges) 7, 5, 11, 2, 3, 8, 9, 10
Algorithm of Creation
Kahn’s algorithm
Theorem 3.1 (Characterization of directed acyclic graphs)
No closed circuit
Any subgraph has an end vertex
Any subgraph has a starting vertex
Topological sorting exists
3.3 Definition of Bayesian Network
Vertexes of directed acyclic graphs correspond to random variables, directed edges describe causal relationships
Definition 3.1 (Bayesian network)
Given a directed acyclic graph G = (V, vector E)
A finite set X = {Xi}i∈V of random variables for any i ∈ V
Xi ∐ Xndes(i) | Xpa(i)
is a Bayesian network with graph structure G when
The value of Xi is determined probabilistically from Xpa(i)

3.4 Factorization of Bayesian Networks

3.4.1 Factor Decomposition Theorem

Definition 3.2 (Factorization of Bayesian Networks)
A stochastic variable X={Xi}i∈V on a directed acyclic graph G=(V,vector E) is a Bayesian network if
Equivalent to the fact that the probability distribution function P is divided into the following products
P(X) = ∏P(Xi|Xpa(i))
Hypothesis of causal independence
Causally independent except for direct parent node
Example
Factorize from children of graph
P(X)=P(X5|X1,X2,X3,X4)P(X4|X1,X2,X3)P(X3|X1,X2)P(X2|X1)
Replacement
Only X3 is the direct parent of X5
P(X5|X1,X2,X3,X4)=P(X5|X3)
P(X4|X1,X2,X3)=P(X4|X2,X3)
P(X3|X1,X2)=P(X3|X1)
P(X2|X1)=P(X2)
After replacement
P(X)=P(X5|X3)P(X4|X2,X3)P(X3|X1)P(X2)P(X1)

3.4.2 Configuration of Bayesian Networks

Multiplying tables of local conditional probabilities yields a Bayesian network.
Supplement 3.4 (Construction of a Bayesian network)

3.5 Examples of Bayesian Networks

3.5.1 The obvious Bayesian network and Markov process

Stochastic variable family X = {X1,X2,. .Xn} given
The probability distribution function P is P(X1,X2,… ,XN) = ∏ P(Xj|Xj-1,… ,X1)
Stochastic variable family X = {X1,X2,. Xn} is a Markov process.
Xt+1 ∐ (X1,…. Xt-1) | Xt holds.

3.5.2 Bayesian network consisting of three variables

3.6 Graphs and conditional independence

Stochastic variables on a Bayesian network satisfy a certain conditional independence

3.6.1 Graph Theoretic Preparation

An undirected trail in an effective acyclic graph
𝓊i → 𝓊i+1 and 𝓊i ← 𝓊i+1 in the sequence {𝓊i} of vertices
V-structure (V-structure)
𝓊I-1 → 𝓊I ← 𝓊I+1 (head to head)
Definition 3.2 (Active undirected path)
An undirected path connecting vertices 𝓊 and 𝓋 of an effective acyclic graph is active under the vertex set W if
The undirected paths (𝓊1,𝓊2,… 𝓊L) for all vertices 𝓊i on the undirected path (𝓊1,𝓊2,…,𝓊L) satisfies either of the following
(𝓊i ⋃ des(𝑢i)) ⋂ W ≠ 0 when 𝓊I is a V-structure
When 𝓊I is not a V-structure, 𝓊i ∉ W
The “connection is alive” in 𝑢 and 𝑣 when conditioned on the source of the vertex set W
If the connection is broken, it becomes inactive
Situation of becoming inactive
Vertex 𝑢i and its descendants in the V-structure do not contain vertices in W
Vertex 𝑢i is not in the V structure but is contained in W

3.6.2 Conditional independence and d-separation

Read independence between variables from the graphical model and decompose the posterior distribution into simpler products
Effective separation
d-separation
A method to check the independence of the stochastic variables for more complex models
Definition 3.3 (d-separation)
Consider an effective acyclic graph and a set of vertices U, W, S that are prime to each other
U,W is a d-separation under the vertex set S if
For any vertex 𝑢 ∈ U, 𝑤 ∈ W, we say that there is no “active undirected path under the vertex set S” that connects them
Theorem 3.4 (Conditional independence leading to d-separation)
Suppose that the vertex sets U, T, and W of an effective acyclic graph are prime to each other
If U,T is d-separated under W, then
The random variables Xu, XT, are conditionally independent under the random variable XW
That is, XU ⫫ XT|XW

Chapter 4 Markov Stochastic Fields

Introduction.
Probability structure reflecting interdependencies

4.1 Terms for undirected graphs

Undirected graph
A finite set V is a set of vertices, connected by edges with no orientation
Let the edge set E denote the edge connecting vertices 𝑢 and 𝑣, and write 𝑢𝑣 ∈ E
Path of an undirected graph (walk)
A sequence of vertices (𝑣0,𝑣1,… 𝑣L) that satisfy (𝑣i,𝑣i+1) ∈ E
In a subset A,B of a vertex set V, S separating A and B means
Any path connecting 𝑎 and 𝑏 for any 𝑎∈A, 𝑏∈B passes through a vertex of S
A neighborhood (neighbourhood) of a vertex 𝑣 ∈ V is
N(𝑣) = {𝑢 ∈ V| 𝑢𝑣 ∈ E}
A closure (closure) of a vertex 𝑣 ∈ V is
Defined by cl(𝑣) = {𝑣} ⋃ N(𝑣)

4.2 Definition of Markov Stochastic Field

Definition 4.1 (Three definitions of Markovianity)
For a random variable (X𝑣)𝑣∈V on a graph G(V,E), we define the following three types of Markovianity
Global Markov (G)
Global Markov means that XA⫫XB|XS holds for subsets S,A,B of V that are prime to each other and for which S separates A and B.
Local Markov (L)
Local Markov means that X𝑣⫫XVcl(𝑣)|XN(𝑣) holds for any 𝑣∈V
N(𝑣) is the Markov blanket of vertex 𝑣
Pairwise Markov (P)
Pairwise Markov means that for any 𝑢,𝑣∈V such that 𝑢𝑣∉E, X𝑣⫫XV{𝑣,𝑢}
Theorem 4.1 (Markovianity condition relations.1)
(G)⇒(L)⇒(P) holds.
Under the cross rule, (P) ⇒ (G) holds.
Whenever the probability distribution function is positive, the crossover rate is valid
The three conditions (G),(L),(P) are equivalent

4.3 Factorization of Markov Stochastic Fields

In the previous section, Markovianity is discussed in terms of conditional independence
As in the case of Bayesian networks, Markov probability fields are characterized by a decomposition of functions into products
Prepare graph theory terminology
A vertex subset U of a graph G=(V,E) is a complete subgraph if
A vertex subset C of a graph G=(V,E) is said to be a clique.
Definition 4.2 (Properties of Factorization )
A random variable with a graph structure factorizes (F) with respect to the graph G.
That the probability distribution function P can be written as in the above equation
Z is a normalization constant
Theorem 4.2 (Markovianity condition relations.2)
(F) ⇒ (G) holds.
Theorem 4.3 (Hammersley-Cliford theorem)
For probability distributions, (F) and (P) are equivalent when P>0
(G) ⟹ (F) holds.

4.4 Bayesian Networks and Markov Stochastic Fields

When the random variable X={X𝑣}𝑣∈V is a Bayesian network on an effective acyclic graph (V, vector E)
Can be factorized using conditional probability distribution functions
Considering a graph with each vertex 𝑣 ∈ V and 𝑢, 𝑢’ ∈ pa(𝑣) connected by edges and directed edges as undirected edges, the factorization in (F) can be performed
Operation to create an undirected graph from an effective acyclic graph
Example
The Bayesian network on the left is also a Markovian stochastic field with the graph structure shown on the right.

4.5 Example: Markov Stochastic Field of Gaussian Type

Probability density function of a multidimensional Gaussian random variable
Using a positive definite symmetric matrix J and vector μ
Characteristic of probability density functions is that they do not contain a cubic expression of x on the shoulder of the exponential function
The above equation can be transformed
Can be written as above for I,j ∈ .
A,B,C are constants independent of xi,xj
Conditions under which this probability density function is divided into products
If we consider a multidimensional Gaussian distribution as a Markovian probability field, its graph is the graph of i,j connected by edges such that Jij≠0.

Chapter 5 Factor Graph Representation

Introduction
The probability distribution functions of Bayesian networks and Markovian probability fields are expressed as products of local functions
For a more direct graphical representation of the product representation of functions
Both Bayesian networks and Markovian random fields can be represented by hypergraphs
Extension of undirected graphs
Arrow information is lost
Probability distribution functions (families) with product structures corresponding to graph structures are collectively called “graphical models

5.1 Terminology for Hypergraphs

In a hypergraph, a single hyperedge connects multiple vertices
In a normal graph, an edge connects two vertices
Let V be the vertex set.
A subset of V is called a hyper-edge
A pair (V,F) of a set F consisting of hyper-edges and a vertex set
V={v0,v1,…. ,v4}, F={{v0,v1,v4}, {v0,v3}, {v2,v3}}
Hypergraphs can be illustrated as ordinary graphs consisting of two types of vertices
For a vertex i ∈ V, define its neighborhood as N(i)={α|α∈F, i∈α}
For a supersurface α ∈ F, define its neighborhood as N(α) = {i|i∈V, α∈i}.
If the degree of all hypergraphs is 2, then the hypergraph is the same as a normal graph

5.2 Definition of Factor Graph Type Model

Definition 5.1 (Factor graph type model)
Assumptions
Given a hypergraph H=(V,F)
Furthermore, given a set xi for a vertex i ∈ V and an edge α ∈ F such that 𝝍α:∏i∈αxi→ℝ≥0
p(x) = 1/Z ∏Ψα(xα)
factor graph model
Z: Normalization constant
Z=∑∏Ψα(xα)
The probability distribution (family) defined by the above equation is called a factor graph model
Z is a normalization constant
Hypergraph H
Divided into the form of a product of functions Ψα corresponding to the supersurface α
Hypersurface α is a factor
The function Ψα is a factor function
Z is difficult to calculate and produce
If Z cannot be calculated concretely, it is impossible to calculate the probability value p(x) for a given x
Some conditional probabilities can be easily calculated
For vertex i, common sense dictates that B(i)={j|j∈α, α∋i, j≠i} for the vertices contained in its neighborhood
5.3 Factor graph type model and Markov probability field
The difference between a factor graph type model and a Markovian stochastic field model
How do I get a Markov probability field from a factor graph type model?
If a vertex 𝓊,𝓋 is contained in a factor, we can connect 𝓊,𝓋 by an edge
How do I get a factor graph type model from a Markov probability field?
Just make a hypergraph with the cliques of the graph as hyper-edges.
Graphical representation of Markov probability fields is less informative
Different factor graphs are represented in the same graph in a Markov probability field
Example
Factor graph with p(x1,x2,x3)=Φ(x1,x2)Φ(x2,x3)Φ(x3,x1)
Factor graph of p(x1,x2,x3)=Φ(x1,x2,x3)
Graphs in Markov Stochastic Fields
The graph representation of a Markov probability field is natural in terms of conditional independence (Markovianity), but the factor graph representation has a more detailed decomposition structure into products
Two factor graphs and their corresponding Markov probability field graphs
Markov probability field is less informative

5.4 Factor Graph Type Models and Bayesian Networks

Bayesian nets can be expressed as a product of conditional probabilities
Bayesian nets can also be interpreted as factor graph type models
Factor graphs are used to display probability distribution functions in the context of streamlining calculations

5.5 Examples of Factor Graph Type Models

5.5.1 Binary pairwise model

When all factors have exactly two vertices
Markov probability fields and factor graphs are the same graph
All factor functions can be written as Ψ(xi, xj)
Simplest of the pairwise factor graph type models
Variables on vertex are 1 or -1
Factor function takes 2×2=4 arguments
The factor function is the above equation
The generalization of probability p to multiple values is the above equation
Let hi’ be the sum of hi
Normalization constant

5.5.2 Combinatorial optimization problems

Local constraints that appear in many combinatorial optimization problems “can be expressed by a factor function.
Example: Explanation by SAT (SAT (True-False Satisfaction Problem))
Assumptions
Variable xi is assumed to take a boolean value (0:false, 1:true)
The negation of xi is assumed to be ẋi=1-xi
A clause (clause) is a conjunction of xi and its negation with the OR operator ∨.
Clause example:(x1∨x2∨x3)
A logical expression obtained by concatenating several clauses with the AND operator ∧.
Example CNF: (x1∨x2∨x3) ∧ (x2∨x4∨x5)
For a given CNF, find the value {xi} of the variable that satisfies it
When every clause consists of K variables
Let the factor function be the above equation
Let xα be a variable contained in clause α and define its energy function Eα as follows
0 when xα satisfies the conditions of clause α
1 When xα does not satisfy the condition of clause α
Β is a constant
For a given CNF, the assignment of a boolean value that minimizes the number of unfulfilled clauses corresponds to the state of maximum probability for this probability distribution.

Chapter 6 Calculation of Peripheral Probability Distributions 1 : Probability Propagation Method

6.1 Formulation of each inference

The observed random variables Y = (Y1,Y2,…. YL) based on the unobserved random variables X = (X1,X2,…. XM) based on unobserved random variables X = (X1,X2,)
Compute the marginal probability distribution (above equation) for each Xi under a given observation {yi}.
Example
In the case of a factor graph type model as shown above
When variables y1 and y2 are observed
Probability distribution of unobserved variables x1 and x2 is as above
Substituting y1,y2 into the original factor function
Given a factor graph type model consisting of random variables X and Y
Problem to compute the marginal probability distribution under observation Y=(Yj)
Resulting in the problem of computing the (unconditional) marginal probability distribution on a transformed factor graph
Difficult to compute in practice
Approximation required
Tree structure allows efficient computation using probability propagation method

6.2 Probability Propagation on Trees

6.2.1 Definition of a tree

There exists a path such that any two vertices are the starting and ending points.
Degree of all vertices is 2
Example of cycle (bold line)
Connected graph without cycles
Tree example
A bipartite graph representation is a tree
Tree vertex with degree 1
6.2.2 Probability Propagation Fields on Linear Graphs
Consider the simplest example (factors are pairwise, graph is linear)
Probability distribution p(x) = 1/Z Φ12(x1,x2)Φ23(x2,x3)Φ34(x3,x4)Φ45(x4,x5)
Definition formula for peripheral probability at x3
The number of calculations required to find p3(x3) for all x3 is K5
Let K be the number of states of variable xi, respectively
The definition of the probability around P3(x3) can be transformed as above
For computational reduction
Calculation of K2 as x1 is added for each x2
Similarly, the others are K2, so the final calculation is “4K2”.
Calculations are done in order from right to left
Efficiency is improved by calculating and sending quantities from the edges of the graph.
Some common calculation methods in dynamic programming
Examples of algorithms in dynamic programming
CKY (Cocke-Younger-Kasami algorithm (natural language processing parsing)
Viterbi algorithm: optimal series of hidden Markov models
Dynamic Time Wrapping Algorithm: Compares similarities between series

6.2.3 Probability Propagation Method in Pairwise Models

Generalizing the aforementioned calculation method to the case of trees
The same calculated quantities are sent in order from the edges of the graph
A factor graph type model that is pairwise on a tree-structured graph G=(V,E) is represented by the above equation
Algorithm for probability propagation method in tree graphs
For leaf i, the message mi→j is determined
After that, all messages are determined sequentially from leaf to leaf.
With the probability propagation method, once the message (probability) is calculated, it can be obtained for all vertices (variables).
Pair-wise (all-pair) methods can handle larger models than orthogonal table-based approaches
Pairwise methods are often used in test case generation
Simply combining the number of parameters makes the test case huge
Since most of the time, it is the combination of the values of two parameters that has the actual impact, so create a set of parameters and test all the values for each set of parameters.
Example
There is a column for member ID and a column for password.
When there are states such as “not entered”, “entered correctly”, “entered anomalous value”, etc.
3*3=9 ways are needed
Furthermore, if there are items such as gender, prefecture, email address, etc. on the form when registering as a member
9*27=243 ways

6.2.4 Probability Propagation Method in Factor Graphs

The probability propagation method is also extended to tree factor graphs
Message updates can be complex.
Suppose the factor graph model is given as above
Algorithm for probability propagation method on tree factor graphs
Schematic of update formula for message mα→i(x)

6.3 Common usage: Hidden Markov Model (HMM)

Calculation of the probability propagation method in the HMM model
HMM model equation
Conceptual diagram of HMM
Computes the probability distribution around a hidden state z under an observation x
The graphical model under observation x is as above
Definition of Φt,t+1
Message function of probability propagation method (quasi-directional and inverse direction)

6.4 Continuous variables

Similar arguments can be made for Gaussian functions
Peripheralizing a Gaussian distribution results in a Gaussian distribution
Non-Gaussian continuous variables require probability propagation methods with approximations
Approximation Methods
A method of substituting in the consistency of the expected values of sufficient variables
Methods to express probability density functions
Methods using particles
Methods for embedding in Hilbert space

6.5 Other Exact Calculation Methods

Convert a non-tree hypergraph into a tree called a junction tree (join tree) and perform probability propagation on it
Exponential order with respect to tree width
Cannot be used if tree width is too large
Peripheralization (=variable elimination) is done in order
Erasing factor functions

Chapter 7 Calculating Peripheral Probability Distributions 2 : Bethe Approximation

Introduction.
Calculate approximate marginal probability distributions by applying probability propagation algorithms other than trees
In terms of variational methods, the Bethe approximation
From Kinji Kikuchi, an extension of the Bethe approximation, we derive the generalized probability propagation method

7.1 Probability Propagation Methods on Graphs with Cycles

7.1.1 Algorithm for probability propagation method

Probability Propagation Method Algorithm on General Factor Graphs
In non-tree probability propagation methods, messages are not updated a finite number of times
Continue updating until mtα→i(xi) and mt+1α→i(xi) are approximately equal for all α→i
May not converge
Convergence is easier when the graph structure is close to a tree or when the dependence among random variables is weak
Approximate solutions bi,bα of the marginal probability distribution satisfy a relation in the above equation called the local consistency condition

7.1.2 Example: Probability propagation method on a graph with one cycle

Simplest graph that is not a tree: cycle-type graph
Probability distribution function equation
The probability propagation method on this graph computes rightward and leftward messages
The above equation is established under the appropriate coefficient α
Using matrix Φ and vector m0,1, the above equation becomes
It becomes a problem of solving for the eigenvalues of the matrix
If there is only one cycle, the fixed point of the stochastic propagation method is characterized by a linear equation
If there are two or more cycles, the equation becomes a complex nonlinear equation with many fixed points

7.2 Variational formulation

7.2.1 Gibbs free energy function

Explains that in the tree case, the true marginal probability distribution can be characterized by a variational method of the Gibbs free energy function
The Gibbs free energy function is defined above as a function of probability p
Convex function as a function of P
Take the minimum -logZ at the above value
The probability distribution determined by the factor function greg {Ψα} is characterized by the variational problem of the Gibbs free energy function
In the tree case, an arbitrary probability distribution p is represented by the above equation using its marginal distribution {pi, pα}.
Putting this into the Gibbs free energy formula

7.2.2 Bethe free energy function

The Bethe free energy function approximates the Gibbs free energy function
For a certain hypergraph G of cycles, we define the convex set 𝕃(G) by the above formula
A collection of candidate peripheral probability distributions such that the local matching condition is satisfied
The source is called the pseudo marginal distribution.
The definition of the Bethe free energy function is given above.
Approximate approximation of the calculation of the marginal probability by solving the minimization problem of this equation
In general, a way to characterize the solution to a problem by some minimization (maximization) problem
Theorem.
The point at which the slope of the free energy function is zero is the solution
Concavity of the function
Equation
A hypergraph is a concave function if it is a tree or just one c
Bethe free energy function is convex
The solution of the stochastic propagation method is unique
Closely related to a quantity called the zeta function of the graph
Dual characteristics of the stochastic propagation method

7.3 Generalized Stochastic Propagation Method

Introduction.
Generalized Probability Propagation from the Hearing-Valued Free Energy Function, which is a generalization of the Bethe Free Energy Function
Investigation based on ladder-type graph structure

7.3.1 Subset families closed by intersection

Example of subset group and Hasse diagram

7.3.2 Decomposition formulas for probability in the tree case

Theorem

7.3.3 Derivation of the Kikuchi Entropy Function

Definition: Kikuchi entropy function

7.3.4 Relationship between the Kikuchi entropy function and the Bethe entropy function

The Kikuchi entropy function is an extension of the Bethe entropy function

7.3.5 Derivation of the Generalized Probability Propagation Method

Generalized Probability Propagation Algorithm

7.3.6 Computational Example of Generalized Probability Propagation Method

7.3.7 Generalized Stochastic Propagation Method and Kikuchi Free Energy Function

Theorem

Chapter 8 Calculating Surrounding Probability Distributions 3 : Mean Field Approximation

Introduction.
Approximate computation of the marginal probability distribution from a variational problem called the meanba approximation

8.1 Mean-field approximation

Narrow the range of variates taken rather than approximating the Gibbs free energy function
Limit the probability distributions that can be included in the argument of the Gibbs free energy function to those that decompose into an expression for the probability distribution for each variable, as in the above equation.
Find {qi(xi)}i∈V such that the KL divergence (above equation) described in “KL divergence constraint” with the true probability distribution is minimized
Algorithm for mean-field approximation

8.2 Example: Inging model case

Mean field approximation is an approximation technique originally devised in physics
A simple example is the Inzing model

8.3 Mean Field Approximation and Related Methods

Bethe approximation (probability propagation method) and mean field approximation are suitable for different situations
The Bethe approximation is an approximation technique derived from the instrumental algorithm
Approximation accuracy is better when cycle influence is small
Mean-field approximation works well when the influence from the surroundings does not vary around the mean
The graph structure should have a high degree of vertices and be close to a complete graph.

8.4 Calculation of marginal probability distribution Method by sampling

Other methods also use sampling
Factor model is Gibbs sampling

Chapter 9 Learning Graphical Models 1 : Models without Hidden Variables

Introduction.
Explains how to obtain a graphical model from data
Background graph structure is assumed known
Task of learning function parameters
The most basic method of learning a stochastic model pθ(x) from data
Observed data x(1),x(2),… Determine the parameter θ by θML that maximizes the log-likelihood (above equation) for the observed data x(1),x(2),…,x(N)
Consistency is valid under the regularity condition on the probabilistic model
Tasks that also learn the graph itself

9.1 Learning Bayesian Networks

Introduction.
In general, the probability of rice WHY is expressed in the form of a product of conditional probabilities
In learning the parameters of the U.S. WHY, this conditional probability is obtained from the data

9.1.1 Learning by Maximum Likelihood Method

When the parameter group of the Bayesian network is given by the above equation
A parameter θi is assigned for each conditional probability
Observed data {x(n)}n=1,… The log-likelihood function for the observed data {x(n)}n=1,…,N is given by the above equation
By changing the order of addition, the above equation becomes a sum.
Maximum likelihood estimation is to find θi,ML that maximizes the above equation for each i.
For each vertex i, train on the model log pθ(x|x’) with the teacher data {(xpa(i)(n), xi(n))}.
Case
Assumptions
All vertices of the Bayesian network are discrete random variables
Parameter θi is the conditional probability table itself
Probability value determined by the maximum likelihood method is determined by the number of cases in the data
Ny, Nx,y is the above equation

9.1.2 Bayesian Learning

The number increases exponentially with respect to the number of vertices of the parent of vertex i
Solved by using Bayesian estimation instead of maximum likelihood
Case
As a prior distribution of the probability table for vertex i, with parameter α=(α1,…) ,αK) as the prior distribution of the probability table for vertex i. Consider the Dirichlet distribution (above)
K is the number of states of the random variable Xi
The above equation is a normalized variable
Data {x(n)}n=1,… The probability distribution (posterior distribution) of θ, after observing N, is
The advantage of using the Dirichlet distribution is that the posterior distribution can be easily calculated.
If we take the θMAP that maximizes the probability value of the posterior distribution, the above equation becomes
MAP estimator (maximum a posterior estimation)
The expected value of the posterior distribution, θEAP, is given by
EAP estimator (expected a posterior estimation)
Using the above, a conditional probability table is learned for each vertex i of a directed acyclic graph
Hyperparameters of the prior distribution α
Parameter θ is made peripheral and the probability of the data (above equation) is found to be maximized

9.2 Learning a Factor Graph Model Basics

Introduction
Learning the parameters of a factor graph model from data
Unlike Bayesian networks, optimization cannot be performed for each vertex in a Markovian stochastic field
Each factor in a factor graph model is parameterized as a log-linear model as in the above equation
Tα(xα) is called a sufficient statistic
<..,> is the inner product of vectors
Parameterization in the case of discrete states
Δx(x) is a function that is 1 when x=ẋ and 0 otherwise
Parameterization in the Gaussian case

9.1.1 Learning Formulation

Training data x(1),… ,x(N), we have x(n)={xi(n)}i∈V since all variables on the graphical model are observed
Learning parameters by maximum likelihood estimation method
Equation for the log-likelihood function
Dependence on θ cannot be decomposed into each factor because of the second term
Difficult to come up with optimal θ
For exponential distribution families, the likelihood function is a concave function, so in principle it can be calculated using the gradient method
The differential calculation results in the above equation
Determine θ so that the expected value of Tα(xα) for the empirical distribution matches the expected value for the model distribution.

9.1.2 Maximum Likelihood Method with IPF Algorithm

The multivariate function f(θ1,…. θd) is given
Method of repeated minimization for each coordinate
Select i in order and minimize for θi with variable θj fixed
In the case of a discrete state variable model, maximizing the log-likelihood function for each θ
Generalized Iterative Scaling algorithm for non-discrete states
IPF Algorithm
Repeat updates until the gap between the empirical distribution and the model distribution disappears

9.3 Approximation of Factor Graph Type Models by Learning and Variational Methods

9.3.1 Distribution function and entropy function rise
9.3.2 TRW rise in the distribution function
9.3.3 Bethe Approximation

9.4 Learning by Maximum Likelihood Function

Chapter 10 Learning Graphical Models 2 : Models with Hidden Variables

10.1 Problem Setting and Formulation

10.1.1 Differentiation of the log-likelihood function

10.2 Variational Lower Bound and Variational EM Algorithm

10.2.1 Preparation: KL divergence
10.2.2 Derivation and optimization of variational lower bounds
10.2.3 Variational EM Algorithm
10.2.4 EM Algorithm

10.3 Variational EM Algorithm with Graphical Model
10.4 Other learning methods

10.4.1 Method by sampling
10.4.2 Wake-sleep algorithm

Chapter 11 Learning Graphical Models 3 : Examples

11.1 Boltzmann Machines

Introduction
A Boltzmann machine is a binary pairwise model whose graph structure is a complete graph.
The probability distribution function is the above equation
xi∈{1, -1}
Observed data x(1)<… From the observed data x(1)<…,x(N), determine the parameters of the model θ=(J,h)
The log-likelihood function is given above
ẋih,ṁi is the mean of xixi,xi of the empirical distribution, respectively

11.1.1 Mean-field approximation

Using the mean-baba approximation, the parameters of the model, θ=(J,h)
Equation for the minimization problem
If qi(xi)=(1+mixi)/2, then qi can be parameterized by mi∈[-1,1
Maximization problem with respect to M
Equation for f(h,m)
Let 𝛏(x)=-xlogx
Expression of Hi
Equation for parameter J,h

11.1.2 Bethe Approximation

How to determine parameters from data in the Bethe approximation
Equation
Equation for J

11.2 Hidden Markov Model

Introduction
How to learn the parameters of a hidden Markov model
Assume Gaussian observation model (above equation)
Have mean μz and variance σz for each hidden variable z
Observation x is obtained according to a Gaussian distribution
Assume there are K hidden states
Let Σz be a diagonal matrix
Let rz’,z be the fiber probability from z’ to z of the hidden state
The total number of parameters to be learned is θ=(τ, μ, σ)

11.2.1 Learning with the EM algorithm

Learning with the EM Algorithm
The probability distribution p(z|x,θ)n of the hidden variable under observation can be calculated efficiently
Because M-step can be computed in a revised manner
Equation of the Q function
Maximize the parameter θ=(τ, μ, σ) in M-step
Equation for Τ
Equations for μ, σ

11.2.2 Bayesian Hidden Markov Model

Assume a prior probability distribution for the parameter θ
Dirichlet distribution for the transition probability parameter τ
Use normal and gamma distributions for μ and σ
Let Φ=(u,𝛒,α,β) denote these superparameters
Variational lower bound equation
Optimization for probability distribution q cannot be solved as is
Separate the product of the distribution over the hidden states and the distribution over each parameter
Mean field approximation (with structure)
Optimization of q(z) is E-step
q(θ) optimization is M-step
The above equation is obtained by taking out the terms related to the superparameter Φ in the variational lower bound
Determine Φ to maximize the variational lower bound

Chapter 12 Calculation of MAP Assignment 1 : Maximum Propagation Method

12.1 What is MAP estimation?

Given a stochastic model p(x|θ), its MAP assignment (maximum a posteriori assignment) is
To find the MAP assignment
There are exponentially many possible states x of the graphical model relative to the number of vertices in the graph.
It is usually impossible to find the MAP assignment by a simple full search.
Perform MAP estimation with some of the variables in the graphical model observed
Can be attributed to the problem in Equation (12.1)
A quantity “not similar” to the MAP assignment (x’=(x’i)i∈V that collects the maximum of the marginal probabilities
Caution.
The marginal probability distribution for the probability distribution function p(x) is expressed by the above equation
Maximum marginal distribution
Represents the maximum probability when Zi=xi and fixed
If there is only one MAP assignment, then the MAP assignment x* is obtained by assigning the MAP of the distribution to the largest state
If the maximum marginal distribution is known for all vertex i in the graphical model, the MAP assignment can be computed

12.2 MAP Estimation by Message Propagation

12.2.1 Calculations for the case of linear structures

If the graphical model has a tree structure, the MAP assignment can be computed exactly by an algorithm similar to that used to compute the detailed marginal probability distribution
As the simplest case, consider a graph with five vertices in a straight line
The defining equation for the maximum marginal distribution of x3 is the above equation
Let a variant of the equation be the above equation
Define the message function as the above equation
Use these to form the above equation
There is a common algebraic property between addition and max
Equivalent addition properties

12.2.2 Maximum Propagation Method on Tree Graphs

Calculation of maximum marginal distribution by message propagation
Algorithm for maximum propagation method
When the probability distribution is the above equation
Theorem: Maximum Propagation Method for Tree Graphs

12.2.3 Maximum Propagation Method on Factor Graphs with Cycles

Maximum Propagation Method can be applied to non-tree graphs

12.3 TRW Maximum Propagation Method

TRW max-product algorithm
Algorithm TRW Maximum Propagation Method
Factor Graph Model Assumptions
Theorem STA conditions and MAP assignment

Chapter 13 Calculation of MAP Assignment 2: Method by Linear Relaxation

13.1 Formulation of the MAP estimation problem as a well-formed programming problem

Description of the algorithm for performing MAP estimation
For discrete-state graphical models, MAP estimation can be formulated as linear programming described in “Overview of linear programming and examples of algorithms and implementations”
Derive a new message propagation algorithm as a dual of linear programming
Consider a factor graph of discrete states on G=(V,F)
The factor function is assumed to be of log-linear type
Write ε for the exponential distribution family and Xi for the set of possible values of variable xi.
Equation between MAP estimation problems
Find x that maximizes the right equation
Rewriting to linear programming problem
Using marginal probability convex polytope (marginal polytope)
Endpoints correspond to the number of states in the above equation
T(x)={Tα(xα)} is a vector of sufficient statistics
cl(S) is the closed envelope of the set S
Theorem: Linear programming formulation of the MAP estimation problem

13.2 Relaxation problem

Ideas for approximate solution of the MAP estimation problem
Using a “simple” problem to solve a difficult optimization problem
Use of pseudo-peripheral stochastic convex multicistor 𝕃
Peripheral probability distributions always satisfy local consistency conditions
Schematic diagram of L and M
Endpoints of 𝕃 that are 𝕄 endpoints of 𝕄
Endpoints of 𝕃 that are not endpoints of 𝕄
Equation from the inclusion relation between 𝕃 and 𝕄
Example in a binary pairwise model with variable ε ∈ {0,1}
The endpoints of 𝕄(G) are 0,1 vectors
Fractional endpoints include 1/2

13.3 Improvement of the relaxation problem by the resection plane method

Introduction.
Causes the optimal solution of the relaxation problem to deviate from the optimal solution of the original problem
To solve this problem, add inequalities to the set of inequalities that define 𝕃 as needed
Algorithm: Cutting plane method
Difficult to “choose C from inequalities that do not satisfy b” in Step 2
Algorithm to do this

13.3.1 Cycle inequalities

Consider the separation algorithm for the set of cycle inequalities C in the case of the binary fuda model
For value x=(xi), edge ij is a cut means that xi≠xj
The number of cuts on a cycle is always an even number of children
The above equation holds for a cycle C and its subset S⊂C where |S| is odd
Taking the expected value of each of the sides, the above equation becomes
The inequality determined from C,S
Adding the set of all cycle inequalities to 𝕃 does not reach 𝕄
If G is a planar graph, it coincides

13.3.2 Separation algorithm

Select the necessary one among the cycle inequalities
Solve the minimum weight path problem
Slightly abbreviated

13.4 Solution by dual decomposition and message propagation

13.4.1 Duality of the relaxation problem

Another approach to solving relaxation problems by L
Dual decomposition
New message propagation algorithm
Suitable for distributed computation
Main problem equation
Equation for the dual problem of relaxation
By strong duality of convex optimization problems
Constraints on Peripheralization
Let the above equation as J(δ)
Decompose the objective function J into a form that separates and maxes out for each i, α
On the relationship between the main problem and the dual problem
Theorem: Complementarity Condition
Approximate solution of MAP x

13.4.2 MPLP Algorithm

Optimization problem with respect to {δαi} of the factor αSpark, using the coordinate descent method.
Extract only the function part from the objective function to α
Minimize right
Complementary: Minimization conditions
MLPL Algorithm

13.4.3 Related Algorithms

Under tight constraints, a method similar to MLPL with message propagation

Chapter 14 Structural Learning of Graphical Models

14.1 What is structure learning?

Introduction
Task of learning the shape of the graph itself
Assumptions
Variables X = { Xi } i ∈ V indexed by set V
N samples x(1),x(2),…, following independent distribution ,x(n)
Determine to what graphical model the probability distribution followed by the above samples belongs
In the case of Markov probability fields
Considering the case where each vertex is connected by an edge and the case where it is not connected by an edge, there are ways of constructing the graph in the above equation
The number of combinations of how to connect each vertex is enormous, making learning difficult

14.2 Learning Markov Stochastic Fields

14.2.1 Independence-based approach, Constraint-based approach

Verify whether the Markov condition holds or not
Pairwise Markov Condition
Is there an edge between two vertices i and j? Condition
Specific Methods
Pearson’s conditional independence
Chi-square test
There are exponentially many possible values of Xv{I,j} with respect to |V|-2
Requires a large amount of data to calculate
Method to find N(i) satisfying the above equation
Structural learning by finding a neighborhood N(i) for each vertex i ∈ V
Computed in two phases
In the first half of the grow phase, N(i) is increased to satisfy the equation.
Remove unnecessary items from N(i) in the latter half of the shrink phase
Connect i and j by an edge when j ∈ N(i) using the obtained neighborhood
Pseudo Code

14.2.2 Methods using sparse regularity

Boltzmann Machine (Binary Pairwise Model)
i and j are connected as Markov probability fields
By optimizing the above equation, if the coefficients {Jij} of the resulting coupling become sparse, we are effectively performing a structural calculation.
The right equation adds L1 regularization to the likelihood
Computation is difficult for complete graphs such as Boltzmann machines
Starting from a graph with no edges, the approach is to gradually add edges that increase the value of the objective function.
Similarly for Gaussian Markov random fields, structured learning can be done by L1 regularization
I and j are connected by a strange
Jij≠0
The inverse of the covariance matrix is J of the coefficient matrix of xixj
Equation for optimization problem

14.3 Structural Learning of Bayesian Networks

14.3.1 Methods using conditional independence

Approaches to learning the structure of Bayesian nets
Determine directed acyclic graph structure from conditional independence of data
The most elementary of the algorithms using conditional independence
Building on the above results regarding the vertex connectivity structure and d-separability of directed acyclic graphs
Whether vertices u,v are connected by directed edges is characterized by d-separability
Orientation of directed edges is also obtained from data
SGS Algorithm
Cannot be computed unless the number of vertices|V| is quite small
Low robustness
Improved algorithm
PC (Peter Spirtes and Clark Clymour) algorithm
GS (Grow-Shrink) algorithm

14.3.2 How to maximize the score function

For each graph structure, define a score that represents the goodness of fit of the data, and find the graph that maximizes this score
Equation for the score of a directed acyclic graph G under data D
It is in the form of a sum of local scores
Log Likelihood Function
Algorithm to maximize the log-likelihood by restricting the directed acyclic graph to be searched to a tree
Search method within a restricted tree width
A method of adding or deleting edges or reversing the orientation so as to increase the score value

Appendix A Formulas

A 1 Conditional Independence Formula
A 2 Semi-ordered Sets and Möbius Functions

A 2.1 Basic Properties of Semi-ordered Sets
A 2.2 Möbius Functions

Appendix B Introduction to Convex Analysis

B 1 Definitions

B 1.1 Convex sets
B 1.2 Convex Functions

B 2 Fenchel Duals

B 2.1 Definition of Fenchel Duals
B 2.2 Properties of Fenchel Duals
B 2.3 Examples of Fenchel Duals

B 3 Duality in Convex Optimization Problems

B 3.1 Convex optimization problems
B 3.2 Strong Duality
B 3.3 KKT Vector and Optimality Conditions

Appendix C Exponential Families of Distributions

C 1 Definition of exponential distribution family
C 2 Parameter Transformations for Exponential Distribution Families

C 2.1 Derivation of Parameter Transformations for Exponential Distribution Families
C 2.2 Example 1 Multidimensional Gaussian distribution with mean 0
C 2.3 Example 2: The case of finite sets

References