Integration of probability and logic, automatic generation of Bayesian nets by knowledge base

Machine Learning Artificial Intelligence Natural Language Processing Algorithm Digital Transformation Deep Learning Mathematics Probabilistic Generative Models Reinforce Learning Intelligent information Ontology Technology Explainable Machine Learning Navigation of this blog

Integration of probability and logic (1) Bayesian Net, KBMC, PRM and SRL

Iwanami data science vol.2

One of the applications of Bayesian estimation that I mentioned before is Bayesian networks. Bayesian nets are a modeling technique that expresses causal relationships (strictly speaking, probabilistic dependencies) between various events in a graph structure, and have been used and researched in various fields such as fault diagnosis, weather prediction, medical decision support, marketing, and recommendation systems.

It is used and researched in various fields such as fault diagnosis, weather forecasting, medical decision support, marketing, and recommendation systems. XN as nodes and a conditional probability table (CPT) attached to each node, and the simultaneous distribution of X1,. XN, P(X1=x1,. . XN=xn) is represented by the following graph structure.

Various types of learning using these methods have been studied, ranging from estimating the probability value of CPT to structural learning, which learns the structure of a graph.

Notable examples include Naive Bayes (an algorithm that calculates the probability of all estimates given data and outputs the one with the highest probability as the estimated result), which is used in train congestion, disasters such as tsunamis and earthquakes, real-time processing of stock prices, spam mail classification, recommendation systems, and sentiment analysis. It has become a standard probabilistic modeling technique that supports data mining, including hidden Markov models applied to the recognition of time-series patterns, such as speech recognition, bioinformatics, morphological analysis (natural language processing), music tracking, and partial discharge.

Automatic generation of Bayesian nets by KBMC

Bayesian nets cannot be described together even if they are similar, and there are issues in creating Bayesian nets such as the need to create separate Bayesian nets for different variables, which makes it difficult to describe complex and huge models.

To solve this problem, research was conducted on automatic Bayesian net generation called Knowledge-Based Model Construction (KBMC), which uses predicate logic and a declarative programming language such as prolog to generate Bayesian nets dynamically to answer questions by backward reasoning with reference to a knowledge base. A Bayesian net is dynamically generated and used to calculate probabilities and answer questions. This avoided the problem of creating a huge Bayesian net for all possible questions.

This KBMC can be seen as a probabilistic application of the expert system, but it inherited the problems that the expert system had, so it had a frame problem.

Probabilistic relational model (PRM)

PRM is an evolution of Bayesian networks, a directed graphical model, and a precursor to SRL. Simply put, PRM creates a Bayesian network from a relational database (RDB). While conventional Bayesian nets deal with data represented as a single table and model probabilistic dependencies among attributes by considering attributes as random variables, PRM deals with multiple tables in a relational database and uses reference relationships among the tables to model probabilistic dependencies among attributes in the same or different tables. PRM handles multiple tables in the relational database and uses the referential relationships between the tables to determine the probabilistic dependencies between attributes in the same or separate tables for each tuple (record), which is a row in the database, and collects them to represent a Bayesian net.

In other words, PRM is characterized by the fact that it handles multiple tables and models probabilities for each tuple. Unlike Bayesian nets, PRMs require more work to calculate probabilities, but they can express more complex and detailed probabilistic dependencies than conventional Bayesian nets.

The learning aspect is also different. Conventional Bayesian networks learn probabilities by assuming that the tuples in the table are samples from independent and identical distributions, but in PRM probability learning, the database is the only sample, and probabilities are learned from that single example. Therefore, it is not related to the process of independent and identical distributions, but it makes a separate assumption that is practically equivalent in terms of probability calculation.

More specifically, PRM uses an object-oriented approach, in which each table is considered a class and tuples belonging to a table are considered objects (individuals or instances) of the class. Hereafter, tables and classes, tuples and objects are treated as the same thing. Tuples have multiple attributes, and attributes are divided into non-probabilistic attributes such as IDs to identify objects in a table and values to express reference relationships with other tables, and normal probabilistic descriptive attributes that express the properties of individuals.

As an example, consider a relational database of data about movies. As shown in the figure below, this relational database has three tables: actors, appearances, and movies.

The table of actors has as attributes the actor’s ID (an attribute that uniquely identifies the actor) and the actor’s gender. The attribute that uniquely identifies the actor is called the primary key in relational database terminology. The gender of an actor is a descriptive attribute. Similarly, the movie table has the movie ID and the movie genre as attributes. Furthermore, the table of appearances has as attributes the ID of the appearance, the ID of the actor, the ID of the movie, and the role that an actor played in a certain movie. Of these, the ID of the appearance is the primary key, while the ID of the actor and the ID of the movie are used to refer to other tables and are called foreign keys.

The attribute A of class X is denoted by X.A and the class minimized by the foreign key σ is denoted by X.σ. For example, in the example above, actor . Gender, Appearance . Actor, and so on. The reference relations in the table can be further generalized by using τ to denote a chain of references that follows a foreign key zero or more times in both the forward and reverse directions, and writing X.τ as the class referred to (traced to) by the chain τ of references from class X. In this case, the probabilistic descriptive attribute A of class X may have a probabilistic dependency with the probabilistic descriptive attribute B of class X.τ , which is referenced by the chain of references τ . For example, a performance . Role is a probabilistic descriptive attribute of actor . Gender and film. Genre.

For each attribute A of class X, the set of attributes (set of parent nodes) Pa(X.A) and the associated conditional distribution (CPT) P(X.A|Pa(X.A)) on which A probabilistically depends are defined in the PRM. The relational schema consisting of function names, attributes, and attribute definitions, together with Pa(X.A) and the accompanying P(X.A|Pa(X.A)) representing the local probabilistic dependencies of the attributes, is called the PRM schema.

An RDB with all tuple attribute values deleted from the RDB represents only the reference relationship between tuples by key. A PRM is a pair of a PRM schema and a skeleton σ. The PRM schema, which represents the probabilistic dependencies among attributes, is embodied at the tuple level by the skeleton. In other words, PRM defines probabilistic dependencies between an attribute x.A of a tuple x ∈ X in the RDB table X and an attribute B of a tuple y = x.τ, that is, x.τ.B, which can be traced from x by a chain of references τ derived from the skeleton σ.

B. These probabilistic dependencies are assumed to form a Bayesian net (under the condition that the probabilistic dependencies do not loop, etc.). In this Bayesian net, unlike ordinary Bayesian nets, the attribute of each tuple, i.e., x.A, is a random variable and forms a node of the graph, and the CPT associated with x.A is P(X.A|Pa(X.A)). In other words, in PRM, if the class and attributes are the same, the same CPT is shared even if the tuples are different. This mechanism allows PRM to treat samples as if they were from the same independent distribution for the purposes of probability calculations.

The figure below shows the parent node for the role of actor. Role parent node is Actor. Gender, Film. The following figure shows the PRM schema, the PRM skeleton, and part of the Bayesian net equivalent to the PRM defined by them, where the parent nodes of the role are actor, gender, film, and genre.

As the number of rows (tuples) of actors, movies, and appearances increases, we can see that a Bayesian net with a very large number of nodes is constructed.

The original relational database RDB will be a skeleton σ with the attribute values of the tuples that were deleted when creating it backfilled. It corresponds to giving a value to each node of this Bayesian net. In other words, the RDB is one instance I (realization) of this Bayesian network. x is a tuple in Table X, the realization of x.A in I is Ix.A, and the realization of the parent attribute Pa(X.A) of attribute A of X given in the PRM schema S is IPa(x.A). The probability P(Ix.A|IPa(x.A)) is called a parameter and the whole set of parameters is denoted by θ. Furthermore, by X we denote by A(X) the entirety of the table and by A(X) the entirety of the attributes of the table X∈X. Then the probability of instance I, in other words, the likelihood of RDB determined by PRM, is given by

\[P(I|\sigma,\mathbf{S},\theta)=\displaystyle\prod_{X\in\mathbf{X}}\prod_{A\in \mathbf{A}(X)}\prod_{x¥in\sigma(X)}P(I_{x.A}|I_{Pa(x.A)})\]

The calculation of the conditional probability values, which are the parameters of the model, is done by maximum likelihood estimation, i.e., by maximizing the likelihoods described above. If there are missing observed values, parameter learning is also performed by maximum likelihood estimation using the EM algorithm, etc. In terms of structural learning, structural learning in PRM is to find the parent attribute Pa(X.A) for the attribute of interest X.A in Table X in the RDB from the data. PRM has been applied to recommendation systems, etc., and has since been used to incorporate structural indeterminacy and to develop entity-relationship models ( The PRM has been applied to recommendation systems, etc., and has since been developed by incorporating structural uncertainties and combining it with the entity-relationship model (ER model).

Markov Stochastic Fields and SRL, MLN, PSL

Looking back at PRM from the standpoint of postoperative logic, if we think of the postoperative and table tuples that represent the relations between tables as solids, we are dealing with solids and their relations. That there is a tuple t1,…,tn in a table X is represented by an atom of postoperative logic called X(t1,…,tn). where X is a predicate symbol and ti is called a term. This expression represents the proposition “t1,…,tn is in the relation X”. Then, the relational database is equivalent to a finite set of atoms without variables (called ground atoms), and the PRM can describe, for example, the probabilistic dependency between the terms ti of X(t1,…,tn) and sj of Y(s1,…,sm) Y(s1,…,sm). However, logical dependencies such as X(t1,…,tn)⇒Y(s1,…,sm) cannot be described or used even if they exist. And even if they did exist, they cannot be described or used, and the treatment of general laws expressed by universal quantifiers such as ∀x bird(x)⇒can_fly(x) is out of consideration.

In predicate logic, a term is a variable such as x,y,z,…, or a constant such as 0,1,2,…, or a function such as the more complex father_of(x), representing an individual. An atom or its negation is called a literal, and a literal disjunction is called a clause. If a clause has a variable, it is considered to be bounded by ∀. There is a subclass of clauses A⇐B1∧…∧Bn or, equivalently, a definite clause of the form B1∧…∧Bn⇒A (n≥0<A,Bi is an atom), with the case n=0 (coinciding with an atom) being specifically called a unit bird(x)⇒can_fly(x) is an example of a definite clause, which says that for all x (since we consider variables as bounded by ∀), if x is a bird, then x can_fly. Clauses are a powerful form of representation for knowledge representation, as they can also represent relational databases as sets of single clauses, and recursive transitive rules such as friend of a friend is also a friend (friend(x,y) ∧ friend(y,z) ⇒ friend(x,z)).

Therefore, it is also possible to construct probabilistic models based on individuals and individual-ship relationships using clauses in post-operative logic, which have more descriptive power than relational databases. One representative example is the Markov logic network (MLN), which defines a Markov random field (MArkov random field) by a weighted set of clauses.

The notion of possible worlds is useful for discussing MLNs. In general, the truth or falsehood of a logical expression is determined by defining a domain (set), determining the truth or falsehood of ground atoms on the domain, and then determining the truth or falsehood of more complex expressions in turn, thereby determining the truth or falsehood of all logical expressions in the domain. In the case of clauses, it is customary to use the entire set of ground terms (terms without variables) as the domain, i.e., the Herbrand universe. The Herbrand interpretation is what defines the truth values of all the specified atoms in the Herbrand universe. The MLN gives the probability distribution of the possible worlds in terms of a set of clauses with real weights. An example is as follows.

This is the weighted clause set that defines the MLNs for smoking and friendship (A ⇒ B is equivalent to ¬A∨B and A⇔B to (A⇒B)∧(B⇒A), so the above two equations can be rewritten as clause sets). The greater the weight of a clause, the higher the probability that the clause is true in the possible world.

Consider the case where the Elbrun domain is finite and consists of two individuals {a, b}. By considering the entire set of possible base assignment examples of a clause (all variables in the clause are assigned a basis term) and connecting by edges the basis atoms that appear in the base assignment examples of the same clause, we obtain an undirected graph as shown in the figure.

For example, if we make the basis assignment a to x and b to y in freind(x,y)⇒(smokes(x)⇔smokes(y)), we obtain friends(a,b)⇒(smokes(a)⇔smokes(b)), where the basis atoms friends(a,b),smokes (a),smokes(b) are connected by edges to form a clique.

In this undirected graph, the basis atoms are random variables that take the values 1 (true) and 0 (false), and the graph represents a Markov probability field that defines the probabilistic dependencies of the basis atoms. The clique corresponding to the base assignment example in clause C is given the weight Wc associated with the clause. The generation probability of the possible world ω is given by the log-linear (log-linear) model\(p(\omega)\propto exp\left(\displaystyle\sum_C W_C\cdot N_c(\omega)\right)\). However, Nc(ω) is the number of base assignment sequences of clause C for which ω is true.

MLN can be viewed as a conventional log-linear model using basis clauses that take the values 1 and 0 as features of the model. MLNs have been extensively applied in natural language processing as well as in recommendation systems.

A relatively recently proposed class of Markovian probability fields other than MLNs is probabilistic soft logic (PSL), which, like MLNs, uses programs consisting of weighted clauses such as 0.3 : friend(B,A) ∧ votesFor(A,P) → votesFor(B,P). However, the fundamental difference is that the probability values of the propositions are continuous, which gives it a computational advantage over MLN. Specifically, in PSL, the true/false values are continuous in the interval [0,1], and after assigning such values as parameters to the base atoms, the probability values of any Boolean expressions from the base atoms are recursively computed as follows.

\begin{eqnarray}\omega(A∧B)&=&max(0,\omega(A)+\omega(B)-1)\\\omega(A∨B)&=&min(1,\omega(A)+\omega(B))\\\omega(¬A)&=&1-\omega(A)\end{eqnarray}

For a weighted clause r=λ:B→H of the program, we define the “distance from being true” of this clause by dr(ω)=max(0,ω(B)-ω(H)). dr(ω)=0 is equivalent to ω(B)≤ω(H), which is equivalent to treating true values in a B→H (generalized Boolean algebra) Heiting algebra. This is equivalent to dealing with true values in the Heiting algebra B → H (which generalizes Boolean algebra).

The probability of occurrence of a possible world Ω is given by \(P(\omega)\propto exp\left(-\displaystyle\sum_{r\in R}\lambda_r\cdot d_r(\omega)\right)\) where R is the set of base assignment examples of weighted clauses in the PSL program, which will consist only of the base atoms that are relevant to the stochastic modeling in question now. It can be seen that the probability of occurrence of ω is high such that dr(ω) = 0 for many r.

One advantage of PSL over MLN is that it can solve the maximum probable explanation (MPE) problem faster: when a part of the world ω is observed and its truth is fixed, the MPE problem is to find the truth (parameter ), the problem becomes one of finding the maximum value of P(ω).

In the case of PSL, to solve the MPE, we need to find the parameter that gives the minimum value of \(\phi=\displaystyle\sum_{r\in R}\lambda_r\cdot d_r(\omega)\), but since φ is convex as a function of the parameter and dr(ω) is also a linear function of the parameter Since φ is convex as a function of the parameters and dr(ω) is a linear function of the parameters, the MPE can be solved quickly.

The above is a description of SRL. In general, although SRL introduced logical expressions to improve the descriptive power of Bayesian nets, it is only used as a kind of convenient (macro-like) function, and its connection with logic is weak.

In the next article, we will discuss the European competitor, probabilistic logic learning (PLL).

In the next article, we will discuss probabilistic logic learning (PLL), which was developed in Europe.