Fusion modeling of probability and logic using Probabilistic Logical Learning

Machine Learning Artificial Intelligence Natural Language Processing Algorithm Digital Transformation Deep Learning Mathematics Probabilistic Generative Models Reinforce Learning Intelligent information Ontology Technology Explainable Machine Learning Navigation of this blog

Modeling that combines probability and logic (2) PLL (Probabilistic Logical Learning)

From Iwanami Data Science vol 3, “Probabilistic Modeling Combining Probability and Logic.

Predicate Logic and PLL

In the previous article, we discussed SRL (statistical relational learning), which was developed in North America. This time, we will discuss its European counterpart, probabilistic logic learning (PLL).

SRLs such as PRM (probabilistic relational model) and MLN (Markov logic network) are based on the idea of enriching probabilistic models by using relational and logical expressions. Although SRL uses relational and logical expressions, it does not directly aim to enrich post-operative logic with probability. On the other hand, knowledge representation by post-operative logic has long been studied in the field of artificial intelligence, and attempts to incorporate probability into it to represent not only logical knowledge, which is always valid, but also probabilistic knowledge have been attempted since before the statistical machine learning boom.

The approach from logic to probability in artificial intelligence research probably began with the work of Nilson in 1986, who considered a knowledge base consisting of logical formulas with probabilities and viewed it as a constraint on the probability of occurrence of possible worlds – models of the knowledge base – and then developed a probability of logical formulas derived as theorems from the knowledge base Nilson then calculated an interval consisting of upper and lower bounds for the probability of logical formulas derived as theorems from the knowledge base. Later, in the second half of the 1990s, research on probabilistic logic programming, which introduces probability into logic programming, began to flourish.

Logic programming is a programming method that uses definite clauses as program code, and is represented by Prolog as a language, while Lisp and Erlang are distantly related functional languages that have influenced it.

Two types of probabilistic methods have been proposed for logic programming. One is a method like Nilson, in which a probability interval is given to a logic expression, and the other is a method in which a probability value is given to a logic expression. For the latter method, Poole showed that discrete Bayesian nets can be represented by logical formulas. However, in both methods, probabilities were assumed to be given in descending formulas, and it was not assumed that probabilities could be learned from data.

Later, a method for statistically learning probabilities from data along the stochasticized minimal model semantics was proposed by Sato, enabling statistical learning of hidden Markov models and probabilistic context-free grammars. On the other hand, inductive logic programming (ILP), which is derived from logic programming and performs structural learning based on logic, began to be integrated with the second Bayesian network in 1990. By incorporating Bayesian networks, it was expected to be able to handle uncertainties such as noise and missing data in real-world data while maintaining the high expressive power of postoperative logic. These studies are now referred to as PLL together with the stochastic logic programming mentioned above.

PLL has been studied mainly in Western Europe, but in the 2000s, SRL in North America and PLL in Western Europe began to interact against the background of a common interest in probability and logic, and a peaceful community was formed across the Atlantic Ocean. Since then, research continued to develop steadily, and in PLL, for example, it is called lifted inference. The efficiency of probability computation using the symmetry of logical expressions was studied intensively, but by 2010, both SRL and PLL research had come to a halt.

At the same time, along with the SRL/PLL trend, a movement toward probabilistic programming (PP), which aims to realize machine learning using high-level languages in AI, which can be said to be the initial ambition of SRL/PLL, emerged, and DARPA’s PPAML (probabilistic programming for advanced machine learning) was established in the United States. PP (probabilistic programming for advanced machine learning) project of the U.S. DARPA. Probabilistic programming uses Turing complete general-purpose programming languages that go beyond graphical models and probabilistic grammars, and aims for highly flexible probabilistic modeling with high-level languages, such as Church based on Schema and Anglican based on Clojure. They make nonparametric Bayesian models relatively easy to handle for non-specialists.

Both SRL and PLL have tried to link logic and probability by pitching probabilistic models, but they still have a ways to go. Some of the ways of linking the two are ad hoc, such as simply giving a numerical value called a probability to a logical expression, and there have been few attempts to rigorously integrate the two as (discrete) probability modeling.

Inductive logic programming (ILP) and probability

ILP is a method of knowledge acquisition proposed by Muggleton in the UK in the early 1990s, which uses ground atoms to represent positive and negative examples, and extracts regularities hidden in these examples in the form of define clauses. It has since developed and differentiated itself, spreading mainly to Hwangshu, and is now established as a type of structural learning in machine learning.

On the other hand, the PLL movement to introduce probability into ILP began in the 1990s and continues to this day. For example, Muggleton focused on the similarity between deterministic clauses and CFG rules, and proposed a method of assigning probabilities to atoms to be proved by assigning probabilities to deterministic clauses, just as PCFG is possible by assigning probabilities to CFG rules, and called it SLP (stochastic logic programming; In the 2000s, a group of ILP researchers led by De Raedt in Belgium started to introduce probability into ILP in a different way from Muggleton’s. However, many of them used definite clauses to define Bayesian networks in their publications, and the probability computation and learning were Bayesian network dependent.

Both Muggleton and De Raedt started from predicate logic formulas and followed the path of assigning probabilities to them, but they were very conscious of their application to data mining and did not choose predicate logic as a tool to handle infinity. Logic formulas are often used for convenience as complicated names (although they are conveniently containing variables), and despite assigning probabilities to logic formulas, there was little awareness of their relation to probability measures on the possible world.

ILP is still being actively studied and conferences are being held.

Connecting Probability and Logic

SRL/PLL will attempt to link knowledge expressed in predicate logic with probability computation and learning through probability models. However, logic and probability are mutually independent disciplines, and simply using logical expressions in a probability model does not mean that the two are integrated without contradiction. One theorem that mathematically links probability and logic without contradiction is Fenstrad’s Expression Theorem.

This theorem says that to be able to assign probabilities without contradiction logically, one should define a probability measure P(*) for a possible world, and for a logical formula φ, the entire possible world ω (ω|ω⊨φ) that makes it true should be the probability of φ. This theorem suggests that if we implement computable possible many-worlds in a computer and define probability measures on them, it will be possible to compute and learn probabilities of mathematically consistent logical expressions.

In addition, based on Fenstad’s representation theorem, a finite number of possible many-worlds (Elbrun interpretation {ω1, ω2,…,ωn}) with probability measures can be artificially created and used to simulate probability calculations. Specifically, the probability of occurrence of each possible world \(\lambda(\omega_1),\dots,\lambda(\omega_n)(\sum_{i=1}^n\lambda(\omega_i)=1)\) is defined, and for a closed logical formula φ, the sum of the probabilities of possible worlds that make φ true should be a function of φ.

\[P(\phi)=\displaystyle\sum_{\omega⊨\phi,1≤i≤n}\lambda(\omega_i)\]

As a concrete example, assume that the Elbrun domain consists of two constants (a,b) and the language 𝓛 has only R(*,*) as a predicate symbol. 4 base atoms (atoms without variables){R(a,a),R(a,b),R(b,a),R(b,b)} exist and their true or false There are a total of 24 = 16 possible worlds (Elbrun interpretation) that determine their true/false values. The normal probabilities of the possible worlds are determined as follows so that the sum of the occurrence probabilities of the possible worlds is 1.

Then the probability is calculated as follows.

\begin{eqnarray}P(∃x_1R(x_1,a))&=&\lambda(\omega_1)+\lambda(\omega_2)=0.2+0.3=0.5\\P(∀x_1R(x_1,a))&=&\lambda(\omega_2)=0.3\\P(∀x_1∃x_2R(x_1,x_2))=&=&\lambda(\omega_1)+\lambda(\omega_2)+\lambda(\omega_3)=1.0\end{eqnarray}

Using these we finally obtain P(R(x1,x2))=0.2.0.5+0.3.0.9+0.5.0.4=0.57 and in this possible world the predicate R(x1,x2) is true with an average probability of 0.57.

Logic-based probability modeling

In the previous article, we created a finite possible many-worlds with artificial probability measures based on Fenstad’s representation theorem and simulated probability calculations. In other words, we created a probability model. If we generalize and implement this attempt, we should be able to realize a probability model based on logic.

For the sake of simplicity, we will consider only the probability of closed-logic expressions. In this case, according to Fenstad’s representation theorem, it is sufficient to define a probability measure on many possible worlds, and it is not necessary to consider a probability measure for each possible world. Therefore, logic-based probability modeling should do the following three things.

Definition: Define a possible many-worlds probability measure.
Definition Calculation: Calculate the probability of a logical expression.
Definition Learning: learn probabilities from data.

The first task will be to define probability measures for possible many-worlds, which becomes difficult when the possible worlds are infinite. For simplicity, we denote the entire base atom of our first-order language 𝓛 by B and call it the Elbrun basis. A possible world is a map from the Elbrun basis to {1 (true), 0 (false)}, and thus there are 2|B| possible worlds (where |B| is the concentration of B). If there is no function symbol, B is a finite set, so we can count up the possible worlds as in the previous simulation and assign a probability to each world so that the sum is 1.

However, when there is a function symbol, B is additive and infinite, and there are 2|B| possible worlds, i.e., as many as the number of real numbers. Therefore, it is impossible to count up all of them and assign probabilities.

From a purely mathematical point of view, it is possible to define a probability measure on the possible many-worlds with real number density. However, from the standpoint of aiming at machine learning, the second issue is important: this probability measure must be computable, and there is an engineering requirement for it to be computationally efficient as well. Also, related to the third issue, as a probability model, it must be parameterized in such a way that statistical learning is possible.

distribution semantics

To take full advantage of the potential that Fenstad’s representation theorem offers for machine learning, it is necessary to go beyond finite graphical models and realize probabilistic modeling at the level of programming languages. There are many probabilistic models that deal with infinite worlds, such as Markov chains and PCFGs (Probabilistic Context-Free Grammars), each of which has been studied as a separate field, but all of them, including graphical models, whether they are finite or infinite models, can be absorbed by logic-based probability modeling, whether finite or infinite models, including graphical models.

In the world of logic, a logic-based programming language called Prolog has existed for a long time, and various processing systems have been developed and maintained.

First of all, if the objective is only to make the programming language probabilistic, it is enough to include probabilistic elements in the program. If we consider Turing machines, which are the basis of programming languages, state transitions are non-deterministic, and if we replace them with probabilistic choices, we obtain a probabilistic Turing machine. Similarly, if we make a probabilistic choice of a call clause in Prolog, we obtain a probabilistic Prolog.

However, this is merely a probabilization of computation: to lead to a probability space as described by Fenstad’s theorem, a probability measure must be defined over the many possible worlds of a probabilistic logic program, and this position is called “distributional semantics”. Distributional semantics is a type of denotational semantics that defines the mathematical meaning of a program, and is currently the standard semantics for probabilistic logic programs.

Distributional semantics starts from the probabilistic selection by dice, i.e., categorical distribution. The direct product distribution of additive categorical distributions is called the fundamental distribution. For example, when the categorical distribution is the Bernoulli distribution, the fundamental distribution is the probability measure on the whole {0,1}∞ of an infinite sequence of 0,1. In distributional semantics, this fixed basic distribution is processed by a logic program to create a more complex distribution that approximates the real world.

Technically, the basic distribution is extended to distributions over many possible worlds by using the standard semantics of non-probabilistic logic programs, namely, the least model semantics and Kolmogorov’s extension theorem, and as a result, by writing logic programs, any number of complex distributions can be expressed. As a result, a very wide range of generative probability models can be represented by distribution semantics, including current standard probability models such as BN (Bayesian nets), LDA, HMM (hidden Markov models), and PCFG.