Statistical causal search to find cause and effect relationships in huge amounts of data.

Machine Learning Artificial Intelligence Mathematics Digital Transformation Statistical Causal search Navigation of this blog

Summary

Techniques to examine “causal relationships” that are not “correlations” are “causal inference” and “causal search. Both causal inference and causal search are methods for analyzing causal relationships, but there is a difference in purpose and approach: causal inference is a technique for verifying causal relationships, while causal search is a technique for discovering causal relationships.

Causal search is a method for discovering potential causal relationships, and it explores data to identify variables that may predict outcomes. Specific algorithms include (1) “correlation analysis,” which calculates the correlation between two variables and calculates the correlation coefficient; (2) “causal graph theory,” which creates a directed graph representing the causal relationship between variables and estimates the causal relationship based on the graph structure; and (3) “structural equation model,” which constructs a structural equation model representing the causal relationship between variables and interprets the (4) regression analysis, which estimates causal relationships by selecting an appropriate regression model. Various other machine learning algorithms have been applied to causal search.

Here, we will discuss various theories and applications of causal search based on “Statistical Causal Search” in the Machine Learning Professional Series.

In this article, I will discuss the reading notes.

Machine Learning Professional Series – Statistical Causal Search Reading Notes

「How do you find cause and effect relationships from huge amounts of data?　In this book, the leading researcher who developed LiNGAM (Linear Non-Gaussian Acyclic Model) explains the basics and advanced topics in an easy-to-understand manner. A must-have for causal inference and causal search.」

Chapter 1: Starting Point for Statistical Causal Search

1.1 Introduction
1.2 The Greatest Difficulty in Causal Search: Similarity Correlation
Scatterplot of Chocolate Consumption and Nobel Prize Winners
Multiple causal relationships may give the same correlation
There are three possible causal relationships, but they all give the same correlation
The more chocolate consumed, the more Nobel Prize winners there are.
Meaning of Terms
Observed variable
A variable that has been observed
Unobserved variable
An unobserved variable for which no data have been collected.
Causal graph
A variable for which the starting point of the arrow is the cause and the ending point of the arrow is the effect.
Quantitative information such as “how large is the causal effect” is not included in a causal graph.
Common cause
Common cause of multiple variables
Unobserved common cause
Unobserved common cause
Spurious correlation
Gap where correlation appears even though there is no causal relationship
Data generating process
The process by which data is generated.
The procedure of how the “value” of a variable is determined.
1.3 Numerical examples of similarity correlation
The process of generating data in three types of causal graphs
The number of Nobel laureates is y
The amount of chocolate consumed is x
GDP as z
Let the error variables be ex, ey
ey: A parameter that determines the variable story of the Nobel laureates and combines all variables except chocolate consumption and GDP into one.
ex: All variables other than GDP in the consumption of chocolate are combined into one.
Generating process of Y
Y = byxx + λyzz +ey
For the sake of simplicity, let’s assume linearity
Linearity
The value of variable y can be written as the sum of the values of variable x, variable z, and error variable ey
y, x, z, and ey are random variables
byx, λyz are constants
Subscripts: The first letter represents the left-hand side variable and the second letter represents the corresponding right-hand side variable
x = λxzz + ex
Equations and scatter plots of three types of causal graphs
Although the equations for the generation process of x and y are different
The correlation between x and y will be the same
1.4 Summary of Essence

Chapter 2 Fundamentals of Statistical Causal Inference

2.1 Introduction
Substantial science
Basic science, such as natural science and social science, and applied science, such as engineering and medicine
Main purpose
To clarify causal relationships
Methodology
Statistics and machine learning
Main objective
Research on the methods themselves to achieve the goals of real science
Statistical causal inference
Methodology for inferring causal relationships from data
If you change something, and something else changes, the two are causally related
How can this be expressed mathematically?
2.2 Defining Causality with Counterfactual Models
Introduction
In order to clarify the meaning of causality, we will explain the counterfactual model.
2.2.1 Individual-level causality
Introduction of the concept of “unit-level causation
(A) Explanation by example
An individual (human species)
Strong activity of aldehyde dehydrogenase and strong constitution against alcohol
Name: Keiko Ishida
Age: 35 years old
Gender: Female
Speaks Japanese
Occupation “Data Scientist”.
Works in the city, lives in the suburbs
Commute is one hour
Etc.
Individual A has a disease, and you are interested in whether a certain drug will cure his disease.
To find out, we compare the results of two behaviors.
Ask Individual A to take the medicine.
Ask individual A not to take the medicine
The causal relationships we consider for individuals are called
Individual-level causality
(B) Explanation by symbols
Whether or not to take the medicine is represented by x
If x=1, take the medicine
If x = 0, do not take the medicine
Indicate by y whether or not you will be sick in 3 days
If y=1, I will be sick in 3 days
If y=0, you will not be sick in 3 days
2.2.2 The Fundamental Problem of Causal Inference
The problem with using data to investigate the causality of individuals
It is impossible to observe both outcomes of two actions.
We can always observe only one of them
Fundamental problem of causal inference
2.2.3 Group-level causality
Introduction of the concept of “population-level causation
2.3 Describing the data generation process using structural equation models
Structural equation models
Procedures for determining the values of variables
Mathematical tools to describe the data generation process
An equation called a structural equation is used to describe how the value of a variable is determined.
Example: Medicine and disease
The data generating process of how the value of the variable y, which indicates whether a person is ill, is determined
y = fy(x, ey)
Structural equation
The left-hand side is defined by the right-hand side
y: whether the person is sick (1: sick 0: not sick)
ey: Error variable that collectively represents all variables other than x that can contribute to the determination of the value of y
x and y are observed variables, ey is an unobserved variable
Data generation process for x
x = fx(ēx)
The error variable ē is the sum of all variables that can contribute to determining the value of x.
A simple expression
x = ex
Example of a data generating process in a structural equation model
The arrows in both directions for ex and ey indicate that ex and ey may be dependent rather than independent.
If they are dependent, then if we know the ex of an observation, we can predict the value of its ey to some extent
Variable y
Endogenous variable
A variable that appears on the left-hand side of the structural equation
The structural equation describes which variables are determined by the values of the other variables.
Error variables ex and ey
Exogenous variable
Variables that appear only on the right-hand side of the structural equation
The procedure by which they are generated from any variable is not described.
Four models of the structural equation
Endogenous variable
Exogenous variables
Functions connecting endogenous and exogenous variables
Probability distribution of exogenous variables
Corrective determinants of the values of the variables represented by the structural equation model are expressed using a “causal graph.
When each variable on the right-hand side of a structural equation may be necessary to calculate the value of a variable on the left-hand side
Draw a directed edge to the left-hand side variable
When the existence of an unobserved common factor is suspected between two variables in the model
Draw an effective edge in both directions between the error variables associated with the two variables.
Generalization of the structural equation
vi = fi(𝒗, 𝒖)
vi: endogenous variable
𝒗: vector of p endogenous variables 𝒗 = [v1,… ,vp]T
𝒖 : vector of q exogenous variables 𝒖 = [u1,…,uq]T ,uq]T
function f = {f1,… ,fp}
2.4 A Framework for Statistical Causal Inference: Structural Causal Models
Introduction
Structural causal model
A typical framework for causal inference.
It is based on two models
A semi-factual model of causation
Structural equation model, a model of the data generating process
2.4.1 Representing population-level causality
Representation of population-level causality in the counterfactual model by a structural equation model
A behavior called “intervention” is represented using structural equations.
What does it mean to intervene on a variable x?
To intervene in a variable x means to “take the value of variable x as a constant c, no matter what value any other variable takes.
Other variables: all variables, observed and unobserved
All other variables, both observed and unobserved, are denoted by the symbol do as do(x=c).
Interventions are done from outside the model
Example: Medicine and disease
To intervene in the variable x, which represents whether or not to take medication, means
To make sure that the patient takes the medication regardless of age, gender, severity of the disease, etc., i.e., to set the value of x to 1 (or never take the medication (set the value of x to 0))
In the structural equation model, the intervention d0 (x=c) is
In the structural equation model, the intervention d0(x=c) is to “replace the structural equation representing the data generation process of x with another structural equation representing the data generation process, x=c.
The value of x is always generated from the structural equation x=c.
Structural equation Mx=c with interventions
The x=c in the lower right of M indicates that we intervened in x and set its value to c
x=c
y=fy(x,ey)
A causal graph with interventions
Since the creation of variables is “left to nature,” the value of x is intervened upon and becomes a constant c
Unobserved common factors no longer exist
Autonomy assumption
Replacing any of the structural equations does not change the distribution of the other structural equation functions or exogenous variables
Example: Medicine and disease
Intervention to have the patient take the medicine does not change the function fy connecting the disease and the medicine
The distribution of the error variable ey associated with y also remains unchanged
The probability distribution of y after the intervention on x, p(y|do(x=c))
p(y|do(x=c)) := pMx=c(y)
The symbol := means that the left-hand side is defined by the right-hand side
The right-hand side, pMx=c(y), is the distribution of y in the structural equation model Mx=c.
The distribution of y when we intervene by setting the value of x to c is the distribution of y in the new structural equation model Mx=c created by the intervention.
Representation of population-level causality in a structural equation model
If there are two values of x, c and d, such that the probability distribution of y after the intervention is different
In this population, x is the cause of y
If there are values of c and d such that p(y|do(x=c)) ≠ p’y|do(x=d))
In this population, x is the cause of y
x and y are causally related
Example: Medicine and disease
If the distribution of whether or not a person is sick after 3 days differs between the case where the person takes the medicine and the case where the person does not take the medicine
p(y|do(x=1)) ≠ p(y|do(x=0))
In this population, whether or not to take the medicine is the cause of whether or not the disease is cured
If the probability that the disease will be cured if the patient takes the medicine is greater than if the patient does not take the medicine
p(y=0|do(x=1)) > p(y=0|do(x=0))
In this population, the behavior of taking the medicine has the effect of curing the disease
2.4.2 Quantifying the magnitude of causal effects
What is the magnitude of the causal effect, if any?
One difficult way to quantify the magnitude of the causal effect from variable x to variable y
Evaluating the average difference
E(y|do(x=d)) – E(y|do(x=c))
AVERAGE COSTAL EFFECT
The difference between E(y|do(x=d)) and E(y|do(x=c)), the expected value of y in hypothetical populations Mx=c and Mx=d, when x=d and c are set.
When the value of the variable x is changed from the constant c to the constant d, it shows how much the value of the variable y changes.
If you want to know the magnitude of the causal effect, calculate the average causal effect instead of the correlation coefficient.
Example: Medicine and disease
Intervention on x
Average causal effect
E(y|do(x=d)) – E(y|do(x=c)) = E(byxd + ey) – E(byxc + ey) = byx(d-c)
The difference between d and c, which is the change in x, lacking the coefficient of x, byx.
When intervening in Y
Average causal effect from y to x when the value of y is changed from c to d
E(x|do(y=d)) – E(x|do(y=c)) = E(ex) – E(ex) = 0
Changing the value of Y does not change x
There are no directed edges in the graph.
2.4.3 Representation of individual-level causality
Representation of causality at the individual level in structural equation models
Example: Medicine and disease
If we ask individual A to take a medicine (by intervening and setting the value of x to 1), we can determine whether or not the individual is ill after 3 days
is represented by yx=1(A).
(A) in the upper right corner is the value of individual A.
Indicates that x=1 in the lower right corner of Y “intervened on x and determined its value to be 1
If yx=1(A) and yx=0(A) are different
In individual A, whether or not to take the medicine is the cause of whether or not the disease is cured.
Expression in structural equation model
Preparation
Equation of the structural equation model
x=ex
y=fy(x, ex)
The structural equation model before the intervention generates the values of x and y for individual A, a member of the population.
The values of the error variables ex and ey of individual A are ex(A) and ey(A)
The values of x and y are obtained from the above equation
x=ex(A)
y=fy(ex(A), ey(A))
For this population of x, the structural equation model Mx=1 when the value of x is set to 1
x=1
y=fy(x,ey)
Representation of causality in individuals
yx=d(A) ≠ yx=c(A)
If there are c and d such that “In individual A, x is the cause of y.”
2.4.4 Explanation of Events
Both population-level and individual-level causality express predictions about “what would happen if we intervened.
Structural causal models can express not only such predictions, but also explanations of what caused events in the past.
Example
Structural equation model of study time (x) and score (y)
2.5 Randomized experiments
Randomized experiment” is the simplest method for inferring causal relationships.
Example: Medicine and disease
In a randomized experiment, each individual randomly decides whether or not to take the medication (the value of x).
In this way, whether or not to take the medication is determined independently of any other variable.
Then, each individual is tested to see if he or she is sick (y) after 3 days.
Randomized experiment
Structural equation
x=ēx
The error variable ēx is a random variable that follows a Bernoulli distribution with a probability of success of 1/2.
It represents a random decision to take a drug or not.
Since the value of x is randomly determined, there is no variable that can contribute to the generation of that value.
The two error variables ex and ey are independent
y=fy(x,ey) Translated with www.DeepL.com/Translator (free version)
2.6 Summary of this chapter
The language of probability theory is not sufficient to describe mathematically the causal relationships defined on the basis of interventions
Introduction of the symbol “Do
To measure the magnitude of a causal effect, calculate the mean effect rather than the correlation coefficient
Randomized experiments are one way to estimate the average causal effect
Prior knowledge of the temporal order determines the possible causal directions
Randomization makes the two error variables independent, so that the expected value with do is the same as the normal conditional expectation
Estimate the average causal effect from the data

Chapter 3 Fundamentals of Statistical Causal Search

3.1 Motives
Two directions of statistical causal research
1) Research to clarify under what conditions predictions and explanations about causality are possible, assuming that the causal graph is known.
Research to clarify under what conditions causal graphs can be inferred with unknown causal graphs.
Statistical causal search
Examples where statistical causal search is needed
Depression and sleep disorder
Correlation coefficient is 0.77
Candidate causal relationships
Sleep disturbance causes depressed mood
Depressed mood causes sleep disorders
Depressed mood and sleep disorder are not causally related
Causal graph representing the three candidate hypotheses
It is practically difficult to conduct a randomized experiment
How to infer a causal graph without a randomization experiment
Need to put a special hypothesis in place instead of using randomization
Not a perfect substitute for randomization
3.2 Framework for Causal Search
Use of pseudo-correlation
Correlation and causality gap
What conditions must be met for differences to appear in the distribution of observed variables
Framework for statistical causal search
3.3 Basic Problems of Causal Search
Basic problems of causal search
Variables
X,y: Observed variables, endogenous variables
Z: Unobserved common cause, exogenous variable
Ex,ey: Error variables, unobserved exogenous variables
Model A
x=fx(z,ex)
y=fy(x,z,ey)
p(z,ex,ey)=p(z)p(ex)p(ey)
Exogenous variables z,ex,ey are independent
Model B
x=fx(y,z,ex)
y=fy(z,ey)
p(z,ex,ey)=p(z)p(ex)p(ey)
Exogenous variables z,ex,ey are independent
Model C
x=fx(z,ex)
y=fy(z,ey)
p(z,ex,ey)=p(z)p(ex)p(ey)
Exogenous variables z,ex,ey are independent
Assume autonomy
Replacing any of the structural equations does not change the distribution of the other structural equation functions or the exogenous variables
Assume that the data matrix X was generated from one of the three models A, B, or C
Guess which of the three models generated the data matrix X
The exogenous variables z,ex,ey are generated based on the probability density function p(z,ex,ey)
Endogenous variables x and y are generated by the functions fx and fy
Inference of the original causal graph from the probability distribution p(x,y) of the observed x and y
Major research topics in statistical causal search
The main research question of statistical causal search is “To what extent can the original causal graph be inferred if any process is established in the function form and the distribution of exogenous variables?
3.4 Three approaches to the basic problem of causal search
Separation criteria
What processes are placed in the functions fx and fy?
What processes are placed on the distributions p(z), p(ex), and p(ey) of the exogenous variables z, ex, and ey?
3.4.1 Nonparametric Approach
An approach that “makes no assumptions” on the function form or on the distribution of the exogenous variables.
The functional form can be linear or nonlinear.
The distribution of the exogenous variables can be Gaussian or otherwise.
It is not possible to guess which of the three models generated the data matrix X.
Don’t try to “recover a causal model.
Identify the limits of how much can be estimated from the data alone, with as few assumptions as possible.
The question is whether it can be estimated.
3.4.2 Parametric Approach
An approach that makes assumptions about both the functional form and the distribution of the exogenous variables
Incorporate the analyst’s prior knowledge into the model as assumptions
Often assumes linearity in the functional form and Gaussian distribution in the distribution of exogenous variables
A combination of parametric and nonparametric approaches is often used
Parametric approaches are used to infer causal graphs that represent corrective causal relationships.
Based on the graph, infer the mean causal effect, which quantifies the magnitude of the causal effect, using a nonparametric approach
Even if we assume linearity for the function and a Gaussian distribution for the exogenous variables, all models will have the same distribution of the observed variables.
It is impossible to guess which model among A, B, and C generated the data matrix X
3.4.3 Semiparametric Approach
An approach in which assumptions are made about the functional form, but not about the distribution of the exogenous variables.
An approach in which linearity is assumed for the function form and non-Gaussian distribution is assumed for the distribution of the exogenous variables.
Linear Non-Gaussian Acyclic Model (Linear Non-Gaussian Acyclic Model)
Exogenous variables can be anything other than Gaussian distribution
Can infer which of the three models A, B, or C the data matrix X was generated from
Since a non-Gaussian distribution contains more information than a Gaussian distribution, it is possible to change the distribution of the observed variables in models A, B, and C.
(A) Gaussian and non-Gaussian distributions
Gaussian and non-Gaussian distributions
All the information in a Gaussian distribution is concentrated in the mean and variance/covariance.
In a non-Gaussian distribution, even if the mean, variance and covariance are determined, the distortion and kurtosis are different.
Information on distortion, sharpness, and concavity is needed
LiNGAM approach uses information not available in Gaussian distribution to infer causal graph
Use of non-Gaussianity
(B) The choice of process is based on an overall assessment of goodness of fit.
3.5 Three approaches and identifiability
Compare the three approaches in terms of the identifiability of the causal graph
Under a certain assumption, if the structure of the causal graph is different, the distribution of the observed variables will be different.
Under this assumption, the causal graph is identifiable.
It is possible to recover the original causal graph based on the difference in the distribution of observed variables.
The assumption is that there are no unobserved common causes.
All are observed
Select a non-cyclic assumption from among the causal graph candidates
Assume that there is no cycle
Effective acyclic graph (directed acyclic graph)
Opposite concept
Directed cyclic graph
3.5.1 Basic problem setup when there are no unobserved common factors
When there are p observed variables
P variables are denoted by x1,x2,…,xp , xp
For each variable xi(I=1,…. , p) using the structural equation
xi=fi(x1,… ,xi-1,xi+1,… ,xp,ei) (I=1,…,p) ,p)
Group of variables without xi
Causal graph
Left: Case where there are unobserved variables and cycling is allowed.
Right: No unobserved variables and no cycling allowed
Relationships between variables in the form of a family tree
Typical examples of additional assumptions
The error variables (ei(i=1,. ,p) are independent
Agree that “there is no unobserved common cause”
Causality is acyclic
Structural equation with the above two processes
xi=fi(pa(xi),ei) (i=1,…,p) ,p)
pa(xi) : the set of observables that are parents of the observed variable xi pa stands for parents
Example
The set of parents of the observed variable x3 is pa(x3)={x2}
The set of parents of the observed variable x4 is pa(x4)={x3,x7}
Since the observed variable x1 has no parent, pa(x1)={}
Example of structural equation model in the above example
Example 1
x1=e1
x2=log(x1/e1)
x3=5×1+e3
Example 2
x1=e1
x2=3x1x3+e2
x3=e3
3.5.2 Basic problem setup for the linear case without unobserved common factors
Introduction
Comparing the differences between the three approaches
Try to use as common a setup as possible for comparison
Error variable is a continuous variable
Function is a linear function
Structural equation
Coefficients bij are constants
The values of the interobservables xi are determined by the linear sum of the values of the observables xj∈Pa(xi), which can be the parents of each, and the values of the error variables ei
(A) Average direct effect
The size of the coefficient bij indicates the size of the “direct” causal effect from the observed variable xj to the observed variable xi.
Direct” means that the causal effect still remains even if the values of the other variables are set to some value.
Example
The original structural equation model
Model for calculating the average causal effect (intervention in x1)
If we rewrite the model using the error variables e2 and e3
Calculate the average causal effect using the above equation
If the intervention changes the value of x1 from c to d, the value of x3 will change on average by (b31+b32b21)(d-c)”
Collectively quantify the magnitude of causal effects transmitted through multiple pathways
Those that do not go through other variables
Direct effects
Those that are transmitted through other variables
Indirect effects
Model for calculating the average direct effect (interventions in x1 and x2)
Average direct effect
Intervention with x2=c2, x1=c
Calculate x3 after the intervention
Use this formula to calculate the average direct effect from x1 to x3
If the value of x2 is left at c2 by the intervention and the value of x1 is changed from c to d by the intervention, the value of x3 will change by b31(d-c) on average.
(B) Representation by matrices
Structural equation
Coefficients bij are constants
The values of the inter-observables xi are determined by the linear sum of the values of the observables xj∈Pa(xi), which can be their parents, and the values of the error variables ei.
There are a total of p structural equations in this model.
We can write them all together using matrices
Vectors x and e are column vectors of dimension p each
The observed variable xi and the exogenous variable ei are the i-th component.
Matrix B is a square matrix of pxp
The matrix B has the coefficients bij as the first (iej) component
Which component of matrix B is zero and which component is non-zero indicates whether there is no directed edge from any variable to any variable or an effective edge from any variable to any variable.
The diagonal component bii is always 0
There is no closed loop that goes out of a variable and changes to a variable
Example: Expression in the case of three variables
Objectives of statistical causal search
Under the assumption that the data matrix X of Pxn is generated from the structural equation model above
To estimate the unknown coefficient matrix B using the data matrix X
To be able to draw a causal graph
Knowing which observation variables are the parents of which observation variables
If we know the set pa(xi) of parents of an observed variable xi, we can find the expected value of an intervention on xi by conditional expectation.
The mean E(xj|do(xi=c)) of xj when we intervene and set the value of xi to c is
E(xj|xi=c, pa(xi)) is obtained by averaging the expected value of E(xj|xi=c, pa(xi)) conditioned on xi and all its parents with the observed variable pa(xi), which is the parent of xi.
Example: Chocolate and the Nobel Prize
Xj is the number of Nobel laureates
xi is the amount of chocolate consumed
pa(xi) is the GDP
Example of a causal graph
Calculating the average causal effect from xi to xj
Examples of variables that should not be added as explanatory variables
Xh is an “intermediate variable
Xh is a child of xi and a parent of xj
Xh is a child of xi and parent of xj
Variables that are descendants of xi and parents of xj must not be added as explanatory variables
This is because the causal effect via that variable will be excluded.
Not even if it is a child of Xj
When the causal graph is known, which variables can be added or not?
A major topic in statistical causal inference
If it is not known, the analyst uses “statistical causal search” to infer the causal graph
3.5.3 Nonparametric Approaches and Identifiability
Standard principles for inferring causal graphs
Casual Markov condition
Inference principle based on conditional independence between observed variables
Each variable, when conditioned on its parent variable, is independent of its non-descendant variables.
pa(xi) is the set of observed variables that are the parents of xi
The symbol ∏ is the product of a sequence of numbers
This is generally true for non-cyclic structural equations without unobserved common causes
Not only for linear equations, but also for nonlinear equations and discrete variables
In nonparametric approaches, where no assumptions are made about the functional form or distribution of error variables, causal Markov conditions are used to infer causal graphs
Example: Explanation of the meaning of the causal Markov condition
When there are three observed variables
x3 is not a descendant of x1 (nor is it a parent)
In the causal Markov condition, x2 is independent of its non-descendant x3 when conditioned on x1
There is no observable that is a parent of x3 (nor is there a non-descendant)
p(x1,x2,x3) = p(x1,x2|x3)p(x3) = p(x2|x1,x3)p(x1|x3)p(x3) = p(x2|x1)p(x1|x3)p(x3)
Faithfulness
A variable xi(I=1,…. (I=1,…,p) are only those derived from the structure of the causal graph, i.e., only those derived from causal Markov conditions.
Assuming linearity in the functional form
What is a causal graph structure?
A causal graph structure is one in which the coefficients bij(i,j=1,…. ,p) zero and non-zero patterns
What is faithfulness?
When the conditional independence between variables violates the structure of the causal graph and the coefficients bij(i,j=1,… ,p).
When the faithfulness assumption is violated
Structural equation
x=ex
y=-x+ey
z=x+y+ez
The error variables ex, ey, and ez follow a Gaussian distribution and are independent
byx=-1, bzx=1, bxy=1
Applying the causal Markov condition to this causal graph, there is no conditional independence for the three observed variables x,y,z
Express the structural equation in terms of Ex,ey,ex
x=ex
y=-ex+ey
z=ex-ex+ey+ez = ey+ez
Ex,ey,ez are independent of each other
E(exec)=E(ex)E(ey)
E(exec)=E(ex)E(ez)
The covariance of x and z, cov(x,z), is
cov(x,z) = E(xz) – E(x)E(z) = E(ex(ey+ez)) – E(ex)E(ey + ex) = E(exec) + E(exec) – E(ex)E(ey) – E(ex)E(ez) = E(ex)E(ey) + E(ex)E(ez) – E(ex)E(ey) -E(ex)E(ez) = 0
x and z are uncorrelated
x and z are independent
Gaussian distribution is a special distribution, uncorrelated and independent are equivalent
Include “faithfulness” assumption to ensure that causal Markov assumptions are consistent
Example of Inference by Causal Markov Condition
Variation is represented by undirected edges
Connected by undirected edges
x1 and x3 are connected by an undirected edge
In every candidate causal graph, there is a directed edge between x1 and x3, but the direction of the edge varies from graph to graph
The presence of undirected edges means that there are multiple causal graphs left as candidates.
Can’t guess all at once
Causal graphs are not identifiable
An approach for inferring causal graphs from real data in the framework of nonparametric approaches
The causal graph has the same conditional independence as the correct causal graph The causal graph has fractions
The goal is to infer a “Markov equivalence”, a set of causal graphs that have the same conditional independence as the correct causal graph
Two guessing approaches
Constraint-based approach
Steps
1. infer from the data what conditional independence is possible for the observed variables
2. use the inferred conditional independence as a constraint and search for a causal graph that satisfies it
Typical estimation algorithms
PC algorithm (Peter and Clark, PC algorithm)
Extended to FCI algorithm (fast casual inference, FCI)
When there is an unobserved common cause
Extend to CCD algorithm (cyclic casual discovery, CCD)
In the presence of cyclicity
Satisfiability problem (SAT)
Inference framework to be integrated
Search for causal graphs that satisfy the constraints
Advantages
Relatively easy to extend in various ways
Problems
Hypothesis testing is used to determine conditional independence when inferring whether conditional independence is true
This is not the original purpose of the test
A fairly strong process is needed to make the estimators of Markov identities consistent
Score-based approach
Evaluate the goodness of the model for each Markov equivalence, a set of causal graphs that are given the same conditional independence
Use Bayesian information criterion (BIC) to assign a score to each Markov equivalence that measures the goodness of the model
To calculate the BIC score, we need to specify the type of distribution of the error variable.
The Markov equivalence with the highest score is used to infer the Markov equivalence that contains the correct causal graph
Typical algorithms
GES algorithm (Greedy equivalence search, GES algorithm)
3.5.4 Parametric Approaches and Discriminability
Nonparametric approaches can be used without making assumptions about the functional form or the distribution of the error variable.
Causal graphs that give the same conditional independence cannot be discriminated
Consider a “parametric approach” that makes assumptions about linearity and Gaussian distribution of error variables
More information is available
Identifiability does not improve with the right assumptions
E.g., models with opposite causal directions cannot be identified
3.5.5 Semiparametric Approaches and Discriminability
Approaches to improve discriminability
An approach that makes assumptions about the function, but not about the distribution of the error variable
Semiparametric approach
Assumption of linearity for the function form and non-Gaussian distribution for the error variable
LiNGAM approach
When the error variable is non-Gaussian, the distributions of the observed variables for models with different causal directions are different
This difference in the distribution of the observed variables can be used to infer the causal graph.
3.6 Summary of this chapter Translated with www.DeepL.com/Translator (free version)

Chapter 4 LiNGAM

LiNGAM model (linear non-Gaussian acyclic model, LinGAM)
First, we introduce a signal processing technique called independent component analysis.
Using the results of the independent component analysis, we show that the LiNGAM model is discriminative.
Finally, we show how to estimate the LiNGAM model.
4.1 Independent Component Analysis
Independent component analysis (ICA)
A data analysis method that has been developed in the field of signal processing
In independent component analysis
In independent component analysis, the values of unobserved variables are considered to be mixed to produce the values of observed variables.
Example
The voices of multiple speakers are mixed and observed by multiple microphones.
Example of a model
X1 = a11s1 + a12s2
X2 = a21s1 + a22s2
s1 and s2 on the right-hand side are unobserved continuous variables
x1 and x2 on the left-hand side are observed variables
Coefficients a11, a12, a21, and a22 are constants
Indicates how unobserved variables are mixed together to form observed variables x1 and x2
Representation by causal graph
Characteristics of Independent Component Analysis
Unobserved variables s1 and s2 are independent and follow a non-Gaussian continuous distribution
S1 and s2 are independent components (independent component t)
Recover unobserved variables s1 and s2
General independent component analysis model
P observed variables xi(i=1,…,p p≥2) ,p p≥2)
Equation
sj(j=1,… ,q) are unobserved continuous variables that are independent and follow a non-Gaussian distribution.
Expression using matrices
Matrix A is a matrix of pxq
The components are aij(I,j=1,…,p) .p)
Matrix A represents how the independent components are mixed to become the observed variables
It is called the mixing matrix.
The i-th component of vector x is the observed variable xi
The jth component of vector s is the independent component sj
Any two columns of the mixing matrix A are assumed to be linearly independent
Identifiability of the mixing matrix A
The mixing matrix A of the independent mixed component analysis model is identifiable except for the order and scale of its columns.
It is not possible to uniquely estimate the mixing matrix A.
Even if we change the order and scale of the columns of the mixture matrix A, if we change the order and scale of the components of the independent component vector s to match the change, the mixture matrix and independent component vector after the change will also satisfy the assumption.
Example
A matrix that may differ only in the order and scale of its columns can be estimated
Formula for the relationship between a mixed matrix A and a matrix AICA that may differ only in column, order and scale
Matrix P is a qxq replacement matrix
change the order of the columns
D is a diagonal matrix of qxq
Diagonal components are non-zero
Changing the scale of a column
Difference between Gaussian and Non-Gaussian Distributions
Uncorrelated or Independent?
Independent is more informative and stronger as a condition
How to Estimate the Mixture Matrix A from Actual Data
Assume that the mixture matrix A is a square matrix
p=q
Number of observed variables and number of independent variables are the same
Estimate the independent component vector s from the vector y=Wx, which is created by linearly transforming the observed variable vector x with the matrix W of pxp
W: demising matrix
If the demising matrix W is equal to the inverse of the mixing matrix A, then
s can be recovered by s=Wx(=A-1As=s)
To estimate the restoration matrix, we need to determine the independence of the components of the vector y (j=1,…,p). (j=1,…,p) of vector y.
Because the components of the independent component vector s that we are trying to estimate with vector y are independent
Standard independence measure
Mural information
Takes a non-negative value
0 when the variables are independent
Estimate the W that minimizes the mural information of the components of vector y=Wx (above equation)
H(y) is the entropy of y and is given by
Entropy
In information theory, entropy is a measure of the amount of information a random variable has.
The amount of information increases as the random variable X takes on more and more values.
The amount of information is a measure of the “clutter” of the values of X
The amount of mutual information is defined by the above equation using the data matrix X
The vector y(m) is the mth column of the matrix Y=WX, which is a linear transformation of the data matrix X by W.
The mth observation of the variable vector y
yj(m) is the jth component of vector y(m) and the mth observation of variable yj
row vector wjT is the jth row of matrix W
vector x(m) is the mth column of the data matrix X, the mth observation of the observed variable vector x
Algorithm for minimization
Immovable Point Method
Characterized by the fact that the non-Gaussian distribution of the independent components does not need to be identified in advance
Estimating the restoration matrix W that minimizes the amount of mutual information yields an estimate of WICA
4.2 LiNGAM Model
LiNGAM model for P observed variables x1,x2,…,xp LiNGAM model for P observables x1,x2,…,xp
Each observed variable xi is a linear sum of the other observed variables xj and its error variable ei
If the coefficient bij is 0, there is no direct causal effect from xj to xi
The error variable ei is independent and follows a non-Gaussian continuous distribution
No unobserved common cause
Causal graph of LiNGAM model
Directed Acyclic Graph
Equation using matrices
LiNGAM model is identifiable
The coefficient matrix B can be uniquely estimated based on the distribution of observed variables p(x).
Causal graphs can be drawn from zero and non-zero patterns of the components of the coefficient matrix B.
The relationship between the acyclicity of causal graphs and the coefficient matrix B
In preparation
Define the causal order of the observed variables (casual order)
What is a causal order?
If we reorder the variables according to that order
The order in which variables later in the order do not cause the variables earlier in the order
The order of the observed variables x1,x2,…,xp is defined as k(x2). ,xp in the order k(1),k(2),… , k(p).
The number in parentheses of the symbol k indicates the number of observed variables.
Directed path
A series of directed edges with the same orientation.
Example 1
Matrix representation
Rearrangement of causal order
After reordering, the coefficient matrix B is a lower triangular matrix with all diagonal components zero (exact lower triangular matrix).
In this example, there is only one causal sequence that can be an exact lower triangular matrix.
Example 2
If there are multiple causal orders that result in an exact lower triangular matrix
(1) The sequence x3,x1,x2 k(3)=1, k(1)=2, k(2)=3
(2) Order x1,x3,x2 k(1)=1, k(3)=2, k(2)=3
Solve each for the observed variable x
4.3 Estimating the LiNGAM Model
4.3.1 Approach using independent component analysis
4.3.2 Approach using regression analysis and independence evaluation
4.4 Summary of this chapter

Chapter 5 LiNGAM in the Presence of Unobserved Common Cause

5.1 Difficulties due to unobserved common causes
5.2 LiNGAM Model with Unobserved Common Causes
5.3 Unobserved Common Causes Do Not Lose Their Generality Even When Assumed to be Independent
5.4 Approach Based on Independent Component Analysis
5.5 Approaches Based on Mixture Models
5.5.1 Rewriting the model for each observation
5.5.2 Evaluate the goodness of the model with the log perimeter likelihood
5.5.3 Prior distribution
5.5.4 Numerical Examples
5.6 The multivariate case
5.7 Summary of this chapter

Chapter 6 Related Topics

6.1 Relaxing the Hypotheses of a Model
6.1.1 Cyclic model
6.1.2 Time series models
6.1.3 Nonlinear models
6.1.4 Discrete variable models
6.2 Model evaluation
6.3 Statistical reliability evaluation
6.4 Software
6.5 Conclusion