Iwanami Data Science Series vol.3 “Causal Theory Reading Causality from Real World Data” Reading Memo

Machine Learning Artificial Intelligence Mathematics Digital Transformation Statistical Causal search Navigation of this blog

Summary

Techniques to examine “causal relationships” that are not “correlations” are “causal inference” and “causal search. Both causal inference and causal search are methods for analyzing causal relationships, but there is a difference in purpose and approach: causal inference is a technique for verifying causal relationships, while causal search is a technique for discovering causal relationships.

Causal inference becomes a formal method for identifying causal relationships from experimental or observational data. It would infer that a particular intervention is causing a causal effect by, for example, conducting an experiment on two separate groups and identifying a statistical difference between the two groups. The main approaches to causal inference are (1) “randomized experiments,” in which the intervention variable is randomly assigned to the experiment; (2) “natural experiments,” in which a phenomenon occurring in nature is considered the intervention variable; (3) “propensity score matching,” and (4) “linear regression models,” which statistically estimate causal effects from observed data.

In this section, based on Iwanami Data Science Series vol. 3, “Causal Inference–Reading Causes from Real World Data,” we discuss various theories and applications with a focus on causal inference.

In this article, I will discuss the reading notes.

Iwanami Data Science Series vol.3 “Causal Theory Reading Causality from Real World Data” Reading Memo

Vol.3 Special Issue: Causal Inference – Reading Causality from Real-World DataHow to read causality from observational data, especially real-world data that cannot be reproduced. Analysis is difficult due to missing data, bias, and constraints, but this is why there are strong expectations from the field. This book provides an introduction to “causal direction,” “confounding,” and “intervention,” which are essential for scientists and basic education for citizens, as well as analysis methods that are immediately useful.

Introduction to Causal Inference

Causality has a direction
Figure
Correlation
When X is large, Y is also large
Causal relationship
When X is large, Y is also large
Increasing height increases weight, but increasing weight does not increase height
Two methods of causal inference
Stratified analysis
Use of regression models
What’s the trouble with misinterpreting causation as correlation?
We want to touch the cause and control the effect.
Different results depending on how the data is compiled
When examining the relationship between two variables, we have to look at the influence of other factors
Simpson’s Paradox
Correlation in the aggregate and correlation per part are reversed
Causality, causal effects, and repeated hypotheses
Correlation
There is a linear relationship between two variables such that the higher the value of one variable, the higher (or lower) the value of the other variable.
Causal relationship
The relationship between X and Y when a change in factor X causes a change in factor Y.
Guidelines for determining causality
Cause variable
A function that indicates the cause
Result variable
Function indicating the result
Intervention
Manipulation of a factor to change it.
Causal effect
The strength of a causal relationship
The difference between the value of the outcome variable when the same person receives the intervention and the value of the outcome variable when the person does not receive the intervention
Counterfactual (counterfactual)
The difference between the value of the outcome variable when the same person receives the intervention and when the person does not receive the intervention
Confounding
Causal diagram (causal graph)
Inside the square is the observed variable.
Example
Confounding
Factors that are not directly related but are influenced by a common factor
Confounding factor
A common factor C that affects X and Y.
spurious correlation
A correlation that occurs due to confounding between variables that are not really related.
Addressing confounding with RCTs
Randomized controlled trial
Randomize whether or not to implement an intervention
How to tell if causality is confounding
Cutting the path = fixing the variables
Causal inference from observational studies
Always better to assume unknown confounders
Stratified analysis
When an outcome variable changes, it is not clear whether the change is due to (1) a change in the causal variable or (2) a change in the confounding variable
Fix the confounding variable.
Stratified analysis
Divide the target into several strata according to the values of the confounders
Perform analysis for each strata
Integrate the results to obtain the correct relationship between the current variable and the result variable in the whole
Example of stratified analysis
Discretize the confounding factors of a continuous quantity and perform stratified analysis to correctly estimate the causal effect of x on y
1. generate random numbers x and z with a correlation coefficient of about 0.8
2. generate y from the following model y = 1.5x + 1.1z + e (e: normal random number)
Without stratified analysis
With stratified analysis Discretize z into 4 cases
Weight the regression coefficients by the inverse of their variances
Using the regression model
Approaches other than stratified analysis
Add confounders to the model to create a multiple regression model
Generalized linear model (generalized linear model) can also be used to examine whether the causal variable and covariates are affected by the resultant henceforth without the influence of the covariates by feeding them into the model simultaneously
Only if linearity between dependent variable (outcome variable) and explanatory variable (causal variable or covariate) can be homed
Backdoor criteria
Examples of the use of regression models
Other methods of causal inference
Other methods: Matching
Pairs of confounders with identical or close values (this operation is called matching) and measures the outcome variable with intervention in one and no intervention in the other
A useful method when considering many covariates simultaneously
A method that combines the covariates to be used into a unidimensional dormitory called a propensity score

Causality and Correlation in Time Series

Be cautious about taking causality from time series
Examples of time series data that appear to be causal

Correlation, Causation, Circles and Arrows My First Backdoor Criterion

Overview

The Backdoor Criterion
Criteria to determine if a causal relationship is shown

Let’s start with the definition of “causality”.

When we artificially change (intervene in) factor X, factor Y also changes, we say that there is a causal relationship between factor X and factor Y.
When you press a switch on the wall, which light comes on?

Let’s get an idea of the pattern where correlation ≠ causation.

(1) Cases in which the direction of causation is reversed

Correlation between coral survival rate and individual density of coral predator O
Predator O only eats coral that is dying
Decline in coral survival rate → increase in predator O
The reverse does not exist

(2) Cases in which common factors exist upstream of causality

Example of a causal structure with confounding factors or confluence points
Backdoor criteria
Which variables should not be added as explanatory variables

(3) Cases in which variables are selected at the confluence of causality

Variable Z “collider” in Figure 2-C
Even if there is no causal relationship between X and Y, there may be a correlation between X and Y.
When a correlation is created without causation
When there is no causality, but there is correlation

Review in the framework of regression analysis – when regression coefficients and intervention effect values “deviate

(1) Let’s use the example of the water level in a man-made pond on a hill to illustrate.

An example to visualize the relationship between the “intervention effect” and the causal structure
There is an X pond at the top of the hill and a Y pond at the bottom of the hill, connected by a channel
Y = βX,Y X + γ + error
Intervention from X to Y is considered by β, but the effect of intervention from Y to X is 0 cm (cannot be calculated backwards from the above equation)

(2) “Misalignment” caused by the existence of common factors upstream of causality
(3) “Misalignment” caused by selection at the confluence point
(4) “Misalignment” caused by using intermediate variables

4. preparation for landing – seriously, 5 seconds before the “back door standard

(1) The flow from the common personnel upstream should be “blocked
(2) Confluence points should not be added
(3) Intermediate variables should not be added

Landing on the “back door standard

(1) Now, let’s replace the “backdoor criterion” with a casual term.
(2) Let’s check your understanding by solving the exercises.

5. after the landing – a guide for those who want to learn more

If “the backdoor criterion is satisfied,” then “strong ignorability holds.”
Simpson’s paradox (the problem of changing estimation results depending on which variables you stratify by)
The backdoor criterion provides clear guidance on which variables to stratify by if the causal structure behind it is known

Designing a Quasi-Experiment

Overview

How to derive causal relationships from observational data
Quasi-experiment
Randomized controlled trials can prove causality

1. instrumental variable (IV) method

IV: A variable that influences the intervention, but only through the intervention on the outcome.
A method of analysis that uses the IV to estimate the causal effect of an intervention on an outcome.
The strength of the correlation between the intervention and the outcome is calculated by dividing the correlation between the IV and the outcome by the correlation between the IV and the intervention.
Diagram of the operating variable method

Regression discontinuity design (RDD)

The value of a continuous value (Z) is higher or lower than a specific cutoff value, which determines whether the subject is assigned to the intervention group (X=1) or the control group (X=0).

Interupted time-series analysis ITS

The continuous variable Z for intervention assignment in RDD is “time”.
Autocorrelation needs to be taken into account
Generalized estimating equations (GEE) using correlation structure of time series
ARIMA (Autoregressive integrated moving average model)
Relationship between level and trend in an interrupted time series design
Example of an interrupted time series design

Difference-in-differences analysis DID

Fundamentals of statistical causal effects, especially using propensity scores and operating variables

1. counterfactual – no regrets and no end in sight

Counterfactual (what would have happened if we had done something differently than we actually did)
Rubin causal model
A statistical framework for making counterfactuals from data

2. latent outcome variables and the Rubin causal model – a framework for evaluating the effects of regret and measures

Variable Z
Binary variable (Z=0,1) indicator variable for presence or absence of intervention
Variable Y
Outcome variable
Y0: If the person received the outcome of Z=0
Y2: If the person receives the condition that Z=1
Example: Testing the effectiveness of a smartphone game commercial
Z=1 if the person is exposed to the commercial during the two weeks, Z=0 if the person is not exposed
Usage time during the following two weeks
Y1: Time spent playing the game if the user saw the TV commercial
Y0: Time spent playing the game when the user did not see the TV commercial
Y1 and Y0 cannot be measured simultaneously.
Potentially exists, but what can be observed differs depending on the value of Z
Potential out-comes
Y = Z × Y1 + (1-Z) × Y0
Y1 for Z = 1, Y0 for Z = 0
Effect of CM on usage time
Y1 – Y0

3. the fundamental problem of causal inference and the causal effect as a group

Can’t be seen by individuals at the same time, but can be considered in a specific population
Causal effects
Average Treatment Effect (ATE)
ATE = E(Y1-Y0) = E(Y1) – E(Y0)
Preliminaries for estimating the causal effect of population from data
There are N people for whom data are available (sample size is N). 4.

Assumptions for estimating causal effects from survey observation data

Method for estimating causal effects

(1) Methods using regression models
(2) Matching and stratified analysis
(3) Propensity scores
1) Propencity score matching and stratified analysis
2) Analysis of covariance by propensity score
3) Inverse Probability Weighting (IPW) by propensity score
4) Doubly robust estimation methods

6) Covariate selection problem

7. operating variable method

(1) Conditions under which causal effects can be estimated by the operating variable method
(2) Local treatment – Solution by “inconvenient truth = disobedience” in RCT
(3) Some challenges
(4) Takeway and discussion

[Appendix 1] Can we really estimate causality with RCTs?
[Appendix 2] On M-bias

Application of Causal Effect Estimation Causal Effects and Adjustment Effects of Commercial Contact

1. adjustment effect and general peripatetic model
2. introduction to the data and problems with simple mean difference analysis
3. selection bias
4. estimation of the average treatment effect ATE
Estimating the Mean Treatment Effect ATT in the Treatment Group
6. reanalysis using adjustment effects

Estimating the bunt effect using propensity scores Does bunting with no outs increase the probability of scoring?

1. what to do to show the effectiveness of the bunt strategy

Inference of missing data in (2) and (3) using propensity score
(1) Causal effects when the bunt strategy can be regarded as a randomized controlled experiment
(2) When allocation is affected by covariates
The causal effect of the bunt strategy is estimated using the Inverse Probability WeightingEstimator (IPW estimator) and the Doubly Robust Estimator (DR estimator).
(3) Estimating the causal effect of bunt strategy on the probability of scoring runs
Calculation results of the IPW estimator and the estimator of the error variance
Cautions for inference based on the IPW estimator
Analysis based on the Doubly Robust estimator and its results
List of covariates used in the analysis of the data
(4) Summary of the analysis results

2. summary

The Effect of “Nursery Development” Examined by the Difference-in-Difference Method Application of Causal Inference in Social Science 1.

What is the difference-in-differences method?

1.1 Basic idea

Increase in the number of nursery school places is considered as a treatment, and the prefectures with a large increase in the number of nursery school places are considered as the treatment group, while those with only a small increase are considered as the target group.

1.2 Use of covariates

1.3 Continuous treatment variable

Did the construction of childcare centers increase women’s employment?

2.1 Data

2.2 Regression analysis

Cautions in Applying the Difference-in-Difference Method

In order for the difference-in-difference method to be valid, a certain hypothesis must be valid.
In order to increase reliability, it is necessary to assume a scenario in which the hypothesis does not hold true, and to show with other data that it has not actually occurred.

4. extension of the difference-in-difference method – triple difference method

5. Conclusion

Monte Carlo methods and propensity scores

Importance sampling

Fast learning of graph representation

Undirected Graphs
Correlation: Indirect influences correspond to independence.
Use the concept of “conditional independence” to define “direct influence
Conditional Independence p(xi,xj|x-ij)=p(xi|x-ij)p(xj|x-ij) holds ⇔ there is no edge connecting vertex i and vertex j
When the structure of multivariate normal distribution is specified by an undirected graph
Gaussian Graphical Model: GGM
Directed Graph – Represents the model as a product of conditional probabilities
DAG model (Directed Acyclic Graphical model)
Directed Graph – Represents a causal relationship
Causal Diagram

Information geometry of positive definite matrices

1. Gaussian Graphical Model

Graphical model
Statistical model that explicitly reflects and describes dependencies between variables
Dependency relations between variables are represented by a mathematical concept called a graph, which consists of a set of branches connecting the points corresponding to the variables and the points between them.
Gaussian graphical model
Graphical model with normal distribution
Typical example of Markov probability field
Example: Temperature and sales of fans and air conditioners
There appears to be a correlation between sales of fans and sales of air conditioners.
Correlation between temperature and fan and temperature and cooler is indirectly reflected
Sales of fan and cooler are independent
p(fan, cooler|temperature) = p(fan|temperature)p(cooler |temperature)
Sample Variance-Covariance Inverse Matrix
Inverse of the sample variance-covariance matrix of the data
Partial correlation matrix
scale the rows and columns of the sample covariance inverse matrix so that the diagonal matrix is 1
Application to an Example
Correlation (2,3) elements of fan and cooler are zero
There is no direct relationship between the sales of the fan and the cooler
Partial correlation function
Correlation between variables i and j under the condition that all variables except i and j are fixed
Estimate the graph structure of the Uusian graphical model from the data
Calculate the partial correlation matrix and set the non-diagonal elements that are close to 0 to 0
Example with 4 parameters (temperature, economy, fan, air conditioner)

Maximum Likelihood Estimation and Convex Optimization of Gaussian Graphical Models

Advantages of the Gaussian model
Maximum likelihood estimation is a convex optimization on the row example space
Maximum likelihood estimation
Estimate by adjusting parameters to make the “probability” of generating data as large as possible
Probability that the data is generated
Likelihood
Maximum Likelihood Method for Gaussian Graphical Model
N n-dimensional data x(1), … , x(N)
Non-zero elements dij, (i, j) ℊ
Likelihood function
Logarithmic likelihood function

3. information geometry of Gaussian graphical model

Characteristic function
Potential
Dual potentials
Semi-positive stationary programming
Divergence
Kullback-Leibler (KL) information content
Lujandre transformation

4. Conclusion

Gaussian graphical modeling is a method to explore data structures focusing on conditional independence from a sample covariance matrix based on maximum likelihood estimation.
When the corresponding graph is a chordal graph, the maximum likelihood method can be obtained using only linear algebra operations.

The Road to Probability Modeling by Integrating Probability and Logic 3

1. probability modeling based on logic

Possible worlds are maps from the Elbran basis to {1(true), 0(false)}.
There are 2|B| possible worlds (|B| is the concentration of the set B)
If B is finite, count up the possible worlds and assign probabilities so that the sum of each world is 1

2. distributional semantics

Probability model for infinite worlds
Markov chain
PCFG (probability free context grammar)
Expansion to the computer world
Extension of Prolog to probabilistic programming language
For the purpose of making programming languages probabilistic
Introduce probabilistic elements into programs
Probabilistic Turing Machine
Introduce probability into state transitions
The case of Prolog
Probabilistic selection of call clauses in Prolog
Just probabilize the computation
Distributional Semantics
Probabilistic Selection with Dice
Basic Distribution
Direct product distribution of additive categorical distributions
Example: Categorical distribution is Bernoulli distribution
Basic distribution is probability distribution over the whole (0, 1)∞ of infinite sequence of 0, 1
In distribution semantics, this fixed basic distribution can be modified by logic programs to create more complex distributions that approximate the real world.
A probabilistic logic program must define a probability measure over the possible worlds.
Distributional Semantics
A type of denotational semantics that defines the mathematical meaning of programs.
Uses the standard semantics of ordinary non-probabilistic logic programs (left model semantics) and Kolmogorov’s extension theorem to extend the basic distribution to the distribution of possible venues.
By writing logic programs, it is possible to express any number of complex distributions.
example
BN
LDA
HMM
PCFG
PRISM
A built-in probability predicate of the form msw( * , * ) that follows a categorical distribution and can be used in prolog.

Examples of PRISM models and probability calculations

Example: Modeling the mechanism of inheritance of ABO blood types
Alleles (a, b, o) for A, B, and O blood types
If the gene of an individual is (a, a), (a, o), or (o, a), the blood type is A
The distribution (frequency) of alleles in the parental generation is P(a)=0.5, P(bb)=0.2, P(o)=0.3
According to the Harding-Weinberg law of population genetics, the frequency of alleles is universal across generations.
The offspring generation receives one allele each from the father and mother generations, according to their distribution.
The pair received determines the blood type
Prismatic programming of the above example
Probabilistic base atom
msw( abo, *)
values(about, [a, b, o], [0.5, 0.2, 0.3])
P(msw(abo, a)) = 0.5
P(msw(abo, b)) = 0.2
P(msw(abo, o)) = 0.3
(2)
Find the genotype [Gf, Gm] by gtype(Gf, Gm)
Calculate the phenotype blood type X from the genotype [Gf, Gm] by pg_table(X, [Gf, Gm])
(3)
Definition of the predicate pg_table(X, [Gf, Gm])
Describes the logical relationship between genotype [Gf, Gm] and phenotype X.
(4)
PRISM-specific programming using the probabilistic built-in predicate msw( * , * )
When msw(abo, Gf) is executed, it rolls the dice named abo and returns the gene Gf from the father generation sampled from (a, b, o) according to the probability specified in section (1).

4. parameter learning

Parameters can be estimated inversely from data
Maximum likelihood/MAP estimation
EM algorithm
Viterbi learning method

5. Bayesian Inference

Bayesian inference using Dirichlet distribution
MCMC Method Offerings
Metropolis-Hastings Algorithm
Variational Bayesian Methods
EM Algorithm and Similarity

6. developmental topics

6.1 Generative CRFs

Useful for logistic-regression, linear-chain CRFs, etc.
Discriminative probability models

6.2 Infinite Computation

A stochastic system in which elements are interacting stochastically, such as metabolites and enzymes in a metabolic network in bioinformatics
Loops in dependency relationships arise, and infinite sums of probabilities must be calculated.

7 Conclusion