Iwanami Data Science – The World of Bayesian Modeling Reading Notes

Machine Learning Artificial Intelligence Digital Transformation Natural Language Processing Deep Learning Probabilistic Generative Models The World of Bayesian Modeling Navigation of this blog

Iwanami Data Science – The World of Bayesian Modeling Reading Notes

Memo for reading “Iwanami Data Science: The World of Bayesian Modeling.

Introduction
Part I. The World of Bayesian Modeling
From Mean to Individuality: Statistical Modeling Opens a New Vision of the World Yukito Iba

Overview
To statistics that incorporate as much individuality as possible
The One and Only You
When you seek data that is as close to you as possible, rather than the average, you end up with yourself.
The finer the granularity, the more individual you end up with
Each piece of data, by itself, is a luminous star-like minoa in the night sky, with infinite deep darkness in between.
There is “modeling” as a way to compensate for the in-between by bringing similar things together in some way.
Without modeling, no laws can be drawn out or predicted
Modeling Examples
Figure 2: Example of data analysis
When considering the average (representative value), divide it into 2 parts, 3 parts, etc.?
Modeling 1
Linear approximation.
Modeling 2
Represent knowledge in a probabilistic model
yi = (mean value at location xi) + (random fluctuations)
Random fluctuations are assumed to be “normally distributed
Mean is m, variance is σ2
p(Y) = p(y1, y2, … , yN)
pi(yi) = 1/√(2𝛑σ2)exp(-(yi-m)2/2σ2)
Model represented by random only
Model split in two in Figure 2b
Model representing a linear approximation in Figure 2d
General solution is not a special solution
Statistical Science at Work
Developing models and modeling with them
Various forms of probability distributions and functions
Estimation, evaluation, and use of models
Model selection by AIC criteria
Samples generated from the model can be used to evaluate the model
Traditional statistics is the main body here.
Because the model itself was limited to a small number of normal distributions and linear functions
What is a “smooth curve?”
How to represent complex positional dependencies?
One direction
Consider second- and third-order polynomials and choose the model that seems optimal, such as the AIC minimum criterion
Mathematical expression of the concept of “smooth curve” itself
Consider the curve f itself to be a sample from a probability distribution.
pi(yi|f) = 1/√2πσ2exp(-(yi-f(xi))2/2σ2)
p*(f)
Probability density function of f
Originally infinite dimension
To process on a computer, divide into a large number of subintervals, and set a constant value on each of the subintervals.
Example: 100 divisions: {f1, f2, …, … f100}, where f is a vector of dimension 100.
p*(f) = Cexp(-1/2δ2(fi+1 – 2fi + fi-1)2)
Represents a small second-order difference
When δ2 is small, the mean varies linearly
Arbitrary shape when δ2 is large
Choose f from a “jar” that contains many “squishy” curves at a certain level
Application in spatial and temporal f
Use to represent “individuality.”
Divide customers into K groups
Average of K groups m={mk} (k=1, … ,K)
mk is normally distributed around the “mean of the means of each group” M
p*(m)
Distribution that incorporates group differences but uses overall information
Hierarchical Bayesian model
Consider multi-stage data generation process by reading constants as random variables
How can I make inferences using the hierarchical model?
Given δ2, data are generated first by f, then y, in that order
Equivalent to giving the simultaneous probability density function of y and f
p(y,f) = p(y|f)p*(f)
p(y|f)
More probability steps.
For any given pair of random variables, the simultaneous probability density function p(r,s)
p(r|s)p(s) = p(r, s) = p(s|r)p(r)
(1)
Integrating both sides, the left-hand side becomes
∫p(r|s)p(s)dr = p(s)∫p(r|s)dr = p(s)
Therefore
p(s) = ∫ p(r, s)dr = ∫ p(s|r)p(r)dr
(2)
(1) / (2)
p(r|s) = p(r, s) / ∫ p(r, s)dr = p(s|r)p(r) / ∫ p(s|r)p(r)dr
Bayes’ law
If the entire probability structure is assumed, most of the “estimation, evaluation, and use of the model” is taken care of
Using the above, the “on-step” probability distribution is
The simultaneous distribution of f and σ2, given only y, is
If you model the simultaneous probabilities of all variables and put data into it, you automatically get the answer.
Utility of “individuality” Micro and macro
How can hierarchical models be used?
Parameters, such as “personality,” assigned to small groups, individuals, or locations, are “micro parameters.”
If you are interested in predicting the future by individual or by location
Probability of surgical success for an individual
Prediction of purchasing behavior by subdivided groups
Density of plant distribution at each location
Parameters that describe characteristics of the entire system equivalent to σ2 or m “macroscopic parameters”
Ignoring microstructure may bias estimated macro parameters
In some cases, correlations that are clearly present on an individual basis are underestimated by performing averaging operations that ignore individuals.
Connecting Islands of Science to Unknown Continents
The sciences are organized by core propositions
Law of conservation of energy
Coding of genetic information by DNA
Existence of atoms
In between each of these areas, there remain areas that do not clearly belong to any one area.
Problem-specific decisions require maximum use of immediate knowledge
Example
Has the ecosystem changed due to river modifications?
What should I sell for tomorrow’s big sale?
Is it dangerous to put a baby on its back?
How to choose the right treatment for each constitution
Predicting the Global Environment
Automatic processing of large numbers of texts
Thermodynamic Worldview
Directly observe the macroscopic variables governing a phenomenon and unravel the laws between them
Science and engineering missions
Statistics of “averages”
There are “microscopic” statistics as a counter/complementary
memo
Regenerative Kernel Method
Trying out the “mean to individuality” example
Overview
MCMC software like JAGS and STAN
Kalman filter based software like KFAS
Artificial data
Capturing “Individuality” in Hierarchical Models Takuya Kubo
Observational Data and Statistical Modeling
Examples in Ecology
What patterns are seen in simplified fictitious data for fictitious plants?
I want to know how many seeds an individual plant produces when I select it.
Every plant always has 10 embryos, the organs from which seeds are produced.
Fruiting is the process by which an embryo becomes a seed
The probability of a given hypocotyl becoming a seed is called the fruiting probability.
Data on the number of purposes when 100 individual plants are observed
100 individuals x 10 = 1000
496 will become seeds
Fruiting probability = 496/1000 = 0.496?
Division estimation and its statistical model
What is the mean value calculation (496/1000=0.496)?
Let i be an individual of a fictitious plant
Assume all individuals have the same probability of fruiting q
The probability that yi is the number of fruiting hypocotyls among 10 hypocotyls of individual i is a binomial distribution
binomial distribution
Probability of 1,0
f(yi|q) multiplied by 100 individuals for 100 individuals as a whole
In this case, the likelihood is the total number of fruiting germplasm divided by the total number of leaves.
Predictions of the model ignoring individual differences
Probability distribution in the above example
Binomial distribution does not represent the phenomenon well
The process “every individual has the same probability q that the germ will fruit” is incorrect.
Variation in the number of fruiting seeds yi of individual i deviates from the prediction of the binomial distribution model.
Over dispersion
Binomial distribution needs to be extended
Fruiting probability q varies among individual plants
Models that account for individual differences
Express probability of fruiting as a logistic function q(z)=1/(1+exp(-z))
The variable that represents the fruiting susceptibility for a given individual i is zi
zi=β + αi
β: common to all individuals
α: individual differences
Putting common sense into the likelihood equation of the binomial distribution model
Maximum likelihood estimation is not possible
Individual differences expressed in a hierarchical Bayesian model
Role of the Hierarchical Bayesian Model
A method that eliminates the need for maximum likelihood estimation of the individual differences αi for all 100 individuals.
Leave αi undetermined and treat it as a random variable
Let’s determine αi so that the individuals are as close to each other as possible (αi is close to zero) within the range that can successfully explain the observed data.
The probability distribution given the role of constraining {αi} is called a prior distribution.
The probability distribution of αi is given the observed data {yi} and the rule that “the observed fruiting probabilities of 100 individuals are somewhat similar in the aggregate.
Probability distribution of αi and β determined by observed data and prior distribution is posterior distribution
For simplicity, the prior distribution of individual differences αi is assumed to be normally distributed with mean zero and standard deviation σ.
σ represents how similar the individuals of this plant are to each other.
If σ is close to zero, all individuals are similar to each other
If σ is large, αi takes a value that matches the number of fruiting yi of each individual
Posterior distribution for prior distribution
What should we do with the parameter σ?
We postpone and simply assume that σ also follows some probability distribution h(σ).
Call it a super-prior distribution since it is a prior of the parameters of a prior distribution
Simultaneous distribution of parameters under observed data {yi
Denominator is a constant since it is the seat of all cases
The probability density of the posterior distribution is the product of the likelihood (under observed data), the prior distribution, and the probability density of the long prior distribution
Empirical Bayesian Maximum Likelihood Method
Method for estimating the parameters of hierarchical Bayesian
Empirical Bayesian method
In the posterior distribution p{β, {αi}, σ|{yi})
The prior distribution gβ(β) of the parameter β common to all individuals and the (super)prior distribution h(σ) of σ representing the variation of individual differences are assumed to be “uniform distributions with very large variance
Both β and σ can take any value you like (to match the observed data).
The individual difference αi of each individual is constrained by the prior distribution gα(αi|σ), which is normally distributed with mean zero and standard deviation σ
Because the denominators gβ(β) and h(σ) are constants
Posterior distribution
Quantity integrated with respect to αi
Become likelihood equations for parameters β and σ under observed data {yi}.
Exactly the same form as the generalized linear mixed model (GLMM) class of models
Easily computed with statistical software R (glmmML)
Distribution in GLMM
Markov Chain Monte Carlo (MCMC) method
For more complex models not possible with empirical Bayes
Ecology and Statistics of Individuality
What are individual differences?
Different body size and age of each observed individual, different genes in each individual
Light/dark areas where food is grown or high/ low nutrient content in the soil
Individuality” and Parameter Estimation Yukito Iba
An example of how parameters can be successfully estimated when data is compiled without regard to “individual differences” such as differences between individuals or between groups.
Data consisting of four groups (j=1,2,3,4) of pairs of observed values (x,y)
There is a linear relationship y = ax + bj + η behind each group of data
η is random noise with zero expected value
Slope a is common to all
Intercept bj is different for each group
The person observing the data knows which group each observation belongs to
Data
Looking at each group shows that x and y are correlated
If you throw out the group information and mix it all up, the whole thing becomes a dumpling and you don’t see anything that looks like a correlation.
To correctly and efficiently estimate opposing parameters, such as the slope a, including error
A model that “incorporates different bj’s for each group and loosely relates them is needed.
For each group, we can consider an appropriate reference point (e.g., the mean value of the data x,y belonging to the group), take the difference from that point, and then mix the whole group together for analysis.
Classic method when there are individual, personal, and group differences
Differences, ratios, etc., can be successfully taken to eliminate individual differences and differences among individuals.
Eliminate individual differences by taking the difference between before and after treatment of the same person in medical care.
Difficult when the problem is complex
Modeling using “variables that are not directly observed to represent individual, personal, or group differences” without forcing elimination
In this example, b
Bind a large number of introduced parameters with a prior distribution, and integrate (marginalize) them later using MCMC or other means to eliminate them.
Hierarchical Bayesian Modeling Idea Translated with www.DeepL.com/Translator (free version)

Statistical Science Incorporating Individual and Regional Differences: A Case Study in Medicine Toshiro Tango

Introduction.
Drugs work” does not equal “disease is cured.
Does not work uniformly for all patients
Unpredictable individual differences exist
Variables or random variables that vary from person to person
Mixed-effects model
Fixed-effects in which it is natural to assume that the effect is constant regardless of the individual, such as sex, age, etc.
Variable effects (random-effects), such as the effect of a drug, that vary from person to person
Bayes model
Assumes random variables (prior distribution) for all factor effects
System incorporates individual differences
The reference value for the test is a population that includes approximately 95% of healthy people
Individual physiological variability is significantly narrower than that of the population
Plotting five readings of red blood cells at one screening center
Greater variability within the population than within individuals
Large individual differences
The distribution of test data for any given individual follows a normal distribution N(μi, σi2) with appropriate variable transformations
That there are individual differences.
H0: The null hypothesis that μi = μj, σi2 = σj2 is rejected
linear random-effects model
xij is the jth measurement for individual i
xij = μi + εij =(μ +βi) +εij
i = 1, … ,n (individual): j = 1,2, … ,r(repetition)
βi: Variable effect of individual i showing individual differences
βi ~ N(0, σβ2) (σβ2 is the interindividual variance)
εij: Iteration error
εij ~ N(0, σε2) (σε2 is the within-individual variance)
Variance of the population σ2
σ2 = σβ2 + σε2
Individual difference index
Evaluate the magnitude of individual differences in test items
η = σβ / σε
Sum of squares Vβ,Vε between individuals and within orphans
Evaluation of treatment effects incorporating individual differences
Estimation of disease risk by incorporating regional differences
Hierarchical Bayesian Model
Hierarchical Bayesian model of a model to evaluate the efficacy of the therapeutic drug Progabide as a supplement
yji ~ Poisson (μji) log μji = log(T) + b0i + b1ixji2 + γxji1xji2 b0i ~ N(μλ, σλ2), b1i ~ N(μθ, σθ2)
The prior distribution of the parameters, which are random variables, indicates “uninformativeness.
γ, μλ, μθ ~ N(0, 1002) 1/σλ2, 1/σθ2 ~ Gamma(0.001, 0.001)
Solving by MCMC method

From Global Model to Local Model State Space Model and Simulation Tomoyuki Higuchi

Introduction.
Time-localized “local models”
Well-connected local models
Modeling Tokyo Temperature Data
Data analysis using the overall model
Statistical model with a few fixed parameters
Local linear model
Extract time-localized information
Local nonlinear model
Extension to nonlinear
Stochastic difference equation
A “soft” model that allows stochastic deviations from the constraints expressed by the equation
Generalizes to non-Gaussian systems the distribution form followed by noise terms that produce establishment-like fluctuations.
Handles infrequently generated establishment events such as jumps and anomalies
Non-Gaussianity
General state space model
Particle Filter
Proposed for numerical emancipation of general state space models
Faithfully handles the non-linearity and non-Gaussianity of the general space model
Significant simplicity of computational implementation
Practical applications in robotics, ITS (Intelligent Transportation Systems), finance, marketing, etc.
Simulation using particle filters
Phase space composed of variables included in the simulation model
When the numerical solution of a simulation is plotted, what was a heavenly body when the initial values and boundary conditions were given becomes a “string” over the computation time.
Different trajectories are drawn in phase space when initial conditions are changed.
Multiple paths (solution varieties) arise as the simulation continues
Observational data can be used to narrow down the simulation choices
From global model to local model
Change in average temperature in Tokyo over time
N number of data
Y={y1, y2, …,yN} Temperatures are on an upward trend
Fit a straight line a n + b
wn the difference term between the straight line and the observed values
Gaussian distribution with mean 0 and variance σ2
wn ~ N(0, σ2)
Probability distribution of data set Y
Probability distribution of data set Y
N independent samples of data yn from Gaussian distribution with mean μn = a・n+β and variance σ2
The unknown parameters are defined as a vector θ = (a,b,σ2)T
A function of Θ
Likelihood function
The value of θ is determined by maximizing the likelihood function or log-likelihood function
The slope of the line is not constant
Represents a “locally approximately straight line” for µn, rather than a linear model for the entire data set
[…] μn = 2μn-1 – μn-2 + vn, vn ~ N(0, τ2)
vn is the noise representing the deviation from a straight line at three consecutive points
λ = τ2 / σ2
Straight line when λ=0
Not a straight line at all for λ=±∞ (σ2→0)
State Space Model
System Model
Xn = Fn(xn-1, vn)
xn=(μn, μn-1)T, vn=(vn)
Xn: state vector
Vn: System noise vector
White noise according to q(v|θsys)
Θsys: parameter vector describing the distribution
Fn: nonlinear function
Observation model
yn = Hn(xn, wn)
yn=(yn), wn=(wn)
Wn: Observation noise vector
White noise according to r(w|θobs)
Θobs: Vector of parameters describing the distribution
Hn: nonlinear function
Local nonlinear model
Local increase/decrease values are amplified by urban heat islanding effects
Introduce unobservable quantity ρn
Discrete-time random walk
Xn =(μn, μn-1, ρn)T
Vn =(vμ,n, vρ,n)T
Sequential Bayesian filter
Z1:n : Quantity of all vectors z from the first time to time n
Explanation of conditional distribution
Predictive Distribution
Predicts this year using data up to last year
Filter distribution
Distribution of this year’s state vector based on data up to this year
Smoothed distribution
Distribution of this year’s state vector under all data at hand
Schematic of the Gradual Equation for Condition Estimation
(1) One period ahead forecast
We have last year’s filter distribution p(xn-1|y1:n-1)
p(xn|y1:n-1) = ∫ p(xn|xn-1)p(xn-1|y1:n-1)dx
p(xn|xn-1) is the system model of the general state space model
(2) Filtering
Once this year’s predictive distribution is obtained, this year’s data comes in and a filtering calculation using Bayes’ theorem yields a filter distribution
Monte Carlo approximation and particle filter
In the general state space model, the conditional distribution p(xj|y1:k) can exhibit any shape, so representation using analytic functions is not possible
How to represent the state vector when it is high dimensional
Dealing with integrals of the dimensions of the state vector that appear in sequential expressions is also an issue.
How can we simplify the implementation of the sequential update equation while making it possible to represent p(xj) in ultra-high dimensions on a computer?
We can approximate the conditional distribution with a large number of independent realizations (e.g., hundreds to a million) that we consider to be derived from it.
Monte Carlo approximation
Each realization is called a “particle
Example
Predicted distribution p(xn|y1:n-1)
Xn|n-1 = {xn|n-1(1), xn|n-1(2),… ,xn|n-1(m)}
Predicted particle
Filter distribution p(xn|y1:n)
Xn|n = {xn|n(1), xn|n(2),…. ,xn|n(m)}
Filter particles
xj|k(i)
State vector at time j
The last time of the observation data used to estimate the state vector is k
I-th particle
Sequential update formula in particle-approximated system
Resampling (restoring and extracting) the predicted particles with the establishment of a normalized likelihood, and the resulting particle is the filter particle at time n
Particles die if goodness-of-fit is low
High goodness-of-fit splits to increase mate
Simulation and Data Assimilation
Mathematical model form of simulation model that takes a time evolving form
Partial differential equations in continuous time and space
In space, a longitude-latitude grid system (grid)
Various variables are defined and calculated on the lattice system in science and physics Translated with www.DeepL.com/Translator (free version)

Modeling Living Words: The Interface between Natural Language Processing and Mathematics Daichi Mochihashi

Introduction.
Hypotheses and testing based on linguists’ experience and subjectivity
The field of statistical thinking about language
Computational Linguistics
From an engineering standpoint, natural language processing
Automatically model complex and vast linguistic phenomena with computers by taking a statistical view of language
Mathematically handle ambiguities, exceptions, and context-dependencies that cannot be captured by rules
Statistical Model of Language
Language is a sequence of symbols
First, we assume that language consists of words
Word frequency is highly skewed
Rank and frequency are inversely proportional
Zpif’s law
Power law common to many discrete phenomena in nature
If word i appears ni times in a sentence of N words in total, the probability is
Pi = ni / N
The rate of occurrence of each word is represented by a vector of V dimensions (V is the number of vocabulary words) p=(p1,p2,…) .pV)
Example: If there are only three vocabulary words (w1,w2,w3)=(“bouquet”, “voyage”, “Viagra”)
p1=(0.3, 0.7, 0)
p2=(0.4, 0.3, 0.2)
p3=(0.1, 0.1, 0.8)
Advertising e-mails are generated from a probability distribution such that p3, normal e-mails are generated from p1 and p2
Probability distribution of P: Dirichlet distribution
Consider the probability distribution of a word to be born from a dirichlet distribution
Procedure
Generate p ~ Dir(p|α)
~ : “according to the probability distribution of ~”
Word wi ~ p(i=1,… .N) is generated
The probability of w is integrated over the various p possibilities to compute the expected value
ni is a function of words appearing in w
A large number of sentences w1,… Product of probabilities on w1,…,wD
Convex with respect to α. Using Newton’s method, we can find the parameter α of the prior distribution that maximizes the probability of the data
Given W, the probability distribution of p that gave rise to it is estimated from Bayes’ theorem as the Dirichlet posterior distribution
The expected value is
E[pi|w] = (ni + αi ) / (N + α)
Generation of infinite discrete probability distribution G by Dirichlet process
Using a Dirichlet process also assumes the probability of generation of unknown words
N-gram model and infinite n-gram model
Words do not arise independently (the aforementioned assumption is not true)
N-gram model
A model in which a word depends on the (n-1) words before it (captures the relationship between adjacent n words)
1 gram model
Words are generated independently of each other
The probability of a sentence w=w1w2…wT in the N-gram model is
Product of conditional establishment
Example: In the case of 3 grams
p(dream she sees)
= p(she) x p(is|she) x p(sees|she is) x p(dreams|she sees)
Markov process of (n-1)-order of words
Very simple model, but extremely effective for reducing the probability of linguistic ineligibility in speech recognition, statistical machine translation, etc.
In the n-gram model, the more n is increased, the more precisely the relationships between words are captured
If n is increased too much, simple estimation leads to zero conditional probability due to lack of data
Example: “Tortoiseshell cat with a fish in its mouth
p(tortoiseshell cat | fish in mouth) = n(tortoiseshell cat with fish in mouth)/n(fish in mouth) = 0
Hierarchical Dirichlet process
1-gram distribution p(*)
2-gram distribution p(*|w1) by Dirichlet process with base measure G0
Furthermore, a 3-gram distribution p(*|w2w1) is generated with Mire as the base measure G0
Conceptual Diagram
In the case of language, the fit of the Dirichlet process is not perfect
Its Extension
Two-parameter Poisson-Dirichlet process (Pitman-Yo process)
Hierarchical Pitman-Yo process
Variable n (depth of tree hierarchy) of n-grams by stochastic modeling with context length as a hidden variable.
Lydam Walk Generative Sentence
Statistical Model of Word Meaning
How to statistically handle the “meaning” of words?
Probability distribution θ=(θ1,…. θk), which is generated for each sentence from the Dirichlet distribution
K: Total number of potential topics (usually around 100)
LDA Model
Choose a topic distribution θ
Generate topic distribution θ~Dir(θ|α) that the sentence has
Randomly select topics according to Θ
For n=1,… ,N
Select a certain topic kn~θ
A sentence is generated from the topic.
Generate words wn~p(w|kn) from a topic kn
Each sentence w=w1w2… .wN probability is
Bayesian estimation of posterior probabilities
By sampling over millions to hundreds of millions of words of text, the correct topic generated for each word and the topic distribution of the document can all be estimated
Results of data processing in Kawabata Yasunari’s Snow Country
K=100
Topic 58: Tunnel – glass – signal – window
LDA processing results for Mainichi Shimbun text
References
Statistical Latent Semantic Analysis with Topic Models
Natural Language Processing Series
Topic Model
Machine Learning Professional Series
Iwanami Data Science vol2
Word Meaning and word2vec
Word2vec word vectors
Predicts its own word vector from the word vectors of several words before and after each word in the text
Suggestion by Arora of Princeton University in 2016 that the text is explained by a number-statistics model “generated by a random dumb walk in a latent semantic space”
A Latent Variable Model Approach to PM 1-based Word Embeddings
Assume that the word wt at time t in the text is generated according to the proximity of the “context vector” Ct at that coordinate wt
p(wt|ct) = exp(wtTct) / ∑exp(vTct)
Dynamic linear-logarithmic model (logistic regression)
In Closing.
Languages have an ungrammatical structure, and the acquisition of this structure from linear word sequences (seikainashide) is the state-of-the-art of research. Translated with www.DeepL.com/Translator (free version)

Statistical Science as a Post-Modern Science Kunio Tanabe

Don’t understand statistics? Statistics is “a method of inference
What is rational inference?”
Perspectives on various approaches to hypothesis (model) formulation and inference methods
Newton-Dekart paradigm
Modern scientific reasoning is based on the “hypothetico-deductive method
Hypothesis formulation → deduction → experiment
Deduction-driven reasoning
(Elemental) reductionism
The properties of a complex object are the sum of its component properties
The whole can be grasped by examining each element
It is sufficient to decompose the object into its components and build and verify a model for each individual element.
Limitations of deductive reasoning
Only deductive reasoning is allowed in proving mathematical knowledge
Deductive reasoning guarantees absolute certainty in mathematics
In mathematics, concepts such as “∞” and “converging to 0” are
Adjectives such as “any (all)” and
by finite notation (representation) and finite operations, such as the Ε-δ argument.
Finite and infinite worlds
Proof that a particular mathematical proof has been done correctly lies outside of the mathematics in question
Mathematics is not a closed formal system and is not guaranteed to be endosympathetically psychological
using the original sin of induction as a lever
Methods of acquiring knowledge other than deductive methods
Inductive Reasoning
Derive generalized knowledge from limited empirical data to predict situations outside of experience
Inference by function is logically fallacious
No matter how many black crows you have seen, you cannot conclude that “crows are black”.
The fallacy of the conclusion does not change when the proposition “crows are black” is expressed in terms of probability
Not only do we not know the total number of crows, we do not know if the crows we saw were random.
Attempts have been made in the past to bring “objectivity” to “inductive reasoning
Aristotle’s “Simple Pastoralism”
Duns Scotus’ “Law of Consistency”
William Ockham’s “Law of Difference”
R. Gross test and R. Bacon’s “test” procedure
Hypothesis-testing methods
It is wrong to derive general propositions referring to an infinite number of events from a finite number of data
Knowledge acquired by humans
Obtained by the operation of deductive reasoning on a symbolic representation of an object based on a definition, as in the case of mathematics
By hypothetico-deductive method, which presupposes an a priori syllogism of the object and its symbolic representation, as in the case of classical physics.
Knowledge that can be made conscious and symbolized
Tacit intuition that does not rise to the surface of consciousness
Humans also live “inductively” rather than “deductively”
Inductive Reasoning” in the Mathematical Sciences
The “problem of deductively inferring a result from a cause” in empirical science -> “pure problem”
A problem that appeals from a result to a cause → “inverse problem”
Example: Solving the “heat equation” in reverse to find the past temperature distribution after knowing the current temperature distribution
An increase in entropy occurs during the transition from cause to effect
Most inverse problems that follow the causal chain in reverse are “nonappropriate inverse problems
J.S. Adamar
When the operator leading from cause to effect is ill-conditioned or irreversible
If you try to solve the equation by expressing the causal relationship in an equation, using the cause as the path and substituting observed data for the fruit, errors will occur and the solution will not be obtained.
Many problems in experimental science and engineering are also non-pertinent inverse problems.
Find the charge distribution that creates the field from the information of the electric field
Find the distribution of the mass that makes up the field from the information of the gravitational field
Statistics is inductive inference
Pseudo-deduction
Hypothesis Testing Theory” by Fisher, Pearson, and Neyman
Mathematical Statistics
Introduced probability theory into statistics and made some of the inference procedures into mathematical formulas
Only a portion of the inference process is expressed as a mathematical equation
The essence of statistical induction derived from data attribution remains the same.
Dominance Tests
Predicts a specific probability distribution (model) for the components of an observable event
The assumed probability distribution is a self-assumed assumption (modeling) made by humans
Probability distributions of raw observable events cannot be derived from this
Hypothesis can be wrong
Induction Hypothesis
Find a group of events that rarely occur when this model is assumed to be true, based on the calculation of probabilities.
If the observed data falls into this event group, then “reject” this hypothesis
It is not clear how small the number should be to be considered rare.
Expanding Knowledge and Statistical Science
The subject of modern science is a complex, hierarchically coupled, multi-degree-of-freedom system of diverse elements that vary on different time scales
Statistics is not only a descriptive study of techniques for collecting and organizing data
Modern statistics is not only concerned with the quantitative understanding of single events, but also with modeling the structure of underlying relationships among intertwined and complex events, and integrating linear knowledge and finite empirical data to provide methods for recognizing, predicting, and controlling events.
Three components of inductive inference in statistical science
Statistical and probabilistic models
Probability and statistical models that can flexibly represent the structure and laws of the subject
Data
Algorithm
Developments in solving non-pertinent problems
Equation Model” → “Optimization (Variational) Model” → “Statistical/Probabilistic Model
For modeling systems with multiple subsystems
I know this part well, but that part there is quite suspicious, that part I don’t know at all
With the equation approach, the “I don’t know” part of the equation is a setback.
With an optimization model, the unfamiliar or unknown parts can be roughly modeled, weighted, and added to the function to be optimized, but the weights do not provide guidance on how the values should be taken.
The function to be optimized (minimized) and the exponential function can be integrated and normalized to create a statistical model called Gibbs distribution.
Empirical Bayesian Method
The purpose of statistics is not to identify parameters, but the probability distribution itself
Finished logistic regression model
[Appendix] Bayesian Statistics and Machine Learning
Machine Learning as a departure from the Newton-Dekart paradigm
Inference in machine learning does not necessarily require precise measurement data
Machine learning does not make a priori discoveries of the mechanisms to be posited between variables
Models as tools of cognition and inference
Models are essential tools for consciously selecting, interpreting, and representing events in the world and inferring their consequences.
There are explicit as well as unconscious models
Models and the real world diverge.
Machine learning only invisibly captures meaningful co-occurrence structures from super-multi-dimensional data sets that are simply an array of phenomenon data, including irrelevant data, and draws inferences based on these structures, without any syntax for essential factors or dimensional compression of the data.
Models and Bayesian Statistics in Machine Learning
Models in machine learning use a “squishy” plasticity that can fit any data by adjusting internal parameters, so there is a possibility of over-fitting.
Machine learning has a mechanism to avoid this and keep predictability (generalizability) intact.
Bayesian models ensure generalizability by indirectly softening and restricting internal parameters.
Hierarchy can also be used to create hyper-parameters for more automatic processing.
Application areas and limitations of machine learning
No firm predictive power for path data that deviates significantly from the training data
It is not known in advance whether the data are outliers or not.
Mechanisms are uninterpretable to humans in machine learning inference Translated with www.DeepL.com/Translator (free version)

Part II: Hierarchical Bayesian Lectures Yukito Iba

Introduction
Introduction
Hierarchical Bayesian Modeling
Three Features
Represent the process of data generation in terms of probability distributions
Consider many hidden elements (latent variables) that are not directly observed
Treats discrete and continuous equally, regardless of linear model or normal distribution
Composition and Literature Citation
Iwanami DS (Iwanami Data Science)
Statistical Frontiers (Frontiers in Statistical Science)
Kubo Green Book (Statistical Modeling for Data Analysis)
PRML (Pattern Recognition and Machine Learning)
Bayesian Data Analysis
Bayesian Statistical Modeling with Stan and R
Symbols and Terms
Lecture 0 Bayes, Hierarchical Bayes, Empirical Bayes
Introduction
Bayesian Super Fast Learning Course in Iwanami DS1
Bayes
Data y is generated from probability distribution p(y|x)
Parameter x, which determines the probability distribution, is also a sample from another probability distribution p(x)
Let p(x) be the prior distribution of x
→x → y
p(x|y) = p(x,y) / ∫ p(x,y)dx = p(y|x)p(x) / ∫ p(y|x)p(x)dx
The range of integration is the entire range over which x is defined
If x is a multivariate vector, the integral is a multiple integral
If it is a discrete variable, it can be read as a sum.
posterior distribution
If we are only interested in x, we can use the denominator as 1/C
Can be written as p(x|y) =Cp(y|x)p(x)
C is the normalization constant for the posterior distribution
How to extract information from the posterior distribution
Calculate the expected value, median, and quartiles of the statistic of interest A(x) under the posterior distribution
The value of x that maximizes the probability density of the posterior distribution is the estimator of x (MAP estimator)
Maximum Likelihood Estimator
The goal is to predict future data z based on the estimated x, not the parameter x itself
p(z|y) = ∫ p(z|x)p(x|y)dx = ∫ p(z|x)p(y|x)p(x)dx / ∫ p(y|x)p(x)dx
Effect of prior distribution
Uninformed prior distribution (Diffuse prior distribution)
Large spread prior distribution expressing “no prior knowledge of parameter x
Difficult to set up properly
Knowledge can be actively incorporated into the prior distribution
When the number of components of the parameter size is fixed and the sample size of data y is increased, the effect of the prior distribution is reduced and Bayesian or maximum likelihood estimation approaches the same result.
When the sample size is small, the estimation result is determined by the balance between the prior distribution p(x) and the portion p(y|x) that contains the data
Hierarchical Bayes
Put the parameter γ into the prior distribution p(x) to make it p(x|γ), and
Assume again a prior distribution p(γ) on this γ
→γ→x→y
Determine the prior distribution of x adaptively to the data, rather than just subjectively
Simultaneous posterior distribution of x and γ
p(x,γ|y) = p(y|x)p(x|γ)p(γ) / ∫ p(y|x)p(x|γ)p(γ)dxdγ
Empirical Bayes
Estimation method that directly uses the simultaneous posterior distribution p(x,γ|y) of x and γ
Full Bayesian method
Peripheral Likelihood, Evidence
Hierarchical Bayesian Objectives
Interested in a local parameter x, for which a prior distribution p(x|γ) is obtained from the data
Interested in the global parameter γ and the entire data generating process (mixture distribution) determined from it
The p(x|y) part contains parameters like “slope of the regression curve” and what we mainly want to know is its value

Lecture 1: The Two Faces of Hierarchical Bayesianism

1 From Stein estimator to Hierarchical Bayes
Introduction.
One Origin of Hierarchical Bayesian Modeling
The Problem of “Shrinkage Estimation
A method of estimation that reduces the effect of parameters corresponding to features that are not relevant to the estimation is called parameter shrinkage or simply the shrinkage method.
Problem Setup
N observed values {yi}
Assume that each yi is obtained from a normal distribution with separate expected value θi
p(yi|θi) = 1/√(2πσ2)exp(-(yi-θi)2/2σ2)
Example: If yi is “weight”, then y1,y2…. is “the weight of a student in a certain class”, “the weight of various kinds of pets”, “the weight of furniture in a house”
Variance (measurement error) of each normal distribution is assumed to be known, all with the same σ2
Sample size n is greater than or equal to 4
Goodness measures for estimator θi*({yi})
Expected value of the squared error between the true value of the parameter θi and the estimated value
Stein estimator
In the above setup, there is only one measurement yi for each i, so the estimator for θi is the respective measurement value as is θ = yi
For any {θi}
Stein estimator
Intuitive meaning
Pull each yi by a in the direction of the average of all {yi
Average value of {yi}
Adaptively determine the degree of pull a from the data using the second line of the equation, without a human deciding how much a to pull.
Stein estimation is a type of “reduced estimator
Shrink” the “difference from the average.”
How the Stein estimator works
“Figure unrelated things together and the result will be better”?
If the expected values of θi are different (unrelated to each other), S2 is larger than σ2 and a becomes zero.
If the expected values of θi are close (the measured ones are “related”), s2 becomes smaller and a becomes non-negligible and is attracted to the overall average
From Paradox to Application
Balancing Bias and Variation.”
Example.
Estimating the batting average of big league hitters from a small number of at bats at Hurson and others.
The batting average at the end of the season (excluding the portion of at-bats used for estimation) is the “true batting average
Application to the case where the sample size is too small and the error is too large if the sample is divided by region, but if all the samples are averaged together, the regional differences are not visible at all.
The Problem of “Subregional Estimation
Interpretation by Bayesian statistics
In the above, “variation of θi” is taken into account, but {θi} itself is a parameter of the normal distribution, not a random variable.
View {θi} as a random variable
Assume the prior distribution of Θi is a normal distribution with mean θ0 and variance δ2
p(θi|δ2, θ0) = 1/ √(2πδ2)exp(-(θi-θ0)2/2δ2)
{yi} is generated independently for each i from p(yi|θi)
From Bayes’ formula, the posterior distribution of θi is
p(θi|yi, δ2, θ0) = Cp(yi|θi)p(θi|δ2, θ0) = C’ exp[-(yi-θi)2/2σ2-(θi-θ0)2/2δ2] C,C’ are normalization constants including δ2
The estimator that maximizes θi in [ ] is the estimator that maximizes the posterior density (MAP estimator)
Differentiate by Θi and set to zero θiMAP = (1-b)yi + bθ0, b=σ2/(δ2 + σ2)
Equation similar to Stein estimator
Empirical Bayes
Prior distribution θ0 and variance δ2 are unknown
Θ0 and δ2 are obtained from data
Bayesian modeling perspective assumes additional prior distributions for Θ0 and δ2
Approximation by empirical Bayesian method
Maximize the marginal likelihood instead of dealing directly with the posterior distribution of Θ0 and δ2
Difference between the two approaches
Subsequent developments
In the Bayesian model
Giving up “at least it can’t get any worse” proofs rolls in coverage.
Inappropriate models lead to worse estimation results than “simple estimation methods that don’t consider anything.
Arguments based on Stein’s complement are valuable as a logical method for obtaining rigorous results when the sample size of the training data is finite.
Example of a hierarchical Bayesian model with group structure and multiple factors
2 From Overdispersion to Hierarchical Bayes
Introduction.
Another path for hierarchical Bayesian models
Random-effects models, mixed-distribution models, starting with the “overdispersion” problem
What is overdispersion?
Binomial and Poisson distributions have the property that “there is a relationship between mean and variance, and given the population mean, the population variance is also determined.
In the Poisson distribution, since the population mean is θ, the population variance is also equal to θ.
If the sample size is large, the sample mean and variance will be approximately equal
Example: Geiger counter with 10 counts of 30 seconds each.
Data A has a mean of 21.2 and an unbiased variance of 24.2, almost the same
Data A was measured at a fixed point
Data B has a mean of 28 but universal variance of 54.4
Poisson distribution process and deviation
Data B was measured by moving around the house
In addition to variation in Poisson distribution, variation in measurement location is added
overdispersion
When individual and regional differences work in the number of disease outbreaks
When individual and soil differences matter in the number of plants in bloom
When there are hidden non-uniformities
Mixed distribution
Consider a probability distribution for “non-uniformity not directly observed” and consider a “mixture” of the distributions assumed at the outset
Assume that for non-negative integer data {yi}, the observed value yi follows a Poisson distribution
p(yi|θi) =θiyi / yi! exp(-θi)
Consider a distribution p(θi|γ) that includes the parameter γ for the parameter θi representing intensity
γ → θi → yi
p(yi|γ) = ∫ p(yi|θi)p(θi|γ)dθ
Use of conjugate prior distribution
How to choose a prior distribution p(θi|γ)
One of two typical ways
Choose a mixture distribution so that the form of the formula is simple
Conjugate prior distribution
Choose gamma distribution for Poisson distribution
Beta distribution for binomial distribution
Dirichlet distribution for multinomial distribution
In natural language processing, the Dirichlet process, an extension of the Dirichlet distribution to infinite dimensions, is used as a prior distribution.
Use of Link Functions
Another way
Using the link function, the intensity θi is transformed as “μi=logθi” and then
Express μi as the sum of a constant β and the variation γi around it, as in “μi =β + γi
Assuming a normal expression as the distribution of γi
γi is a quantity not directly observed
Random effects
Characteristics: Easily extended to regression (e.g. Poisson regression)
Conjugate prior distribution can also be extended to Poisson regression (negative binomial regression), but the more complex the model, the more difficult it becomes to handle.
Using a link function is more flexible.
Generalized Linear Mixed Model (GLMM)
Generalized Linear Models (GLMs) such as Poisson regression with random effects
Also useful for incorporating into congested space models and CAR models
Meaning to be considered a hierarchical Bayesian model
Introduce {θi} to represent complex distributions
It is more realistic to calculate numerically with MCMC using {θi}.
Actual problem is maximum likelihood estimation of mixed distribution of parameters such as γ, α, β, etc.
Difficult to fit a model using a combination of high-dimensional numerical integration and optimization
Subsequent developments
Under-dispersion and over-dispersion
Causes of under-dispersion
Intentionally made uniform by human or machine action
Repulsive forces are at work between elements.
Example: Number of couples lined up along the riverbank at regular intervals
Summary of Lecture 1
Both can be handled by hierarchical Bayesian
Local parameters to express individual differences and non-uniformity
Global parameters such as γ, α, β Translated with www.DeepL.com/Translator (free version)

Lecture 2: Prior distributions expressing correlations

1 State Space Model
Introduction
The state space model is a unified review of time series analysis methods from a statistical modeling perspective
What is a state-space model?
State Space Model
Time series data {yt},t=1,… Behind the time series data {yt},t=1,…,n is a sequence of states {xt},t=1,…,n which are not directly observed. ,n
Equations describing the time evolution of xt (system equations)
xt+1 = F(xt) + ηt
Equation expressing the generation of data yt (observation process) (observation equation)
yt = H(xt) + εt
H, F are arbitrary functions, system noise ηt and observation noise εt are independent random variables at each time point
Conditional probability distribution of the system model
p(xt+1|xt) =1/√(2πδ2)exp(-(xt+1-F(xt))2/2δ2), t=0,… ,n-1
Conditional prior probability distribution of the observation model
p(yt|xt) =1/√(2πδ2)exp(-(yt-H(xt))2/2δ2), t=0,… ,n
Example: xt is a scalar (one component) and F(xt)=xt
System equation
xt+1=xt + ηt
Random walk equation described in “Overview of Random Walks, Algorithms, and Examples of Implementations”
Local-level model
Small difference in x values at adjacent times t and t+1
Generalization and parameter estimation
Cauchy distribution, etc., may be employed as the distribution of the noise
Poisson or binomial distribution can be used to replace the observation equation part
Time series model version of the Generalized Linear Model (GLM)
yt ~ p(yt|g(xt))
By defining a set of states at multiple times as a new state, we can create a model in which “the conditional probability of xt+1 depends on xt and xt-1
Lag Coordinates
Time delay coordinate
Representation of “smooth change in x
Second-order difference equation
xt+1 = 2xt – xt-1 +ηt
Can be written in first-order difference form
Local trend model
Arbitrary AR or ATMA models can be incorporated within the framework of a state-space model
Parameters that determine the magnitude of noise, such as σ2 and δ2, and parameters included in H and F are estimated from the data
Let α,β denote these collectively.
Parameters are included in conditional probabilities such as p(yt|xt:α),p(xt+1|xt:β)
Maximum Likelihood Approach
α,β to maximize the expression
Bayesian interpretation
Bayesian interpretation of state space models
Let state x={xt} be a local parameter
The part representing data generation from x
Bayesian equation
Calculating the Linear Least Variance Estimator” in the Kalman filter
Kalman filter is Bayesian in a linear model (both system and observation equations are linear and all noise is independent and follows a normal distribution)
Smoothing in one dimension, estimating functions and curves
State space models can also be used to analyze spatial data, such as observation points placed on a straight line
Extension to 2-space
Introduction.
The idea of prior distribution to express correlation can be extended to express spatial structure
CAR Model
Can be built in one dimension in space as well as in time, but does not work well in two dimensions
The approach is to use a model that finds the conditional probability p(xi|{xj}j∈N(i)) for each point i, when the value of x in the neighborhood N(i) is fixed.
Sometimes it doesn’t work.
The idea of “expression by product of microscopic conditional probability density functions” is abandoned, and the logarithm of the prior probability density is given as “sum of microscopic terms.
Markov Random Field (MRF)
When the defined distribution is multivariate normal
Gaussian Markov random field (GMRF)
CAR model defined on a square lattice
Gaussian Processes and Kernel Regression
Using a Gaussian process as a prior distribution
Computational complexity is determined by sample size and is dimension-independent, so it is advantageous in higher dimensions
General Markov Stochastic Field Model
Paper by Geman et al.
Markov random field with discrete random variables (Bots model in statistical physics) and an auxiliary discrete variable called a first process is used as a prior distribution
MCMC maximizes the posterior distribution to obtain a MAP estimate
Problems
Difficult to approach natural images
3 Non-appropriate Inverse Problem
Various examples of inverse problems
Many important nonappropriate inverse problems are related to time and space
Typical examples
Computed tomography (CT)
Exploring the Earth’s interior using seismic waves
Measuring the height of the mountains of the moon from shadows
Determine distance from binocular disparity
Reconstruction of object surfaces and recognition of object motion
Interpretation of “illusions
Non-appropriate inverse problems and regularization
In many inverse problems, the solution is not fixed to the first rank from the observed peak, or the solution varies greatly due to small noises.
Ill-used
Originated from initial value problems in partial differential equations
Requires some form of prior knowledge of the target to be estimated
Target Modeling
When Y is the data and x is the target to be estimated, add a penalty term (penalty term, regularization term) f(x) to the log-likelihood logp(y|x)
l(x) = logp(y|x) – λf(x)
λ: Strength of penalty (strength of regularization)
Penalized Maximum Likelihood Estimation
Non-Inappropriate Inverse Problems and Hierarchical Bayes
To interpret in Bayesian terms, the prior distribution of x
p(x|λ) = exp(-λf(x)) / Z(λ)
If the penalty term f(x) is quadratic in the components of x, the prior distribution of x is Gaussian Translated with www.DeepL.com/Translator (free version)

Lecture 3: Outliers, Clustering, and Missing Measurements

Introduction.
Hierarchical Bayesian models can systematically absorb outliers, clustering based on probabilistic models, defects, etc.
1 Use of parameters that take discrete values
Introduction.
By introducing discrete parameters (latent variables, labels), problems such as outliers and cluster classification can be handled within the framework of a hierarchical Bayesian model.
Bayesian framework has a low barrier between discrete and continuous
Outlier models
Outlier Example
xi∈{0,1} is the “label” (or “discrete parameter”) that determines whether the result of observation i is an “outlier” or not
Data {yi}i=1,… Conditional probability (likelihood function) of,n
If xi=0, “yi is generated from the model p(y|γ) when it is not an outlier
If xi=1, then yi is generated from the “model for outliers” pi(y) (e.g. uniform distribution with large width)
If the prior distribution of γ is p(γ), the posterior probability is
p(γ,{xi}|{yi}) = Cp({yi}|γ,{xi})p({xi}|Q)p(γ)
Sampling by MCMC to obtain
Posterior probability that the I-th is an outlier
Estimated value of γ taking into account the effect of outliers
Finite mixture distribution model
Can be used not only for “outlier or not,” but also for modeling in general by dividing into two clusters
xi=0 if observation i belongs to one cluster, xi=1 if it belongs to the other cluster
The value determines whether the observed value yi is from distribution p(yi|γ0) or p(yi|γ1).
When divided into K clusters
p(yi|γk) is the distribution corresponding to each cluster
Posterior distribution
Finite mixture distribution model
Hidden Markov model
Data {yi} is time series
Latent variable behind {xi} is also time series
Prior distribution of {xi
Hidden Markov model
Markov chain” in which the prior distribution following states not directly observed is determined by the transition probability p(xt+1|xt)
Applications
Speech Recognition
Represents the progression of a disease or the effectiveness of a treatment
A model that considers “regimes” such as “expansion period” and “recession period” as hidden states in an economic case sequence
Markov switching model
Part of the AR model that follows the data switches depending on the hidden state
Capable of modeling “serial events
Natural Language Processing
N-gram model
DNA sequences (sequence alignment)
Future developments and issues
Hybrid of hidden Markov and state space models
Extension to include both discrete and continuous variables as state xi
Challenges
Complex modeling often has posterior distributions with many maxima (multimodal distributions)
Causes difficulties in MCMC calculations
There are MCMC algorithms that are robust to multi-modal distributions, but they have not been incorporated into MCMC tools such as JAGS.
Stan does not directly support sampling of discrete variables in the first place
Label-switching problems
2 Missing measurements
Introduction.
Treat “missing measurements” statistically as “unobservable conditions” rather than ad hoc treatment
Treatment in hierarchical Bayesian models
Random missing measurements
Examples of missing measurements
Model for data {(yi,zi)}
The problem of estimating the parameter γ
However, the value of zi is randomly missing
Behind {zi} is a “hidden state” called {xi}.
zi is generated from xi with the following probabilities
Probability q NA
NA:missing
Probability 1-q xi
NA={i|zi=NA} for the set of i that are missing.
Number of elements of NA # NA is m
Dirac’s delta function
Superfunctions
Probability function with sharp peak
Density function of a normal distribution with arc-shaped small variance
Representation of the constraint s=s’
The simultaneous function of Yi,zi,xi is considered a generative process γ → (yi,xi) → (yi,zi)
p(yi,zi,xi|γ) = p(yi,xi|γ)[(1-q)xδ(zi-xi) + qxδ(zi=NA)].
[ ] in [] [zi=xi with probability 1-q] [zi=NA with probability q] in equation
Let p(γ) be the prior distribution of Γ
The simultaneous distribution of {yi},{zi}, {xi}, and γ is
Completely random missing (MCAR)
Independent of whether Zi is missing or not
Model including “censored”
When the conditions for cohesion are clearly known
Example: missing whenever zi exceeds a known threshold value ξ, otherwise no cohesion
The upper limit of measurement time is one hour, at which point the measurement is terminated.
zi={NA, xi>ξ {xi, xi≤ξ
I(conditional expression) = {1, (condition is satisfied) {0, (condition is not satisfied)
p(yi, zi, xi|γ) = p(yi, xi|γ) [I (xi ≤ ξ)δ(zi-xi) + I ( xi > ξ)δ(zi=NA)] zi=xi if xi ≤ ξ
zi=NA if xi > ξ
Simultaneous distribution
Missing Measurements and Bayesian Modeling
Consider the state behind the defect value as a local parameter {xi}.
Defect modeling is almost identical to hierarchical Bayesian modeling
Can be viewed as “prior distribution of parameters” or “distribution of complete data before defects”
Advantages of fusion
Advantages of integrating defect modeling and hierarchical Bayesian modeling
The treatment of defects and the treatment of “hidden states” originally included in the modeling can be combined into a unified approach of “integrating over variables that are not directly observed.
Common mathematical methods can be used.
Application of Gibbs Sampler
data augmentation
Points to consider
Behind Bayesian modeling is the idea of “balancing bias and variation
Priority is given to correcting for bias” when it comes to defects
Causal Inference
Rubin’s framework views “counterfactuals” in causal inference as a type of missing measurement
Correction of Bias by Intersection Translated with www.DeepL.com/Translator (free version)

Appendix A Predicted Distributions for Hierarchical Bayesian Models

Two types of predictive distributions
How “different parameters of interest” are reflected in different Hierarchical Bayesian predictive distributions
If you are interested in a local parameter x
“Interested in x” can be baked into a “prediction,” “assuming x takes the value estimated from the data, from which the future data z arises independently.”
Distribution when x is sampled from the posterior distribution and new data is generated using it
If we are interested in the opposite parameter γ
Ignore current x values as contingent
Distribution when Γ is sampled from the posterior distribution, x’ is newly generated from the conditional distribution with the value of γ, and the future data z is generated using x’.
If you are interested in parameters such as “the slope of the line in the regression model
Evaluating Predicted Distributions
Methods often used to evaluate performance
How to consider a large sample size limit
Asymptotic Theory
In the case of a hierarchical Bayesian model, whether asymptotic theory results in a good approximation when the sample size is increased
Example: Two limits in a model with group structure
Increase the number of groups and keep the members of each group constant
Keep the number of groups constant and increase the number of members in each group

Appendix B Proof that the Stein estimator improves the expected value of the squared error

General Considerations
Evaluating the Intersection Term
To reduce to a given value
To reduce to the mean value obtained from the data

Appendix C Empirical Bayesian Estimation when the prior distribution is an exponential family of distributions

Here is the question to consider
Intuitive meaning of equation (2)
Problems with empirical Bayesian methods when the generative model is not good
Derivation of the equation