Machine Learning Professional Series – Nonparametric Bayesian Point Processes and the Mathematics of Statistical Machine Learning Reading Notes

Machine Learning Technology Artificial Intelligence Technology Digital Transformation Technology Probabilistic Generative Models Navigation of this blog Natural Language Processing Technology Markov Chain Monte Carlo Method Deep Learning Technology Nonparametric Bayesian and Gaussian processes

Summary

Nonparametric Bayes is a method of Bayesian statistics that allows one to build probability models from the data itself and to estimate probability distributions from the data instead of assuming the true probability distribution that generates the data. This allows the use of flexible models for the data and automatically adjusts the probability distribution to fit the data. Here we describe this nonparametric technique based on the Machine Learning Professional series “Nonparametric Bayes – The Mathematics of Point Processes and Statistical Machine Learning“.

This is a post-reading memo.

Machine Learning Professional Series – Nonparametric Bayesian Point Processes and the Mathematics of Statistical Machine Learning

“Now, open the door to infinite dimensions!
Clearly explains the basics of probability distributions and their application to time series data and sparse modeling. The book is kindly designed to provide a detailed explanation of the theoretical background of measurement theory from the basics as well. Written by an up-and-coming ace researcher. A must for all Bayesians!”

Chapter 1: Basic Knowledge about Probability Distributions

1.1 Preparation for notation and basic mathematics
Set-related
Probability
Matrix-vector related
1.2 Bernoulli Distribution and Binomial Distribution
1.3 Poisson distribution
1.4 Multinomial Distribution
1.5 Beta Distribution
1.6 Dirichlet Distribution
1.7 Gamma and Inverse Gamma Distributions
1.8 Gaussian Distribution
1.9 Wishart Distribution
1.10 Student’s t distribution

Chapter 2 Probabilistic Generative Models and Learning

2.1 Probabilistic Generative Models and Notation
2.2 Graphical Models
2.3 Statistical Learning
A measure of the “closeness” of probabilistic models
Definition KL divergence
p(x*) true generative model
Find p(x|Φ) that minimizes KL[p*(x)∥p(x|Φ)
Optimization Problem
KL divergence formula using expectation value
The terms in 𝔼p*(x)[logp*(x)] do not contribute to the optimization of p(x|Φ)
Final optimization equation
Cannot be solved as is because it contains p*(x)
Consider the data as a sample from the true distribution and calculate the “expected value
xi ~ p*(x)
𝔼p*(x)[logp(x|phi)]≈1/2∑logp(xi|Φ)
Maximum likelihood estimation
The resulting solution is called ΦML
ΦML in terms of the generative model
Generative process for parameter Φ
Assume that Φ˜p(Φ|η)
Generative probability is
Optimization problem
log(p|η) is a regularization term
It is called maximum a posteriori estimation (MAP estimation)
The solution obtained by MAP estimation is written as ΦMAP
Bayes’ theorem (posterior distribution of Φ)
Likelihood of observed data p(x1:n|Φ)
Prior distribution of Φ p(Φ|η)
Redefine the optimization problem
Since p(x1:n|η) does not depend on the optimization of Φ
Estimation
Estimate only one point
Maximum likelihood estimation
MAP estimation
Image
Estimation weighted by probability
Image
Difficult to perform integral calculation analytically
Generate S samples in some way from the posterior distribution, and calculate the predictive distribution using the sample mean
Estimate using multiple Φ
2.4 Perimeterization
Eliminating a specific variable from the combined distribution
Example
Integral elimination of x2 from the joint distribution p(x1,x2,x3) of random variables x1,x2,x3
In Bayesian estimation, the likelihood p(x1:n|Φ) of the observed data x1:n is peripheralized by the prior distribution p(Φ|η) (equation above), which is called the marginal likelihood.
2.5 Gibbs Sampling
In Bayesian estimation, samples are generated from the posterior distribution, and the predictive distribution is constructed by the sample mean.
How to efficiently generate samples from the posterior distribution
Method of alternately sampling from the conditional prior distribution of each random variable for a multivariate posterior distribution
Example
Assuming a generative model for data x1:n with random variables Φ, 𝛙, and μ
We want to find the posterior distribution p(Φ, Ψ, μ|x1:n)
It is unlikely that p(Φ, Ψ, μ) will be some well-known probability distribution
p(Φ, Ψ, μ) ≠ p(Φ|x1:n)p(Ψ|x1:n)p(μ|x1:n)
In Gibbs sampling, the conditional distribution for each random variable is used and calculated as above.
To facilitate sampling, the conjugate prior distribution is often used as the prior distribution of the probability distribution.
By Bayes’ theorem, the posterior distribution becomes the above equation
The following distribution of a random variable x is p(x|θ)
Let the prior distribution of Θ be p(Θ)
When the prior and posterior distributions belong to the same distribution family
Sampling is easy
When calculating the conditional distribution for each variable, Bayes’ theorem leads to the calculation of the joint distribution, and decomposing it into the product of conditional distributions improves the prospect of the calculation.
Graphical models are useful
Conditional Independence
Conditional independence of three patterns
Tail-to-tail type
Head-to-tail type
Head-to-fead type

Chapter 3 Bayesian Estimation

3.1 Exchangeability and De Finetti’s Theorem
Theorem: Exchangeability
When variables are interchangeable, changing the order of the variables does not change the joint probability of the variables
Theorem: de Finetti’s theorem
When random variables are interchangeable, any n joint probabilities can be expressed using a random variable Φ
When exchangeability is assumed for p(x1:n), it can be expressed in such a way that x1:n follows an independent distribution
p(xi|Φ) represents the likelihood of the observed data, and p(Φ) represents the prior distribution.
3.2 Bayesian Estimation
Assuming that the likelihood of the observed data and the parameters constituting the likelihood are random variables and their prior distributions
Bayes’ theorem above can be used to calculate the posterior distribution
Posterior distribution of Φ given the observed data x1:n
Predictive distribution can be constructed from the posterior distribution
Estimation of the true source distribution p*(x)
3.3 Dirichlet-Multinomial Distribution Model
Bayesian Estimation in the Dirichlet-Multinomial Distribution Model
Assumptions
Generative model is the number of eyes that appear when the dice are rolled n times.
Consider a dice with K possible outcomes.
The probability of each eye is given by π=(π1,π2,…. ,πk) (∑πk=1)
Skewed dice with different probability of each eye
Zi represents the eye of the i-th dice.
Z2=6 means that the second dice thrown met 6.
The probability of generating zi(i=1,…,n), the set of dice eyes, by tossing N times. (i=1,…,n) by tossing N times.
Assume Dirichlet distribution for the probability of occurrence of each dice.
Dirichlet-Multinomial Distribution Model
Graphical model
What is π when data z1:n is obtained?
Posterior distribution of π
In summary, it is a simple equation
What is p(π|z1:n,α)?
The predictive distribution is
Variant of Eq.
Part (B) is the probability of normalizing the frequency at z1:n
Part (C) is the mean probability of the Dirichlet distribution, which is the prior distribution.
The predictive distribution is the sum of these distributions with the proportion of part (A).
3.4 Gamma-Gaussian Distribution Model
Introduction
Find the posterior distribution of the Gaussian distribution
Posterior distribution of the mean when the mean is a random variable and the variance is fixed
Posterior distribution of variance when mean is fixed and variance is a random variable
Posterior distribution of mean and variance when mean and variance are random variables
Likelihood of sample x1:n for a D-dimensional Gaussian distribution with mean µ and covariance matrix σ2I
3.4.1 When the mean (µ) is a random variable and the covariance matrix (σ2I) is fixed
Assume a prior distribution for µ
Graphical model
For a sample size n=1
If we write the variable of interest on the proportion sign
The equation of Gaussian distribution is derived.
When the number of samples is more than 2
The equation is
The predictive distribution is
3.4.2 When the mean (µ) is fixed and the covariance matrix (σ2I) is a random variable
The inverse γ distribution is assumed as the prior distribution of the random variable σ2.
The graphical model is
The posterior distribution of σ2 given a sample x1:n is
The final distribution is
The predictive distribution is
St: Student t-distribution
3.4.3 When both the mean (µ) and the covariance matrix (σ2I) are random variables
Introduction of τ:precision parameter (τ=1/σ2)
No independent prior distribution for µ and τ is assumed
Prior distribution for µ and τ
Prior distribution
The posterior distribution is also the seat of the Gaussian and γ distributions
Graphical Model
Computation of the posterior distribution
Final Solution
Perimeterization
τ
μ
Predicted distribution
3.5 Surrounding Likelihood
Bayes’ Theorem
Surrounding Likelihood
Equation of Integration
The Role of the Surrounding Likelihood
The difficulty of calculating the posterior distribution depends on whether or not the marginal likelihood can be easily calculated.
The marginal likelihood also plays an important role as one of the indices to determine the values of the parameters of the prior distribution.
When the marginal likelihood can be calculated, the parameter η of the prior distribution can be obtained by finding η that maximizes p(x1:n|η).
General calculation method when the marginal likelihood can be calculated
Omitted (later)

Chapter 4 Clustering

4.1 k-means algorithm
Classify each data point into K pre-defined classes.
Let μk∈ℝd(k=1,2,…,K) be a point representing each class. K).
Each data point xi is classified into the class with the highest similarity to μk.
Define the distance between xi and μk, and classify the data points into classes with close distance (high similarity).
Use the square Euclidean distance as the distance measure
When the data point xi belongs to class k, we introduce the variable zi∈{1,2,… When a data point xi belongs to class k, we introduce the variable zi∈{1,2,…,K} and express it as zi=k.
Since the class information of each data point is not given in advance, zi is called a latent variable or hidden variable.
z1:K=(z1,z2,…,zn) and z1:K=(z1,z2,…,zn) so that the square Euclidean distance between the mean vector in each class and the data points in the class becomes small. , zn) and Μ1:K=(μ1,μ2,…,μK) so that the square Euclidean distance between the mean vector in the class and the data points in the class is small. ,µK).
Formulated by optimization problem
Algorithm
4.2 Clustering by Gibbs Sampling for Mixed Gaussian Models
Introduction
The objective function of the optimization problem can be transformed into the above equation
The final optimization problem can be rewritten as
N(xi|μzi, I) is the probability that xi is generated from a Gaussian distribution with mean μzi and covariance matrix I.
The above equation is the joint probability (likelihood) of x1:n given Z1:n and μ1:K.
The problem of maximizing the log likelihood
The K-means clustering method is based on
Among K Gaussian distributions N(xi|μk,I), select the distribution with the highest log likelihood for each data
Maximum likelihood estimation of µk given Z1:K
K-means clustering method is
Mixed Gaussian model with fixed variance
The mean parameter µ1:K and the class assignment z1:n are each estimated greedily to have the highest likelihood
Prone to local optimal solutions
Estimate variance and mean parameters and class assignments stochastically
4.2.1 The Fixed Variance Case
Gibbs sampling in the case of fixed variance
Assume that data x1:n is generated as above
Assume Gaussian distribution as prior distribution for xi and μi
Assume that Zi is a model generated from a multinomial distribution
Graphical model
Data generation by Gibbs sampling
Key Points of Gibbs Sampling
Compute the Joint Distribution
Use graphical model, conditional independence, and Bayes theorem to turn the joint distribution into a product of conditional distributions.
Compute the conditional distribution from the product of the conditional distributions, leaving only the portion that is relevant to the random variable of interest.
Procedure
Calculate the joint distribution of all random variables based on the graphical model.
Equation of Joint Distribution
Compute the conditional distribution for zi
Summarize only the distributions that are related to zi
Final conditional distribution for zi
Calculate Normalization Constant
Condition
Compute Conditional Distribution for µk
Summarize only distributions related to µk
Calculation Continued
Condition
Conditional Distributions Required for Gibbs Sampling
Gibbs Sampling Algorithm for Mixed Gaussian Model with Fixed Variance
Difference from K-means cluster algorithm (above)
K-means method selects the k that maximizes N(xi|μk,I) at each step
In the K-means method, it is definitively determined that µk=xk at each step.
In Gibbs sampling, k is chosen in proportion to N(xi|μk,I) at each step.
In Gibbs sampling for mixed Gaussian models, µk is sampled from a Gaussian distribution with nk/(nk+1)*xk as the mean at each step.
Establishment noise is introduced.
4.2.2 When the variance is also a random variable
Bayesian estimation with variance as a random variable
Assumptions
Assume that the data x1:n is generated by the above equation
To estimate the parameter π of the multinomial distribution in Bayesian terms, we assume a K-dimensional Dirichlet distribution as the prior for π.
Assume that all the parameters of the Dirichlet distribution are α=(α1,α2,…,αn). ,αn).
Graphical model
Calculation
Compute samples from the posterior distribution p(z1:n,µ1:K,τ,π|µ0,p0,a0,b0,α) using Gibbs sampling
Compute the combined distribution of all random variables based on a graphical model.
Compute Conditional Distributions for Zi
Summarize only distributions related to Zi.
Conditional Distributions
Compute Normalization Constants
Conditional
Compute Conditional Distribution for µk
Summarize only distributions related to µk
Calculation continued
Conditional Distribution Required for Gibbs Sampling
Compute Conditional Distribution for Τ
Summarize only distributions related to Τ
Calculation continued
Conditional Distributions Required for Gibbs Sampling
Compute Conditional Distribution for π
Cut out only distributions related to Π
Calculation continued
Conditional Distribution Required for Gibbs Sampling
Gibbs Scattering Algorithm for Mixed Gaussian Model with Mean and Variance as Random Variables
By looking at the sampling history of each zi as a histogram, we can sell the most frequent class among them.
By looking at the histogram, we can analyze the stability of the clustering of the data.
4.3 Clustering with Gibbs sampling for mixed Gaussian models
If we just want to do clustering, we only need to get the sampling results of z1:n, and the sampling results of Μ, τ, and π are unnecessary.
Sampling only Z1:n
In Gibbs sampling, peripheralize specific random variables to reduce the number of random variables to be sampled
Peripheralize Μ, τ, and π and sample only Z1:k
Basic approach
Attribute to joint distribution
Decompose into products using Bayes’ theorem, and leave the part related to zi to be calculated
Introduce peripheralization (integral elimination)
Introduce conditional distribution for zi
When decomposing into combined distributions, eliminate μ, τ, and π in the graphical model and consider the dependency.
Distributions for x and z can be decomposed into product distributions for xi and zi, respectively
Example
p(z1:n|π) can be formulated as above from conditional independence
As for p(π1:n|α), it cannot be decomposed because the cough of the probability distribution enters into the integral
Omitted in the middle
Final result
Perimeterized Gibbs Sampling Algorithm for Mixed Gaussian Distribution Translated with www.DeepL.com/Translator (free version)

Chapter 5 Opening the Door to Infinite Dimension: An Introduction to Nonparametric Bayesian Models and Its Application to Clustering

Introduction
Consider the extension of the Dirichlet distribution to dimensionless dimensions as an introduction to nonparametric Bayesian models.
5.1 Considering the Dirichlet Distribution in Infinite Dimensions
This section describes the Dirichlet process mixture model, which plays a central role in nonparametric Bayesian models.
Non-Dimensionalization of Finite Mixture Models
Why non-dimensionalization?
Application of Dirichlet process mixture model to clustering
If the number of classes is not properly determined, even simple clustering cannot be performed.
In real world problems, it is difficult to determine the number of dimensions in advance
If the data changes dynamically, the number of classes K may also need to change dynamically
First, we assume finite dimension, and consider the Dirichlet distribution when K→∞ for the final result.
First, we assume the above equation
By setting Dir(π|α/K), we can expand to infinity.
Two properties of Dir(π|α/K)
As αk is the same value for each k, there is no distinction between k as a prior distribution.
As the dimension K increases, the parameter αk of the Dirichlet distribution becomes smaller
Sample example from Dirichlet distribution for varying α=(α1, α2, α3)
As K becomes larger (αk becomes smaller), points are distributed near the vertices of the triangle.
A particular raw element, πk, has a high value, while other elements have small values, such that π is generated.
Peripheralized Gibbs sampling
Does not contain π
The set of {1,2,…,K} appearing in Z1:nI. Let the set of {1,2,…,K} appearing in Z1:nI be the above equation.
Taking the limit of K→∞, we get
The values that have already been sampled are the probabilities in the above equation
Any other value is sampled with the probability of the above equation.
The last one is unknown (again).
5.2 Infinite Mixture Gaussian Model
A model that extends the Dirichlet distribution to infinite dimensions
Peripheralized Gibbs sampling can be used only for latent variable sampling.
When sampling, each time a new class is sampled, it becomes a candidate for sampling.
Mean and variance sampling
The covariance matrix of the Gaussian distribution that generates the data is simply diagonalized to τ-1I.
How to estimate the covariance matrix more rigorously.
Explained in “Easy to Understand Pattern Recognition”.
5.3 Infinite Dimensionality of Dirichlet Distribution from the Viewpoint of Perimeter Likelihood
On the Peripheral Likelihood when K→∞
Summary
5.4 Probability Model of Segmentation
The aforementioned probability of splitting has been proposed independently of the Dirichlet-polynomial model as the Chinese Restaurant Process (CRP).
Example of CRP in action
5.5 Dirichlet Process
Explanation of the Dirichlet distribution behind the CRP
De Bakk’s theorem reveals the existence of G behind the CRP and p(G) that generates it
G is called the Direchlet process.
5.6 Estimation of the concentration parameter α
5.7 Other Topics
In addition to the Dirichlet process, there is a stick-break process (SBP).
Variational Bayesian sampling algorithms have been proposed.
Variational Bayes is fast in deterministic form
Variational Bayes requires an upper bound on the number of dimensions
Hierarchical Dirichlet process for hidden Markov processes

Chapter 6 Applications to Structural Change Estimation

6.1 Structural Change Estimation Using Statistical Models
Structural change in data is one of the most common problems in analyzing time series data.
The problem of analyzing changes in the properties of data is widely studied in the field of change detection and matrices.
When considering changes in data, it is not possible to predict how many changes there will be in the data.
Examples of structural changes in data
6.2 Structural Change Estimation by Infinite Mixed Linear Regression Model Based on Dirichlet Process
Structural Change Estimation by Infinite Mixed Linear Regression Models
6.3 Gibbs sampling in infinite mixed linear regression models based on Dirichlet processes
Gibbs Sampling for Infinite Mixed Linear Regression Models
6.4 Experimental Examples
Experimental example with artificial data
Plot of artificial data
Results

Chapter 7: Applications to Factor Analysis and Sparse Modeling

Introduction
About Nonparametric Bayesian Models in Sparse Modeling
Explanation of Beta Processes
7.1 Factor Analysis
Factor analysis is a technique for analyzing individual components, assuming that the observed data is a quantity composed of a set of hidden factors.
If the observed data yi∈ℝD(i=1,. ,N), and zi,k∈{0,1}, xk∈ℝD(k=1,…,K). ,K) can be expressed in the above equation using
Zi,k=1 means that the observed data i has a factor k.
The information characterizing the factor k is represented by xk
Y = ZX + E
An example of factor analysis where Z has only 0 and 1 components
Matrix factorization can be done in various ways by placing constraints on the matrix to be decomposed.
Singular value decomposition is based on the constraint of orthogonality
Real numbers are taken as elements
Non-negative matrix factorization places the constraint that the matrix elements are skin-valued
In the example above, one of the matrices Z has only 0 and 1 components.
If there are many zero components, it is called a sparse matrix.
By using a prior distribution that assumes infinity for the dimension K of the columns of Z
Describes a nonparametric Bayesian model that can estimate the K+ factors that represent the observed data by using a prior distribution that assumes infinite dimension K in the columns of Z.
7.2 Generative Model for Infinite Dimensional Binary Matrices
Consider an infinite dimensional prior distribution for Z
Estimate the number of dimensions at the waist to obtain the posterior distribution based on the data
Ideas on how to generate an infinite dimensional binary matrix (binary matrix without fixed K)
Consider based on exchangeability in binary matrix
Given the staircase matrix on the right, we can consider a generative model in which the number of columns increases as the number of rows increases.
Assume the above equation as the generating process of Zi,k
Beta-Bernoulli distribution model
Posterior distribution p(π1:K|z,α)
Continued
mk
Calculation continued
Result
Formula
If K→∞, then
To summarize
Infinite dimensional binary matrix generation process
Indian buffet process (IBP)
Example of binary matrix generation by IBP
7.3 Generative Model and Exchangeability of Infinite Dimensional Binary Matrix from the Viewpoint of Surrounding Likelihood
Analyzing the marginal likelihood of the beta-Bernoulli distribution model as K→∞
Generative process of zi,k by Beta-Bernoulli distribution model is shown as above equation.
In the beta-Bernoulli distribution model, zi and k are generated independently.
All matrices that match by swapping columns have the same probability.
In the finite case, the marginal likelihood is the above equation.
If we look at the binary matrix for each latent feature k, which is a column, we can see it as an N-dimensional binary vector
An N-dimensional binary vector can be taken as a binary vector of 2N types
Historical binary vectors and their number of occurrences
Omitted in the middle (don’t know)
Equation of marginal likelihood
Exchanging the order of Zi and zj does not change the values of {Kh} and {mk
As with CRP and Dirichlet process, IBP corresponds to a beta-process
7.4 Infinite Latent Feature Model
Infinite latent feature model
Generative process
Derivation of the Gibbs sampling equation

Chapter 8 Foundations of Measurement Theory

8.1 Measurable spaces, measure spaces, and probability spaces
Examples
Assumptions
Consider a trial of rolling a dice once.
Let Ω = {1,2,3,4,5,6} be the set of possible outcomes.
A roll Ω ∈ Ω is called a sample
The set Ω is called the sample space
A subset A of Ω is called an event
Example: A={2,4,6} is an event whose outcome is an even number
A set F of subsets is called a family of events
Example
If Ω={1,2,3,4,5,6}, then {1}∈F, {2}∈F, … {6}∈F and so on. The set of only one sample belongs to F.
{1,3,5} ∈ F for odd numbers, {2,4,6} ∈ F for even numbers, and the sample space Ω also belongs to F
The probability P is calculated by counting the number of elements in the subset, P(A)=|A|/|Ω| etc. (where |A| is the number of elements in A)
Example
When you roll the dice once and the number of outcomes is even
P(A)=|{2,4,6}|/|{1,2,3,4,5,6}|=3/6=1/2
The world of real numbers
What is the probability that a value falls into the interval (0,0.5) when a real number is taken at random from the interval (0,1)?
Counting the number of real numbers goes to infinity.
Focus on the length of the interval, not the number of cases in the interval.
0.5/1=0.5
What to do in 2 or 3 dimensions?
Area? Volume?
A measure of a quantity in space by generalizing the properties of the number of points, length, and area.
Probability based on measurement
Measurement theory is a field that considers “measurement” mathematically.
What are the things (sets) that can be measured?
What properties should the value (measure) have as a result of measurement?
Definition (σ-additive family, measurable set, measurable space)
Definition (measure, measure space, finite measure, σ-finite measure)
What properties should a measure have for a measurable set?
Example
Measurable space([0,+∞], F)
The measure µ([a,b])=b-a that maps lengths to intervals is
An=[n,n+1] is not a finite measure because μ([0,+∞])=∞.
Since μ(An)=1<∞, it is a σ-finite measure.
Definition (probability measure, probability space, sample space, event, family of events)
We can create a probability measure by using a finite measure µ and setting P(A)=µ(A)/µ(Ω).
8.2 Measurable Functions and Random Variables
If a function f(x) on real numbers is continuous on the interval [a, b], then it is a function that can be meaningfully represented by ∫abf(x)dx in the Riemann integral sense.
A measurable function is “a function on a measure space (Ω, F, µ) that, given a measurable set A ∈ F (A ⊂ F), returns a real number ∫ Af(Ω)µ(dω).
The reason for µ(dω) instead of µ(ω)dω is that
Because a measure is a measurement on an interval, it represents a measurement in the microspace dω⊂ Ω.
If we consider Ω as the time axis, μ(dω) represents the measure in the small interval dω(=[Ω,Ω+dω]) on the time axis.
A concrete example
Rubek’s integral using the Rubek measure
Specific definition of a measurable function
In order to make the explanation of random variables easier to understand, we use half-open intervals, but there is no need to limit ourselves to that.
If we use a stochastic measure as a measure, the above equation means that we are calculating the expected value.
A measurable function is a class of functions that can calculate the expected value.
Definition: (real-valued) random variable
A random variable X satisfies the above equation for any real number α<β
A random variable X is a function that can calculate the probability of having a value in an arbitrary range [α,β)
A family of sets consisting of all intervals, such as semi-open and closed intervals, and the operations of merging, common set, and scion set obtained from these intervals finitely or additively infinitely many times in phase space S.
Elements of Borel set family
Example: ℝ and ℝD as S
Using the Borel set family, the definition of a measurable function can be rephrased as above
For any Borel set B ∈ B(S), the random variable
We can say that a random variable is a function that can compute the probability P({ω∈Ω|X(ω)∈B}) of having a value in X(ω) ∈ B(S).
General definition of a random variable
Image of a Borel set
S=ℝ
Semi-open interval B=[α,β].
and replace it with
8.3 Single function, nonnegative measurable function, monotonic convergence theorem
Definition: Single function
Let the defining (indicator) function of a set A be the above equation.
A single function is a staircase function
If the measure of Ai is µ(Ai), the integral by a measurable function is the above equation.
Theorem: A theorem that is used to extend a function to a general accelerating function (after first analyzing it with a single function).
Theorem: Approximation theorem by single function
Theorem: Monotone convergence theorem (Monotone convergence theorem)
In the monotone convergence theorem, if fn=𝜑n, the integral of the non-negative acceleration function can be defined as above using the integral of a single function
8.4 Distribution of Random Variables (Probability Distribution)
The distribution of a random variable
Let X:Ω↦X be a random variable on the probability space (Ω,F,P).
For any Borel set B ∈ B(X), we define the above equation as
Px will be a random measure on the measurable space (X,B(X))
Or we can say that the random variable X follows the distribution Px
Example: A random variable X on a random space (Ω,F,P) follows a one-dimensional Gaussian distribution with mean 0 and variance 1
For any Borel set B∈B(ℝ), e.g. B=[-1,2
It can be calculated as above
For any Borel set B ∈ B(X), when we can compute P({ω∈Ω|X(Ω) ∈ B})=Px(B)=∫Bp(x)dx
8.5 Expected value
Assumptions
Let X:Ω⟼X be a random variable on (Ω,F,P).
Let Φ(x) be a Borel measurable function on X.
The expected value of Φ(x) is defined as above
Theorem of expectation calculation for random variables
If we assume a random variable and the distribution it follows (Gaussian distribution, gamma distribution, etc.)
If we assume a random variable and the distribution it follows (Gaussian distribution, gamma distribution, etc.), we can calculate the expected value without considering the probability space behind it.
8.6 Laplace Transform of Probability Distribution
Assumptions
Let P be the probability distribution followed by the random variable X:Ω⟼X.
P has a probability density function p(x)
For T ∈ ℝ, the function defined above in ℝ is called the Laplace transform of probability distribution/probability density.
Example
In the case of Poisson distribution, the above equation is obtained from X~Po(λ): P(n)=λn/n!e-λ
In the case of gamma distribution, the above equation is obtained from X~Ga(a,b): p(x)=ba/Γ(a)xa-1e-bx
8.7 Propositions that hold with “probability 1
Let (Ω, F, P) be a general probability space.
If the set {ω∈Ω|¬prop.(Ω)} of Ω ∈ Ω is a measurable set and its probability density is the above equation
Proposition prop.(Ω) holds with probability 1″.
Example
For random variables X and Y, “X and Y are equal with probability 1” means
f=g with probability 1
When f=g with probability 1, the expected value is equal
8.8 Random measure
Let (Ω,F,P) be a probability space
Let (X,S) be a measurable space
Introduce the map M:ΩxS⟼[0, +∞] (i.e., M(Ω, A) ∈ [0, +∞] for Ω ∈ Ω, A ∈ S)
Assume that the map M has the above properties.
Random measure
Two types of integrals for random measures
1
2
Expectation measure
An analogue of the usual concept of expectation for random variables in random measures
Variant
8.9 Laplace Normal Function for Random Measures
Consider the concept corresponding to the Laplace transform for random measures
Laplace generic function of a random function M
Monotonic Convergence Theorem for Laplace Generalized Functions
Independence of random measures Translated with www.DeepL.com/Translator (free version)

Chapter 9 Nonparametric Bayesian Models from the Perspective of Point Processes

9.1 What is a Point Process?
The stochastic processes that make up a nonparametric Bayesian model are
Point process
A statistical model of a set of “points” abstracting discrete events and “some quantity” of each point
It is useful to analyze the stochastic mechanism of the arrangement of “points” on a time axis, a plane, or a general space.
A point process is a statistical model of “points” and “bars.
Image
Example
Location of traffic accidents at a given point
We want to predict how it will be in the future based on the current data.
Consider a variable N(A) that outputs the number of points in an area A.
Formulation of N(A)
The reason for the infinite sum is to include the points that may occur in the future.
Add up the bars of length 1 attached to each point (assuming that the bars (weights) also have stochastic properties)
Marked point process
9.2 Poisson process
A point process on a time axis (1D)
Let T denote the time axis.
A set of random variables {Xt} defined on a probability space (Ω, F, P) is called a stochastic process.
When the sample Ω is fixed, a function is defined with only time t as a variable.
This function is called the sampling function, sample function, path function, etc.
Example: Time variation of cat’s footprint
Ω is a set of animals
Let Ω be the set of animals and F be the probability space (Ω, F, P) of the probabilistic appearance of an animal in the yard.
Let Ω ∈ Ω be a cat
Let Xt(Ω) denote the number of its footprints.
Ω is often omitted.
A sample function is said to be continuous if it has the above equation.
Definition: Additive process
Introduce a non-negative function λ:T⟼[0,∞) on T called the intensity function
When a measurement process (Nt)t∈T has the above properties, it is called a Poisson process.
Example of a Poisson process
The degree of occurrence of cat visits depends on the intensity function λ(t) on the time axis
The larger the value of the intensity function λ([s,t]) in a certain interval [s,t], the higher the probability of the occurrence of footprints
By estimating the intensity function λ, we can estimate the number of cat visits that will occur in the future
Considerations for generalization
Since N([s,t}) had a fixed ω, we can consider ω
Let [s,t] be T=[s,t] ∈ T
T can also be extended to two or three dimensions instead of one
Definition: Poisson process and random Poisson measure
Intuitive explanation of Poisson random measure
9.3 Laplace Legendre Functions for Poisson Random Measures
Probability distributions can be expressed differently by Laplace transform
Theorem: Laplace Transform of Poisson Random Measure
9.4 Gamma Processes
Adding Weight wi to a Point Process
Weight wi=1 for Poisson process
Definition: Gamma process, gamma random measure
Explanation of gamma random measure
Weighted gamma process
9.5 Laplace Legendre Function for Gamma Random Measure
Laplace Function of Gamma Random Measure
9.6 Discreteness of Gamma Random Measure
The gamma random measure can be expressed as a discrete velocity
9.7 Normalized gamma process
9.8 Dirichlet Processes
Definition Dirichlet process and Dirichlet random velocity
Explanation of Dirichlet random measure
Theorem
9.9 Complete random measure
Definition: Complete random measure
Theorem: Laplace functional of the complete random measure
Explanation of the beta random measure
Theorem: Lévy-Ito decomposition of the complete random measure
Summary of point processes