Summary
A topic model is a probabilistic generative model for extracting potential topics from a set of documents and understanding the content of documents. By using a topic model, it is possible to estimate what topics are covered in a certain document, and when applied to large-scale text data analysis, for example, it is possible to understand what topics are covered in a large number of news articles and blog posts, and what trends are observed. This probability model is the simplest of all.
This topic model base model is the “unigram model” or “mixed unigram model. This model has been extended to “Latent Dirichlet Allocation (LDA)” and “Probabilistic Latent Semantic Analysis (PLSA)”, and to dimensionless models such as Chinese Restarant Process (CRP) and Stick Breaking Process (SBP), and Hierarchical Direchlet Process (HDP), etc. as dimensionless extensions.
This section describes various theories (models) and applications of topic models based on the “Machine Learning Professional Series: Topic Models“.
This issue discusses the post-reading notes.
Machine Learning Professional Series: Topic Models Post-Reading Notes
Chapter 1 Fundamentals of Probability
1.1 Probability
1.1.1 Probability and Probability Distribution
1.1.2 Simultaneous probability
1.1.3 Peripheralization
1.1.4 Bayes’ Theorem
1.1.5 Independence
1.1.6 Calculation Examples
1.1.7 Expected value
Expected value
Mean (mean)
Mode (mode)
Variance (variance)
Covariance (covariance)
1.1.8 Kullback-Leibler divergence
1.1.9 Continuous random variables
1.1.10 Jensen’s inequality
Figure
1.1.11 Lagrange’s undecided multiplier method
Used to estimate the parameters of a stochastic model.
Can be used to solve constrained maximization problems
Under I constraints gi(Φ)=0, i=1,… ,I under
The problem of maximizing a function f(Φ) with respect to Φ
1.2 Probability distribution
1.2.1 Bernoulli Distribution
Coin toss (probability distribution of a variable taking two values x ∈ {0,1})
When tossing a coin N times, the distribution is binomial
1.2.2 Categorical distribution
A categorical distribution is a distribution of one variable from multiple discrete values {1,2,…,V}. Probability distribution of variables that take one value from multiple discrete values {1,2,…,V}.
Example: Dice
The result of doing the above N times is called multinomial distribution.
1.2.3 Beta distribution
The parameter Φ of Bernoulli distribution or binomial distribution satisfies 0 ≤ Φ ≤ 1.
The distribution for generating such a parameter is
where
Properties of the beta distribution
1.2.4 Dirichlet distribution
Probability distribution of the parameter Φ of a categorical or tacoumbp distribution
Properties of the Dirichlet Distribution
Chapter 2 Unigram Model
2.1 Document Representation
In the topic model, a document is represented as a “multiset” (bag, bag) of words that appear in it.
If the same word appears twice, it is treated as two separate words.
BOW (bag-of-words)
Word order information is lost
Processing
Morphological analysis
Stop-word removal
Converts vocabulary to non-conjugated form
Symbols used in this document
2.2 Unigram Model
A topic model is a probabilistic model for generating a set of BOW-represented sentences.
The simplest generative model is the “unigram model”.
All words are generated according to a single categorical distribution
Φ=(Φ1,…,ΦV) ,ΦV)
Φv is the probability of occurrence of word v
Example of generating a document set in the unigram model
The probability of a document set W given a parameter Φ
Description
2.3 Maximum Likelihood Estimation
Given a document set W, it is necessary to estimate the unknown parameter Φ of the unigram model
Use “maximum likelihood estimation”, which is a basic estimation method for probability models
Estimate the parameters so that the likelihood is the highest
Likelihood expresses the likelihood that the observed data is generated from a given probability model
The likelihood is the conditional probability p(W|Φ) viewed as a function of Φ
Let the parameters of the probability model be Φ
the data is W
Since logarithm is a monotonically increasing function, even if we take the logarithm, the parameters that maximize the likelihood remain the same
Function to be maximized
The point at which the derivative becomes zero is the maximum value
Maximum likelihood estimate
If the number of data is large compared to the number of unknown parameters, it will work, but if the number of data is small, problems will occur
2.4 Maximum posterior probability estimation
Overlearning occurs when the number of data is small
Example of the difference between maximum likelihood estimation and MAP estimation
Methods for generalization
Maximum a posteriori (MAP) estimation
Find the parameter that maximizes the posterior probability, which is the probability of the parameter Φ after observing the data W
The posterior probability is calculated using Bayes’ theorem.
MAP estimation requires a prior probability p(Φ|β)
Supports any probability distribution, but
Conjugate prior distribution is easy to calculate
Conjugate prior distribution
Prior distribution in which the posterior distribution has the same shape as the prior distribution
In the case of exponential family of distributions, a conjugate prior distribution exists.
Normal distribution
Poisson distribution
Categorical distribution
Conjugate prior is Dirichlet distribution
Multinomial distribution
Dirichlet distribution with parameters (β,…,β) as prior distribution Dirichlet distribution with parameter (β,…,β) as prior distribution
Optimal problem in MAP estimation
Transformation of equation with Dirichlet distribution
Omit terms in the middle of the equation that are irrelevant to the Φ to be maximized (second to third)
Using Lagrange’s undecided multiplier method to find the maximum Φ, the MAP estimation becomes the above equation
In order to prevent the probability from becoming negative for a vocabulary with no observations
In order to prevent the probability from being negative for a vocabulary with no observations, it is necessary to set Β≥1 and the parameters of the Dirichlet distribution.
When β=2, it is called Laplace smoothing.
Example
Set the value of the parameter of the prior distribution to β=2
The MAP estimation when the third eye is observed once is Φ=(1/7,1/7,2/7,1/7,1/7,1/7,1/7)
Compared to the maximum likelihood estimation, it is an estimate that is not extremely dependent on observations.
Example when the number of observations becomes large
Since Nv is larger than β, the maximum likelihood estimate and the MAP estimate are closer
2.5 Bayesian Estimation
Maximum likelihood estimation and MAP estimation seek the best parameter for each criterion.
Bayesian estimation seeks the distribution of the parameters
By obtaining the distribution, the certainty of the estimated parameters is expressed.
It also alleviates the problem of overlearning
Example: Problems with MAP estimation using dice
Assume that the value of the parameter of the Dirichlet prior distribution is β=2.
If a single roll of the dice yields 3 megs (Experiment 1)
MAP estimation becomes Φ=(1/7,1/7,2/7,1/7,1/7,1/7,1/7)
64 dice rolls, 19 “3’s” and 9 non-3’s each (Experiment 2)
MAP estimation becomes Φ=(1/7,1/7,2/7,1/7,1/7,1/7,1/7)
Using Bayes’ theorem, estimate the posterior distribution of the parameters Φ after observing the data W
Posterior distribution of parameters in the unigram model
The first equation is Bayes’ theorem
The third equation uses the normalization term of the Dirichlet distribution
Both the posterior and prior distributions have the same Dirichlet distribution.
Posterior distributions of Experiment 1 and Experiment 2
In experiment 1, p(Φ|W experiment 1)=Dirichlet(Φ|2,2,3,2,2,2)
In experiment 2, p(Φ|W experiment 2)=Dirichlet(Φ|11,11,21,11,11,11,11)
Estimation becomes more reliable
Relationship between Generative Models and Bayesian Estimation
The probability model p(W|Φ) is
Given a model parameter Φ, specifies the probability of being generated from data W
Probability p(Φ|W) of the model parameters given the data W
Posterior distribution
MAP estimation is not invariant to parameter transformation
Bayesian estimation is invariant to parameter transformation
Example where the parameter with the highest posterior probability, which is the MAP estimate, does not represent the characteristics of the posterior distribution
A low mountain with a large area and a high mountain with a small area
High mountains become the MAP estimate, not low mountains that occupy 99% of the area
2.6 Bayesian Predictive Distribution
By using the distribution of the estimated parameters, it is possible to make predictions that take into account the variability of the estimates.
Given a set of documents W, calculate the Bayesian predictive distribution of the distribution where the word Ω* is v.
In the first equation, we use the fact that ω* does not depend on the data W when the parameter Φ is given, and we perform a permutation with respect to the parameter Φ.
The fourth equation normalizes the Dirichlet distribution.
In the fifth equation, the equation is simplified based on the nature of the gamma function.
The results take into account the variability of the estimated values.
In the data of Experiment 1
In the data of experiment 2
MAP estimation and maximum likelihood estimation are regarded as assuming Dirac’s delta function.
The value of MAP estimation does not differ from the number of observed data
2.7 Hyperparameter estimation
Since MAP and Bayesian estimation require a prior distribution
Since MAP and Bayesian estimation require a prior distribution, it is necessary to set the parameters of the prior distribution, called hyperparameters.
Example: β=2 for Dirichlet distribution
Methods for estimating hyperparameters from data
Empirical Bayesian estimation (empirical Bayesian estimation)
Find the hyperparameter β that maximizes the marginal likelihood p(W|β), which marginalizes the parameter Φ
In the case of the unigram model, the formula for the marginal likelihood is
This distribution is called the Polya distribution.
The marginal likelihood can be maximized with respect to beta by repeating the update of the above equation until convergence using the “fixed point iteration” method
Assuming a gamma distribution as the prior distribution of the hyperparameter β
By assuming a gamma distribution as the prior distribution of the hyperparameter β and maximizing the posterior probability of the hyperparameter when the parameters are peripheralized, robust hyperparameter estimation can be expected.
Update equation in the immobile point iteration method
Hyperparameter cannot be estimated by maximizing the posterior probability p(Φ|W,β)
2.8 Model Selection
Model selection” can be performed using the marginal likelihood.
What is model selection?
Selecting a good model from among several candidate models
Calculation of the marginal likelihood of the data W given the model M
The product of the likelihood p(W|Φ,M) and the prior distribution p(Φ|M) is peripheralized with respect to the parameter Φ.
If there are multiple models, the model with the highest marginal likelihood becomes the good model
Another method
Given a set of candidate models M1,…. ,Mk with prior probabilities p(M1),… ,p(Mk) can be set
It is possible to select the model that maximizes the posterior probability p(Mk|W), which can be derived using Bayes’ theorem.
Concept of model selection
The more complex the model, the higher the likelihood for the observed data
On the other hand, complex models will overlearn and lose generalizability
Occam’s razor Translated with www.DeepL.com/Translator (free version)
Chapter 3 Mixed Unigram Model
3.1 The Mixed Unigram Model
The unigram model assumes that all words are generated from the same distribution in all documents.
In real documents, vocabulary is used in different ways in different documents
Example
In political articles, “prime minister”, “parliament”, “bill”
In sports, “victory”, “soccer”, “stadium”.
Word distribution differs depending on the topic of the document
Generative models for representing document topics
Mixture of unigram models
Each topic has a different word distribution.
Let the word distribution of topic k be Φk=(Φk1,…,ΦkV). , ΦkV).
Φkv=p(v|Φk) is
the probability that the vocabulary v appears in topic k.
Assume that each document has one topic zd∈{1,… ,K}.
Words are assumed to be generated according to the word distribution Φzd of the topic of the document.
Specific generation process
Θk=p(k|θ) is the probability that topic k is assigned to a document
θ=(θ1,…. , θk) is called the topic distribution.
Assume that each is generated from a Dirichlet distribution
Example of generating a set of documents by the mixed unigram model
Given a word distribution, use the fact that each word is independent
3.2 Mixture Model
A model in which data is assumed to be generated from one of several probability models.
Mixture model
Formula for mixture model
p(k) is the mixture weight
Probability that topic k is chosen
Corresponds to θk in the mixture unigram model
p(w|k) represents the probability that observation w is generated when topic k is chosen
In the mixed unigram model, p(w|k) is the probability that the unigram model
A typical mixture model
Mixture of noemal distributions, which is a mixture of multiple normal distributions
An extension of the k-means method (k-,eams), a clustering method, to probability models
Mixture models are often used for clustering to divide data into multiple groups.
In the mixed unigram model, the topic xd of a document is an index that indicates to which cluster it belongs.
3.3 EM Algorithm
A method for maximum likelihood estimation of the parameters of a mixture model.
The formula for the log-likelihood of a mixed unigram model is
In the unigram model, the parameter that maximizes the log likelihood was obtained analytically.
In the mixture model, it cannot be obtained analytically.
Maximum likelihood estimation using the EM algorithm (Expextation-Maximization) algorithm
The local optimum of the parameters is obtained by maximizing the lower bound of the log likelihood.
Calculating the lower bound F of the log likelihood using Jensen’s inequality
Transformation of the equation
In the inequality, log(⋅) is an upwardly convex function, so Jensen’s inequality is used.
By using Jensen’s inequality, ∑ can be taken out of the logarithm of log(∑), which makes the calculation easier.
In the first equation, multiply by qdk/qdk=1
where qdk is the probability that document d belongs to topic k.
This is called the responsibility ratio.
When the responsibility ratio is fixed, we can analytically solve for the parameter that maximizes the lower bound F of the log likelihood
EM Algorithm
Estimates parameters by repeating E step and M step until convergence
E step
Calculate the burden rate (q11,…,qDK) that maximizes the lower bound F (q11,…,qDK) where the lower bound F is maximized by the above equation
This can be derived using the Lagrange’s undecided multiplier method.
Specifically, the problem is to find the extreme value of the above equation.
The derivative with respect to qdk is the above equation
∂F(qdk)/∂qdk = 0. If we find qdk
Summing from k=1 to k on both sides, we get 1
Step M
Update the parameters θ and Φ so that the lower bound F is maximized under the given burden rate.
The Lagrange’s undecided multiplier method can be used to derive
The topic distribution θk is proportional to the sum of the burden rates of topic k of all documents
Word distribution Φkv is proportional to the number of occurrences of vocabulary v in topic k, weighted according to the burden rate
Algorithm for mixed unigram model
For each document, calculate the burden rate qdk, update the topic distribution θknew=θknew+qdk, and update the word distribution Φkwdnnew=Φkwdnnew+qdk
This can be done after calculating the burden rate for all documents, but
but it is more efficient to do it for each positional document.
Meaning of burden rate qdk
The difference between the log likelihood L and the lower bound F is given by the above equation
The burden rate qd(qd1,… qdk) and the KL divergence of the topic posterior distribution p(zd|wd,θ,Φ).
The maximization of the lower bound F in the E-step is
The maximization of the lower bound F in the E step is equivalent to the minimization of the KL divergence of the burden rate and the topic posterior distribution
The burden rate qdk is an approximation of the posterior distribution p(k|wd,θΦ) where document d belongs to topic k.
3.4 Variational Bayesian Estimation
3.4.1 Maximizing the marginal likelihood
Consider Bayesian estimation of a mixed unigram model.
Since it cannot be calculated analytically, we approximate the posterior distribution using “variational Bayesian estimation”.
In variational Bayesian estimation, we seek approximations to the posterior distribution of topics and parameters that maximize the marginal likelihood
Let 𝚿={θ,Φ} be a summary of the parameters
Posterior distribution of topics p(z|W)
Z=(z1,…. ,zD) denotes the set of topics.
Posterior distribution of parameters p(𝚿|W)
The formula for the lower bound of the log perimeter likelihood
In the second stage, multiply q(z,𝚿)/q(z,𝚿)=1
In the third stage, use Jensen’s inequality
The lower bound F on the log marginal likelihood is called the variational lower bound.
In variational Bayesian estimation, we obtain an approximation q(z,𝚿) of the posterior distribution that maximizes the variational lower bound
q(z,𝚿) is called the variational posterior distribution.
In variational Bayesian estimation, in order to simplify the computation
In variational Bayesian estimation, it is assumed that the posterior distribution q(z,𝚿) can be decomposed into the product of the variational posterior distribution of the topic q(z) and the variational posterior distribution of the parameter
Calculating the difference between the log marginal likelihood and the variational lower bound yields the above equation
Since the log marginal likelihood logp(W) is constant with respect to the variational posterior function q(z,𝚿)
The q(z,𝚿) that maximizes the variational lower bound F minimizes the KL divergence with the true posterior distribution p(z,𝚿|W).
By maximizing the variational lower bound in variational Bayesian estimation
By maximizing the variational lower bound in variational Bayesian estimation, we can obtain an approximate posterior distribution that minimizes the KL-divergence with the true posterior distribution.
3.4.2 Estimating the variational posterior distribution
Estimation of the variational posterior distribution q(z) of a topic
Maximizing the variational lower bound F is a problem of finding the extreme values of the above equation using Lagrange’s undecided multiplier method
The variate of F(q(z)) with respect to q(z) is
𝔼q(x)[f(x)] = ∫ q(x)f(x)dx
The expected value of f(x) when the distribution q(x)
Solving for the (largest) q(z) for which the equation on the left becomes zero, we get
Estimation of the variational posterior distribution q(𝚿) of the parameters
Maximizing the variational lower bound F under the constraint ∫q(𝛙)d𝛙=1 is a problem of finding the extreme value of the above equation using Lagrange’s undecided multiplier method
The variant of F(q(𝛙)) with respect to q(𝛙) is
Solving for q(𝛙) such that the right equation becomes zero, we obtain
In the mixed unigram model, the simultaneous probability can be decomposed from the generative process as in the above equation.
The posterior distribution is as shown in the above equation
If we extract the factor related to θ in the right equation, we obtain the above equation
Now
Similarly, if we extract the factor related to Φ, we obtain
Here we obtain
The final variational posterior distribution of the parameters is
In the case of mixed unigram model, the variational posterior distribution of topics is as above
The probability that the topic of document d is k is given by
Variational Bayesian Estimation Algorithm for Mixed Unigram Models
In the EM algorithm
The topic of each document is obtained as a distribution (burden rate) qd=qd1,…. ,dK).
The parameters θ and 𝚽 were estimated as points.
In the variational Bayesian estimation
In variational Bayesian estimation, not only the topic q(z) but also the parameters q(θ) and q(Φ) are estimated by distribution
Given an observation variable W and a main problem z=(z1,…,zK) The variational posterior probability distribution in a general stochastic model with observed variables W and main problem z=(z1,…,zK) is given by the above equation
𝔼q(z)[f(x)] = ∫ q(z)f(x)dz
the expected value of f(x) when using the distribution q(z)
q(zk)=q(z1,…. ,zk-1,zk+1,… ,ZK) is
denotes the variational posterior distribution of latent variables other than zK.
3.4.3 Variational lower bound
The variational lower bound does not decrease when updating the variational posterior distribution.
By calculating the variational lower bound, we can check if the algorithm is working properly.
The variational lower bound, which is a lower bound on the marginal likelihood, can also be used as a criterion for model selection.
Equation for the variational lower bound of a mixed unigram model
First term
Second term
Paragraph 3
Paragraph 4
Continuation
Paragraph 5
Paragraph 6
Paragraph 7
3.5 Gibbs sampling
3.5.1 Markov Chain Monte Carlo Method
Variational Bayesian estimation sought to approximate the posterior distribution by maximizing a lower bound on the marginal likelihood
Another method for Bayesian estimation of mixed unigram models
Gibbs sampling
A type of Markov chain Monte Carlo (MCMC) method.
A method that allows sampling from the true posterior distribution, rather than an approximation.
In principle, the true posterior distribution can be obtained by sampling an infinite number of cases
Example: Distribution p(z1,…,zD|W) …,zD|W) is to be estimated
In Gibbs sampling
Given all variables except for the variable zd, the conditional probability of zd, p(zd|z1,… ,zd-1,zd+1,…. .zD,W) for all variables d=1,…,D. Repeat for all variables d=1,…,D to obtain examples from the desired distribution.
If enough cases are obtained, the target distribution can be approximated by the empirical distribution above
S is the number of cases sampled
zd(s) is the number of cases of variable zd sampled in the sth version
distribution p(z1,…. ,zD|W) in the distribution p(z1,…. ,zD) in the distribution p(z1,…,zD|W).
3.5.2 Parameter Peripheralization
There are three unknown variables in the mixed unigram model: z, θ, and Φ.
The parameters θ and Φ are peripheralized (integral elimination) to estimate the posterior distribution of the topic set z. Gibbs sampling is described.
Gibbs sampling in which parameters are peripheralized is called collapsed Gibbs sampling.
By peripheralizing the parameters, the number of variables to be sampled can be reduced, and more efficient estimation can be achieved.
When the parameters θ and Φ are peripheralized, the simultaneous distribution of the document set W and the topic set z is as follows
The topic set z is generated from α (through θ)
Dk is the number of documents to which topic k is assigned
Word set W is generated from z and β (through Φ)
Nkv is the number of occurrences of lexeme v in the documents to which topic k is assigned
Nk is the total number of words in the document to which topic k is assigned
3.5.3 Sampling Equation
The conditional probability required for Gibbs sampling can be calculated using Bayes’ theorem as shown above
zd=(z1,…. ,zd-1,zd+1,… ZD) is the set of topics from Z excluding only zd
p(zd=k|zd,α) becomes as above using equation (3.24) and the gamma function Γ(x+1)=xΓ(x)
d is the value of the document d when it is excluded
p(wd|Wd,zd=k,zd,β) becomes the above equation using equation (3.25).
Gibbs-Sampling’s equation in the mixed unigram model is
The first factor is proportional to the number of documents assigned to topic k. Topics with a large number of documents are more likely to be assigned.
The second and subsequent factors are the likelihood (Poyari distribution) of document d being assigned to topic k. The more similar document d is to the set of documents assigned to topic k, the more likely it is to be assigned to topic k.
The hyperparameters α and β of the mixed unigram model are
The hyperparameters α and β of the mixed unigram model can be estimated by maximizing the simultaneous probability distributions p(W,z|α,β) of the document set W and the topic set z using the immobile point iteration method.
Update equation
A method for estimating a model by sampling a latent variable z and repeatedly maximizing the simultaneous probability p(W,z|α) of the data W and the sampled latent variable z with respect to the parameter α.
Stochastic EM algorithm (Stochastic EM algorithm)
In the usual EM algorithm
Calculate the burden rate, which is the imputed probability of the latent variable, in the E step
In the M step, maximize the simultaneous probability with the expected value of the burden rate
In the probabilistic EM algorithm
Sample the latent variable in the E step
Maximize the expected simultaneous probability using the sampled main problem in the M step
Gibbs sampling algorithm for mixed unigram models
Start with no topics assigned to the documents (zd=0)
Sequentially assign topics to documents
The sampling probability is calculated using only the information of the set of documents to which topics have been assigned so far, as shown in the above equation
<d is the set or value in the document before document d
Using a single sampled topic set z, the integral elimination parameters θ and φ can be estimated as above.
Difference between EM algorithm, variational Bayesian, and Gibbs sampling estimation
Chapter 4 Topic Models
4.1 Topic Model
The mixed unigram model assumes that each single document has a single topic
There is one topic distribution for the whole document set.
Topic model
A model that assumes that a single document has multiple topics.
Each document has a topic distribution θd=(θd1,…,θdK). (θd1,…,θdK) for each document.
θdk=p(k|θ)
probability that a topic is assigned to a set of documents d
A topic zdn is assigned to each word of document d according to the topic distribution θd
Words are generated according to the word distribution Φzdn of the assigned topic
The word distribution Φ=(Φ1,…,ΦK) for each topic is defined as , ΦK) is the same as the mixed unigram model
Φk=(Φk1,…,ΦkV) is the same as the mixed unigram model. ,ΦkV) indicates the word distribution for a topic
Φkv=p(v|Φk) denotes the probability that vocabulary v is generated by topic k
Since words in the same document can be assigned different topics, it is possible for a document to have multiple topics.
Instead of assigning a topic to each lexeme, a topic is assigned to each word.
The same word may be assigned different topics.
Process of generating a concrete document set
Example of document set generation by topic model
Given a topic distribution and a word distribution set Φ, the probability of document wd
4.2 Graphical Model
Graphical model representations are often used to illustrate generative models.
Graphical model representations of unigram models, mixed unigram models, and topic models
In the mixed unigram model
There is only one topic distribution θ for a set of documents
There is one topic z per document
In the topic model
There is a topic distribution θ for each document
There is one topic z for each word.
4.3 Maximum Likelihood Estimation
Methods for maximum likelihood estimation of topic models
Probabilistic latent semantic analysis (PLSA)
Maximum likelihood estimation of topic models uses the EM algorithm
(θ1,…,θD),𝚽 that maximizes the log likelihood (above).
The lower bound of the log likelihood can be calculated using Jensen’s inequality as shown above
Step E
Find the burden rate at which the nth word of document d that maximizes the lower bound F becomes topic k, as in the above equation
The burden rate qdnk is proportional to the topic distribution θdk of the document and the probability Φkwdn of the word’s occurrence in topic k.
If the document and the vocabulary are the same, the burden rate will be the same.
M Step
Find the parameter that maximizes the lower bound F
Update the probability θdk of the topic of document d as in the above equation
Proportional to the sum of the burden rates for topic k of the words in the document
Update the probability Φkv of the vocabulary v in topic k as above
where ∑ is the sum with respect to n where wdn=v
The probability Φkv that vocabulary v occurs in topic k is proportional to the sum of the burden rates for vocabulary v in all documents
The above update equation can be obtained by maximizing the lower bound under a pharmaceutical using Lagrange’s undecided multiplier method.
EM Algorithm Procedure for Topic Models
4.4 Variational Bayesian Estimation
Methods for Bayesian estimation of topic models
Latent Dirichlet allocation (LDA) data science model
4.4.1 Maximizing the marginal likelihood
Topic models use variational Bayesian estimation to estimate the posterior distribution of unknown variables.
The unknown variables in the topic model are
Topic set Z=(z11,…,z1N1,z1N1,z1N1) ,z1N1,Z21,…,ZDND ,ZDND
Topic distribution set 𝚯
Word distribution set 𝚽
Find the variational posterior distribution q(Z,𝚯,𝚽), which is an approximation of these posterior distributions.
The log perimeter likelihood of the topic model is the above equation.
The variational lower bound F can be calculated using Jensen’s inequality
For ease of calculation, we assume that the variational posterior distribution can be decomposed as q(Z,𝚯,𝚽)=q(Z)q(𝚯,𝚽).
From the process of generating the topic model, we use the fact that the simultaneous distribution can be decomposed as in the above equation
The word distribution 𝚽 is generated from the Dirichlet distribution with parameter β
The topic distribution set 𝚯 is generated from the Dirichlet distribution with parameter α
Topic set Z is generated from topic distribution set 𝚯.
The word set W is generated from the topic set Z and the word distribution set 𝚽.
4.4.2 Estimating the variational posterior distribution
The variational posterior distribution q(𝚯) of the topic distribution that maximizes the variational lower bound F is obtained as in the above equation.
αdk is a parameter of the variational posterior distribution q(θd), which is a Dirichlet distribution
Let Qdnk≡ q(zdn=k) be the variational posterior probability that the topic zdn of the nth word in document d is k.
The variational posterior distribution of the word distribution, q(𝚽), is as shown above
βkv is a parameter of the variational posterior distribution q(𝚽k), which is a Dirichlet distribution.
The posterior distribution of the topic is
By transforming Eq.
I(⋅) denotes the indicator function where I(A)=1 if A is true and I(A)=0 otherwise
Variational Bayesian Estimation for Topic Models
4.4.3 Variational Lower Bound
A variant of the variational lower bound for the topic model
The first term is
continued
The second term is
continued
Paragraph 3 continues
Paragraph 4…
Paragraph 5
Paragraph 6 is
continued
Section 7.
4.5 Gibbs Sampling
4.5.1 Parameter Perimeterization
As in the mixed unigram model
Estimate the posterior distribution p(Z|W,α,β) of the topic set Z by integral elimination of the parameters 𝚯,𝚽.
The simultaneous distributions of the document set W and the topic set Z after eliminating the parameters 𝚯 and 𝚽 are shown in the above equation.
The first factor can be estimated using the normalization term of the Dirichlet distribution
Continue
Ndk is the number of words assigned to topic k in document d
Nd=∑Ndk is the number of words in document d
The second factor is
Nkv is the number of words in vocabulary v assigned to topic k
Nk=∑Nkv is the number of words assigned to topic k
4.5.2 Sampling Formula
Sampling a topic word by word
Probability of sampling a topic for the nth word in document d
The set of topics excluding that topic Zdn=(z11,…. ,zd,n-1,zd,n+1,… Given the topic set Zdn=(z11,…,zd,n-1,zd,n+1,…,zDNd) and the document set W, the conditional probability of the topic zdn is given by
The first factor is
To be continued
where Ndkdn is the number of words assigned to topic k in document d when the nth word is excluded.
The second factor is
The final formula
The first factor represents the number of words assigned topic k in document d when the nth word is excluded
It tends to be the most frequently assigned topic in that document.
The second factor is proportional to the sum of the vocabulary wdn assigned to topic k
The second factor is proportional to the sum of vocabulary wdn assigned to topic k, which is likely to be the topic assigned to the document set.
This sampling probability is a very simple color that can be calculated by simply counting the number of topics assigned to it.
The program is easy to implement.
Example of Gibbs sampling with collapsed topic model
Hyperparameters α and β can be estimated by maximizing the surrounding simultaneous likelihoods.
Update equation using the immobile store iteration method
Gibbs sampling for topic models
4.6 Relationship between MAP estimation, variational Bayesian estimation, and Gibbs sampling
The relationship between MAP estimation, variational Bayesian estimation, and Gibbs sampling using the EM algorithm for topic models
4.7 Latent Semantic Analysis
The relationship between latent semantic analysis (LSA) and probabilistic latent semantic analysis (PLSA) is as follows
PLSA is an extended version of LSA
A document set data is represented by a (DxV) matrix N
D is the number of documents
V is the number of words
The element (d,v) of the matrix N is the number of times the lexeme v has appeared in the document d, Ndv
In LSA
Matrix N is matrix factorized into UTH, the product of two low-rank matrices, so that the Frobenius norm described in “Overview of the Frobenius norm and examples of algorithms and implementations” is minimized
U=(u1,…. U=(u1,…,uD) is a real matrix of (KxD)
Ud is a K-dimensional vector
H=(h1,… ,hV) is a real matrix of (KxV)
Hv is a k-dimensional vector
∥∥FRO denotes the Frobenius norm
square root of the sum of squared elements
The above decomposition is obtained by using singular value decomposition As described in “Overview of Singular Value Decomposition (SVD) and examples of algorithms and implementations” In PLSA
Relationship between Latent Semantic Analysis and Topic Models
4.8 Evaluation Measures
A measure to evaluate the performance of a probabilistic model
Perplexity for test data
Perplexity
A value that can be calculated from the negative log likelihood
Test means test data
M represents a probability model
Perplexity is the geometric mean per word of the inverse of the likelihood
Perplexity is the geometric mean of the inverse of the likelihood per word, i.e. the ratio of 1/perplexity to predict a word on average
The perplexity of a model in which all words have a uniform probability of occurrence is the vocabulary V
The perplexity of a model that can predict all words will be 1
Low perplexity indicates that the model is a good probabilistic model that can predict the test data with high accuracy
Likelihood of each model
4.9 Experiments
Experiments on topic models using Japanese Wiki data
Vocabulary set of 5000 nouns
Randomly selected 10000 documents from a set of sentences with more than 100 words
Generate BOW-represented document data
Out of 10% of the documents, 50% of the words were used as test data, and the rest as training data.
Use uniform Dirichlet distribution as prior distribution of word and topic distributions
The parameters of the Dirichlet prior distribution are estimated from the data using the immobile point iteration method.
Decay Gibbs sampling is used to estimate the topic model and mixed unigram model.
Experimental Results
Log Simultaneous Likelihood
Log likelihood increases with Gibbs sampling and settles at a high value around 500 times.
Perplexity
The topic model has lower perplexity (more suitable for real documents) than the mixed unigram and unigram models.
Examples of topics extracted by topic models
In order of appearance rate Translated with www.DeepL.com/Translator (free version)
Chapter 5 Extending the Topic Model: Using Other Information
5.1 The Combined Topic Model
Adding information other than the word information contained in the document
Supplemental information (side information)
e.g., product category and rating for a product review article
For an academic article, the author, journal name, and year of publication
Consider a model for generating a set of documents with supplementary information
A document d contains Md pieces of auxiliary information xd=(xd1,…,xdMd) ,xdMd)
Auxiliary information is a discrete variable like a classification category xdm∈{1,… ,S}
Topic model for document data with auxiliary information
Joint topic model (JTM)
Each topic has its own auxiliary information distribution
Auxiliary information is generated according to the topic
Concrete flow
After determining the topic ydm of auxiliary information according to the topic distribution θd
Generate auxiliary information xdm according to the topic auxiliary information distribution 𝛙ydm
𝜳k=(𝜳k1,…. ,𝜳kS)
𝜳ks=p(s|𝜳k) denotes the probability that auxiliary information s is generated when the topic is k
Word generation is the same as in the regular topic model
Generated sequence
Generation process
The simultaneous distribution of the word set wd and the auxiliary information set xd of document d is given by
Sampling probability
X=(x1,…. , xd) is the auxiliary information set
Y=(y11,…,yDMd) is the auxiliary information set. Y=(y11,…,yDMd) is the auxiliary information topisac set.
Mdk is the number of times auxiliary information topic k has been assigned in document d
Mks is the number of times topic k is assigned to auxiliary information s
Mk is the number of times topic k is assigned to auxiliary information.
Graphical Model
The observable variables are word w and auxiliary information x
z is word topic, θ is word topic distribution, Φ is word distribution
y is auxiliary information topic, 𝛹 is auxiliary information distribution
Α, β, and γ are parameters of the Dirichlet distribution that generates θ, Φ, and Ψ, respectively
What we can do with the combined topic model
Predict departmental categories
Predict the grade of a review article
Predict auxiliary information from word information
Analysis of Multilingual Document Sets
Apply the combined topic model to Japanese documents, considering ego documents as auxiliary set.
Extract topic models with Japanese and English correspondence without using a dictionary.
Flow
Estimate a combined topic model using a set of documents with word information and auxiliary information (training data) and a set of documents with only word information (test data)
Estimate the combined topic model
Using the topic distribution θd and auxiliary information set Ψ of the documents in the estimated test data
Using the topic distribution θd of the documents in the estimated test data and the auxiliary information set Ψ, calculate the probability that the auxiliary information of document d is s using the formula above.
s with the highest probability becomes the predicted value.
5.2 Correspondence Topic Model
Task to predict auxiliary information from word information
In the combined topic model, the topics of words and auxiliary information are chosen independently according to the topic distribution.
In the same document, some topics are used only for words, and some topics are used only for auxiliary information.
The relationship between words and auxiliary information cannot be modeled properly
Predictive performance is low
Correspondence topic model
Generate supplementary information using only the topics that generated the words
Generation process
Word generation is the same as in normal topic models
As in the correspondence topic model, we have an auxiliary information distribution Ψ=(Ψ1,…,ΨK) for each topic. , ΨK).
Auxiliary information is generated by assigning auxiliary topics according to the ratio Ndk/Nd of topics that generated words in the document.
The auxiliary information is generated from the auxiliary information distribution Ψydm for that topic ydm.
Since the topic used to generate auxiliary information is always the topic that generated the word
The topic of the word is determined so that auxiliary information is generated appropriately.
Generation Example
Graphical Model
In the generation process, a single word topic is determined by a K-dimensional categorical distribution with parameters proportional to the number of words assigned.
A word is selected uniformly at random from among the Nd words in the document, and the topic of that word is chosen as the word topic.
Simultaneous distribution of word set wd and auxiliary set xd in document d in the correspondence topic model
zd=(zd1,…,zdnd) Ndk/Nd is the topic set of sentence d.
Ndk/Nd is the probability that the auxiliary information and picks are k given a word topic.
Sampling probability of a word topic in decay Gibbs sampling
In the combined topic model, the probability is proportional to the sum of the topics of the word and the auxiliary information (Ndkdn+Mdk+α).
Even if a word has no topic, there is a probability that topic k will be assigned.
In the correspondence topic model, words and topics of auxiliary information are not treated equally
Probability proportional to (Ndk+Mdkdm+α)
If a word does not have a topic k (Ndk=0), the probability of assigning a topic k is zero
In the Correspondence Topic Model, each auxiliary information is assumed to be generated from a topic-specific distribution of auxiliary information for a word topic.
Related Models
Supervised topic model
Regress auxiliary information from a vector of word topics
Assume that auxiliary information yd for the len speed variable is generated by a normal distribution
5.3 Correspondence topic model with noise
Tagging that is not related to the content (words)
Example
Read later” tag in social bookmarking
This is awesome” tag for subjective evaluation
Camera model tags such as “Nikon” or “Canon” regardless of what is in the photo
Noisy correspondence topic model
Extension of the correspondence topic model
Automatically determine whether content and auxiliary information are related or not.
Expected to improve accuracy of prediction of supplementary information and search using supplementary information
Generated sequence
Correspondence topic model with noise is
The auxiliary information distribution for each topic Ψ1,… , 𝚿K in addition to
In addition to the topic-specific auxiliary information distribution Ψ1,…, 𝚿K, the model has a topic-independent general auxiliary information distribution 𝚿0.
For each auxiliary information, we have a day latent variable rdm∈{0,1} that indicates whether the content is relevant or not.
If the content is relevant (rdm=1), then the auxiliary information is generated using one of the topics that generated the word.
If it is not related to the content (rdm=0), then auxiliary information is generated from the general auxiliary information distribution without being related to the word topic
Generation Process
Graphical Model
The parameter λ of the Bernoulli distribution that generates the relation r is generated from the beta distribution, which is a conjugate prior distribution.
The simultaneous distribution of word set wd and auxiliary information set xd of document d in the noisy correspondent topic model is given by
To be continued
The unknown variables are
Word topic set Z
Auxiliary information topic set Y
The relation set R=(r11,…,rDNd) ,rDNd)
The sampling probability in decay-type Gibbs sampling of unknown variables is the above equation
M0(M1) is the number of auxiliary information topics that are (are) irrelevant to the content
M0 is the number of auxiliary information s that are not related to the content
Mks is the number of supplementary information s that are related to the content and assigned a topic.
Example of data analysis
5.4 Author Topic Model
A model in which topics and topic distributions are generated depending on auxiliary information.
In the combined topic model, the opposite is true: auxiliary information is generated depending on the topic.
Example
A topic model for a set of documents with author information (author topic model)
The topic distribution is determined depending on the author information
Generation process
In the author topic model, the author set A=(a1,…,ad) is obtained. ,ad) in the author topic model is obtained, the document set W
ad=(ad1,…. ,adMd) denotes the author set of document d.
adm is the m-th author index of document d
Md is the number of authors of document d
Let S be the total number of authors in the whole document set, adm ∈ {1,…. ,M)
For each author, there is a unique topic distribution θs
Flow
For each word, we first select one author ydn∈{1,…,Md} from the set of authors of the document with equal probability. ,Md} from the set of authors of the document with equal probability.
Next, we choose a topic zdn according to the topic distribution θadyn of the chosen author.
Finally, generate a word wdn according to the word distribution Φzdn of the chosen topic
Generation Example
Graphical Model
In author topic model, author information is given and cannot be generated
Observable variables
Author information a
Word w
Y is a single author
Z is a word topic
Θ is the topic distribution for each author
Φ is the word distribution
Α and β are parameters of the Dirichlet distribution that generates Θ and Φ, respectively.
The conditional probability of sampling topic zdn and author ydn simultaneously is common sense
Nadmk is the number of times topic k was assigned to a word of author adm
Nadm is the number of words of author adm
The topic distribution of author s becomes the above equation using the sampled topic set
By examining the similarity of the topic distributions for each author, we can determine the relationships among the authors.
A model that uses general auxiliary information, not limited to authors, to determine the topic distribution
Dirichlet-multinomial regression model (Dirichlet-multinomial regression model)
Generates topic distribution depending on auxiliary information Parameters of Dirichlet distribution are determined
𝛈k is a real-valued vector of the same size as the auxiliary information vector xd
Regression parameters for topic k
5.5 Topic Tracking Model
Focus on temporal information in estimating the topic of a document
Example
A blog about sports
The articles a blog covers vary greatly depending on when the article was written
Example: Olympics during the Olympics, soccer during the World Cup
The blogger’s interests may also change
Example: I used to be interested in movies and wrote articles about movies
Topic models for estimating time-varying topics
Topic tracking model
Example
Assume blog data
Not only blogs but also purchase history can be treated in the same way by considering wtd as a set of products purchased by user d at time t
D users wrote at time t=1,…,T Let’s model the document set W1,…,WT written by D users during time t=1,…,t. , model WT
Wt=(wt1,…,wtD) ,wtD) is the document set at time t
wtd=(wtd1,… ,wtdNtd) is the word set of articles of user d at time t
Ntd is the number of words in user d’s article at time t
User’s interest is represented by topic distribution
Because interests change, we have a different topic distribution θtd at each time
Changes do not occur randomly, but rather, at many times, many changes do not occur
Assume that the mean is generated from the Dirichlet distribution Ditechlet(θtd|αtdθt-q,d) of the estimate of the topic distribution at the previous time θt-1,d
The parameter αtd is the inverse of the variance, the precision
which indicates how difficult it is for the interest to change.
Consistency of interest
If αtd is high, interest does not change much
Assume that the word distribution for each topic changes over time as well as the topic distribution for each user
Assume that the mean is generated from the Dirichlet distribution Direchlet(Φtk|βtkΦt-1,k) of the estimate of the word distribution at the previous time Φt-1,k
The parameter βtk represents the consistency of the word distribution.
Generation process
Graphical model
An example of the change in topic distribution per user and word distribution per topic is shown.
Probability of word set wd at time t for user d in the topic tracking model
The probability in Gibbs sampling is
In the topic product model, past information can be used appropriately by estimating the accuracy hyperparameters αtd and βtk.
It is possible to estimate αtd and βtk using the immobile store iteration method.
Up to now, we have used the revised values of parameters from one time ago, but it is also possible to rely on the estimated values of parameters from multiple times in the past.
Example of Topic Extraction from Blog Data
The topic tracking model assumes that the topic distribution changes.
Model in which the topic distribution does not change, but the prior distribution of the topic distribution changes
Used to analyze data where a single document is generated at a single time, such as newspaper articles or papers
Topic tracking models are used to track documents written by a single user. Translated with www.DeepL.com/Translator (free version)
Chapter 6 Extending the Topic Model: Putting Structure into Topics
6.1 Correlated Topic Model
When there is a correlation between topics
Example
Politics and economics in a newspaper article
The correlated topic model makes it possible to handle correlations between topics.
In a normal topic model, the topic distribution is generated from a Dirichlet distribution
Dirichlet distribution cannot model correlation
Correlated topic model is generated using normal distribution
Specific approach
K-dimensional real vector 𝛈d=(𝛈d1,… ,𝛈dk).
The mean µ=(µ1,… ,µK) from the normal distribution of the covariance matrix 𝚺 Normal(µ, 𝚺)
𝚺 is a matrix of size (KxK)
Its elements δkk’ define the correlation between topics k and k’.
The elements of 𝛈d generated from the normal distribution take negative values, and adding all elements of 𝛈d does not equal 1
To make the real vector 𝛈d satisfy the probability condition of being non-negative and summing to 1, we take the exponent and normalize it.
This transformation function is called the “softmax function”.
Convert real vector 𝛈d to topic distribution θd by softmax function
Similar approach to ordinary topic model
Generating process of correlated topic model
Graphical model
Distribution of document wd in correlated topic model
Variational Bayesian Estimation of Correlated Topic Model
The variational lower bound on the log marginal likelihood of the correlated topic model is
Continued
H=(𝛈1,… ,𝛈D)
To simplify Mr. Ike, we assume that the variational posterior distribution is as in common sense
E step to find the variational posterior distribution q(Z,H,Φ) that maximizes the lower bound F and
M step to find the parameters µ and 𝚺 that maximize the lower bound F, given the variational posterior distribution, is repeated until convergence is achieved.
6.2 Slingshot Distribution Model
In the correlated topic model, the covariance matrix is used to model the correlation between two topics
The pachinko allocation model
Models the relationship between topics by introducing a hierarchical structure to the topics.
When a slingshot ball falls, it depends on where it passes at the top to determine where it passes at the bottom.
Example
A slingshot allocation model with two levels of topics: upper and lower topics
Lower-level topics: “cooking,” “health,” “insurance,” “medicine
Upper level topics: “Cooking, Health”, “Health, Insurance, Medicine
Specific flow
First, for each word, choose a higher-level topic ydn using the higher-level topic θd
The distribution of high-level topics in document d is θd(θd1,…,θdS). ,θdS), and
Θds is the probability that an upper-level topic s is chosen in document d
S is the number of top topics.
Next, we select the lower-level topic zdn using the lower-level topic distribution θd,ydn corresponding to the upper-level topic.
There are S subordinate topic distributions for each document
Θd=(θd1,…,θds). , θds).
Model the subtopics that are likely to appear in the same document by using subtopic distributions according to the subtopics.
Decide the vocabulary according to the word distribution Φzdn of the chosen subtopic zdn.
Generation process of slingshot distribution model
Graphical model
Given the parameters θd, Θd, and Φ, the probability of a document wd is
The slingshot distribution model can be estimated by collapsible Gibbs sampling
Upper topic ydn and lower topic zdn are sampled simultaneously
A=(α0,α1,… ,αs) is the parameter set of the Dirichlet prior of the topic distribution
Since A models the relationship between topics, it needs to be inferred from the data
Can be calculated using immobile point iteration method
To be continued
6.3 Probabilistic Latent Semantic Visualization
Using topic models to visualize documents and topics Topic models
Probabilistic latent semantic visualization (PLSV)
Visualize documents with similar topics so that they are placed close to each other
PLSV assumes that documents and topics have coordinates in the visualization space, and that their positional relationships define the topic distribution.
Let the visualization coordinate of document d be rd=(rd1,…,rdC). ,rdC)
Let the visualization coordinate of topic k be sk=(sk1,…,sKC). ,sKC).
C is the number of dimensions of the visualization space, usually 2 or 3
Let the probability that topic k is selected in document d be given by the above equation
S=(s1,… ,sK) is a set of topic coordinates.
If the square of Euclidean distance between document d and topic k ∥rd-sk∥2=∑(rdc-skc)2 is small, the ratio of topic k in document d will be high
Relationship between visualization and topic distribution in PLSV
Generative Process of Probabilistic Latent Visualization
Graphical Model
Given document coordinate rd, topic coordinate set S, and word distribution set Φ, the probability of wd in PLSV is given by
Estimate PLSV by MAP estimation using EM algorithm Translated with www.DeepL.com/Translator (free version)
Chapter 7 Application to Data Other Than Documents
7.1 Images
7.1.1 BOW Representation of Images
Applying the Topic Model to the Analysis of Image Data
Image information is usually not represented as a BOW.
Each pixel is a 3-dimensional real number in R˝B
SHIFT (scale invariant feature transformation) features as local features of the image
Each feature point is represented by a 128-dimensional vector
By using vector quantization, a collection of vectors can be represented as a BOW
Apply clustering methods such as K-means
After vector quantization, apply to topic model
7.1.2 Coordinate Recommendation
Example of application to image data
Coordinate recommendation of clothes using a topic model
BOW representation of clothing coordination
Specific procedure
Coupled topic model can be applied
7.2 Networks
Application of network data to probabilistic models
Social networks such as friendships
Citation relationships in papers
Links in web pages
Food chain
7.2.1 Probabilistic Block Model
Typical stochastic model of a network
Stochastic block model
Example of representation of a network link
A network with D nodes can be represented by a (DxD) connection matrix W
The (d,d’) element wdd’ of W is
If there is a link between nodes d and d’, wdd’=1
If there is no link between nodes d and d’, wdd’=0
In the stochastic block model, the probability that a node of topic k is connected to a node of topic k’ is assumed to be Φkk’.
Generative process of stochastic block model
Since it is a conjugate prior distribution
θ, a parameter of the categorical distribution, is generated from the Dirichlet distribution.
The link probability Φkk’, a parameter of the Bernoulli distribution, is generated from the beta distribution.
After generating the topic zd of each node from the category with parameter θ, the link wdd’ is generated from the Bernoulli distribution with parameter Φzdzd’.
Example of a stochastic block model
A network W and a set of topics z=(z1,…,zD) in a probabilistic block model ,zD) in the stochastic block model is as shown in the above equation
θ=(θ1,…,θk) ,Θk)
where Θk represents the probability that a node has topic k
Φ=(Φ11,…,ΦKK) ,ΦKK)
Since a conjugate prior distribution is used, the unknown parameters, topic distribution Θ and link probability Φ, can be integrated and eliminated.
The unknown variable to be estimated is the topic set z.
Gibbs sampling formula for estimating the topic set
B(x,y)=𝚪(x)𝚪(y)/𝚪(x+y) is a beta function
Dk is the number of nodes to which the topic k has been assigned
Dkl is the number of links from topic k nodes to topic l nodes
Ḋkl is the number of links from the topic k node to the topic l node that were not found
Dkl:zd=k is Dkl when zd=k
7.2.2 Mixed Member Probabilistic Block Model
Consider a social network. There are friendships in various communities (topics).
A stochastic block model in which a node has a relationship with only one topic model cannot model this relationship with multiple topics.
Extend the probabilistic block model to allow nodes to have multiple topics
Mixed membership stochastic block model
Each node has a unique topic distribution θd
Similar to stochastic block model, with link probability Φ between topics
Generation flow of link wdd’ from node d to node d’
Choose one topic zdd’1 according to the topic distribution θd of node d
Similarly, choose another topic zdd’2 according to the topic distribution θd’ of node d’.
Generate links according to the link probabilities Φzdd’1zdd’2 corresponding to topics zdd’1 and zdd’2.
Generative process of mixed-member stochastic block model
Network W and topic set Z=(z11,…,zDND) ,zDND) is the simultaneous distribution of
The probability by Gibbs sampling is
Chapter 8 Estimating the Number of Topics
8.1 Estimating the Number of Topics in a Mixture Model
The hierarchical Dirichlet process used for topic estimation in topic models is based on the Dirichlet process.
The Dirichlet process is based on the Dirichlet process, which is used to estimate the number of topics in a mixture model.
8.1.1 Dirichlet Process
The Dirichlet process (DP) can be used to estimate the number of topics in a mixture model.
The Dirichlet process is
A probability distribution G ~ DP(α, H) that generates a discrete distribution G defined by a base distribution H and a concentration parameter α.
The base distribution is
The base distribution determines what the generated discrete distribution will look like on average.
The concentration parameter
expresses the degree of discrete
In the limit of α → 0, the probability is concentrated on a single value
As α increases, probability is applied to more values
Example: Discrete distribution generated by Dirichlet process in the case of base distribution beta distribution
When concentration parameter is small, large probability is assigned to only a few discrete values.
When concentration parameter is large, probability is assigned to many discrete values
Generates a distribution similar to the base distribution.
Mixture model with infinite number of element models using Dirichlet process
Infinite mixture model (Infinite mixture model)
Dirichlet process mixture model
Uses Dirichlet distribution as basis distribution and unigram model as observed data distribution
Infinite mixture unigram model
No need to set the number of topics in advance Can estimate a mixed unigram model with the number of topics appropriate for the data
8.1.2 Chinese Restaurant Process
Infinite mixture models include dimensionless mixing ratios and infinite element models.
By using the Chinese restaurant process (CRP), estimation is possible with only finite mixing ratios and elemental models.
An infinite mixture model can be constructed by CRP.
In CRP, the probability that the dth document topic zd is k depends on the topics z1,…,zd-1 of the documents up to d-1. ,zd-1 of the d-1th document and is given by the above equation.
Dk is the number of documents to which topic k is assigned
α is a concentration parameter
This probability does not depend on the order of the documents and is exchangeable.
The probability that a topic k to which at least one document has been assigned so far is selected depends on the number of documents belonging to the topic k, Dk.
It can be compared to the process of a customer (document) being assigned to a table in a Chinese restaurant with an infinite number of tables (topics).
The probability that a customer will be assigned to a table is proportional to the number of people Dk who are already at the table.
The probability of getting a new table is proportional to α
The process of generating an infinite mixture unigram model using CRP.
Estimating topics using collapsible Gibbs sampling
Topic sets other than document d zd=(z1,…,zd-1,zd-1,zd-1,zd-1,zd-1,zd-1) ,zd-1,zd+1,…,zD) is given. ,zD), we need the conditional probability of the topic xd of document d.
The conditional probability is the above equation
Gibbs Sampling Algorithm for Infinite Mixture Unigram Model
8.1.3 Bar Folding Process
Construction of infinite mixture models other than CRP
Stick breaking process
Generate a topic distribution θ=(θ1,…,θ∞) with infinite number of elements. (θ1,…,θ∞) with infinite number of elements.
The process of folding sticks of length 1 in sequence
Generate 𝛎1 by Beta distribution Beta(1,α)
Fold the bar at a length of 𝛎1 from the left and adopt the length of the left part as the first element θ1=𝛎1.
On the right side, 1-𝛎1 remains
Fold the remaining bar again at the position of 𝛎2 generated from the beta distribution (1,α)
Use the length of the folded right-hand side as the value of the second element θ2=𝜈2(1-𝜈1).
Use the length of the bar obtained by repeating the operation of folding the bar k times as the value θk of the kth element.
By folding an infinite dissection, we obtain a vector θ with infinite elements whose sum is 1.
The parameter α of the Beta distribution in the bar folding process corresponds to the concentration parameter α of the Dirichlet process
If alpha of the beta distribution Beta(1,alpha) is small, its output will be close to 1.
Only a few topics with some large probability will be generated.
If Α is large, the output will be close to 0.
Many topics with small probabilities are generated.
An infinite mixture model using a bar-folding process can be estimated using a variational Bayesian method.
Set a sufficiently large number K* as the upper limit of topics
By setting 𝛎K*=1, i.e., θk=0
Approximated using a topic distribution of the dimension that says
The Dirichlet process corresponds to the limit where the dimension K of the Dirichlet distribution is infinite.
An approximation of an infinite mixture model using a Dirichlet distribution of sufficiently large dimension K as a prior to the topic distribution is also possible.
8.2 Estimating the Number of Topics in a Topic Model
8.2.1 Hierarchical Dirichlet Process
The number of topics in a mixture model can be estimated by using the Dirichlet process.
A topic model is a mixture model in which each document has a topic distribution.
The number of topics per document can be estimated by using a Dirichlet process with a Dirichlet distribution as the basis distribution for each document.
Since the Dirichlet distribution is a continuous variable distribution, the same word distribution is never used across different documents.
Topics are not shared and cannot be used in a topic model
Hierarchical Dirichlet process (HDP)
Allows topics to be shared among documents
The number of topics in the entire document set and the number of topics per document can be estimated.
Formula of Dirichlet process for each document
It has a common base distribution H, and its prescribed distribution is generated by the Dirichlet process and the process.
Since the distribution H generated by the Dirichlet process is a discrete distribution
The distribution Gd generated from the Dirichlet distribution for each document can share values with other documents.
The distribution Gd generated from the hierarchical Dirichlet process with the concentration parameters γ=1, α=1, and the base distribution of the shared distribution is the beta distribution H0=Beta(2,2).
Shared distribution H
Six individual distributions Gd
The individual distributions generated from the Dirichlet distribution with the shared distribution as the base distribution are
The individual distributions generated from a Dirichlet distribution with the shared distribution as the base distribution are assigned probabilities only to those values that have high probability in the shared distribution.
8.2.2 Chinese Restaurant Franchise
A hierarchical Dirichlet process can be constructed for a “Chinese restaurantfranchise”.
Consider a franchise of Chinese restaurants (document) with a shared menu.
Each restaurant has an infinite number of tables.
The first customer at a table (word) chooses a dish (topic) from the menu, and all customers at that table get the same dish.
Which table you get is proportional to the number of people (words) at the table, as in the Chinese restaurant process.
Which dish (topic) is chosen is proportional to the number of tables choosing that dish
Chinese food franchise example
The topic model for the Chinese food franchise process is
By using Gibbs sampling, the number of topics can be estimated without the need to determine the number of topics in advance.
Specifically, we sample a table for each word and the topic assignment for each table.
The probability of choosing table l for the nth word in document d is given by
α is the concentration parameter of the Dirichlet process for each document
Γ is the concentration parameter of the Dirichlet process for the whole set of documents
tdn is the table to which the nth word of document d is assigned
Zdl is the topic to which the lth table of document d is assigned
T=(t11,… ,tDNd) is a set of tables for each word.
Z is a set of topic percentages for each table
Ndl is the number of words that chose table l in document d
Mk is the number of tables that chose topic k
M is the total number of tables
The probability of choosing topic k in the lth table of document d is given by
8.2.3 Posterior Probability Expression
Sampling tables and topics in a Chinese food franchise
In the posterior representation
Estimate the hierarchical Dirichlet process by sampling topic Z, topic distribution Θ, and shared topic distribution π
The shared topic distribution represents the likelihood of a topic being selected in the entire document set. Translated with www.DeepL.com/Translator (free version)
Appendix A Typical Probability Distributions
A.1 Bernoulli distribution
A.2 Categorical distribution
A.3 Normal distribution
A.4 Multidimensional normal distribution
Continued
A.5 Poisson distribution
Continued
A.6 Exponential distribution
A.7 Gamma distribution
A.8 Wishart distribution
Continued
A.9 Beta distribution
A.10 Dirichlet distribution
Continued
コメント