Topic Model Theory and Implementation

Machine Learning Natural Language Processing Artificial Intelligence Digital Transformation Markov Chain Monte Carlo Method Deep Learning Technology Probabilistic Generative Model. Navigation of this blog

About a Topic Model

A topic model is a probabilistic generative model for extracting potential topics from a set of documents and understanding the content of documents. By using a topic model, it is possible to estimate what topics are covered in a certain document, and when applied to large-scale text data analysis, for example, it is possible to understand what topics are covered in a large number of news articles and blog posts, and what trends are observed. This probability model is the simplest of all.

This topic model base model is the “unigram model” or “mixed unigram model. This model has been extended to “Latent Dirichlet Allocation (LDA)” and “Probabilistic Latent Semantic Analysis (PLSA)”, and to dimensionless models such as Chinese Restarant Process (CRP) and Stick Breaking Process (SBP), and Hierarchical Direchlet Process (HDP), etc. as dimensionless extensions.

The algorithms for these models are built using machine learning models based on probability distributions, as described in “About Probability Generation Models.“.

In addition to text analysis, topic models are applied to various fields such as music, images, videos, and bioinformatics.

The topic model can be used for the following applications

  1. News article analysis: By applying a topic model to the text data of news articles, it is possible to determine what topics are covered. For example, it is possible to analyze what topics were popular at a particular time of the year, or what politicians or companies were covered.
  2. Social media analysis: By applying social media postings from Twitter, Facebook, and other social media sites to a topic model, it is possible to determine what topics are popular and what kinds of emotions are often expressed. The topic model can also be used to extract posts related to specific keywords.
  3. Recommendations: Topic models can be used to make product and content recommendations by estimating topics of interest to users. For example, recommendations can be made based on the content of books a user has read to estimate what genres of books the user is interested in.
  4. Image Classification: Using topic models, potential topics can be extracted from image features to classify images. For example, latent features can be extracted from facial images to classify facial expressions.
  5. Music genre classification: Using topic models, latent topics can be extracted from music waveforms to classify music genres. For example, it is possible to classify what genre of music a song is based on its rhythmic pattern and pitch.

This blog discusses this topic model in more detail below.

Implementation

A topic model is a statistical model for automatically extracting topics (themes or categories) from large amounts of text data. Examples of text data here include news articles, blog posts, tweets, and customer reviews. The topic model is a principle that analyzes the pattern of word occurrences in the data to estimate the existence of topics and the relevance of each word to the topic.

This section provides an overview of this topic model and various implementations (topic extraction from documents, social media analysis, recommendations, topic extraction from image information, and topic extraction from music information), mainly using the python library.

Variational methods (Variational Methods) are used to find the optimal solution in a function or probability distribution, and are one of the optimization methods widely used in machine learning and statistics, especially in stochastic generative models and variational autoencoders (VAE). In particular, it plays an important role in machine learning models such as stochastic generative models and variational autoencoders (VAE).

Variational Bayesian Inference is one of the probabilistic modeling methods in Bayesian statistics, and is used when the posterior distribution is difficult to obtain analytically or computationally expensive.

This section provides an overview of the various algorithms for this variational Bayesian learning and their python implementations in topic models, Bayesian regression, mixture models, and Bayesian neural networks.

Theory

Classically, the study of language is linguistics, in which research has been accumulated through the accumulation of hypotheses generated by the experience and subjectivity of linguists and the creation of new hypotheses through counterexamples. In contrast, the field that considers language statistically is called statistical linguistics, or from an engineering standpoint, natural language processing, and is a rapidly growing field of research due to the recent increase in electronic texts and the need to process them. Although this field can be considered part of linguistics, it differs from traditional linguistics in that it involves statistical and mathematical modeling and large-scale experimental verification based on purely objective data. By taking a statistical view of language, it is possible to automatically model the vast number of French linguistic phenomena by computer, and to deal mathematically with ambiguities, exceptions, and contextual structures that cannot be captured by rules.

In many areas of language processing, language models have emerged as the key to processing languages. In textbooks, this “language model” begins with a mathematical expression such as “a language is a subset L of a set Σ* consisting of a sequence of letters x ∈ Σ.” In more concrete terms, a language model is a model of all people who speak a language and continue to use it unconsciously. In a more concrete image, the language model is something familiar to everyone who speaks a language and continues to use it unconsciously.

Objectively, language can be thought of as a sequence of symbols. Although, when looked at in detail, it is composed of letters, here we will assume that language, like English, is composed of words.

One thing that is immediately noticeable when looking at a language’s word sequence is that the frequency of words is highly skewed. The inverse relationship between word rank and word frequency is known as Zipf’s law, and is one of the basic facts discovered in the 1930s. In recent years, this law has become known as a power law common to many discrete phenomena in nature beyond language.

To express such indeterminacy, we need a probability distribution for the location of p itself. The simplest such distribution is the Dirichlet distribution as in the following equation.

To “awareness” means to observe or perceive something carefully, and when a person notices a situation or thing, it means that he or she is aware of some information or phenomenon and has a feeling or understanding about it. Becoming aware is an important process of gaining new information and understanding by paying attention to changes and events in the external world. In this article, I will discuss this awareness and the application of artificial intelligence technology to it.

A topic model is a probabilistic model for a document. In order to understand the concept of probability models, we will discuss the unigram model, which is the simplest probability model for documents. We will also discuss how to estimate a probability model using the unigram model as a subject.

In this article, we will discuss the mixed unigram model, which introduces topics into the unigram model.

The mixed unigram model assumes that a single document has a single topic. However, in reality, a single document may have multiple topics. For example, a newspaper article about “the deliberation of a bill on medical care in the Diet” has two topics, “medical care” and “politics. For example, a newspaper article on “the economic effects of the Olympic Games” has two topics: “sports” and “economy. If we want to represent such topic combinations in a mixed unigram model, we need to prepare word distributions for all the topic combinations of “medical + politics” and “sports + economy. In this case, the number of word distributions to be estimated would be huge, and it would be impossible to estimate them properly.

The solution to this problem is the topic model, which assumes that a single document has multiple topics. In the mixed unigram model, there is one topic model for the entire set of documents, while in the topic model, there is a topic distribution θd=(θd1,…,θdK) for each document.

In the previous section, we assumed a situation where only the information of the words contained in the document is given, but there are cases where other information is given. For example, a product review article may be accompanied by information such as the product category and rating score. For example, a product review article may include information such as the product category and rating score, and an academic paper may include information such as the author, journal name, and year of publication. This kind of information other than words is called side information. In this section, we describe a model for generating a set of documents with supplementary information.

One topic model for documents with supplementary information is the joint topic model. In the joint topic model, each topic has its own auxiliary information distribution, and auxiliary information is assumed to be generated according to the topic.

In the Correspondence Topic Model, auxiliary information was generated using the topic that generated the word. However, there are cases where auxiliary information is not related to the content (word). In particular, in the case of social bookmarking, where tags (supplementary information) can be freely added and shared on the Web, tags used as reminders to “read later” can be added to articles on politics or entertainment. Subjective evaluation tags such as “this is great” and “**** (three stars)” can also be added regardless of the topic. Also, in the case of photo-sharing services, the name of the camera model, such as “Nikon” or “Canon,” can be added regardless of what is in the photo.

Therefore, a topic model that can handle supplementary information unrelated to the content is the nosiy correspondence topic model. The noisy correspondence topic model is an extension of the correspondence topic model. By using the noisy correspondence topic model, it is possible to automatically determine whether the content and the supplementary information are related or not, which is expected to improve the accuracy of supplementary information prediction and retrieval using the supplementary information.

In considering various tasks, it is possible that the topics may be correlated. For example, in the case of newspaper articles, there are many articles that have two topics, politics and economics, but few articles that have two topics, politics and entertainment. Ordinary topic models cannot handle such correlations between topics, but the correlated topic model can.

In the correlated topic model, the covariance matrix is used to model the correlation between two topics. The pachinko allocation model models the relationship between topics by introducing a hierarchical structure to the topics.

A topic model for visualizing documents and topics is called probabilistic latent semantic visualization (PLSV). PSLV visualizes documents with similar topics so that they are placed near each other. PSLV visualizes similar documents in such a way that they are placed near each other. The visualization provides a complete picture of large-scale data and enables intuitive retrieval.

The topic model can be applied to any data that is expressed as a BOW, even if it is not a sentence type. For example, in the case of a purchase history, if the user is regarded as a document and the product as a vocabulary, it can be handled in the same way as a document. In addition, even data that is not originally represented as BOW, such as images, can be converted to BOW representation and applied to network data.

By using vector quantization, a collection of various vectors can be converted into a BOW representation, and a topic model can be applied. In vector quantization, all vectors are clustered using a clustering method, and the vectors are rewritten with cluster labels to convert them into a BOW representation.

We will discuss the stochastic block model, which is a typical probabilistic model of networks. A network consists of a set of nodes and a set of links between the nodes. In the case of social networks, a person is represented by a node, and two nodes are linked when they have a friendship. In the probabilistic block model, each node has one topic, and the existence of a link depends on the topic of the node. For example, in the case of social networks, the model assumes that each person belongs to a group and that friendships with other people are determined by which group they belong to.

Estimating the number of topics in a mixture model is done by using the Dirichlet process (DP). The Dirichlet process is specified by the base distribution H and the concentration parameter α.

A mixture model with an infinite number of element models using a Dirichlet process is called an infinite mixture model or a Dirichlet process mixture model.

By using the infinite mixture unigram model, it is not necessary to set the number of topics in advance, and a mixture unigram model with the appropriate number of topics for the data can be estimated.

The infinite mixture model has both a dimensionless mixing ratio and an infinite number of element models, but by using the Chinese restaurant process (CRP), it is possible to estimate the model with only a finite number of mixing ratios and element models.

A topic model is a generic term for a generative model for analyzing documents written mainly in natural language. LDA assumes that a potential topic (politics, sports, music, etc.) exists behind a document that is a list of words, and that each word in the document is generated based on that topic. By using topics learned with a large amount of document data, it will be possible to classify and recommend news articles and retrieve semantically relevant documents from a given word query. In recent years, there have also been cases where LDA has been applied not only to natural language processing but also to image and genetic data.

We now describe collapsed Gibbs sampling for LDA. In the mixed model, we considered a new model with parameters peripheral to the stochastic model and sampled the latent variables one by one; in LDA, the algorithm can be derived using exactly the same procedure.

Implementation

Link to OSS of the topic model implemented on Java. Contains various tutorials and information on various applications. (If you are working in Python, see Antoniak’s Little MalletWrapper). For more information on the Java language and environment settings, see “Java, Scala and Koltlin as general-purpose application building environments.

A CRP (Chinese resturant process) is a stochastic process that describes a particular data generating process. Mathematically, this data generating process is one that, at each step, samples a new integer from the set of possible integers, with a probability proportional to the number of times that particular integer has been sampled so far, with a constant probability of sampling a new integer that has not been seen before The probability is proportional to the number of times the particular integer has been sampled so far.

In this article, we describe the implementation of this CRP using Anglican, a framework for probabilistic programming of Clojure, and its combination with a mixed Gaussian model.

A link to a collection of topic model libraries in python. See “Python and Machine Learning” for an overview of python and its environment settings.

In this section, we will provide a brief overview of LDA (Latent Dirichlet Allocation), the most well-known topic model, and how it is implemented using Python.
We will also visualize the results of the implementation using PyLDAvis and word clouds.

This tutorial is aimed at beginners and intermediate users of R with the aim of showcasing how to perform basic topic modeling on textual data using R and how to visualize the results of such a model. The aim is not to provide a fully-fledged analysis but rather to show and exemplify selected useful methods associated with topic modeling. 

For an overview of the R language and its environment settings, see “R Language and Machine Learning.

 

コメント

  1. […] Intelligence Technology   Machine Learning Technology  Digital Transformation  Topic Model   Ontology Technology    Natural Language Processing  Intelligent information technology  […]

タイトルとURLをコピーしました