Hierarchical Dirichlet Process (HDP) Overview, Algorithm and Implementation Examples

Machine Learning Artificial Intelligence Digital Transformation Probabilistic Generative Models Machine Learning with Bayesian Inference Small Data Nonparametric Bayesian and Gaussian Processes python Economy and Business Physics & Mathematics Navigation of this blog

Hierarchical Dirichlet Process (HDP) Overview

The Hierarchical Dirichlet Process (HDP) is a Bayesian nonparametric method for dealing with infinite mixture models, and is specifically used to allow each group to have its own cluster structure while having clusters in common across multiple groups of data. The HDP is a Bayesian nonparametric method for dealing with infinite mixture models. An overview is given below.

The basic idea of HDP is as follows

  • Dirichlet Process (DP): Dirichlet process described in “Overview of the Dirichlet Process (DP), its algorithms, and examples of implementations” is a stochastic process that generates an infinite dimensional probability distribution, where the number of clusters of data need not be fixed in advance and new clusters may be created as new data are added.
  • Hierarchical structure: In HDP, there are multiple groups (e.g., each document in a document set), each with its own cluster distribution, but sharing a common “parent” distribution as a whole. This parent distribution itself follows a Dirichlet process.

As a mathematical structure, the HDP is described as follows

  • Parent Dirichlet process: \(G_0\sim DP(\gamma,H)\)
    • \(\gamma\): A parameter indicating the concentration of the parent distribution
    • \(H\): Base distribution (prior distribution of clusters)
  • Dirichlet process for each group: \(G_j\sim DP(\alpha,G_0)\) for each group\(j\)
    • \(\alpha\): Parameter that determines the cluster concentration within each group
    • \(G_0\): Cluster distribution sampled from the parent distribution
  • Observations: Each data point\(x_{ij}\) is generated from the cluster\(\theta_{ij}\) in group\(j\)
    • \(\theta_{ij}\sim G_j\ x_{ij}\sim F(\theta_{ji})\)
    • F(\theta_{ij})\(\theta_{ij})\(\theta_{ij})}:Data Generation Model

Intuitive image

  • The parent distribution\(G_0\) is like a “menu” and determines which clusters are present in the whole.
  • Each group\(G_j\) is “an individual preference from that menu”, sharing the overall clusters, but each with its own cluster weighting.
Implementation Example

We describe a simple example of implementing a Hierarchical Dirichlet Process (HDP) in Python. The library is gensim.

Basic implementation of HDP (for the topic model)

from gensim import corpora, models

# Sample documents (text data from which you want to extract topics)
documents = [
    "apple orange banana",
    "apple apple orange",
    "banana mango apple",
    "car bus train",
    "car bike bus",
    "bus train plane",
    "apple car orange",
    "banana bus car"
]

# Word tokenization
texts = [doc.split() for doc in documents]

# Create a dictionary
dictionary = corpora.Dictionary(texts)

# Corpus (each document is converted to a vector)
corpus = [dictionary.doc2bow(text) for text in texts]

# Learning HDP model
hdp = models.HdpModel(corpus, id2word=dictionary)

# Displays each topic and its word distribution
print("=== Topic and word distribution ===")
for topic_id, words in hdp.show_topics(num_topics=5, num_words=5):
    print(f"topic {topic_id}: {words}")

Code Description

  1. Data Preparation: List a simple set of documents. Each word is divided into tokens and converted into text data.
  2. Preprocessing: Index words using gensim’s Dictionary. Convert documents to BOW (Bag of Words) format.
  3. HDP model training: Using HdpModel, train a model using the BOW corpus and dictionary as input.
  4. Result output: Top words related to each topic are displayed.

Examples of results

=== Topic and word distribution ===
topic 0: 0.400*"apple" + 0.300*"orange" + 0.200*"banana" + 0.100*"car"
topic 1: 0.500*"car" + 0.250*"bus" + 0.150*"train" + 0.100*"bike"

Application Points

  • Custom settings: control the sparsity of the topic distribution by adjusting alpha and gamma parameters.
  • Visualization: pyLDAvis allows you to intuitively see the relationship between topics and words.
  • Automatic determination of the number of clusters: A strength of HDP is that the number of topics is not fixed in advance, but is determined dynamically based on the data.
Application Examples

The following are specific examples of actual applications where HDP is used.

1. anomaly detection and design review in the automotive industry

Application examples:

    • Sensor anomaly detection: Sensor data (engine temperature, battery voltage, etc.) collected from vehicles are clustered using HDP to find unknown anomaly patterns.
    • Design review: Extract common design issues and frequent failure patterns by topic modeling of past design change logs and failure reports submitted by engineers.

Point of View:

    • Since the number of topics is not known in advance, HDP, which automatically determines the number of clusters, is useful.
    • Anomalous sensor patterns (new topics) can be detected automatically.

2. document classification and topic model

Application examples:

    • News article classification: Classify a large number of news articles by topic using HDP. For example, automatically generate categories for politics, economics, sports, etc.
    • Organizing FAQs: Classify inquiries received by customer support based on topics to find unclassified inquiry patterns.

Point of View:

    • With LDA, the number of topics must be determined in advance, but with HDP, new topics are added naturally.
    • Even if the number of documents increases, the number of clusters can be readjusted while learning.

3. genetic data analysis

Application examples:

    • Oncogene classification: clustering gene expression data with HDP to discover new genotypes and mutation patterns.
    • Classification of cell types: Analyze single-cell RNA sequencing data to identify unknown cell types.

Point of View:

    • Even if the number of clusters is unknown, HDP can add new cell types and gene groups according to the data.
    • Promotes new discoveries in medical research.

4. recommendation system

Application Examples:

    • Recommendation of movies and music: Personalized recommendations are made by clustering preferences based on users’ past viewing history using HDP and discovering new categories.
    • Product recommendation for e-commerce sites: Recommendations that reflect undiscovered patterns of purchasing behavior based on topic modeling of purchase history.

Point of View:

    • HDP can automatically respond to the emergence of new preference clusters whenever trends change.
    • Flexibly update clustering as more content is added.

5. social media analysis

Applications:

    • Trend extraction: Detect new trending words and topics in real-time by topic decomposition of social networking postings with HDP.
    • Early detection of flare-ups: Instantly find new topics that attract unusual attention (sudden buzz or flare-ups).

Point of View:

    • HDP is more suitable for SNS, where new topics appear one after another, than for LDA, where the number of topics is fixed.
    • It can also detect sudden topics that occur in a short period of time.

The strength of HDP is especially demonstrated in the following situations

  • When an unknown number of clusters needs to be handled
  • When you want to dynamically adjust the model as data increases
  • When you want to automatically detect new patterns and anomalies

This makes it applicable to a wide range of fields, including design review and anomaly detection in the automotive industry, medical data analysis, and social media analysis.

reference book

This section describes reference books on the Hierarchical Dirichlet Process (HDP).

Theory

Bayesian Nonparametrics

Foundations of Machine Learning, Second Edition (2018)

Author(s): Mohri, Rostamizadeh, Talwalkar
Description: while dealing broadly with machine learning in general, there are chapters on Bayesian and nonparametric models, and HDP is also mentioned as a related topic.
Recommendation: For those who want to deepen their understanding of machine learning in general and consider its application to HDP.

Implementation and Applications

Bayesian Analysis with Python (Second Edition, 2018)

Author: Osvaldo Martin
Description: a practical book on Bayesian statistics with Python, including examples of HDP implementation using PyMC3, with many concrete examples for practical use.
Recommendation: For those who want to deepen their understanding of Bayesian statistics by going back and forth between formulas and implementation.

Probabilistic Machine Learning: Advanced Topics (2022)

Author: Kevin P. Murphy
Description: Covers the latest probabilistic machine learning techniques, including HDP and Bayesian nonparametric models, with theory and Python implementation examples.
Recommendation: The book is rich in application examples and is useful for applying to real-world data.

Papers and Online Resources

Hierarchical Dirichlet Processes (2006)

Author: Teh, Jordan, Beal, Blei
URL: Click here for the paper (PDF)
Description: This paper is the origin of HDP. Covers basic theory, Indian restaurant process, sample method, etc.
Recommendation: A must-read for those who want to understand the mathematical background.

    コメント

    タイトルとURLをコピーしました