Overview of Chinese Restaurant Process
The Chinese Restaurant Process (CRP) is a stochastic model used to intuitively explain the Dirichlet Process (DP) described in “Overview of the Dirichlet Process (DP), its algorithms, and examples of implementations“. It is especially frequently used for clustering problems.
Rules of CRP: The following is a parable about a Chinese restaurant.
- There is a Chinese restaurant with an infinite number of tables.
- The first customer sits at one table at random.
- The second and subsequent customers follow these rules
- The probability of sitting at a table where someone is already sitting is proportional to the number of people sitting at that table.
- The probability of sitting at a new table is proportional to some constant \(\alpha\).
These can be expressed in the formula as follows.
- Probability of sitting at existing table \(k\):\[P(probability of sitting at existing table k)=\frac{n_k}{n+\alpha}\]\(n_k\):table, k:number of people sitting at k, n:number of all customers who have visited the restaurant so far, \(\alpha\):parameter controlling probability of sitting at new table Parameters to control the probability of sitting at a new table
- Probability of sitting at a new table: \[P(probability of sitting at a new table)=\frac{\alpha}{n+\alpha}\].
Characteristics of CRP
- Rich-get-richer effect: The more people at a table, the higher the probability that a new customer will sit at the table, which tends to create large clusters.
- Infinite dimensionality: Since there are an infinite number of tables, there is always the possibility of adding new tables (clusters).
- Relationship to Dirichlet process: CRP is a concrete representation of distributional sampling by a Dirichlet process. The CRP is used to describe the behavior of clustering in a Dirichlet process\(DP(\alpha,H\)).
Applications include the following.
- Infinite mixture models: data clustering when the number of clusters is not a priori determined
- Natural language processing: topic models (e.g. HDP-LDA)
- Recommendation systems: dynamic assignment of user preferences to an infinite number of categories
- Genetic analysis: modeling to discover unknown genotypes
Implementation Example
An example of a simple implementation of the Chinese restaurant process (CRP) using Python is described.
import random
import matplotlib.pyplot as plt
# Parameter α (controls the probability of sitting at a new table)
alpha = 1.0
n_customers = 100 # Number of guests
# List to store the number of people in each table
tables = []
# Simulation of CRP
for i in range(n_customers):
if random.random() < alpha / (len(tables) + alpha):
# Sitting at a new table
tables.append(1)
else:
# Sit at an existing table (proportional to probability)
table_probabilities = [count / (len(tables) + alpha) for count in tables]
chosen_table = random.choices(range(len(tables)), weights=table_probabilities)[0]
tables[chosen_table] += 1
# Visualize results
plt.bar(range(1, len(tables) + 1), tables)
plt.xlabel('Table (Cluster) Number')
plt.ylabel('Number of Customers')
plt.title('Chinese Restaurant Process Simulation')
plt.show()
print(f"Total tables (clusters) formed: {len(tables)}")
Code Description
- Parameter settings
- alpha: Controls the probability of sitting at a new table. The higher the value, the more likely a new table will be created.
- n_customers: Number of customers (= number of data points).
- CRP rules
- Probability of sitting at a new table: \(\frac{\alpha}{n+\alpha}\)
- Probability of sitting at an existing table\(k\):\(\frac{n_k}{n+\alpha}\)
- Visualization: Visualize how many tables (clusters) were created in the end and check them in the bar chart.
Image of the result of the run
- If \(\alpha\) is small → large tables dominate (small number of clusters)
- \When the number of clusters is large → New tables tend to increase and the number of clusters also increases
Application Examples
Specific applications of the Chinese Restaurant Process (CRP) will be described.
1. clustering in a mixture model
Examples: text and image classification
Objective: To dynamically estimate clusters from data without a priori determination of the number of clusters.
Applications:
-
- Text classification: Discover unknown categories from text (e.g., automatic classification of news articles)
- Image classification: Grouping from unlabeled data based on image features
Effectiveness: Enables clustering based on data, so adding new categories (tables) is flexible
2. topic model (HDP-LDA)
Example: Natural Language Processing (NLP)
Objective: To automatically estimate the number of topics in a document without pre-setting the number of topics.
Application:
-
- Topic extraction of news articles
- Topic estimation of social media posts
- Topic analysis of word-of-mouth reviews
Effectiveness: Conventional LDA (Latent Dirichlet Allocation Method) requires manual setting of the number of topics (k), but with CRP, topics can be increased or decreased dynamically
3. genetic analysis
Example: Bioinformatics
Objective: To discover unknown gene groups by clustering gene expression data
Application:
-
- Discovery of unknown genotypes
- Extraction of gene patterns for each disease
Grouping of DNA sequences
Effectiveness: New clusters (tables) are added each time a new gene pattern is discovered, allowing flexible models to be built
4. recommendation system
Example: E-commerce and video streaming services
Objective: To dynamically generate categories (tables) according to user preferences and find new preference groups.
Applications:
-
- Movie recommendation: Increase clusters (genres and themes) each time a new movie is introduced.
- Recommendation for e-commerce sites: Detect new purchasing trends based on user behavior
Effectiveness: No fixed number of clusters, so flexible response to new groups even if user preferences change
5. anomaly detection
Example: Cyber security
Objective: To dynamically identify new clusters (tables) of anomalous behaviors that do not belong to conventional patterns.
Applications:
-
- Network anomaly detection
- Detection of unauthorized access
Effectiveness: When an unknown attack method occurs, it is immediately added as a new cluster, enabling real-time detection
reference book
I will discuss reference books on the Chinese restaurant process (CRP) and related Bayesian nonparametric models.
2. Pattern Recognition and Machine Learning – Christopher M. Bishop
Summary: Covers the basics of probabilistic models and clustering, including CRPs and Dirichlet processes. You can also learn about graphical models and variational inference.
Recommendation: Ideal for those who want to understand the relevance of machine learning.
3. Machine Learning: A Probabilistic Perspective – Kevin P. Murphy
Summary: Explains Bayesian nonparametric methods including CRPs, Dirichlet processes, and Hierarchical Dirichlet Processes (HDP) described in “Hierarchical Dirichlet Process (HDP) Overview, Algorithm and Implementation Examples” with plenty of examples.
Recommendation: For those who want to apply the methods to practical work, as there are explanations with implementations. 4.
5. Chinese Restaurant Process, Stick Breaking Process, and Dirichlet Process Mixture Model
コメント