Thompson Sampling Algorithm Overview and Example Implementation

Machine Learning Artificial Intelligence Digital Transformation Probabilistic generative model Sensor Data/IOT Online Learning Deep Learning Reinforcement Learning Technologies python Economy and Business Navigation of this blog

Thompson Sampling Algorithm

The UCB algorithm described in “Overview and Example Implementation of the Upper Confidence Bound (UCB) Algorithm” is based on the concept of frequency theory and constructs a confidence interval by evaluating the deviation of the reward sample mean obtained from each arm from the true expected value using Hoeffding’s inequality. The arm with the largest confidence interval is selected for each trial. This algorithm is definitive in the sense that it selects the same arm for all cases where the known information for the arm is the same.

On the other hand, Thompson Sampling is designed to take uncertainty into account when selecting the optimal one among multiple alternatives (often called actions or arms). It solves the multi-armed bandit problem described in “Overview of the Multi-armed Bandit Problem, Applicable Algorithms, and Examples of Implementations“, in which each arm is selected with “the probability that the arm has the maximum expected value” at each time point. These are realized by a heuristic called the probability matching method, which uses Bayesian estimation as described in “About Probabilistic Generative Models.

In the UCB algorithm, when reward data is given in the form of batch data (B=every 100 times), it often happens that the same arm is drawn for all B times without updating the information within each batch. However, the stochastic matching method has the advantage that even if the observed data is fixed, the next arm to be drawn is selected randomly, so the number of arms selected is appropriately distributed, and stable operation can be obtained even when the batch size is large.

The following is an overview of the Thompson Sampling algorithm.

The Thompson Sampling algorithm is considered in the framework of Bayesian estimation, so the reward is considered as a probability distribution. As an example, assume that the reward\(x_{i,t}\) of arm i follows a Bernoulli distribution\(Ber(x_i|\mu_i)\in [0,1]\) of the true expected value parameter\(\mu_i\).

The conjugate prior distribution of the Bernoulli distribution is the beta distribution, as described in “Various Probability Distributions Used in Stochastic Generative Models“. Using this distribution, if arm i is selected \(n_k\) times and the winner is selected \(m_k\) times, the posterior distribution of the parameter \(\mu_k\) of the true expected value of the reward is as follows.

\begin{eqnarray}p(\mu_k|X_{n_k})&=&\frac{p(X_{n_k}|\mu_k)p(\mu_k)}{p(X_{n_k})}\\&\propto&p(X_{n_k}|\mu_k)p(\mu_k)\\&=&(\prod_{t=k}^{n_k}p(x_t|\mu_k))p(\mu_k)\\lnp(\mu_k|X_{n_k})&=&\sum_{t=!}^{n_k}lnp(x_t|\mu_k)+lnp(\mu_k)+const.\\&=&\sum_{t=1}^{n_k}inBer(x_k|\mu_k)+inBeta(\alpha|\beta)+const.\\&=&Beta(m_k+\alpha|n_k-m_k+\beta)\end{eqnarray}

According to the above formula, the posterior distribution of the expected value parameter \(\mu_k\) of the reward for arm k that has hit \(m_k\) times in \(n_k\) trials will follow a new parameter α and β of the prior distribution plus the number of hits \(m_k\) and the number of misses \(n_k-m_k\), respectively It will follow a beta distribution.

After calculating this posterior distribution for all arms, Thompson Sampling is used to determine \(a_t\) as follows.

\begin{eqnarray}a_t&=&arg\max_{k\in1,\dots,K}\hat{\mu}_k\\\hat{\mu}_k&\sim& Beta(m_k+\alpha|n_k-m_k+\beta)\end{eqnarray}

Incidentally, since Thompson Sampling does not require knowing the probability of the maximum expected value of each arm, but only selecting each arm with that probability, the method described above works well.

The concrete procedure of Thompson Sampling is as follows.

1. Initialization: For each action, parameters representing the probability distribution are initialized. Typically, a probability distribution such as a beta distribution is used.

2. Action Selection: For each action, we sample from the current probability distribution and select an action based on it. This facilitates the selection of more rewarding actions, but also allows for uncertainty and exploration.

3. Action execution: The selected action is executed and the resulting reward is observed.

4. Parameter updating: Using the observed rewards, the parameters of the probability distribution for each action are updated. Typically, Bayes’ theorem is used to update the probability distribution, increasing belief in the action with the higher reward and decreasing belief in the action with the lower reward.

5. iteratively repeat steps 2 through 4.

Application examples of Thompson Sampling

Thompson Sampling has been used successfully in many applications, primarily for stochastic decision-making problems. The following are examples of applications of Thompson Sampling.

1. multi-armed bandit problem:

Thompson Sampling is particularly suited to the multi-armed bandit problem. In the multi-arm bandit problem, the goal is to obtain rewards from multiple actions (arms), and the distribution of rewards in each arm is uncertain.

2. online advertising:

Thompson Sampling is used to optimize click-through rates for online advertising. Different ad variations can be tested and Thompson Sampling can be utilized to select the most successful ads. This allows advertisers to maximize click-through rates and optimize the allocation of their ad budgets. For more information, see also “Recommendation Technology.

3. clinical trials:

Clinical trials of new drugs may apply different treatment protocols to different patient groups; Thompson Sampling can be used to identify the most promising treatment protocols and improve the success rate of clinical trials.

4. automated recommendation systems:

Product and content recommendations are important for online platforms and e-commerce sites, and Thompson Sampling can be used to provide the best recommendations, taking into account the user’s past behavior and preferences. See also “Recommendation Technology” for more information.

5. Internet of Things (IoT):

Thompson Sampling can help with issues related to proper configuration of IoT devices and optimal use of resources. This includes adjusting device parameters and selecting communication channels. See also “Sensor Data & IOT Technology” for more information.

This algorithm is a powerful tool for dealing with uncertain decision problems and is a widely used and effective method for making optimal choices in real time.

Example implementation of Thompson Sampling

The implementation of Thompson Sampling may depend on the specific problem and programming environment, but a common approach involves sampling probability distributions and updating parameters. Below are the general steps for implementing Thompson Sampling and a simple example using Python.

Example implementation of Thompson Sampling using Python:

import random

# Set initial parameters (parameters of beta distribution) for each action
num_actions = 5
action_parameters = [(1, 1) for _ in range(num_actions)]

# Initialize rewards for each action
action_rewards = [0] * num_actions

# Main Thompson Sampling loop
num_rounds = 1000
for round in range(num_rounds):
    sampled_values = [random.betavariate(a, b) for a, b in action_parameters]
    
    # Select the action with the highest sample value
    selected_action = sampled_values.index(max(sampled_values))
    
    # Perform the selected action and observe the reward (e.g., 0 or 1 reward)
    reward = simulate_reward(selected_action)
    
    # Update action parameters
    a, b = action_parameters[selected_action]
    action_parameters[selected_action] = (a + reward, b + 1 - reward)
    
    # Update Cumulative Rewards
    action_rewards[selected_action] += reward

# Analyze final selection probabilities and rewards
print("Selection probability for each action:", [a / (a + b) for a, b in action_parameters])
print("Cumulative reward for each action:", action_rewards)

In this example, the rewards for each action are modeled using a beta distribution, and actions are selected based on the Thompson Sampling algorithm. After sampling and observing the rewards in each round, the parameters of the beta distribution are updated.

Thompson Sampling on the Challenges

While Thompson Sampling is a powerful probability-based decision-making algorithm and has been successful in many settings, there are some challenges and limitations. These challenges are described below.

1. initial search problem:

Thompson Sampling can take a long time to find the optimal action. It is sometimes combined with other approaches, such as the epsilon-greedy method, to solve this initial search problem.

2. prior knowledge of parameters required:

Thompson Sampling uses a Bayesian approach, which requires a priori parameter information. If accurate a priori information is not available, the performance of the algorithm may be limited.

3. computational cost:

Because of its Bayesian sampling, Thompson Sampling is computationally expensive. Especially when the number of actions is large, the computation becomes more complex, making it difficult to ensure real-time performance.

4. difficulty of generalization to other than multi-arm bandits:

Thompson Sampling is designed specifically for the multi-arm bandit problem. Generalization to other types of reinforcement learning or decision-making problems is difficult, and finding an appropriate model and parameter update method can be challenging.

5. adaptation to non-stationary environments:

When the environment changes over time, Thompson Sampling has limitations in adaptability. The algorithm updates the model based on historical data, making it difficult to adapt to rapid changes.

6. increased number of experiments:

Thompson Sampling adjusts the trade-off between exploration and exploitation, so as the number of times an action is tried increases, exploration decreases and may overfit past data.

To overcome these challenges and limitations, research has been conducted to improve Thompson Sampling or combine it with other reinforcement learning algorithms. In appropriate applications and environments, Thompson Sampling can be an excellent choice, but it is important to understand its limitations and to design and tune it appropriately.

Addressing Thompson Sampling’s Challenges

The following methods and approaches have been taken to address the challenges of Thompson Sampling

1. addressing the problem of initial search:

To deal with the problem of insufficient initial search, the epsilon-greedy method described in “Overview of the epsilon-greedy method (epsilon-greedy) and examples of algorithms and implementations” and the UCB (Upper Confidence Bound) method described in “Overview of the UCB (Upper Confidence Bound) algorithm and examples of implementations” may be used. These approaches can facilitate more exploration in the early stages and enhance the collection of information.

2. prior knowledge of parameters:

If prior knowledge of parameters is lacking, non-informative prior distributions (e.g., Dirichlet distribution as described in “Overview of Dirichlet distribution and related algorithms and implementation examples“, etc.) may be used. Also, since Thompson Sampling updates parameters through self-search, initial prior knowledge may be modified later.

3. reduction of computational cost:

If the computational cost is high, the computation can be optimized by introducing approximation algorithms or fast sampling methods. Parallel processing or the use of GPUs can also be considered to improve efficiency. See also “Parallel and Distributed Processing in Machine Learning” for more details.

4. Generalization to other than multi-arm bandits:

Extending Thompson Sampling to other problems requires the design of customizable models and algorithms. The algorithm can be adapted to specific problems by modifying the reward distribution and parameter update methods. 5.

5. adaptation to non-stationary environments:

When the environment changes, Thompson Sampling is limited in its adaptability and requires consideration of different strategies. Methods could be introduced to detect changes in the concept and adjust parameter updates appropriately.

6. handling an increase in the number of experiments:

As the number of experiments increases, it is important to maintain a balance between exploration and exploitation. It is possible to find the optimal balance by adjusting the parameter update method and exploration parameters.

Reference Information and Reference Books

More details on the bandit problem are given in “Theory and Algorithms of the Bandit Problem“.

Information on the web includes “Netflix Uses Contextual Bandit Algorithm,” “Contextual Bandit LinUCB LinTS,” “Composition of a Recommender System Using the Bandit Algorithm,” and “Rental Property Search System Using Contextual Bandit. Bandit-based Rental Property Search System“.

A reference book is “Bandit problems: Sequential Allocation of Experiments“.

“Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems“

“Conservation Laws, Extended Polymatroids and Multi-armed Bandit Problems: A Unified Approach to Indexable Systems“