Overview of the contextual bandit problem and examples of algorithms/implementations

Machine Learning Artificial Intelligence Digital Transformation Sensor Data/IOT Reinforcement Learning Recommendation Technology  Python Physics & Mathematics Navigation of this blog
What is Contextual bandit?

Contextual bandit is a type of reinforcement learning and a framework for solving the multi-armed bandit problem described in “Overview of the Multi-armed Bandit Problem, Applicable Algorithms, and Implementation Examples. The contextual bandit problem consists of the following elements

Context (situation): A context is given that represents the amount of information or features that are relevant to the choice of each option. For example, if an advertisement is displayed on a website, the context could be user attributes, page content, or other available information.
Choice (arm): There are multiple choices (arms). For example, when displaying an ad, each arm represents a different variation of the ad.
Rewards: Each arm is associated with a reward that can be obtained if selected. Rewards can be defined in different ways depending on the objective; for example, in the case of advertising, rewards may be associated with user click rates, purchase rates, etc.

The goal of the contextual bandit is to find the most rewarding option in each situation based on the given context, and various algorithms have been developed to solve this problem. Typical algorithms include the epsilon-greedy method as described in “Overview of the ε-Greedy Method (ε-Greedy) and Examples of Algorithms and Implementations“, the Upper Confidence Bound (UCB) method as described in “Overview of the Upper Confidence Bound (UCB) Algorithm and Examples of Implementation“, and Thompson sampling as described in “Thompson Sampling Algorithm Overview and Example Implementation“.

Contextual bandit application examples

Contextual bandit has many applications. Some specific examples are described below.

  • Online ad serving: Contextual bandit is used for ad serving on websites and mobile apps to display the most appropriate ads, taking into account contextual information such as user attributes and behavioral history. Each advertisement is an option (arm), and the choice is made to maximize user response, click-through rates, and other rewards.
  • Recommendation systems: Contextual bandits are also applied to recommendation systems for online shopping, music streaming services, etc., and are used to recommend the most appropriate items and content, taking into account user characteristics and past behavior. The items to be recommended become choices (arms), and the choices are made to maximize user satisfaction, click-through rates, and other rewards.
  • Media Personalization: Contextual bandit is also applied to personalized offerings of media (articles, news, videos, etc.) and is used to select and offer the most appropriate content, taking into account user interests and preferences. Different content becomes a choice (arm), and the choice is made to maximize user engagement, click-through rates, and other rewards.
  • Clinical Trials: Contextual bandits are also being used in healthcare. In clinical trials, where different treatment options need to be proposed to patients, Contextual bandit is used to select the most appropriate treatment, taking into account patient characteristics and medical conditions. The treatment options (arms) are chosen to maximize the patient’s recovery rate, improvement in health indicators, and other rewards.

Contextual bandits can be used in many application areas where choice problems arise, and can maximize rewards and improve performance by making the best choice based on context.

Algorithm used for Contextual bandit

Various algorithms have been developed to solve the Contextual bandit problem. Typical algorithms are described below.

Epsilon-greedy method:

The ε-greedy method is one of the popular search methods in Reinforcement Learning and is a simple and effective contextual bandit algorithm. Reinforcement learning is a machine learning technique that learns optimal behavior through trial and error while interacting with the environment. The algorithm of the ε-greedy method can be summarized as follows

    1. Initialization: Set the search probability ε (a value between 0 and 1). Usually, ε is set to a relatively small value (e.g., 0.1).
    2. Action selection: In the ε-greedy method, a random action is selected with probability ε, and the current optimal action (the action with the maximum action value) is selected with probability 1-ε.
    3. Search and utilization: The ε-greedy method balances search and utilization by search probability ε. For small values of ε, the optimal action is selected in most cases and the knowledge already obtained is utilized. On the other hand, if the value of ε is large, it tries to explore unknown areas by randomly selecting actions.
    4. Decay of ε: Normally, a decay schedule is set up where ε is gradually reduced as learning progresses. In this way, the initial stage of the process involves a lot of exploration and gradually becomes more utilization oriented.

Although the ε-greedy method is a simple and effective search strategy, care must be taken in setting the search probability ε. If ε is too small, not enough search is performed and a locally optimal solution may result. Conversely, if ε is too large, there will be a lot of unnecessary random search, which may hinder efficient learning.

The ε-greedy method is used in combination with various reinforcement learning methods, such as Q-learning and SARSA as described in “Overview of SARSA and its algorithm and implementation system“, where the appropriate ε setting and decay schedule are tailored to the specific problem.

Upper Confidence Bound (UCB) Method:

The Upper Confidence Bound (UCB) method is an algorithm for efficient action selection in search and exploitation problems such as reinforcement learning and multi-arm bandit problems, and is a contextual bandit algorithm that performs selection while considering uncertainty. The UCB method uses historical information to select the most promising action, and the best choice is made by adding an uncertainty term to the estimated value of each arm, and the uncertainty term is calculated using indices such as the number of choices and the variance of the reward.

The basic idea of the UCB method is to calculate an Upper Confidence Bound for each action and select the action with the highest upper bound of that bound. The specific procedure is shown below.

    1. Initialization: The number of trials for each action is initialized to 0.
    2. Calculation of UCB value: The UCB value for each behavior is calculated considering the size of the upper confidence interval for the average reward of the behavior and is generally expressed as follows: UCB value = average reward + exploration term. The exploration term increases with the number of attempts of the behavior, thereby facilitating exploration for unknown behaviors.
    3. Action selection: The action with the largest UCB value is selected. That is, the action with the largest UCB value is considered “most promising.
    4. Action execution and reward observation: Execute the selected action and observe the obtained reward.
    5. Update the number of action trials: Increase the number of trials of the selected action by 1.
    6. Recalculate UCB value: Recalculate the UCB value based on the new reward information.

By repeating these steps, it is possible to efficiently search for and utilize actions that are expected to have higher rewards. the UCB method is particularly effective in search and utilization problems with a small number of trials, such as the multi-arm bandit problem. the UCB method is theoretically efficient, and in many cases, the appropriate reward is The UCB method is theoretically efficient, and in many cases, it is said to be able to obtain an appropriate reward with a minimum number of trials. However, it should be kept in mind that the UCB method also has the importance of parameter adjustment and that its performance may vary depending on the problem.

    Thompson sampling:

    Thompson sampling is a probabilistic algorithm for efficient action selection in search and exploitation problems such as the multi-arm bandit problem. it is a Contextual bandit algorithm for probabilistic selection. similar to the UCB method, it maximizes the reward by It aims to find the optimal action, but the approach is different.

    The basic idea of Thompson sampling would be to a priori set the distribution of values for each arm, sample the value of each arm given the context, and then select the arm with the highest expected value of reward. The algorithm aims to maximize the reward while accounting for uncertainty. The specific steps are as follows

      1. Initialization: The parameters of the reward distribution for each action are initialized with an appropriate prior distribution. Usually, a beta distribution is used.
      2. Updating the reward distribution: The reward distribution for each action is updated based on previous action execution and reward observations. For example, it is common to update the parameters of the beta distribution by counting the number of successes and failures.
      3. Sampling: Sample from the reward distribution for each action and select the action with the highest sampled reward.
      4. Action execution and reward observation: Execute the selected action and observe the obtained reward.
      5. Refresh the reward distribution: Refresh the reward distribution based on the new reward information.

    Compared to the UCB method, the Thompson sampling method is said to naturally balance the search and utilization because of its stochastic element. In the UCB method, it was important to adjust the search term and set epsilon, but Thompson sampling does not require adjustment of these parameters and is known to have good theoretical performance.

    Since Thompson sampling uses a Bayesian approach, it can perform search considering uncertainty and is considered an effective method for real-world problems. However, its performance may vary depending on the prior distribution of the reward distribution and the method of updating it, so it needs to be designed appropriately for specific problems.

    Procedure for implementing the Contextual bandit problem

    The implementation steps for the Contextual bandit problem are as follows

      1. Data collection: To solve the contextual bandit, context, choice, and reward data need to be collected. For example, in the case of ad serving, data on user attributes and behavior, ad variations, and user responses should be collected.
      2. Model design: solving the contextual bandit requires a model that calculates the predicted value of the reward for each choice (arm). Typically, machine learning algorithms (e.g., linear models or neural networks) are used to model the relationship between context and reward.
      3. Algorithm implementation: implement an algorithm to determine the choices. This can include, for example, the epsilon-greedy method, the UCB method, Thompson sampling, etc. Each algorithm uses the predictions of the reward obtained from the model to make the best choice.
      4. Learning and updating: When running an algorithm, the model needs to be trained using the data collected and the reward predictions updated. Given a new context, the model is used to make the optimal choice and update the predicted value of the reward.
      5. Evaluate and improve: evaluate the implemented Contextual bandit system and measure its performance. Based on the maximization of rewards and other evaluation metrics, the algorithm and model will be improved, and the cycle will be repeated to continue improvements if new data needs to be collected or the model needs to be adjusted.

    Thus, the implementation procedure for the Contextual bandit problem consists of the following steps: data collection, model design, algorithm implementation, learning and updating, and evaluation and improvement.

    Example implementation in python of the Contextual bandit problem using the epsilon-greedy method

    Below is a simple example of a Contextual bandit problem implementation using Python. This example describes optimal selection using the epsilon-greedy method.

    import numpy as np
    
    class ContextualBandit:
        def __init__(self, num_arms, num_features):
            self.num_arms = num_arms
            self.num_features = num_features
            self.true_rewards = np.random.randn(num_arms, num_features)
        
        def get_reward(self, arm, context):
            true_reward = np.dot(self.true_rewards[arm], context)
            noisy_reward = np.random.normal(true_reward, 1.0)
            return noisy_reward
    
    class EpsilonGreedy:
        def __init__(self, num_arms, epsilon):
            self.num_arms = num_arms
            self.epsilon = epsilon
            self.arm_counts = np.zeros(num_arms)
            self.arm_rewards = np.zeros(num_arms)
        
        def select_arm(self):
            if np.random.random() < self.epsilon:
                # Exploration: Choose a random arm
                arm = np.random.randint(self.num_arms)
            else:
                # Exploitation: Choose the arm with the highest average reward
                arm = np.argmax(self.arm_rewards)
            return arm
        
        def update(self, arm, reward):
            self.arm_counts[arm] += 1
            self.arm_rewards[arm] += (reward - self.arm_rewards[arm]) / self.arm_counts[arm]
    
    # Parameter Setting
    num_arms = 5
    num_features = 10
    epsilon = 0.2
    num_rounds = 1000
    
    # Creating a Contextual bandit
    bandit = ContextualBandit(num_arms, num_features)
    
    # Creation of epsilon-greedy method
    agent = EpsilonGreedy(num_arms, epsilon)
    
    # execution loop
    for _ in range(num_rounds):
        # Context Generation
        context = np.random.randn(num_features)
        
        # Arm Selection
        chosen_arm = agent.select_arm()
        
        # Obtaining Rewards
        reward = bandit.get_reward(chosen_arm, context)
        
        # Update on Arm's Rewards
        agent.update(chosen_arm, reward)
    

    In this example, the ContextualBandit class represents the Contextual bandit and the EpsilonGreedy class implements the epsilon-greedy method, EpsilonGreedy class selects the arm and updates the reward. The execution loop executes a specified number of rounds, generating context in each round, selecting arms and obtaining and updating rewards.

    Example implementation in python of the Contextual bandit problem using the Upper Confidence Bound (UCB) method

    The Upper Confidence Bound (UCB) method is one of the most effective algorithms for this type of problem. Below is a simple example implementation of the UCB method for solving the Contextual bandit problem in Python.

    import numpy as np
    
    class UCBContextualBandit:
        def __init__(self, num_actions, num_context_features):
            self.num_actions = num_actions
            self.num_context_features = num_context_features
            self.total_rewards = np.zeros(num_actions)
            self.num_actions_selected = np.zeros(num_actions)
            self.context_action_values = np.zeros((num_actions, num_context_features))
    
        def select_action(self, context_features):
            # Calculation of UCB value
            ucb_values = self.context_action_values.dot(context_features) + np.sqrt(2 * np.log(sum(self.num_actions_selected) + 1) / (self.num_actions_selected + 1e-6))
    
            # Select the action with the largest UCB value
            action = np.argmax(ucb_values)
    
            return action
    
        def update(self, context_features, action, reward):
            # Update the number of attempts for the selected action
            self.num_actions_selected[action] += 1
    
            # Update rewards for selected actions
            self.total_rewards[action] += reward
    
            # Behavioral Value Update
            self.context_action_values[action] = self.total_rewards[action] / self.num_actions_selected[action]
    
    def simulate_contextual_bandit(context_features_matrix, true_action_rewards, num_actions, num_context_features, num_rounds):
        bandit = UCBContextualBandit(num_actions, num_context_features)
    
        for t in range(num_rounds):
            context_features = context_features_matrix[t]
            optimal_action = np.argmax(true_action_rewards[t])
    
            # Select Action
            selected_action = bandit.select_action(context_features)
    
            # Get actual rewards
            reward = true_action_rewards[t][selected_action]
    
            # Behavioral Value Update
            bandit.update(context_features, selected_action, reward)
    
            print(f"Round {t+1}: Selected Action = {selected_action}, Optimal Action = {optimal_action}, Reward = {reward}")
    
    # Data generation for testing
    np.random.seed(42)
    num_rounds = 1000
    num_actions = 5
    num_context_features = 10
    context_features_matrix = np.random.rand(num_rounds, num_context_features)
    true_action_rewards = np.random.rand(num_rounds, num_actions)
    
    # Simulation Execution
    simulate_contextual_bandit(context_features_matrix, true_action_rewards, num_actions, num_context_features, num_rounds)

    In this example implementation, the UCBContextualBandit class implements the processing required for the UCB method, and the simulate_contextual_bandit function executes the UCB method using test data. It is important to note that the true values of contextual features and rewards are randomly generated, so it is necessary to prepare appropriate data for the actual problem. Also, since the performance of the UCB method depends on the parameters and design, it is necessary to adjust the parameters appropriate for the actual problem and compare the UCB method with other algorithms.

    Example implementation in python of the Contextual bandit problem using Thompson sampling

    An example Python implementation of the Contextual bandit problem with Thompson sampling is shown below.

    import numpy as np
    
    class ThompsonSamplingContextualBandit:
        def __init__(self, num_actions, num_context_features):
            self.num_actions = num_actions
            self.num_context_features = num_context_features
            self.context_action_successes = np.ones((num_actions, num_context_features))
            self.context_action_failures = np.ones((num_actions, num_context_features))
    
        def select_action(self, context_features):
            # 各行動の報酬分布をベータ分布でモデル化
            sampled_rewards = np.zeros(self.num_actions)
            for action in range(self.num_actions):
                sampled_rewards[action] = np.random.beta(self.context_action_successes[action].dot(context_features),
                                                         self.context_action_failures[action].dot(context_features))
            
            # Select the action with the largest sampled reward
            action = np.argmax(sampled_rewards)
    
            return action
    
        def update(self, context_features, action, reward):
            # Update rewards for selected actions
            if reward == 1:
                self.context_action_successes[action] += context_features
            else:
                self.context_action_failures[action] += context_features
    
    def simulate_contextual_bandit(context_features_matrix, true_action_rewards, num_actions, num_context_features, num_rounds):
        bandit = ThompsonSamplingContextualBandit(num_actions, num_context_features)
    
        for t in range(num_rounds):
            context_features = context_features_matrix[t]
            optimal_action = np.argmax(true_action_rewards[t])
    
            # Select Action
            selected_action = bandit.select_action(context_features)
    
            # Get actual rewards
            reward = true_action_rewards[t][selected_action]
    
            # Behavioral Value Update
            bandit.update(context_features, selected_action, reward)
    
            print(f"Round {t+1}: Selected Action = {selected_action}, Optimal Action = {optimal_action}, Reward = {reward}")
    
    # Data generation for testing
    np.random.seed(42)
    num_rounds = 1000
    num_actions = 5
    num_context_features = 10
    context_features_matrix = np.random.rand(num_rounds, num_context_features)
    true_action_rewards = np.random.randint(0, 2, size=(num_rounds, num_actions))
    
    # Simulation Execution
    simulate_contextual_bandit(context_features_matrix, true_action_rewards, num_actions, num_context_features, num_rounds)

    In this example implementation, the ThompsonSamplingContextualBandit class implements the processing required for Thompson sampling, and the simulate_contextual_bandit function performs Thompson sampling using test data. The rewards are modeled as a beta distribution. Rewards are modeled with a beta distribution, and the optimal action is selected by sampling the rewards. It is important to model the context features and reward distribution appropriately for the actual problem. In addition, since the performance of Thompson sampling depends on parameters and design, it is necessary to adjust the parameters to suit the actual problem and compare with other algorithms.

    Reference Information and Reference Books

    More details on the bandit problem are given in “Theory and Algorithms of the Bandit Problem“.

    Information on the web includes “Netflix Uses Contextual Bandit Algorithm,” “Contextual Bandit LinUCB LinTS,” “Composition of a Recommender System Using the Bandit Algorithm,” and “Rental Property Search System Using Contextual Bandit. Bandit-based Rental Property Search System“.

    A reference book is “Bandit problems: Sequential Allocation of Experiments“.

    Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems

    Conservation Laws, Extended Polymatroids and Multi-armed Bandit Problems: A Unified Approach to Indexable Systems

    コメント

    タイトルとURLをコピーしました