Theory and Algorithms for the Bandit Problem

Machine Learning Artificial Intelligence Digital Transformation Sensor Data/IOT Reinforcement Learning Deep Learning Probabilistic  generative model Recommendation Technology  Navigation of this blog

Bandit Problem Overviews

Life becomes a repetition of choices. We encounter similar situations over and over again, and based on past experiences, we choose a certain path. We sometimes regret what we would have done if we had chosen that path at that time, but we cannot be sure that the present would really be better if we had chosen that path at that time.

How to make a “without regret” choice is a difficult question, but one thing that can be said is that a balance between “search” and “knowledge use” is important. Among the possible paths, there are paths that we do not know if they are good or not, and there are paths that we already know are good to some extent. If we had chosen the path that we knew, we might have a good present, but if we had chosen the path that we did not know, we might have a better present, and we would regret it.

On the other hand, if one chooses an unknown path in an attempt to find a better path, one may find that the path is not as good as the path one knows, and one may regret that one should have chosen the path one knows. In order to make the best choice, it is necessary to strike a balance between “searching” for a better path and “using knowledge” to make use of one’s past experience, which is the essence of the bandit problem.

In today’s Internet society, it has become easier to distribute new news and product advertisements that suit each individual’s preferences. Here, the designer of the system aims to deliver news and advertisements with the highest click rate to each user, but since the true click rate is a way, those estimates are actually used instead.

In this case, even if the current estimated click rate is low, those with a small number of deliveries may have a high true click rate, so it is necessary to deliver them more aggressively than those with a large number of deliveries and observe (search) the user response. On the other hand, in order to increase the overall number of clicks, it is necessary to distribute more news and advertisements with a high estimated click rate (i.e., conduct knowledge use). The multi-armed bandit problem is formulated as a problem to optimize the balance between such search and knowledge utilization, and is collectively called the bandit problem including generalized problems.

The bandit problem has been studied for a long time because of its application to the treatment of new drugs, but in recent years there has been an explosion of research on the problem because it is very compatible with Internet-related applications, as mentioned above. Recently, with the discovery of superior search strategies (measures) such as Thompson extraction. Since the optimal strategies in basic settings have largely been solved, there has been a lot of research on various settings that are more suited to real-world problems.

However, even so, the settings that exist in existing studies do not always apply directly to real-world problems. Therefore, it is often more important to understand the basis on which the measures in a particular setting are constructed than to conduct a short-term analysis of a particular setting, either for application to real problems or for research into new formulations.

The bandit problem is a type of reinforcement learning in the field of machine learning, in which the agent must decide which arm to choose among multiple alternatives (arms). Each arm generates a reward according to a certain probability distribution, which is unknown to the agent, and the agent finds which arm is the most rewarding by drawing the arm several times and receiving the reward. Such a bandit problem is solved under the following various assumptions. (1) the agent selects each arm independently, (2) each arm generates rewards according to some probability distribution, (3) the rewards of the arms are observable to the agent, but the probability distribution is unknown, and (4) the agent receives rewards by drawing an arm several times.

In the bandit problem, the agent also decides which arm to select, and learns a strategy for selecting the arm with the maximum reward using the following algorithms. (1) ε-greedy method (randomly select an arm with constant probability ε, and select the arm with the highest reward with the remaining probability 1-ε), (2) UCB algorithm (aims to increase the upper bound of the reward by preferentially selecting the most uncertain arm), and (3) Thompson extraction method (posterior distribution of the probability distribution of the arm from which the next arm to be selected is sampled).

The details of these bandits are presented below in this blog.

Implementation

The Bandit problem is a type of reinforcement learning problem in which a decision-making agent learns which action to choose in an unknown environment. The goal of this problem is to find a method for selecting the optimal action among multiple actions.

In this section, we provide an overview and implementation of the main algorithms for this bandit problem, including the ε-Greedy method, UCB algorithm, Thompson sampling, softmax selection, substitution rule method, and Exp3 algorithm, as well as examples of their application to online advertising distribution, drug discovery, and stock investment, The paper also describes application examples such as online advertisement distribution, drug discovery, stock investment, and clinical trial optimization, and their implementation procedures.

The Multi-Armed Bandit Problem is a type of decision-making problem that involves finding the most rewarding option among multiple alternatives (arms), and this problem is used in real-time decision-making and applications that deal with trade-offs between search and exploitation This problem is used in the following applications.

The ε-greedy method (ε-greedy) is a simple and effective strategy for dealing with the trade-off between search and exploitation (exploitation and exploitation), such as reinforcement learning. The algorithm is a method to adjust the probability of choosing the optimal action and the probability of choosing a random action.

The Boltzmann distribution is one of the important probability distributions in statistical mechanics and physics, which describes how the states of a system are distributed in energy. The Boltzmann distribution is one of the probability distributions that play an important role in machine learning and optimization algorithms, especially in stochastic approaches and Monte Carlo based methods with a wide range of applications, such as The softmax algorithm can be regarded as a generalization of the aforementioned Boltzmann distribution, and the softmax algorithm can be applied to machine learning approaches where the Boltzmann distribution is applied as described above. The application of the softmax algorithm to the bandit problem is described in detail below.

    The Upper Confidence Bound (UCB) algorithm is an algorithm for optimal selection among different actions (or arms) in the Multi-Armed Bandit Problem (MBA), considering the uncertainty in the value of the actions, The method aims at selecting the optimal action by appropriately adjusting the trade-off between search and use.

    Thompson Sampling is an algorithm used in probabilistic decision-making problems such as reinforcement learning and multi-armed bandit problems, where the algorithm is used to select the optimal one among multiple alternatives (often called actions or arms) by It is designed to account for uncertainty. It will be particularly useful when the reward for each action is stochastically variable.

    • Count-Based Multi-Armed Bandit Problem Approach

    The Count-Based Multi-Armed Bandit Problem is a type of reinforcement learning problem in which the distribution of rewards for each arm is assumed to be unknown in the context of obtaining rewards from different actions (arms). The main goal is to find a strategy (policy) that maximizes the rewards obtained by arm selection.

    Boltzmann Exploration is a method for balancing search and exploitation in reinforcement learning. Boltzmann Exploration calculates selection probabilities based on action values and uses them to select actions.

      Contextual bandit is a type of reinforcement learning and a framework for solving the problem of making the best choice among multiple alternatives. The contextual bandit problem consists of the following elements. This section describes various algorithms for the contextual bandit and an example implementation in python.

      • EXP3 (Exponential-weight algorithm for Exploration and Exploitation) Algorithm Overview and Implementation Example

      EXP3 (Exponential-weight algorithm for Exploration and Exploitation) is one of the algorithms in the Multi-Armed Bandit Problem. EXP3 aims to find the optimal arm in such a situation while balancing the trade-off between exploration and exploitation. EXP3 aims to find the optimal arm while balancing the trade-off between Exploration and Exploitation.

      • Libraries available for Bandit issues

      There are few libraries for bandit problems, and the main ones are provided in R.

      • Bandit: A library for solving bandit problems in R. It supports a large number of algorithms and implements epsilon-greedy, UCB1, Thompson Sampling, and other algorithms.
      • arm: A library for solving multi-armed bandit problems in R. It supports a large number of algorithms. It supports selecting the best algorithm and adjusting algorithm parameters.
      • MAB: A library provided by google for solving bandit problems in R.

      Theory

      The bandit problem is a sequential decision problem that aims to maximize the sum of rewards in a repeated process of selecting one element from a set of alternatives and obtaining the reward for that alternative but not the reward information for the other alternatives. The name “bandit” is derived from the term “one-armed bandit,” referring to a classic slot machine with a single arm that forces the player to spend money (steal money).

      The problem of selecting the next target to be sampled based on past observations, as in the bandit problem, has been studied since ancient times under the names of adaptive allocation or sequential allocation.

      The main question to be considered in the stochastic bandit problem is of the form “If the current click-through rate \(\bar{\mu}\) of an ad is less than 5%, what is the probability that the true click-through rate micro is actually μ=10%? This is a common statistical language to write. In general statistical language, this is the question “If the true click rate is micro, what is the probability (likelihood) that its sample mean is \(\hat{\mu}\leq x\) for some x∈[0,1]”.

      In this article, we will discuss the most basic settings in stochastic bandit and describe the achievable theoretical limits and the ε-greedy method.

      As mentioned in the previous section, the ε-greedy method cannot achieve the O(logT) riglet no matter how ε is taken, depending on the parameter {μi}i of the true distribution. The most basic of these are measures based on likelihood comparisons, such as UCB measures.

      The important point of UCB measures is to control the false selection rate of the arm with the non-maximum expected value to be about 1/t, which is not necessarily essential to calculate a confidence interval for the true expected value. Therefore, there are MED strategies that aim to directly control this false selection rate, and the DMED strategy (Deterministic Minimum Empirical Divergence policy) is the most intuitive and easy to understand of them.

      The KL-UCCB strategy can be interpreted as finding the upper bound of the 1/t confidence interval of the true expected value using Chernoff-Heffding’s inequality, but Chernoff-Heffding’s inequality overestimates the probability by a factor of \(O(\sqrt{n})\)times for the number of samples n. This may cause an extra search to be conducted for the arm that has a very small chance of having the maximum expected value. Furthermore, the significance level 1/t is a value derived from the asymptotic theoretical limit argument, and replacing it with another value, O(1/t), proves asymptotic optimality, but the performance at the finite time varies greatly.

      One heuristic for achieving good performance without relying on theoretical limits via asymptotics is called the probability matching method. The probability that each arm has the maximum expected value is formulated in a certain way, and the arm to be drawn is randomly selected according to this probability. For example, the softmax policy assumes that the probability that arm i has the maximum expected value is proportional to \(e^{\hat{\mu}_i/r}\) using a Gibbs distribution with temperature parameter r>0, and randomly selects arms with the following standardized probabilities.

      In this article, we describe a general framework for evaluating the performance of algorithms for stochastic bandit problems and the conditions that a “good” bandit algorithm must satisfy. Based on this framework, we will also compute the actual regret upper bound.

      In the previous section, we discussed how measures such as UCB and Thompson extraction can all be interpreted as “subtracting the arm with a significance level or posterior probability of 1/t or more, which is the maximum expected value. These intuitive understandings can be applied to a broader class of bandit problems, and are useful to get an idea of the measures that will perform well for such problems, but their specific performance evaluation requires a more detailed discussion. In this section, we discuss the basic concepts of regret analysis for stochastic bandit problems, using the Bernoulli distribution model as an example, and evaluate the UCB and Thompson measures based on this concept.

      In this article, we discuss the adversarial bandit setting and describe the Exp3 strategy and its variant, the Exp3.P strategy, as a bandit version of the Hedge algorithm, an online learning algorithm for the all-information setting in computational learning theory, and prove pseudo-riglet and high-probability ( P strategy, which is a variant of Exp3.P strategy. We also prove a lower bound for the pseudo-riglet and describe the Poly INF strategy, which achieves pseudo-riglets of the same order as the lower bound.

      The Exp.3 policy achieves O(TKlogK) pseudo-riglets, but not necessarily good riglets with high reliability, rather than using the upper confidence bound , which is set to \(\frac{1}{K}e^{\eta\hat{G}_{j,t}}\).

      P with well-defined parameters is O(TKlogK). Is this a good measure to achieve the suspect riglet for the adversarial bandit problem? One way to check this is to compare it with the suspect riglet lower bound for the hostile multi-armed bandit problem, i.e., the suspect riglet that arises for any measure.

      In the previous Bandit problems, the objective was to maximize the cumulative reward by selecting the machine that is estimated to have the maximum expected value at the present time at an appropriate frequency (knowledge utilization) while estimating the expected value of each slot machine (search). On the other hand, in practical situations, many problems have emerged where the period of search is clearly distinguished and the slot machine with the largest expected value during that period is to be identified with high probability. In this article, we discuss efficient strategies for such pure search problems.

      In this section, we consider strategies to perform ε-optimal arm identification for a predefined ε ≥ 0 as a setting that includes both exact optimal arm identification and ε-optimal arm identification. As a direction, we first discuss strategies for the fixed confidence setting and then describe how to apply these strategies to the fixed-budget case.

      As in the case of riglet minimization, the concept of confidence intervals plays a central role in the construction of the strategy. While it was effective to use the Upper Confidence Bound (UCB) as a criterion for arm selection in the case of cumulative reward maximization, it is important to simultaneously consider the Lower Confidence Bound (LCB) in the case of optimal arm identification, as described below. The following Lower Confidence Bound (LCB) is also important for optimal arm identification.

      In the setting of the simple bandit problem, we consider the case where a reward from one slot machine has no information about another machine. In practical applications, on the other hand, there are many cases where each slot machine is associated with some feature that can be used to infer the nature of the reward from one slot machine to another. As a representative formulation of such a setting, we consider here the case where the reward expectation is represented by a linear model with respect to a feature. Contextual bandits, where the characteristics of the slot machine change at different times, can also be formulated in a similar framework.

      In the previous section, we discussed the LinUCB strategy as a natural generalization of the UCB strategy to linear models. Similarly, we describe a generalization of Thompson extraction to linear models.

      In this article, we will discuss the formulation and strategy of the bandit problem when the number of possible player actions, such as adjusting the learning parameters, is large or continuous. We also discuss Bayesian optimization, which assumes that the reward is generated by a Gaussian process.

      In the continuous arm bandit, natural extensions of the finite arm bandit measures are possible, and in particular, various measures for noise-free and simple liglet minimization have been proposed in the context of black box optimization research. In the following, we first consider the case where the kernel for which f(a) is generated and the scale parameters σ0,λ are known. These estimations are discussed below.

      There are various extensions and generalizations of the Bandit problem in addition to the linear Bandit. In settings such as news recommendation, the probability distribution of a quantity corresponding to the reward of the bandit problem, such as the presence or absence of clicks, may vary with time. There are several possible formulations of such settings, and various measures are possible depending on the formulation. Here we discuss some of the most representative of them.

      In addition to the linear bandit problem, there are various extensions and generalizations of the bandit problem, and here we discuss partial observation problems and other extensions that are of practical importance or have recently attracted attention.

      In this section, we discuss how game tree search can be treated as a bandit problem and what techniques are used as applications of the bandit method.

      In this article, we will discuss a case in which the Bandit Problem method was applied to find the optimal billing and compensation for Internet advertising.

      In this article, we describe a case in which the Bandit problem method is applied to find the optimal recommended solution.

      コメント

      タイトルとURLをコピーしました