Machine Learning Professional Series “Theory and Algorithms for Bandit Problems” Reading notes

Machine Learning Artificial Intelligence Digital Transformation Sensor Data/IOT Reinforcement Learning Navigation of this blog

Machine Learning Professional Series “Theory and Algorithms for Bandit Problems” Reading notes

The Bandit Problem refers to the problem of selecting the optimal choice among multiple alternatives with probabilistic rewards. The name comes from the problem of which slot machine should be selected given a casino slot machine (also known as a one-armed robber). The Bandit Problem is a general framework for mathematically modeling decision-making problems and has been widely studied in fields such as machine learning and reinforcement learning, and its applications include optimization of ad delivery, optimization of medical diagnosis, and optimization of fund allocation.

Here we discuss the Bandit Problem based on the Machine Learning Professional Series “Theory and Algorithms of the Bandit Problem“.

The Bandit problem is based on the trade-off between “exploration” and “utilization,” where the agent performs “exploration” to gather information about unknown alternatives, while “utilization” is used to leverage knowledge about known alternatives to make the optimal choice. Algorithms for solving the bandit problem include the epsilon-greedy method, UCB (Upper Confidence Bound), and Thompson Sampling. Reading notes will be included in this issue.

Introduction

Reveal the no regrets option!　 Careful explanation of basic theory such as policy and riglet analysis. It introduces not only optimal arm identification and continuous arm bandits, but also more specific situations such as Monte Carlo tree search and Internet advertising.
Preface
Balance between “search” and “knowledge use” is important for “without regret” selection
Example
How to deliver news and ads with the highest click-through rate to users?
For those ads that have a low current estimated click-through rate but are distributed less frequently, the true click-through rate may be higher, so it is necessary to distribute them more aggressively than those that are distributed more frequently and observe user responses (exploration).
In order to increase the overall number of clicks, it is necessary to send out more news and advertisements with a high estimated click-through rate (knowledge use).
Application to New Drug Trials
The multi-armed bandit problem is a problem of optimizing the balance between search and knowledge use

Chapter 1 What is the Bandit Problem?

1.1 Introduction
Bandit problem
In the repeated process of selecting an element from a set of alternatives and obtaining the reward for that element but not for the other alternatives
Sequential decision problem aiming at maximizing the sum of rewards
Example
There are K-5 slot machines.
Customers can draw on any arm.
Select an arm each time and draw a total of 100 times.
No information on which arm is more likely to win
Draw the arm n times, and the most successful arm is drawn the remaining (100-5n) times
What should we do with the value of N?
One-armed bandit
Classic slot machine with one arm
Multi-armed bandit problem
Aiming to maximize profit by repeatedly selecting one slot machine out of K slot machines and pulling the arm.
1.2 Examples of Bandit Problems
Examples of Bandit Problems
Critical trial
To minimize the number of patients who fail to be treated by sequentially deciding which of K treatments should be administered to each patient who comes to the clinic one after another with a certain disease.
Internet advertising
Aiming to maximize the number of ads clicked (or profit from such clicks) by repeatedly displaying a selection of K ads on a web page or a search results page for a certain keyword.
Recommender system
A service site on the Web that recommends a number of products from among all the products available at the time of each user’s visit based on past purchase history, and aims to maximize the number of products actually purchased (or profit from such purchases) among those products.
Game tree search
To find the best move among K candidate moves in a game, the average win rate of a random sample of possible final board moves is used as the evaluation value of each candidate.
Online routing
When transferring data repeatedly from one location to another over a network, determine one of several routes to minimize the accumulation of data delays.
1.3 Probabilistic Bandit and Adversarial Bandit
The bandit problem can be broadly divided into two categories: stochastic, where the reward from each arm is determined by probability, and adversary, where the adversary knows the player’s strategy.
Stochastic bandit
Rewards from each arm are generated according to some probability distribution
Adversarial bandit
Assume a case where an adversary who knows the player’s strategy determines the reward
Consider measures that will work in the worst case scenario
Types of adversaries
Oblivious adversary
Rewards are determined independently of the player’s past choices
Adaptive adversary
Depends on the player’s past choices to determine the next reward
1.4 Evaluation Methods for Player Measures
1.5 History of the Bandit Problem
1.6 Related Fields
1.7 Organization of this book

Chapter 2: Fundamentals of the Stochastic Bandit Problem

Introduction.
Explanation of the concept of the large deviation principle, which is the theory underlying stochastic bandidnioite
Explain the difference from center-curvature limiting
2.1 Probability Approximation by the Central Limit Theorem
Key issues to consider in the stochastic bandid problem
When the current click-through rate μ of an ad is less than 5%, what is the likelihood that its true click-through rate is actually μ=10%?
In the language of statistics, we can write in general terms
What is the probability (likelihood) that the sample mean is μ’≤x for some x∈[0,1] when the true click rate is μ?
A commonly used approximation method for a probability distribution of sample mean μn, when the number of samples n is large
Central Limit Theorem
Even if this approximation is used, it is not always an unconditionally “accurate” approximation for a certain number of large samples.
This theorem shows that
If you want to approximate the probability within an error ε
It only means that approximation by normal distribution is allowed as long as the number of samples nε or more, which is determined from ε, is secured.
Even if the probability of an event with a true probability of occurrence of 0.01% is approximated with an (absolute) error of ε = 1%
The umbilical cord error amounted to 100 ume.
Small absolute error does not necessarily mean good approximation accuracy in practical use.
A large sample is required to evaluate low-probability events with small relative errors by birth approximation.
Berry-Essen Theorem
Near-samurai by the central limit theorem has an error of about ε=O(1/√n)
Approximating probability with error ε requires a large number of samples, O(1/ε2)
2.2 Evaluation of hem probability
The central limit theorem is useful for approximating the probability of events that occur with relatively high probability
It is not suitable for evaluating the probability of low-frequency events with small relative errors, such as “the probability that the sample mean deviates significantly from the expected value (hem probability).
The simplest formula for evaluating hem probability
Theorem.
From this inequality, we see that the probability that the tabular mean μn deviates from the true mean μ by more than ∆ decreases exponentially
Also, this exponent 2∆2 cannot be improved any further as a quantity independent of the distribution of Xi and μ
A more accurate evaluation is possible if these dependencies are allowed
Chernoff-Hoeffding’s inequality
2.3 Large Deviation Principle

Chapter 3: Measures for Stochastic Bandit Problems

Introduction.
Describes the most basic settings in stochastic bandid
Discuss achievable theoretical limits
We will introduce some typical examples of service that can achieve the theoretical limits, such as UCB measures and Thompson extractions, and explain their intuitive interpretations.
3.1 Formulation
3.2 Theoretical limits
3.3 -Greedy method
3.4 Likelihood-based measures
3.4.1 UCB measures
3.4.2 MED measures
3.5 Probability Matching Method and Thompson Extraction
3.5.1 Characteristics and interpretation of probability matching methods
3.5.2 Thompson Extraction
3.5.3 Relationship between Thompson Extraction and UCB Policy
3.6 Worst-Case Assessment
3.6.1 Worst case evaluation example
3.6.2 Optimal measures in the worst-case scenario

Chapter 4: Riglet Analysis of Stochastic Bandit Problems

Introduction.
Introduces a general framework for evaluating the performance of algorithms in stochastic bandid problems
It explains the conditions that a “good” abacus algorithm must satisfy
Also, based on this framework, we give a riglet upper bound for the algorithms introduced in the previous chapter
4.1 Decomposition of ligatures
4.1.1 Behavior after convergence
4.1.2 Behavior before convergence
4.2 Cumulative distribution function and expected value
4.3 Performance Analysis of UCB Measures
4.4 Performance Analysis of Thompson Extraction
4.4.1 Hem probability of posterior distribution
4.4.2 Riglet analysis

Chapter 5: Adversarial Bandit Problem

Introduction
Describes the setting in an adversarial bandit
We introduce the Exp3 abundance and its variant, the Exp3.O strategy, as a bandit version of the Hedge algorithm, an online learning algorithm for the all-information setting in computational learning theory.
Prove pseudo-riglet and highly probable (and valid) riglet upper bounds
Prove a lower bound for pseudo-riglets, and
Introduce the PolyINF strategy that achieves a pseudo-riglet of the same order as the lower bound
5.1 Problem Setup
5.2 Online learning theory and Hedge algorithm
5.3 Exp3 policy
5.4 Exp3.P strategy
5.5 Riglet Lower Bound for Hostile Multi-Arm Bandit Problem
5.6 Measures of optimal order

Chapter 6: Optimal Arm Identification and A/B Testing

Introduction
Bandit problem described in previous chapters
By estimating (searching) the expected value of each slot machine and selecting (using knowledge) the machine that is estimated to have the maximum expected value at this point in time at an appropriate frequency
The objective is to maximize the cumulative reward
In practical situations
There is a clear distinction between the time period of the search and the time period of the
I want to identify the slot machine with the greatest expected value during that period with high probability
This chapter discusses efficient strategies for such pure search problems
6.1 Formulation
6.1.1 Difference from Cumulative Reward Maximization
6.1.2 – Optimal arm identification
6.1.3 Simple Riglet
6.2 Sample complexity
6.3 Optimal Arm Identification Measures
6.3.1 Methods based on uniform selection
6.3.2 Score-based methods
6.4 Fixed budgeting

Chapter 7 Bandit Problem on a Linear Model

Introduction
In the previous setups
We considered the case where the reward from one slot machine has no information about another machine.
In actual application
Each slot machine is associated with some feature
There are many cases in which it is possible to infer from the rewards of one slot machine the nature of the rewards of another machine.
In this chapter
A typical formulation of such a setting is
Consider the case where the expected reward is represented by a linear model on features
Contextual bandits, where slot machine characteristics change from time to time, can be formulated in a similar framework
7.1 Linear Bandit
7.2 Contextual Bandit
7.3 LinUCB policy
7.4 Thompson Extraction on a linear model
7.4.1 Calculating posterior probabilities in a normally distributed model
7.4.2 Random number generation from multivariate normal distribution
7.4.3 When the error term is not normally distributed
7.5 Bandit on logistic regression model

Chapter 8: Continuous Arm Bandit and Bayesian Optimization

Introduction
In this chapter
We introduce the formulation of the bandit problem and strategies for the case where the number of possible actions by the player is large or continuous, such as adjusting the learning parameters.
Bayesian optimization, which assumes that rewards are generated by a Gaussian process, is also discussed.
8.1 Formulation and observation model
8.2 Riglet setup
8.3 Classes of expected value functions
8.3.1 Smoothness constraints
8.3.2 Bayesian optimization
8.4 Continuous arm bandit strategy
8.4.1 GP-UCB measures
8.4.2 Thompson extraction
8.4.3 Expected improvement measures
8.4.4 Polynomial-time feasible measures
8.5 Parameter Estimation of Covariance Function

Chapter 9 Extensions to the Bandit Problem

Introduction
In addition to the linear bandit problem, the bandit problem also includes
There are various extensions and generalizations
This chapter introduces some of them that are of practical importance or have recently attracted attention.
9.1 Bandit problem with time variation
9.1.1 Contextual Bandit-based Methods
9.1.2 Methods based on hostile bandits
9.1.3 Cases with finite time variation
9.1.4 Other methods
9.2 Comparative Bandit
9.2.1 Formulation
9.2.2 Comparative Bandit Measures
9.3 Partial observation problem
9.3.1 Examples of partial observation problems
9.3.2 Classification and theoretical limits
9.3.3 Measures for partial observation problems
9.4 Other extensions

Chapter 10: Application of Bandit Method

Introduction
This chapter covers three areas: game tree search, Internet ad serving, and recommendation systems.
Explain what kind of problems can be treated as bandit problems
The methods used will be introduced.
10.1 Monte Carlo Tree Search
AI programs that play games such as Shogi and Go are
Extensive search to find the best possible next move
Shogi and Igo are considered in game theory to be
Two-player zero-sum perfect information game
The current phase is represented by a contact point
The possible moves from this point are connected by branches as child points of contact.
From each of these phases, select a child that can be moved to in one more move.
Can be represented as a tree (game-tree) created by repeating the above
Since the winner and loser are clearly determined at the final phase of the game, which corresponds to the leaves of the tree, the surface quality from the current player’s point of view can be given as an evaluation value based on the game’s final phase.
Search for the best possible move by calculating the evaluation value of all contacts in the game tree
Ternary game tree
Minimax search performs a full search that calculates the evaluation values of all contact points
Trimming off unneeded parts
Game tree is not expanded to the final phase
Expand to the point where you have broken off a few moves ahead, and
Heuristically evaluate the game using knowledge of the game to the corresponding leaf of the tree whose development was terminated in the middle of the game.
Application to games such as Go, where it is difficult to assign appropriate evaluation values until very close to the final game.
Application of the Monte Carlo method by Brukman
Playout the game randomly using random numbers from the phase you want to evaluate to the final phase.
Randomly extract the evaluation values of the final phase that can be reached
The average of the resulting sample is compared with the evaluation of that phase.
Search using Monte Carlo methods to evaluate leaf contacts in the search for a game tree that has terminated midway through its development.
Go player program (CrazyStone) developed by Coulomb
General Monte Carlo tree search process
1. selection of leaf contact point vt of tree T
Extend tree T (add a child contact v’ to leaf contact vt and make it a new vt)
3. random extraction of evaluation values of the final phase reachable from leaf contact vt by playout
Back propagation of randomly extracted evaluation values from leaf contact vt to leaf contact v0
There are many variations of Monte Carlo trees, depending on what strategy is used for the above four processes.
UCT Algorithm
10.2 Internet advertising
Borrowing a portion of a web page as advertising space
The advertiser pays the ad-serving company to
The provider of the ad space is compensated for the amount of the space, excluding brokerage fees.
Various billing and compensation methods
Per-click billing and compensation method
It is not possible to determine which ads get the most clicks.
It is not possible to estimate the number of clicks with any accuracy until some degree of actual distribution is done.
Bandit problem strategy is effective by balancing exploration (serving the least accurate estimates of clicks due to the small number of clicks served up to this point) and knowledge utilization (serving the largest number of clicks up to this point).
In the method of estimating and distributing the largest click rate
It is likely that only popular ads (ads with high click-through rates no matter which page they are placed on) will be served.
In reality, because of advertisers’ goodwill, even popular ads are only distributed within the budgeted frequency.
Even if it is not a popular ad, there is compatibility with the page, and if it is delivered to a page with good compatibility, the click-through rate will be higher.
Auction-based ad serving design algorithm
10.3 Recommendation System
A system that recommends products and other items suited to each individual based on the user’s purchase and browsing history on the site.
Recommendation systems are used in a research field called information filetring.
Depending on what information is used
User demographics (gender, age, region of residence, income, occupation, school age)
Using content attributes
Use ratings of users with similar browsing and purchasing histories
Typically, linear parameter learning methods are used with training data.
Bandit method recommendation is effective for evaluating new users by balancing exploration and Shunju use.

Appendix A Updating the Inverse Matrix
Appendix B Hem Probability of Beta Distribution