Overview of the Upper Confidence Bound (UCB) Algorithm
In the ε-greedy method described in “Overview of the ε-greedy method, algorithm, and implementation examples” and the softmax algorithm described in “Boltzmann Distribution, Softmax Algorithm, and Bandit Problem” the search for an arm is performed probabilistically, and the The algorithm focuses on “how much is the reward obtained from the arm”.
Therefore, if the reward from the arm contains noise, and the true value is mixed with the noisy estimate, it is easy to be fooled when processing the data for the first time and easily make a wrong decision after a few negative experiences. However, because these algorithms rely on randomness, they can make up for those mistakes in later plays.
The Upper Confidence Bound (UCB) algorithm is an algorithm for optimal selection among different actions (or arms) in the Multi-Armed Bandit Problem (MBA), taking into account the uncertainty in the value of the actions, The method aims to select the optimal action by appropriately adjusting the trade-off between search and use.
In other words, it can be said that the approach is to evaluate and track the user’s level of confidence in the arm estimate so that he or she is not deceived.
This is specifically calculated based on Heffding’s inequality, a type of probability inequality, which is imaginatively as follows
An overview of the UCB algorithm is given below.
First, as with the epsilon-greedy and softmax algorithms, a class is set up to store all the information that the algorithm tracks.
class UCB():
def __int__(self, count, values):
self.counts = counts
self.values = values
return
The field parameters of UCB will be similar to those of the epsilon-greedy method and the softmax algorithm. The difference will be the utilization of counts.
1. Initialization: For each action, an initial estimated value and a level of exploration are set. Typically, the initial estimated value is initialized to zero or some other appropriate initial value, and the search degree (or confidence interval width) is set to infinity.
def initialize(self, n_arms):
self.counts = [0 for col in range(n_arms)]
self.values = [0.0 for col in range(n_arms)]
return
2. Action Selection: For each action, the estimated value of the action plus the search degree is calculated. This is called the UCB value, and the action with the largest UCB value is selected.
def select_arm(self):
n_arms = len(self.counts)
for arm in range(n_arms):
if self.counts[arm] == 0:
return arm
ucb_values = [0.0 for arm in range(n_arms)]
total_counts = sum(self.counts)
for arm in range(n_arms):
bonus = math.sqrt((2 * math.log(total_counts))/ float(self.counts[arms]))
ucb_values[arm] = self.values[arm] + bonus
return
def update(self, chosen_arm, reward):
self.counts[chosen_arm] = self.counts[chosen_arm] + 1
n= self.counts[chosen_arm]
value = self.values[chosen_arm]
new_value = ((n-1) / float(n)) * value + (1 / float(n)) * reward
self.valies[chosen_arm] = new_value
return
The above code, if self.counts[arms]==0, makes sure that all given arms are subtracted at least once. By doing this, UCB avoids having no data at all before applying the confidence-based decision rule. This is characteristic of UCB, which takes longer to initialize when the number of arms to be explored is large. However, the effect of having a little information about all the given choices is that it allows the clearly disadvantaged arms to be ignored from the beginning.
The select_arm method of UCB is unique in that it uses a value called ucb_value, which is the estimated value of each arm plus a bonus value. This bonus value is a combination of the estimated value for one arm plus some information that is missing for that arm relative to the other arms. It can be said that this is the case.
A plot of the likelihood that UCB will select the appropriate arm at a given point in time is shown below.
This move is noisier than the epsilon-greedy and softmax algorithms, even though there is no randomness in the algorithm.
3. Action execution: Perform the selected action and observe the resulting reward.
4. Updating the estimated value: The observed rewards are used to update the estimated value of each action. Typically, the average of the rewards is used to update the estimated value and decrease the search.
5. iteratively repeat steps 2 through 4.
A characteristic of the UCB algorithm will be to actively explore unknown actions through the degree of exploration (or width of the confidence interval), taking into account uncertainty. As each action is selected in the early stages and more reliable actions are identified, the search degree is expected to decrease, thereby converging to the optimal action.
Below are the results of a comparison of the UCB algorithm with the epsilon-greedy method and the Softmax algorithm.
Compared to the epsilon-greedy method, the Softmax algorithm is more accurate, UCB approaches the Softmax algorithm although noise is seen, and the ICB algorithm overtakes the Softmax algorithm as the number of trials increases.
Application of the Upper Confidence Bound (UCB) Algorithm
The Upper Confidence Bound (UCB) algorithm is widely used in the Multi-Armed Bandit Problem (MBA) and has a variety of applications. The following are some of the major applications of the UCB algorithm.
1. online ad selection:
The UCB algorithm is used for online ad selection. It helps to maximize advertisers’ revenue by selecting the ads with the highest click-through rates among different ad variations. The click rate of an ad is uncertain, and the UCB algorithm adjusts the trade-off between exploration and utilization to select the best ad.
2. medical diagnosis and treatment planning:
In the medical field, the UCB algorithm is being used to select the optimal diagnosis and treatment plan. This takes into account uncertainties about the patient’s medical condition and response and contributes to optimizing treatment selection.
3. renewable energy management:
Research is being conducted to apply the UCB algorithm to the optimal control of renewable energy generation. In cases where energy supply is uncertain, such as wind and solar power generation, the UCB algorithm can help achieve optimal power generation control.
4. online education and automated tutoring:
The UCB algorithm has been applied to online education and automated tutoring systems on learning platforms. Based on the student’s ability and level of understanding, the most appropriate questions and assignments can be selected to support effective learning.
5. unmanned aerial vehicle (drone) path planning:
UCB algorithms are used for unmanned aerial vehicle (drone) path planning to optimize the drone’s flight path and determine the best path to achieve a goal.
These are just some of the common applications of the UCB algorithm. the UCB algorithm is very useful for decision-making and optimization problems involving uncertainty and is used in a variety of real-world domains and is a powerful tool for considering uncertainty in the selection of actions and making optimal choices It is widely used as a powerful tool for considering uncertainty in the choice of actions and making optimal choices.
Example implementation of the Upper Confidence Bound (UCB) algorithm
The implementation of the UCB (Upper Confidence Bound) algorithm is relatively simple and is usually done using a programming language such as Python. The following is a simple example of a UCB algorithm implementation.
import numpy as np
# Number of Actions
num_actions = 5
# Initialization of each action
action_rewards = np.zeros(num_actions) # Total reward for each action
action_counts = np.zeros(num_actions) # Number of times each action is selected
# Main UCB loop
num_rounds = 1000 # Number of rounds
ucb_values = np.zeros(num_actions) # UCB value for each action
for t in range(num_rounds):
# Action Selection
for a in range(num_actions):
if action_counts[a] == 0:
ucb_values[a] = np.inf # Set UCB value of unselected actions to infinity
else:
ucb_values[a] = action_rewards[a] / action_counts[a] + np.sqrt(2 * np.log(t) / action_counts[a])
selected_action = np.argmax(ucb_values)
# Perform selected actions and observe rewards
reward = simulate_reward(selected_action)
# Update Action
action_rewards[selected_action] += reward
action_counts[selected_action] += 1
# Analyze final selection probabilities and rewards
action_selection_probabilities = action_counts / num_rounds
print("Selection probability for each action:", action_selection_probabilities)
print("Cumulative reward for each action:", action_rewards)
コメント