Overview of Advantage Learning and examples of algorithms and implementations

Machine Learning Artificial Intelligence Digital Transformation Probabilistic  generative model Sensor Data/IOT Online Learning Deep Learning Reinforcement Learning Technologies python Economy and Business  Navigation of this blog
Overview of Advantage Learning

Advantage Learning is an enhanced version of Q-learning described in ‘Overview of Q-learning, algorithms and implementation examples’ and the policy gradient method described in “Overview of the policy gradient method and examples of algorithms and implementations“, and is a method for learning the difference between state and action values, or ‘advantage’. In conventional Q-learning, the expected reward value (Q-value) obtained for a state-action pair is learned directly, whereas in advantage learning, an advantage function (A(s,a)) is calculated to evaluate how good the choice is relative to it.

The advantage function \(A(s,a)\) represents the difference between the Q-value \(Q(s,a)\) when the action \(a\) is chosen in state \(s\) and the reference value (state value)\(V(s)\) obtained in that state.

\[
A(s,a) = Q(s,a) – V(s)
\]

This difference makes it possible to assess the ‘relative goodness’ of a behaviour. A higher Advantage indicates that the behaviour is a better choice compared to the average behaviour.

Advantages of advantage learning include: increased stability, which is achieved by focusing on the advantage rather than directly on the Q-value itself; and the use of the advantage in combination with the policy gradient method, which makes gradient-based learning more efficient and also improves sample efficiency. In the latter, especially in the Actor-Critic algorithm, this advantage is used to update the actor network (policy network).

Typical approaches to advantage learning include Advantage Actor-Critic (A2C) and Generalised Advantage Estimation (GAE), which are often used especially in the field of deep reinforcement learning.

Algorithms related to Advantage Learning

The algorithms related to advantage learning are described below.

1. Advantage Actor-Critic (A2C): Advantage Actor-Critic (A2C), also described in ‘Overview of Advantage Actor-Critic (A2C) and examples of algorithms and implementations’, is an Advantage-based Actor- Critic algorithm, which is divided into Actors, who update policies, and Critics, who assess values.

– Actor: a network that outputs policies and selects actions.
– Critic: a network that evaluates state and behaviour values, and how good the actions chosen by the actors are.

The advantage function \( A(s,a) = Q(s,a) – V(s) \) allows actors to improve their measures using the policy gradient method while referring to the values output by the critique.

A2C is characterised by the stability of the learning process, as the actor and the critique are updated simultaneously, and by the use of Advantage, the strategy is updated based on the relative goodness of the behaviour.

2. Asynchronous Advantage Actor-Critic (A3C): A3C, also described in ‘Overview of A3C (Asynchronous Advantage Actor-Critic) and examples of algorithms and implementations’, is an asynchronous version of A2C. It is an asynchronous version of A2C, in which multiple agents learn in parallel and the results are aggregated to update the model. It has a mechanism whereby multiple threads explore the environment and learn in each thread, which improves the diversity of exploration and sample efficiency.

Features include asynchronous updating, where each thread explores a different environment and updates measures using their own gradients, and parallel learning, which reduces gradient dispersion and enables more stable learning.

3. Generalised Advantage Estimation (GAE): also described in ‘Generalised Advantage Estimation (GAE) Overview and Examples of Algorithms and Implementations’, GAE is a variant in the estimation of the advantage function, which is used in particular It will be used to optimise the bias-variance trade-off in reinforcement learning. Instead of focusing on estimation for future rewards, a more stable advantage estimation can be achieved by taking the average of the advantage over several different time ranges.

The main idea of GAE is to introduce a discount rate \( \gamma \) and a smoothing parameter \( \lambda \) in the calculation of the advantage, allowing flexibility in the temporal dependence of rewards, characterised by flexible estimation that balances short-term and long-term rewards by adjusting \( \lambda \) and increased stability by incorporating rewards over multiple time scales, which reduces noise and allows for stable updates of the policy.

4. Trust Region Policy Optimisation (TRPO): TRPO, also described in ‘Overview of Trust Region Policy Optimisation (TRPO) and examples of algorithms and implementations’, was introduced to make policy updates more secure. TRPO is a method introduced to make policy updates safer, with confidence regions set as constraints to prevent policy changes from becoming too large. While using Advantage to assess the quality of behaviour, TRPO limits the range of policy updates, thereby preventing large policy updates from destabilising learning.

Features include the use of an advantage function when updating measures and optimisation within the trust region, which prevents destructive changes to measures by constraining the size of changes to prevent them from becoming too large.

5. Proximal Policy Optimisation (PPO): PPO, also described in ‘Overview of Proximal Policy Optimisation (PPO) and examples of algorithms and implementations’, is an improved version of TRPO, employing a simpler method of policy update with a limited policy update range. It is an improved version of TRPO, employing a simpler method of updating measures, while limiting their update width. Specifically, it uses an advantage-based clipping technique to prevent policy updates from becoming too large.

Features include the use of an advantage function, which uses the advantage when actors update measures, and update control by clipping, which introduces clipping to limit the update width, simplifying and improving stability at the same time.

These algorithms are characterised by the use of an advantage function to improve learning stability and sample-efficient strategy updating, and each method is designed for a different purpose and strengthens the foundation of advantage learning.

Case studies of the application of Advantage Learning

Advantage learning makes use of advantage functions as part of reinforcement learning to improve the stability and efficiency of learning and help in solving complex problems. Examples of their application are described below.

1. robotics: reinforcement learning algorithms using advantage learning are used in robot control. For example, Advantage Actor-Critic (A2C) and Proximal Policy Optimisation (PPO) are used in robot balance control and arm manipulation to learn the optimum behaviour in each state in real-time, using advantage learning. enables efficient learning by assessing the relative goodness of behaviours.

Case study:
– Task in which a robot walks while maintaining balance on unstable terrain.
– Reinforcement learning for industrial robots to adapt to different environments when learning to grasp objects.

2. game AI: Advantage learning is heavily used in game AI to learn complex strategies. Especially in real-time strategy games and board games, where there is a wide range of actions to be decided, advantage functions can be used to learn while evaluating how well the current action compares to other options.

Case study:
– OpenAI’s Dota 2 AI: The AI developed by OpenAI is known as a powerful game player and uses algorithms based on PPO. It uses advantage learning to adapt to the game’s complexities and learns strategic behaviours.
– Atari games: an A3C algorithmic approach is used in AI development for various Atari games, where multiple agents can learn efficient play strategies by exploring in parallel.

3. automated driving: advantage learning is also used in the control systems of self-driving vehicles. The automated driving environment involves complex factors such as road conditions and the movements of other vehicles, but optimal driving manoeuvres can be learnt through reinforcement learning. Advantage learning is an effective approach for assessing whether driving behaviour is safe and efficient.

Case study:
– Reinforcement learning of complex decisions at lane changes and intersections, taking into account the relative advantages of the actions.
– Advantage learning is used to control distance and speed on motorways to ensure safe and efficient driving.

4. optimising financial transactions: in the financial sector, advantage learning is applied to decision-making in equity and options trading. Advantage functions are used to assess the relative merits and demerits of market fluctuations and trading strategies in real time and to make optimal investment decisions.

Case study:
– When forecasting stock prices and optimising trading strategies, A2C and PPO are used to learn the optimal behaviour in each state.
– Risk management of portfolios with multiple assets is optimised through reinforcement learning.

5. healthcare: in healthcare, advantage learning is used to optimise treatment planning. Reinforcement learning algorithms are used to optimise treatment plans and medication schedules based on the patient’s condition, and advantage functions are used to assess the relative effectiveness of treatments.

Case study:
– In the treatment of a chronic disease, how well each step of treatment compares to other options is assessed in order to optimise long-term health status.
– 5. medical robots use advantage learning to support surgical operations in a safe and efficient manner.

6. optimising advertising and marketing: advantage learning is applied in optimising online advertising and personalised marketing. Reinforcement learning is used to learn which ads are most effective based on the timing of the ad display and the behaviour of the target user.

Case study:
– Predicting whether a user will click on a particular ad and optimising ad display strategies based on that behaviour.
– To maximise the effectiveness of retargeting ads, an advantage function is used to compare the effectiveness of different ads and determine the best ad display.

These examples show that the ‘relative behaviour evaluation’ feature of advantage learning is used to solve complex decision-making problems.

Examples of Advantage Learning implementations

As an example of a typical implementation of Advantage learning, we describe how a simple version of Advantage Actor-Critic (A2C) can be implemented in Python and TensorFlow or PyTorch. Here, we use OpenAI Gym’s CartPole as a reinforcement learning environment and a network of Actor-Critics to learn measures using Advantage.

Implementation overview:.

  1. Environment initialisation: setting up OpenAI Gym’s CartPole environment.
  2. Actor-critique network: define an actor network (behavioural measures) and a critique network (value evaluation) respectively.
  3. Calculation of the advantage function: compute the advantage as the difference between the Q-value of the behaviour and the state value.
  4. Updating the policy and value functions: updating the gradients of the actors and the critics to improve the policy.

Install required libraries: first, install the required libraries.

pip install gym tensorflow

Alternatively, if you use PyTorch:

pip install gym torch

Example A2C implementation (TensorFlow version):

import gym
import tensorflow as tf
import numpy as np

# Initialising the environment
env = gym.make('CartPole-v1')

# Definition of actor networks.
class Actor(tf.keras.Model):
    def __init__(self, action_space):
        super(Actor, self).__init__()
        self.dense1 = tf.keras.layers.Dense(24, activation='relu')
        self.dense2 = tf.keras.layers.Dense(24, activation='relu')
        self.logits = tf.keras.layers.Dense(action_space, activation=None)

    def call(self, inputs):
        x = self.dense1(inputs)
        x = self.dense2(x)
        return self.logits

# Definition of a clitic network.
class Critic(tf.keras.Model):
    def __init__(self):
        super(Critic, self).__init__()
        self.dense1 = tf.keras.layers.Dense(24, activation='relu')
        self.dense2 = tf.keras.layers.Dense(24, activation='relu')
        self.value = tf.keras.layers.Dense(1, activation=None)

    def call(self, inputs):
        x = self.dense1(inputs)
        x = self.dense2(x)
        return self.value

# hyperparameter
gamma = 0.99  # discount rate
learning_rate = 0.001  # learning rate

# Initialisation of actors and critics.
num_actions = env.action_space.n
actor = Actor(num_actions)
critic = Critic()
optimizer = tf.keras.optimizers.Adam(learning_rate)

# Calculation of Advantage
def compute_advantage(reward, next_value, done, value):
    return reward + gamma * next_value * (1 - int(done)) - value

# Policy update
def train_step(state, action, reward, next_state, done):
    with tf.GradientTape(persistent=True) as tape:
        state = tf.convert_to_tensor([state], dtype=tf.float32)
        next_state = tf.convert_to_tensor([next_state], dtype=tf.float32)

        value = critic(state)[0, 0]
        next_value = critic(next_state)[0, 0]
        
        advantage = compute_advantage(reward, next_value, done, value)

        # Actor's policy gradient
        logits = actor(state)
        action_probs = tf.nn.softmax(logits)
        action_log_prob = tf.math.log(action_probs[0, action])
        actor_loss = -advantage * action_log_prob
        
        # Critik value function gradient
        critic_loss = advantage**2

    # Actor and Critique updates.
    actor_grads = tape.gradient(actor_loss, actor.trainable_variables)
    critic_grads = tape.gradient(critic_loss, critic.trainable_variables)
    
    optimizer.apply_gradients(zip(actor_grads, actor.trainable_variables))
    optimizer.apply_gradients(zip(critic_grads, critic.trainable_variables))

# Main loop of learning
num_episodes = 1000
for episode in range(num_episodes):
    state = env.reset()
    total_reward = 0

    while True:
        # Behavioural choices in the environment
        state_tensor = tf.convert_to_tensor([state], dtype=tf.float32)
        logits = actor(state_tensor)
        action = np.random.choice(num_actions, p=tf.nn.softmax(logits[0]).numpy())

        # Performing actions and observing the next state of affairs
        next_state, reward, done, _ = env.step(action)

        # Learning with Advantage.
        train_step(state, action, reward, next_state, done)

        state = next_state
        total_reward += reward

        if done:
            print(f"Episode: {episode}, Total Reward: {total_reward}")
            break

Key points of implementation:

  1. Network of actors and clitics:
    • Actors output measures of which action to take from the current state (Softmax outputs probability distributions).
    • Critics output the value in the current state (Value Function).
  2. Advantage Function:
    • The compute_advantage() function calculates the difference between the value Q(s,a) and V(s) of an action and updates the actor based on this advantage.
  3. Loss function:
    • The loss of an actor is calculated based on the logarithm of the advantage and the probability of the action and is updated using the policy gradient method.
    • Critics are updated to minimise the difference between the actual reward and the predicted state value.

Implementation with PyTorch (simple alternative example): the basic structure is similar when using PyTorch. The model, the optimisation method and the gradient calculation are slightly different.

Advantage learning challenges and measures to address them

Advantage learning is an effective method for improving the efficiency and performance of reinforcement learning, but there are also several challenges. Understanding these challenges and taking appropriate countermeasures are essential for building a successful reinforcement learning system. These challenges and their countermeasures are described below.

1. highly advantageous variance:

– Challenge: The advantage function is used to assess the ‘relative goodness’ of a behaviour, but if the prediction of Q-values or state values is unstable, the advantage can become excessively large or small. This can lead to an explosion or disappearance of gradients during the learning process, making learning difficult to converge.

– Solution:
1. normalisation: normalising the value of the advantage can reduce excess values and promote stable learning. For example, using the mean and standard deviation of the advantage can be an effective method of normalisation. 2.
2. scaling rewards: scaling rewards so that Advantage values fall within an appropriate range can also help reduce variance.

– Example: in Proximal Policy Optimisation (PPO), normalising the advantage is a commonly used method to ensure learning stability.

2. difficulties with off-policy advantage learning:

– Challenge: Advantage learning essentially works ‘on-policy’ (i.e. it collects data based on the current policy and learns it immediately). However, in an off-policy environment, it is difficult to re-use and learn from previously collected data, which reduces data efficiency.

– Solution:
1. algorithms such as Trust Region Policy Optimisation (TRPO) and PPO can be used to learn in a manner similar to off-policy. These techniques stabilise learning by avoiding large updates from previous policies and limiting policy changes.
2. the introduction of a Replay Buffer can also consider moving to off-policy learning algorithms (such as DQN and SAC) that re-use past experience.

3. different convergence speeds of actors and critics:

– Challenge: Actors (measures) and critics (value functions) learn with different goals and may converge at different rates. This can cause one to progress immaturely and the learning of the other to be unstable, especially if the critique cannot predict the value function accurately, the advantage is not calculated correctly and the actor’s learning is negatively affected.

– Solution:
1. set different learning rates to ensure that actors and critics learn at the right time. For example, a higher learning rate for the clitics and faster learning of the value function will enable stable advantage calculation.
2. the use of target networks prevents excessive updating of the value function and leads to stable learning. 3. the DQN and DDPG approaches, where the target of the clitic is fixed and updated at regular intervals, are particularly effective

4. balancing exploit and exploit measures:

– Challenge: In reinforcement learning, it is necessary to act on current measures (exploit) while at the same time trying out new behaviours (exploit). However, advantage learning tends to reinforce the behaviour that is best based on the current policy, which may lead to insufficient exploration.

– Solution:
1. ε-Greedy policies: increase the breadth of exploration by reducing the probability that an actor acts according to the policy and choosing a random action with a certain probability.
2. entropy regularisation: adding an entropy term to the actor’s policy increases the randomness, thus increasing the breadth of the search. This strengthens the incentive to explore new behaviours instead of repeating the same behaviour.

– Example: entropy regularisation is introduced in the PPO and A3C algorithms, thereby maintaining a balance between exploration and utilisation.

5. sparsity of rewards:

– Challenge: When rewards are low or sparse, it may take longer for the agent to learn appropriate actions and advantage learning may not progress. This is due to the lack of feedback for learning the value of actions in low reward environments.

– Solution:
1. reward shaping: accelerate learning by providing intermediate rewards during the learning process and giving feedback to the agent even before the goal is reached.
2. imitation learning or inverse reinforcement learning: facilitating the onset of learning even in sparse reward environments by mimicking the behaviour of other agents.

Reference information and reference books

Details of reinforcement learning are described in “Theories and Algorithms of Various Reinforcement Learning Techniques and Their Python Implementations. Please also refer to this page.

A reference book is “Reinforcement Learning: An Introduction, Second Edition.

Deep Reinforcement Learning with Python: With PyTorch, TensorFlow and OpenAI Gym

Reinforcement Learning: Theory and Python Implementation

Asynchronous Methods for Deep Reinforcement Learning

Advantage Actor-Critic Algorithms

DeepMind’s Reinforcement Learning Lectures

OpenAI Spinning Up

OpenAI Baselines

Stable Baselines3

Reinforcement Learning: An Introduction

Deep Reinforcement Learning Hands-On

Algorithms for Reinforcement Learning

Applied Reinforcement Learning: With Python Examples

Probabilistic Machine Learning: Advanced Topics

Handbook of Reinforcement Learning and Control

コメント

タイトルとURLをコピーしました