Overview of Generalised Advantage Estimation (GAE) and examples of algorithms and implementations

Machine Learning Artificial Intelligence Digital Transformation Probabilistic generative model Sensor Data/IOT Online Learning Deep Learning Reinforcement Learning Technologies python Economy and Business Navigation of this blog

Overview of Generalized Advantage Estimation (GAE)

Generalised Advantage Estimation (GAE) is one of the methods used for policy optimisation in reinforcement learning, especially algorithms that utilise state value functions or action value functions, such as the Actor-Critic (Actor-Critic) approach GAE adjusts the trade-off between bias and variance to achieve more efficient policy updating.

The main purpose of GAE is to prevent excessive noise when calculating the advantage function (a value that indicates how good a particular action is), by making the computation smoother,

In reinforcement learning, the agent takes an action \( a \) from some state \(s \) and improves the policy based on the reward obtained, using two key quantities in this process.
– State value function\( V(s)\): the expected cumulative reward for being in state \( s \)
– Advantage function \( A(s, a) \): a quantity that indicates how good it is to take action \( a \) in state \( s \)

The advantage function is defined as follows.

\[ A(s_t, a_t) = Q(s_t, a_t) – V(s_t) \]

However, the Q function indicates the cumulative reward after taking an action \( a_t \) in a state \( s_t \), and conventionally, to calculate this advantage function accurately, it is necessary to consider all future rewards, which involves high variance and can make the learning unstable.

In contrast, GAE computes a more stable advantage by partially discounting future rewards to mitigate error accumulation; GAE uses λ-discounted sums, which give different weights to different future rewards, to reduce the variance of the advantage and enable better estimation.

Specifically, the GAE calculates the advantage in the following form.

\[ A_{GAE}^\lambda = \sum_{l=0}^\infty (\gamma \lambda)^l \delta_t^V \]

where \(\delta_t^V \) is the TD error (Temporal Difference Error) at time t

\[ \delta_t^V = r_t + \gamma V(s_{t+1}) – V(s_t) \]

– \( \gamma \)：Discount rate (usually e.g. 0.99)
– \( \lambda \)：GAE parameters (0 <= \(\lambda \) <= 1)

Adjusting this λ allows the trade-off between bias and variance to be managed, with the closer λ is to 1, the more strongly it depends on future rewards and the greater the variance, while the closer it is to 0, the more strongly the advantage depends on immediate rewards and the smaller the variance but the greater the bias.

Such benefits of GAE include the trade-off between bias and variance, where adjusting the lambda parameter allows longer-term rewards to be taken into account while ensuring that the reward estimation is not overly dispersed, and the fact that GAE allows the learning process to be smoother and policy updates to become more stable, which makes learning more stable.

GAE is often employed in modern reinforcement learning algorithms, notably Proximal Policy Optimisation (PPO) and Trust Region Policy Optimisation (TRPO), and contributes to the improved performance of these algorithms.

Algorithms associated with Generalised Advantage Estimation (GAE)

Generalised Advantage Estimation (GAE) is a method for improving the stability and efficiency of policy optimisation in reinforcement learning, and is closely related to the following algorithms, among others.

1. Trust Region Policy Optimisation (TRPO):

– Overview: TRPO, also described in ‘Overview of Trust Region Policy Optimisation (TRPO) and examples of algorithms and implementations’, is an algorithm that limits the range of policy updates in reinforcement learning so that updates do not become too large and performance does not deteriorate rapidly. The TRPO is an algorithm. The algorithm aims to gradually update policies in a direction that ensures improvement, rather than updating policies significantly at once.
– Relationship with GAE: TRPO uses GAE to estimate the advantage function in policy updates, which allows for more stable policy improvement, and GAE smoothes the estimation of the advantage function and facilitates effective policy updates.

2. proximal policy optimisation (PPO):

– Overview: PPO, also described in ‘Overview of Proximal Policy Optimisation (PPO) and Examples of Algorithms and Implementations’, is an improved version of TRPO that aims to improve performance consistently while simplifying policy updating. Instead of constraining policy updates as in TRPO, direct clipping of the update range reduces computational cost and simplifies implementation.
– Relationship with GAE: GAE is also commonly used in PPO to estimate the advantage function, and GAE can be used to make policy updates more stable by adjusting for bias and variance in the learning process, while considering long-term rewards.

3.Actor-Critic algorithm:

– Overview: The Actor-Critic algorithm, described in ‘Actor-Critic overview, algorithm and implementation examples’, is a method for learning both a policy (Actor) and a value function (Critic), combining policy-based and value-based approaches. The Critic evaluates the value function and the Actor updates the policy based on that evaluation.
– Relationship with GAE: In the Actor-Critic framework, GAE is used when Critic evaluates the advantage function, and using GAE makes the advantage estimation more stable and effective in improving policy based on Critic’s value assessment.

4. Advantage Actor-Critic (A2C):

– Overview: A2C, described in ‘Overview of Advantage Actor-Critic (A2C), Algorithm and Implementation Examples’, is a type of Actor-Critic algorithm that uses the Advantage function to improve computational efficiency by learning policies in multiple environments simultaneously. It is a method that improves computational efficiency by learning policies in multiple environments simultaneously. In normal Actor-Critic, the emphasis is on the advantage function, but A2C learns it concurrently.
– Relationship with GAE: GAE is often used for advantage estimation in A2C as well, and GAE makes A2C’s advantage estimation more stable and improves learning efficiency.

5. Deep Deterministic Policy Gradient (DDPG):.

– Overview: DDPG, described in ‘Overview of Deep Deterministic Policy Gradient (DDPG) and examples of algorithms and implementations’, is an algorithm suitable for reinforcement learning in continuous action space, and is an off-policy method, where the policy improvement is It uses a trained value function to learn the policy.
– Relationship with GAE: GAE is sometimes used in algorithms such as DDPG to estimate the advantage function, especially as it increases the accuracy of advantage estimation, and introducing GAE is expected to result in more stable updates.

6.Soft Actor-Critic (SAC):

– Overview: SAC, described in ‘Soft Actor-Critic (SAC) Overview, Algorithm and Example Implementation’, is an Actor-Critic algorithm that incorporates entropy regularisation to learn policies while maintaining explorability. Entropy is added to the probability distribution of the policy so that the agent can explore a greater variety of behaviours.
– Relationship with GAE: GAE can also be incorporated in SAC for the estimation of the advantage function, and using GAE makes the advantage estimation smoother and policy updates more stable.

GAE has been used in various reinforcement learning algorithms to improve the estimation of the advantage function and to achieve more stable policy updates. In particular, the use of GAE has proven to be very effective in optimisation methods such as TRPO and PPO, and in Actor-Critic type algorithms.

Generalised Advantage Estimation (GAE) application examples

Generalised Advantage Estimation (GAE) has been widely used for policy optimisation in the field of reinforcement learning in the following applications.

1. robotics:

– Example: GAE is used for motion control and motion planning of robots.
– Application: when a robot learns complex tasks in real-time, GAE can be used to efficiently estimate the advantage function, enabling stable learning. For example, in a task where a robot arm grabs and carries an object, GAE can be used to improve the control policy while appropriately taking future rewards into account.
– Benefit: GAE reduces learning instability and enables the robot to complete tasks quickly and accurately.

2. game AI:

– Example: application of GAE to the decision-making of a character in a game.
– Application: GAE has been used to help AI agents learn, especially in challenging game environments. For example, in real-time strategy games such as OpenAI’s Dota 2, GAE can help agents make decisions in real-time while evaluating long-term strategies.
– Benefit: the use of GAE improves the estimation of future state values and allows agents to act more effectively in the game.

3. automated driving:

– Examples: optimising route selection and vehicle control in automated driving systems.
– Application: policy optimisation is important for automated vehicles to drive safely and efficiently in complex environments; by using GAE, vehicle control policies can be learnt more smoothly and adapted in real-time according to traffic and road conditions.
– Benefit: GAE enables automated driving systems to learn more stable driving behaviour while taking into account long-term rewards (e.g. safety and energy efficiency).

4. optimising financial transactions:

– Examples: portfolio management and algorithmic trading using reinforcement learning.
– Application: in financial markets, it is important to predict future risks and rewards, and GAE can be used to optimise policies based on historical data, allowing portfolio management and automated trading systems to generate stable returns.
– Benefit: using GAE allows trading agents to adapt quickly to market movements and optimise the balance between risk management and rewards for trading.

5. navigation and mobility optimisation:

– Example: navigation systems for drones and unmanned vehicles.
– Application: in the task of a drone reaching a target point while avoiding obstacles, GAE is used to learn the optimal path with consideration of future rewards; by using GAE, a policy is learnt that minimises fuel consumption and travel time while effectively avoiding obstacles and complex terrain.
– Benefit: GAE makes learning more stable and helps navigation agents learn efficient routes.

6. dialogue systems in natural language processing (NLP):

– Example: training a dialogue agent using reinforcement learning.
– Application: dialogue systems utilise GAE to provide appropriate responses to user utterances, and GAE enables agents to learn policies that improve user satisfaction in the long term and improve the quality of dialogue.
– Benefit: the use of GAE enables response generation that takes into account the outcomes throughout the entire dialogue, not just short-term interactions.

7. treatment policy optimisation in healthcare:

– Example: reinforcement learning model for optimising a patient’s treatment plan.
– Application: In the medical field, policies to optimise the long-term health of patients are important, and GAE can be used to learn optimal treatment plans that take into account the side effects of treatment and changes in the patient’s physical condition. For example, in cancer treatment, GAE has been applied in optimising drug administration schedules with reinforcement learning.
– Benefit: GAE improves the stability of treatment policies and learns effective treatments to maintain long-term health status.

GAE has been applied to train reinforcement learning models in many areas to achieve stable updates of policies while taking into account long-term rewards, particularly in robotics, automated driving, gaming AI, financial trading, navigation systems and medicine Using GAE enables learning to be balanced between bias and variance, leading to more efficient and stable performance.

Example implementation of Generalised Advantage Estimation (GAE)

Generalised Advantage Estimation (GAE) is often implemented mainly using the Python reinforcement learning library. An example of a GAE implementation is given below. The example shows a simple form of GAE computation using TensorFlow and PyTorch, which are commonly used libraries in reinforcement learning.

Implementation overview: GAE is a method for approximatively estimating the advantage function at each step within an episode, and the main formula is as follows.

GAE formula: the GAE is calculated in the following form

\[\delta_t=r_t+\gamma V(s_{t+1}-V(s_t)\\\hat{A}_t=\displaystyle\sum_{l=0}^{T-t}(\gamma\lambda)^t\delta_{t+l}\]

Where:

\(\delta_t\) is the TD error (Temporal Difference Error).
\(\hat{A}_t\) is the Advantage Estimate.
\(r_t\) is the reward, γ is the discount rate and \(Vs_t)\) is the state value function.
\(\lambda\) is a parameter used in GAE to adjust the extent to which future rewards are taken into account.

Example implementation (PyTorch version):.

import torch
import numpy as np

def generalized_advantage_estimation(rewards, values, gamma=0.99, lam=0.95):
    """
    Function to calculate GAE.
    
    Args:
    rewards: List of rewards (earned at each time step during the episode)
    values: List of state values (state value function values per time step)
    gamma: Discount rate (normally 0.99)
    lam: GAE lambda parameter (typically 0.95)
    
    Returns:
    advantages: Advantages of each step.
    """
    
    # Calculate TD error at each time step
    deltas = [r + gamma * v_next - v for r, v_next, v in zip(rewards, values[1:], values[:-1])]
    
    advantages = []
    gae = 0
    # Calculate the advantage backwards from the end of the episode.
    for delta in reversed(deltas):
        gae = delta + gamma * lam * gae
        advantages.insert(0, gae)  # Calculated in the reverse direction, added at the beginning
    
    return torch.tensor(advantages, dtype=torch.float32)

# Example: calculate GAE by setting rewards, values
rewards = [1, 0, -1, 2]  # List of rewards (provisional data)
values = [0.5, 0.6, 0.7, 0.8, 0]  # List of state values (provisional data)

# Calculation of GAE
advantages = generalized_advantage_estimation(rewards, values)

print("Advantages gained from GAE.:")
print(advantages)

Description:

Function generalised_advantage_estimation
- Receives a list of reward and state value functions, calculates the GAE and returns an advantage estimate at each time step.
- First, the TD error is calculated and then the GAE values are calculated in reverse order.
- lam is a parameter that determines the importance of future rewards and is usually set to around 0.95.
Execution example
- In rewards, a list of rewards during an episode is entered (e.g. [1, 0, -1, 2]).
  In values, the state value of each corresponding step is entered.
- When the function is executed, the output is the advantage calculated by the GAE.

Example implementation (TensorFlow version):.

import tensorflow as tf
import numpy as np

def generalized_advantage_estimation(rewards, values, gamma=0.99, lam=0.95):
    """
    Function to compute GAE in TensorFlow.
    
    Args:
    rewards: List of rewards (earned at each time step during the episode)
    values: List of state values (state value function values per time step)
    gamma: Discount rate (normally 0.99)
    lam: GAE lambda parameter (typically 0.95)
    
    Returns:
    advantages: Advantages of each step.
    """
    
    deltas = [r + gamma * v_next - v for r, v_next, v in zip(rewards, values[1:], values[:-1])]
    
    advantages = []
    gae = 0
    for delta in reversed(deltas):
        gae = delta + gamma * lam * gae
        advantages.insert(0, gae)
    
    return tf.convert_to_tensor(advantages, dtype=tf.float32)

# Example: calculate GAE by setting rewards, values
rewards = [1, 0, -1, 2]  # List of rewards (provisional data)
values = [0.5, 0.6, 0.7, 0.8, 0]  # List of state values (provisional data)

# Calculation of GAE
advantages = generalized_advantage_estimation(rewards, values)

print("Advantages gained from GAE:")
print(advantages)

Description:

This code is the TensorFlow version of the GAE calculation. The basic algorithm logic remains the same, but the output is obtained in TensorFlow tensor format.

Generalised Advantage Estimation (GAE) challenges and measures to address them

Generalised Advantage Estimation (GAE) is effective in policy optimisation for reinforcement learning algorithms, but several challenges exist. Measures to address these challenges can make GAE more effective. The main challenges and their countermeasures are listed below.

Challenge 1: Bias-dispersion trade-off due to the setting of λ

λ, an important parameter of GAE, determines the extent to which future rewards are taken into account.
– Large λ (close to 1): future rewards are taken into account in the long term, but this may lead to unstable estimation and high variance.
– Small λ (close to 0): emphasises short-term rewards, but this may increase bias and make accurate policy updating more difficult.

Solution:
– Adjustment of λ: the value of λ needs to be adjusted appropriately. Typically, a range of 0.9 to 0.95 should be tried and the optimal value determined by looking at the performance of the model.
– Hyper-parameter tuning: the optimal settings can be found by tuning λ and γ (discount rate) using grid search or random search.

Challenge 2: computational costs in high-dimensional environments.

GAE performs a cumulative calculation at each time step to account for future rewards. This causes increased computational costs when learning in high-dimensional and complex environments (especially in robotics and 3D simulations).

Solution:
– Batch processing of computations: processing multiple episodes or batches at a time can increase computational efficiency. In particular, the utilisation of hardware such as GPUs and TPUs can reduce computational costs.
– Adopting parallel processing: by adopting an environment where multiple agents are trained in parallel (e.g. A3C or IMPALA), the computation of the GAE can be parallelised to increase its speed.

Challenge 3: Noisy reward functions

Because GAE uses future rewards to estimate advantage, it is susceptible to noisy reward functions. If there is a lot of noise, the estimated advantage will vary and policy updates will not be stable.

Solution:
– Normalise rewards: normalise rewards to reduce the effect of noise. Use techniques such as reward scaling and clipping to adjust the range of reward values and stabilise policy updates.
– GAE normalisation: ensure that extreme values do not adversely affect policy updates by also normalising the acquired advantage itself.

Challenge 4: Learning instability in long episodes

In long episodes, the influence of far-future rewards diminishes and advantage estimation can become unstable. In particular, in complex, long-term tasks, rewards across episodes tend to be smaller, which negatively affects GAE estimation.

Solution:
– Splitting episodes: if the episodes are very long, it is recommended to split the training into shorter sub-episodes, which will result in more accurate estimation of future rewards and stabilise the training.
– Adjusting the discount rate: it is also useful to adjust the discount rate γ according to the length of the episode, with a slightly higher discount rate for longer episodes allowing the impact of the reward to be transmitted further.

Challenge 5: Balancing policy exploration and exploitation

GAE determines the extent to which agents should improve their current policy based on rewards, but if agents are already following a good policy, exploration is lacking and the optimal solution is missed.

Solution:
– Add exploration methods: enhance elements of exploration to allow agents to explore the behaviour space more widely. For example, adding noise during action selection or using exploration methods such as the ε-Greedy method or the Upper Confidence Bound (UCB) can provide a balance.
– Entropy regularisation: it can also be useful to introduce entropy regularisation to encourage agents to try a variety of behaviours. The higher the entropy, the more random the agent’s behaviour becomes and the more exploration is enhanced.

Challenge 6: Balancing policy and value function learning

GAE needs to learn both the policy and the state value function, but if these learnings are not balanced, policy optimisation may not be performed properly.

Solution:
– Introduce a shared network: learning policies and value functions in the same network can be shared and balanced to improve learning efficiency. This will ensure that the learning of the two complements each other and leads to greater optimisation.
– Adjusting loss function weights: by giving appropriate weights to policy losses and value function losses, balanced learning can be achieved, thereby preventing biased learning in one direction or the other.

Reference information and reference books

Details of reinforcement learning are described in “Theories and Algorithms of Various Reinforcement Learning Techniques and Their Python Implementations. Please also refer to this page.

A reference book is “Reinforcement Learning: An Introduction, Second Edition.

“Deep Reinforcement Learning with Python: With PyTorch, TensorFlow and OpenAI Gym“

“Reinforcement Learning: Theory and Python Implementation“

“Reinforcement Learning: An Introduction”

“Advances in Financial Machine Learning”

“Proximal Policy Optimization Algorithms”

“Deep Reinforcement Learning Hands-On”

“Artificial Intelligence: A Guide for Thinking Humans“