Introduction
Reinforcement learning is another aspect of OpenAI, which is famous for chatGPT. the heart of GPT described in “Overview of GPT and examples of algorithms and implementations“, which is the basis of chatGPT, is said to be the transformer described in “Overview of Transformer Models, Algorithms, and Examples of Implementations“ based on attention as described in “Attention in Deep Learning” and the improvement of deep learning models through reinforcement learning. The key to GPT is said to lie in the improvement of deep learning models through transformers based on attention and reinforcement learning as described in “Attention in Deep Learning. When one hears the term “deep learning,” one immediately thinks of its application to games, such as AlphaGo, or to self-driving cars, but this time we will take a more in-depth look at reinforcement learning.
What are the cases in which reinforcement learning is needed?
Reinforcement learning is considered to be particularly useful in the following cases
<Scenarios requiring trial and error>
When the solution to a problem or the optimal action is not known in advance, an agent needs to learn through trial and error. Reinforcement learning is a method for agents to find the optimal course of action while interacting with the environment, and is the best approach for such cases. The following are specific examples where trial-and-error is required.
- Game play: When the optimal strategy for a game is not known in advance, the agent learns the optimal behavior through trial and error, such as AlphaGo and Dota 2’s AI (the AI that fights professional players against humans).
- Robotics: Robot manipulation and control must be learned through trial and error in real-world environments. Reinforcement learning is used to learn optimal behavior in complex situations, such as operating a robotic arm or driving a self-driving car.
- Optimization of chemical reactions: Reinforcement learning is sometimes used to find optimal experimental conditions and reaction paths in chemical processes when the optimal conditions and reaction paths are not known in advance.
- Medical treatment planning: Complex medical treatment plans, such as cancer treatments or drug administration, require trial and error to find the optimal schedule for each patient. Reinforcement learning is being explored to optimize the timing and amount of treatment.
- Traffic signal optimization: In urban transportation systems, control of traffic signals needs to be optimized through trial and error. Traffic systems involve a variety of complex parameters, and reinforcement learning is used to optimize them and adjust signal timing according to traffic volume and congestion.
- Energy Management: In power plants and power networks, reinforcement learning is used to optimize the complex parameters of the power network to coordinate the supply and demand of electricity and optimize the efficient use of energy.
In these cases, there are situations where the optimal behavior depends on the nature of the problem and the situation, resulting in complex situations that are difficult to solve with prior knowledge or analytical approaches (learning in static and relatively simple situations, which general machine learning is good at). In these cases that require trial-and-error, reinforcement learning is where the agent learns by interacting with the environment.
<Scenarios requiring real-time decision making>
Reinforcement learning is a method for developing the ability of agents to make decisions while interacting with their environment in real time. Specific examples are described below.
- Automated Driving: Automated vehicles need to be able to respond immediately to road conditions and obstacles to drive safely, and automated driving systems are being developed that use reinforcement learning to determine optimal driving behavior in real time.
- Robotics: Robots need to make real-time decisions when operating and moving in a fluctuating environment; for example, autonomous movement in a warehouse and evacuation behavior in dangerous situations are realized using reinforcement learning.
- Fintech: Financial trading and investment situations require optimal trading strategies that respond quickly to market conditions and fluctuations, and algorithms are being built to make trading decisions based on real-time market data using reinforcement learning.
- Network Management: In the management of computer and telecommunication networks, traffic control and resource optimization must be performed in real time. Efforts are underway to use reinforcement learning to optimize control according to network conditions.
- Energy control: In the management of power networks and smart grids, power supply and demand must be coordinated in real time to achieve efficient use of energy. Efforts are underway to use reinforcement learning to optimize control of the power supply.
In cases such as these, where the situation changes quickly and immediate decision-making is required, making it difficult to respond with advance planning or static approaches, reinforcement learning provides the ability for agents to learn through real-time interactions with the environment and immediately select appropriate actions.
<Scenarios that maximize long-term rewards>
The goal of reinforcement learning would be for the agent to learn a course of action that maximizes rewards in the long term. In such cases, it is necessary to focus on long-term rewards rather than short-term losses or costs. Specific cases are discussed below.
- Resource management: In the optimal use and protection of natural resources, decisions to maximize long-term rewards are important. Sustainable harvesting of forests and appropriate use of water resources are examples.
- Environmental Protection: Decisions need to be made with a long-term perspective to address environmental issues, such as reducing emissions and adopting renewable energy sources. Reinforcement learning is being used in this approach to find strategies that maximize rewards while minimizing environmental impacts.
- Medical treatment planning: In the medical field, treatment planning is required to maximize patient health from a long-term perspective, such as cancer treatment and chronic disease management, and approaches are being taken to use reinforcement learning to consider multiple treatment options and propose optimal treatment plans.
- Investment Portfolio Optimization: Investors need to build a long-term asset portfolio that balances risk and return, and efforts are underway to use reinforcement learning to find optimal investment strategies that take market fluctuations into account.
- Traffic Flow Optimization: Urban transportation systems require long-term optimization of traffic flow, and reinforcement learning is being used to reduce congestion and achieve efficient travel in signal control and traffic route planning.
In these instances, it is important to consider long-term consequences and impacts, not just temporary gains or losses. Reinforcement learning contributes to issues such as sustainable development and efficient resource use as a method for learning appropriate strategies and decisions to maximize long-term rewards.
<Situations of High Uncertainty>
Reinforcement learning is useful in situations where uncertainty is high and the environment is complex and variable. In such situations, it is difficult to determine appropriate actions based on human expertise alone, so agents must be able to adapt through experience. The following are specific examples of such situations.
- Weather Forecasting: Weather forecasting is a highly uncertain issue because of the complexity and variability of weather and weather conditions. Reinforcement learning is being used to find optimal responses in fields such as agriculture and energy supply, which are affected by weather conditions.
- Robotics: When robots work in unknown environments, they must be able to adapt to environmental fluctuations and obstacles. Robot control that uses reinforcement learning to select optimal actions while adapting to changes in the environment is being studied.
- Financial Market Prediction: Financial markets are highly uncertain, and prices and trends fluctuate depending on a variety of factors. Algorithms are being developed that use reinforcement learning to predict market fluctuations and determine optimal trading strategies.
- Natural disaster response: When natural disasters such as earthquakes and floods occur, there is a need for rapid response and evacuation planning. Approaches that use reinforcement learning to learn evacuation strategies that adapt to different disaster scenarios are being studied.
- Medical Diagnosis: In medical diagnosis, there are individual differences and uncertainties in patient conditions and disease progression. Using reinforcement learning, decision support for selecting appropriate tests and treatment plans is being investigated.
In these cases, the environment is complex, variable, and uncertain, and conventional algorithms and static methods may be difficult to handle. Reinforcement learning provides the ability for agents to learn while adapting to real-world situations and is an effective approach as a method for finding appropriate actions in highly uncertain environments.
<Situations where human expert knowledge is limited>
Reinforcement learning is also useful when the knowledge of the human expert is limited or when the problem is too complex to solve analytically. Agents can understand the problem through experience and learn appropriate actions. Specific examples include the following
- Gameplay: In complex games, it is sometimes difficult to find the optimal strategy based solely on the knowledge and playing experience of human experts. AI agents are being developed that use reinforcement learning to learn optimal tactics that adapt to in-game situations and opponent behavior.
- Drug Design: Designing drugs and discovering new compounds requires knowledge of complex molecular interactions. However, it is difficult to test all possible combinations, so research is being conducted to use reinforcement learning to predict the properties and effects of compounds and explore promising drugs.
- Environmental monitoring: Environmental monitoring and ecological surveys in remote areas can be difficult for experts to collect all the information. Approaches that use robots and sensors to collect data and use reinforcement learning to analyze changes in the environment are being employed.
- Logistics management: In complex logistics networks and supply chains, it can be difficult for experts to plan appropriate routes and delivery schedules. Systems are being studied that use reinforcement learning to determine optimal logistics strategies based on real-time information.
- Autonomous Robots: Autonomous behavior and interaction of robots can make it difficult for experts to pre-direct appropriate behavior in complex environments. Reinforcement learning has been used in research to help robots learn as they interact with their environment and gain the ability to select appropriate behaviors.
In these cases, the agent must be able to adapt through experience because the problem is complex and difficult to solve with expert knowledge alone. Reinforcement learning is used as a powerful tool for learning optimal behavior in situations that cannot be covered by human knowledge alone.
<Summary>
In summary, reinforcement learning is useful as an adaptive learning method for unknown situations, complex problems, and problems with a long-term perspective. Such problems cannot be solved by the general machine learning approach of learning patterns and relationships from static data sets. Thus, there are many cases where reinforcement learning is needed to develop systems and agents that make autonomous decisions or to solve optimization problems.
Why reinforcement learning is a useful approach to the above issues
Reinforcement learning is useful as an adaptive learning method for unknown situations, complex problems, and problems with a long-term perspective for the following reasons
- Learning through trial and error: Reinforcement learning occurs through repeated trial and error as the agent interacts with its environment. Thus, even in unknown situations where prior knowledge is limited, the agent can learn appropriate behavior through actual experience.
- Maximize long-term rewards: The goal of reinforcement learning is to learn the optimal course of action that maximizes long-term rewards. This allows for finding optimal strategies not only for short-term gains, but also for the long term.
- Responding to environmental variability: Reinforcement learning has the ability to respond to environmental variability and uncertainty because the agent learns as it interacts with the real environment. This allows them to adapt to complex situations and fluctuating problems.
- Overcoming the limitations of human expertise: Reinforcement learning provides a means to learn appropriate behavior through experience, even in situations where the knowledge of human experts is limited or the problem is too complex to be solved analytically.
- Balancing Exploration and Exploitation: Reinforcement learning finds optimal behavior by balancing exploration (trying out unknown behaviors) and exploitation (established behaviors based on past experience). This allows optimization to proceed while gathering new information even in unknown situations.
Challenges of Reinforcement Learning Technology
Thus, reinforcement learning is a very useful technique, but there are some challenges. Below we discuss some of the main challenges associated with reinforcement learning techniques.
- Sample efficiency problem: Reinforcement learning is a trial-and-error learning method, and it is important for the agent to experience many episodes (trials). However, if it is difficult to conduct trials in a real environment in terms of cost and time, it is difficult to collect the data necessary for learning.
- Trade-off between exploration and utilization: Reinforcement learning requires selecting the optimal behavior based on past experience while exploring new behaviors and collecting new information. This balance between exploration and utilization is difficult to achieve, and insufficient exploration may lead to locally optimal solutions.
- Reward design: the design of an appropriate reward function is critical to the success of reinforcement learning. If the reward function is designed incorrectly, learning convergence can be difficult, such as taking unwanted actions. Designing an appropriate reward function can be difficult in practice.
- Complexity of the state space: If the state space is very complex and high-dimensional, learning can be difficult. In high-dimensional state spaces, it can be difficult for the agent to collect sufficient data, making it difficult to learn appropriate behaviors.
- Sampling bias: If the data collected by the agent is biased, the learning results may also be biased. Therefore, appropriate data collection methods and bias elimination techniques are needed.
- Adaptability constraints: Reinforcement learning is an experience-based learning method, but in the real world, the environment can change. If the agent’s ability to adapt to new situations is limited, effective learning may be difficult.
The approaches to each of these issues are described below.
Approaches to Reinforcement Learning Challenges
The following approaches are being considered to address these reinforcement learning issues.
- Approaches to the sample efficiency problem: As described in “Overview of Weaknesses and Countermeasures in Deep Reinforcement Learning and Two Approaches to Improve Environment Recognition” the sample efficiency problem can be categorized as follows
- Caused by model
- Learning ability: The ability to learn efficiently from given data. In addition to devising models, there are various methods for sampling training data, such as Experience Replay, and optimization methods.
- Transferability: the ability to learn in a short period of time by leveraging previously learned content, which may be stored in the model or acquired from other models.
- Caused by data
- Environmental perspective (improvement of environmental awareness): This is a method to make information obtained from the environment easier for the agent to understand (i.e., easier to learn). Specifically, the states and rewards obtained from the environment are processed so that they are easy to learn, and this can be achieved by using the model-based approach (modeling the environment (transition function/reward function)) described in “Overview of Reinforcement Learning Using Model-based Approach and its Implementation in Python” or by using the model-based approach (modeling the environment (transition function/reward function)) described in “Weak points of deep reinforcement learning and its implementation in python“. (Overview of Weaknesses and Countermeasures in Deep Reinforcement Learning and Two Approaches to Improve Environment Recognition)” and the representation learning approach (create a new “state” as a vector (representation) that captures the characteristics of the state and state function). This approach is also used to improve the efficiency of learning in the actual environment by conducting pre-learning using simulations in the environment.
- From the agent’s point of view (improved search behavior): this is a method that allows the agent to obtain samples that are highly effective in learning.” Rainbow’ described in “Rainbow Overview, Algorithm and Implementation Examples” Noisy Nets (a method for learning how much exploration to perform) described in “New Developments in Reinforcement Learning (2) – Approaches Using Deep Learning” and “Research Trends in Deep Reinforcement Learning: Meta-Learning and Transfer Learning, Intrinsic Motivation and Curriculum Learning” described in Intrinsic Reward/Intrinsic Motivation is a method that motivates agents to actively transition to unknown states.
- From the perspective of the learner: external encouragement is used to promote learning by the agent. There are two types of learning methods: curriculum learning and imitation learning.
- Curriculum Learning: In “Research Trends in Deep Reinforcement Learning: Meta-Learning and Transition Learning, Intrinsic Motivation and Curriculum Learning” curriculum learning is a technique for adaptively adjusting the order and difficulty of learning by an agent.
- Bonsai: Bonsai is a curriculum learning platform based on reinforcement learning that supports curriculum design, expert demonstration, iteration and evolution, debugging, and visualization. For more information, please refer to “Reinforcement Learning Application Areas (1) Optimizing Behavior.
- Imitation Learning: There is a method called imitation learning in which a learning sample (example) is given as described in “Overcoming Weaknesses in Deep Reinforcement Learning: Locally Optimal Behavior / Overlearning (1) Imitation Learning“. Details are described in “Approaches to Reward Design.
- Curriculum Learning: In “Research Trends in Deep Reinforcement Learning: Meta-Learning and Transition Learning, Intrinsic Motivation and Curriculum Learning” curriculum learning is a technique for adaptively adjusting the order and difficulty of learning by an agent.
- Caused by model
- Approaches to the trade-off between search and exploitation: As described in “Resolving the Trade-off between Search and Exploitation: Riglets, Stochastic Optimal Measures, and Heuristics” there are two evaluation metrics for the trade-off between search and exploitation: riglet and sample complexity. Riglet is a measure that evaluates a reinforcement learning algorithm that learns strategies while interacting with its environment, taking into account both search and exploitation. The “sample complexity” corresponds to the number of times the wrong strategy is adopted, and if the algorithm is too biased toward the utilization side and continues to adopt the wrong strategy due to insufficient search, or if the search is continued forever, the sample complexity will diverge.
- Approaches that directly specify measures using linear models such as general linear models and neural networks: In the case of deterministic measures, mathematical models are used that output actions in response to inputs such as states, etc.; in the case of probabilistic measures, models are used that output a probability distribution of actions.
- Realization in reinforcement learning using function approximation:
- An approach that indirectly specifies measures by preparing a utility function that outputs the utility of state-action pairs.
- Greedy policy: Greedy measures always choose the action that maximizes utility.
- ε-greedy policy: A model that randomly selects actions with probability ε and otherwise follows the greedy policy. The implementation of the epsilon greedy model is described in “Implementing Model-Free Reinforcement Learning in Python (1) The epsilon-Greedy Method“. The implementation for the epsilon greedy model is described in “Model-free Reinforcement Learning in Python (1) The epsilon-Greedy Method.
- Softmax policy: A model that extends greedy measures to probabilistic measures, in which actions are selected probabilistically using a softmax function as described in “Overview of softmax functions and related algorithms and implementation examples“.
- Optimistic policy: Measures are selected using the heuristic of “optimism in the face of uncertainty. Basically, if there are actions with uncertain utility effects, they are preferentially selected.
- UCB (Upper Confidence Bound) method: proposed for the multi-arm bandit problem, a special reinforcement learning problem with only one state. The method balances search and exploitation by considering an upper confidence bound on the value of actions. For details, see “Measures for Stochastic Bandit Problems: Likelihood-based Measures (UCB and MED Measures).
- Approaches that directly specify measures using linear models such as general linear models and neural networks: In the case of deterministic measures, mathematical models are used that output actions in response to inputs such as states, etc.; in the case of probabilistic measures, models are used that output a probability distribution of actions.
- Approaches to reward design: Recent deep learning models are often optimized by gradient descent, in which case, of course, the “gradient” must be computable. While the squared error can be computed for the gradient, there are some measures that cannot be computed. In the case of reinforcement learning, even if the gradient cannot be calculated, it is possible to learn by setting the value of the evaluation index as the “reward. The following methods are available for determining this reward.
- Imitation Learning: Imitation Learning is similar to supervised learning. The purpose is to record the expert’s behavior and train the agent to behave in a manner similar to the expert. However, simply imitating the expert’s behavior is not sufficient for two reasons: first, it is difficult to capture all of the expert’s behavior when the number of states is very large, and second, there are states whose behavior is difficult to record in the first place. The goal of imitation learning is to learn to take appropriate actions from a limited set of examples, including cases other than the examples. There are four methods of imitation learning. The details of these methods are described in “Overcoming Weaknesses in Deep Reinforcement Learning: Locally Optimal Behavior / Overlearning (1) Imitation Learning“.
- Forward Training: Foward Training is a method in which strategies are created for each time step individually, and then linked together to form an overall strategy.
- SMILe: SMILe is a method that improves on the problems of Foward Training. As the name “Mixing” suggests, SMILe is a method that mixes multiple strategies. Since the strategies are integrated into a single strategy, they are not divided by time step. There is no need to determine the length of the time step.
- DAgger: Unlike Forward Training/SMILe, which is strategy-based, DAgger is data-based. Instead, it mixes data.
- GAIL: GAIL is a method of “avoiding detection” of expert imitation. In other words, there are two models: the one that imitates the expert and the one that detects the imitation.
- Inverse Reinforcement Learning (IRL): While “imitation learning” learns the behavior itself, inverse reinforcement learning estimates the reward function (what people perceive as rewarding) behind the behavior shown in the example. For details on inverse reinforcement learning, please refer to “Overview of Inverse Reinforcement Learning and Examples of Algorithms and Implementations“.
- Linear Programming: Linear programming evaluates behavior by reward. Since the expert’s action should be the best action, the reward is estimated so that the reward obtained for the expert’s action is high and the reward obtained for other actions is low. In order to maximize (Max) the difference of rewards (Margin), the problem is set up as a MaxMargin problem. For details on linear programming, please refer to “Overcoming Weaknesses in Deep Reinforcement Learning: Locally Optimal Behavior / Dealing with Overlearning (2) Inverse Reinforcement Learning.
- AIRL (Apprenticeship learning via Inverse Reinforcement Learning): AIRL focuses on state transitions (the abbreviation AIRL is not commonly used). There is a clear difference between the states followed by experts and those followed by other strategies. Therefore, we set the reward high for the states that experts often follow and low for those that they do not. AIRL, like linear programming, tries to widen the “reward difference” as much as possible, but AIRL differs in that the reward depends on the characteristics of the state transitions. “Overcoming Weaknesses in Deep Reinforcement Learning: Locally Optimal Behavior / Dealing with Overlearning (2) Inverse Reinforcement Learning. for details.
- Imitation Learning: Imitation Learning is similar to supervised learning. The purpose is to record the expert’s behavior and train the agent to behave in a manner similar to the expert. However, simply imitating the expert’s behavior is not sufficient for two reasons: first, it is difficult to capture all of the expert’s behavior when the number of states is very large, and second, there are states whose behavior is difficult to record in the first place. The goal of imitation learning is to learn to take appropriate actions from a limited set of examples, including cases other than the examples. There are four methods of imitation learning. The details of these methods are described in “Overcoming Weaknesses in Deep Reinforcement Learning: Locally Optimal Behavior / Overlearning (1) Imitation Learning“.
4. Approach to the complexity of the state space: As described in “Reinforcement Learning with Function Approximation (1) – Function Approximation of Value Functions (Batch Learning)” etc., in order to deal with the case where the number of states and actions is huge or the state-action space is continuous, value functions and policy functions are approximated using function approximators approximation to cope with cases where the number of states and actions is large or the state-action space is continuous.
-
- Functional approximation of the value function: fitted Q iteration, etc. For details, see “Reinforcement Learning with Function Approximation (1) – Function Approximation of Value Functions (for Batch Learning)” etc.
- Functional approximation of policy functions: There are methods using linear function approximation, etc. For details, see “Reinforcement Learning with Function Approximation (1) – Function Approximation of Value Functions (Batch Learning)” etc.
- Application of deep learning: There is a method that uses neural networks as functions. For details, see “Application of Neural Networks to Reinforcement Learning (1) Overview” etc.
5.Approaches to sampling bias: See also “Small Data Machine Learning Approaches and Examples of Various Implementations” for an approach to data bias.
-
- Weighted sampling: This is a method of learning by appropriately weighting biased data.
- Batch re-training: An approach that does not collect new data, but reuses past data to train agents.
6. Approaches to Adaptability Constraints:.
-
- Online Learning: An approach to learning and applying real-time learning to improve the ability to adapt to new situations. For more information, see “Online Learning and Online Forecasting.
- Sequential learning: An approach in which the agent learns as it collects new information and adapts to changing conditions. For more information, see Online Learning and Online Prediction.
Specific implementation and reference information and reference books
For specific implementations of reinforcement learning, see “Overview of Reinforcement Learning Techniques and Various Implementations” “Overview of the Bandit Problem, Application Examples, and Implementation Examples” and “Combination of Simulation and Machine Learning and Various Implementation Examples. For detailed information including application examples, see “Theory, Algorithms, and Python Implementations of Various Reinforcement Learning Techniques.
A reference book is “Reinforcement Learning: An Introduction, Second Edition.
“Deep Reinforcement Learning with Python: With PyTorch, TensorFlow and OpenAI Gym”
“Reinforcement Learning: Theory and Python Implementation”
コメント