Machine Learning Professional Series “Reinforcement Learning” Reading Memo

Machine Learning Artificial Intelligence Digital Transformation Probabilistic  generative model Sensor Data/IOT Online Learning Deep Learning Reinforcement Learning Technologies Navigation of this blog
Machine Learning Professional Series “Reinforcement Learning” Reading Memo

Reinforcement learning is a field of machine learning in which an agent, which is the subject of learning, interacts with its environment and selects its actions, selects the best action from among the actions it can choose in an unknown environment or an environment with complex problems (environment (enviroment)). The method selects the optimal action while learning the optimal one based on the State and Reward. Typical methods for selecting them include Q learning, SARSA, DDPG, DQN, and PPO, etc. These methods are used to evaluate the value of the agent’s actions and measures, and provide algorithms for selecting the optimal action.

Reinforcement learning has been applied in various fields such as robot control, game play, natural language processing, and financial trading, and can realize advanced decisions that exceed human judgment.

This section is based on the Machine Learning Professional Series “Reinforcement Learning”.

“The book covers a wide range of mathematics required for reinforcement learning. Consistent and careful explanations make for a thorough read. Formulation of the Bellman equation, Markov decision process as described in “Overview of Markov Decision Processes (MDP) and Examples of Algorithms and Implementations“, and policy gradient method are covered in depth! Finally, distributed reinforcement learning and deep reinforcement learning are introduced.”

Chapter 1 Preparation

Introduction

Reinforcement learning deals with learning of sequential decision rules.
There are two main types of models (components): systems to be controlled and strategy models to be learned.
In this chapter
Introduction to the overview of reinforcement learning
We explain the basic components of reinforcement learning.
Section 1.2 introduces the Markov decision process, a standard mathematical model for describing the system to be controlled
Section 1.3 describes a function that defines a decision rule, called a strategy model
In Section 1.4, we define the objective function of policy learning and formulate a sequential decision-making problem.

1.1 What is Reinforcement Learning?

A field of learning that aims to define rules for optimal decision making.
Generally, it is a branch of machine learning along with “supervised learning” and “unsupervised learning.
Conventional machine learning is often interpreted as maximization of likelihood or posterior probability.
Even in reinforcement learning, there are approaches that approximate the problem as a likelihood maximization problem and use traditional machine learning methods.
The concept of “reward” appears, which is not present in other machine learning methods.
The most important feature of reinforcement learning is to learn sequential decision rules that maximize the expected value of the reward.
Basically, it is solved as reward maximization
Decision making is
Decisions are made based on a representation of the current situation, called a state
As a result, a reward or a new state is observed
Repeat decision making again
Sequential decision rules are called policies.
The problem of optimizing a policy is called a “sequential decision-making problem
Example
A simplified problem of implementing a campaign without a price reduction sale in a retail store
The goal is to maximize the long-term average of sales by deciding whether or not to run a campaign in each sales period
Campaign Optimization Problem
In reinforcement learning, we treat each period’s sales as “reward”, customer’s willingness to buy as “state”, and the presence or absence of the campaign state as “action”.
If we assume that the strategy depends on the state and is deterministic (non-probabilistic), we can consider the following patterns
Strategy A
Continue to implement the campaign. Average sales is 1
Strategy B
No campaign at all. Average sales will be 3
Strategy C
Conduct a campaign in the midst of purchase motivation. Average sales are about 2
Strategy D
Conduct the campaign at high motivation. Average sales are 4
Strategy D maximizes average sales (average compensation).
It is better to wait until the desire to buy is high before launching a campaign.
The general case is not as simple as the above, there are many state actions and state transitions are stochastic and complex.
Reinforcement learning is a mathematical framework to deal with such general sequential decision making problems.
It is not realistic to consider a learning method for an arbitrary system to be controlled.
A process is assumed for the system
A typical assumption is “Markovianity”.
In most cases, we consider learning methods for “Markov decision processes,” which assume that the state in which Markovianity holds can be observed.
There is also a mathematical model called “Partially Observed Markov Process” that relaxes the process of Markov Decision Process.
For the policy model to be learned
In Example 1.1, we dealt with a simple model that depends only on the current state and selects actions deterministically.
A policy model that selects an action depending on the history of past states and actions
A complex model that selects actions probabilistically

1.2 Markov Decision Processes and Sequential Decision Problems

Introduction
We will explain stochastic processes, Markovianity, and Markovian decision processes, which are the basis of reinforcement learning.
We then introduce a typical example of a sequential decision-making problem.
1.2.1 Stochastic Processes and Markovianity
Before introducing stochastic processes, we will first review the basics of probability using dice rolls.
Probability is a quantitative measure of the likelihood of an event occurring with randomness, such as “the dice will roll 1” or “the dice will roll an even number.
The probability of event A is denoted as Pr(A).
A variable that does not deterministically take a certain value, but for which the possible values and the probability of taking that value are given.
Actual value taken
The correspondence between a random variable and a probability is called
Stochastic process
A process in which the dice are rolled repeatedly, rather than singly.
A series of random variables such that the values of the variables change stochastically with time
It is often denoted by {Xt, t∈T} with time step t as a parameter.
In a general stochastic process
The probability that a random variable Xt at time step t can take x ∈ X is
depends on all the realizations before time step t
Pr(A|B) denotes the conditional probability of event A given event B
As the simplest stochastic process with strong constraints imposed
If each random variable X1, X2,…. are independent of each other and follow the same probability distribution
For any x1,…,xt-1∈X The above holds for any x1,…,xt-1∈X
If the data in hand can be considered to follow i.i.d., then
If we can assume that the data we have obeys i.i.d., then we can use standard pattern recognition and machine learning methods without considering the sequence or time series of the data.
In many decision problems, the i.i.d. process cannot be used.
Reinforcement learning assumes Markov property, which is a weaker constraint than i.i.d.
Markov property means that the conditional probability distribution of the future random variable depends only on the value xt at the current time step t
If Xt is given, it does not depend on the values x1,…,xt-1 before t-1 If Xt is given, the property does not depend on the values x1,…, xt-1 before t-1.
The establishment process with the property of Markovianity is
Pr(Xt+1=x’|Xt=x) represents the probability of transition from state x to the next step x’.
Stochastic process with Markov property
When the possible values of the state variables are discrete (finite or additive)
Examples of stochastic processes
A stochastic process has different characteristics depending on the definition of the random variable, even though the process behind it (rolling the dice) is the same.
The establishment process in (a) satisfies i.i.d.
Yt in (b) satisfies the Markov property
(c) satisfies neither i.i.d. nor Markovianity
If Markovianity is not satisfied, the complexity of the probability distribution of state transitions will increase combinatorially with time step t, and it will become unmanageable in general.
1.2.2 Markov Decision Processes
Reinforcement learning is
To optimize the action selection rule
Instead of the conventional stochastic process with only “state”, we consider a stochastic control process with additional actions, etc.
Establishment control process that incorporates “action” and “reward,” which is the criterion for good or bad decision making, into the Markov chain
A finite set of actions: A≙{S,A,Pso,Pt,g}.
Finite action set: A≙{a1,…. ,a|A|}∋a
Initial state establishment function:Ps0:S→[0,1]: ps0(s)≙Pr(S0=s)
State transition establishment function:Pr:SxSxA→[0,1] Reward function:g:SxA→ℝ
The random variables St and At represent the state and action variables at time step t∈ℕ0
The number of states in a Markov decision process is|S| and the number of actions is|A|.
|X| denotes the number of elements of X if X is a finite set
Markov process with only reward added to the Markov chain or Markov decision process with|A|=1
We assume that the reward g is a bounded function and that there exists a constant Rmax∈ℝ that satisfies the above equation.
Define the set of rewards R as above
To avoid notational complexity, we treat the action set as a single state-independent set A, except for some rubrics.
The above definition can be easily applied to the case where the set of selectable actions As differs depending on the state s.
Reward function
For simplicity, we use g(St,At), which depends only on the state and action at time step t.
A more general reward is
A reward g(St,At,St+1) that also depends on the next state, or
The reward distribution function Pr(Rt≤r|St=s,At=a) may be used.
The reward multiplied by a negative number is called the cost.
Define a function that specifies the selection rule of the action as an input to the Markov decision process.
This is called policy or production.
Various forms of policy can be considered.
In this book
In this document, we use the stochastic policy π:AxS→[0,1]: π(a|s)≙Pr(A=a|S=s), which selects an action in an established manner depending only on the state s of the time step.
Let us assume that the Markov decision assumption M, including the measure π, is the above.
Define the above equation to be the set of measures that includes any established measure π.
In order to confirm our understanding of the Markov decision process, let us consider its time evolution (s0,a0,r0,… In order to confirm our understanding of the Markov decision process, the specific steps of its time evolution (s0,a0,r0,…,st,at,rt) are shown below.
1.2.3 Typical problem setting for sequential decision making
The sequential decision-making problem is a policy optimization problem.
The problem of sequential decision making, which is a problem of optimizing a set of measures, is solved by using a function for evaluating measures called the objective function.
Given a set of measures
The problem is to find the measure that maximizes the objective function among a given set of measures.
As it is, the abstraction level of the problem is too high to give an efficient liberation.
Typically, we assume that the system follows a Markov decision process.
The objective function can be either an expected reward
or an expected discounted cumulative reward called the expected return.
The typical problem setting described above cannot be applied to the real problem at hand
In addition to maximizing the expected utility, we want to avoid large losses.
On the other hand, we want to increase the probability of making a large profit.
In order to formulate real-world problems mathematically as decisions
Consideration of the problem setting is important
It is also necessary to examine up to which kind of fertility set we need to deal with.
1.2.4 Reinforcement Learning and Markov Decision Processes
Reinforcement learning is based on the results of research on Markov decision processes (planning).
The main objectives of both Markov decision processes and reinforcement learning are the same
For a system to be controlled as described by a Markov decision process
The main objective of both Markov decision processes and reinforcement learning is the same: to find the optimal strategy that maximizes the expected value of the cumulative reward for the system under control, as described by the Markov decision process.
In other words, to solve a sequential decision-making problem.
In the study of Markov decision processes, the system is often assumed to be known.
Reinforcement learning often deals with problems where the system is unknown.
Reinforcement learning considers learning strategies from interactions with the system as described above.
The control target is the environment
The controller or decision maker is called the agent

1.3 Strategies

Introduction
In this section, we define several sets of measures for Markov decision processes.
We also discuss their complexity and relationships, and which set of measures is sufficient to be considered.
1.3.1 Classification of measures
For a probabilistic measure π
For a probabilistic policy π, we can consider a set Πd of deterministic policies Πd as a subset of a set Π of π.
The superscript d in Πd and Πd is an acronym for deterministic policy.
Since we can rewrite πd in the form of a probabilistic policy π, we know that Πd is contained in Π.
The measures π and πd we have introduced so far depend only on the state s, and choose their actions independently of past experience
The decision rule (measure function) does not change as the time step t progresses
A general Markov strategy is a non-stationary strategy series in which the strategy function changes as the time step t progresses.
A time-invariant stationary strategy is defined as follows
For simplicity, we sometimes denote ΠS as Π or ΠSD as Πd.
The acronym S is derived from the word stationary.
Consider a non-Markovian strategy that selects an action depending not only on the current state but also on previous experiences
One of the most complex and expressive non-Markovian measures is
Based on the history of all experiences up to the current time step t
The most complex and expressive of the non-Markovian measures is the history-dependent policy, which determines the action selection rate based on the history of all experiences up to the current time step t.
The set of measures containing any πht is denoted as above.
Define a series of policy πdt and its set as above
Relationships among sets of policy sequences
1.3.2* Characteristics of measures
From the above inclusion relations
If we identify the best sequence of measures πh* from the history-dependent set of measures ΠH
there is no better strategy.
The above equation holds for any objective function whose argument is a policy series.
Therefore, to find the best strategy, we can optimize the above equation.
In general, it is difficult to handle optimization problems for ΠH because of the combinatorial explosion of the policy size with respect to the time step
Example 1.2
Checking the size of each set of measures
In order to simplify the comparison of the sizes of the measures, we consider the following deterministic measure sets, which are finite sets rather than established measures
A set of stationary deterministic Markov measures
The set of non-stationary deterministic Markov measures with time step length T
The set of history-dependent deterministic measures with time step length T
Πh,dt is the set of history-dependent deterministic measures for time step t
Size of measures in a Markov decision process of finite length T with number of states|S|=2 and number of actions|A|=2
Proposition 1.1 Validity of Markov measures
No matter which history-dependent Markov measure is used to select an action
For the simultaneous probability of St and At at each time step t, which is an important feature of the environment (stochastic control process)
can be accurately reproduced by using a simpler Markov measure.
Hereinafter XXXX

1.4 Formulation of a Sequential Decision-Making Problem

Introduction
In this chapter, we first review the problem setup of sequential decision making and the classification of Markov decision processes.
Next, the standard objective function is explained and the sequential decision problem is formulated.
We will also briefly introduce other formulations of the Chikusei decision problem.
1.4.1 Problem formulation
The problem of optimizing a strategy is called a sequential decision problem.
The only thing that can be learned is the strategy π
The environment model, a Markov decision process M≙{S,A,Ps0,PT,g}, is determined by the task to which the reinforcement learning is applied and is assumed to be time invariant.
If the environment model is known, then we can optimize measures from the environment model even without data.
This is often distinguished from learning a strategy from data.
Finding optimal measures from an environmental model is often referred to as planning or planning learning instead of learning.
A typical planning approach
Dynamic programming or linear programming using knowledge of state transition probabilities Pt, reward functions g, etc.
There is also a planning method that treats the environment model as a model whose internal structure is unknown (black box model) that only returns an output (r,s’) for an input (s,a) and searches for optimal measures.
When the environment model is unknown
Unlike the case of planning
Unlike the case of planning, it cannot be formulated as an optimization problem to which conventional optimization solvers can be directly applied.
Learning from data (results of interaction with the environment) is required.
In this book, we refer to the problem of learning strategies when the environmental model is unknown as the reinforcement learning problem.
There are two major settings for reinforcement learning problems.
One is
One is a setting similar to that of conventional machine learning.
There is a large amount of data obtained from interactions with the environment.
Batch learning: learning strategies from that data
This is also called offline learning.
The other is
Learning by sequentially interacting with the environment and collecting data
There are two decision-making strategies, and we need to consider the balance between them
Data collection and exploration (exploration)
From the standpoint that there is not enough data
A strategy for choosing actions that reduce uncertainty about the environment and allow for new discoveries.
Data exploitation
The position that we have enough data.
A strategy to select the action that can be judged as the best from the data
Classification of Sequential Decision Making Problems
1.4.2 Unification of Markov Decision Processes
Depending on the setting of the target sequential decision problem, Markov decision processes with different termination conditions can be considered as follows
(A) There is a goal state, and the process ends when the goal state is reached.
(B) The process ends when a predetermined time step is reached.
Maximizing total sales under a predetermined deadline such as a quarter
The state is extended by incorporating information on light or time steps into the state, and transitions to the absorbing state with probability 1 when the specified end time step is reached.
(C) is transformed into a Markov decision process with infinite time length.
(C) does not terminate (Markov decision process of infinite time length)
Without changing the meaning of the Markov decision process in (A) and (B)
Without changing the meaning of the Markov decision process in (A) and (B), it can be reformulated as the Markov decision process in (C) with only a slight change in the phenotype.
State D is an absorbing state
A state that does not transition to any other state
If the original state is S≙{i,j,k} with three end time steps T of 2
with the absorbing state as z.
The expanded set of states becomes seven, S={i0,j0,k0,i1,j1,k1,z}.
The expanded state size increases in proportion to the exit time step T as|S|xT+1
1.4.3 Return and Objective Functions
1.4.4* Other Sequential Decision Making Problems

Chapter 2 Planning

Introduction

This chapter deals with the planning problem, which is a sequential decision-making problem when the environment is known.
The characteristics and solution methods of the planning problem provide a basis for dealing with reinforcement learning problems in which the environment is unknown.
By studying the characteristics of the planning problem
By examining the characteristics of the planning problem, we can find out which class of fertility we need to achieve even when the environment is unknown.
As probabilistic approximations to planning methods, we can derive typical methods of reinforcement learning such as TD and Q-learning methods.
In this chapter
First, we introduce the objective function, optimal value function, and optimal measures based on aircraft returns, and then we introduce dynamic programming (a general term for methods, not a specific method) as a method for finding optimal measures.
Next, the value iteration method and the measure iteration method are presented as methods based on dynamic programming.
We also touch on the openness of linear programming and discuss the nature of the objective function introduced.
In Chapter 5, we introduce planning that treats the environment as a black box model.

2.1 Preparation
2.2 Dynamic Programming
2.3 Solution by Dynamic Programming
2.4 Solution by Linear Programming

Chapter 3: Tradeoffs between Exploration and Utilization

Introduction

Explain the trade-off between exploration and exploitation of data, which is a problem in sequential learning.
Describes a typical strategy model that takes the trade-off into account

3.1 Overview
3.2 Trade-off between exploration and exploitation
3.3 Policy Models

Chapter 4: Model-Free Reinforcement Learning

Introduction

In Chapter 2, we considered the optimization of a strategy (planning problem) under perfect information, assuming that the environment (Markov decision process) is known.
Here, the environment is unknown, and we consider learning strategies from data obtained through interaction between the environment and the agent.
Model-free reinforcement learning is
Model-free reinforcement learning is also called environment-de-identified reinforcement learning.
An approach to learning strategies without explicitly estimating the environment.
In the next chapter, we introduce the Motanii (environment identification) type of reinforcement learning, which explicitly estimates the environment.
Most of the methods introduced in this chapter are based on the dynamic programming method presented in Chapter 2, and follow the idea of probabilistic approximation, learning from probabilistically observed data.
The methods derived from the value iteration method are
Q-learning method
A method related to the strategy iteration method is the
SARSA method
Actor-critic method

4.1 Data-Based Decision Making
4.2 Estimating the Value Function
4.3 Learning Strategies and Action Value Functions
4.4 Convergence
4.5 Actor-critic method

Chapter 5: Model-Based Reinforcement Learning

Introduction

In Chapter 4, we described model-free reinforcement learning without estimating the environment.
This chapter deals with model-based reinforcement learning, where the environment model is explicitly estimated and the strategy is obtained using the estimated environment model.

5.1 Organizing the Problem
5.2 Environment Estimation
5.3 Planning for Black Box Generative Models
5.4 Online model-based reinforcement learning

Chapter 6: Reinforcement Learning Using Function Approximation

Introduction

When the number of states is large or the state space is continuous
If the number of states is large or the state space is continuous, the number of elements in the table becomes too large for the conventional table-style functions that have a value for each state.
Learning becomes difficult.
In this chapter, we consider approximating the value function and the measure function using a function approximator and learning them.

6.1 Overview
6.2 Function Approximation for Value Functions
6.3 Function Approximation for Measures

Chapter 7 Partially Observed Markov Decision Processes

Introduction

Until now, agents have been assumed to be able to observe Markovian states.
Markovian processes are impractical for some real problems
The state can only be partially observed, and
Future events depend not only on current observations, but also on past history
A mathematical model of such a situation
There are partially observed Markov decision processes.
In this chapter, we introduce the partial observation Markov decision process and its bulletin.

7.1 Basics of Partially-Observed Markov Decision Processes (POMDPs)
7.2 Planning a POMDP
7.3 Learning a POMDP

Chapter 8 Recent Topics

Introduction

Recent new developments in reinforcement learning include the following
Distributed reinforcement learning based on return distributions rather than expected returns (value functions)
Deep reinforcement learning using a deep neural network model as a function approximator

8.1 Distributional Reinforcement Learning
8.2 Deep Reinforcement Learning

Appendix A Appendix

A.1 Proof
A.2 Norm
A.3 Linear programming
A.4 Supplement to the Natural Gradient Method

 

コメント

タイトルとURLをコピーしました