Machine Learning Professional Series “Online Prediction” Reading Memo

Artificial Intelligence Digital Transformation Image Information Processing Deep Learning Probabilistic Generative Model Machine Learning Online Prediction Sensor Data/IOT Navigation of this blog
Summary

Online forecasting is a theoretical framework used in the field of machine learning to receive data sequentially and make forecasts in real time. Online forecasting is not only used for simple forecasting problems, such as literally predicting the weather or stock price fluctuations, but also for building probabilistic language models, designing high throughput network routing, determining reasonable prices in auctions, deriving optimal mixing strategies in games, etc. but also a combination of decision-making problems with various problems of predicting future events based on past experience.

Online forecasting has also been developed to make optimal forecasts even when the distribution of the given data is not known in advance, and online forecasting can be used to find optimal forecasting methods and maximize forecasting performance.

These techniques are closely related to the multi-armed bandit problem described in “Theory and Algorithms of the Bandit Problem” and to the reinforcement learning techniques described in “Theory and Algorithms of Various Reinforcement Learning Techniques and Their Implementation in Python“.

Here we discuss various algorithms and applications of this online forecasting technique based on the Machine Learning Professional Series “Online Prediction”.

In this article, I will discuss the reading notes.

Machine Learning Professional Series “Online Prediction” Reading Memo

How do we capture and learn from the constant flow of information? Online prediction now has a wide range of applications, including ad serving, securities investing, and routing. …… Understand and use a variety of algorithms including Hedge, OGD, and ONS.

Preface

Temporal forecasting and decision-making problems, such as weather forecasting and stock price prediction, have long been the focus of statistical and other theories.
Since the 1990s, theoretical researchers in computer science have been working on learning and prediction problems from an algorithmic perspective.
from an algorithmic perspective.
A new discipline called “Computational Learning Theory” was born.
Among them, the theory that deals with sequential prediction problems
Notable features of online prediction theory
There is no process to the data generation process.
Without a process, it is generally impossible to evaluate the performance of a prediction algorithm.
In online prediction theory
Instead of an absolute evaluation of performance
Instead of absolute evaluation of performance, online prediction theory evaluates performance relative to a “good in hindsight” prediction strategy
Formulate and evaluate the amount of regret, such as “I wish I had done this then.
Very useful for online prediction problems where it is difficult to predict how the data will be generated
Online prediction theory is
Not only for online prediction
Online prediction theory provides effective design guidelines not only for online prediction but also for optimization problems.

Chapter 1: Expert Integration Problems

Introduction.

As an introduction to online prediction theory
the most basic framework, the expert integration problem.
The expert integration problem is
In spite of its simplicity, the expert integration problem
Despite its simplicity, the expert integration problem can represent decision-making problems in various domains.
The problem of specifying the probability distribution of information sources in universal codes, and
and action selection problems in repeated games.

1.1 The Problem of N Quiz Kings

Introduction
Assumptions
You are participating in the first round of a quiz competition held at the Tokyo Dome.
The first question is: “The Statue of Liberty in New York used to be a lighthouse. Is it Fat or X?
There will be many quizzes in the form of “Fat or X”, and the top performers will be able to advance to the preliminary round.
I found N quiz kings (experts) who had once won this competition, and decided my answer after seeing what they were doing.
The experts will quickly move to their answer positions, so you will have more time to use cheating strategies.
How do I integrate the experts’ answers?
The design and analysis of problem formulations and integration algorithms will be discussed.
Characters in N Quiz King Problems
Player
N experts
Contestants
Non-player entities are external to the player and their behavior cannot be controlled by the player
Definition of the problem
The environment is a vector zt=(zt,1,…. ,zt,N) ∈ {0,1}N to the player.
The i-th component zt,i ∈ {0,1} of zt represents the i-th expert’s answer to the quiz in trial t
The player predicts the answer as xt∈{0,1} and outputs this
The environment presents the player with the correct answer yt ∈ {0,1} to the quiz
In each trial t, the player is presented with the data series St=((z1,y1),. For each trial t, the player determines his answer xt based only on the previously presented data series St=((z1,y1),…,(zt-1,yt-1),zt).
The algorithm to consider is the part that finds xt from St
Player’s obvious goal
Reduce the number of apologies (above equation) as much as possible
Try to use statistical learning methods?
Use statistical learning techniques
Introduce some hypothesis class H⊆{h:{0,1}N→{0,1}}
After estimating the hypothesis ht∈H by considering St as a sample
Let xt=ht(zt) be the predicted value
A typical binary classification learning problem
If learning is successful, the generalization error Pr(yt≠xt) converges to 0
The expected value of the number of apologies can be suppressed
For a statistical learning method to work, there must be a stochastic process for how the samples are generated
Example
Assumption: There is a distribution D on {0,1}Nx{0,1}, and each case (zt,yt) is generated independently according to the distribution D.
In the case of the problem of N quiz kings, it is unnatural to put the above process
Hostile Environment
If no assumptions are made about how the samples are generated, even the best player algorithm will get all the questions wrong in the worst case.
Let A be any (deterministic) algorithm of the player.
Algorithm A can be any learning algorithm that converges to zero generalization error under assumption P.
Let A(Si) ∈ {0,1} denote the player’s answer obtained as output when algorithm A is given a sample Si as input.
Consider Algorithm B in the above environment
Show that there is a non-zero possibility of accidental adversarial behavior.
We can’t deny the possibility that an actual expert or quizzer will behave in the same way as Algorithm B by fate, even if he thinks he is acting fairly.
If the environment knows the player’s Algorithm A, it will know the player’s answer before giving the quiz.
If the environment knows the player’s algorithm A, it can know the player’s answer before submitting the quiz, so it can submit a quiz in which the player’s answer is incorrect.
Introducing adversarial algorithms for worst-case analysis.
From this point on, we will only consider environments that behave in an adversarial manner.
The environment is called an adversary.

1.2 When there is an expert who answers all questions correctly

Introduction.
What is a good strategy for the problem of N quiz kings in the presence of an adversary?
First, consider the case where there is an expert who answers all questions correctly
A simple strategy for the player
Pick one of the experts who keeps answering all the questions correctly and imitate him/her.
Make only N-1 mistakes at worst.
Each time a player makes a mistake, the number of players who continue to answer all questions correctly decreases by at least one.
1.2.1 Dichotomy
Best strategy
Follow the majority of experts who continue to answer all questions correctly.
If a player makes a mistake, then the majority of experts who continue to answer all questions correctly will also make a mistake.
Each time a player makes a mistake, the number of experts who continue to get all the answers correct is reduced to less than half.
Theorem on the number of errors in dichotomies
Under the process that there are experts who answer all questions correctly
Dichotomy is the best decision algorithm.
Optimality of the dichotomy method
For any given decision algorithm A of the player
Construct an adversary’s algorithm B’.
The goal of ‘B’ is to keep the number of player apologies to log2N while
guarantee the existence of experts who answer all questions correctly.
For simplicity
The number of experts, N, is a Beki number of 2 (given some natural number k, let N=2k)
Algorithm B’ sets the total number of trials to be T=k and uses the same method as Algorithm B to make the player answer all questions incorrectly
Pseudo-Code of Algorithm B’
1.2.2 Random Selection Dichotomy
When there is an expert who answers all questions correctly
Dichotomy is optimal as a decision algorithm
Upper limit of number of apologies is log2N
In this section and the next section, we consider the random choice algorithm and evaluate it by summing the expected value of the number of acknowledgments.
In this paper, we consider the following simple strategy
Choose one expert at random from among those who have answered all questions correctly, and follow his decision.
Performance of randomly chosen dichotomy
Since lnN≃0.693log2N
The randomly chosen dichotomy method improves performance by about 30% over the dichotomy method
series as in the above equation
mimicking a telescope with a telescopic mechanism using nested tubes.
Often used in online predictive beauty rhythms
1.2.3 c-Random Choice Dichotomy
Performance can be improved by further refinement
In dichotomous and randomly chosen dichotomies
In the dichotomous and randomly chosen dichotomous methods, consider how to determine the probability pt of outputting xt=1 on trial t.
Let Wt be the number of experts who have answered the quiz correctly up to trial t-1.
Let Kt be the number of experts who predicted the answer to be 1 in the quiz of trial t.
Pt can be expressed as pt=f(Kt/Wt) using a function f:[0,1]→[0,1
In concrete terms
In the case of the dichotomous method, since it is a decisive method based on majority voting
In the case of random dichotomy, since it is a uniform random selection method
As shown in the figure above, the algorithm that generalizes this to pt = fc(Kt/Wt) using a function fc with a real number c ∈ [0, 1/2] as a parameter is called C randomly chosen dichotomy.
C randomly chosen dichotomy algorithm
Dichotomy is a special case where c=1/2, randomly chosen dichotomy is a special case where c=0
By optimizing the value of C, it is possible to derive an algorithm with better performance.
In order to analyze the performance of the randomly chosen dichotomy method
Consider the relationship between the error probability qt=Pr(yt≠xt) and Wt+1/Wt at trial t
When yt=0
When yt=1
The above holds in both cases.
Let bc-qt be the exponential function that holds the function on the right-hand side strictly from above.
Let bc be the largest real number b satisfying 1 – fc-1(x) ≤ b-x for any x ∈ [0,1].
determined uniquely by the parameter c.
Using this bc, we can evaluate the performance properties of the c random dichotomy method
Randomly chosen dichotomy with parameter c that maximizes bc is optimal.
Relationship between function 1-fc-1(x) and bc-x for dichotomous method, randomly chosen dichotomous method, and c randomly chosen dichotomous method for general c.
Optimal c randomly chosen dichotomy
Optimal value of c is about 0.15
This result shows that the following strategy is superior
If more than 85% of the answers match the role of the expert who keeps answering all questions correctly, then follow the result.
If not (there is some disagreement), then
If not (there is some disagreement), the probability of the majority answer is slightly amplified to the majority answer
Follow the minority answer with the remaining probability
The optimal c random dichotomy is
The optimal c-random dichotomy is the best of all the Runtz algorithms.

1.3 When there is not always an expert who answers all questions correctly

Introduction
The general case where there is not always an expert with all correct answers
In this case
For any decisive player
An adversary can use Algorithm B to make A answer all questions incorrectly.
Against a randomly chosen player
By defining yt as an adversary for randomly selected players
The expected value of the number of errors can be set to be more than t/2.
The expected number of errors for a player who randomly predicts the answer on each trial is
The expected number of errors for a player who randomly predicts the answer on each trial is exactly T/2 for any adversary.
From this, we can conclude that the algorithm that answers randomly on each trial is optimal under the following criteria
The number of player errors (expected value) is the evaluation metric.
The data series ((z1,y1),… ,(zr,yr)) without any process in the way the data series is generated.
Evaluate the player’s performance in the worst case
The above discussion assumes the worst case, including the behavior of the N players.
1.3.1 Riglet
Consider a more realistic problem setting
What is required by the problem of N quiz kings
Only if one of the experts achieves a good grade (if not all correct answers)
An algorithm that is guaranteed to achieve a reasonably good score
In other words
If all the experts perform poorly, we give up.
If any one expert performs well, then the algorithm is guaranteed to perform reasonably well.
achieved a reasonably good result.
Based on this philosophy, we will revise the algorithm’s metrics.
Until now
The number of errors itself was used as the evaluation metric.
In the future
Compare the error count of the algorithm
compared to the error count of the best performing expert.
relative to the error count of the best performing expert.
Riglet’s definition, widely accepted in the field of online prediction
The first and second terms of REgretA(S) are the expected number of apologies of algorithm A and the number of iterations of the best performing expert, respectively.
Riglet is defined as the difference between them.
Therefore, the smaller the regret, the better the algorithm performed, comparable to the best expert.
RegretA(T) denotes the bad regret of algorithm A when the total number of trials is T.
From this point on, this book will be based on the concept of relative evaluation for online prediction problems.
We will define a riglet based on the concept of relative evaluation.
Throughout this document, the goal of algorithm design is to minimize the number of regrets (RegretA(S) and RegretA(T)).
1.3.2 The Meaning of Regret
After all the thinking is done, the best expert i* is revealed.
At this point the player
The player regrets that if he had answered as expert i* did from start to finish, the number of errors would have been clearer in the above equation.
This regret increases as the difference between his score and his own score (the above equation) becomes larger.
RIGLET is a quantification of the magnitude of this regret.
In general
The player’s predictions can influence the adversary’s subsequent choice of data
Even if the riglet is a small algorithm
In the worst case, nothing will improve
In that case, we “give up.”
In other words, we take the position that the performance of the algorithm is irrelevant.
1.3.3 Weighted Average Algorithm
By slightly modifying the randomly chosen dichotomy, we can obtain an algorithm with a small riglet.
Review the representation of the random dichotomy.
Let the number of experts who answered all questions correctly up to trial t-1 be Wt
Let Kt be the number of experts who predicted the answer to be 1 on trial t.
With probability pt=Kt/Wt, xt=1
With probability 1-pt, xt=0
At this time, Wt and Kt can be expressed as follows using the indicator variables (above equation)
The above equation can be expressed as follows
Thus, the above equation is the monthly average of the weights of the experts’ predictions zt,i when each expert i is weighted by wt,i/W.
When the environment provides the correct answer yt, the indicator variable can be updated as in the above equation using the constant β = 0.
If the constant β is a parameter that takes a suitable value of 0<β<1 instead of 0
If the constant β is a parameter that takes an appropriate value of 0<β<1 instead of 0, we can obtain an algorithm with a small regression.
The above equation is used to update the indicator variable.
WAA has pt as output and
WAA uses pt as the output and saves the post-processing step of “randomly selecting xt∈{0,1} satisfying Pr(xt=1)=pt, and outputting the answer.
In fact, the error probability in trial t of WAA is Pr(xt≠yt)=|yt-pt|
This can be easily checked by dividing the case of yt∈{0,1}.
The expected value of the number of errors in all thoughts is the above equation
Riglet can be evaluated without using Xt
Algorithm
Performance of WAA
What gives the upper bound on the expected value of the number of errors in WAA
Indicates that WAA achieves to some extent the goal of being “comparable to the best expert performance
Another version of the algorithm that is equivalent to WAA
One that keeps and updates the normalized weights w’t,i=wt,i/Wt instead of wt,i
Algorithm in detail
1.3.4 Improved Weighted Average Algorithm c-WAA
As with the extension of randomly chosen dichotomous algorithm to c-randomly chosen dichotomous algorithm
By applying the transformation function fc to WAA, we obtain the improved algorithm x-WAA.
C-WAA uses the above equation instead of the third line of Algorithm 1.4
The above equation is exactly the same as when we evaluated the performance of the C random dichotomy method, but for c=WAA.
By using the constant bc and suppressing it from above by bc-qt, we obtain an upper bound on the expected error count for c-WAA.
Since this upper bound becomes smaller as bc increases, the optimal c-WAA can be obtained by choosing c that maximizes bc.
The theorem summarizes this result
1.3.5 Choosing the parameter β and deriving the upper bound on the WAA riglet
Deriving the Riglet Upper Bound for the WAA
1.3.6 Automatic Adjustment of β – The Doubling Trick

1.4 Expert Integration Problems

Introduction
We have discussed the problem of the N quiz kings as an example of an expert integration problem.
We now consider an extension of this problem, the expert integration problem.
An expert integration problem is defined by three sets (X, Y, ℓ)
X and Y are sets
set of predictions, and a set of result values
ℓ is a function from XxY to [0, ∞)
Loss function
The expert integration problem (X,Y,ℓ) is described above as a protocol between a player and an adversary

1.4.1 N Quiz King Questions
1.4.2 Online Allocation Problem
1.4.3 Weighted Average Algorithm
1.4.4 Logarithmic Loss and Information Compression

1.5 Formulation with Weights and Loss Functions

1.5.1 Standard Form of the Expert Integration Problem
1.5.2 Online Allocation Problem and Hedging Algorithm
1.5.3 Formulation with Weights and Loss Functions
1.5.4 Preparation for Chapter 2

1.6 Literature Notes

Chapter 2 Online Convex Optimization

Introduction

What is online convex optimization?
a problem framework for abstracting and unifying various online prediction problems
problems in a unified manner.

2.1 Framework of online convex optimization

An online convex optimization problem is defined by
a convex set X and
a case space
a set of convex functions on X such that F⊂{f:X→ℝ|f is convex) (X,F).
The online convex optimization problem is described above as a protocol between a player and an adversary.
We assume that the convex function ft is explicitly given to the player by the adversary.
That is, not only the function value ft(xt), but also
The function ft itself is given
Therefore
For any x ∈ X
we not only know ft(x), but also
We can also calculate the gradient ∇ft(x) of the function ft, the subgradient, and the lattice anchor.
This process is called full information setting.
The goal of the player is to minimize the regret
Riglet for online convex optimization problem
Image of the protocol
On the other hand
No information about the function ft is given
On the other hand, the situation where the function ft is not given and the function value ft(xt) is given at the brain site is called banditsetting.

2.2 FollowTheLeader (FTL) Strategy

2.2.1 Effectiveness of the FTL Strategy
2.2.2 Limitations of the FTL Strategy

2.3 FollowTheRegularizedLeader(FTRL) Strategy

2.3.1 When the loss function is linear
2.3.2 Returning to Online Linear Optimization Problems
2.3.3 Online Gradient Descent Method
2.3.4 Bregman divergence
2.3.5 Generalization of FTRL Strategy
2.3.6 Riglet Lower Bound for Online Linear Optimization Problems
2.3.7 When the Loss Function is Strongly Convex

2.4 Online Newton Method and Follow The Approximate Leader Strategy

2.4.1 Online Newton Method (ONS)
2.4.2 FollowTheApproximateLeader(FTAL) Strategy

2.5 Application to Offline Optimization
2.6 Literature Notes

Chapter 3 Online Prediction Based on Randomness

Introduction

Online forecasting based on randomness is described.

3.1 FollowthePerturbedLeader(FPL) Strategy

In the previous chapter, we discussed the properties of greedy strategies such as FTL and FTRL, and greedy strategies with positive values.
Are there any other strategies that guarantee the upper bound of the riglet?
Follow the Perturbed Leader (FPL) strategy is very different from the previous strategies.
FPL is a kind of extension of FTL as well as FTRL
When thinking about FTRL, we used regularity.
Instead, FPL uses a greedy strategy by adding random noise to the objective function
FPL Algorithm
Riglet upper bound for FPL strategy

3.2 Exponentially Weighted FollowThePerturbedLeader (FPL*) Strategy
3.3 Relevance to the FTRL Strategy
3.4 Literature Notes

Chapter 4 Combinatorial Online Prediction

Introduction

This chapter discusses efficient online learning when the decision space is some combinatorial set or set of discrete structures.

4.1 What is combinatorial online prediction?

Examples of combinatorial sets used in combinatorial online prediction problems
Experts
Expert problems can also be considered as combinatorial online prediction
The entire combination of choosing one element from N elements is the decision space
The size of the set is n
K sets
The set consists of the entire combination of choosing k from N elements
The problem of multiple ad selection can be regarded as online prediction on k-sets
The size of the set is O(nk)
Permutation
A set consisting of permutations of N elements
Permutations are used in online prediction problems in ranking, assignment, and scheduling problems
The size of the set is n!=Ω((n/e)n)
Paths
Given a fixed graph, a starting point, and an ending point, the path set is the entire set of paths from the starting point to the ending point.
It is used in online shortest paths.
The size of the set is generally exponential to the size of the graph.
If we consider naively written combinations as experts and apply the hedging algorithm
At worst, each trial takes several hours of computation.
In combinatorial online problems
In combinatorial online problems, it is important to design online prediction algorithms that work in polynomial time even in exponentially large decision spaces.
Assume that the combinatorial set can be represented by a vector set
The decision space C is a subset of the n-dimensional Euclidean space, C⊂ℝn.
Vector set representation in the previous example
Expert (n=4)
K-set (n=4,k=2)
Permutation(n=3)
Permutation (matrix representation)
Path (path from the viewpoint to the end point in the case of the left figure)
Directed graph
By considering the combinatorial set as a vector set
By treating the combinatorial set as a vector set, we can formulate it in the same way as the online convex optimization problem.
Definition of the combinatorial online prediction problem
In this chapter, for simplicity, we consider only online linear optimization problems (when the loss function is linear).
In the next section, we will discuss approaches to efficient online prediction methods for combinatorial sets.

4.2 Sampling-based Approach

First, we describe a sampling-based approach.

4.3 Off-line linear optimization-based approach
4.4 Continuous Relaxation and Discretization Based Approaches
4.5 Literature Notes

Appendix A Mathematical Preparation

A.1 Inner product, semidefinite matrix
A.2 Norms
A.3 Convex sets, convex functions
A.4 Convex optimization
A.5 Probability inequalities

コメント

タイトルとURLをコピーしました