Summary
Online learning is one of the learning frameworks in machine learning. In the online learning framework, the model is improved sequentially using only the given data each time one piece of data (or a portion of all the data) is given, without using all the data at once. Due to the nature of this data processing method, it is applied to data analysis on a scale where all data cannot be stored in memory or cache, or to learning in an environment where data is generated permanently.
Online learning requires the ability to receive sequential data and change algorithms and data models accordingly. Algorithms used include perceptron, ADALINE, naive Bayes, linear regression, and deep learning.
These techniques are closely related to the multi-armed bandit problem described in “Theory and Algorithms of the Bandit Problem” and to the reinforcement learning techniques described in “Theory and Algorithms of Various Reinforcement Learning Techniques and Their Implementation in Python“.
This section describes various algorithms and applications of online learning based on the “Machine Learning Professional Series: Online Machine Learning”.
In this article, I will discuss the reading notes.
Machine Learning Professional Series “Online Machine Learning” Reading Memo
Online learning is one of the learning frameworks in machine learning. In the online learning framework, the model is improved sequentially using only the given data each time one data (or a part of all data) is given, without using all the data at once. The model is improved sequentially using only the given data. Because of the nature of this data processing method, it can be used to analyze data on a scale that does not allow all the data to be stored in memory or cache, and it can also be used to learn in an environment where data is generated permanently. As advances in infrastructure and sensor/network technologies have led to larger scale data and continuous service operation using machine learning techniques have become common, research and development in the field of online learning has been accelerated. In papers in the machine learning field, these frameworks are often referred to simply as online learning. However, the term “online learning” generally refers to a system that allows users to access educational contents such as university lectures on the Web. To distinguish between online learning as a machine learning term and online learning as an educational term, the machine learning term is sometimes referred to as online machine learning.
Online learning is one of the frameworks of machine learning, and it is only one of the means. It is not a framework specific to any particular task or purpose. Therefore, the effective application method differs depending on the task, and there are many problem settings to which alternative frameworks other than online learning should be applied.
Chapter 1 Introduction
1.1 What is Machine Learning?
1.2 What is Online Learning?
Introduction
Learning sequentially from data that appears one after another
Learning the whole data at once
Batch learning
1.2.1 The Early Days of Online Learning
With the birth of artificial intelligence, online learning was born.
The Perceptron
1957
Frank Rosenplatt
Modeling vision and brain function
Unable to learn the problem of determining whether a shape is connected or not
Minsky’s refutation
1.2.2 Rediscovering Online Learning
Application to natural language processing
Method
Maximum entropy method
Hidden Markov Models
Support vector machines
Applications
Language models
Document Classification
Part-of-Speech Estimation
Syntax analysis
Machine translation
Support Vector Machines and Logistic Regression are one layer models
Deep learning is a “deep” model
1.3 Characteristics of Online Learning
1.3.1 Training data can be discarded quickly
1.3.2 Learning Speed is Fast
1.3.3 Learning results are always available
1.3.4 Easy to implement
1.3.5 Performance analysis is easy
1.4 Disadvantages of Online Learning
Learning results are greatly affected by the order of the data to be learned.
Affected by data bias
Tends to be more sensitive to noise than batch learning
Chapter 2 Preparation
2.1 Preparing to read a mathematical expression
2.2 Promises about Mathematical Expressions
2.2.1 Notation of numbers
It is important to consider the “type of expression.
Integers, vectors of integers, real numbers, etc.
It is difficult to think about the “type” of a function
2.2.2 Vectors
Inner product of vectors
x1y1+x2y2+… +xnyn
2.2.3 Summing Symbols
Summing symbols
2.2.4 Minimum value, maximum value, argmin, argmax
Variables that minimize or maximize the value of each expression
2.2.5 Absolute value
The absolute value is the magnitude of a real number after removing the sign.
2.2.6 Norm
A generalized concept of distance
L1 norm
∥X∥1 = |x1| + |x2| +… +|xn|
∥W -V∥1 = |w1-v1| + |w2 -v2| +… +|wn-vn|
L2 norm
∥X∥2 = √x12 + x22 +… +xn2
∥W-V∥2 = √(w1-v1)2 + (w2-v2)2 +… +(wn-vn)2
2.2.7 Exponential function
Let the Napier number e(2.718…) The exponential function ex with low is written as exp(x).
2.2.8 Partial Differentiation and Gradient
Partial differentiation is the process of
Differentiating a multivariable function by assuming that all variables except the one of interest are constants.
Partial differentiation of a function f(x,y) by x
Example
f(x,y)=2x+xy+y
∂f(x,y)/∂x=2+y
∂f(x,y)/∂y=x+1
The gradient of a function is
The result of partial differentiation of a function by a variable, arranged as a vector
The gradient of a function f(x) is written as ∇f(x)
∇f(x)=(∂f(x)/∂x1, ∂f(x)/∂x2, … , ∂f(x)/∂xn)
2.2.9 Techniques Used for Partial Differentiation
Techniques for partial differentiation of complex functions
Differentiating a product of functions
If we can form a product with the function f(x)=g(x)h(x)
∂f(x)/∂x = ∂g(x)/∂x x h(x) + g(x) x ∂h(x)/∂x
Differentiation of a composite function
If the function f(x) = h(g(x)) and the function is the argument of yet another function
∂f(x)/∂x = ∂h(x)/∂g(x) x ∂g(x)/∂x
Differentiation of exp and log
∂ exp(x)/∂x = exp(x)
∂ log(x)/∂x = 1/x
2.2.10 Computational complexity
O-notation
Description of computational complexity
O(n):The computational complexity increases linearly
O(xn): exponential increase in computational complexity
2.3 Convex Functions and Nonconvex Functions
Downwardly convex
For any x, y ∈ Rm and any t satisfying 0≤t≤1
f(tx+(1-t)y) ≤ tf(x) + (1-t)f(y)
where x is a locally optimal point of the function f
There exists some real number ε > 0
For any X’ ∈ Rm
We have ∥x-x’∥< ε ⇒ f(x) ≤ f(x’)
For sufficiently close to x, x is minimal
x is a global optimal point of the function f
For any x’ ∈ Rm
f(x)≤f(x’) holds
There is no restriction by ε
Chapter 3 Basics
3.1 Binary Classification
Consider a function f(x) that returns a real number for an input x.
The output y of the classifier is represented by the above equation in the case of binary classification
Classification is based on whether the value of f(x) is greater than 0 or not
Depending on the sign of the argument, either +1 or -1 can be obtained by using the Gaussian function sign().
3.2 Linear Classifier
For f(x) above
Linear classifier
w ∈ Rm-1 is a weight vector
b ∈ R is a bias term
It is often omitted for computational purposes.
If the parameters w and b can be adjusted to appropriate values, the binary classification problem is solved
It is better to treat bias b as special and optimize it separately.
In the software Sgd, the learning rate is set to 1/100 only when updating the bias term
Passive-Aggressive algorithm with bias
3.3 Perceptron
The most basic online learning algorithm
Algorithm in Perceptron
Let the training data given for the Tth time be a pair of x(t) and y(t)
Let w(t+1) be the parameter after receiving the data t times
Calculate the inner product of the weight vector w and the input x (wTx)
The change in wTx before and after the parameter update is
Before and after updating the parameters, the value of the inner product changes by y(t)x(t)Tx(t)
Defined in a descent manner
3.4 Objective Function and Optimization Method
Introduction
Objective function
A function that expresses the objective to be achieved
Optimization method
A method to achieve the objective
Even if we don’t know the objective function well, we can achieve good performance if we have a learning algorithm based on some idea.
Stochastic gradient methods
FOBOS, RDA
3.4.1 Notation of the objective function
The objective function L(w) characterized by the parameter w
Loss term
Regularization term
The specific form of ℓ(x,y,w) and r(w) is different for each learning algorithm
3.4.2 Online Learning and Convergence
Learning a classifier is equivalent to adjusting the values of the parameters to minimize the objective function.
Incremental Gradient Method
When does it converge?
3.5 Stochastic Gradient Descent Method
Introduction.
The most basic optimization method for online learning
Stochastic gradient descent
3.5.1 What is the gradient method?
A method for minimizing the value of the objective function L(w) with parameter w as argument.
If you want to maximize it, you can transform it into a minimization problem by adding -1.
If the objective function is convex, go down in the downward direction.
Key Points in the Gradient Method
How much to move in which direction at which time
Is there a guarantee that we will reach the minimum value?
Whether the objective function is convex or not is important.
In the case of a convex function, the key is how fast you can get to it.
In the case of a non-convex function, the key is how good the solution is.
3.5.2 Gradient Descent Method
Consider the objective function L(w) for some linear classifier.
We assume that the objective function can be decomposed into the above equation
The specific formula for ℓ(x,y,w) depends on the algorithm
e.g. 0 for successful classification, value of the distance to the separating hyperplane if wrong
In the gradient descent method, the parameters are updated using the above formula until the objective function L(w) converges
η(t) is the learning rate, a suitable value greater than or equal to zero.
If L(w(t+1)-L(w(t)|<ε, then convergence has been achieved.
3.5.3 Stochastic Gradient Descent Method
In the usual gradient method, it is necessary to recompute the gradient using all the data for one parameter update.
Computational cost is high
When one data is received, the gradient is calculated using only that data, and the parameters are updated.
Example algorithm for stochastic gradient descent
w(t+1) = w(t) – η(t)∇ℓ(x(t),y(t),w(t))
Instead of L(w), we use ℓ(x,y,w) as an approximation
3.5.4 Objective Function of the Perceptron
Objective Function of the Perceptron
Perceptron Criterion
3.5.5 Deriving the Learning Algorithm for the Perceptron
ℓ(x,y,w)=max(-ywTx,0)
x and y are given and cannot be moved during training
The only variable that can be moved is w
Partial differentiation on each element of w
Put the elements together to get the gradient
Let’s consider different cases
-ywTx>0
The value of the equation can be written as -ywTx without max
Partial differentiation by Wk
-ywTx<0
The function is zero all the way, and when you differentiate by wk, it is all zero, and the gradient is zero
-If ywTx=0
You can’t differentiate at a point that isn’t smooth.
Cannot determine the value of the gradient
Introduction of the concept of subgradient
Subgradient can be used instead of gradient
-When ywTx=0, we can use -yx as the subgradient
The (sub)gradient of the perceptron’s objective function is the above equation
The update equation is the above equation
3.6 Support Vector Machines
3.6.1 What is linearly separable?
When all data can be correctly separated by drawing a separating hyperplane between the data at +1 and -1
The data is said to be linearly separable
Linearly separable when there exists w,b satisfying y(n)(wTx(n) + b) ≥ 0
When linear separation is possible and when it is not
3.6.2 Derivation of Hard Margin SVM
When the plane separating the +1 and -1 classes is drawn, the vector of each class closest to the plane is called the “support vector.
The distance from the support vector to the hyperplane is called the margin.
In the separating hyperplane, f(x)=y(wTx+b)=0
The distance between the support vector x* and the separating hyperplane f(x)=0 is
Transformation of the distance from the support vector to the hyperplane (margin)
Since the distance from a sample other than the support vector to the separating hyperplane is always greater than the margin
Constraint
Under this constraint, maximizing the margin is the hard margin SBVM
Since squaring the margin does not change the result
The goal is to minimize the square of L2 norm ∥w∥2
3.6.3 Derivation of Soft Margin SVM
Instead of a constraint, we give a penalty 𝛏(t) in case of classification failure
Incorporate the penalty into the objective function
C is a suitable positive constant
All 𝛏(t) satisfy the constraints of the above equation
Eliminate 𝜉 from the objective function
𝜉(t) = max(1-y(t)wTx(t), 0)
The first term is the loss term
Hinge loss
Second term is the regularization term
L2 regularization
3.6.4 Interpreting the SVM objective function
What does the objective function mean?
The loss term is zero when y(t)wTx(t)≥1, otherwise it is greater than zero
If y(t)wTx(t)>0, classification is successful, but the loss may not be zero.
The regularization term is there to prevent overlearning.
3.6.5 Learning SVM by Stochastic Gradient Descent
In SVMs, an objective function is given, and it is up to the implementer to decide how to achieve the objective.
SVMs also include max in the objective function, so it is necessary to separate cases when calculating the gradient
Since it is not differentiable at the point where the condition 1-y(t)wTx(t)=0 is satisfied, we introduce a subgradient
Calculating the gradient of the loss term
Case separation
ℓ(w)>0
Since the value of the equation is greater than zero, the gradient of 1-ywTx with respect to w is used
Expression for the gradient of the loss term
Expression of gradient of regular term
If we expand C∥w∥2 as a formula
C∥w∥2=C(w12+w22+… +wk2+…)
If we do partial differentiation by Wk, we can ignore the terms without wk
Expression for the gradient of a regular term
Algorithm for SVM by stochastic gradient descent
If part is the gradient of the loss term
The last part to find w(t+1) is the calculation for the regularization term
3.7 Logistic Regression
Introduction
Logistic regression
A very popular algorithm along with SVM.
Characteristics
Output values satisfy probability constraints
Output values are non-negative
The sum of the outputs for two classes equals 1
Consider the probability P(y|x) that the value of output y is +1 or -1 for input x
P(y=+1|x)=1/(1+exp(-wTx))
P(y=-1|x)=1-p(y=+1|x)=1-1/(1+exp(-wTx))
When the value of wTx is large (high confidence), P(y=+1|x) is close to 1.
When the value of wTx is small (low confidence), P(y=+1|x) will be close to 0.
Expression of P using the standard sigmoid function
P(y=+1|x)=σ(wTx)
P(y=-1|x)=1-σ(wTx)
Integrated expression
Can be used for both y=+1 and -1
Objective function of logistic regression
Likelihood or likelihood function
Take logarithm to make the calculation more convenient
Because it is logarithmic, it is the sum of the logarithms of the standard sigmoid functions.
Since the log function is a simple increasing function, maximizing the likelihood and maximizing the log likelihood have the same result.
3.7.1 Parameter estimation in logistic regression
In the case of parameter estimation, the numerical results may be unstable.
The objective function with L1 or L2 regularization is often used.
Logarithmic objective function with -1 to reverse the sign and L2 regularization term
Optimization by stochastic gradient method
Define the function ℓ(w) as in the above equation
Partial differentiate ℓ(w) at wk, the kth dimension of w
The partial derivative of the regularization term is 2Cwk
Gradient at each data point
3.7.2 Interpretation of logistic regression
In SVM, the hinge loss max(1-y(t)wTx(t),0)
Logistic regression loss
Hinge loss vs. logistic loss
In SVM, the loss is zero when y(t)wTx(t) is large to some extent In logistic regression, the loss is very close to zero, but not exactly zero
In SVM, the loss becomes non-differentiable at y(t)wTx(t)=1 In logistic regression, there is no non-differentiable point
3.8 Effects of Regularization
3.8.1 Preventing Overlearning
Why does minimizing the sum of the loss term and the regularization term as the objective function prevent overlearning?
Overlearning is a condition of low generalization performance
It occurs when there are too many parameters for the training data.
It occurs when there is not enough training data or too many parameters.
Suppose we have a coin with y=+1 corresponding to heads and y=-1 corresponding to tails.
We tossed the coin once and it came up face up, so we assume that the coin can only be tossed face up.
When the value of a parameter tries to diverge to infinity, the value of the regular term also goes to infinity, and the point that minimizes the objective function approaches the origin.
3.8.2 Regularization to create a sparse solution
The effect of regularization is not only to prevent overlearning.
If we use a regularization term such as the L1 norm, most of the parameters can be set to zero with little performance loss.
Sparse solutions
Form of the equation of the objective function with L1 norm as the regularization
If multidimensional, some dimensions of the optimal solution will be zero
If the loss function is convex, the origin is the optimal solution and the solution is zero
Equation of the objective function with the L2 norm as regularization
No sparse solution can be obtained
3.8.3 Regularization and Prior Distribution
When the loss function is L and the regularization term is r, consider the problem of minimizing the objective function
3.9 Methods for Evaluating the Performance of Binary Classifiers
Commonly used binary classification term algorithms
Typical metrics to evaluate the performance of a classifier
Correct answer rate (accuracy)
How to calculate the correct answer rate
Example
Processing 380 cases of data
Succeeded in classifying 350 cases of data
Correct answer rate is 350/380×100=92.1
Cases where performance cannot be measured by the correct answer rate
Mistakes in classifying spam as not spam
Error of assuming spam is spam when it is not
Cases where there is a large bias in the number of data in two classification classes are also overlooked by the rate of correct answers
Four Patterns of True/False
True positives
False positive
True negative (true negative)
False negative
Goodness of fit
Equation of goodness of fit
Items that were classified as spam and were really spam
Recurrence rate
Reproduction rate formula
The denominator is the number of spam, and the numerator is the number of spam that could be classified as spam.
F-value
Formula for F-value
Precision-recall curve
3.10 Summary of Binary Classification
Objective Functions of Binary Classification Algorithms
3.11 Multiclass Classification
3.11.1 What is the concept?
3.11.2 One versus the other method
One versus rest
The most natural extension of the binary classifier
Classification from input x ∈ Rm to output y ∈ {1,2,…. ,K}.
Prepare a weight vector wy ∈ Rm for each output
Define the score for each output y as S(x,y;w)=wyTx
The classification result for input x is the output with the largest score y*=argmaxxyS(x,y;w)
3.11.3 One versus one method
One versus one
Whole method
Prepare a classifier for every combination
Vote on all classifiers during classification
Select the one with the highest number of votes
3.11.4 Error Correcting Output Code Method
A method for determining the final classification class based on the output results of a predetermined number of binary classifiers.
Example: 4 classes and 3 classifiers
The class with the closest Hamming distance is used as the classification result.
ECOC Algorithm
3.11.5 Multi-class SVM
Grammer et al. have shown that SVMs for binary classification can be naturally extended to multiclass classification.
11
Objective function of multiclass SVM
The objective function of a multiclass SVM is given by ξ(t) for all ȳ∈ {1,2,…,K}. ,K} as in the above equation
𝜹y(t),ȳ is a number such that ȳ=1 if ȳ=y(t) and 0 otherwise
Φ is a function called the feature function or characteristic function that is used to shift x to a position corresponding to y.
Example
Input x is 2-dimensional and output y is 1,2,3.
If x=(1,2) and y=2, then Φ(x,y)=(0,0,1,2,0,0)
Put x in the part for the position of y and set the rest to 0
Derivation of the unconstrained objective function
Expression of the objective function
Online Learning of Multi-class SVM
Perform optimization approach using stochastic gradient method
Expression of gradient
3.11.6 Log-Linear Model
A natural extension of logistic regression to multiclass classification
Maximization of the likelihood function is the objective
Logistic regression defines probabilities for two ways of output y=1 and y=-1
Logarithmic model considers K possible values for y
To define the probability for K toori of y
Prepare a parameter for each class
The parameter for class c is written as wc.
Equation of true probability in log-linear model
The objective of the log-linear model is to maximize the log likelihood using this probability.
Objective function with a regularization term
3.11.7 Learning the Log-Linear Model
Perform an optimization approach using the stochastic gradient method
Equation of the result of partial differentiation with Wk
Equation of the gradient lined up with partial differentiation
3.12 Examples of Applications to Natural Language Processing
3.12.1 Document Classification
Supervised Classification Problem
Characterizing Document Data
Bag of words(BOW)
A set of words
Does not take word order into account
Not all problems can be solved by BOW
Needs to be adapted to the task
Eliminate stop words
Remove words that occur too frequently to contribute to classification
Restore word form changes
Restore word forms to a single expression
stemming
Use adjacent words
Use neighboring words as features
N adjacent words
Word N-gram
Risk of overlearning increases due to increase in number of dimensions
Do not use N-grams with too few occurrences
3.12.2 Word Segmentation
Estimating complex structures using a classifier
Example
Word segmentation
Task that returns a sequence of words when a sentence is given a string
Find the boundaries of words
Given an array {t1,…,tn} of strings of length b ,tn}
Divide into two parts ti∈{0,1}
ti=1 if there is a word boundary between characters xi and xi+1, 0 otherwise
SVM or logistic regression can be used to analyze with sufficient accuracy
27
There is also a way to solve using dying information and grammar
Some methods estimate simultaneous probabilities instead of solving for each variable independently
Chapter 4 Development
4.1 Online Learning with High Accuracy
Introduction.
The learning problem of estimating a binary output y ∈ {-1,+1} from an M-dimensional input x ∈ Rm.
Among binary classifiers, we consider linear classifiers.
characterized by a model parameter w ∈ Rm of the same m dimensions as the input.
Given an input x ∈ Rm
Estimate by sign(wTx)
Take a weighted majority vote for each input and set the output to +1 if the result is positive and -1 if the result is negative
The goal of learning is to obtain a weight vector that estimates the correct output.
Common algorithms for online learning
α,E are real numbers satisfying α>0 and E≥0
A ∈ Rmxm is a semi-positive definite matrix
For any x ∈ Rm
xTA≥0 for any x∈Rm
If the number of input dimensions is large, use a diagonal matrix as an approximation instead of a semi-positive definite matrix
In the case of z=Ax
Let ak be the kth diagonal component of A.
xk=akxk is obtained
Meaning of the algorithm
In the first line, initialize the weights to 0
No matter what input comes in, the output is always estimated to be zero.
No training has been done.
In the third line, we check if the current classifier correctly estimates the given training data.
If the sign of the correct output y matches that of the estimated result wTx, it is correct.
Positive if the classification is correct
Negative if no match
Negative if it is wrong
|If ywTx| is close to 0, the classifier is not confident.
If ywTx is small, it is wrong.
We need to change the weights significantly
In the fourth line (similar to the gradient descent method), α is the update width common to all parameters A is the update width for each feature
The meaning of the update formula
The weight at the t-th time is w(t), and the training data is (x,y).
Suppose the update is done and w(t+1) is obtained
yW(t+1)Tx = y(W(t) + yαAx)Tx = yW(t)Tx + y2α(Ax)Tx =yw(t)Tx + αxTAx ≥ yw(t)Tx
yW(t+1)Tx ≥ yW(t)Tx
Becomes larger after update
The error degree becomes smaller.
4.1.1 Perceptron
When the current classifier estimates differently from the training data, it updates the weight vector as shown in the above equation.
In the general algorithm
One with E=0
One that sets α=1
No matter how wrong it is, it always updates with a constant update width.
A=I
I: Unit matrix
Instead of using a different update width for each feature, update with the same update matrix
Simple, but effective for many problems
Can find the answer in a finite number of times
Improved version
Use a perceptron that averages all weights
Higher accuracy than the simple one
Use average of weights of all steps instead of last weight for classification
4.1.2 PassiveAggressive(PA)
Uses a hinge loss function
Hinge function is the most used function in SVM
Update formula for weight vector
Find the weight that is closest in Euclidean distance to the previous weight among the weights such that the hinge loss is zero.
Intuitive explanation
Hinge loss is 0
Find the weights that can correctly classify the current training data with a sufficient margin
There are many weights that satisfy the above
Find the weights that can classify the previous training data correctly as well.
Consider each case separately
If ℓhinge(x(t), y(t), w(t)) > 0
To find a solution to the constrained optimization problem
Use Lagrange’s undecided multiplier method
How to find the extreme values of a function with constraints
Under the constraint g
The problem of finding the extremum of a function f can be reduced to the problem of finding the extremum of f-𝞽g
Formula for finding the extremes of an optimization problem
Conditions for taking the partial derivative in W and setting it to zero
Introduce the result of partial differentiation into the above equation
The condition that the partial derivative is taken at Τ and becomes o is
Final update formula
If we apply the general algorithm
E=1
α is the above equation
A=I
The perceptron updates only when it makes a mistake, and the update width is always constant
PA Nova Iha updates when it is unsure, even when it is classifying correctly, and changes its full width depending on the error.
The update width is normalized by the norm of the input
When the training data contains noise
Hinge loss function for training data is zero
Update is applied such that all previous training is forgotten
Constraints need to be relaxed
PA-1 update rule
When the penalty for breaking the constraint is considered linearly
PA-2 update rule
When the penalty for breaking the rule is considered to be squared
Final PA-1 and PA-2 update equations
4.1.3 Confidence Weighted Learning (CW)
Considering that the weights w are generated according to a normal distribution
N(µ,∑) is a normal distribution with µ ∈ Rm as the mean and ∑ ∈ Sumosamo as the covariance matrix.
The most trusted weight at the moment is µ, and the confidence of each weight is held by ∑.
If the value of ∑ii is large, the variance of that weight is large
Expresses a lack of confidence in the weight
Find distributions that are close to the previous distributions
Use Kullback-Leibler divergence as a measure of closeness between distributions
Change the condition that the training data will not be wrong to the condition that the training data will only be wrong with probability less than η
Formulate the problem of finding the distribution
PW~N(μ,∑) is the probability that w is generated by a normal distribution N(μ,∑)
This optimization can be expressed in a closed form as above
The update width for each feature can be changed during the update.
When the confidence level is high, the update width is small.
When the confidence level is low, the update width can be increased.
High accuracy, but computational complexity increases
4.1.4 Adaptive Regularization of Weight Vectors (AROW)
This method overcomes the problem that CW is sensitive to noise.
ARROW takes three conditions into account at the same time during training and optimizes
First
To classify the current training data correctly.
Secondly
Finding a distribution close to the existing distribution
Third.
To increase the confidence of each feature with each update
Update rule
4.1.5 Soft Confidence-Weighted Learning (SCW)
A method with features of CW and ARROW
Loosen constraints as in AP
SCW-1
SCW-2
4.1.6 Summary
4.2 Online Distributed Parallel Learning
4.2.1 Parallelization of Stochastic Gradient Descent Method
Review the formulation of the stochastic gradient descent method.
We define the parameters we want to learn as w
In particular, we denote the value at step t as w(t).
At each step, the training data is sampled independently according to some distribution.
We can think of the loss function as sampling from some distribution
The expected value of the loss function ℓ(t)(W) derived from it matches the objective function L(w)
Find the parameter w that minimizes L(w)
Iterative formula
Compared to the gradient descent method, which uses the gradient of the objective function directly, the computational cost per iteration is smaller.
Since the loss function deviates significantly from the expected value of the objective function, it proceeds in a direction that deviates from the gradient of the objective function in a single iteration.
Computational cost is high.
If we can handle a loss function with a small deviation from the objective function while keeping the computational cost per iteration low, we can perform efficient calculations.
Minibatch stochastic gradient descent
X(t) is a set of loss functions corresponding to B training data
At each iteration t, the mini-batch X(t) is assumed to be sampled from an independent distribution
Compute the gradient for each sample in the mini-batch in parallel
Parallelized mini-batch stochastic gradient descent method
Algorithm
Synchronize the entire process each time each process performs an update
After the tth update of each process is completed and the result is shared among all processes, the t+1th update is performed.
4.2.2 ParallelSGD
How to reduce the number of synchronizations between BSP nodes
Do not synchronize until training is complete
Divide the training data evenly into p pieces and assign them to each of the p processes.
Compute each until the end and average the last p parameter vectors.
Algorithm
If the loss function is convex and its gradient is Lipschitz continuous, the ParallelSGD algorithm converges to the optimal solution.
Since the training data for each process is chosen randomly
Since the training data for each process is chosen randomly, the parameters of the training results for each process can be regarded as random variables.
The expected values of these parameters are consistent with the optimal solution.
The variance as an estimator of the optimal solution will remain large
4.2.3 Iterative Parameter Mixture (IPM)
Synchronize parameters between processes after several parameter updates
IPM
Distribute the data set to each process in separate parts, and
Distribute the dataset to each process separately, and each process starts training with the perceptron
Synchronize each time the processes go through the dataset
Replace the parameters held by each process with the average
Algorithm
4.2.4 Stale Synchronous Parallel (SSP)
State Synchronous Parallel (SSP)
A framework for parallel computing that generalizes BSP.
The algorithm can be parallelized by SSP, where parameter updates are depicted as adding some vector to the current parameters.
Algorithm
4.3 Online Learning Used in Deep Learning
Efficient Learning Methods for Complex Nonlinear Models
4.3.1 Feed-forward neural networks
Feed-foward neural network
A multilayer computational procedure consisting of an input layer, one or more hidden layers, and an output layer.
Each layer consists of a number of units, and each unit basically has an activation function applied to its input as its activation.
In the input layer, each dimension value of the input case x is directly used as the activation.
In the output layer, special functions are used according to the purpose.
In a D-stage feed-forward neural network
i∈{1,…. D}, the activation vector hi
Let the activation function be fi
hi is expressed as above using the weight matrix Wi and bias vector bi of the i-th layer.
The activation function fi is applied to each component.
If we use the constant function id(x)=x as the activation function in the hidden layer
Using a nonlinear function increases the expressive power
Typical activation function
With a sigmoidal function, the gradient value becomes closer to 0 as learning progresses, and convergence to the local solution becomes slower
Piecewise linear activation functions are used
A simple hinge function f(x)=max(0,x)
4.3.2 Gradient Calculation by Error Back Propagation Method
To train with stochastic gradient descent, we need to calculate the gradient for the activation.
The parameters to be optimized are the weight matrix Wi for each layer and the bias vector bi
The goal is to find the partial differential coefficient of the loss function ℓ
Equation
Let hi-1,j be the activity of the unit on the input side of the edge
The partial differential coefficient of the loss function with respect to the activity of the unit on the output side is ∆hik
Partial differential coefficient of the loss function with respect to the weight Wijk of the edge
Delta rule
The partial differential coefficient is
The partial differential coefficient is the product of the activity of the input unit, the partial differential coefficient due to the output unit, and the activation function.
Calculate the partial differential form rain ∆hik by the activity of the hidden layer
Can be calculated by propagating the neural net in the opposite direction
Must be calculated for each layer in turn
Use stochastic gradient descent or other approaches that require fewer gradient calculations for training data
Other approaches
Hessian-Free method
Efficient method based on second-order differentiation
4.3.3 Mini-batch stochastic gradient descent method
Mini-batch is often used in neural networks
Rationale
Faster convergence in the neighborhood of the local solution
Stochastic gradient descent is efficient in the early stages of optimization, but when it gets close to the local solution, the stochastic fluctuations of the loss term only wane, and convergence becomes slow.
By using mini-batches, stochastic fluctuations of the loss term are suppressed, and the local solution can be reached quickly.
Suitable for speed-up by parallel computing
The error back propagation method for neural networks can be basically computed using only matrix operations and local nonlinear operations.
It is easy to accelerate by using GPU.
In general, batch sizes of tens to hundreds of values are used.
4.3.4 Momentum Method
To reduce the variance in the update direction
How to reduce the change in gradient
Take the average of the gradients for the update direction of the previous update and the current parameter
Momentum method
Commonly used for optimization of large neural networks
Instead of using the gradient of the loss function directly to update the parameters, a moving average of the gradients is used.
μ: Momentum coefficient
Indicates how much importance is given to the past update vector
When μ = 0, it is consistent with stochastic gradient descent method
When μ is close to 1, the update vectors are less likely to change
Visualizing Updates in the Momentum Method
Momentum method is easy to implement
4.3.5 Accelerated Gradient Method
The momentum method can be thought of as a gradient vector with sensitivity to the direction of the update vector
Update equation
Nesterov’s accelerated gradient (Nesterov’s method)
Slightly shift the position where the gradient is calculated
4.3.6 Parameter Instability
In order to use stochastic gradient descent with the error back propagation method when optimizing a discomfort neural network, there are some issues to be solved
If the values of the differential coefficient of activity and the weight function of each unit are too small, the cusp propagated to the input tank along with the basement suzek will be small.
Optimization will not proceed.
If the value is too large, the accuracy will not be maintained.
Sensitive to initial values of parameters and learning rate
Hyperparameter optimization can be viewed as a model selection problem.
Meta-algorithms such as cross-validation and Baysian Optimization are commonly used
4.3.7 Learning Rate Decay Methods
When the objective function is a convex function, in order for the stochastic gradient descent method to converge to the optimal solution, the learning rate must be a sequence such that the sum diverges to infinity and the sum of squares converges.
In deep learning, we optimize a non-convex function such that there are many local solutions.
The goal is not to converge quickly to an arbitrary local solution, but to converge quickly to a good local solution.
In large-scale deep learning, manually adjust the decay factor of the learning rate so that it decays when the value of the objective function is sufficiently stable.
4.3.8 AdaGrad
How to adjust the learning rate automatically
Update equation for stochastic gradient descent method
A(t) ∈ ℝdxd is the regular matrix used to transform the gradient
It is used instead of the learning rate
Newton’s method” is to use the Hesse matrix ∇2L of the objective function as A(t)
Automatic adjustment of the learning rate is achieved by A(t)
When training a neural network by stochastic gradient descent method
Depending on the training data, some parameters may not be used
Example
When using ReLU as activation function
If the input to a unit is negative, the activity of that unit will be zero
In this case, the parameters of the edge adjacent to this inactive unit will have a gradient of 0
Parameters are not updated.
Intuitively, for parameters that have been updated a lot, a smaller learning rate will lead to faster convergence to a local solution.
Scale each dimension of the gradient using the previous spices of that dimension.
Let A be a diagonal matrix as in the above equation
Let g(t) be the average gradient vector of the mini-batch at step t
η is a constant that defines the scale of the learning rate
AdaGrad algorithm
Look at the update history for each dimension, and shrink the update vector for dimensions that have moved significantly
Large movement means
Considered relative to the average gradient size of the dimension
The amount of update of dimension I by AdaGrad is expressed by the above equation
In practice, the manually adjusted momentum method is often more accurate.
4.3.9 RMSprop
The antithetical shape of a nonconvex function cannot be predicted from the local information of the function value and gradient
One approach to the non-convexity problem
Learning rate is not determined from all histories, but from the most recent history
Assuming that the function is smooth to some extent, the above approach is valid.
Algorithm to update parameters
RMSprop
RMS stands for root mean square
Methods using heuristics
There is another algorithm called “ADADELTA” which is very similar.
4.3.10 vSGD
Introduction
Efficiency is improved by placing hypotheses of good quality on the shape of the loss function around the local solution, rather than on the history of previous updates.
vSGD for non-mini-batch cases
vSGD in the mini-batch case Translated with www.DeepL.com/Translator (free version)
Chapter 5 Performance Analysis
5.1 Performance of Online Learning
By analyzing the performance of online learning, we can determine whether updating the parameters is going in the right direction. by updating parameters.
5.2 Learning Theorems for the Perceptron
Assumptions
Input is an m-dimensional vector x ∈ Rm
Let the parameter weights be w ∈ Rm
Let sign(wTx) be the output of the perceptron
sign(z) is a function such that it is 1 for z≥0 and -1 for z<0
The perceptron will initially initialize the weights to w(1)=0
The perceptron receives training data (x,y), x ∈ Rm, y ∈ {0,1}.
If the output of the training data differs from the result of the perceptron estimation (sign(wTx)≠y)
Update the weights w
Preparation for performance analysis
Definition of terms
Margin of training data
Radius of the training data
Perceptron Learning Theorem
The upper bound on the number of perceptron updates depends only on the radius of the training data and the size of the margin.
Does not depend on the number or dimension of training data
Even if the data is high-dimensional, if the data is sparse, the radius will be small and the number of perceptron updates will be small
5.3 Learning Theorems for the Perceptron when the Data is Not Linearly Separable
Perceptron can classify a lot of data even when the training data is not linearly separable
Training Data Penalty
Norm of Training Data Penalty
Perceptron Learning Theorem for Non Linearly Separable Data
Even when the training data is not linearly separable, the number of perceptron updates (number of wrong updates) depends only on the radius R and the penalty norm D
5.4 Riglet Analysis
Introduction
Regret analysis
We measure the performance of an algorithm by measuring its regret, i.e., how much worse the algorithm did than if it had taken the optimal strategy.
Riglet Analysis Problem Setup
Each time a player chooses an action θ ∈ K from a set of possible actions K
After that, a cost function f is presented and the cost f(θ) is determined for that
Example: Investing in stocks
The player chooses an action, how much to invest in which company
The cost function is the decrease in the price of each company’s stock on that day
If the decrease is -100 yen, it means that the stock price increased by 100 yen
The total cost is the sum of the decreases
The way a player decides what to do is called a strategy.
What kind of strategy can the player use to minimize the total cost ∑(t)f(t)(θ(t))?
Even if the cost function f(t) is not known, is it possible to adopt a strategy that minimizes the total cost?
Strategy that only takes the same action
Let θ* be
The riglet of a strategy is defined as above
Measure how much worse the current strategy is compared to the optimal (fixed) strategy.
and wish you had taken that strategy.
What does regret mean?
When Regret γ(A) is linear in the number of trials T
The difference between the optimal fixed strategy and the optimal strategy does not decrease even if the number of trials is increased by the number of birds.
When Regret γ(A) is greater than a linear function of T for the number of trials
As the number of trials increases, the difference from the optimal fixed strategy decreases.
Regret analysis includes online learning as a special case.
The parameter θ(t) ∈ Rm of online learning is the player’s action
Action is not data (x,y), but a parameter
In online learning, the learner takes the action of selecting parameters
Let the loss function f(t)(θ) = (x(t), y(t), θ) be the cost function
RIGLET analysis is applied when the problem does not change much
Difference from fixed strategy
When the problem is large, the cost of the fixed strategy becomes large
5.4.1 FollowTheLeader(FTL)
Simplest strategy
The next action is the one that minimizes the sum of all previous costs
Define the Tth action θ(t) as above
Follow The Leader (FTL)
Examples of what can go wrong
Action is a point in dimension -1≤θ≤1
Let the I-th cost function be f(t)(θ)=(1/2)(-1)tθ
The actions taken by the FTL are
First, θ(1)=0
Next, θ(2)=-1
θ(3)=1
-It goes back and forth between 1 and 1.
The cost is always 1/2 except at the beginning
Optimal fixed value is θ=0, total cost is 0
Doesn’t come close to optimal fixed value
5.4.2 RegularizedFollowTheLeader(RFTL)
Improvement of FTL
FTL becomes problematic when the action oscillates
Add some constraint other than the cost function to stabilize the action
Define the action θ(t) for the tth time as in the above equation
R(θ) is a convex function
Regularization function
Η≥0 is a parameter that determines how much regularization is taken into account.
Regularized Follow The Leader (RFTL)
Riglet of RFTL
Assumptions
Norm based on semi-positive definite matrix A
Norm of the cost function
Diameter of the feasible region
Riglet of RFTL
A framework that can dynamically respond to any cost function that comes along
5.4.3 Riglet with strongly convex loss function
If the loss function is strongly convex, we can use gradient descent to obtain an even smaller riglet
The concept of strongly convex
A function f(x) is α-strongly convex if and only if
Define ∇2f(x)≥αI
∇2f(x) is the Hesse matrix of f
A≥B means that the matrices A-B are semi-positive definite matrices
Example:
The squared error function f(x)=∥x-a∥22 is a 1-strongly convex function
Riglet of Gradient Descent Method for Strongly Convex Loss Function Translated with www.DeepL.com/Translator (free version)
Chapter 6 Implementation
6.1 Implementation of Vectors
Introduction
In the BOW of natural language processing, the dimensionality is very high (on the order of 100,000 to 1,000,000).
Tens to hundreds of non-zero elements
from a few hundredths to a few thousandths
Sparse vectors (sparse vectors)
Dense vector
Most of the elements are non-zero
6.1.1 Internal representation of vectors
In the case of dense vectors
All dimension information is placed in a contiguous region of memory as a floating point array
In case of sparse vectors
Keep only non-zero elements in memory
Store only non-zero dimensions (integer values) and their value pairs
Amount of computation
6.1.2 Inner product calculation
Vector operations
Inner product
Sparse and Sparse
Calculate only non-zero elements in both
Compute inner product of sparse vectors
Dense and Dense
Simplest
Algorithm for calculating inner product
Just multiply each element by itself
Vectorization makes it more efficient by using the vector arithmetic instructions of the CPU
Several times faster
Vectorize at compile time to improve efficiency
Sparse and Dense
0 multiplied by anything is 0
Inner product calculation algorithm
Compute only with non-zero elements in sparse vector x
6.1.3 Adding Vectors
Performed for each dimension as in integration
Addition of dense vectors
Algorithm
Add sparse vectors to dense vectors
Use the fact that the result does not change no matter what is added to 0
Algorithm
Add sparse vectors to each other
Merge only non-zero elements of both vectors
If one of them is non-zero, use all elements
6.2 Implementation of the Algorithm
6.2.1 Efficient implementation of the averaging method
Keep the past parameters and average them at the end
Deal with problems where parameter vectors are not stable
Averaging Perceptron
Algorithm
Averaging stochastic gradient descent (ASGD)
Simple averaging is computationally inefficient
Add up the differences in each update while applying weights to them
Formula
6.2.2 lazyupdate
When the input data is a sparse vector, the calculation can be made more efficient.
When calculating the regularization term, all weights have to be calculated.
Support by lazy update
Use natural language processing, etc.
Algorithm
6.3 Constraints on floating-point numbers
6.3.1 Mechanism of Floating-Point Numbers
Inside a computer, all information (including real values) is in binary form, 0 and 1.
Real numbers are represented by “floating point numbers
What is floating point?
One of the methods for expressing real numbers
In addition to the positive/negative sign, it is divided into two parts, the mantissa part and the exponent part.
Representation method of a real number x
s: Sign part (binary)
1 bit
0 or 1
f: Mantissa part (binary)
52 bits
1 or more but less than 2
e:Exponent part (binary)
11 bits
Signed integer value
B:radix
Fixed at 2
Specified in IEEE754
Called normalized number
Number other than normalized number
0
Infinite
Non-number (NaN)
A number that died with a normal value
Example: When 0/0 is calculated
Non-normalized number
A value that is so close to 0 that it cannot be expressed by the normalized exponent.
The calculation for the denormalized number is almost 100 times larger than the calculation for the normalized number.
6.3.2 Handling huge real numbers
Points to be aware of when implementing
Overflow and underflow due to floating point decimal constraints
The exponential function that appears in logistic regression can easily overflow.
If the result is an overflow, it is treated as infinity, and thereafter it is infinite no matter what you add or subtract.
Underflow is treated as zero.
Workaround with logsumexp function
6.3.3 Digit loss
When a program contains the addition of positive and negative floating-point numbers
Light drop-off may occur
If floating point numbers with very close values are added, the result will not be accurate.
Example: x=1.001001, y=1.000000
If the significant digits are 4 digits in decimal, x=1.001, y=1.000, x-y will be 0.001 and the information in the last line will be lost.
√(x+e)-√x will lose digits.
Transforming it to E/(√x + √(x+e)) avoids the loss of digits
Use in CW calculations
Appendix A
A.1 Derivation of Equation (3.35)
A.2 Lagrange undecided multiplier method and KKT condition
A.3 Variables in CW
A.4 Variables in AROW
A.5 Variables of SCW
A.6 Riglet analysis of the parallel stochastic gradient descent method using SSP
コメント