Machine Learning Professional Series Probabilistic Optimization Reading Notes

Mathematics Artificial Intelligence Technology Digital Transformation Online Learning Machine Learning Technology stochastic optimization Navigation of this blog

Summary

Stochastic optimization in machine learning will refer to the solution of optimization problems using random samples. In a normal optimization problem, all training data must be used to optimize in order to minimize the objective function, but when the data set is large, this can be very computationally intensive, so stochastic optimization allows the use of a randomly selected subset (mini-batch) of data for optimization, the computational complexity can be greatly reduced. In addition, because it uses randomly selected data, the optimization may converge faster. This section describes various stochastic optimizations based on the Machine Learning Professional Series “Stochastic Optimization.

In this article, I will discuss the reading notes.

Machine Learning Professional Series Probabilistic Optimization Reading Notes

From the introduction to the Machine Learning Professional Series, Stochastic Optimization. “From a review of supervised learning and convex analysis to parallel distributed stochastic optimization useful for large data analysis, this book provides a detailed explanation. Both basic techniques and new topics are well understood in this book. The book shows how formulas are formed and what algorithms mean. A real fast track to performing large-scale computations, with as few abbreviations as possible in the proofs.”

Chapter 1: Supervised Learning and Regularization

1.1 Supervised Learning

Introduction
Feature Extraction and Learning

1.1.1 Regression

Regression Loss Functions
Continued
Loss Function 2

1.1.2 Discrimination

Generally, if f(x) is positive, it is discriminated as 1, and if it is negative, it is discriminated as -1.
Binary Discriminant Loss Function
Loss Function 2
0-1 loss function is appropriate, but it is neither a continuous function nor a convex function, so it is difficult to optimize.
It is common to use "surrogate loss" which is a close approximation of 0-1 function.
Other than the above 0-1 functions

1.1.3 Overlearning

Overlearning

1.2 Regularized learning method

Methods to avoid overlearning
Regularization
Minimize the error function not by minimizing it as it is, but by adding "penalties" according to its "complexity
Complexity" is determined by the structure of the data
Regularization function
General regularization learning method
λ>0 is the "regularization parameter" that adjusts the strength of regularization
The first term is an expression for the fit of the data
The second term is a penalty for model complexity
Regularization function example
Continue
Various regularization functions
Linear models are simple, but higher dimensionality leads to more degrees of freedom and over-training
About L1 regularization
L1 regularization has the property that it tends to make the learning result β sparse
Learning result β is "sparse" means
means that many components of β are zero.
Ignore useless features that are not necessary for prediction
In what sense are they "necessary"?
Feature selection that minimizes generalization error
Akaike information criterion (AIC)
Linear regression methods with regularization

1.3 Various sparse regularizations

1.3.1 Group Regularization

Group Regularization
Sparse regularization that treats features belonging to the same loop as if they were a single variable
Equation for group regularization
∥βgk∥2 can also be the Lr norm for r greater than 1, like ∥βgk∥r
The group regularization is
L1 regularization is applied between groups and L2 norm is applied between groups.
Variables belonging to the same group tend to be zero together.
Further hierarchically, given a group g' of groups
Hierarchical group regularization

1.3.2 Generalized fused regularization

generalized fused regularization
Regularization that makes adjacent variables in a graph have the same value.
Generalized fused regularization formula

1.3.3 Trace norm regularization

Trace norm regulization described in "Overview of the trace norm and related algorithms and implementation examples"
Appears when you want to learn a low-rank matrix
Trace norm regularization formula
L1 regularization for a vector of singular values
Matrices with sparse singular values, i.e. low-rank matrices, are learned
Also called nuclear norm
Applications
Collaborative filtering
Multi-task learning

1.3.4 Combinations of regularization functions

Various regularization combinations are possible
Example: sparse and low-rank matrix learning
Using L1 regularization and trace norm regularization at the same time
R(β) = ∥β∥1 + ∥β∥tr
How to combine regularization functions
Sum type
Example
Sparse and low rank matrix
Convolution type
Example
Sum of sparse and low-rank matrices
Application examples
Removing sparse noise from each component of a low-rank matrix
Common in image processing
Duality holds.

Chapter 2: Basics of Convex Analysis

2.1 Convex functions and convex sets

Convexity of sets
Convex Sets
Image
Convex Function
Image
Epigraph
Effective domain
True convex functions, closed convex functions
Narrowly convex functions, strongly convex functions
Strongly convex functions assume the curvature of convex functions
Smooth convex functions
Example
f(x)=x2 is strongly convex and smooth
f(x)=log(1+exp(-x)) is smooth but not strongly convex
f(x)=|x|+x2 is strongly convex but not smooth

2.2 Subdifferentiation and dual functions

Conjugate functions and Lejandre transformations
Conjugate functions are functions viewed in terms of slope information
f(x)-<x,y> is the value at x=0 of a line with slope y that takes the value f(x) at point x
Lejandre transformation
The convex function y=f(x) can be expressed by the set of (x,y), but equally well by the set of tangents specified by their slope and intercept values.
As a generalization of the Lujandl transform, originally used in the transformation of Lagrangians to Hamiltonians in analytical mechanics
A generalization of the Lujandre transform is the Lujandre-Fenchel transform
A new function is given by the LeJandre transform only when the minimum value of the function g(x)=f(x)-px is fixed.
Even when the minimum is fixed, if the original function is not convex, the newly defined function f*(p) will not be restored by inverse transformation.
Complementary Subjects
Definition of terms
Convex hull
The convex hull of a set C ⊆ ℝp is the smallest convex set containing C
conv(C)
For a function f, a convex function whose epigraph is the convex hull of the epigraph of f (conv(epi(f))) is defined to be the convex hull of f and written conv(f)
Closed Closure
The closed closure of a set C is the smallest closed set containing C
The closure of a convex function f is defined to be a closed convex function whose epigraph is the closure of the epigraph of f (cl(epi(f))) and is denoted by cl(f).
Affine set
A set A is an affine set if
a set such that for any two points x,y in A, λx+(1-λ)y is contained in A for all λ
A set such that no line passing through any two points of the set extends beyond the set.
The set of linear spaces whose origin is displaced
Affine hull
The smallest affine set containing a set C is called its affine hull
Relative interior
Let A be the affine hull of a convex set C
A set x ∈ C such that there exists some ε > 0 and {x'|∥x'-x∥≤ε}∩A⊆C is called a relative transfer of C and is denoted by ri(C).
Theorem 2.2.2.
For a function, a function f** that has undergone two Lejandre transformations is the largest convex function that can be suppressed from the original function f
System 2.2.3
Conjugate functions are another way of looking at a closed convex hull from the viewpoint of slope information
Definition of "slope
Subdifferential
The source of the subdifferential is called the subgradient.
Subgradient does not always exist
Subgradient always exists on ri(dom(f)) (subdifferentiable)
Properties of Inferior Differentiation
When Young-Fenchel's inequality holds for integration
x and y are connected by the subdifferential and give each other Lujandre restitution
Inferior gradient
To express the gradient at non-differentiable points
After restricting F to a convex function, the slope of the function at each point is represented as "multiple values" instead of just one.
The "multiple values" = the set of values represented as the subdifferential
One of its elements is the inferior gradient
Properties of Conjugate Functions
Conjugate Functions of Various Convex Functions
Continued

2.3 Fenchel's duality theorem

Fenchel's dual theorem
When we want to solve the optimal problem Info{f(x)+g(Ax)}, we call it the primal problem.
The equivalent problem supy{-f*(ATy)-g*(-y)} derived from Fenchel's dual theorem is called the dual problem.
For some problems, conjugate functions are easier to handle and dual problems are easier to solve.
When both the main problem and the dual problem are being solved in the middle of the algorithm
Finding the difference between the main problem and the dual problem provides a guarantee of how close they are to the optimal solution.
The difference between the main problem and the dual problem is called the duality gap.
Conjugacy of summation and convolution using the Fenchel duality theorem

2.4 Proximity mapping

In stochastic optimization in machine learning, proximal mapping, as represented by proximity gradient, plays an important role.
About proximal mapping
Let f be a true closed convex function
Definition of proximal mapping proxf
Extension of projection to sets
Proximal mapping can be defined uniquely as a "mapping" to non-differentiable convex functions
Replaces gradient descent operations
Examples of Proximity Mappings
Indicator function
Moreau decomposition

2.5 Properties of Strongly Convex and Smoothly Convex Functions

If one or both of strongly convex and smoothly convex functions are satisfied, we can construct optimization methods that converge more quickly

Chapter 3: What is Stochastic Optimization?

If a general-purpose optimization solver is applied as-is to the optimization required for learning, it takes a lot of time.
If we divide the data appropriately and solve small subproblems at random, we end up solving the optimization using the entire data.
Online stochastic optimization
Useful when samples are observed sequentially
Very low computational cost for a single update
Batch stochastic optimization
Assuming all data are available
Faster convergence can be achieved by using the same samples again
Some methods split features
Purpose of Machine Learning
Reduce generalization error
If the generalization error is small, there is no need to strictly optimize the training error itself on the data at hand
History of Stochastic Optimization
In 1951, statisticians Robbins and Monroe
Considered maximum likelihood estimation in statistics as an optimization problem
Parameters are updated with each sample observation
Classification of various methods

Chapter 4: Online Stochastic Optimization

4.1 Framework for Online Stochastic Optimization

Update parameters every time one or a few samples are observed
Let zi ∈ Z be the samples observed at time t (Z is the space of samples)
In stochastic optimization, zi is assumed to occur independently and identically from some probability distribution P
Let ℓt(β) or ℓ(zt,β) be the value of the loss function with parameter β ∈ ℝp for Zt
Example: ℓt(β)=∥zt-βt∥2
In case of supervised learning
Input (feature vector) and label pair zt=(xt,yt)
Multiply ℓt(β)=ℓ(yt,<xt,β>)
<> inner product of inputs xt and β
Minimize training error after observing samples up to time t
Normalization term is included
Basic Procedure for Online Stochastic Optimization
Stochastic optimization in regularization learning corresponds to minimizing the expected value of the above equation
For practical use, training proceeds by setting several regularization strengths (regularization parameters).
After the tth update, the appropriate regularization parameter is chosen based on the validation data.

4.2 Relationship between online learning and stochastic optimization

Online type learning
Minimize regrets by sequentially updating parameters
Regret is
The difference between the minimum training error with fixed parameters achievable after training all samples and the cumulative training error incurred during sequential training.
No assumption is made that the samples are random variables
Application Examples
Advertising strategy
Changes in subsequent consumer behavior depending on how the advertisement is presented
What we want to minimize is the "cumulative" loss incurred each time an ad is run
Not the generalization error of the learning results obtained at time t
Difference between "online stochastic optimization" and "online learning
In online learning, zt is assumed to be an independent and identical probability distribution, but not a random variable
In online learning, zt's behavior changes according to the behavior of the collars; in online stochastic optimization, the distribution does not change
What we want to minimize in online stochastic optimization is the expected error, not the riglet

4.3 Stochastic Gradient Descent (SGD)

4.3.1 Stochastic Gradient Descent Framework and Algorithm

Algorithm for stochastic gradient descent described in "Overview of Stochastic Gradient Descent (SGD), its algorithms and examples of implementation" without regularization term
gt is the inferior gradient of the loss function ℓt
From the definition of the inferior derivative, the above equation holds
Taking the expectation of both sides with respect to zt
The expected value of gt is the inferior gradient at βt-1 of the loss function L(β) = Ezt[ℓt(β)].
Approximate the (inferior) gradient of the objective function, the expected loss L(β), and update βt by moving slightly in that direction
Moving in the direction of the gradient of the objective function is equivalent to moving in the direction where the objective function is most inclined.
Gradient descent method
steepest descent
Projection back to B so as not to go out of the defined region B
Projected gradient descent
Stochastic gradient descent algorithm with regular terms
If we can eliminate the stochastic element and let gt∈∂L(βt-1)
L𝛹, which minimizes L𝛹, then it becomes a "proximal gradient descent" method.
Convergence of the stochastic gradient method (SGD) and the proximal gradient method

4.3.2 Convergence rate of the stochastic gradient descent method

Assumption
Theorem Translated with www.DeepL.com/Translator (free version)

4.3.3 Convergence rate of stochastic gradient descent method (strongly convex)

Assumption
Theorem

4.3.4 Proof of convergence rate of stochastic gradient descent method (general form)

Theorem

4.3.5 Stochastic mirror descent method

Bragman divergence
Bragman divergence
Example of Braigman divergence
KL divergence described in "KL divergence constraint"
Stochastic Thorax Descent Algorithm
Update Mirror Image Descent Algorithm
Example

4.3.6 Application of Nesterov's acceleration method

4.4 Stochastic Dual Averaging (SDA)

4.4.1 Stochastic Dual Averaging Algorithm and Convergence Rate

Algorithm

4.4.2 Stochastic dual averaging method for strongly convex regularization terms

Algorithm

4.4.3 Proof of convergence rate of stochastic dual averaging method (general form)
4.4.4 Extension of the stochastic dual averaging method to the mirror image descent method

Algorithm

4.5 AdaGrad

Adaptive method for adjusting the update width for each coordinate
In online training of sparse regularization, if there is a feature that appears infrequently, the coefficient corresponding to that feature will be collapsed to zero every time by sparse regularization.
Correspondence by AdaGrad
Slight modification of Stochastic Gradient Descent and Stochastic Dual Averaging
Modifications
AdaGrad update formula algorithm
Used not only for learning convex loss, but also for deep learning optimization
The direction of descent of the objective function corresponds to a locally nearly flat spot
Escape from plateaus by taking advantage of the property of emphasizing directions with small gradients
Deep learning ~ approaches other than AdaGrad
AdaDelta
RMSProp
Reference 38
Theorem.
Adaptive scaling of variables to obtain better accuracy even when there is a difference in the frequency of occurrence of features

4.6 Minimax Optimality

Never loses to any stochastic optimization method using function values and gradients
Definition of the term
First-order stochastic oracle
Set of Lipschitz continuous functions
Set of optimization algorithms
Minimax error

4.7 Generalization Error for Online Stochastic Optimization

Chapter 5 Batch-type Stochastic Optimization

5.1 Problem Setup for Batch Stochastic Optimization

Optimization method prioritizes minimizing the training error
To avoid using all samples in a single update
Randomly select a small number of samples and use them only for a single update
If the objective function is strongly convex, it has the advantage of achieving exponential convergence with respect to the number of updates.
Characteristics of batch stochastic optimal methods

5.2 Stochastic Dual Coordinate Descent Method

5.2.1 Algorithm for stochastic dual-coordinate descent

Method using the dual problem
Assume a (true-closed) convex function fi and multiply by ℓi(β) = fi(xiTβ)
Solve the above dual problem using Fenchel's dual theorem
The coordinate descent technique can be easily applied.
Block coordinate descent
SDCA (stochastic dual coordinate descent) algorithm

5.2.2 Convergence Proof of Stochastic Dual Coordinate Descent

5.3 Stochastic variance-reduced gradient descent

5.3.1 Algorithm for Stochastic Variance-Reduced Gradient Descent

Efficient computation in the main problem
Stochastic Variance Reduced Gradient Descent (SVRG)
Algorithm for SVRG

5.3.2 Convergence Proof of the Stochastic Variance Reduced Gradient Method

5.4 Stochastic Mean Gradient Method

Algorithm

Chapter 6: Stochastic Optimization in Distributed Environments

6.1 Distributed Online Stochastic Optimization

6.1.1 Simple averaging
6.1.2 Synchronous and mini-batch methods
6.1.3 Asynchronous Variance SGD:Hogwild!

6.2 Distributed Batch Stochastic Optimization: Stochastic Coordinate Descent Method

6.2.1 Parallel Coordinate Descent Methods in the Main Problem
6.2.2 Parallel Coordinate Descent in Dual Problems: COCOA

Appendix A

A.1 Useful inequalities
A.2 First-Order Optimization of Regularized Learning Methods (Proximity Gradient Method)
A.2.1 Minimizing non-smooth convex functions
A.2.2 Minimizing smooth convex functions
A.2.3 Minimizing smooth convex functions: Nesterov's acceleration method
Image