Summary
Stochastic optimization in machine learning will refer to the solution of optimization problems using random samples. In a normal optimization problem, all training data must be used to optimize in order to minimize the objective function, but when the data set is large, this can be very computationally intensive, so stochastic optimization allows the use of a randomly selected subset (mini-batch) of data for optimization, the computational complexity can be greatly reduced. In addition, because it uses randomly selected data, the optimization may converge faster. This section describes various stochastic optimizations based on the Machine Learning Professional Series “Stochastic Optimization.
In this article, I will discuss the reading notes.
Machine Learning Professional Series Probabilistic Optimization Reading Notes
From the introduction to the Machine Learning Professional Series, Stochastic Optimization. “From a review of supervised learning and convex analysis to parallel distributed stochastic optimization useful for large data analysis, this book provides a detailed explanation. Both basic techniques and new topics are well understood in this book. The book shows how formulas are formed and what algorithms mean. A real fast track to performing large-scale computations, with as few abbreviations as possible in the proofs.”
Chapter 1: Supervised Learning and Regularization 1.1 Supervised Learning Introduction Feature Extraction and Learning 1.1.1 Regression Regression Loss Functions Continued Loss Function 2 1.1.2 Discrimination Generally, if f(x) is positive, it is discriminated as 1, and if it is negative, it is discriminated as -1. Binary Discriminant Loss Function Loss Function 2 0-1 loss function is appropriate, but it is neither a continuous function nor a convex function, so it is difficult to optimize. It is common to use "surrogate loss" which is a close approximation of 0-1 function. Other than the above 0-1 functions 1.1.3 Overlearning Overlearning 1.2 Regularized learning method Methods to avoid overlearning Regularization Minimize the error function not by minimizing it as it is, but by adding "penalties" according to its "complexity Complexity" is determined by the structure of the data Regularization function General regularization learning method λ>0 is the "regularization parameter" that adjusts the strength of regularization The first term is an expression for the fit of the data The second term is a penalty for model complexity Regularization function example Continue Various regularization functions Linear models are simple, but higher dimensionality leads to more degrees of freedom and over-training About L1 regularization L1 regularization has the property that it tends to make the learning result β sparse Learning result β is "sparse" means means that many components of β are zero. Ignore useless features that are not necessary for prediction In what sense are they "necessary"? Feature selection that minimizes generalization error Akaike information criterion (AIC) Linear regression methods with regularization 1.3 Various sparse regularizations 1.3.1 Group Regularization Group Regularization Sparse regularization that treats features belonging to the same loop as if they were a single variable Equation for group regularization ∥βgk∥2 can also be the Lr norm for r greater than 1, like ∥βgk∥r The group regularization is L1 regularization is applied between groups and L2 norm is applied between groups. Variables belonging to the same group tend to be zero together. Further hierarchically, given a group g' of groups Hierarchical group regularization 1.3.2 Generalized fused regularization generalized fused regularization Regularization that makes adjacent variables in a graph have the same value. Generalized fused regularization formula 1.3.3 Trace norm regularization Trace norm regulization described in "Overview of the trace norm and related algorithms and implementation examples" Appears when you want to learn a low-rank matrix Trace norm regularization formula L1 regularization for a vector of singular values Matrices with sparse singular values, i.e. low-rank matrices, are learned Also called nuclear norm Applications Collaborative filtering Multi-task learning 1.3.4 Combinations of regularization functions Various regularization combinations are possible Example: sparse and low-rank matrix learning Using L1 regularization and trace norm regularization at the same time R(β) = ∥β∥1 + ∥β∥tr How to combine regularization functions Sum type Example Sparse and low rank matrix Convolution type Example Sum of sparse and low-rank matrices Application examples Removing sparse noise from each component of a low-rank matrix Common in image processing Duality holds. Chapter 2: Basics of Convex Analysis 2.1 Convex functions and convex sets Convexity of sets Convex Sets Image Convex Function Image Epigraph Effective domain True convex functions, closed convex functions Narrowly convex functions, strongly convex functions Strongly convex functions assume the curvature of convex functions Smooth convex functions Example f(x)=x2 is strongly convex and smooth f(x)=log(1+exp(-x)) is smooth but not strongly convex f(x)=|x|+x2 is strongly convex but not smooth 2.2 Subdifferentiation and dual functions Conjugate functions and Lejandre transformations Conjugate functions are functions viewed in terms of slope information f(x)-<x,y> is the value at x=0 of a line with slope y that takes the value f(x) at point x Lejandre transformation The convex function y=f(x) can be expressed by the set of (x,y), but equally well by the set of tangents specified by their slope and intercept values. As a generalization of the Lujandl transform, originally used in the transformation of Lagrangians to Hamiltonians in analytical mechanics A generalization of the Lujandre transform is the Lujandre-Fenchel transform A new function is given by the LeJandre transform only when the minimum value of the function g(x)=f(x)-px is fixed. Even when the minimum is fixed, if the original function is not convex, the newly defined function f*(p) will not be restored by inverse transformation. Complementary Subjects Definition of terms Convex hull The convex hull of a set C ⊆ ℝp is the smallest convex set containing C conv(C) For a function f, a convex function whose epigraph is the convex hull of the epigraph of f (conv(epi(f))) is defined to be the convex hull of f and written conv(f) Closed Closure The closed closure of a set C is the smallest closed set containing C The closure of a convex function f is defined to be a closed convex function whose epigraph is the closure of the epigraph of f (cl(epi(f))) and is denoted by cl(f). Affine set A set A is an affine set if a set such that for any two points x,y in A, λx+(1-λ)y is contained in A for all λ A set such that no line passing through any two points of the set extends beyond the set. The set of linear spaces whose origin is displaced Affine hull The smallest affine set containing a set C is called its affine hull Relative interior Let A be the affine hull of a convex set C A set x ∈ C such that there exists some ε > 0 and {x'|∥x'-x∥≤ε}∩A⊆C is called a relative transfer of C and is denoted by ri(C). Theorem 2.2.2. For a function, a function f** that has undergone two Lejandre transformations is the largest convex function that can be suppressed from the original function f System 2.2.3 Conjugate functions are another way of looking at a closed convex hull from the viewpoint of slope information Definition of "slope Subdifferential The source of the subdifferential is called the subgradient. Subgradient does not always exist Subgradient always exists on ri(dom(f)) (subdifferentiable) Properties of Inferior Differentiation When Young-Fenchel's inequality holds for integration x and y are connected by the subdifferential and give each other Lujandre restitution Inferior gradient To express the gradient at non-differentiable points After restricting F to a convex function, the slope of the function at each point is represented as "multiple values" instead of just one. The "multiple values" = the set of values represented as the subdifferential One of its elements is the inferior gradient Properties of Conjugate Functions Conjugate Functions of Various Convex Functions Continued 2.3 Fenchel's duality theorem Fenchel's dual theorem When we want to solve the optimal problem Info{f(x)+g(Ax)}, we call it the primal problem. The equivalent problem supy{-f*(ATy)-g*(-y)} derived from Fenchel's dual theorem is called the dual problem. For some problems, conjugate functions are easier to handle and dual problems are easier to solve. When both the main problem and the dual problem are being solved in the middle of the algorithm Finding the difference between the main problem and the dual problem provides a guarantee of how close they are to the optimal solution. The difference between the main problem and the dual problem is called the duality gap. Conjugacy of summation and convolution using the Fenchel duality theorem 2.4 Proximity mapping In stochastic optimization in machine learning, proximal mapping, as represented by proximity gradient, plays an important role. About proximal mapping Let f be a true closed convex function Definition of proximal mapping proxf Extension of projection to sets Proximal mapping can be defined uniquely as a "mapping" to non-differentiable convex functions Replaces gradient descent operations Examples of Proximity Mappings Indicator function Moreau decomposition 2.5 Properties of Strongly Convex and Smoothly Convex Functions If one or both of strongly convex and smoothly convex functions are satisfied, we can construct optimization methods that converge more quickly Chapter 3: What is Stochastic Optimization? If a general-purpose optimization solver is applied as-is to the optimization required for learning, it takes a lot of time. If we divide the data appropriately and solve small subproblems at random, we end up solving the optimization using the entire data. Online stochastic optimization Useful when samples are observed sequentially Very low computational cost for a single update Batch stochastic optimization Assuming all data are available Faster convergence can be achieved by using the same samples again Some methods split features Purpose of Machine Learning Reduce generalization error If the generalization error is small, there is no need to strictly optimize the training error itself on the data at hand History of Stochastic Optimization In 1951, statisticians Robbins and Monroe Considered maximum likelihood estimation in statistics as an optimization problem Parameters are updated with each sample observation Classification of various methods Chapter 4: Online Stochastic Optimization 4.1 Framework for Online Stochastic Optimization Update parameters every time one or a few samples are observed Let zi ∈ Z be the samples observed at time t (Z is the space of samples) In stochastic optimization, zi is assumed to occur independently and identically from some probability distribution P Let ℓt(β) or ℓ(zt,β) be the value of the loss function with parameter β ∈ ℝp for Zt Example: ℓt(β)=∥zt-βt∥2 In case of supervised learning Input (feature vector) and label pair zt=(xt,yt) Multiply ℓt(β)=ℓ(yt,<xt,β>) <> inner product of inputs xt and β Minimize training error after observing samples up to time t Normalization term is included Basic Procedure for Online Stochastic Optimization Stochastic optimization in regularization learning corresponds to minimizing the expected value of the above equation For practical use, training proceeds by setting several regularization strengths (regularization parameters). After the tth update, the appropriate regularization parameter is chosen based on the validation data. 4.2 Relationship between online learning and stochastic optimization Online type learning Minimize regrets by sequentially updating parameters Regret is The difference between the minimum training error with fixed parameters achievable after training all samples and the cumulative training error incurred during sequential training. No assumption is made that the samples are random variables Application Examples Advertising strategy Changes in subsequent consumer behavior depending on how the advertisement is presented What we want to minimize is the "cumulative" loss incurred each time an ad is run Not the generalization error of the learning results obtained at time t Difference between "online stochastic optimization" and "online learning In online learning, zt is assumed to be an independent and identical probability distribution, but not a random variable In online learning, zt's behavior changes according to the behavior of the collars; in online stochastic optimization, the distribution does not change What we want to minimize in online stochastic optimization is the expected error, not the riglet 4.3 Stochastic Gradient Descent (SGD) 4.3.1 Stochastic Gradient Descent Framework and Algorithm Algorithm for stochastic gradient descent without regularization term gt is the inferior gradient of the loss function ℓt From the definition of the inferior derivative, the above equation holds Taking the expectation of both sides with respect to zt The expected value of gt is the inferior gradient at βt-1 of the loss function L(β) = Ezt[ℓt(β)]. Approximate the (inferior) gradient of the objective function, the expected loss L(β), and update βt by moving slightly in that direction Moving in the direction of the gradient of the objective function is equivalent to moving in the direction where the objective function is most inclined. Gradient descent method steepest descent Projection back to B so as not to go out of the defined region B Projected gradient descent Stochastic gradient descent algorithm with regular terms If we can eliminate the stochastic element and let gt∈∂L(βt-1) L𝛹, which minimizes L𝛹, then it becomes a "proximal gradient descent" method. Convergence of the stochastic gradient method (SGD) and the proximal gradient method 4.3.2 Convergence rate of the stochastic gradient descent method Assumption Theorem Translated with www.DeepL.com/Translator (free version) 4.3.3 Convergence rate of stochastic gradient descent method (strongly convex) Assumption Theorem 4.3.4 Proof of convergence rate of stochastic gradient descent method (general form) Theorem 4.3.5 Stochastic mirror descent method Bragman divergence Bragman divergence Example of Braigman divergence KL divergence Stochastic Thorax Descent Algorithm Update Mirror Image Descent Algorithm Example 4.3.6 Application of Nesterov's acceleration method 4.4 Stochastic Dual Averaging (SDA) 4.4.1 Stochastic Dual Averaging Algorithm and Convergence Rate Algorithm 4.4.2 Stochastic dual averaging method for strongly convex regularization terms Algorithm 4.4.3 Proof of convergence rate of stochastic dual averaging method (general form) 4.4.4 Extension of the stochastic dual averaging method to the mirror image descent method Algorithm 4.5 AdaGrad Adaptive method for adjusting the update width for each coordinate In online training of sparse regularization, if there is a feature that appears infrequently, the coefficient corresponding to that feature will be collapsed to zero every time by sparse regularization. Correspondence by AdaGrad Slight modification of Stochastic Gradient Descent and Stochastic Dual Averaging Modifications AdaGrad update formula algorithm Used not only for learning convex loss, but also for deep learning optimization The direction of descent of the objective function corresponds to a locally nearly flat spot Escape from plateaus by taking advantage of the property of emphasizing directions with small gradients Deep learning ~ approaches other than AdaGrad AdaDelta RMSProp Reference 38 Theorem. Adaptive scaling of variables to obtain better accuracy even when there is a difference in the frequency of occurrence of features 4.6 Minimax Optimality Never loses to any stochastic optimization method using function values and gradients Definition of the term First-order stochastic oracle Set of Lipschitz continuous functions Set of optimization algorithms Minimax error 4.7 Generalization Error for Online Stochastic Optimization Chapter 5 Batch-type Stochastic Optimization 5.1 Problem Setup for Batch Stochastic Optimization Optimization method prioritizes minimizing the training error To avoid using all samples in a single update Randomly select a small number of samples and use them only for a single update If the objective function is strongly convex, it has the advantage of achieving exponential convergence with respect to the number of updates. Characteristics of batch stochastic optimal methods 5.2 Stochastic Dual Coordinate Descent Method 5.2.1 Algorithm for stochastic dual-coordinate descent Method using the dual problem Assume a (true-closed) convex function fi and multiply by ℓi(β) = fi(xiTβ) Solve the above dual problem using Fenchel's dual theorem The coordinate descent technique can be easily applied. Block coordinate descent SDCA (stochastic dual coordinate descent) algorithm 5.2.2 Convergence Proof of Stochastic Dual Coordinate Descent 5.3 Stochastic variance-reduced gradient descent 5.3.1 Algorithm for Stochastic Variance-Reduced Gradient Descent Efficient computation in the main problem Stochastic Variance Reduced Gradient Descent (SVRG) Algorithm for SVRG 5.3.2 Convergence Proof of the Stochastic Variance Reduced Gradient Method 5.4 Stochastic Mean Gradient Method Algorithm Chapter 6: Stochastic Optimization in Distributed Environments 6.1 Distributed Online Stochastic Optimization 6.1.1 Simple averaging 6.1.2 Synchronous and mini-batch methods 6.1.3 Asynchronous Variance SGD:Hogwild! 6.2 Distributed Batch Stochastic Optimization: Stochastic Coordinate Descent Method 6.2.1 Parallel Coordinate Descent Methods in the Main Problem 6.2.2 Parallel Coordinate Descent in Dual Problems: COCOA Appendix A A.1 Useful inequalities A.2 First-Order Optimization of Regularized Learning Methods (Proximity Gradient Method) A.2.1 Minimizing non-smooth convex functions A.2.2 Minimizing smooth convex functions A.2.3 Minimizing smooth convex functions: Nesterov's acceleration method Image
コメント