Stochastic optimization methods for solving large-scale learning problems with large amounts of data

Mathematics Artificial Intelligence Digital Transformation Online Learning Machine Learning Navigation of this blog

stochastic optimization

With the development of the Internet, sensor technology, and computers, a large amount of data of various types is now available, and machine learning technology has been attracting attention and making great progress in extracting meaningful information from large amounts of data. While the availability of a wide variety of data has made it possible to solve a wide range of problems, it has also created the problem of increased computational complexity due to the increased size of the data.

Stochastic optimization in machine learning will refer to the solution of optimization problems using random samples. In a normal optimization problem, all training data must be used to optimize in order to minimize the objective function, but when the data set is large, this can be very computationally intensive, so stochastic optimization allows the use of a randomly selected subset (mini-batch) of data for optimization, the computational complexity can be greatly reduced. In addition, because it uses randomly selected data, the optimization may converge faster.

A typical stochastic optimization method is Stochastic Gradient Descent (SGD) described in “Overview of Stochastic Gradient Descent (SGD), its algorithms and examples of implementation“, which updates parameters according to the direction of the gradient of the objective function to be minimized. SGD is particularly effective in training neural networks and is now a widely used method in machine learning.

There are also improved versions of SGD, such as SGD with Momentum and Adam, which are known to converge faster, and these methods improve the performance of SGD by adjusting the learning rate and reducing noise.

This section describes these various methods of stochastic optimization in machine learning.

Implementation

Overview and Implementation of Stochastic Optimization in Machine Learning

Stochastic optimization represents a method for solving optimization problems involving stochastic elements, and stochastic optimization in machine learning is a widely used method for optimizing the parameters of a model. Whereas in general optimization problems, the goal is to find optimal values of parameters to minimize or maximize the objective function, stochastic optimization is particularly useful when the objective function contains noise or randomness caused by various factors, such as data variability or observation error .

In stochastic optimization, random factors and stochastic algorithms are used to find the optimal solution. For example, in the field of machine learning, stochastic optimization methods are frequently used to optimize parameters such as weights and biases of neural networks. In SGD (Stochastic Gradient Descent), a typical method, optimization is performed by randomly selecting samples of the data set and updating parameters based on those samples, so that the model can be efficiently trained without using the entire data set The model can be trained without using the entire dataset.

This section describes implementations in python for SGD and mini-batch gradient descent, Adam, genetic algorithms, and Monte Carlo methods and examples of their application to parameter tuning, feature selection and dimensionality reduction, and k-means.

Theory

Machine Learning Professional Series Probabilistic Optimization Reading Notes

The book carefully explains everything from a review of supervised learning and convex analysis to parallel distributed stochastic optimization, which is useful for large data analysis. This book provides a sound understanding of both basic techniques and new topics. The book shows how formulas are formed and what algorithms mean. A real fast track to running large-scale calculations that shows proofs as short as possible.

Supervised Learning and Regularization

There are two main settings for machine learning: supervised learning and unsupervised learning. Supervised learning is the problem of predicting the corresponding output y (label) from a certain input x (feature), such as image discrimination or disease prediction from genetic data. Typical problems in supervised learning are regression and classification.

In addition to selecting an appropriate model, regularization is another way to avoid overlearning. Both model selection and habituation are essentially the same in that they select a relationship of appropriate complexity for the data. The regularization method does not minimize the training error as it is in the model, but minimizes it by adding penalties according to its “complexity.

Fundamentals of Convex Analysis in Stochastic Optimization (1) Convex Functions, Subdifferentials, and Dual Functions

In this article, I will discuss the basics of convex analysis. Various techniques of stochastic optimization are based on convex analysis. In this article, we will discuss the basics of convex analysis necessary for stochastic optimization. In particular, we introduce Fenchel’s dual theorem, proximity maps, strongly convex functions, and smooth convex functions, which are important in algorithm construction (for more details on convex analysis, please refer to the book by Rockerfeller). For a (non-stochastic) optimization of regularized learning methods, see “Machine Learning with Sparsity”.

Fundamentals of Convex Analysis in Stochastic Optimization (2) Fenchel’s Dual Theorem, Proximity Maps, and Strongly Convex Functions

In this article, I will discuss Fenchel’s dual theorem, proximity maps and strongly convex functions, which are important elements of convex analysis.

Stochastic Optimization and Online Optimization

Stochastic optimization in machine learning is mainly used to facilitate learning on large data sets. Current machine learning applications frequently require the handling of high-dimensional and large amounts of data, such as natural language, images, and speech.

If a general-purpose solver is applied directly to the optimization required for learning, each update will take at least the order of sample size x dimensions, so when the data is large, it will take a long time until the end of each update. Moreover, it is difficult to predict when the update will be completed.

To solve these problems, the idea of stochastic optimization is to divide the data appropriately and solve small subproblems at random, resulting in optimization using the entire data. Stochastic optimization can be broadly divided into online stochastic optimization and batch stochastic optimization.

Online Stochastic Optimization and Stochastic Gradient Descent for Machine Learning

In this article, we will discuss stochastic gradient descent SGD, the most basic algorithm for on-line stochastic optimization (Nesterov’s acceleration method, solving convex function optimization by gradient method, Lagrange’s undetermined multiplier method, Eulerian undetermined multiplier method, and Eulerian undetermined multiplier method). (Nesterov’s acceleration method, gradient method for optimization of convex functions, Lagrange’s undetermined multiplier method, Euclidean norm, convergence rate, KL divergence, exponential gradient descent method, Newton-Raphson method, Bregman divergence, stochastic mirror descent method, narrowly convex functions, Lipschitz continuous, loss function, projection gradient method, SGD, Cauchy-Schwartz inequality, minimax optimal, steepest descent method)

Online Stochastic Optimization and Stochastic Dual Averaging (SDA) for Machine Learning

Stochastic dual averaging (SDA) is an important optimization technique along with stochastic gradient descent. The dual averaging method was originally proposed by Nesterov in the context of non-stochastic optimization, but it can also be transformed into an Enline-type establishment optimization. In the theoretical analysis of the stochastic gradient descent method, the condition \\(E_{z_{1:t}}\{||\beta_t-\beta*||^2\}\leq D^2(\forall t\geq 1)\) appeared, but such a condition is not necessary in the stochastic dual averaging method. In other words, the boundedness of the variables is automatically guaranteed. It also has the practical advantage of being more sparse when using sparse regularization such as L1 regularization.

Online Stochastic Optimization, AdaGrad, and Minimax Optimization for Machine Learning

In this section, we describe a method called AdaGrad, which adaptively adjusts the update width of each coordinate. In online learning with sparse regularization, if there is a feature that appears infrequently, the coefficient corresponding to that feature will be collapsed to zero by sparse regularization every time. Considering the update formula for gradient descent, βt is obtained through the soft thresholding function every time when L1 regularization is used.

All of the methods described so far have generally converged to \(O(1/\sqrt{T})\)convergence, or O(1/T) if convex today. It has been shown that these are in fact optimal in the sense of mini-max optimal (mini-max optimal). To put it simply and without fear of misinterpretation, they have the property that they cannot be defeated (except by constant multiplications) by “any stochastic optimization method that uses function values and gradients”.

Batch-type Stochastic Optimization – Stochastic Dual Coordinate Descent Method

In this article, we will discuss stochastic dual-coordinate descent methods for batch stochastic optimization. Batch stochastic optimization is a way to achieve fast convergence when all samples are already available.

Batch Stochastic Optimization – Stochastic Variance-Reduced Gradient Descent and Stochastic Mean Gradient Methods

In “Batch Stochastic Optimization – Stochastic Dual Coordinate Descent,” we have constructed a set of fictional optimizations that achieve exponential-order convergence by using a dual problem. Here we describe a method called Stochastic Variance Reduction and Descent (SVRG), which achieves exponential order in the face of exponential loss.

Distributed Processing of Online Stochastic Optimization

Parallel computation and stochastic optimization go relatively well together, and parallel computation is possible by modifying the methods described so far. Here, a variety of parallel computations are possible. In this section, we list methods for various situations. An overview of the various methods is as follows. Simple averaging, mini-batch method, asynchronous distributed SGD, stochastic gradient descent, stochastic optimization in a distributed environment

Stochastic Coordinate Descent Method as a Distributed Process for Batch-Type Stochastic Optimization

In the previous sections, we have discussed parallel distributed computation methods with a focus on on-line stochastic optimization. This time, we will discuss parallel approximation errors in batch stochastic optimization. In particular, I will discuss parallel approximation errors in the coordinate descent method, which is one of the easiest methods to implement, and describe how to handle the main problem and the dual problem.