On online learning and online prediction

Artificial Intelligence Digital Transformation Image Information Processing Deep Learning Probabilistic Generative Model Machine Learning Sensor Data/IOT Reinforcement Learning Navigation of this blog

Online learning and online prediction

Online learning and online prediction are described below.

Online learning is one of the learning frameworks in machine learning. In the framework of online learning, the model is improved sequentially using only the given data each time one piece of data (or a portion of all the data) is given, without using all the data at once. Due to the nature of this data processing method, it is applicable to data analysis on a scale where all data cannot be stored in memory or cache, and to learning in an environment where data is generated permanently.

Online learning requires the ability to receive sequential data and change algorithms and data models accordingly. Algorithms used in them include perceptron, ADALINE, naive Bayes, linear regression, deep learning, etc.

Online forecasting is a theoretical framework used in the field of machine learning to receive data sequentially and make forecasts in real time. Online forecasting is not only used for simple forecasting problems, such as literally predicting the weather or stock price fluctuations, but also for building probabilistic language models, designing high throughput network routing, determining reasonable prices in auctions, deriving optimal mixing strategies in games, etc. but also a combination of decision-making problems with various problems of predicting future events based on past experience.

Online forecasting has also been developed to make optimal forecasts even when the distribution of the given data is not known in advance, and online forecasting can be used to find optimal forecasting methods and maximize forecasting performance.

These techniques are closely related to the multi-armed bandit problem described in “Theory and Algorithms of the Bandit Problem” and to the reinforcement learning techniques described in “Theory and Algorithms of Various Reinforcement Learning Techniques and Their Implementation in Python“.

This blog discusses the following topics.

Implementation

Overview of Online Learning, Various Algorithms, Application Examples and Specific Implementations

Online learning is a method of learning by sequentially updating a model in a situation where data arrives sequentially. Unlike batch learning in ordinary machine learning, this algorithm is characterized by the fact that the model is updated each time new data arrives. This section describes various algorithms and examples of applications of on-run learning, as well as examples of implementations in python.

Overview of Online Prediction Technology and Various Applications and Implementations

Online Prediction (Online Prediction) is a technique that uses models to make predictions in real time under conditions where data arrive sequentially.” Online learning, as described in “Overview of Online Learning, Various Algorithms, Application Examples, and Specific Implementations,” is characterized by the fact that models are learned sequentially but the immediacy of model application is not clearly defined, whereas online prediction is characterized by the fact that predictions are made immediately upon the arrival of new data and the results are used. characteristic.

This section discusses various applications and specific implementation examples for this online forecasting.

Active Learning Techniques in Machine Learning

Active learning in machine learning (Active Learning) is a strategic approach to effectively selecting labeled data to improve model performance. Typically, training machine learning models requires large amounts of labeled data, but since labeling is costly and time consuming, active learning increases the efficiency of data collection.

Overview of mini-batch learning and examples of algorithms and implementations

Mini-batch learning is one of the most widely used and efficient learning methods in machine learning, which is computationally more efficient and applicable to large data sets compared to the usual Gradient Descent method. This section provides an overview of mini-batch learning. Mini-batch learning is a learning method in which multiple samples (called mini-batches) are processed in batches, rather than the entire dataset at once, and the gradient of the loss function is calculated for each mini-batch and the parameters are updated using the gradient.

Online Learning

As an introduction to online learning, a sequential machine learning method, the paper describes the basic algorithms and the establishment gradient descent method up to its introduction.

Fundamentals of Online Learning – Perceptron, SVM, and Logistic Regression

It is based on stochastic gradient descent, which is a fundamental method for online learning, and describes its application to the perceptron, SVM, and logistic regression.

Advanced online learning (1) High accuracy Approach (Perceptron, PA, PA-I, PA-II, CW, AROW, SCW)

Even if the same function is used for the discriminant function and the amount of training data is the same, its discriminant performance will vary greatly depending on the training method. The performance also depends not only on whether the generalization performance is excellent, but also on whether it is robust to data noise and whether it can handle differences in the variability of each dimension of the features. The high-precision approaches described here (Perceptron, PA, PA-I, PA-II, CW, AROW, and SCW) are all realized on the same framework and have different features only due to the difference in parameters at the time of update.

Advanced online learning (2)Distributed Parallel Processing(Parallelized mini-batch stochastic gradient method, IPM, BSP, SSP)

In general, the larger the training data, the better the performance of machine learning. In order to use up the information in a large amount of training data, it is necessary to increase the size of the training model and to increase its expressive power. We will discuss distributed parallel processing (parallelized mini-batch stochastic gradient method, IPM, BSP, SSP) to achieve these goals.

Advanced Online Learning(3)Application to deep learning (mini-batch stochastic gradient descent, momentum method, accelerated gradient method)

Deep learning consists of deep and wide multilayer neural nets, and high performance is obtained by using up the information in a large amount of training data. In this section, we discuss online learning methods for deep learning, such as mini-batch stochastic gradient descent, momentum method, and accelerated gradient method.

Advanced online learning (4) Application to deep learning (AdaGrad, RMSprop, ADADELTA, vSGD)

Continuing from the previous article, we will discuss AdaGrad, RMSprop, ADADELTA, and vSGD as online learning methods for deep learning.

Evaluating the Performance of Online Learning(Perceptron, Regret Analysis, FTL, RFTL)

It discusses theories (Perceptron, Riglet Analysis, FTL, RFTL) for evaluating the performance of online learning. By using the weapon of theory, online learning can be guaranteed to produce learning results that are as close as if it had been trained after looking at all the training data, and can be guided by how much training data is needed.

Implementation in online learning – sparse vector computation and averaging perceptron, averaged stochastic gradient descent, lazy update

It discusses the elements required when dropping various online learning algorithms into an implementation (sparse vector computation and averaging perceptrons, averaged stochastic gradient descent, and lazy update as efficient vector computation).

Article on Online Learning at the Conference on Artificial Intelligence

Content by Hidekazu Oiwa in my bookmarks where topics are described in various technologies in the Artificial Intelligence Society. International conferences where online learning has been presented, links to technical surveys, and various algorithms are summarized.

Sequential Update Anomaly Detection with Mixture Distribution Model -Jensen’s Inequality and EM Method

This article describes a sequential update EM approach using mixed normal distribution for failure and change detection by sequential learning for systems with multiple operating modes, such as air conditioning systems.

Stochastic Optimization and Online Optimization

Stochastic optimization in machine learning is mainly used to facilitate learning on large data sets. Current machine learning applications frequently require the handling of high-dimensional and large amounts of data, such as natural language, images, and speech.

If a general-purpose solver is applied directly to the optimization required for learning, each update will take at least the order of sample size x dimensions, so when the data is large, it will take a long time until the end of each update. Moreover, it is difficult to predict when the update will be completed.

To solve these problems, the idea of stochastic optimization is to divide the data appropriately and solve small subproblems at random, resulting in optimization using the entire data. Stochastic optimization can be broadly divided into online stochastic optimization and batch stochastic optimization.

Online Stochastic Optimization and Stochastic Gradient Descent for Machine Learning

In this article, we will discuss stochastic gradient descent SGD, the most basic algorithm for on-line stochastic optimization (Nesterov’s acceleration method, solving convex function optimization by gradient method, Lagrange’s undetermined multiplier method, Eulerian undetermined multiplier method, and Eulerian undetermined multiplier method). (Nesterov’s acceleration method, gradient method for optimization of convex functions, Lagrange’s undetermined multiplier method, Euclidean norm, convergence rate, KL divergence, exponential gradient descent method, Newton-Raphson method, Bregman divergence, stochastic mirror descent method, narrowly convex functions, Lipschitz continuous, loss function, projection gradient method, SGD, Cauchy-Schwartz inequality, minimax optimal, steepest descent method)

Online Stochastic Optimization and Stochastic Dual Averaging (SDA) for Machine Learning

Stochastic dual averaging (SDA) is an important optimization technique along with stochastic gradient descent. The dual averaging method was originally proposed by Nesterov in the context of non-stochastic optimization, but it can also be transformed into an Enline-type establishment optimization. In the theoretical analysis of the stochastic gradient descent method, the condition \\(E_{z_{1:t}}\{||\beta_t-\beta*||^2\}\leq D^2(\forall t\geq 1)\) appeared, but such a condition is not necessary in the stochastic dual averaging method. In other words, the boundedness of the variables is automatically guaranteed. It also has the practical advantage of being more sparse when using sparse regularization such as L1 regularization.

Online Stochastic Optimization, AdaGrad, and Minimax Optimization for Machine Learning

In this section, we describe a method called AdaGrad, which adaptively adjusts the update width of each coordinate. In online learning with sparse regularization, if there is a feature that appears infrequently, the coefficient corresponding to that feature will be collapsed to zero by sparse regularization every time. Considering the update formula for gradient descent, βt is obtained through the soft thresholding function every time when L1 regularization is used.

All of the methods described so far have generally converged to \(O(1/\sqrt{T})\)convergence, or O(1/T) if convex today. It has been shown that these are in fact optimal in the sense of mini-max optimal (mini-max optimal). To put it simply and without fear of misinterpretation, they have the property that they cannot be defeated (except by constant multiplications) by “any stochastic optimization method that uses function values and gradients”.

Distributed Processing of Online Stochastic Optimization

Parallel computation and stochastic optimization go relatively well together, and parallel computation is possible by modifying the methods described so far. Here, a variety of parallel computations are possible. In this section, we list methods for various situations. An overview of the various methods is as follows. Simple averaging, mini-batch method, asynchronous distributed SGD, stochastic gradient descent, stochastic optimization in a distributed environment

Online Prediction

Overview of online forecasting technology. The theory that deals with sequential prediction problems is called online prediction theory, and it has been used for sequential prediction and decision-making problems such as weather forecasting and stock prediction.

One of the characteristics of online learning is that it does not make any assumptions about the data generation process. However, without assumptions, it is impossible to evaluate forecasting algorithms, and to compensate for these assumptions, online forecasting theory evaluates “relative” to “hindsight-optimal” forecasting strategies. To compensate for this, online forecasting theory provides a “relative” evaluation of the “optimal” forecasting strategy in hindsight, which is called “Riglet Analysis”.

In this paper, I first describe the basic framework of “expert prediction theory,” which is a case in which expert opinions are used to answer a quiz show.

Online Convex Optimization(1)FTL strategy and BTL supplement

Online convex optimization is a framework for abstracting and unifying various online prediction problems.

An online convex optimization problem is defined by a convex set χ and a set of convex functions on χ such that F⊂{f:χ→ℝ|f is convex} (χ, F). We will refer to χ as the case space. The online convex optimization problem can be represented as a protocol between a player and an adversary as follows.

For each trial t=1,…. (1) the player chooses a point xt ∈ χ, (2) the adversary chooses a convex function ft ∈ F and gives it to the player, and (3) the player suffers a loss ft(xt).

We assume that the convex function ft is explicitly given to the player by the adversary. That is, not only the function value ft(xt), but also the function ft itself is given. Thus, not only do we know ft(x) for any x ∈ χ, but we can also compute the gradient ∇ft(x), subgradient, and Hensian of the function ft. This assumption is called the full information setting.

Online Convex Optimization (2) Complementing FTL Strategies with Regularization

In the previous article, we discussed the effectiveness and limitations of the FTL strategy, and the lesson of the FTL strategy is that a simple and greedy forecasting strategy may not work. In this article, we will discuss strategies that can compensate for the weaknesses of FTL strategies.

One of the most important concepts in machine learning is “regularization”. Many of the successful methods in machine learning, such as support vector machines, do not simply learn a hypothesis that minimizes the empirical loss (e.g. training error), but learn a hypothesis that simultaneously minimizes the empirical loss and some function (called the regularization term). By taking into account the regularization term, this approach can be seen as preparing for future losses that may appear in the future, rather than being limited to past empirical losses. In fact, in the field of statistical learning theory, it is possible to evaluate the generalization error of regularization-based methods under reasonable assumptions.

Something similar holds true in the context of online prediction. What we are about to describe is called the Follow The Regularized Leader strategy (FTRL strategy), which incorporates the idea of regularization into the FTL strategy.

Online convex optimization (3) exp concavity and ONS

In the previous section, we discussed the possibility of obtaining a riglet upper bound O(logT) stronger than \(O(\sqrt{T})\) when the loss function is strongly convex. In this article, I will discuss the possibility of strong quadratic approximation by exp-concavity and the Online Newton Step (ONS) as a case where a riglet upper bound of O(logT) can be obtained even when the loss function is not strongly convex.

Randomness-Based Online Prediction

Previously, we have discussed greedy strategies such as FTL and FTRL strategies, and greedy strategies with regularization terms. In this article, we will discuss the Follow the Perturbed Leader (FPL) strategy, which is an extension of FTL similar to FTRL, based on the method of adding random noise to the objective function and using a greedy strategy instead of using a regularization term when considering FTRL. FPL is based on the method of adding random noise to the objective function and using a greedy strategy. It is known that the FPL strategy of incubating random noise has the same guarantee of Riglet as the FTRL strategy of adding a regularization term.

Combinatorial Online Forecasting

In this article, we will discuss online prediction problems in which some combinatorial set or a set of discrete structures is used as the decision space.

Examples of combinatorial sets that can be used in combinatorial online prediction problems are as follows

- Expert: The expert problem described above can also be regarded as a combinatorial online prediction problem. The expert problem can be regarded as a combinatorial online forecasting problem, because the entire combination of selecting one element from n elements is a decision space. The size of the set is n.
- k-sets: A k-set is a set consisting of the entire combination of choosing k from n elements. The problem of multiple ad selection can be regarded as an online prediction problem on the k-set. The size of the set is O(nk).
- Permutation: A set consisting of an entire vertical sequence of n elements. Permutations are used in online prediction problems in ranking, assignment, and scheduling problems. The size of the set is n!=Ω((n/e)n).
- Path: Given a fixed graph, a starting point and an ending point, the path set is the entire set of paths from the starting point to the ending point. It is used in online shortest path problems. The size of the set is generally exponential to the size of the graph.