Continuous optimization in machine learning

Mathematics Artificial Intelligence Digital Transformation Machine Learning Navigation of this blog
Continuous optimization in machine learning

In today’s highly information-oriented society, gaining useful insights from data has become an extremely important issue for making important scientific discoveries and promoting industry and economic society. Today, data are becoming larger, more dimensional, and more diverse. As a fundamental technology for appropriately handling such data, a field called machine learning has emerged in recent years and is developing rapidly. In traditional academic fields that deal with data, such as statistics and data science, modeling techniques have been developed mainly to perform statistical inference appropriately.

Machine learning, on the other hand, is unique in that it focuses more on the development of efficient computational methods. Against the backdrop of the big data era, various methods developed in the field of machine learning are having a significant impact on society.

Here, we describe the method of continuous optimization, which is an indispensable computational method for constructing machine learning algorithms.

Continuous optimization in machine learning is a method for solving optimization problems in which variables take real values, such as optimization of neural network weights and biases, parameter estimation in regression analysis, and parameter estimation in SVM.

Typical methods for continuous optimization include gradient descent, steepest descent, conjugate gradient, quasi-Newton, and L-BFGS methods, which are algorithms that use information such as the gradient and Hesse matrix as described in “Hesse Matrices and Regularity” of the objective function to update variables as they approach an optimal solution.

In continuous optimization, it is necessary to consider not only the value of the objective function but also the constraint conditions. To solve optimization problems that consider constraints, we use methods for constrained optimization problems, such as methods that satisfy the KKT condition, interior point methods, projection gradient methods, penalty methods, and barrier function methods.

By appropriately selecting these methods and optimizing the parameters, higher prediction accuracy and faster learning can be achieved, but care may be required when the optimization problem is nonconvex, as some initial values may lead to a locally optimal solution.

The details are as follows

Fundamental Theory for Optimization in Machine Learning

To describe optimization algorithms and investigate their theoretical properties, knowledge of calculus and linear algebra is used. In this section, we discuss these fundamentals. First, we define the notation of a matrix. The entire nxn matrix (nth order matrix) with real numbers as elements is denoted by ℝnxn. More generally, ℝnxm is the whole of an nxm-th order matrix. When the elements of a matrix A∈ℝnxm are explicitly stated as A=(aij)i=1,…,n,j=1,…,m, sometimes written A=(aij) for simplicity.

In this article, we define convex functions and convex sets and discuss their properties. Convexity of a function or a set is of great importance for optimization problems. When an optimization problem satisfies “convexity,” it can be shown that if it is locally optimal, it is globally optimal. This property can be used to design efficient optimization algorithms. (Subgradient, subdifferential, conjugate function, closed truly convex function, conjugate function, strongly convex function, closed truly convex function, upper and lower bounds on function values, Hesse matrix, epigraph, Taylor’s theorem, relative interior, Ahuinn envelope, continuity, convex envelope, convex function, convex set)

Unconstrained optimization for continuous optimization in machine learning (machine epsilon, stopping conditions without scaling, stopping conditions with scaling, Taylor’s theorem, stopping conditions for optimization algorithms, Hesse matrix)

In this article, we will discuss stopping conditions for iterative methods among optimality conditions for optimization problems. (scaling, influence, machine epsilon, algorithm stopping conditions, iterative methods, convex optimal solutions, constrained optimization problems, global optimal solutions, local optimal solutions, convex functions, second order sufficient conditions, second order necessary conditions, first order necessary conditions)

In this article, we will discuss the optimality conditions for optimization problems and describe the stopping conditions for iterative methods. The linear search, coordinate descent, and steepest descent methods are described as basic algorithms for unconstrained optimization problems.

This section describes the Newton’s method, which is an optimization method using Hesse matrices. Newton’s method is a computational method that uses not only the gradient but also the Hesse matrix information. When the Hesse matrix can be computed in a cyclic manner, Newton’s method is useful as an efficient optimization method.

To derive Newton’s method, assume that the iterative method has been used to minimize a twice continuously differentiable function f(x), and that the computation has progressed to the sky xk. Taylor expansion of f at point xk yields

In this article, we will discuss the Gauss-Newton method and the natural gradient method, which are optimization methods using Hesse matrices. In the Gauss-Newton method, the second term of the Hesse matrix ∇2f is omitted and the following equation is used as the search direction at point xk.

In machine learning and statistics, a distance different from the Euclidean distance may be naturally defined on the space of optimization parameters x. For example, suppose that two probability density functions on the sample space Ω are given as p(⋅,x),p(⋅,x’) with parameters x,x’ ∈ ℝn. The Helinger distance is sometimes used to measure the distance between them.

Conjugate gradient methods are used for optimization of large problems as a method with low cost and memory value range saving. We describe the conjugate gradient method, an optimization method with better convergence than the steepest descent method. First, we describe the conjugate direction and conjugate gradient methods for optimization of convex two-time numbers. Next, we discuss some extensions to general nonlinear functions.

The quasi-Newton method avoids the computation of the Hesse matrix in the Newton method and provides an efficient method for optimization. In this article, we will review typical algorithms, discuss their relation to geometric structures on the positive definite matrix space, and describe how to utilize the sparsity of the Hesse matrix.

The quasi-Newton method avoids the computation of the Hesse matrix in the Newton method and provides an efficient optimization method. In this article, we will review typical algorithms, describe their relation to geometric structures on the positive definite matrix space, and discuss how to utilize the sparsity of the Hesse matrix.

In the trust-region method, the objective function f(x) is first approximated by an easily optimized function m(x) in the neighborhood of the iterative solution. In addition, a confidence region is set where the approximation is valid. Next, m(x) is approximatively minimized on the confidence region, and the solution is updated. In the gradient method, which uses a linear search, the direction of descent is determined first and then the step width is determined.

In this article, we will discuss optimality conditions for equality-constrained optimization problems. First, consider the following optimization problem.

\[\min_{x\in\mathbb{R}^n}f(x)\ s.t.\ g_i(x)=0,\ i=1,\dots,p\quad(1)\]

Assume p<n with respect to the dimension of the variables and the number of equality constraints. Also assume that the functions f,g1,…,gp are all differentiable. Sets satisfying the equality constraint gi(x)=0(i=1,…,p) are not convex sets in general. However, if f is a convex function and gi(x) are all linear, then (1) is a convex optimization problem.

We discuss optimality conditions for constrained inequality optimization problems. In this section, we deal with the following optimization problem. \[\min_{z\in\mathbb{R}^n} f(x)\ s.t.\ g_i(x)\leq 0,\ i=1,\dots,p\]The functions f, g1,…,gn are assumed to be differentiable. The first step is to state the first-order necessary condition for a locally optimal solution. From this condition, the stopping point can be obtained. To further confirm that the solution is locally optimal, two necessary and sufficient conditions are presented. If the functions f,g1,…,gp are all pxp, then (1) is a convex optimization problem. We show that the first-order necessary condition is a sufficient condition for the solution to be globally optimal.

This section describes optimization algorithms for constrained optimization problems: the effective constraint method, the barrier function method, and the penalty function method. These methods are mainly applied to the main problem.

The method using Lagrangian functions is useful for solving constrained optimization problems. In addition, as in sparse regularization learning, even if the problem is an unconstrained optimization problem, it can be easily solved by considering it as a linear constrained optimization problem and applying the Lagrangian function method. In this section, we discuss methods using powerful Lagrangian functions.

First, we describe the dual ascent method.

We describe the extended Lagrangian method, which is an improvement of the Lagrangian function.” In order for the penalty function method and barrier function method described in “Optimization for the Main Problem in Machine Learning” to work well, it was necessary to make the strength of the penalty or barrier ck diverge to infinity (ck→∞). This gradually worsens the conditions of the optimization problem. In other words, a slight change in a variable can cause a large change in the objective function. Therefore, in order to stabilize the optimization, appropriate planning according to the strength of the penalty is necessary.

The augmented Lagrangian method described here does not need to increase the penalty in this way. Therefore, the condition number of the internal problem to be solved does not get worse with each update, and optimization can be performed stably.

The dual ascent method is not only equipped with an action that encourages its main problem, the Lagrangian function, to satisfy the constraints. Therefore, the function f must have properties such as strong convexity to guarantee its convergence. On the other hand, the extended Lagrangian function is defined as a kind of penalty function added to the ordinary Lagrangian function so that the value of the constraint equation is close to zero.

In this article, we describe the alternating direction multiplier algorithm in the extended Lagrangian function method.

The upper bound minimization algorithm is a widely used sequential solution method that generates a sequence of points that monotonically decreases the objective function. In particular, it is used as one of the solution methods when it is difficult to optimize the objective function explicitly. In addition, understanding the upper bound minimization algorithm provides a unified understanding of the other algorithms included in this framework.

  • Support Vector Machines and Optimization
  • Sparse learning
  • Optimization on Matrix Space

コメント

タイトルとURLをコピーしました