Machine Learning Professional Series “Continuous Optimization for Machine Learning” Reading Memo

Mathematics Artificial Intelligence Technology Sequential optimization for machine learning Machine Learning Technology Navigation of this blog Digital Transformation
Summary

Continuous optimization in machine learning is a method for solving optimization problems in which variables take real values, such as optimization of neural network weights and biases, parameter estimation in regression analysis, and parameter estimation in SVM. Typical methods for continuous optimization include gradient descent, steepest descent, conjugate gradient, quasi-Newton, and L-BFGS methods, which are algorithms that use information such as the gradient and Hessian matrix of the objective function to update variables as they approach an optimal solution. In continuous optimization, it is necessary to consider not only the value of the objective function but also the constraint conditions. To solve optimization problems that consider constraints, we use constrained optimization problem methods such as methods that satisfy the KKT condition, interior point methods, projective gradient methods, penalty methods, and barrier function methods. Here we discuss continuous optimization in machine learning based on the Machine Learning Professional Series “Continuous Optimization for Machine Learning“.
Reading Memo.
Here’s where you can learn the most with the least effort! The boundary field is interesting! This is a solid book from the basics to the cutting edge. You can learn the basic knowledge essential for machine learning. It’s not gentle. There has never been a textbook as good as this.
Part I. Introduction

Chapter 1 Introduction

1.1 Inference and Computation in Machine Learning

In machine learning and statistics, one of the main goals is to make accurate estimates and predictions
An approach that is often used is to properly model the process by which data is generated and minimize a statistically valid loss measure (error metric).
What kind of loss can be used to achieve the desired level of statistical accuracy?
Computational efficiency is also important
Algorithm Examples
Maximum likelihood method
Fisher's score method (an optimization method similar to Newton's method)
Standard method for estimating parameters of generalized linear models in regression analysis
Bayesian method
Markov chain Monte Carlo methods
Computation of high-dimensional integrals for obtaining the posterior distribution
Computational methods for efficient inference and prediction using the structure of statistical models.
EM Algorithm for Parameter Estimation of Mixed Gaussian Distributions
Error Back Propagation for Hierarchical Neural Network Models
Probability propagation for Bayesian networks.
Boosting, an example of collective learning.

1.2 Description of Optimization Problems

1.2.1 Various optimization problems
A set of real numbers ℝ
N-dimensional Euclidean space ℝn
Consider an N-dimensional source x as a column vector
x=(x1,...,xn)T ,xn)T
T is the transpose of a vector or matrix
The norm (length) of a vector x is called
Euclidean norm
The distance between two points x, y ∈ ℝn is ∥x - y∥ using the Euclidean norm
Optimization problem is the problem of optimizing (specifically minimizing or maximizing) a given function 𝒇 : ℝn → ℝ under some constraint x ∈ 𝑆 ⊂ ℝn
If minimization is the goal
Min" means to minimize the function
x under Min is the variable to be optimized
The "subject to ~" means "under the condition of ~".
Sometimes it is abbreviated to "s.t.".
Let 𝑓 be the objective function
𝒇(x) is defined in terms of Euclidean space powers and often satisfies properties such as continuity and differentiability
Continuous optimization
𝑺 is a feasible region
The condition "x ∈ 𝑺" defined by 𝑺 is called a constraint.
Example 1.1 : School placement problem
In a village where there is no elementary school, there are 5 houses, and their locations are represented by coordinates αi ∈ ℝ2 (i=1,2,3,4,5).
If we want to build a house in 𝑺 ⊂ ℝ2, where should we put it?
Constraints
Build it where the square of the distance to the school is minimized on average.
Build it where the distance to the school is minimized on average.
Minimize the distance from the farthest house to the school
Example 1.2 Transportation problem
When transporting goods from production sites S1 and S2 to consumption sites T1, T2, and T3, what is the delivery plan that will result in the lowest transportation cost?
The total transportation cost is
The transportation cost from the production site Si to the consumption site Tj is cij per unit of goods.
Let the quantity transported be xij
Let x ∈ ℝ6 be x ={x11,... ,x23}
Each production site produces at most ai(i=1,2)
At each consumption location, at least bj(j=1,2,3) is consumed
Model
Constraints
Feasible solution
A point that satisfies the constraints of the optimization problem
Optimal solution
Global optimal solution
Global solution
Minimization source (Maximization source)
Optimal value
Lower (upper) limit of the objective function
The existence of an optimal value does not necessarily mean the existence of an optimal solution.
Neighborhood
ε-neighborhood of a point x ∈ ℝn is B(x,ε) ={y ∈ ℝn | ∥x -y∥ < ε} local optimal solution unconstrained optimization problem equality constrained optimization problem inequality constrained optimization problem Possibility of the existence of an optimal solution S is an empty set and there are no feasible times S is not an empty set and there is an optimal solution S is not an empty set and there is no optimal solution Minimum value case If f is small enough on S, the optimal solution is -∞. 1.2.2 Iterative methods and convergence speed Many optimization algorithms are described as "iteration method" as a basic procedure. The basic iteration algorithm Initialization: Set initial value x0, k←0 1. if the stop condition is satisfied, output the result and terminate the calculation 2. function f(x), set S, history of points x0,...,xk , xk, etc. to compute xk+1. 3. Set k ← k+1, return to step 1 In the iterative method, the numerical solution x0, x1,... converge to the optimal solution x* as soon as possible. The stopping condition of the algorithm should be set appropriately. Evaluation measure of convergence speed {Linear convergence to x* for xk∞k=0 Measure the error from the numerical solution xk to the optimal solution by ‖xk-x*‖. ‖xk -x*‖ = 0 is guaranteed Superlinear convergence quadratic convergence Memo:Meaning of "sup" and "inf" and difference from "max max A : maximum value of A max A : maximum value of A if x ≤ c and for any x ∈ A (c is not less than any element -> c is an upper bound of A)
min A : minimum value of A
sup A : upper bound of A (soup) supernum
For any x ∈ A, x ≤ c and (c is the upper bound of A)
For any real number less than c, there exists an x ∈ A such that r0 if x≠0
A is called a positive definite matrix
A ≻ O
2.1.4 Projection onto a subspace
For a subspace S of N-dimensional space, the orthogonal captive space S⊥ of S
Projection x1 of a vector x onto a subspace S
The projection x1 is characterized as the point on S that minimizes the distance to x
The eigenvalues of the projection matrix are 0 or 1
2.1.5 One rank update of a matrix
A matrix A plus a matrix with rank 1, A + xyT, is called a one-rank update of A.
It is used in the natural gradient method and the quasi-Newton method.
Formulas have been devised to efficiently perform various operations on a matrix that has been updated by one rank, using the information of the matrix before the update.
Sherman-Morison formula
2.1.6 Various Norms
A norm is a quantity that corresponds to the length of a vector.
The most basic norm is the Euclidean norm
Definition 2.3 Definition of norm
Schwarz's inequality holds for the Euclidean norm
Herder's inequality, an extension of Schwarz's inequality, holds for the P-norm
2.1.7 Functions on Matrix Space

2.2 Fundamentals of Convex Analysis

2.2.1 Convex Sets and Convex Functions
Definition of a convex function
For any points P=(x,f(x)) and Q=(y,f(y)) on the graph of a function f(x), we say that the function f(x) is convex (downward) if the graph of f(x) lies below the line segment PQ in the region between these two points.
For any x, y, and any λ satisfying 0≤λ≤1
λf(x) + (1-λ)f(y) ≥ f(λx + (1-λ)y)
z=λx + (1-λ)y, then R=(x,f(z)), S=(z,λf(x) + (1-λ)f(z))
The inequality shows that S is above R
Definition 2.4 Convex set
A set S is called a convex set if u, v ∈ S, 0≤λ≤1 ⇨(1-λ)u + λv ∈ S for any non-empty set S⊂ℝn.
Examples of Convex Sets and Non-Convex Sets
Properties of convex sets
A convex set is a set of finite points x1,...,xk ∈ S contained in a convex set S. ,convex union of xk∈S is contained in S
If S and T are convex sets, then S ∩ T is also a convex set
For any set S that is not necessarily a convex set, we can define the smallest convex set that contains S
Definition 2.5 Convex hull (convex-hull)
Corollary 2.6
Definition 2.7 Convex functions
Properties of convex functions
Definition 2.8 µ-Strongly Convex Functions
2.2.2 Minimization of convex functions
Definition 2.9 Optimization of Convex Functions
Corollary 2.10
2.2.3 Continuity of Convex Functions
Affine hull
Theorem 2.11 Continuity of convex functions
2.2.4 Differentiable convex functions
Theorem 2.12 Characterization of convex functions by gradient
Theorem 2.13 Characterization of convex functions by Hesse matrices
Theorem 2.14 Upper and lower bounds for function values
2.2.5 Conjugate functions
If a function f is differentiable, then the tangent plane of its graph is fixed.
Closed convex functions and nonconvex functions
Theorem 2.15 Properties of conjugate functions
Theorem 2.16 Strong convexity and smoothness
2.2.6 Subgradient and subdifferential
Definition 2.17 Inferior gradient and inferior derivative
Theorem 2.18 Differentiable convex functions and column differentiation
Theorem 2.19 Subgradient and conjugate functions

Part II: Unconstrained Optimization

Chapter 3: Optimality Conditions and Algorithm Stopping Conditions

3.1 Local Optimal Solutions and Optimality Conditions

Unconstrained optimization problem for function f:ℝn→ℝ
In general optimization of f, which is not necessarily convex, finding the local optimal solution is a realistic solution.
If the objective function is differentiable, then the gradient of the local optimal solution is a zero vector
Theorem: First order necessary condition
x* satisfying equation 3.2 is called the stationary point of function f
It is also possible to determine whether the solution is local or not using the information of the Hesse matrix
Theorem: Second order sufficient condition
By approximating the objective function f by a quadratic interval using Taylor's theorem, we can find various properties.
Theorem: Necessary condition of second order
Using the first- and second-order necessary conditions, we can peek at the local optimal solution, X, from the stopping point.
The candidate points for the locally optimal solution satisfy ∇f(x)=0
Candidate points whose Hesse matrices have negative eigenvalues are not local optimal solutions
Example
Function f(x1,x2)=x14+x24+3x12x22-2x22
First order condition
(X1,x2)=(0,0),(0,±1)
Hesse matrix
From the second order sufficient condition, take the minimum value -1 at (x1,x2)=(0,±1)
(X1,x2)=(0,0) is not a local optimum because it does not satisfy the second-order necessary condition
Among the stopping points, the point (0,±1) takes a minimum value
Theorem: First order sufficient condition for convex optimization
Summary of the minimization problem

3.2 Optimality conditions for set constraints

Consider a constrained optimization problem defined by a function f:ℝn→ℝ and a set S ∈ ℝn
In many applications, S is considered as a convex set
Definition: feasible direction
Theorem: Temporary necessary condition for h-problem (3.4)
When the above equation holds for any feasible direction d of S at a point x*, x* is called the stopping point of problem (3.4)
Equivalent to ∇f(x*)=0
Theorem:h First order sufficient condition for problem (3.4)
When problem (3.4) is an optimization problem, being a stopping point is equivalent to being an optimal solution
The optimal solution can be obtained from the stopping point.

3.3 Stopping Condition of Optimization Algorithm

If ∇f(x*)=0 and ∇2f(xk)≻ are satisfied, xk is a local optimal solution, and the iterative method stops.
In actual calculation, it is difficult to check whether ∇f(xk)=0 is strictly valid due to numerical juggling
We assume that the algorithm stops when ∥∇f(xk)∥<ε
How to choose Ε
Let εmatch be the degree of randomness in the immovable point of the computer
Example: εmatch=10-16
Assumptions
The function f is assumed to be twice differentiable
We have ∇f(x*)=0 and ∇2f(x*)≻O
The difference of the function values is f(xk)-f(x*)=h
The difference in solutions is δ=xk-x*
As in Taylor's theorem, there exists a suitable t ∈ (0,1), and the above equation holds
If x* and xk are close enough, then ∥δ∥=O(√h)
From Taylor's theorem for gradient, the above equation becomes
∥∇f(xk)∥ is of the same order as ∥δ∥
Stopping condition when computing the function f(x) with εmatch accuracy
Considering the effect of scaling, we can use the above equation
Since the left-hand side is too strict, the above stopping condition is proposed.
In the actual numerical calculation, we need to decide whether to stop or not by observing not only the gradient but also the behavior of the function value f(xk) and the iterative solution xk.
The final proposed stopping condition

Chapter 4: Basics of the Gradient Method

Overview

Algorithms that use information about the gradient of a function to find a solution to an optimization problem.
There are two methods: coordinate descent and maximum steepest descent.

4.1 Line search method

Line search method (line search)
First, find the direction in which the objective function becomes small
Next, calculate the step size that represents how much x should be moved in that direction.
The step size is then used to find the buy.
Fastest descent method
Newton method
Quasi-Newton method
Determine if the function value has decreased sufficiently
Armijo condition
Wolf condition
Allowable range of step width
Strong Wolf Condition
Goldstein condition
One of the methods to obtain an appropriate step mother under the Armijo condition by devising a search method for the parameter α.
Backtracking Method Algorithm
Complementary Problem
Step approach
Method to find the exact minimum solution
Conjugate gradient method
A method that only needs to obtain a sufficient decrease
Backtracking method
Direct search may be combined with annealing method to escape local minima

4.2 Iterative method with linear search

Optimize function f using iterative method and linear search method
Penetrate to point xk
Proceed from Xk in the direction of search direction dk, choose the appropriate step width αk, and update point xk as in the above equation
The search direction dk is chosen to satisfy the above equation.
Dk is called the descent direction of f at xk.
From Taylor's theorem, the above equation becomes
For sufficiently small αk, f(xk+αkdk)-f(xk)<0, the function proceeds in the direction of decreasing value. Algorithm When the Wolf condition is used for linear search, the above theorem holds to guarantee convergence of the algorithm. 4.3 Coordinate Descent Method The method of optimizing the objective function along each coordinate Coordinate descent method) Let vector ei be a unit vector whose i-th element is 1 and other elements are 0. The search direction of the coordinate descent method is selected from ±e1,...,±en. and ±en. How to choose the search direction Uniformly random selection method 1,2,... ,n,1,2,... ,n cyclically. When searching cyclically, convergence can be guaranteed by determining the step width appropriately. Select the coordinate where the absolute value of the element of gradient ∇f is the maximum. Find the lower bound of cosine cosθk in Theorem 4.2 Equation Continued etc. Algorithm No convergence when the objective function is non-differentiable 4.4 Fastest Descent Method When the gradient vector is easy to calculate, the steepest descent method is used as a convenient method 4.4.1 Optimization Algorithm A type of iterative method that uses linear search Use dk=-∇f(xk) as the search method Proof Algorithm Consider the Wolf condition as a criterion for determining the step width 4.4.2 Convergence Speed of the Fastest Descent Method Example: applying the steepest descent method to a convex quadratic function Q∈ℝnxn is a positive definite symmetric matrix Optimal solution is x*, optimal value is f(x*)=0 Stopping condition for step width Using this result, we get Finally Let λ be the eigenvalue of Q Theorem: Convergence of the steepest descent method Ratio of minimum to maximum eigenvalue The closer the condition number is to 1, the steeper the contour line of f(x) becomes, and the larger it is, the more elongated it becomes. Corollary: Kantrovich's inequality 4.4.3 The steepest descent method using the backtracking method The steepest descent method using the backtracking method as a linear search is widely used as a practical algorithm. Convergence of this method 4.5 Applications to Machine Learning 4.5.1 Coordinate Descent and Boosting Application of Coordinate Descent Method to Machine Learning Suppose we have observed bivalent data. Predict the output b ∈ {+1,-1} for the new data a with the sign of the above equation If H(a) is positive, output b is +1 If H(a) is negative, the output b is -1 H1(a),... If H(a) is positive, output b is +1 If H(a) is negative, output b is -1 H1(a),...,hn(a) is a basis function that takes values in {+1,-1}. Each basis function can be simple, but complex functions can be used depending on the A general method for constructing a discriminant that achieves high prediction accuracy by successively combining simple basis functions. The coefficients of the function Hx(a), x=(x1,...,xn), are computed from the observed data. (x1,...,xn) of the function Hx(a) are appropriately determined from the observed data. If btxHHx(a)>0 for the data (ai,bt), the function Hx is judged correctly.
Determine x in such a way that btxHHx(a)>0 is valid for as many observed data as possible.
A learning method that minimizes the loss function (above equation) has been proposed.
Calculation of exponential loss
Let the weights of the data be the above equation
1{A} takes the value 1 if A is true and 0 if A is false
The method of coordinate time usage that reduces the wind vane the most is ±h1,. Corresponds to the direction that minimizes the above equation in ±h1,..., ±hn
Algorithm
Figure 1.
Learning algorithm using the coordinate descent method as optimization method.
Algorithm of Adaboost
Boosting
Generate a strong learning machine by combining a series of weak learning machines.
Add weights to the weak learning machines.
The difference between various types of boosting is the weighting of data points and hypotheses.
AdaBoost
AdaBoost is a typical example of boosting.
Adaptive boosting
Apply weak classifiers t in order from t=1 to t=T
Determine whether each classifier is correct or not
Affected by noisy data and outliers
4.5.2 Error Back Propagation Method
Assume bivalent data is observed
Learning a multilayer perceptron
Various input-output relationships can be represented by adjusting the parameters
The parameters can be any number of matrices of appropriate size W(ℓ), ℓ=1,...,L , L
The relation between input a and output b is defined as the above equation
Expressed in terms of the equation for b
The function Φi is a suitable nonlinear function
Example
Diagram of the calculation process
Equation
Compute the multilayer perceptron from the data
The function Φi is defined as above using the same function 𝞿:ℝ→ℝ for all ℓ
Define the objective function as the above equation
The linkage law propagates from the output layer to the input layer because the gradient calculation is necessary to calculate the error.
Determine the gradient
Differentiation of the multilayer perceptron
Let u be the above equation
From the derivative of the ℓ+1 layer, we can find the derivative of the ℓ layer
Differentiation with respect to parameters
Differentiate sequentially from output to input
Problem
When the number of layers is large, the non-concentration of the parameters stagnates close to the input layer due to numerical errors
Gradient method with stochastic optimization
Difficult to update the sum of squared errors f(W) every time
A method that updates the parameters each time the data is observed is used
A type of algorithm that uses only a small number of data at each step to update the algorithm in a stochastic setting
Consider the problem of learning the parameters W of the function Φ(a,W) from input and output data (a,b)
One data (at,bt) is given at time t
Update the parameter W as in the above equation
Observed data (a1,b1),... ,(aN,bN) have already been engraved
At each time t=1,2,.... at each time t=1,2,..., we randomly generate (at,bt) from the data along a uniform distribution.
The above equation has a high probability of f(W)no Kyokushokkai ni Shuso-kusuru
memo
A typical error back propagation method is equivalent to the gradient method
Use fast automatic differentiation to compute the gradient
Online algorithms related to the pre-gradient method

Chapter 5: The Newton Method

Overview

Optimization method using Hesse matrix
Calculations using not only the gradient but also the Hesse matrix information
memo
An iterative root finding algorithm for solving equations numerically.
The only conditions on the system of equations to be solved are the differential equation in the domain and the sign of the second derivative, and no linearity is required.

5.1 Derivation of the Newton Method

Newton's method is a computational method that uses not only the gradient but also the Hesse matrix information.
When the Hesse matrix is easy to compute, the Newton method is an efficient optimization method.
Minimization of two-differentiable functions
Computation proceeds up to xk by iterative method
Taylor expansion of f at point xk yields the above equation
Assuming that the Hesse matrix ∇2f(xk) is a positive definite matrix, finding the δ that minimizes the expression between the second orders of the Taylor expansion becomes common sense.
If we update Xk with the above equation, we can expect the wind vat to decrease.
The method to set the search direction and step width as shown in the above equation is called the Newton method.
Calculation Algorithm of Newton's Method
The Hesse matrix of function f is positive definite.
Xk is assumed to be not so important.
We need to be careful to apply Newton's method in practice.
The above equation is not guaranteed.
Theorem: Convergence of Newton's method

5.2 Covariance for Coordinate Transformations

When constructing an optimization problem based on observed real data, changing the unit system of the data transforms the optimization problem accordingly
How does the behavior of the optimization algorithm relate to the coordinate transformation?
In the case of Newton's method, it is "covariant" to the coordinate transformation.
Not the steepest descent method

5.3 Modified Newton Method

Solves the problem of Newton's method
Applying the Cholesky Method
Algorithm
Modified Newton method is not covariant

5.4 Gauss-Newton Method and Related Topics

Overview
In statistics and machine learning, the problem of minimizing the sum or mean of error functions for data often arises.
Write the objective function as a sum of squares of m functions
Calculating the gradient and Hesse functions
An optimization method to simplify the computation and improve efficiency when the Hesse function is difficult to compute.
Gauss-Newton method
Ravenberg-Macart method
5.4.1 Derivation of the Gauss-Newton Method
Search direction of Gauss-Newton method
No need to calculate the Hesse matrix
5.4.2 Ravenberg-Makert Method
Algorithm

5.5 Natural Gradient Method

5.5.1 Descent Direction Determined from Fisher Information Matrix
In machine learning and statistics, a distance different from the Euclidean distance may be naturally defined in the space of optimization parameters x
Hellinger distance
Approximation of the Hellinger distance
When x and x' are close
G(x): Fisher information matrix
A gradient method for parameter estimation in statistics, where the distance structure is locally determined by dG(x,x').
A generalization of the steepest descent method.
Example: Fisher information content of a normal distribution with expected value µ and variance v
Probability density of parameter x
Calculating the purchase of log(a,x) with respect to x
The Fisher information matrix is
The distance dG(x,x') between x=(µ,v) and x'=(µ+δµ,v+δv) is
For variations in parameters, the larger the variance v, the smaller the distance between distributions
The larger the variance, the more difficult it becomes to distinguish the distributions from the data.
The steepest descent method is used when the distance structure of the parameter space is determined from the Fisher information distance matrix.
The natural gradient method is a generalization of the steepest descent method, and depending on the distance structure and the objective function, the search direction is almost identical to the Newton method.
5.5.2 Natural Gradient Method in Online Learning
Online Algorithm for Natural Gradient Method
See details later.

Chapter 6: Conjugate Gradient Method

Overview

Conjugate gradient method is used for optimization of large scale problems as it is computationally inexpensive and saves memory space.
memo
It is used as an iterative method
Algorithm for solving simultaneous linear equations with symmetric positive definite matrices as coefficients
Optimization method with better convergence than the steepest descent method

6.1 Conjugate direction method

6.2 Conjugate gradient method

6.3 Nonlinear Conjugate Gradient Method

memo
共役勾配法 - Qiita


はじめに

この記事は「これなら分かる最適化数学 -基礎原理から計算手法まで-」(金谷健一著,共立出版)の3.3「共役勾配法」を基に書いている。また、自主ゼミで担当した回に向けての内容になっているため、一部「前回の内容で説明済」と...
qiita.com
Univariate gradient method
Case of N variables
Conjugate gradient method
What is Conjugate?
Conjugate" is an extension of the concept of "orthogonal".
Explanation
For a positive symmetric matrix Q, if the vectors x and y satisfy (x,Qy)=0, then x and y are said to be conjugate with respect to Q.
Avoiding the calculation of Hesse matrices
Since the Hesse matrix is a matrix that contains second-order partial derivatives, it is expensive to calculate them. There are several ways to avoid this.

Chapter 7: Quasi-Newton Method

Overview

Algorithm for finding solutions to nonlinear simultaneous equations or maxima and minima of functions in continuous optimization problems.
How to avoid calculating the Hesse matrix in the Newton method and perform optimization efficiently

7.1 Optimization method using variable metric and secant condition

7.2 Proximal Update of Positive Definite Matrix

7.2.1 Formulation of Update Rule by Divergence Minimization
7.2.2 Relationship between the Nature of Divergence and the Update Rule
7.2.3 Distance Minimization and Divergence Minimization
7.2.4 Derivation of the Update Rule

7.3 Convergence of Quasi-Newton Method

7.4 Memory Limited Quasi-Newton Method

7.5 Exploiting the Sparsity of Hesse Matrices

7.5.1 Graphical Representation of Functions
Representing relationships between variables in graph-theoretic terms
A subgraph is a clique, or complete subgraph, when all vertices of the subgraph have branches attached to them.
Extreme cliques
A chordal graph.
7.5.2 Positive Definite Matrix Completion
7.5.3 Update Rule by Sparse Clique Decomposition
memo
This is one of the methods mainly used to solve unconstrained minimization problems. The outline of the method is the same as Newton's method. However, the difference is that instead of using the Hesse matrix equation at the point equation of the objective function equation, the approximation matrix equation of the Hesse matrix is used, and the search direction is obtained by solving the equation equation. Instead of solving the equation, the search direction can also be obtained directly by using the approximation matrix equation, which is the inverse matrix of the Hesse matrix at the point equation. As for the method of updating the approximation matrix equation and equation, the BFGS and DFP formulas, which are update formulas satisfying the secant condition, are well known.

Chapter 8: Confidence Region Method

Overview

Confidence Domain
A term used in the context of mathematical optimization, a subdomain in which it is assumed that it is reasonable to approximate the objective function by some model function (often a quadratic function).
Confidence Region Method
Linear search methods and duality

8.1 Structure of the Algorithm

8.2 Approximate solution of a subproblem

8.2.1 Optimality Conditions for Subproblems
8.2.2 The dogleg method

8.3 Convergence

memo
One of the gradient methods for solving unconstrained optimization problems. It is a solution method designed so that the Newton method converges globally even if the Hesse matrix is not positive definite, but it is also extended to the framework of quasi-Newton methods and constrained optimization methods. Given x_k, the approximate solution at the kth iteration, find the step s_k, which minimizes the quadratic approximation in the confidence region where the quadratic approximation of the objective function seems to be reasonable. Then, based on the reduction of the function, we adjust the size of the confidence region or update the approximate solution as x_{k+1}:=x_k+s_k,.

Part III Constrained Optimization

Chapter 9: Optimality Conditions for Equivalently Constrained Optimization

9.1 First-Order Optimality Conditions

9.2 Second-Order Optimality Conditions

9.3 Optimality conditions and duality for convex optimization problems

9.4 Sensitivity Analysis

Chapter 10: Optimality Conditions for Inequality-Constrained Optimization

10.1 First-Order Optimality Conditions

10.2 Second-Order Optimality Conditions

10.3 Optimality conditions for convex optimization problems

10.4 Principal Problems and Dual Problems

Chapter 11 Optimization Methods for Principal Problems

11.1 Effective Constraint Methods

11.1.1 Choosing the Search Direction
11.1.2 Sign of Lagrange multiplier
11.1.3 Updating the Effective Constraint Formula and Optimization Algorithm

11.2 Penalty Function Method

11.2.1 Formulation with Penalty Function
11.2.2 Properties of Unconstrained Optimization Problems in the Penalty Function Method
11.2.3 Exact Penalty Function Method

11.3 Barrier Function Method

11.3.1 Formulation using the Barrier Function Method
11.3.2 Properties of the Barrier Function Method

Chapter 12 Optimization Methods Using Lagrangian Functions

12.1 Dyadic Ascent Method

12.1.1 Introduction of the Lagrangian and Derivation of the Duality Problem
12.1.2 Optimization of the Duality Problem Using the Gradient Method
12.1.3 Duality Decomposition
12.1.4 Dyadic Ascent Method for Inequality Constraints
12.1.5 Convergence of the Dyadic Ascent Method
12.1.6 Dyadic Ascent for Nonlinear Constraints

12.2 Extended Lagrangian Method

12.2.1 Extended Lagrangian
12.2.2 Extended Lagrangian Function Method
12.2.3 Extended Lagrangian Method as a Dual Ascent Method
12.2.4 Handling Inequality Constraints
12.2.5 Convergence Theory of the Extended Lagrangian Method
12.2.6 Theory of Convergence Rates for Convex Objective Functions
12.2.7 Relationship to the Nearest-Point Algorithm

12.3 Alternating Direction Multiplier Method

12.3.1 Alternating Direction Multiplier Algorithm
12.3.2 Parallel Computation by the Alternating Direction Multiplier Method

Part IV Optimization as a Learning Algorithm

Chapter 13 Upper Bound Minimization Algorithm

Introduction.

Sequential solution method to generate a sequence of points that monotonically decreases the objective function
It is one of the openings when it is difficult to optimize the objective function explicitly.

13.1 Upper Bound Minimization Algorithm

General framework for a sequential solution method that generates a monotonically decreasing sequence of points for the objective function.
The expectation-maximization (EM) algorithm is another framework for MM algorithms.
It is one of the most famous methods for maximum likelihood estimation of the parameters of a stochastic model.
Definition: Directional Derivatives and Stopping Points
A stopping point is not an extreme point or a point with maximum or minimum value.
Consider the minimization problem (above equation)
Consider a setting where it is difficult to solve f(x) directly
The MM algorithm generates a sequence of candidate solution points by successively solving the minimization problem of the approximate function of f(x) instead of directly solving this minimization problem.
Specifically, how to obtain
Let u(x;xi) be the approximation function of the objective function f(x) around the t-th candidate solution point xi.
Find the t+1th candidate solution point xt+1 as the above equation
Condition that the approximate solution u must have to solve the optimization problem minx∈Xf(x)
Condition continuation
u satisfying the above conditions is called a surrogate function.
Algorithm that generates a sequence of candidate solution points by repeating the two steps of majorization and minimization using the surrogate function
Example of MM algorithm in action
MM Algorithm
Properties of the MM Algorithm
Theorem: MM algorithm is monotonically decreasing
Theorem: Convergence of the MM algorithm

13.2 Examples of Proxy Functions

Typical methods for finding a proxy function
Specifically, we present several inequalities that can be used to find the proxy function
Use of linearization
Let h(x) be a convex function that is once differentiable
From Theorem 2.12, the above equation holds
-By setting h(x), we can construct a proxy function.
Use of quadratic approximation
Let h(x) be a convex function that is twice differentiable and whose Hesse matrix is bounded.
If M is a positive definite matrix and M-∇2h(x) is a non-negative definite matrix for any x, then the above equation is valid.
Using Jensen's inequality
Let h(x) be a convex function.
For a random variable X, Jensen's Inequality (above) holds
Used to derive the EM algorithm

13.3 EM Algorithm

Introduction
A typical algorithm for maximum likelihood estimation of a statistical model that includes an unobservable variable called a latent variable.
The maximum likelihood estimation of a parameter can be derived in the framework of the MM algorithm by considering it as the minimization of the negative log likelihood.
The maximum likelihood estimation of the parameters is regarded as the minimization of the negative log likelihood.
Explanation of the EM Algorithm
Assumptions
Let the observed data be y1:n=(y1,...,yn). ,yn).
Consider a stochastic model p(x|θ) with Θ as a parameter
The maximum likelihood estimation is the above equation
Assume a latent variable zi for each observed data point yi
Calculate the upper bound for the negative log likelihood
u(θ;θt) is the above equation
EM Algorithm
Compute the posterior distribution p(z1:n|y1:n, θt) of z1:n given Θt, and derive u(θ;θt)
Find the proxy function
Find the optimal solution θ of u(θ;θt)
This means maximizing the likelihood.
Obtain the minimization of the proxy function
Repeat E and M
Latent variables are auxiliary variables introduced to construct the proxy function for optimization of the objective function -logp(y1:n|θ)

13.4 Optimizing the Difference of Two Convex Functions

Suppose that the function f(x) can be written in terms of the once-differentiable convex functions g(x) and h(x) as in common sense
The function f(x) will be non-convex if h(x) is not a linear function.
Such a minimization problem for f(x)
Theorem: Transformation to DC programming problem
Very important for optimization of nonconvex functions
Solution of DC programming problem
By generating candidate solution points {xi} such that ∇g(xt+1)=∇h(xt) is satisfied
By monotonically decreasing the objective function f(x), we can find the stopping point
Analysis of the CCCP in terms of the MM algorithm
Using the linearity of f(x), derive the above equation as a proxy function for f(x)
How to iteratively minimize the right equation for DC programming problems

13.5 Proximity Point Algorithm

With f(x) as a convex function, the minimization problem becomes equivalent to the above equation
Since f(x)+1/2c∥x-y∥2 is a strongly convex function with respect to x
Since f(x)+1/2c∥x-y∥2 is strongly convex with respect to x, we can solve the minimization problem for convex functions as the minimization problem for strongly convex functions
Closest point algorithm is an algorithm to obtain the minimization of the objective function by repeatedly updating the above equation
It is the same as MM algorithm with the proxy function as the above equation.

Chapter 14 Support Vector Machines and Optimization

14.1 Formulation of SVM and Optimization Problem

Introduction.
Hard Margin SVM and Soft Margin SVM
Linear Classification Boundary of SVM
W is the coefficient vector of the linear classification boundary
B is the bias
SVM is an optimization problem to find the optimal w and b according to the training data
14.1.1 Main Problem of SVM
Optimization problem for linear SVM based on soft margin maximization
Regularization term and loss term, respectively
The loss function is called the hinge loss
0-1 loss is a non-convex and discontinuous function, so it cannot be optimized as is
Hinge loss is a proxy loss function for 0-1 loss
Since the hinge function is a convex function, it can be optimized.
However, the hinge function is not differentiable at yif(xi)=0
Two approaches for dealing with this problem
Convert an unconstrained optimization problem into a constrained optimization problem
Convert to constrained optimization by introducing artificial variables (slack variables)
Hinge loss
Expressed as a piecewise linear function with two segments
Vector with N artificial variables as elements 𝝃=(𝛏1,𝛏2,... ,𝛏n), which can be transformed into a differentiable constrained optimization problem.
Fine-tune the hinge loss to make it a smooth loss function
Squared hinge loss
Hoover hinge loss
Vanilla hinge loss
Logistic Regression
14.1.2 The Duality Problem of SVM
Derivation of the Duality Problem
Lagrangian for Constrained Optimization Problems
Objective Function of the Dual Problem
The problem of maximizing the dual objective function under nonnegative constraints on the Lagrangian variables α and µ
The Lagrangian is convex with respect to the principal variables w, b, and 𝝃.
It is minimized at the point where the partial derivative becomes zero with respect to the main variables
Duality problem of SVM
14.1.3 Optimality Conditions for SVMs
One of the features of SVMs is that the classification boundary is not determined by all training cases, but is characterized by a subset of cases called the support vectorz9
The optimality conditions for SVMs can be summarized as follows: the dual variables α1,...,αn and SV The relationship between the dual variables α1,..., αn and SV becomes clear.
Relationship between case set and margin
Training cases with margin (yi(wTxi+b)) larger than 1 do not affect the classification boundary because the corresponding dual variable α becomes 0.
Removing non-SVs does not change the final classification association.
14.1.4 Nonlinear Modeling with Kernel Functions
One of the features of SVM is the ability to do nonlinear modeling using kernel functions
Consider a function Φ that maps an input x to some feature space F
The linear classification association when this Φ(x) is interpreted as a new feature vector
Note that the coefficient vector w is also defined as an element of the feature space F
Let x be Φ(x) in the duality problem
Dual representation of classification boundary
In the duality problem and the classification boundary, the feature Φ(x) does not appear alone, but in the form of an inner product Φ(x)TΦ(x)
Define the inner product as a kernel function
By using a function that satisfies certain properties, the inner product can be obtained directly without using Φ(x).
RGF (radial basis function kernel)
Expression as a kernel function

14.2 Optimization Algorithms for SVM Learning

Introduction
Optimization algorithms specific to SVM learning
Used in LIBSVM and LIBLINEAR
SMO (sequential minimization optimization algorithm) algorithm
Squared hinge loss
DCDM(dual coordinate descent method) algorithm
Hoover hinge loss
14.2.1 Solving the Kernel SVM Duality Problem: SMO Algorithm
The duality problem of SVM is an optimization problem for n unknown variables α1,...,αn. The SVM duality problem is an optimization problem for n unknown variables α1,..., αn.
When the number of training cases n is not large, the duality problem can be solved by using a general algorithm for constrained optimization.
When n is large, the cost is huge
Approach
Consider only a portion of the n variables to be optimized (working set) and fix the remaining variables as constants
Partitioning method with set size set to 2
Iteratively solves a minimal optimization problem with only two variables.
Consider a dual problem in which only two variables αs and αt are considered as variables out of N variables, and the rest are fixed as constants
Let the dual variables be βi=yiαi
Using variables β1,.... and βn, the duality problem of SV classification becomes the above equation
Kij=K(xi,xj)
Let the indexes of the two dual variables to be changed be s and t.
In order to satisfy the equality constraint (zero sum)
∆βt=-∆βs
Effectively, the variable that can be moved is βs
If we rewrite the duality problem as a constrained optimization problem for ∆βs
The above equation is a constrained quadratic minimization problem with one variable ∆βs
Let the extreme value of the quadratic function be the above equation
The optimal solution is
The two variables may be chosen at random.
14.2.2 Solving the Duality Problem for Linear SVM: DCDM Algorithm
Optimization algorithm specific to linear SVMs
Main problem when considering a classification boundary without constant term b
Duality Problem
Variant of the duality problem
The DCDM algorithm is a kind of partitioning method like the SMO algorithm.
SMO algorithm had equality constraints, so it considered a working set of two variables.
In DCDM, there is no equality constraint, so a single αs is updated.
At a certain step of the DCDM, only αs is considered as a variable and the rest are considered as constants and fixed.
Organize the duality problem with respect to the update width ∆αs
Constrained two-time minimization problem for one variable ∆αs
Extreme values of quadratic functions
Optimal solution
In DCDM, the selection of the working set is done by selecting n variables α1,.... , αn as the working set in turn, and repeat the solving process.
14.2.3 Comparison of Learning Algorithms
Comparison of SVM-specific DCDM method with general-purpose trust-region method and quasi-Newton method with memory restriction
Use Hoover-Hinge loss because it requires differentiation
Main problem of Hoover-Hinge loss SVM
Duality problem
Experimental data used for comparison
Experimental results 1: Small-scale data
Experimental result 2: Medium-scale data
Experimental result 3: Large-scale data

14.3 Regularized Path Tracking

Introduction
Hyperparameters
Example: Regularization parameter C to control the balance between the loss function and the regularization term
Proper selection of hyperparameters is needed
Using a warm start is an efficient way to select models
When solving multiple optimization problems with different values of hyperparameters
By using the solution of one optimization problem as initial value, the solution of another optimization problem can be obtained efficiently.
Regularization path following
Multiple optimization problems can be solved at once.
It is possible to compute exactly how the optimal solution changes when the regularization parameter C changes continuously.
One of the approaches called parametric optimization (parametric programming)
Consider a class of optimization problems with parameterized representations, and try to obtain their optimal solutions in a parameterized form.
To obtain the optimal solution of a dual problem as a function of the regularization parameter C
In regularization path tracking, the optimality condition plays an important role.
To emphasize that the set of three cases changes depending on C, we express it as above.
The three case sets are expressed as common sense
Regularized Path Tracking Algorithm
14.3.1 Parameter representation of the optimal solution (Step 1)
Consider an interval of C such that the case set T(C) is constant
14.3.2 Event Detection (Step 2)
Events for which the case set T(C) changes when the regularization parameter C is gradually increased

14.4 Optimal Guaranteed Screening

Let the case set belonging to M and I be the support vector (SV)
The set of cases belonging to O is called the non-support vector (non-SV)
Cases in O are αi=0, so excluding them will not change the result
Many SVMs use heuristics to estimate the non-SV cases and solve the optimization problem with the rest.
Safe screening
Find the non-SV cases
Example of Optimal Guaranteed Screening
Basic idea
Optimization problem
Gradient vector
If the margin is greater than 1, the case is non-SV (included in O)
What do we do when we don't know the optimal solution w*?
Consider a situation where the range of the optimal solution is known
When the optimal solution w* is contained in Ω, if the lower bound of the margin is greater than 1, the case is non-SV
Two challenges to achieve optimal guaranteed screening
How to solve the minimization problem efficiently when the domain Ω is obtained?
How to obtain a region Ω that contains the optimal solution w*?
Minimization problem approach when the domain Ω meets the hypersphere
Hypersphere
Theorem: Lower and upper bounds for linear scores
Lower and upper bounds on ηTw* when the hypersphere contains the solution w*.
If we set η=yixi, we can get a lower bound on the marginals and make a non-SV decision as above
How do we find Ω?
Theorem: Hypersphere with optimal solution using approximate times
The center and radius of the hypersphere Ω depend on an arbitrary vector ὥ
Ὢ is called an approximate solution
What approximate solution will reduce the radius of the hypersphere?
When the approximate solution ὢ is sufficiently close to the optimal solution, the radius r(ὢ) becomes small.
If we can choose an approximate solution ὢ that is close to the optimal solution w*, then
The upper and lower bounds become tighter.
Many non-SVs can be safely identified.
Is it enough to find the approximate solution as follows?
The solution obtained in the middle of the optimization algorithm is an approximate solution.
Identify non-SV cases in the process of solving the optimal solution and remove them from the training data
Application of the same idea
Sequential learning, online learning, and streaming learning where training data is given sequentially
Consider the optimal solution before any changes are made to the training data as an approximate solution.
Can screen for non-SV of the optimal solution after the change
Optimization problem for multiple regularization parameters C in model selection
Consider the optimal solution for one regularization parameter as an approximate solution to the optimization problem for another regularization parameter
Theorem: Guaranteed Optimal Screening for Regularization Paths
A Study of Optimal Guaranteed Screening
El Ghaoui(17)
Actively researched in 2016

Chapter 15 Sparse Learning

15.1 Sparse Modeling

Methods for extracting low-dimensional subspaces, such as sparse modeling, have long been used for model selection in statistics.
Criteria for model selection
Akaike's Information Criterion (AIC)
Assumptions
A statistical model {pθ|θ∈Θ} is given.
Θ is a set of parameters
Pθ is the probability density function corresponding to the parameter θ
Given a sequence Θk⊆Θ{k=1,2,...,d) of k-dimensional submodels in Θ ,d) in Θ.
Objective: to choose the "right" model among these models
The model with the largest dimensionality has the highest expressive power, but overtraining will occur if the model is too complex
AIC is
AIC is a criterion to avoid overtraining by correcting the gap between the prediction error (generalization error) and the training error.
The AIC of model ϴk is the above equation
If Θk is the maximum likelihood estimator in model ϴ, then for n observed data {z1,z2,... ,zn}.
AIC(k) becomes the asymptotic unbiased estimator of the expected log likelihood
AIC is the model "complexity" corrected for the training error by adding the ear penalty 2k for dimensionality
Minimizing the AIC approximates minimizing the prediction error
AIC can be regarded as regularization learning using the dimension of the model as the regularization term
If the loss function satisfies the inferior modularity, it can be computed by greedy method
Bayes Information Criterion (BIC)

15.2 L1 Regularization and Various Types of Sparse Regularization

15.2.1 L1 Regularization
A sparse modeling technique that can be formulated as a simple optimization problem alternative to AIC.
L1 regularization uses the L1 norm ∥θ∥1=∑|θj| instead of ∥∥∥0
C is the regularization parameter
Regularization parameter is determined by cross-checking method, information criterion, etc.
L1 norm is known to be a convex hull of L0 norm in [-1,1]d
The optimal solution will actually be sparse
Example: Sparse regression: Lasso
L1 regularization learning with squared loss applied to regression
Lasso estimator is given as the solution to the optimization problem (above)
Example: Sparse discriminant: L1 regularized logistic regression
L1 regularization can be applied to discriminant in the same way as for regression
If the loss function is the logistic loss, it can be formulated as above
15.2.2 Other sparse regularizations
Generalized problem with loss function ℓ(z,w)
If the loss function ℓ is convex with respect to w, then this optimization problem is a convex optimization problem
∥w∥1 is not smooth and we cannot compute the derivative
We can apply the subgradient method, but convergence will be slow
L1 regularization is not smooth, so the solution will be sparse
Guideline for solving this problem
Use the structure of the regularization terms
In the case of L1 regularization, take advantage of the separation of functions for each variable.
Separate the difficulty of optimization between the loss function term and the regularization term.
Perform the optimization as if the regularization term did not exist
Generalization Formula
Sparse regularization function Ψ(w)
Let f(w)=∑ℓ(zi,w)
Regularization terms other than L1
Example: trace norm regularization
A regularization method used to learn low-rank matrices
Trace-norm regularization (trace-norm regularization)
B=(WTW)1/2 is a semi-positive definite target matrix satisfying BB=WTW
Trace-norm regularization is a sum of singular values
A vector of singular values (σ1,σ2,... , σr) lacking L1 regularization.
Matrices learned using trace-norm regularization have sparse singular values (many singular values tend to be zero).
Many of the singular values are 0
Example: Group Regularization
Assumptions
D indices {1,.... ,d} are divided into K groups G1,... ,Gk
The groups Gk may overlap.
The subvector corresponding to each group Gk is written as wGk=(wj)
The group regularization along this grouping is given by the above equation
q>1
∥wGk∥q=(∑|wj|q)1/q
Group regularization is L1 regularization to a vector of Lp norm of subvectors corresponding to each group
Example: Elastic net regularization
A method that is somewhere between L1 and L2 regularization
Equation
Compared to L1 regularization, L2 regularization provides stable learning that is less sensitive to noise.
15.2.3 Numerical Evaluation of L1 Regularization
Experiment with artificial data
Dimension p=2000
Generate data from the linear model in the above equation
X is a randomly generated design matrix
Ε is a noise that follows an independent and identical standard normal distribution
w* is a vector of t (a sparse vector where only the first 100 components are zero-raged sums and the rest are zero)
Sample size is n=1000,2000,4000 to see the behavior of Chi and estimation error.
Results
Lasso's estimation error does not become too large even when the sample size is small
This is because Lasso performs variable selection and estimates with the necessary data. Translated with www.DeepL.com/Translator (free version)

15.3 Solving with the Proximal Gradient Method

15.3.1 Algorithm of the Proximal Gradient Method
Method using the structure of the regularization term
A slightly modified version of the steepest descent method
Algorithm
ISTA (Iterative Shrinkage Thresholding Algorithm)
Optimize function f by linear approximation f(w)≃f(wk) + gkT(w-wk) around wk
The approximation accuracy is only good around wk, where the derivative is taken.
To avoid going too far, the proximity term ηk/2∥w-wk∥2 is added
The update equation of the proximity gradient method is the above equation
Proximity Gradient Method finally becomes the above equation
Let the proximity map associated with the function Φ be the above equation
Wk-gk/ηk is the update equation of the steepest descent method itself for minimizing the function f
In the proximity gradient method, f is reduced once by the steepest descent method, and then corrected by the proximity map determined by Ψ.
The computational efficiency of the proximity gradient method is determined by the computational complexity of the proximity map determined by the regularization term Ψ.
Example: L1 regularization
The proximity map for L1 regularization is y=prox(q|c∥∥1)
It is called soft thresholding
Example: Group regularization without overlap
Example: Trace norm regularization
Example: No regularization term
The update equation for the proximal gradient method is the steepest descent method
Example: Marking function
Continued
15.3.2 Convergence Theory of the Proximal Gradient Method
Convergence rate of the proximal gradient method
In practical applications, it is difficult to calculate the smoothness L of a function f directly, so the backtracking method is used to find L.
Algorithm
FISTA (Fast ISTA) Algorithm
Convergence rate of the acceleration method in the proximity gradient method
Proximity Gradient Algorithm for Strongly Convex Functions
Restart method
15.3.3 Proof of Convergence Rate for the Proximal Gradient Method

15.3.4 Numerical Experiments on the Proximal Gradient Method

Experiments on artificial data
Sample size n=700
Dimension as p=1000
Data generation formula
Only the first 100 components have values, the remaining 900 are 0
Optimization problem
The loss function is the medistic loss
Figure showing the degree of convergence of the objective function
Nesterov(restart) is the best
Generalizability and number of non-zero components
Nesterov(restart) is the best

15.4 Solution by Coordinate Descent Method

Coordinate descent method is also effective for sparse regularization learning
Coordinate descent is useful when the computation of noncephalic battles for each coordinate is light and the regularization terms are separated into blocks.
In coordinate descent, each coordinate (or block of coordinates) is updated alternately so that the objective function becomes smaller.
Many sparse regularization functions can be expressed as a sum of functions separated by coordinates (or blocks)
Example
L1 regularization takes absolute values for each coordinate and then adds them together
Group regularization also has a separate function for each group of coordinates
An expression that represents a structure separated by coordinates
Algorithm
How to choose coordinates
Choose randomly
Choose cyclically
Difficult

15.5 Solving by Alternating Direction Multiplier Method

15.5.1 Alternating Direction Multiplier Method and Structural Regularization
Separate optimization of the loss function from optimization of the regularization term
Example: Group regularization with overlap
Group regularization
Example: Generalized Fused Regularization
Regularization along a graph
The alternating direction multiplier method is a method for optimizing the augmented Lagrange function by alternating x and y.
Extended Lagrange Function in Optimization Problems
Extended Alternating Direction Multiplier Algorithm
Example: Lasso's alternating direction multiplier method for open problems
e.g.: Opening with alternating direction multiplier under regularity taken care of by sum of convex functions
15.5.2 Numerical Experiments for Image Recovery
Example of TV noise reduction
Assume that the true image is smooth and there is no significant change between neighboring pixels
Blurring and noise removal using TV regularization

15.6 Methods Using the Proximity Point Algorithm

15.6.1 Nearest-Point Algorithm and Its Dual Problem in Sparse Learning
Proximity point algorithm is also useful in optimizing sparse regularization
Two theorems for solving the duality problem
Theorem: Properties of Conjugate Functions
Theorem: Properties of proximity mappings
Moreau decomposition (Moreau decomposition)
Moreau decomposition replaces the computation of the proximity map with the computation of the proximity map for the conjugate function.
Use the steepest descent method or Newton's method to optimize the dual problem.
The proximity point algorithm for the main problem can be implemented for the dual problem by solving a linear equality constrained optimization problem using the multiplier method.
Regularization functions that are not smooth in the main problem are smoothed in the dual problem.
Algorithm of the dual augmented Lagrangian method
By using Fenchel's duality theorem, the dual augmented Lagrangian method can be derived directly
15.6.2 Alternating Direction Multiplier Method for Dual Problems
Optimize the augmented backgrange function of a dual problem alternately for y and u
Dual alternating direction multiplier method for regularization learning

Chapter 16 Optimization on Matrix Space

16.1 Stiefel and Grassmann manifolds

16.2 Matrix Optimization in Machine Learning

16.2.1 Independent Component Analysis
16.2.2 Density ratio estimation with dimensionality reduction

16.3 Concepts of Manifolds

16.4 Optimization on manifolds

16.4.1 Fastest Descent Method
16.4.2 Conjugate gradient method

16.5 Retraction and vector transport

16.6 Optimization on matrix manifolds

16.6.1 Properties of Stiefel manifolds
16.6.2 Construction of retraction
16.6.3 Vector transport by projection
16.6.4 Numerical examples