Machine Learning Professional Series “Continuous Optimization for Machine Learning” Reading Memo

Mathematics Artificial Intelligence Technology Sequential optimization for machine learning Machine Learning Technology Navigation of this blog Digital Transformation
Summary

Continuous optimization in machine learning is a method for solving optimization problems in which variables take real values, such as optimization of neural network weights and biases, parameter estimation in regression analysis, and parameter estimation in SVM. Typical methods for continuous optimization include gradient descent, steepest descent, conjugate gradient, quasi-Newton, and L-BFGS methods, which are algorithms that use information such as the gradient and Hessian matrix of the objective function to update variables as they approach an optimal solution. In continuous optimization, it is necessary to consider not only the value of the objective function but also the constraint conditions. To solve optimization problems that consider constraints, we use constrained optimization problem methods such as methods that satisfy the KKT condition, interior point methods, projective gradient methods, penalty methods, and barrier function methods. Here we discuss continuous optimization in machine learning based on the Machine Learning Professional Series “Continuous Optimization for Machine Learning“.

Reading Memo.

Here’s where you can learn the most with the least effort! The boundary field is interesting! This is a solid book from the basics to the cutting edge. You can learn the basic knowledge essential for machine learning. It’s not gentle. There has never been a textbook as good as this.

Part I. Introduction

Chapter 1 Introduction

1.1 Inference and Computation in Machine Learning

In machine learning and statistics, one of the main goals is to make accurate estimates and predictions
An approach that is often used is to properly model the process by which data is generated and minimize a statistically valid loss measure (error metric).
What kind of loss can be used to achieve the desired level of statistical accuracy?
Computational efficiency is also important
Algorithm Examples
Maximum likelihood method
Fisher's score method (an optimization method similar to Newton's method)
Standard method for estimating parameters of generalized linear models in regression analysis
Bayesian method
Markov chain Monte Carlo methods
Computation of high-dimensional integrals for obtaining the posterior distribution
Computational methods for efficient inference and prediction using the structure of statistical models.
EM Algorithm for Parameter Estimation of Mixed Gaussian Distributions
Error Back Propagation for Hierarchical Neural Network Models
Probability propagation for Bayesian networks.
Boosting, an example of collective learning.

1.2 Description of Optimization Problems

1.2.1 Various optimization problems
A set of real numbers ℝ
N-dimensional Euclidean space ℝn
Consider an N-dimensional source x as a column vector
x=(x1,...,xn)T ,xn)T
T is the transpose of a vector or matrix
The norm (length) of a vector x is called
Euclidean norm
The distance between two points x, y ∈ ℝn is ∥x - y∥ using the Euclidean norm
Optimization problem is the problem of optimizing (specifically minimizing or maximizing) a given function 𝒇 : ℝn → ℝ under some constraint x ∈ 𝑆 ⊂ ℝn
If minimization is the goal
Min" means to minimize the function
x under Min is the variable to be optimized
The "subject to ~" means "under the condition of ~".
Sometimes it is abbreviated to "s.t.".
Let 𝑓 be the objective function
𝒇(x) is defined in terms of Euclidean space powers and often satisfies properties such as continuity and differentiability
Continuous optimization
𝑺 is a feasible region
The condition "x ∈ 𝑺" defined by 𝑺 is called a constraint.
Example 1.1 : School placement problem
In a village where there is no elementary school, there are 5 houses, and their locations are represented by coordinates αi ∈ ℝ2 (i=1,2,3,4,5).
If we want to build a house in 𝑺 ⊂ ℝ2, where should we put it?
Constraints
Build it where the square of the distance to the school is minimized on average.
Build it where the distance to the school is minimized on average.
Minimize the distance from the farthest house to the school
Example 1.2 Transportation problem
When transporting goods from production sites S1 and S2 to consumption sites T1, T2, and T3, what is the delivery plan that will result in the lowest transportation cost?
The total transportation cost is
The transportation cost from the production site Si to the consumption site Tj is cij per unit of goods.
Let the quantity transported be xij
Let x ∈ ℝ6 be x ={x11,... ,x23}
Each production site produces at most ai(i=1,2)
At each consumption location, at least bj(j=1,2,3) is consumed
Model
Constraints
Feasible solution
A point that satisfies the constraints of the optimization problem
Optimal solution
Global optimal solution
Global solution
Minimization source (Maximization source)
Optimal value
Lower (upper) limit of the objective function
The existence of an optimal value does not necessarily mean the existence of an optimal solution.
Neighborhood
ε-neighborhood of a point x ∈ ℝn is B(x,ε) ={y ∈ ℝn | ∥x -y∥ < ε} local optimal solution unconstrained optimization problem equality constrained optimization problem inequality constrained optimization problem Possibility of the existence of an optimal solution S is an empty set and there are no feasible times S is not an empty set and there is an optimal solution S is not an empty set and there is no optimal solution Minimum value case If f is small enough on S, the optimal solution is -∞. 1.2.2 Iterative methods and convergence speed Many optimization algorithms are described as "iteration method" as a basic procedure. The basic iteration algorithm Initialization: Set initial value x0, k←0 1. if the stop condition is satisfied, output the result and terminate the calculation 2. function f(x), set S, history of points x0,...,xk , xk, etc. to compute xk+1. 3. Set k ← k+1, return to step 1 In the iterative method, the numerical solution x0, x1,... converge to the optimal solution x* as soon as possible. The stopping condition of the algorithm should be set appropriately. Evaluation measure of convergence speed {Linear convergence to x* for xk∞k=0 Measure the error from the numerical solution xk to the optimal solution by ‖xk-x*‖. ‖xk -x*‖ = 0 is guaranteed Superlinear convergence quadratic convergence Memo:Meaning of "sup" and "inf" and difference from "max max A : maximum value of A max A : maximum value of A if x ≤ c and for any x ∈ A (c is not less than any element -> c is an upper bound of A)
min A : minimum value of A
sup A : upper bound of A (soup) supernum
For any x ∈ A, x ≤ c and (c is the upper bound of A)
For any real number less than c, there exists an x ∈ A such that r0 if x≠0
A is called a positive definite matrix
A ≻ O
2.1.4 Projection onto a subspace
For a subspace S of N-dimensional space, the orthogonal captive space S⊥ of S
Projection x1 of a vector x onto a subspace S
The projection x1 is characterized as the point on S that minimizes the distance to x
The eigenvalues of the projection matrix are 0 or 1
2.1.5 One rank update of a matrix
A matrix A plus a matrix with rank 1, A + xyT, is called a one-rank update of A.
It is used in the natural gradient method and the quasi-Newton method.
Formulas have been devised to efficiently perform various operations on a matrix that has been updated by one rank, using the information of the matrix before the update.
Sherman-Morison formula
2.1.6 Various Norms
A norm is a quantity that corresponds to the length of a vector.
The most basic norm is the Euclidean norm
Definition 2.3 Definition of norm
Schwarz's inequality holds for the Euclidean norm
Herder's inequality, an extension of Schwarz's inequality, holds for the P-norm
2.1.7 Functions on Matrix Space

2.2 Fundamentals of Convex Analysis

2.2.1 Convex Sets and Convex Functions
Definition of a convex function
For any points P=(x,f(x)) and Q=(y,f(y)) on the graph of a function f(x), we say that the function f(x) is convex (downward) if the graph of f(x) lies below the line segment PQ in the region between these two points.
For any x, y, and any λ satisfying 0≤λ≤1
λf(x) + (1-λ)f(y) ≥ f(λx + (1-λ)y)
z=λx + (1-λ)y, then R=(x,f(z)), S=(z,λf(x) + (1-λ)f(z))
The inequality shows that S is above R
Definition 2.4 Convex set
A set S is called a convex set if u, v ∈ S, 0≤λ≤1 ⇨(1-λ)u + λv ∈ S for any non-empty set S⊂ℝn.
Examples of Convex Sets and Non-Convex Sets
Properties of convex sets
A convex set is a set of finite points x1,...,xk ∈ S contained in a convex set S. ,convex union of xk∈S is contained in S
If S and T are convex sets, then S ∩ T is also a convex set
For any set S that is not necessarily a convex set, we can define the smallest convex set that contains S
Definition 2.5 Convex hull (convex-hull)
Corollary 2.6
Definition 2.7 Convex functions
Properties of convex functions
Definition 2.8 µ-Strongly Convex Functions
2.2.2 Minimization of convex functions
Definition 2.9 Optimization of Convex Functions
Corollary 2.10
2.2.3 Continuity of Convex Functions
Affine hull
Theorem 2.11 Continuity of convex functions
2.2.4 Differentiable convex functions
Theorem 2.12 Characterization of convex functions by gradient
Theorem 2.13 Characterization of convex functions by Hesse matrices
Theorem 2.14 Upper and lower bounds for function values
2.2.5 Conjugate functions
If a function f is differentiable, then the tangent plane of its graph is fixed.
Closed convex functions and nonconvex functions
Theorem 2.15 Properties of conjugate functions
Theorem 2.16 Strong convexity and smoothness
2.2.6 Subgradient and subdifferential
Definition 2.17 Inferior gradient and inferior derivative
Theorem 2.18 Differentiable convex functions and column differentiation
Theorem 2.19 Subgradient and conjugate functions

Part II: Unconstrained Optimization

Chapter 3: Optimality Conditions and Algorithm Stopping Conditions

3.1 Local Optimal Solutions and Optimality Conditions

Unconstrained optimization problem for function f:ℝn→ℝ
In general optimization of f, which is not necessarily convex, finding the local optimal solution is a realistic solution.
If the objective function is differentiable, then the gradient of the local optimal solution is a zero vector
Theorem: First order necessary condition
x* satisfying equation 3.2 is called the stationary point of function f
It is also possible to determine whether the solution is local or not using the information of the Hesse matrix
Theorem: Second order sufficient condition
By approximating the objective function f by a quadratic interval using Taylor's theorem, we can find various properties.
Theorem: Necessary condition of second order
Using the first- and second-order necessary conditions, we can peek at the local optimal solution, X, from the stopping point.
The candidate points for the locally optimal solution satisfy ∇f(x)=0
Candidate points whose Hesse matrices have negative eigenvalues are not local optimal solutions
Example
Function f(x1,x2)=x14+x24+3x12x22-2x22
First order condition
(X1,x2)=(0,0),(0,±1)
Hesse matrix
From the second order sufficient condition, take the minimum value -1 at (x1,x2)=(0,±1)
(X1,x2)=(0,0) is not a local optimum because it does not satisfy the second-order necessary condition
Among the stopping points, the point (0,±1) takes a minimum value
Theorem: First order sufficient condition for convex optimization
Summary of the minimization problem

3.2 Optimality conditions for set constraints

Consider a constrained optimization problem defined by a function f:ℝn→ℝ and a set S ∈ ℝn
In many applications, S is considered as a convex set
Definition: feasible direction
Theorem: Temporary necessary condition for h-problem (3.4)
When the above equation holds for any feasible direction d of S at a point x*, x* is called the stopping point of problem (3.4)
Equivalent to ∇f(x*)=0
Theorem:h First order sufficient condition for problem (3.4)
When problem (3.4) is an optimization problem, being a stopping point is equivalent to being an optimal solution
The optimal solution can be obtained from the stopping point.

3.3 Stopping Condition of Optimization Algorithm

If ∇f(x*)=0 and ∇2f(xk)≻ are satisfied, xk is a local optimal solution, and the iterative method stops.
In actual calculation, it is difficult to check whether ∇f(xk)=0 is strictly valid due to numerical juggling
We assume that the algorithm stops when ∥∇f(xk)∥<ε
How to choose Ε
Let εmatch be the degree of randomness in the immovable point of the computer
Example: εmatch=10-16
Assumptions
The function f is assumed to be twice differentiable
We have ∇f(x*)=0 and ∇2f(x*)≻O
The difference of the function values is f(xk)-f(x*)=h
The difference in solutions is δ=xk-x*
As in Taylor's theorem, there exists a suitable t ∈ (0,1), and the above equation holds
If x* and xk are close enough, then ∥δ∥=O(√h)
From Taylor's theorem for gradient, the above equation becomes
∥∇f(xk)∥ is of the same order as ∥δ∥
Stopping condition when computing the function f(x) with εmatch accuracy
Considering the effect of scaling, we can use the above equation
Since the left-hand side is too strict, the above stopping condition is proposed.
In the actual numerical calculation, we need to decide whether to stop or not by observing not only the gradient but also the behavior of the function value f(xk) and the iterative solution xk.
The final proposed stopping condition

Chapter 4: Basics of the Gradient Method

Overview

Algorithms that use information about the gradient of a function to find a solution to an optimization problem.
There are two methods: coordinate descent and maximum steepest descent.

4.1 Line search method

Line search method (line search)
First, find the direction in which the objective function becomes small
Next, calculate the step size that represents how much x should be moved in that direction.
The step size is then used to find the buy.
Fastest descent method
Newton method
Quasi-Newton method
Determine if the function value has decreased sufficiently
Armijo condition
Wolf condition
Allowable range of step width
Strong Wolf Condition
Goldstein condition
One of the methods to obtain an appropriate step mother under the Armijo condition by devising a search method for the parameter α.
Backtracking Method Algorithm
Complementary Problem
Step approach
Method to find the exact minimum solution
Conjugate gradient method
A method that only needs to obtain a sufficient decrease
Backtracking method
Direct search may be combined with annealing method to escape local minima

4.2 Iterative method with linear search

Optimize function f using iterative method and linear search method
Penetrate to point xk
Proceed from Xk in the direction of search direction dk, choose the appropriate step width αk, and update point xk as in the above equation
The search direction dk is chosen to satisfy the above equation.
Dk is called the descent direction of f at xk.
From Taylor's theorem, the above equation becomes
For sufficiently small αk, f(xk+αkdk)-f(xk)<0, the function proceeds in the direction of decreasing value. Algorithm When the Wolf condition is used for linear search, the above theorem holds to guarantee convergence of the algorithm. 4.3 Coordinate Descent Method The method of optimizing the objective function along each coordinate Coordinate descent method) Let vector ei be a unit vector whose i-th element is 1 and other elements are 0. The search direction of the coordinate descent method is selected from ±e1,...,±en. and ±en. How to choose the search direction Uniformly random selection method 1,2,... ,n,1,2,... ,n cyclically. When searching cyclically, convergence can be guaranteed by determining the step width appropriately. Select the coordinate where the absolute value of the element of gradient ∇f is the maximum. Find the lower bound of cosine cosθk in Theorem 4.2 Equation Continued etc. Algorithm No convergence when the objective function is non-differentiable 4.4 Fastest Descent Method When the gradient vector is easy to calculate, the steepest descent method is used as a convenient method 4.4.1 Optimization Algorithm A type of iterative method that uses linear search Use dk=-∇f(xk) as the search method Proof Algorithm Consider the Wolf condition as a criterion for determining the step width 4.4.2 Convergence Speed of the Fastest Descent Method Example: applying the steepest descent method to a convex quadratic function Q∈ℝnxn is a positive definite symmetric matrix Optimal solution is x*, optimal value is f(x*)=0 Stopping condition for step width Using this result, we get Finally Let λ be the eigenvalue of Q Theorem: Convergence of the steepest descent method Ratio of minimum to maximum eigenvalue The closer the condition number is to 1, the steeper the contour line of f(x) becomes, and the larger it is, the more elongated it becomes. Corollary: Kantrovich's inequality 4.4.3 The steepest descent method using the backtracking method The steepest descent method using the backtracking method as a linear search is widely used as a practical algorithm. Convergence of this method 4.5 Applications to Machine Learning 4.5.1 Coordinate Descent and Boosting Application of Coordinate Descent Method to Machine Learning Suppose we have observed bivalent data. Predict the output b ∈ {+1,-1} for the new data a with the sign of the above equation If H(a) is positive, output b is +1 If H(a) is negative, the output b is -1 H1(a),... If H(a) is positive, output b is +1 If H(a) is negative, output b is -1 H1(a),...,hn(a) is a basis function that takes values in {+1,-1}. Each basis function can be simple, but complex functions can be used depending on the A general method for constructing a discriminant that achieves high prediction accuracy by successively combining simple basis functions. The coefficients of the function Hx(a), x=(x1,...,xn), are computed from the observed data. (x1,...,xn) of the function Hx(a) are appropriately determined from the observed data. If btxHHx(a)>0 for the data (ai,bt), the function Hx is judged correctly.
Determine x in such a way that btxHHx(a)>0 is valid for as many observed data as possible.
A learning method that minimizes the loss function (above equation) has been proposed.
Calculation of exponential loss
Let the weights of the data be the above equation
1{A} takes the value 1 if A is true and 0 if A is false
The method of coordinate time usage that reduces the wind vane the most is ±h1,. Corresponds to the direction that minimizes the above equation in ±h1,..., ±hn
Algorithm
Figure 1.
Learning algorithm using the coordinate descent method as optimization method.
Algorithm of Adaboost
Boosting
Generate a strong learning machine by combining a series of weak learning machines.
Add weights to the weak learning machines.
The difference between various types of boosting is the weighting of data points and hypotheses.
AdaBoost
AdaBoost is a typical example of boosting.
Adaptive boosting
Apply weak classifiers t in order from t=1 to t=T
Determine whether each classifier is correct or not
Affected by noisy data and outliers
4.5.2 Error Back Propagation Method
Assume bivalent data is observed
Learning a multilayer perceptron
Various input-output relationships can be represented by adjusting the parameters
The parameters can be any number of matrices of appropriate size W(ℓ), ℓ=1,...,L , L
The relation between input a and output b is defined as the above equation
Expressed in terms of the equation for b
The function Φi is a suitable nonlinear function
Example
Diagram of the calculation process
Equation
Compute the multilayer perceptron from the data
The function Φi is defined as above using the same function 𝞿:ℝ→ℝ for all ℓ
Define the objective function as the above equation
The linkage law propagates from the output layer to the input layer because the gradient calculation is necessary to calculate the error.
Determine the gradient
Differentiation of the multilayer perceptron
Let u be the above equation
From the derivative of the ℓ+1 layer, we can find the derivative of the ℓ layer
Differentiation with respect to parameters
Differentiate sequentially from output to input
Problem
When the number of layers is large, the non-concentration of the parameters stagnates close to the input layer due to numerical errors
Gradient method with stochastic optimization
Difficult to update the sum of squared errors f(W) every time
A method that updates the parameters each time the data is observed is used
A type of algorithm that uses only a small number of data at each step to update the algorithm in a stochastic setting
Consider the problem of learning the parameters W of the function Φ(a,W) from input and output data (a,b)
One data (at,bt) is given at time t
Update the parameter W as in the above equation
Observed data (a1,b1),... ,(aN,bN) have already been engraved
At each time t=1,2,.... at each time t=1,2,..., we randomly generate (at,bt) from the data along a uniform distribution.
The above equation has a high probability of f(W)no Kyokushokkai ni Shuso-kusuru
memo
A typical error back propagation method is equivalent to the gradient method
Use fast automatic differentiation to compute the gradient
Online algorithms related to the pre-gradient method

Chapter 5: The Newton Method

Overview

Optimization method using Hesse matrix
Calculations using not only the gradient but also the Hesse matrix information
memo
An iterative root finding algorithm for solving equations numerically.
The only conditions on the system of equations to be solved are the differential equation in the domain and the sign of the second derivative, and no linearity is required.

5.1 Derivation of the Newton Method

Newton's method is a computational method that uses not only the gradient but also the Hesse matrix information.
When the Hesse matrix is easy to compute, the Newton method is an efficient optimization method.
Minimization of two-differentiable functions
Computation proceeds up to xk by iterative method
Taylor expansion of f at point xk yields the above equation
Assuming that the Hesse matrix ∇2f(xk) is a positive definite matrix, finding the δ that minimizes the expression between the second orders of the Taylor expansion becomes common sense.
If we update Xk with the above equation, we can expect the wind vat to decrease.
The method to set the search direction and step width as shown in the above equation is called the Newton method.
Calculation Algorithm of Newton's Method
The Hesse matrix of function f is positive definite.
Xk is assumed to be not so important.
We need to be careful to apply Newton's method in practice.
The above equation is not guaranteed.
Theorem: Convergence of Newton's method

5.2 Covariance for Coordinate Transformations

When constructing an optimization problem based on observed real data, changing the unit system of the data transforms the optimization problem accordingly
How does the behavior of the optimization algorithm relate to the coordinate transformation?
In the case of Newton's method, it is "covariant" to the coordinate transformation.
Not the steepest descent method

5.3 Modified Newton Method

Solves the problem of Newton's method
Applying the Cholesky Method
Algorithm
Modified Newton method is not covariant

5.4 Gauss-Newton Method and Related Topics

Overview
In statistics and machine learning, the problem of minimizing the sum or mean of error functions for data often arises.
Write the objective function as a sum of squares of m functions
Calculating the gradient and Hesse functions
An optimization method to simplify the computation and improve efficiency when the Hesse function is difficult to compute.
Gauss-Newton method
Ravenberg-Macart method
5.4.1 Derivation of the Gauss-Newton Method
Search direction of Gauss-Newton method
No need to calculate the Hesse matrix
5.4.2 Ravenberg-Makert Method
Algorithm

5.5 Natural Gradient Method

5.5.1 Descent Direction Determined from Fisher Information Matrix described in "Overview of the Fisher Information Matrix and Related Algorithms and Examples of Implementations
In machine learning and statistics, a distance different from the Euclidean distance may be naturally defined in the space of optimization parameters x
Hellinger distance
Approximation of the Hellinger distance
When x and x' are close
G(x): Fisher information matrix
A gradient method for parameter estimation in statistics, where the distance structure is locally determined by dG(x,x').
A generalization of the steepest descent method.
Example: Fisher information content of a normal distribution with expected value µ and variance v
Probability density of parameter x
Calculating the purchase of log(a,x) with respect to x
The Fisher information matrix is
The distance dG(x,x') between x=(µ,v) and x'=(µ+δµ,v+δv) is
For variations in parameters, the larger the variance v, the smaller the distance between distributions
The larger the variance, the more difficult it becomes to distinguish the distributions from the data.
The steepest descent method is used when the distance structure of the parameter space is determined from the Fisher information distance matrix.
The natural gradient method is a generalization of the steepest descent method, and depending on the distance structure and the objective function, the search direction is almost identical to the Newton method.
5.5.2 Natural Gradient Method in Online Learning
Online Algorithm for Natural Gradient Method
See details later.

Chapter 6: Conjugate Gradient Method

Overview

Conjugate gradient method is used for optimization of large scale problems as it is computationally inexpensive and saves memory space.
memo
It is used as an iterative method
Algorithm for solving simultaneous linear equations with symmetric positive definite matrices as coefficients
Optimization method with better convergence than the steepest descent method

6.1 Conjugate direction method

6.2 Conjugate gradient method

6.3 Nonlinear Conjugate Gradient Method

memo
共役勾配法 - Qiita
はじめにこの記事は「これなら分かる最適化数学 -基礎原理から計算手法まで-」(金谷健一著,共立出版)の3.3「共役勾配法」を基に書いている。また、自主ゼミで担当した回に向けての内容になっているため…
Univariate gradient method Case of N variables Conjugate gradient method What is Conjugate? Conjugate" is an extension of the concept of "orthogonal". Explanation For a positive symmetric matrix Q, if the vectors x and y satisfy (x,Qy)=0, then x and y are said to be conjugate with respect to Q. Avoiding the calculation of Hesse matrices Since the Hesse matrix is a matrix that contains second-order partial derivatives, it is expensive to calculate them. There are several ways to avoid this. Chapter 7: Quasi-Newton Method Overview Algorithm for finding solutions to nonlinear simultaneous equations or maxima and minima of functions in continuous optimization problems. How to avoid calculating the Hesse matrix in the Newton method and perform optimization efficiently 7.1 Optimization method using variable metric and secant condition 7.2 Proximal Update of Positive Definite Matrix 7.2.1 Formulation of Update Rule by Divergence Minimization 7.2.2 Relationship between the Nature of Divergence and the Update Rule 7.2.3 Distance Minimization and Divergence Minimization 7.2.4 Derivation of the Update Rule 7.3 Convergence of Quasi-Newton Method 7.4 Memory Limited Quasi-Newton Method 7.5 Exploiting the Sparsity of Hesse Matrices 7.5.1 Graphical Representation of Functions Representing relationships between variables in graph-theoretic terms A subgraph is a clique, or complete subgraph, when all vertices of the subgraph have branches attached to them. Extreme cliques A chordal graph. 7.5.2 Positive Definite Matrix Completion 7.5.3 Update Rule by Sparse Clique Decomposition memo This is one of the methods mainly used to solve unconstrained minimization problems. The outline of the method is the same as Newton's method. However, the difference is that instead of using the Hesse matrix equation at the point equation of the objective function equation, the approximation matrix equation of the Hesse matrix is used, and the search direction is obtained by solving the equation equation. Instead of solving the equation, the search direction can also be obtained directly by using the approximation matrix equation, which is the inverse matrix of the Hesse matrix at the point equation. As for the method of updating the approximation matrix equation and equation, the BFGS and DFP formulas, which are update formulas satisfying the secant condition, are well known. Chapter 8: Confidence Region Method Overview Confidence Domain A term used in the context of mathematical optimization, a subdomain in which it is assumed that it is reasonable to approximate the objective function by some model function (often a quadratic function). Confidence Region Method Linear search methods and duality 8.1 Structure of the Algorithm 8.2 Approximate solution of a subproblem 8.2.1 Optimality Conditions for Subproblems 8.2.2 The dogleg method 8.3 Convergence memo One of the gradient methods for solving unconstrained optimization problems. It is a solution method designed so that the Newton method converges globally even if the Hesse matrix is not positive definite, but it is also extended to the framework of quasi-Newton methods and constrained optimization methods. Given x_k, the approximate solution at the kth iteration, find the step s_k, which minimizes the quadratic approximation in the confidence region where the quadratic approximation of the objective function seems to be reasonable. Then, based on the reduction of the function, we adjust the size of the confidence region or update the approximate solution as x_{k+1}:=x_k+s_k,. Part III Constrained Optimization Chapter 9: Optimality Conditions for Equivalently Constrained Optimization 9.1 First-Order Optimality Conditions 9.2 Second-Order Optimality Conditions 9.3 Optimality conditions and duality for convex optimization problems 9.4 Sensitivity Analysis Chapter 10: Optimality Conditions for Inequality-Constrained Optimization 10.1 First-Order Optimality Conditions 10.2 Second-Order Optimality Conditions 10.3 Optimality conditions for convex optimization problems 10.4 Principal Problems and Dual Problems Chapter 11 Optimization Methods for Principal Problems 11.1 Effective Constraint Methods 11.1.1 Choosing the Search Direction 11.1.2 Sign of Lagrange multiplier 11.1.3 Updating the Effective Constraint Formula and Optimization Algorithm 11.2 Penalty Function Method 11.2.1 Formulation with Penalty Function 11.2.2 Properties of Unconstrained Optimization Problems in the Penalty Function Method 11.2.3 Exact Penalty Function Method 11.3 Barrier Function Method 11.3.1 Formulation using the Barrier Function Method 11.3.2 Properties of the Barrier Function Method Chapter 12 Optimization Methods Using Lagrangian Functions 12.1 Dyadic Ascent Method 12.1.1 Introduction of the Lagrangian and Derivation of the Duality Problem 12.1.2 Optimization of the Duality Problem Using the Gradient Method 12.1.3 Duality Decomposition 12.1.4 Dyadic Ascent Method for Inequality Constraints 12.1.5 Convergence of the Dyadic Ascent Method 12.1.6 Dyadic Ascent for Nonlinear Constraints 12.2 Extended Lagrangian Method 12.2.1 Extended Lagrangian 12.2.2 Extended Lagrangian Function Method 12.2.3 Extended Lagrangian Method as a Dual Ascent Method 12.2.4 Handling Inequality Constraints 12.2.5 Convergence Theory of the Extended Lagrangian Method 12.2.6 Theory of Convergence Rates for Convex Objective Functions 12.2.7 Relationship to the Nearest-Point Algorithm 12.3 Alternating Direction Multiplier Method 12.3.1 Alternating Direction Multiplier Algorithm 12.3.2 Parallel Computation by the Alternating Direction Multiplier Method Part IV Optimization as a Learning Algorithm Chapter 13 Upper Bound Minimization Algorithm Introduction. Sequential solution method to generate a sequence of points that monotonically decreases the objective function It is one of the openings when it is difficult to optimize the objective function explicitly. 13.1 Upper Bound Minimization Algorithm General framework for a sequential solution method that generates a monotonically decreasing sequence of points for the objective function. The expectation-maximization (EM) algorithm is another framework for MM algorithms. It is one of the most famous methods for maximum likelihood estimation of the parameters of a stochastic model. Definition: Directional Derivatives and Stopping Points A stopping point is not an extreme point or a point with maximum or minimum value. Consider the minimization problem (above equation) Consider a setting where it is difficult to solve f(x) directly The MM algorithm generates a sequence of candidate solution points by successively solving the minimization problem of the approximate function of f(x) instead of directly solving this minimization problem. Specifically, how to obtain Let u(x;xi) be the approximation function of the objective function f(x) around the t-th candidate solution point xi. Find the t+1th candidate solution point xt+1 as the above equation Condition that the approximate solution u must have to solve the optimization problem minx∈Xf(x) Condition continuation u satisfying the above conditions is called a surrogate function. Algorithm that generates a sequence of candidate solution points by repeating the two steps of majorization and minimization using the surrogate function Example of MM algorithm in action MM Algorithm Properties of the MM Algorithm Theorem: MM algorithm is monotonically decreasing Theorem: Convergence of the MM algorithm 13.2 Examples of Proxy Functions Typical methods for finding a proxy function Specifically, we present several inequalities that can be used to find the proxy function Use of linearization Let h(x) be a convex function that is once differentiable From Theorem 2.12, the above equation holds -By setting h(x), we can construct a proxy function. Use of quadratic approximation Let h(x) be a convex function that is twice differentiable and whose Hesse matrix is bounded. If M is a positive definite matrix and M-∇2h(x) is a non-negative definite matrix for any x, then the above equation is valid. Using Jensen's inequality Let h(x) be a convex function. For a random variable X, Jensen's Inequality (above) holds Used to derive the EM algorithm 13.3 EM Algorithm Introduction A typical algorithm for maximum likelihood estimation of a statistical model that includes an unobservable variable called a latent variable. The maximum likelihood estimation of a parameter can be derived in the framework of the MM algorithm by considering it as the minimization of the negative log likelihood. The maximum likelihood estimation of the parameters is regarded as the minimization of the negative log likelihood. Explanation of the EM Algorithm Assumptions Let the observed data be y1:n=(y1,...,yn). ,yn). Consider a stochastic model p(x|θ) with Θ as a parameter The maximum likelihood estimation is the above equation Assume a latent variable zi for each observed data point yi Calculate the upper bound for the negative log likelihood u(θ;θt) is the above equation EM Algorithm Compute the posterior distribution p(z1:n|y1:n, θt) of z1:n given Θt, and derive u(θ;θt) Find the proxy function Find the optimal solution θ of u(θ;θt) This means maximizing the likelihood. Obtain the minimization of the proxy function Repeat E and M Latent variables are auxiliary variables introduced to construct the proxy function for optimization of the objective function -logp(y1:n|θ) 13.4 Optimizing the Difference of Two Convex Functions Suppose that the function f(x) can be written in terms of the once-differentiable convex functions g(x) and h(x) as in common sense The function f(x) will be non-convex if h(x) is not a linear function. Such a minimization problem for f(x) Theorem: Transformation to DC programming problem Very important for optimization of nonconvex functions Solution of DC programming problem By generating candidate solution points {xi} such that ∇g(xt+1)=∇h(xt) is satisfied By monotonically decreasing the objective function f(x), we can find the stopping point Analysis of the CCCP in terms of the MM algorithm Using the linearity of f(x), derive the above equation as a proxy function for f(x) How to iteratively minimize the right equation for DC programming problems 13.5 Proximity Point Algorithm With f(x) as a convex function, the minimization problem becomes equivalent to the above equation Since f(x)+1/2c∥x-y∥2 is a strongly convex function with respect to x Since f(x)+1/2c∥x-y∥2 is strongly convex with respect to x, we can solve the minimization problem for convex functions as the minimization problem for strongly convex functions Closest point algorithm is an algorithm to obtain the minimization of the objective function by repeatedly updating the above equation It is the same as MM algorithm with the proxy function as the above equation. Chapter 14 Support Vector Machines and Optimization 14.1 Formulation of SVM and Optimization Problem Introduction. Hard Margin SVM and Soft Margin SVM Linear Classification Boundary of SVM W is the coefficient vector of the linear classification boundary B is the bias SVM is an optimization problem to find the optimal w and b according to the training data 14.1.1 Main Problem of SVM Optimization problem for linear SVM based on soft margin maximization Regularization term and loss term, respectively The loss function is called the hinge loss 0-1 loss is a non-convex and discontinuous function, so it cannot be optimized as is Hinge loss is a proxy loss function for 0-1 loss Since the hinge function is a convex function, it can be optimized. However, the hinge function is not differentiable at yif(xi)=0 Two approaches for dealing with this problem Convert an unconstrained optimization problem into a constrained optimization problem Convert to constrained optimization by introducing artificial variables (slack variables) Hinge loss Expressed as a piecewise linear function with two segments Vector with N artificial variables as elements 𝝃=(𝛏1,𝛏2,... ,𝛏n), which can be transformed into a differentiable constrained optimization problem. Fine-tune the hinge loss to make it a smooth loss function Squared hinge loss Hoover hinge loss Vanilla hinge loss Logistic Regression 14.1.2 The Duality Problem of SVM Derivation of the Duality Problem Lagrangian for Constrained Optimization Problems Objective Function of the Dual Problem The problem of maximizing the dual objective function under nonnegative constraints on the Lagrangian variables α and µ The Lagrangian is convex with respect to the principal variables w, b, and 𝝃. It is minimized at the point where the partial derivative becomes zero with respect to the main variables Duality problem of SVM 14.1.3 Optimality Conditions for SVMs One of the features of SVMs is that the classification boundary is not determined by all training cases, but is characterized by a subset of cases called the support vectorz9 The optimality conditions for SVMs can be summarized as follows: the dual variables α1,...,αn and SV The relationship between the dual variables α1,..., αn and SV becomes clear. Relationship between case set and margin Training cases with margin (yi(wTxi+b)) larger than 1 do not affect the classification boundary because the corresponding dual variable α becomes 0. Removing non-SVs does not change the final classification association. 14.1.4 Nonlinear Modeling with Kernel Functions One of the features of SVM is the ability to do nonlinear modeling using kernel functions Consider a function Φ that maps an input x to some feature space F The linear classification association when this Φ(x) is interpreted as a new feature vector Note that the coefficient vector w is also defined as an element of the feature space F Let x be Φ(x) in the duality problem Dual representation of classification boundary In the duality problem and the classification boundary, the feature Φ(x) does not appear alone, but in the form of an inner product Φ(x)TΦ(x) Define the inner product as a kernel function By using a function that satisfies certain properties, the inner product can be obtained directly without using Φ(x). RGF (radial basis function kernel) Expression as a kernel function 14.2 Optimization Algorithms for SVM Learning Introduction Optimization algorithms specific to SVM learning Used in LIBSVM and LIBLINEAR SMO (sequential minimization optimization algorithm) algorithm Squared hinge loss DCDM(dual coordinate descent method) algorithm Hoover hinge loss 14.2.1 Solving the Kernel SVM Duality Problem: SMO Algorithm The duality problem of SVM is an optimization problem for n unknown variables α1,...,αn. The SVM duality problem is an optimization problem for n unknown variables α1,..., αn. When the number of training cases n is not large, the duality problem can be solved by using a general algorithm for constrained optimization. When n is large, the cost is huge Approach Consider only a portion of the n variables to be optimized (working set) and fix the remaining variables as constants Partitioning method with set size set to 2 Iteratively solves a minimal optimization problem with only two variables. Consider a dual problem in which only two variables αs and αt are considered as variables out of N variables, and the rest are fixed as constants Let the dual variables be βi=yiαi Using variables β1,.... and βn, the duality problem of SV classification becomes the above equation Kij=K(xi,xj) Let the indexes of the two dual variables to be changed be s and t. In order to satisfy the equality constraint (zero sum) ∆βt=-∆βs Effectively, the variable that can be moved is βs If we rewrite the duality problem as a constrained optimization problem for ∆βs The above equation is a constrained quadratic minimization problem with one variable ∆βs Let the extreme value of the quadratic function be the above equation The optimal solution is The two variables may be chosen at random. 14.2.2 Solving the Duality Problem for Linear SVM: DCDM Algorithm Optimization algorithm specific to linear SVMs Main problem when considering a classification boundary without constant term b Duality Problem Variant of the duality problem The DCDM algorithm is a kind of partitioning method like the SMO algorithm. SMO algorithm had equality constraints, so it considered a working set of two variables. In DCDM, there is no equality constraint, so a single αs is updated. At a certain step of the DCDM, only αs is considered as a variable and the rest are considered as constants and fixed. Organize the duality problem with respect to the update width ∆αs Constrained two-time minimization problem for one variable ∆αs Extreme values of quadratic functions Optimal solution In DCDM, the selection of the working set is done by selecting n variables α1,.... , αn as the working set in turn, and repeat the solving process. 14.2.3 Comparison of Learning Algorithms Comparison of SVM-specific DCDM method with general-purpose trust-region method and quasi-Newton method with memory restriction Use Hoover-Hinge loss because it requires differentiation Main problem of Hoover-Hinge loss SVM Duality problem Experimental data used for comparison Experimental results 1: Small-scale data Experimental result 2: Medium-scale data Experimental result 3: Large-scale data 14.3 Regularized Path Tracking Introduction Hyperparameters Example: Regularization parameter C to control the balance between the loss function and the regularization term Proper selection of hyperparameters is needed Using a warm start is an efficient way to select models When solving multiple optimization problems with different values of hyperparameters By using the solution of one optimization problem as initial value, the solution of another optimization problem can be obtained efficiently. Regularization path following Multiple optimization problems can be solved at once. It is possible to compute exactly how the optimal solution changes when the regularization parameter C changes continuously. One of the approaches called parametric optimization (parametric programming) Consider a class of optimization problems with parameterized representations, and try to obtain their optimal solutions in a parameterized form. To obtain the optimal solution of a dual problem as a function of the regularization parameter C In regularization path tracking, the optimality condition plays an important role. To emphasize that the set of three cases changes depending on C, we express it as above. The three case sets are expressed as common sense Regularized Path Tracking Algorithm 14.3.1 Parameter representation of the optimal solution (Step 1) Consider an interval of C such that the case set T(C) is constant 14.3.2 Event Detection (Step 2) Events for which the case set T(C) changes when the regularization parameter C is gradually increased 14.4 Optimal Guaranteed Screening Let the case set belonging to M and I be the support vector (SV) The set of cases belonging to O is called the non-support vector (non-SV) Cases in O are αi=0, so excluding them will not change the result Many SVMs use heuristics to estimate the non-SV cases and solve the optimization problem with the rest. Safe screening Find the non-SV cases Example of Optimal Guaranteed Screening Basic idea Optimization problem Gradient vector If the margin is greater than 1, the case is non-SV (included in O) What do we do when we don't know the optimal solution w*? Consider a situation where the range of the optimal solution is known When the optimal solution w* is contained in Ω, if the lower bound of the margin is greater than 1, the case is non-SV Two challenges to achieve optimal guaranteed screening How to solve the minimization problem efficiently when the domain Ω is obtained? How to obtain a region Ω that contains the optimal solution w*? Minimization problem approach when the domain Ω meets the hypersphere Hypersphere Theorem: Lower and upper bounds for linear scores Lower and upper bounds on ηTw* when the hypersphere contains the solution w*. If we set η=yixi, we can get a lower bound on the marginals and make a non-SV decision as above How do we find Ω? Theorem: Hypersphere with optimal solution using approximate times The center and radius of the hypersphere Ω depend on an arbitrary vector ὥ Ὢ is called an approximate solution What approximate solution will reduce the radius of the hypersphere? When the approximate solution ὢ is sufficiently close to the optimal solution, the radius r(ὢ) becomes small. If we can choose an approximate solution ὢ that is close to the optimal solution w*, then The upper and lower bounds become tighter. Many non-SVs can be safely identified. Is it enough to find the approximate solution as follows? The solution obtained in the middle of the optimization algorithm is an approximate solution. Identify non-SV cases in the process of solving the optimal solution and remove them from the training data Application of the same idea Sequential learning, online learning, and streaming learning where training data is given sequentially Consider the optimal solution before any changes are made to the training data as an approximate solution. Can screen for non-SV of the optimal solution after the change Optimization problem for multiple regularization parameters C in model selection Consider the optimal solution for one regularization parameter as an approximate solution to the optimization problem for another regularization parameter Theorem: Guaranteed Optimal Screening for Regularization Paths A Study of Optimal Guaranteed Screening El Ghaoui(17) Actively researched in 2016 Chapter 15 Sparse Learning 15.1 Sparse Modeling Methods for extracting low-dimensional subspaces, such as sparse modeling, have long been used for model selection in statistics. Criteria for model selection Akaike's Information Criterion (AIC) Assumptions A statistical model {pθ|θ∈Θ} is given. Θ is a set of parameters Pθ is the probability density function corresponding to the parameter θ Given a sequence Θk⊆Θ{k=1,2,...,d) of k-dimensional submodels in Θ ,d) in Θ. Objective: to choose the "right" model among these models The model with the largest dimensionality has the highest expressive power, but overtraining will occur if the model is too complex AIC is AIC is a criterion to avoid overtraining by correcting the gap between the prediction error (generalization error) and the training error. The AIC of model ϴk is the above equation If Θk is the maximum likelihood estimator in model ϴ, then for n observed data {z1,z2,... ,zn}. AIC(k) becomes the asymptotic unbiased estimator of the expected log likelihood AIC is the model "complexity" corrected for the training error by adding the ear penalty 2k for dimensionality Minimizing the AIC approximates minimizing the prediction error AIC can be regarded as regularization learning using the dimension of the model as the regularization term If the loss function satisfies the inferior modularity, it can be computed by greedy method Bayes Information Criterion (BIC) 15.2 L1 Regularization and Various Types of Sparse Regularization 15.2.1 L1 Regularization A sparse modeling technique that can be formulated as a simple optimization problem alternative to AIC. L1 regularization uses the L1 norm ∥θ∥1=∑|θj| instead of ∥∥∥0 C is the regularization parameter Regularization parameter is determined by cross-checking method, information criterion, etc. L1 norm is known to be a convex hull of L0 norm in [-1,1]d The optimal solution will actually be sparse Example: Sparse regression: Lasso L1 regularization learning with squared loss applied to regression Lasso estimator is given as the solution to the optimization problem (above) Example: Sparse discriminant: L1 regularized logistic regression L1 regularization can be applied to discriminant in the same way as for regression If the loss function is the logistic loss, it can be formulated as above 15.2.2 Other sparse regularizations Generalized problem with loss function ℓ(z,w) If the loss function ℓ is convex with respect to w, then this optimization problem is a convex optimization problem ∥w∥1 is not smooth and we cannot compute the derivative We can apply the subgradient method, but convergence will be slow L1 regularization is not smooth, so the solution will be sparse Guideline for solving this problem Use the structure of the regularization terms In the case of L1 regularization, take advantage of the separation of functions for each variable. Separate the difficulty of optimization between the loss function term and the regularization term. Perform the optimization as if the regularization term did not exist Generalization Formula Sparse regularization function Ψ(w) Let f(w)=∑ℓ(zi,w) Regularization terms other than L1 Example: trace norm regularization described in "Overview of the trace norm and related algorithms and implementation examples" A regularization method used to learn low-rank matrices Trace-norm regularization (trace-norm regularization) B=(WTW)1/2 is a semi-positive definite target matrix satisfying BB=WTW Trace-norm regularization is a sum of singular values A vector of singular values (σ1,σ2,... , σr) lacking L1 regularization. Matrices learned using trace-norm regularization have sparse singular values (many singular values tend to be zero). Many of the singular values are 0 Example: Group Regularization Assumptions D indices {1,.... ,d} are divided into K groups G1,... ,Gk The groups Gk may overlap. The subvector corresponding to each group Gk is written as wGk=(wj) The group regularization along this grouping is given by the above equation q>1 ∥wGk∥q=(∑|wj|q)1/q Group regularization is L1 regularization to a vector of Lp norm of subvectors corresponding to each group Example: Elastic net regularization A method that is somewhere between L1 and L2 regularization Equation Compared to L1 regularization, L2 regularization provides stable learning that is less sensitive to noise. 15.2.3 Numerical Evaluation of L1 Regularization Experiment with artificial data Dimension p=2000 Generate data from the linear model in the above equation X is a randomly generated design matrix Ε is a noise that follows an independent and identical standard normal distribution w* is a vector of t (a sparse vector where only the first 100 components are zero-raged sums and the rest are zero) Sample size is n=1000,2000,4000 to see the behavior of Chi and estimation error. Results Lasso's estimation error does not become too large even when the sample size is small This is because Lasso performs variable selection and estimates with the necessary data. Translated with www.DeepL.com/Translator (free version) 15.3 Solving with the Proximal Gradient Method 15.3.1 Algorithm of the Proximal Gradient Method Method using the structure of the regularization term A slightly modified version of the steepest descent method Algorithm ISTA (Iterative Shrinkage Thresholding Algorithm) Optimize function f by linear approximation f(w)≃f(wk) + gkT(w-wk) around wk The approximation accuracy is only good around wk, where the derivative is taken. To avoid going too far, the proximity term ηk/2∥w-wk∥2 is added The update equation of the proximity gradient method is the above equation Proximity Gradient Method finally becomes the above equation Let the proximity map associated with the function Φ be the above equation Wk-gk/ηk is the update equation of the steepest descent method itself for minimizing the function f In the proximity gradient method, f is reduced once by the steepest descent method, and then corrected by the proximity map determined by Ψ. The computational efficiency of the proximity gradient method is determined by the computational complexity of the proximity map determined by the regularization term Ψ. Example: L1 regularization The proximity map for L1 regularization is y=prox(q|c∥∥1) It is called soft thresholding Example: Group regularization without overlap Example: Trace norm regularization Example: No regularization term The update equation for the proximal gradient method is the steepest descent method Example: Marking function Continued 15.3.2 Convergence Theory of the Proximal Gradient Method Convergence rate of the proximal gradient method In practical applications, it is difficult to calculate the smoothness L of a function f directly, so the backtracking method is used to find L. Algorithm FISTA (Fast ISTA) Algorithm Convergence rate of the acceleration method in the proximity gradient method Proximity Gradient Algorithm for Strongly Convex Functions Restart method 15.3.3 Proof of Convergence Rate for the Proximal Gradient Method 15.3.4 Numerical Experiments on the Proximal Gradient Method Experiments on artificial data Sample size n=700 Dimension as p=1000 Data generation formula Only the first 100 components have values, the remaining 900 are 0 Optimization problem The loss function is the medistic loss Figure showing the degree of convergence of the objective function Nesterov(restart) is the best Generalizability and number of non-zero components Nesterov(restart) is the best 15.4 Solution by Coordinate Descent Method Coordinate descent method is also effective for sparse regularization learning Coordinate descent is useful when the computation of noncephalic battles for each coordinate is light and the regularization terms are separated into blocks. In coordinate descent, each coordinate (or block of coordinates) is updated alternately so that the objective function becomes smaller. Many sparse regularization functions can be expressed as a sum of functions separated by coordinates (or blocks) Example L1 regularization takes absolute values for each coordinate and then adds them together Group regularization also has a separate function for each group of coordinates An expression that represents a structure separated by coordinates Algorithm How to choose coordinates Choose randomly Choose cyclically Difficult 15.5 Solving by Alternating Direction Multiplier Method 15.5.1 Alternating Direction Multiplier Method and Structural Regularization Separate optimization of the loss function from optimization of the regularization term Example: Group regularization with overlap Group regularization Example: Generalized Fused Regularization Regularization along a graph The alternating direction multiplier method is a method for optimizing the augmented Lagrange function by alternating x and y. Extended Lagrange Function in Optimization Problems Extended Alternating Direction Multiplier Algorithm Example: Lasso's alternating direction multiplier method for open problems e.g.: Opening with alternating direction multiplier under regularity taken care of by sum of convex functions 15.5.2 Numerical Experiments for Image Recovery Example of TV noise reduction Assume that the true image is smooth and there is no significant change between neighboring pixels Blurring and noise removal using TV regularization 15.6 Methods Using the Proximity Point Algorithm 15.6.1 Nearest-Point Algorithm and Its Dual Problem in Sparse Learning Proximity point algorithm is also useful in optimizing sparse regularization Two theorems for solving the duality problem Theorem: Properties of Conjugate Functions Theorem: Properties of proximity mappings Moreau decomposition (Moreau decomposition) Moreau decomposition replaces the computation of the proximity map with the computation of the proximity map for the conjugate function. Use the steepest descent method or Newton's method to optimize the dual problem. The proximity point algorithm for the main problem can be implemented for the dual problem by solving a linear equality constrained optimization problem using the multiplier method. Regularization functions that are not smooth in the main problem are smoothed in the dual problem. Algorithm of the dual augmented Lagrangian method By using Fenchel's duality theorem, the dual augmented Lagrangian method can be derived directly 15.6.2 Alternating Direction Multiplier Method for Dual Problems Optimize the augmented backgrange function of a dual problem alternately for y and u Dual alternating direction multiplier method for regularization learning Chapter 16 Optimization on Matrix Space 16.1 Stiefel and Grassmann manifolds 16.2 Matrix Optimization in Machine Learning 16.2.1 Independent Component Analysis 16.2.2 Density ratio estimation with dimensionality reduction 16.3 Concepts of Manifolds 16.4 Optimization on manifolds 16.4.1 Fastest Descent Method 16.4.2 Conjugate gradient method 16.5 Retraction and vector transport 16.6 Optimization on matrix manifolds 16.6.1 Properties of Stiefel manifolds 16.6.2 Construction of retraction 16.6.3 Vector transport by projection 16.6.4 Numerical examples

 

コメント

タイトルとURLをコピーしました