Fundamentals of Continuous Optimization – Calculus and Linear Algebra

Mathematics Artificial Intelligence Technology Sequential optimization for machine learning Machine Learning Technology Navigation of this blog Digital Transformation

Summary

Continuous optimization in machine learning is a method for solving optimization problems in which variables take real values, such as optimization of neural network weights and biases, parameter estimation in regression analysis, and parameter estimation in SVM. Typical methods for continuous optimization include gradient descent, steepest descent, conjugate gradient,described in “Conjugate Gradient Method” quasi-Newton,described in “Quasi-Newton Method” and L-BFGS methods, described in “Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) method” which are algorithms that use information such as the gradient and Hessian matrix of the objective function to update variables as they approach an optimal solution. In continuous optimization, it is necessary to consider not only the value of the objective function but also the constraint conditions. To solve optimization problems that consider constraints, we use constrained optimization problem methods such as methods that satisfy the KKT condition, interior point methods, projective gradient methods, penalty methods, and barrier function methods. Here we discuss continuous optimization in machine learning based on the Machine Learning Professional Series “Continuous Optimization for Machine Learning“.

In the previous article, I gave an overview of “continuous optimization. In this issue, we will discuss calculus and linear algebra as the basics of continuous optimization.

introduction

To describe optimization algorithms and investigate their theoretical properties, knowledge of calculus and linear algebra is used. In this section, we discuss these basics. First, we define the notation for matrices. The entire nxn matrix (nth order matrix) with real numbers as elements is denoted by ℝnxn. More generally, ℝnxm is the whole of an nxm-day matrix. When the elements of a matrix A∈ℝnxm are explicitly stated as A=(aij)i=1,…,n,j=1,…,m, sometimes written A=(aij) for simplicity.

Taylor’s theorem

The gradient (gradient) of a differentiable function f:ℝn→ℝ is defined as follows.

\[\nabla f(x)=\left(\frac{\partial f}{\partial x_1}(x),\dots,\frac{\partial f}{\partial x_n}(x)\right)^T\in \mathbb{R}^n\]

The gradient ∇f(x) at point x is a vector that is orthogonal to the posting war of the function f(x). This can be shown as follows. Let fc={x∈ℝn|f(x)=c} be the set of points where the value of the function f is a constant c∈ℝ. Let x(t) be the mapping that maps one point in fc to a real parameter t∈ℝ. Since f(x(t))=c at any t, differentiating both sides by t yields the following equation at t=0

\[\nabla f(x(0))^T\frac{dx(0)}{dt}=0\]

That is, the vectors ∇f(x(0)) and \(\frac{ds(0)}{dt}\) are orthogonal. Here the “velocity vectors \(\frac{ds(0)}{dt}\)” of the various curves x(t) on f are collected to form a tangent plane at x(0) of fc, so the gradient ∇f(x(0)) is orthogonal at x(0) to the contour (isosurface) fc.

The Hessian ∇2f(x) ∈ ℝnxn to a twice differentiable function f is defined as follows.

\[(\nabla^2f(x))_{ij}=\frac{\partial^2f}{\partial x_i\partial x_j}(x),\quad i,j=1,\dots,n\]

The commutativity of partial derivatives shows that the Hessian matrix is a symmetric matrix.

Using the gradient and Hesse matrix, the function f can be approximated locally by a linear or quadratic equation. The underlying Taylor’s theorem is discussed.

When the function f:ℝn→ℝ is first-order differentiable, there exists a real number c∈(0,1) for x,δ∈ℝ and the following holds \[f(x+\delta)=f(x)+\nabla f(x+\delta)^T\delta\] and 2 solutions When differentiable, there exists a real number c ∈ (0,1) such that the following holds. \[f(x+\delta)=f(x)+\nabla f(x)^T\delta+\frac{1}{2}\delta^T\nabla ^2f(x+c\delta)\delta\]

Substitute Landau’s symbols O(⋅) and o(⋅), which are useful for a rough evaluation of function values. When the functions f:ℝn→ℝ and g:ℝn→ℝ satisfy

\[\lim_{x\rightarrow a}\left|\frac{f(x)}{g(x)}\right|=0\]

Write as follows

\[f(x)=o(g(x),\quad (x \rightarrow a)\]

also when

\[\lim_{x\rightarrow a}\left|\frac{f(x)}{g(x)}\right|\lt \infty\]

Write as follows

\[f(x)=0(g(x),\quad (x \rightarrow a)\]

These represent the relationship between the magnitudes of f(x) and g(x) in the neighborhood of point a. When it is clear which point neighborhood is being considered, x→a may be omitted. From the definition, if f(x)=o(g(x)), then f(x)=O(g(x)), but f(x)=o(g(x)) describes more detailed information. By evaluating the function f(x) in a neighborhood of the origin as f(x)=O(||x||k) or f(x)=o(||x||k), we can also roughly estimate the function value.

If the first-order derivative of the function f:ℝn→ℝ is continuous (first-order continuously differentiable), the following equation holds by Taylor’s theorem.

\[f(x+\delta)=f(x)+\nabla f(x)^T\delta+o(||\delta||)\]

Similarly, if the second-order derivative is continuous (second-order differentiable), the following equation holds.

\[f(x+\delta)=f(x)+\nabla f(x)^T\delta+\frac{1}{2}\nabla^2 f(x)\delta+o(||\delta||^2)\]

Now consider the situation where δ → 0. Common sense can be regarded as approximating the function f by a linear or quadratic equation using the gradient and Hesse matrix. These proud formulas provide important information in the optimization of f.

A similar expansion formula for the gradient ∇f(x) is the following for a twice differentiable function f

\[\nabla f(x+\delta)=\nabla f(x)+\displaystyle\int_0^1\nabla^2f(x+\delta)\delta dt\]

where the integration is done for each element of the vector.

In optimization problems, one may assume Lipschitz continuity for the function f or gradient ∇f. A function f:ℝn→ℝ is Lipschitz continuous if there exists a positive real number L>0 such that for any x,y∈ℝn the following equation holds

\[|f(x)-f(y)|\leq L||x-y||\]

The constant L is called the Lipschitz constant. When the gradient ∇f satisfies the following Lipschitz continuity, f is called the γ-smooth function.

\[||\nabla f(x)-\nabla f(y)||\leq \gamma||x-y||,\quad \gamma\gt 0\quad(1)\]

implicit function theorem

For the function F:ℝkxℝn→ℝk, consider a variable x∈ℝk,y∈ℝn satisfying F(x,y)=0∈ℝk. Substituting a constant for y here, common sense can be regarded as an equation consisting of a k-dimensional variable x and k equations. Since the number of variables equals the number of equations, it is expected that there exists an x satisfying the equations. For example, suppose F is given by the following linear equation.

\[F(x,y)=Ax+By,\quad A\in \mathbb{R}^{k\times k},B\in\mathbb{R}^{k\times n}\]

If matrix A is regular, then F(x,y) = 0 if we do the following for a given y

\[x=-a^{-1}By\]

The implicit function theorem is useful when examining how x satisfying the general equation F(x,y)=0 depends on the variable y. For variables x∈ℝk,y∈ℝn, the derivative of the function F(x,y)∈ℝk is the following expression.

\[\nabla_x F=\left(\frac{\partial F_j}{\partial x_i}\right)_{ij}\in\mathbb{R}^{k\times k},\ \nabla_y F=\left(\frac{\partial F_j}{\partial y_i}\right)_{ij}\in\mathbb{R}^{n\times k}\]

These matrices are called Jacobian matrices.

Let the function F:ℝkxℝn→ℝk be r-times continuously differentiable and the following condition holds at the point (a,b) ∈ ℝkxℝn. \[F(a,b)=0,\quad det(\nabla_x F(a,b))\neq 0\] The second equation implies that the determinant is nonzero. In this case, there exists a neighborhood Ux⊂ℝk of a, a neighborhood Uy⊂ℝn of b and a function 𝜑:Uy→Ux and it holds that. \[𝜑(b)=a,\\y\in U_y⇒F(𝜑(y),y)=0\] Furthermore, 𝜑 is r differentiable.

From the implicit function theorem, the derivative of 𝜑(y) can be expressed as the derivative of F. In fact, by differentiating the function F(𝜑(y),y)=0, we obtain

\[\nabla 𝜑(y)=-\nabla_y F(𝜑(y),y)(\nabla_x F(𝜑(y),y))^{-1}\]

Here it becomes the following.

\[\nabla𝜑(y)=(\nabla𝜑_1(y),\dots,\nabla_k(y))=\left(\frac{\partial 𝜑_j}{\partial y_i}\right)_{ij}\in\mathbb{R}^{n\times k}\]

It can be seen that this is an extension of the linear case.

Diagonal matrix and eigenvalues

A is called a symmetric matrix when aij=aji (i.e., AT=A) for any element of an nth-order matrix A=(aij)∈ℝnxn. Since continuous optimization often deals with Hessian matrices, it is essential to be familiar with handling symmetric matrices.

For a matrix A=(aij)∈ℝnxn, the values λ and the n-dimensional vector x that satisfy the following equations are called eigenvalue and eigenvector, respectively.

\[\mathbf{Ax}=\lambda \mathbf{x}\]

In general, even if all the elements of A are decimal, the elements of λ and x are not necessarily real numbers, but can be complex numbers. For example, a rotation matrix must be considered in the range of complex numbers.

If A is an nth order symmetric matrix, then λ is a real number and x is the source of ℝn. Furthermore, the n eigenvectors can be chosen to be orthogonal to each other. That is, there exist λi,xi in the nth order symmetric matrix A satisfying

\begin{eqnarray}\mathbf{Ax}_i&=&\lambda_ix_i,\ x_i\in\mathbb{R}^n,\ \lambda_i\in\mathbb{R},\ i=1,\dots,n\\\mathbf{x}_i^T\mathbf{x}_j&=&\begin{cases}1\quad (i=j)\\0\quad (i\neq j)\end{cases}\end{eqnarray}

Letting the matrix Q be as follows, Q is an orthogonal matrix since QTQ=QQQT=I.

\[\mathbf{Q}=\left(\mathbf{x}_1\ \mathbf{x}_2\ \dots\ \mathbf{x}_n\right)\in\mathbb(R)^{n\times n}\]

Therefore, a symmetric matrix is diagonalizable by an orthogonal matrix. In fact, let Λ be an nth-order diagonal matrix with λ1,…,λn as diagonal elements, the following equation is obtained from the relation of eigenvalues and eigenvectors of A, and A=QTAQ.

\[\mathbf{AQ}=\mathbf{QA}\]

A can also be expressed as in the following equation.

\[\mathbf{A}=\mathbf{Q\Lambda Q}^T=\displaystyle\sum_{i=1}^n\lambda_ix_ix_i^T\quad(2)\]

When a symmetric matrix A ∈ ℝnxn satisfies the following equation for any x ∈ ℝn, A is called a non-negative definite matrix and is denoted A⪰0 (or O⪯ A). A non-negative low-valued matrix is called a positive definite matrix when it further satisfies the following conditions and is denoted by A≻O (or O≺A).

\[x\neq 0\ then\ x^TAz\gt 0\]

For example, for a matrix X ∈ ℝnxm, XXT is an nth-order nonnegative definite matrix, and it is a low definite matrix when the index (rank) of X is n.

Denote the nth-order symmetric matrices A and B as A⪰B (or B⪯A) when they satisfy the following equation for any vector x ∈ ℝn

\[\mathbf{x}^T\mathbf{Ax}\geq \mathbf{x}^T\mathbf{Bx}\]

This would be similar to A-B⪰O. At the same time, A-B≻O would be expressed as A≻B (or B≺A).

Nonnegative definiteness and positive definiteness can be expressed as conditions on eigenvalues. The fact that matrix A is a nonnegative definite matrix is equivalent to the fact that all eigenvalues of A are nonnegative. The fact that matrix A is positive definite is equivalent to the fact that all eigenvalues of A are positive. In fact, using equation (2), we obtain the following relationship between the sign of the left-hand side and the sign of the eigenvalues.

\[\mathbf{x}^T\mathbf{Ax}=\displaystyle\sum_{i~1}^n\lambda_i(\mathbf{x}^T\mathbf{x}_i)^2\]

Given a function f:ℝ→ℝ and a symmetric matrix A, we lay down matrix A and diagonalize it as in (2), defining the transformation \(\mathbf{A}⟼\bar{f}(\mathbf{A})\) as follows.

\[\bar{f}(\mathbf{A})=\displaystyle\sum_{i=1}^nf(\lambda_i)\mathbf{x}_i\mathbf{x}_i^T\]

If the function f(x)=xk, then if k is a natural number, then \(\bar{f}(\mathbf{A})\) coincides with the product Ak of the usual matrices due to the orthogonality of the eigenvectors. If the symmetric matrix A is regular, then \(\bar{f}(\mathbf{A})\) for f(x)=1/x coincides with the inverse of A. If f(x)=x1/2, then for a nonnegative definite matrix A, the formula is as follows.

\[\mathbf{A}^{1/2}=\displaystyle\sum_{i=1}^n\lambda_i^{1/2}\mathbf{x}_i\mathbf{x}_i^T\]

From the definition, the following holds

\[\mathbf{A}^{1/2}\mathbf{A}^{1/2}=\mathbf{A}\]

For the nonnegative definite (positive definite) matrix Ai, A1/2 is also a nonnegative definite (positive definite) matrix. The unit matrix I can be expressed using the positive definite matrix A as follows.

\[\mathbf{I}=\mathbf{A}^{1/2}\mathbf{A}^{-1/2}=\mathbf{A}^{-1/2}\mathbf{A}^{1/2}\]

where A-1/2 means (A1/2)-1, which is the same as (A-1)1/2.

Projection to subspace

For a subspace S of an n-dimensional space, we define the orthogonal complement space S⊥ of S as follows.

\[S^{⊥}=\{\mathbf{x}\in \mathbb{R}^n|\forall \mathbf{y}\in S\quad \mathbf{x}^T\mathbf{y}=0\}\]

In this case, S⊥ is also a subspace of ℝn, and the common part of S and S⊥ is the source of the zero vector. We call xi the projection of x onto S when the vector x ∈ ℝn is decomposed such that

\[\mathbf{x}=\mathbf(x)_1+\mathbf{x}_2\quad \mathbf{x}_1\in S,\ \mathbf{x}_2\in S^⊥\]

In the following, we show that the projection exists uniquely. If the basis of the subspace S is x1,…,xp,p≤n and the matrix Z is as follows, then ZTZ∈ℝpxp is inhabited since the rank of Z is p.

\[Z=\left(z_1\ \dots\ z_p\right)\in\mathbb{R}^{n\times p}\]

Let the nth-order matrix P be

\[\mathbf{P}=\mathbf{Z}(\mathbf{Z}^T\mathbf{Z})^{-1}\mathbf{Z}^T\]

The following equation holds.

\[\mathbf{P}^T=\mathbf{P},\quad \mathbf{P}^2=\mathbf{P}\quad(3)\]

At this point, it can be decomposed as follows

\[\mathbf{x}=\mathbf{Px}+(\mathbf{I}-\mathbf{P})\mathbf{x},\quad\mathbf{Px}\in S)\quad (\mathbf{I}-\mathbf{P})\mathbf{x}\in S^⊥\]

In fact, from the definition we have Px∈S and from the following equation we have (I-P)x∈S⊥ , which shows the existence of the projection. Next, we show the uniqueness of the projection. If \(\mathbf{x}=\mathbf{x}_1+\mathbf{x}_2=\mathbf{x}’_1+\mathbf{x}’_2\) can be decomposed in two ways, then the following rally, x1=x’1 is derived.

\[\mathbf{x}_1-\mathbf{x}’_1=\mathbf{x}_2-\mathbf{x}’_2\in S\cap S^⊥=\{0\}\]

The projection x1 is characterized as the point on S that minimizes the distance to x. This is confirmed by the fact that x ∈ S and the following equation holds.

\begin{eqnarray}||\mathbf{x}-\mathbf{z}||^2&=&||\mathbf{x}-\mathbf{x}_2+\mathbf{x}_1-\mathbf{z}||^2\\&=&||\mathbf{x}-\mathbf{x}_1||^2+||\mathbf{x}_1-\mathbf{z}||^2+2(\mathbf{x}-\mathbf{x}_)^T(\mathbf{x}_1-\mathbf{z})\\&=&||\mathbf{x}-\mathbf{x}_1||^2+||\mathbf{x}_1-\mathbf{z}||^2\end{eqnarray}

The last equality used the following

\[\mathbf{x}-\mathbf{x}_1=\mathbf{x}_2\in S^⊥,\quad \mathbf{x}_1-\mathbf{z}\in S\]

The matrix satisfying equation (3) is called the projection matrix. The eigenvalues of the projection matrix are found to be 0 or 1.

1 rank update of the matrix

A matrix A plus a matrix of rank one, A+xyT, is called a one rank update of A (1 rank upgrade). Such an update formula is used in the natural gradient and quasi-Newtonian methods. Formulas have been devised to efficiently perform various operations on the 1-rank-upgraded matrix using the information on the matrix before the update.

The Sherman-Morrison formula is expressed as follows for an nth-order regular matrix A and x,y∈ℝn

\[\mathbf{A}+\mathbf{xy}^T)^{-1}=\mathbf{A}^{-1}-\frac{1}{1+\mathbf{y}^T\mathbf{A}^{-1}\mathbf{x}}\mathbf{A}^{-1}\mathbf{xy}^T\mathbf{A}^{-1}\quad(4)\]

By actually calculating the value with A+xyT, it can be confirmed that the formula holds. If the inverse matrix A-1 is known, the formula can be used to efficiently compute the inverse matrix (A+xyT)-1 for a one-rank update.

Equation (4) shows that the necessary and sufficient condition for A+xyT to be regular is 1+yTA-1x≠0. Furthermore, the following equation also holds for the determinant.

\[det(\mathbf{A}+\mathbf{xy}^T)=(1+\mathbf{y}^T\mathbf{A}^{-1}\mathbf{x})det(\mathbf{A})\quad(5)\]

Equation (5) can be seen from the fact that the eigenvalues of the matrix I+A-1xyT are 1(n-1 heavy) and 1+yTA-1x.

Various Norms

A norm is a quantity that corresponds to the length of a vector. Machine learning and statistics deal with various norms. The most basic norm is the Euclidean norm. To recap the definition, it is defined as follows for a vector x=(x1,…,xn)T∈ℝn

\[||\mathbf{x}||=(\mathbf{x}^T\mathbf{x})^{1/2}=\left(\displaystyle\sum_{i=1}^n|x_i|^2\right)^{1/2}\]

Define the general norm.

A function mapping a vector x to a nonnegative value n(x)≥0 is called a norm if it satisfies the following properties

1. n(x)≥0、n(x)=0 ⇔ x=0
2. n(αx)=|α|n(x) for any α∈ℝ
3. n(x+y)≤n(x)+n(y) for any x,y

The Euclidean norm n(x)=||x|| satisfies the definition of a norm. In general, for p≥1, we define the p-norm of the vector x as follows.

\[||\mathbf{x}||_P=\left(\displaystyle\sum_{i=1}^n|x_i|^P\right)^{1/P}\]

We further define the following for p = ∞.

\[||\mathbf{x}||_{\infty}=\max_{i=1\dots, n}|x_i|\]

The Euclidean norm becomes the 2-norm. p-norm also satisfies the definition of norm. Schwart’s inequality holds for the Euclidean norm.

\[\mathbf{x}^T\mathbf{y}|\leq||\mathbf{x}||||\mathbf{y}||\]

This is derived from the condition of nonnegativity of the quadratic function||x+ty||2 with variable t∈ℝ. For the p-norm, Hölder’s inequality, an extension of Schwartz’s inequality, holds.

\[|\mathbf{x}^T\mathbf{y}|\leq||\mathbf{x}||_p||\mathbf{y}||_q\]

where p and q are pairs of real numbers satisfying the following, including the cases where p=1 and q=∞.

\[p,q≥1.\quad \frac{1}{p}+\frac{1}{q}=1\]

If p=q=2 in Helder’s inequality, we obtain Schwarz’s inequality.

Using the Euclidean norm for vectors|||||, we can define the norm of a matrix A ∈ ℝnxm as follows.

\[||\mathbf{A}||=\max_{x|in\mathbb{R}^m,x\neq 0}\frac{||\mathbf{A}||}{||\mathbf{x}||}\]

Similarly, the norm of a matrix can be defined from the p-norm of a vector.

Functions on matrix space

We describe a formula for the function f:ℝnxm→ℝ, whose domain is a set of matrices. We express the gradient in matrix form and ∇f(X) ∈ ℝnxm for X={xij}∈ℝnxm as follows.

\[(\nabla f(\mathbf(X))_{ij}=\frac{\partial f}{\partial x_{ij}}(\mathbf{X})\]

The following is the gradient for a concrete function. For a matrix A=(aij) ∈ ℝnxm, define f:ℝnxm→ℝ as follows.

\[f(\mathbf{X})=tr(\mathbf{XA}^T)=\displaystyle\sum_{i=1}^n\sum_{j=1}^ma_{ij}x_{ij}\]

Then the derivatives are as follows.

\[\nabla f(\mathbf{X})=\mathbf{A}\]

This can be verified by direct computation.

If the cosine factor matrix of matrix X is \(\tilde{\mathbf{X}}=(\Delta_{ij})\), then from the properties of the cosine factor matrix, we have

\[\mathbf{X}\tilde{\mathbf{X}}=det(\mathbf{X})\cdot\mathbf{I}\]

Thus, for any j=1,…,n, we have

\[det(\mathbf{X}=\displaystyle\sum_{i=1}^nx_{ik}\Delta_{ki}\]

where Δki(k=1,…,n) is independent of xij, so that

\[\frac{\partial}{\partial x_{ij}}der(\mathbf{X})=\Delta_{ji}=(\tilde{\mathbf{X}}^T)_{ij}\]

Thus, we have the following

\[(\nabla f(\mathbf{X}))_{ij}=\frac{(\tilde{\mathbf{X}}^T_{ij}}{det(\mathbf{X})}=((\mathbf{X}^T)^{-1})_{ij}\]

This is the case for an nth-order regular matrix X ∈ ℝnxm, defined as follows.

\[f(\mathbf{X})=log|det(\mathbf{X})|\]

The derivatives are shown to be as follows

\[\nabla f(\mathbf{X})=(\mathbf{X}^T)^{-1}\]

In the next article, we will discuss the basics of convex analysis as a fundamental matter for continuous optimization in machine learning.