Summary
Sparse learning will be a machine learning technique that builds a model by selecting only the important features in a data set. Sparsity means that the parameters included in the model have almost zero value. The benefit of sparse learning is that the model becomes more interpretable, making it easier to explain the model’s predicted results because important features are selected and unnecessary features are excluded. Sparsity also helps prevent over-fitting.
In this section, we will discuss this sparse learning based on the Machine Learning Professional Series, “Sparsity-based Machine Learning.
The most common approach to sparse learning is linear regression with L1 regularization. In this article,Reading notes are provided.
Machine Learning Professional Series Sparsity-Based Machine Learning
Chapter 1 Introduction
Sparseness
Sparseness
Sparseness.
Most elements of a high-dimensional vector are zero.
Typical examples where sparsity is useful
Prediction from individual differences in the genome (susceptibility to a particular disease, efficacy of treatment)
Inter-individual variation in the genome occurs at millions of sites, but only a small fraction is associated with susceptibility to a particular disease
To estimate how all the millions of mutations affect a particular disease, we would have to collect a sample of the same order of magnitude from each person.
Using the sparsity assumption that only a small number of mutations are associated with a disease, we can study a small sample.
Issues to consider
Statistical issues
How can we use realistic processes to estimate a small sample size?
Computational issues
How to avoid combinatorial explosion and to estimate with realistic computational complexity?
Three types of sparsity
Element-wise sparsity
Many elements have zero values and a few elements have non-zero values
Group sparsity
Each row corresponds to one predefined group, and each row is divided into all-zero or all-day-zero values
Predefined groups constrain the possible zero/non-zero patterns and allow estimation with fewer samples.
In prediction from genomic variation, when a gene acts, gene A activates gene B, which in turn activates another gene C.
Instead of thinking of pathways as a group and considering whether a mutation contributes to disease, consider whether it contributes to disease, pathway by pathway
Low rank property of matrices
If we want to solve T related learning tasks defined on a common d variable simultaneously
The number of coefficients to be estimated is dT
Consider it as a simple problem of estimating a vector of dT dimensions
Or estimate a d-by-T matrix with d coefficients in each column
If the matrix has rank r, it means that there are r factors that are common to T learning tasks
Low rank matrices are not sparse as vectors.
Looking at the singular values of the matrix, only the singular value equal to the rank of the matrix is non-zero, the other singular values are zero.
Even without defining groups, it is possible to learn grooves of variables and factors common to multiple tasks together.
Useful for analysis of hyperspace data, recommendation systems, etc.
For ultra high-dimensional and sparse feature vectors such as n-grams in natural language processing, classifiers cannot be created with a small number of variables, and are not the subject of this book.
Chapter 2 Learning from Data
2.1 Training Data and Generalization
When you want to understand something new, how do you do it?
Learn a new word
Learn a wine preference.
In the case of wine, taste it and if you like it, remember the region, the farmer, the type of grape, the words on the label, etc., and then order something similar.
Things that are easy to illustrate, but difficult to express as rules
Handwriting variations
Train data consisting of N columns (xi, yi)ni=1
xi ∈ ℝd is the input vector
In the example of wine, xi is a numerical representation of the region, type of grape, etc.
For handwritten text, xi is the shading value of the input image
yi is the label
In the example of wine, yi is a 5-point rating of the wine’s taste.
In the case of handwritten letters, it is an integer between 0 and 9.
Learning from data means to mimic and reproduce the rules that generate the data as well as possible when the training data is generated by some rules.
Statistical Machine Learning
A situation in which (xi, yi) are independently and identically generated from the simultaneous probabilities P(x,y) as the rules for generating data.
Discriminant Model
Given an input vector, how well does it predict the label y?
In the case of wine evaluation, this is the “expected squared error” L(f)=𝔼x,y[(y – f(x))2].
The function f to be minimized is the conditional expectation
In the case of handwritten text, it is common to consider the expected error classification rate L(f) = 𝔼x,y[I{yf(x) < 0}] or relative entropy
regression
Estimating a continuous function over an input vector x from training
Classification
Estimates a function that takes a discrete output y, as in handwritten character recognition
Even if the labels are discrete, the conditional probability P(x|y) is a continuous function, and if we estimate the moss, we can solve the classification problem.
2.2 Variance and Bias
2.3 Regularization
A technique to prevent over-learning and improve generalization performance (ability to deal with unknown data). Adjust the model shape to avoid complexity by adding a “regularization term” to the model.
Control the size of the hypothesis set by constraining the norm of the parameter vector.
Norm
Generalization of the concept of “length” of a geometric vector in the plane or space
A mathematical tool to give a “distance” to a vector space.
A space with a defined norm is called a linear normed space or simply a normed space.
Various Norms
Euclidean norm
|x|2 := √(|x1|2 + … +|xn|2)
Maximal norm
|x|∞ := max(|x1|,…,|xn|) ,|xn|)
Description
Reduction of variance by using or constraining the norm of the parameter vector as a penalty term.
2.4 Intersection Checking
2.45 Equivalence of Constrained Minimization Problem and Minimization Problem with Penalty Terms
Control of constrained minimization problems and minimization problems with penalty terms
Chapter 3 Introduction to Sparsity
3.1 Occam’s Razor
Do not assume unnecessarily many factors to prove a thing.
If you can prove the same thing, you should use as few factors as possible.
Analogy with overfitting in machine learning
A mathematical expression for explaining something with a small number of factors
The problem of predicting susceptibility to disease from genomic variation
The target to be predicted is x
The value to be predicted is y
Every possible hypothesis Φ1(x),…. All possible hypotheses Φ1(x),…,…, Φd(x) derived from x are combined into a vector 𝜱(x)
The predicted value of Y
The number of factors is the number of nonzero elements in Ω
s(w)=|supp(w)|
3.2 L1-norm regularization
The L1-norm of the vector ω
xi… .xd
Manhattan distance
The distance in the world that cannot be traveled diagonally on a road that has only vertical, horizontal, and perpendicular sides, like a board bud
Learning with L1-norm regularization
Illustration in 1D
Illustration in 2 dimensions
Definition of the subdifferential
3.3 Illustration using artificial data
Experimental results on artificial data with dimension d=200 and the number of true nonzero elements k=10
For Optimal, the test error is drastically reduced when the number of samples is between 10 and 20.
The next best one is L1
The next best one is L1, where the error is less than 1/2 of the maximum value near n=50.
In L1, the number of non-zero elements decreases to 70 (from 200)
At L2, the number of non-zero elements is 200
Coefficients of the weight vector w obtained from the artificial data experiment
3.4 Addendum to the literature
The history of ℓ1-norm minimization goes back to Galileo and Laplace.
Theoretical approach is based on Logan’s PhD thesis in 1965
A signal and its Fourier transform cannot be sparse at the same time
Approach
Astronomy
Geophysics
Radio Astronomy
Statistics
Computational Neurology
Sparse Coding
Machine Learning
Bayesian Theory
Support Vector Machines
Chapter 4 Theory of Noiseless L1 Norm Minimization
4.1 Problem Setup
In a linear equation, if the dimension of the observed variable is less than the dimension of the unknown vector, then the solution to the equation is not unity.
If the unknown vector is expected to be sparse, it is natural to buy the sparsest vector that satisfies the equation
This is not efficient because we have to search all over the place
Efficient computation by ℓ1-norm minimization
Let w*∈ℝd be an unknown sparse high-dimensional vector
Recover the vector w* from a small number of linear observations (above)
Linear equations are “underdetermined”.
inverse problem
Number of samples n is small compared to dimension d
Cannot determine w* from linear equation
Example
Estimating brain activity from EEG and MEG signals
Generalized minimization problem
y=(y1,…. ,yn)T∈ℝ is a vector of observables
X=[x1,… ,xn]T∈ℝnxd where each row is a matrix of vector xi in the observation model
4.2 Geometric Considerations
Geometric consideration of the minimization problem
The pink line is the set of w’s that satisfy the equality constraints of the minimization problem.
The light blue region is the cone consisting of the true vector w* centered in the direction of decreasing ℓ1 norm
Let D(∥, ∥1;w*) be the descent cone at the point w* in the ℓ1-norm
Descent cone
(A) intersects the light blue region only at w*
(B) is intersected inside the light blue region by a line segment whose endpoint is w*.
Inside the light blue region is the direction where the ℓ1 norm is less than the true sparse vector w*.
The fact that there is a common area means that w* is not a solution to the ℓ1-norm minimization problem
Necessary and sufficient condition that the true sparse vector w* is the only solution to the ℓ1-norm minimization problem
The smaller the subspace N(w*) and the descent cone D(∥, ∥1;w*), the better
The dimension of the subspace N(w*) is d-n, so N(w*) becomes smaller as the number of samples increases
The descent cone gets smaller the sparser the true sparse vector w* is
If it is not sparse, D will be a half-plane
4.3 Performance on Random Problems
Hypothesis that the elements of matrix X are randomly obtained from some distribution
Necessary condition that the sparse vector w* is the only association for the ℓ1-norm minimization problem
Statistical dimension of the set C
Statistical dimension of a typical convex cone
Theorem
Success probability of a minimization problem for a randomly given problem
The success probability varies from 0% to 100 in a relatively small sample size region
The number of samples in the region corresponds well to the statistical power
4.4 Addendum to the Literature
What properties must be satisfied to obtain the desired result?
Under what conditions are these properties satisfied (with high probability)?
Chapter 5: Theory of L1 norm minimization with noise
Introduction.
The problem of recovering a sparse vector from a noisy linear observation.
Considering the observation noise and dealing with the case where the true vector is not strictly sparse
5.1 Problem Setup
Estimate an unknown high-dimensional vector w* from a linear observation with noise (above)
Assume that 𝝃i is a noise term and independent with mean 0 and variance σ2
Estimating the vector of regression coefficients w* based on ℓ1-norm regularization is a minimal problem as above
Lasso or basis pursuit denoting
Difference from noise-free model
Observed model includes noise term
Quantitatively evaluate the squared error ∥w-w*∥22 instead of the probability that w in a convex optimization problem exactly matches the true regression coefficient vector w*
Do not assume that the regression coefficient vector w* is strictly sparse
Assume that the observed noise 𝛏i follows a normal distribution
Easily inferior Gaussian noise (e.g. uniform distribution of emotions going on)
The estimator w will never exactly match the true regression coefficient vector w* because the observations contain noise
Analyze non-precursively how the error evolves as the number of samples n increases
To accommodate the case where w* is not strictly sparse, the upper bound on the error is the sum of the estimation error and the approximation error, as above for any k
Let ws* be the vector consisting only of the upper k coefficients of the absolute value of the true regression coefficient vector w*.
The vector consisting of the remaining coefficients is defined as ws*=w*-wsk*.
The estimation error term Eest is an increasing function of k
The approximation error term Eapprox is a decreasing function of k
Estimation error term
The quantity appearing in the numerator of the estimation error term gives an indication of the number of samples needed to achieve a constant error L
Even if the dimension d of the true regression coefficient vector w* is very large, as long as k is small, we only need a sample of the order of the logarithm of the dimension.
5.2 Performance on Random Problems
The vector w must have at most k non-zero elements
Theorem
Even if the dimension d increases exponentially, the number of samples required increases only polynomially
The condition on the regularization parameter λn depends on the noise term σ2
It does not depend on the number of nonzero elements k
No prior knowledge of the true vector of regression coefficients w* is required when choosing the regularization parameter λn
It is probabilistic.
System
Parameter q plays the role of decorrelating the number of nonzero elements (q=0) and the ℓ1 norm (q=1)
5.3 Preparation
Herder’s inequality
A special case of Herder’s inequality
Cauchy-Schwarz inequality
Duality norm
A generalization of Herder’s inequality holds between any norm and the dual norm
Jensen’s inequality
Definition of a convex function
In general, when the relation between ℓp and ℓta of Herder’s inequality holds between two norms
Corollary: ℓ1-norm of a sparse vector
Corollary: Hem probability of normal distribution
Compatibility of norms
Corollary: ℓ∞ norm of multidimensional Gaussian vectors
Boolean inequality
Union bound
Complementary problem: Chi-square distribution hem probability
Markov’s Inequality
Hefdin’s Inequality
Bernstein’s inequality
5.4 Basic Properties
Corollary
Decomposability of norm
5.5 Restricted strongly convexity
Conjecture: Restricted strongly convexity
Strongly convexity
An explanation of how a sparse high-dimensional vector w can be distinguished from a low-dimensional vector y when the matrix X satisfies the restriction strongly convexity
Theorem
5.6 Theorem 5.1 and Proof of System 5.1
Theorem
5.7 Proof of Theorem 5.2
5.8 Numerical examples
Results of lasso evaluation on artificially generated data
Experiment 2
Chapter 6 Optimization Methods for L1-norm Regularization
6.1 Types of Optimization Methods
Minimization problems
Equivalent to ℓ1-constrained minimization problem if L(w) is a convex function
Equivalent to the above equation in the limit of Λ → 0
Four algorithms depending on assumptions about the function L
Iterative weighted reduction
Assume that the sum of the loss term and the ridge regularization term (λ∥w∥22) can be minimized efficiently
(Accelerated) Proximity Gradient Method
Need only smoothness of the loss term L(w)
Dual-extension Lagrangian
Requires that the function L can be decomposed into a loss function fℓ and a data matrix X such that L(w)=fℓ(Xw)
Bi-paired alternating multiplier method
Same assumptions as for the dual augmented Lagrangian
In the case of squared errors, such as L(w)=1/2∥Xw-y∥22 as a function L, we can precompute the Cholesky decomposition, which is more efficient
6.2 Preparation
We prepare the necessary tools to handle the optimization
Variational representation of the ℓ1-norm
Prox operator
Convex conjugate
Corollary: Variational representation of the ℓ1-norm
The ℓ1-norm can be rewritten using the parameter η
Definition: prox operator
For a convex function g, we define the prox operator proxg
Image of Corollary 6.1
Corollary: prox-operator for ℓ1-norm
The prox-operator for the ℓ1-norm is called the soft-threshold function
The regularization parameter λ plays the role of a threshold
Coefficients yi whose absolute value is less than or equal to λ are truncated to zero, and further coefficients are reduced by λ in the direction of the origin.
Comparison between the prox-operator for the ℓ1-norm (solid line) and the prox-operator for the square of the ℓ2-norm (dashed line)
Summary of properties of the ℓ1-norm
Definition: H-smooth
Definition: convex conjugate
Convex conjugate functions for logistic loss functions and absolute functions
6.3 Iterative weighted reduction methods
6.4 Proximal gradient methods and their acceleration
6.5 Dual Augmented Lagrangian Methods
6.5.1 Derivation of Equation (6.22)
6.6 Dual alternating direction multiplier method
6.7 Numerical Examples
Chapter 7 Machine Learning Based on Group L1 Norm Regularization
Introduction
A norm for selecting zeros/non-zeros in a group unit using a predefined group structure as prior knowledge.
Useful for multi-task learning, multi-kernel learning and vector field estimation
7.1 Definition and Examples
Introduction.
Let 𝓞 be an appropriate partition of d variables.
For example, let d be an even number and 𝓞={{1,2}. {3,4},… {d-1,d}} is the partition of the two variables into pairs
The group ℓ1-norm regularization term is the above equation
Wg denotes the|g|-dimensional vector corresponding to one element g of the partition
Example
Let g={1,2}.
Wg is a two-dimensional subvector wg=(w1,w2)T
The norm ∥ and ∥p can be anything, but p=2 and p=∞ are often used
When p=1 norm is used, or when each group consists of one variable, the partition structure is lost and we return to the ℓ1 norm regularization
Extension of Lasso to account for gluing structure
7.1.1 Multitask Learning
There are T learning tasks, each with d’-dimensional parameter vectors w1,…,wT∈ℝd’. , wT∈ℝd’ in d’ dimensions.
We want to select a small number of features that are commonly useful for all T learning tasks.
If we define a group of size T corresponding to each feature, the minimization problem can be defined by the above equation
Estimating vectors w1,…,wT with a common base (set of non-zero elements) , wT can be estimated
This image shows how group ℓ1-norm regularization can be used to select common variables across multiple learning tasks.
7.1.2 Estimating the vector field
Electrocorticography (EEG) and magnetoencephalography (MEG)
Measurement of electromagnetic waves generated by the activity of many spatially adjacent neurons working in synchrony
Can be measured using sensors that travel through the brain and skull and are attached to the surface of the head
The electric and magnetic fields observed at a particular sensor i are modeled above as a linear superposition of the activity at various locations in the brain
j is the index of the spatial location in the brain
xj is a 3-dimensional vector representing the activity at the jth location
aij is a coefficient representing the propagation of electromagnetic waves from the j-th spatial location to the i-th sensor
The coefficients are different between electroencephalography and magnetoencephalography, but the mathematical model is the same.
Modeling by electric dipole of the activity of a population of neurons at a specific location in the brain
This is a kind of inverse problem because the number of electrodes that can be measured, n, is generally smaller than the number of voxels that are assumed to represent brain activity, N.
Limit the cognitive activity to some extent by designing the task appropriately
The sparsity hypothesis that the number of activity sources is much smaller than the number of voxels is valid
Regularizing the norm, rather than regularizing the coefficients of the vectors, allows for regularization independent of the spatial coordinates.
Minimization problem
Estimate vector if (xj)j=1N from EEG/MEG signal y=(yi)i=1n
7.1.3 Multi-kernel learning
Kernel method is a mechanism to learn a linear model in the function space defined by the kernel function k:XxX→ℝ.
It is always a question of what kernel function to use, and equivalently, what reproducing kernel Hilbert space to consider.
Multi-kernel learning
Given a basis kernel function km(x,x’):XxX→ℝ(m=1,. ,M) are given.
Given a set of kernel weights dm, the kernel function (above) is linearly combined with the predictor derived from the kernel function.
Depending on the choice of kernel weights
Depending on how the kernel weights are chosen, it is possible to select a useful kernel function from the base kernel
It is also possible to combine multiple kernel functions.
Minimization problem with regularization term for kernel weights
ℓ is the loss function, which can be thought of as the hinge loss or logistic loss used in SVM
H is a regularization term for the kernel weight dm, which is a convex function
A typical pair of group norm regularization term g and kernel weight regularization term h
7.2 Mathematical Properties
7.2.1 Relationship to the number of nonzero groups
Corollary
7.2.2 Duality norm
Complementary Problem
7.2.3 Variational representation
Corollary
7.2.4 Prox Operators
Corollary
7.3 Optimization
7.3.1 Iterative Weighted Reduction
Using the variational representation, we can give an iterative weighted reduction similar to the ℓ1-norm case
7.3.2 (Accelerated) Proximal Gradient Method
The proximal gradient method can be written using the proximation element
7.3.3 Bilateral Extended Lagrangian
Chapter 8 Machine Learning Based on Trace Norm Regularization
Introduction
Low-rank matrices have many applications, such as collaborative filtering, system identification, and classification problems using matrices as input.
Directly dealing with low-rank constraints is a non-convex and somewhat insoluble optimization problem
Trace norms described in “Overview of the trace norm and related algorithms and implementation examples“, obtained as natural extensions of ℓ1-norm regularization, induce low-rankness
Many properties of the ℓ1-norm, such as duality norms and prox-operators, are extended to matrix singular values
memo
Frobenius norm of matrices described in “Overview of the Frobenius norm and examples of algorithms and implementations”
A measure of the “size” of a matrix
Root of the sum of squares of all components
∥A∥F
It can also be thought of as the length of the vector (2norm) when all the components of the matrix are arranged in a row and considered as a vector
Frobenius and Trace
Property 1
AT is the transpose matrix of A
Frobenius norm and orthogonal transformation
Property 2
Multiplying by an orthogonal matrix does not change the norm
Frobenius norm and singularity
Property 3
The square of the Frobenius norm is equal to the square of the singularity
Classification of matrix data by logistic regression with trace norm regularization term
http://yamagensakam.hatenablog.com/entry/2018/06/15/064758
Application Examples for Sensors
Other methods used for classification of time series data
HMM
LSTM
8.1 Definitions and Examples
8.1.1 Collaborative Filtering
A type of scoring method for recommendation systems in online shopping and search.
A matrix in which the user is the row index and the candidate items are the column exhibits, and the I,j elements are the values of the number of ratings or clicks on item j by user i.
Scoring is performed by predicting the unobserved elements of the matrix
Use data from a large number of users
For the number of users d1 and items d2, consider a matrix Y ∈ {-1, +1}d1xd2 with binary values of -1 and +1 for simplicity
Example: -1 means dislike, +1 means like
Let Ω be the set of observed user-item pairs. We can consider the empirical error minimization problem (above)
ℓ is a loss function for binary classification
Hinge function or logistic function can be used
For a matrix W, we add a constraint that it be multiplied by a low-rank (vertical) matrix U∈ℝd1xr, V∈ℝd2xr
View the relationship between users and items as a relationship of vectors embedded in an r-dimensional space.
If the inner product of the vector ui corresponding to the i-th user and the vector vj corresponding to the j-th item is
If it is positive, user i prefers item j
If it is negative, user i dislikes item j.
Treating the constraint that the matrix W has low rank explicitly is a non-convex optimization problem and a bit awkward
Instead of constraining the rank, we regularize the trace norm (constrain it to be Al-Iha equivalent)
λ>0 is a regularization parameter
When the loss function is a hinge loss
8.1.2 Multi-task learning
Method to select common variables across multiple tasks
A matrix of parameter vectors for T tasks, arranged in the matrix direction
Group ℓ1-norm regularization with each row as one group
Common variables are not the only way to model the commonality of multiple tasks
Assume that T tasks use a common subspace
Matrix W will be of low rank as W=UVT
U∈ℝd’xr is the basis of the subspace common to all tasks
V∈ℝTxr is a vector of coefficients specific to each task in this r-dimensional space
The rank r of the matrix can never exceed the number of tasks T
If tasks are similar, r will be small
Selecting variables that are common to all tasks using group ℓ1 norm
Formulation of multi-task learning based on low rankness
Replace the rank constraint on W with a trace-norm regularization (or trace-norm constraint)
8.1.3 Classification Problem with Matrix as Input
Binary classification problem based on training data (Xi,yi)i=1n
Xi ∈ ℝd1xd2 is a matrix
yi∈{-1, +1} is a binary label
The problem of treating Xi as a vector of d1, d2 dimensions
When the structure of a matrix is meaningful, such as spatio-temporal data (rows correspond to space and columns to time) or covariance matrices, the structure is lost in vectorization.
Linear model with matrix X as input
The matrix is a matrix of coefficients W
b is the bias term
Input matrices X1,…. , Xn are not necessarily low rank
The process of low rank of coefficient matrix W is a relatively loose process
In the case of spatio-temporal data where the rows of X are in space and the columns are in time
Given the singular value decomposition as described in “Overview of Singular Value Decomposition (SVD) and examples of algorithms and implementations” W=∑j=1rσjujvjT
The singular vectors uj and vj are filters that extract features in the spatial and temporal directions, respectively
The singular value σj is the weight of the feature
Instead of the rank constraint on the coefficient matrix W, we use the trace-norm regularization
ℓ is the loss function, which can be the hinge loss or the logistic loss
8.2 Mathematical Properties
8.2.1 Various definitions
Corollary: equivalent representation of the trace norm
Semi-positive definite matrix
Matrix P is a positive semidefinite matrix
All eigenvalues of P are non-negative (symmetric matrices do not have complex eigenvalues)
The singularities of a semidefinite matrix coincide with the eigenvalues
Trace norm coincides with trace tr(X)
Optimization problem consisting of antipositive constraints on variables, equality constraints, and linear objective function
Trace norm is the dual norm of the spectral norm
The trace norm is a norm and a convex function
The trace norm can be expressed as a linear matrix inequality
Loss function is piecewise linear, like hinge loss, or squared error
Learning based on trace norm regularization can be attributed to semi-positive definite programming problems.
Trace norm minimization is equivalent to minimizing the square of the Frobenius norm of the factor matrices U and V.
If the number of columns of matrices U and V is sufficiently large, relatively simple minimization methods such as alternating minimization can be used.
Frobenius norm-squared regularization for U and V is equivalent to processing a prior distribution of the independent normal distribution for U and V.
8.2.2 Relationship with Rank
Corollary: For a matrix of rank r, show that the ratio of the trace norm to the Frobenius norm can be suppressed by √r
8.2.3 Variational representation
Corollary: The trace norm can be represented variationally in the same way as the ℓ1 norm and the group ℓ1 norm
Square roots of matrices
8.2.4 The prox operator
Corollary
8.3 Theory
The performance of matrix estimation using the trace norm is very good and well known.
Theorem
8.4 Optimization
8.4.1 Iterative Weighted Reduction
The iterative weighted reduction method can be extended to trace-norm regularization
8.4.2 (Accelerated) Proximity Gradient Method
Iterative form of the proximity gradient method for general learning problems with trace norm regularization
Efficient Computation 1
Efficient Computation 2
8.4.3 Dual-extension Lagrangian
When the prox-operator can be computed efficiently as in the proximal gradient method, the dual extended Lagrangian method can be considered.
Example: Classification problem with a matrix as input Translated with www.DeepL.com/Translator (free version)
Chapter 9: Duplicative Sparse Regularization
Introduction.
Sparse Regularization with Duplication
Combines sparse regularization on subvectors and linear transformations of a vector w∈ℝd
Applications in image processing, statistics, tensor decomposition, etc.
9.1 Definitions and Examples
Introduction
When the regularization term R(m) is expressed as the sum of m regularization terms, R is called a redundant regularization term.
If the function Rj has no overlapping elements in the vector w, then the group ℓ1 norm
9.1.1 Elastic nets
For a vector w, the ℓ1 norm and the ℓ2 norm are used simultaneously as penalty terms.
Elasticnet
Comparison between the ℓ1-norm regularization term and the elasticnet regularization term
Like the ℓ1 norm, it is pointed at four vertices.
The contours are rounded between the endpoints.
9.1.2 Total variation
Assume that the image W*∈ℝd1xd2 is observed as Y=W*+E with noise.
To remove the noise from this image, we consider a minimization problem
λ>0 is a regularization parameter
∥∥TV is isotropic total variation
Linear sum of the norm of the gradient vector at each pixel
∇xWx,y and ∇yWx,y are the derivative coefficients in x and y direction
Using the Sobel operator
Even with the simplest 3×3 filter, each element of W will be included in the regularization term for the eight surrounding pixels.
Since the isotropic total variation is defined as a loop in the ℓ2 norm of the gradient vector, it is invariant with respect to the rotation of the coordinates, as is the estimation of the group ℓ1 vector.
Replaced by the ℓ1 norm
Example of image denoising using isotropic total variation
Other Applications
Compressed sensing applied to MRI images
2D wavelet transformation
9.1.3 Group 𝓁1 Norm Regularization with Duplicates
Group ℓ1 Norm with Duplicates
9.1.4 Multiple Linear Ranks of Tensors
Trace norm with duplication
9.2 Mathematical Properties
9.2.1 Relation to the number of nonzero groups
Corollary
9.2.2 Duality norm
Corollary
Summary of properties of the dual type group ℓ1 norm
9.3 Optimization
Introduction.
The regularization terms are not separable with respect to the variables wj, making it difficult to compute the prox-operators
9.3.1 The case of elastic nets
Corollary
9.3.2 Alternating Direction Multiplier Method
Chapter 10 Atomic Norms
Introduction.
The ℓ1-norm induces element-wise sparsity of vectors
The group ℓ1 norm induces group-wise sparsity
Trace norm induces low-rank (sparsity) of matrices
The unit class of a norm is sharp at the point where it becomes sparse, so it is associated with sparsity
Towards a more general representation. See “Overview of the atomic norm and examples of applications and implementations” in detail.
10.1 Definitions and Examples
10.1.1 Group regularization with redundancy
10.1.2 Robust Principal Component Analysis
10.1.3 Multitask Learning
10.1.4 Nuclear Type Norm of Tensors
10.2 Mathematical Properties
10.3 Optimization
10.3.1 Frank-Wolfe Method
10.3.2 Alternating Direction Multiplier Method in Duals
10.4 Foreground Image Extraction Using Robust Principal Component Analysis
Chapter 11 Conclusion
11.1 What induces sparsity?
The ℓ1-norm is used instead of the hard-to-optimize quantity “number of non-zero elements”.
The ℓ1-norm is continuous and non-differentiable at the point where the number of non-zero elements is small (sparse), so the optimal solution will generally be sparse
A sum of functions g that are concave (upwardly convex) and monotonic with respect to the absolute value|wj| of each element of the vector w is valid as a sparse regularization term
Induces sparsity for p≤1, but not for p=2
For any atom set, we can define a norm whose unit class is the convex hull of the atom set
which gives the tightest convex relaxation with respect to the sparsity represented by the atom set.
The atomic norm is
include already known norms such as the ℓ1-norm and the trace norm as special cases
A systematic framework for deriving new kinds of sparsity-inducing norms, such as a set of sparse and low-rank matrices
In principle, for any set of atoms, it is possible to determine how many samples are needed to recover a sparse solution.
11.2 For what kinds of problems is sparsity appropriate?
Empirical checklist to determine if sparsity is effective
Consider a learning/estimation problem where dimension d is much larger than the number of samples
It is important to be able to explain not only the prediction performance, but also why the prediction is possible.
The distribution of the detected signal has a heavy skewness
The distribution of the noise to be removed has a light hem.
Why sparsity is actively used in fields such as bioinformatics
The number of hypothesis candidates d is much larger than the number of samples
The ultimate goal is not to make predictions, but to understand the biological system.
Why sparsity has been used in geophysical surveys and image denoising since relatively early times.
Heavy hem of the detected signal
Distribution of the norm of the gradient between the original image and the normal noise image
Natural images contain many edges, so the gradients are distributed both close to zero and large
The gradient of the noise image is concentrated around the mean value.
When we consider the set of coefficients of a vector as a distribution, vectors with heavy skewness have a smaller ℓ1-norm than vectors whose coefficients are concentrated around the mean.
11.3 In the end, which optimization algorithm should we use?
It is better to use an existing package
Dual Augmented Lagrangian (DAL) method
SPAMS (sparse modeling software)
コメント