Machine learning with sparsity

Machine learning Mathematics Artificial Intelligence Digital Transformation Explainable machine learning Navigation of this blog

Machine Learning with Sparsity

Sparsity-based machine learning is a machine learning method that uses sparsity to perform tasks such as feature selection and dimensionality reduction of high-dimensional data.

Sparsity here refers to the property that most of the elements in the data have values close to zero and only a few elements have non-zero values, and is extracted using models such as linear regression and logistic regression with L1 regularization to perform feature selection and dimensionality reduction. Since machine learning based on sparsity can improve the interpretability of high-dimensional data in this way, it is widely used in areas such as data analysis and predictive model building.

Machine learning based on sparsity is applied to various applications, such as

  • Sensor data processing: When processing data from multiple sensors, sparsity-based methods can remove unnecessary data and extract only the necessary information.
  • Image processing: Using sparse representation for images, tasks such as image compression, restoration, and feature extraction can be performed.
  • Natural language processing: Using sparse representation for text data enables tasks such as word feature selection and document classification.

This sparsity is used as a method to prevent overlearning in machine learning, known as regularization. Regularization is a widely used method for adjusting model complexity and is one of the key elements in machine learning.

In machine learning, as the complexity of a model increases, it tends to overfit the training data and generalization performance deteriorates.

There are two typical regularization methods

  • L1 regularization (Lasso): L1 regularization produces sparse models by penalizing regression coefficients according to their magnitude and moving small coefficients closer to 0. L1 regularization is useful for feature selection because it automatically removes unnecessary features.
  • L2 regularization (Ridge): L2 regularization penalizes according to the size of the regression coefficients and adjusts the complexity of the model by bringing small coefficients closer to 0. L2 regularization can suppress overlearning by reducing weights while retaining all features.

Various other regularization methods exist, including Elastic Net and Max Norm regularization.

This blog discusses the theory and implementation in machine learning using this sparsity.

sparse modeling

From Iwanami Data Science Series “Sparse Modeling and Multivariate Data Analysis”. About sparse modeling.

“In the first half of the year 2000, sparse modeling filled the presentations at overseas conferences in theoretical fields such as statistics, information theory, and signal processing. The centerpiece was regularization in terms of the L1 (L1) norm, defined as the sum of absolute values. Candes, one of the architects of compressed sensing, said, “If the L2 norm is the norm of the 20th century, then the L1 norm is the norm of the 21st century. This statement is a bit of an exaggeration, but it is true that the L1-norm regularization has made it possible to process information that had previously been given up.

Once a logical framework is established, it can be applied to problems in other fields. Sparse modeling has also been incorporated into all kinds of data analysis, including bioinformatics, engineering, and big data. Nowadays, it is hard to find a field where sparsity is not used. Sparse modeling is well established as a fundamental method.

Sparse modeling is a technique that automatically extracts the necessary parts of a statistical model according to the given data. In the case of multiple regression analysis, it means extracting a few necessary things from a list of many explanatory variables. On the other hand, there are traditional terms such as “model selection” and “variable selection”. “Variable selection” is used to select variables to be included in a model, while “model selection” is used more generally to include, for example, the selection of noise distributions.

In this paper, we will refer to the method of “fitting multiple models and then choosing among them” as variable selection/model selection, and the method of “adding a punishment term to the estimation, so that the number of road parameters is reduced simultaneously with the parameter estimation” as sparse modeling, as in L1 normalization (lasso).

To consider this problem, it is helpful to distinguish the following three levels

  1. Why do we want a simple model (fundamental problem)?
  2. What kind of equation is optimal (mathematical expression)?
  3. How do we find the best model (algorithm)?

It is part 3 of this that has made important progress in sparse modeling. However, it also includes the part of “How should I set up the tabular expression in 2 for 3 to work?

This sparse modeling has been cited as a breakthrough in modern machine learning, along with deep learning and kernel methods.

In this blog, I will discuss the following about this sparse modeling.

Implementation

Sparse modeling is a technique that takes advantage of sparsity in the representation of signals and data. Sparsity refers to the property that non-zero elements in data or signals are limited to a very small portion. The purpose of sparse modeling is to efficiently represent data by utilizing sparsity, and to perform tasks such as noise removal, feature selection, and compression.

This section provides an overview of sparse modeling algorithms such as Lasso, compression estimation, Ridge regularization, elastic nets, Fused Lasso, group regularization, message propagation algorithms, dictionary learning, etc., as well as a description of the various algorithms used in image processing, natural language processing, recommendation, signal processing The paper describes the implementation of the algorithms in various applications such as image processing, natural language processing, recommendation, machine learning, signal processing, brain science, and so on.

  • Overview of Overlapping Group Regularization and Implementation Examples

Overlapping group regularization (Overlapping Group Lasso) is a type of regularization method used in machine learning and statistical modeling for feature selection and estimation of model coefficients. In this case, the feature is allowed to belong to more than one group at the same time. This section provides an overview of this overlapping group regularization and various implementations.

In this article, we will discuss sparse estimation using lasso (Least Absolute Shrinkage and Selection Operator) based on a linear regression problem, where we want to estimate the relationship between a disease and its genes by examining the blood samples of a large number of patients. In this paper, we will discuss the calculation method of the L1 norm using a case where the number of blood samples taken is not so large, but the technology is advanced and the number of genes p that can be examined from blood samples is very large.

This article describes the application of lasso using glmnet and genlasso, which are sparse model tools in R. As an example of application, I use data on housing in Boston, which is often used to explain variable selection in multiple regression analysis in R, to solve the task of “predicting the median home price (mdev, the objective variable) for the attribute in $1000 units from other variables.

In addition, we will use fused lasso (gen lasso) to estimate the true value of the background series data from the noisy series data.

When the data is given as a pair of scalar output y and M-dimensional vector x, D={(y(n),x(n))|n=1,…,N}, we will discuss how to achieve a sparse solution as a linear regression (sparsifying the regression coefficients, or sparsifying in the sense of eliminating extra samples rather than variables).

As a use case, I will discuss the case where the observed data is contaminated with noise and not all samples can be trusted, as well as the case where a machine is constantly monitored by a fixed number of sensors by extending it to graph data.

We describe a graphical lasso (sparse structure learning of Gaussian graphical models) that introduces sparsity into the relationships, and its application to anomaly detection (extension of the Hotelling T2 score).

An overview of dictionary generation by machine learning based on sparse land models, rather than dictionaries derived from signal processing knowledge (which have a DCT basis in JPEG), for sparse representation of images using dictionary data.

Actual Removal of Noise from Image Information Using Sparseland Model

The phenomenon of consumers “switching” from n different manufacturers to a different manufacturer is represented by an nxn dimensional transition matrix and analyzed with fused lasso. As an example, it is applied to data on changes in the market share of automobile companies over time.

Multivariate data can often be represented in the form of a matrix. For example, data collected for a large number of students on their performance in various subjects can naturally be represented as a matrix of the form student x subject. The value of an element of the matrix is the grade of a student in a certain subject (e.g. English). Data on how many products a customer has purchased (or not), or how many words appear (or not) in a document, collected for many customers or documents, can also be represented as a matrix of the form customer x product or document x word.

This article describes a type of data analysis in which data in the form of such matrices are decomposed into several matrix seats. This decomposes the individual data into a sum of products of several values and extracts the essential variables. This is where the sparsity property is introduced.

In this paper, the theory of singular value decomposition (SVD) is described as a method of matrix decomposition.

Although there are innumerable possible factorizations of a matrix, a decomposition that gives an expansion in terms of a normal orthogonal basis for each row vector of the original matrix, and that approximates the original data in the least squared error sense when the expansion is terminated in the middle, can be obtained by singular value decomposition.

Low-rank approximation through such decomposition has been applied not only to the data of customer x products, but also to various other data. For example, if X is the data of document x words, the method of decomposing the matrix into singular values is called latent semantic analysis (LSA) or latent semantic indexing (LSI).

By decomposing this matrix into singular values, we can obtain a new basis for representing the meaning of a sentence (a concept such as a weighted combination of multiple word meanings = topic) and a representation of the document using the new basis. For example, if we look at the shopping example as a document x word matrix, we can interpret that the topic of bread and the topic of fruit are extracted from the data of documents about bread and documents about fruit.

SVD (Singular Valu Decomposition), PMD (Penalized Matrix Decomposition), and NMF (Non-negative Matrix Factorization) using R (arules, PMA, NMF). Matrix Factorization).

Theory of Machine Learning with Sparse Modelability

A brief description of sparsity is given.

In this article, I will discuss what it is to learn from data, explain common terms in machine learning such as expectation error, empirical error, UI, etc., and explain why regularization is necessary to suppress learning.

In this article, we introduce sparsity from Occam’s razor (the so-called Kechi’s principle) and intuitively introduce L1-norm regularization as its convex approximation. We also compare some heuristics with L1-norm regularization using artificial data and show that they perform better than simple heuristics.

In this article, we will discuss the problem of finding a sparse solution that satisfies the well-formedness equation from a small number of observations by minimizing the L1 norm. We consider geometrically the conditions under which a true sparse vector can be found from a given observation, and then give an expression to calculate the probability that such a condition is satisfied when the problem is given randomly.

In the previous section, we investigated the conditions under which it is possible to recover a high-dimensional sparse vector from a low-dimensional observation in the absence of observation noise. In this section, we consider the problem of recovering sparse vectors from noisy linear observations and analyze the performance of estimators based on L1-norm regularization. The difference from the previous studies is that we take into account the observation noise and also deal with the case where the true vectors are not strictly sparse. Through these analyses, we also clarify why the L1 norm is effective in the high-dimensional case (n≪d).

Let Sk∈ℝd be the k-dimensional subspace stretched by the top k coefficients of the absolute value of the true regression coefficient vector ω*, and let \(S_k^⊥\) be its orthogonal complement. Thus, \(\omega*=\omega_{S_k}^*+\omega_{S_k^⊥}^*\), defined above, is the orthogonal component. For example, if omega*=(0.5,1,-2,0.1)T, the decomposition for k=2 is as follows.

In this article, we will discuss the iterative weighted reduction method, the (accelerated) proximity update formula, the dual extended Lagrangian method, and the dual alternating direction multiplier method as optimization methods for optimization problems that frequently appear in machine learning. Under the assumption, concepts such as variational representation of the norm, prox operators, convex conjugation, etc. are also discussed.

The (accelerated) proximity gradient method is an extension of the gradient method to problems with non-smooth regularization terms. Therefore, it also inherits the problems of the gradient method. For example, as shown in Figure (b) below, convergence can be quite slow when the condition number of the data matrix X is poor.

In this article, we describe a norm for selecting zero/non-zero in each group using a predefined group structure as a time transition, and show that it is effective for multi-task learning, multi kernel learning, and vector field estimation.

Low-rank matrices have a great many applications, including collaborative filtering, system identification, and nxd with matrices as input. However, dealing directly with low-rank matrices is nonconvex and leads to optimization problems that are somewhat difficult to solve. In this article, we will discuss how the trace norm, obtained as a natural extension of the L1-norm regularization, induces low-rankness, and how many properties related to the L1-norm, such as the dual norm and the prox operator, are extended to the singular values of the matrix.

In this article, we will discuss sparse regularization with duplicates. Sparse regularization with duplicates is a cloud combination of sparse regularization terms, e.g., with respect to subvectors or linear transformations of a vector Ω ∈ ℝd, and has applications in image processing, statistics, tensor decomposition, and others.

The L1 norm, group L1 norm, and trace norm induce different kinds of sparsity: element-wise sparsity of vectors, group-wise sparsity, and matrix low-rank sparsity, respectively. Intuitively, the unit spheres of these norms are associated with sparsity, respectively, because they are pointed at points that are sparse in each sense. In this article, I will describe this geometric intuition using the atomic norm and discuss how the above norm can not only be obtained from this framework, but also define a norm that induces a new sparsity.

We discuss two mathematical properties of the atomic norm and the equivalence of the norm with the convex hull of the atom set as the unit sphere and the representation of the dual norm of the atomic norm. Although the atomic norm is mathematically sophisticated and contains norms that induce various sparsity properties, it is difficult to compute the norm itself or the prox operator on the norm, except in special cases such as the L1 norm, group L1 norm, and trace norm. We will discuss the Frank-Wolfe method, which is effective when optimization with a certain degree of accuracy is sufficient, and the dual alternating direction multiplier method, which is effective when a solution with a slightly higher accuracy is desired. Finally, a concrete example of foreground image extraction using robust principal component analysis is presented.

In sparse learning, the L1 norm is used instead of the hard-to-optimize quantity “number of nonzero elements”; the L1 norm is continuous and non-differentiable in terms of a small number of nonzero elements (sparse), so the optimal solution is generally sparse. In this sense, the following continuous, concave (convex up) and monotonic sum of functions g with respect to the absolute value of each element of the vector ω|ωj| is similarly valid as a sparse regularization term.

Multi-Task Learning is a machine learning method that simultaneously learns multiple related tasks. Usually, each task has a different data set and objective function, but Multi-Task Learning aims to incorporate these tasks into a model at the same time so that they can complement each other by utilizing their mutual relevance and shared information.

Here, we provide an overview of methods such as shared parameter models, model distillation, transfer learning, and multi-objective optimization for this multitasking, and discuss examples of applications in natural language processing, image recognition, speech recognition, and medical diagnosis, as well as a simple implementation in python.

Basic ideas for sparse modeling and an introduction to Lasso. Explanation of the removal of noise factors in linear regression models, with real-world examples.

The world’s leading textbook on sparse modeling

Sparse modeling and multivariate data analysis

The bias-variance trade-off means that the more complex the model to be fit, the greater the effect of the variability of each parameter, and when the number of samples is small, the effect is stronger than the improvement of the expressiveness of the model by increasing the number of parameters (descent (bias) that distorts the data by forcing it into its own mold).

In data analysis, the “true model” is not known, so in order to select a model, methods such as CV, AIC, and Cp are used to estimate the “goodness of prediction” without the true model.

When there are M candidates for explanatory variables in a model, it is necessary to select from 2M models (including constant functions) to search for all of them using the above methods, and sparse modeling is an efficient method to select among them.

    In a meta-analysis, data from multiple trials can be integrated to enable a more informed assessment of trial effects in cases where individual trials lack sufficient information to make inferences with sufficient precision.The problem here is the assumption that “the relative risk across all trials is the same. In general, trials collected in this way are conducted at different times, in different regions (countries), and at different sites, and it is common for various factors, such as the background of the participants and the definitions of study drug doses and outcomes, to be strictly different across trials.

      Applications using sparse learning

      In the monitoring of systems represented by multivariate variables, it is important to know the contribution of each variable to individual abnormal phenomena. There are not many ways to do this, except for some extreme methods such as the simple Bayesian method. In this section, we will discuss the means of calculating the degree of anomaly of individual variables based on the idea of linking the breakdown of dependency among variables to anomalies. In learning the dependencies among variables, we devise a calculation method to efficiently extract the essential dependencies. This makes it possible to automatically extract the modular structure inherent in the system, even if the apparent dimension is high.

      Structural regularization learning is a framework for exploiting the structure among data variables in supervised learning. In this section, we describe the relationship between structural regularization and submodular functions and an algorithm for structural regularization through submodular optimization using this relationship.

      When we have data at hand to which we want to apply machine learning methods, it is unlikely that each variable in the data is completely unrelated to each other. For example, when dealing with image data, the variables in the model correspond to pixels in the image, so it can be said that the adjacency structure of pixels exists as a relationship between variables. Also, when dealing with genetic data, each variable corresponds to a gene, so there are known gene interactions and so on. In learning, it is expected that taking these relationships into account will improve accuracy and produce models that are easy to interpret.

      Here, we describe structural regularization, a machine learning method that is unaware of the structure among data variables. Since such structure is, so to speak, prior information in learning, it can be used effectively to reduce the real dimension of the problem and improve learnability. Therefore, structural regularization can be regarded as an approach to so-called sparse modeling.

      Recently, structural regularization has attracted attention because of its close relationship with submodular optimization. Not only theoretically interesting, but also practically important is the fact that structural regularization learning can be attributed to network flows that can be computed extremely fast, which is an important aspect of this relationship.

      We now discuss the structural sparsity obtained from the submodular functions. The aforementioned regularization by the 𝓵p norm is in the form of a uniform treatment for each variable ω1,…,ωd, as can be seen from this form. However, this one fairy is not essential in regularization. In this section, we describe structured regularization, a method that extends regularization and explicitly uses relationships among variables in the data.

      There are two main types of sparsity obtained by regularization. The first type is sparsity in the form that many parameters are reduced to take the value of 0, as in the 𝓵1 regularization described above, where the unit of sparsity is each variable, but as described below, this can be extended to a group unit on the variables given in advance. The other type of sparsity is where multiple parameters with similar properties tend to have the same value. The former type is called group-type regularization, and the latter type is called coupling-type regularization.

      Application of submodular function optimization, an optimization method for discrete information, to structural regularization problems and formulation using submodular optimization (linear approximation and steepest effect method, accelerated proximity gradient method, FISTA, parametric submodular minimization, partition algorithm)

      In this article, we will discuss nonparametric Bayes in factor analysis and sparse modeling. Here, we will focus on beta processes, another stochastic process that constitutes a nonparametric Bayesian model.

      The approach is to consider an infinite dimensional binary matrix generating process using the beta-Bernoulli distribution model. The specific algorithm is called the Indian buffet process (IBP). These are computed using Gibbs sampling.

      While compressed sensing is a decoding method to estimate sparse original data from a small amount of observed data, sparse coding, an encoding method to represent the original data with basis vectors called dictionaries and sparse vectors, was also proposed in the 2000s. Sparse coding not only compresses data, but also has various applications, such as removing noise from images and increasing the resolution of images through super-resolution.

      Some valid problems to consider based on sparsity are as follows.

      1. Learning/estimation problems where dimension d is much larger than the number of samples n
      2. Problems where it is important to be able to explain not only the prediction performance, but also why it was predicted
      3. The distribution of the signal to be detected is heavily skewed
      4. The distribution of noise to be removed is lightly skewed

      Sparsity is actively used in fields such as bioinformatics because the number of candidate hypotheses, d, is much larger than the number of samples, n, and because the ultimate goal is not to make predictions but to understand the biological system, b.

      Also, the reason why sparsity was used relatively early on in geophysical search and image denoising in geology is not only because of a.) that these problems are inverse problems, but also because c.) that the signal to be detected has a heavy hemisphere is a key hypothesis, as shown below.

      There are also algorithms for computing these in the glmnet library for R, and in scikit-learn for python. There is also a package for the dual augmented Lagrangian (DAL) method published by Tomioka, and the French group SPAMS (sparse modeling software).

      コメント

      1. […] Learning Technology.  Probabilistic Generative Model  Support Vector Machine  Sparse Modeling  Artificial Intelligence Technology  Anomaly and Change Detection technology  Relational Data […]

      タイトルとURLをコピーしました