Statistical Feature Extraction for Robust Local Feature Extraction in Image Recognition

Machine Learning Artificial Intelligence Algorithm Digital Transformation Deep Learning Mathematics Probabilistic Generative Models Image Information Processing Noise Removal and Missing Value Interpolation Navigation of this blog

Summary

Image recognition technology refers to technology in which a computer analyzes a digital image and identifies objects, people, landscapes, etc. in the image. Among these, DNN algorithms such as CNN are commonly used because feature extraction and classification can be achieved simultaneously, and high accuracy can be obtained. However, since DNN requires a large amount of training data, approaches combining other algorithms are also being considered when only a small amount of data is available. Image recognition technology is used in a wide variety of fields, including security surveillance, medical imaging, automated driving technology, robotics, and image retrieval.

Here, we discuss various methods and applications of this image recognition technology based on the Machine Learning Professional Series “Image Recognition”.

Previously, we discussed the first step, local features. This time, we will discuss the next step, statistical features.

Statistical Feature Extraction(PCA,LDA,PCS,CCA)

In actual images, some disturbance or noise is added, and if we use the local features obtained from the image that is affected by the disturbance, we may not be able to get the recognition accuracy we expect. Therefore, statistical feature extraction is necessary to convert the observed data into features that are advantageous for recognition based on the established statistical structure of the data.

What is statistical feature extraction?

Statistical feature extraction means that the extracted local features are further extracted based on the probability statistical structure of the data, and transformed into robust features that are not easily affected by noise or disturbances. Statistical feature extraction can be applied not only to local features but also to various features in image recognition.

Statistical feature extraction can be classified according to the presence or absence of external criteria, i.e., teacher information, such as which class the data belongs to. Principal component analysis is used as a feature extraction method when there is no external criterion. Also, since the scale of the features has a significant impact on the recognition results, feature normalization is important. For feature normalization, whitening by principal component analysis is used. On the other hand, when there is an external criterion, Fisher’s linear discriminant analysis is used for feature extraction in class recognition, the regular modified correlation distribution is used for bivariate correlation maximization, and the partial least squares method is used for bivariate covariance maximization. At first glance, these methods appear to be different, but they are deeply interrelated.

Principal component analysis

First, let us discuss principal component analysis (PCA), which is a dimensionality reduction method to find a low-dimensional subspace that best approximates the distribution of data in a linear space by a certain criterion. There are several criteria used in principal component analysis, such as maximizing the variance of the projected data and minimizing the projection error, but the results obtained are almost transparent regardless of which criterion is used. In this section, we will discuss principal component analysis using variance maximization as the criterion.

Principal Component Analysis, also known as KL expansion (Karhunen-Loeve expansion), is used not only for feature extraction, but also for data compression and data visualization.

First, we consider a data set \(D=\{x_n \in \mathbb{R}^{D_x}\}_{n=1}^N \). The mean of the data \(\tilde{x}\) and the covariance matrix Cxx can be calculated as follows.

\begin{eqnarray}\tilde{x}&=&\displaystyle\frac{1}{N}\sum_{n=1}^N x_n\quad(1)\\C_{xx}&=&\displaystyle\frac{1}{N}\sum_{n=1}^N (x_n-\tilde{x})(x_n-\tilde{x})^T\quad(2)\end{eqnarray}

Here, the data average \(\tilde{x}\) in equation (1) is represented using a data matrix. The data matrix is a matrix of data arranged in rows, as shown in the following equation.

\[\mathbf{X}=\begin{pmatrix}x_1^T\\\vdots\\x_N^T\end{pmatrix}\quad(3)\]

Using X in equation (3) to express equation (1), we get the following equation

\[\tilde{x}=\frac{1}{N}\mathbf{X}^T\mathbf{1}_N\quad(4)\]

Where 1N=(1,…. ,1)T is an N-dimensional vector whose elements are all 1’s. Now, we define the mean deviation matrix as shown in the following equation.

\[\tilde{\mathbf{X}}=\mathbf{X}-\mathbf{1}_N\tilde{\mathbf{x}}^T\quad(5)\]

Using the data matrix and the mean deviation matrix, the covariance matrix in equation (2) is as follows

\[\mathbf{C}_{xx}=\frac{1}{N}\tilde{\mathbf{X}}^T\tilde{\mathbf{X}}=\frac{1}{N}\mathbf{X}^T\mathbf{X}-\tilde{\mathbf{x}}\tilde{\mathbf{x}}^T=\frac{1}{N}\mathbf{X}^T\mathbf{P}_c\mathbf{X}\quad(6)\]

However, put \(\mathbf{P}_c=(\mathbf{I}-\frac{1}{N}\mathbf{1}_N\mathbf{1}_N^T)\).

Here, the data set is projected to one dimension by the coefficient vector \(\mathbf{u}\in\mathbb{R}^{D_x}\).

\[s_n=\mathbf{u}^T(\mathbf{x}_n-\tilde(\mathbf{x}))\quad(7)\]

Considering all data, the following equation is obtained.

\[\mathbf{s}=\tilde{\mathbf{X}}\mathbf{u}\quad(8)\]

where s=(s1,s2,…. ,sN)T is an N-dimensional vector of projective components.

Next, we consider maximizing the variance of the projection component sn. The variance of the projective components can be obtained as follows

\begin{eqnarray} var(s)&=&\displaystyle\frac{1}{N}\sum_{n=1}^N s_n^2=\frac{1}{N}\mathbf{s}^T\mathbf{s}\\&=&\frac{1}{N}(\tilde{\mathbf{X}}\mathbf{u})^T(\tilde{\mathbf{X}}\mathbf{u})=\frac{1}{N}\mathbf{u}^T\tilde{\mathbf{X}}^T\tilde{\mathbf{X}}\mathbf{u}\\&=&\mathbf{u}^T\mathbf{C}_{xx}\mathbf{u}\quad(9)\end{eqnarray}

The projection axis u is assumed to be uTu=1 since it is meaningful only in terms of direction and not in terms of magnitude. Therefore, principal component analysis is the following problem of maximizing the variance var[s] of the projective components under the condition that the norm of the coefficient vector u is 1.

\[max(var[s])\quad s.t.\ \mathbf{u}^T\mathbf{u}=1\quad(10)\]

This optimization comes down to the following stopping problem for the objective function, where λ is the Lagrange coefficient.

\[J(\mathbf{u})=\mathbf{u}^T\mathbf{C}_{xx}\mathbf{u}-\lambda(\mathbf{u}^T\mathbf{u}-1)\quad(11)\]

Differentiating this objective function by u and setting \(\frac{\partial J}{\partial\mathbf{u}}=0\), we get the following

\[\frac{\partial J}{\partial\mathbf{u}}=2\mathbf{C}_{xx}\mathbf{u}-2\lambda\mathbf{u}=0\quad(12)\]

Therefore, the problem of finding the optimal coefficient vector comes down to the following eigenvalue problem.

\[\mathbf{C}_{xx}\mathbf{u}=\lambda\mathbf{u}\quad(13)\]

Since the variance is var[s]=uTCxxu=λ, if we choose the eigenvector for the largest eigenvalue, the variance will be the largest. This eigenvector is called the first component vector. The second and third principal component vectors can be obtained sequentially by maximizing the variance under the condition that they are orthogonal to the principal component vectors already obtained.

Let the M (M≤Dx) eigenvalues of Cxx be λ1, λ2,…. ,λM(λ1≥λ2≥… ≥ λM), then the eigenvectors corresponding to the eigenvalues u1,… ,uM\((\mathbf{u}_i^T\mathbf{u}_j=\delta_{ij})\), we obtain an M-dimensional principal component subspace.

Since λi can be considered as the variance corresponding to ui, the value obtained by dividing the i-th eigenvalue by the sum of the eigenvalues of the whole \(\lambda_i/\sum_{j=1}^{D_x} \lambda_j\) is called the cumulative contribution rate and is used as a criterion to determine the dimensionality of the linear subspace.

Whitening

Next, we will discuss whitening as an application of principal component analysis. Whitening is a method of uncorrelating different variables and aligning the scales, and is frequently used as a pre-processing of data. It is often used in data preprocessing. Specifically, it is expressed by the following formula.

\[s_n=\Delta^{-1/2}\mathbf{U}^T(x_n-\tilde{x})\quad(15)\]

All the data can be summarized as follows.

\[\mathbf{S}=\tilde{\mathbf{X}}\mathbf{U}\mathbf{\Delta}^{-1/2}\quad(16)\]

The mean of the transformed data set is obviously zero, and the covariance matrix is also a unit matrix.

Fisher Linear Discriminant Analysis

Next, we discuss Fisher linear discriminat analysis (Fisher LDA). This is a dimensionality reduction method that seeks a projection that emphasizes the separation of the classes when the teacher information is obtained. Principal component analysis is a method of dimensionality reduction that preserves as much information as possible from the distribution of data in the original space, but the principal component subspace is not necessarily an effective space for classification.

Emphasizing the separation between classes can be thought of as gathering data belonging to the same class closer together and placing data belonging to different classes farther apart, which can be formulated as maximizing the interclass variance relative to the intra-class variance.

Fischer’s linear discriminant analysis, like principal component analysis, is also a generalized eigenvalue problem when the formulas for intra-class and inter-class variances are derived and the Lagrangian method is used to maximize their quotient.

Canonical correlation analysis (CCA) is a dimensionality reduction method that extracts the information commonly contained in multiple sources of information. It has a linear subspace such that the correlation between the two projected variables is high. This problem also eventually leads to the generalized eigenvalue problem.

Partial least square (PLS), like canonical correlation analysis, is a method for extracting information that is commonly contained in multiple sources of information. The difference is that CCA is formulated as bivariate correlation maximization, while PLS is formulated as bivariate covariance maximization. This problem also eventually leads to the generalized eigenvalue problem.

In the next article, we will discuss coding and pooling.