Overview of Kernel Methods and Support Vector Machines

Machine Learning Artificial Intelligence Digital Transformation Natural Language Processing Online Learning anomaly and change detection Ontology Image Information Navigation of this blog

Overview of kernel methods and their applications (SVM, KPCS, GP)

The kernel method is a technique used to handle nonlinear relationships in machine learning. Normally, linear models such as linear regression and logistic regression cannot capture complex relationships among data, but kernel methods make it possible to express nonlinear relationships.

As a way to realize them, the kernel method uses a function called a kernel function to measure the similarity between data, and evaluates the similarity between two sets of data by calculating the inner product between the features of the input data. Thus, the kernel method is able to capture the relationship between features by calculating the inner product of the input data.

Advantages of the kernel method include its ability to handle nonlinear data, its ability to compute in high-dimensional spaces, and its ability to improve the accuracy of the model by selecting the kernel function. On the other hand, it also has the problem of overlearning depending on the choice of kernel function and, because it is computationally expensive, it can be computationally time-consuming for large data sets.

Kernel methods are mainly used in algorithms such as support vector machines (SVM), kernel principal component analysis (KPCA), or Gaussian processes (GP), where SVM uses a kernel function to map input data into a higher-dimensional space and partition the data in a hyperplane, while KPCA uses a kernel function to map input data into a higher-dimensional space and partition the data in a hyperplane. and kernel functions can be used to compute the principal components of the input data; GP uses kernel functions to probabilistically model the relationship between observed and unknown data to represent how the output values are distributed for a given input value.

Support Vector Machine (SVM) is a type of machine learning that can be used for supervised learning classification and regression; SVM is capable of classifying data that is not linearly separable and is also applicable to nonlinear classification.

In principle, SVM performs classification by searching for a hyperplane that separates the two classes. For example, in the case of a linear classifier, the hyperplane is searched for the most appropriate boundary for classifying data points, and the boundary is defined based on the data point (support vector) with the closest distance.

In the case of a nonlinear classifier, the kernel method is used to search for nonlinear boundaries. This takes the form of a kernel method that maps the original feature space to a higher dimensional space and uses a higher dimensional linear hyperplane to classify data that is not linearly separable.

Advantages of SVM include its applicability to nonlinear as well as linear classifiers, its high generalization performance, and its resistance to over-training. On the other hand, there are also issues such as overlearning problems when the number of data is very large or depending on the choice of kernel function, and high computational cost and long processing time depending on the setting of the kernel function.

Kernel Principal Component Analysis (Kernel PCA) is a method for applying PCA to nonlinear data. Kernel PCA can handle nonlinear data. By mapping the data to a nonlinear function space, the data is transferred to a higher dimensional feature space, and by performing PCA in that space, nonlinear structure can be extracted. KPCA is capable of extracting nonlinear structure by moving data to a higher-dimensional feature space and performing PCA in that space.

Specifically, KPCA uses a kernel function to map the original data to a higher dimensional space, and then performs principal component analysis in that space to extract the structure of data with nonlinear features. The kernel function is basically a function that computes the inner product of the original data, and polynomial kernels, Gaussian kernels, and sigmoid kernels are used.

Advantages of kernel PCA include the ability to extract nonlinear data structures, to move to higher dimensional feature spaces than linear PCA, and to handle more complex data structures. Challenges include the fact that overlearning may occur depending on the choice of kernel function, which is computationally expensive and takes a long processing time.

Gaussian Process (GP) is a regression and classification method that uses probability distributions and is realized using the kernel method. Gaussian processes are used to represent continuous functions, and based on a pre-defined kernel function, the relationship between observed and unknown data is modeled probabilistically to represent how output values are distributed with respect to a certain input value.

In the kernel method, the similarity of input values is measured by calculating the inner product of vectors that represent the characteristics of a certain input value. The Gaussian process uses this similarity to define a kernel function, expresses the relationship between input and output values, and uses observed data to train the model and make predictions for unknown input values Gaussian Processes.

The advantages of Gaussian processes include the ability to express the relationship between input and output values without using a complex model, the ability to use probability distributions for prediction and to express uncertainty about the predicted values, and the fact that overlearning is unlikely to be a problem, resulting in high prediction performance even with small data sets. and overtraining is less likely to occur, and high forecasting performance can be achieved even with small data sets.

On the other hand, the disadvantages of the Gaussian process are that it is computationally expensive and difficult to apply to large data sets, the prediction performance may deteriorate depending on the choice of kernel function, and the design of kernel functions is difficult for high-dimensional data.

For more information on Gaussian processes, see “Nonparametric Bayesian and Gaussian Processes.

About Support Vector Machines

From the Machine Learning Professional Series, “Support Vector Machines.

Kernel methods have attracted a great deal of attention with the invention of support vector machines. However, the essence of the kernel method is an extension of the old methods, so the term “learning from the past” is more appropriate to describe the kernel method. In other words, while the kernel method can be considered to be basically on the same footing as traditional linear multivariate analysis, it has the flexibility to be applied to problems that were not possible in the past.

When given complex nonlinear data, the simple idea is to create a nonlinear model and develop a method to fit the data to it, but in most cases, we are faced with complex optimization problems, making it difficult to analyze large-scale data.

However, in most cases, we are confronted with complex optimization problems, making it difficult to analyze large data sets. The kernel method, then, is a reversal of the idea of solving nonlinear problems with linear models. The data is first transferred to a higher dimensional space and then processed, but the higher dimensional space has a problem called the curse of dimensionality, which is a seemingly forbidden operation according to conventional wisdom. In order to avoid the curse of dimensionality, the kernel method introduces a framework called regularization to smooth out overly complex models in high dimensions.

The introduction of regularization, on the other hand, has the side effect of increasing computational complexity, but here is another reversal of thinking. The kernel method does not model the structure of a given problem precisely and then derive a data analysis method, but rather adapts the problem to the method so that it is easy to compute.

In this blog, we mainly discuss the theory and applications of support vector machines in the following sections.

Implementation

Support Vector Machine (SVM) is a supervised learning algorithm widely used in pattern recognition and machine learning. is to find the best separating hyperplane between the classes in the feature vector space, which is determined to have the maximum margin with the data points in the feature space. The margin is defined as the distance between the separation hyperplane and the nearest data point (support vector), and in SVM, the optimal separation hyperplane can be found by solving the margin maximization problem.

This section describes various practical examples of this support vector machine and their implementation in python.

  • Overview of Rank SVM, Algorithm and Implementation Example

Rank SVM (Ranking Support Vector Machine) is a type of machine learning algorithm applied to ranking tasks, especially for ranking problems in information retrieval and recommendation systems. Related papers include “Optimizing Search Engines using Clickthrough Data” and “Ranking Support Vector Machine with Kernel Approximation”.

Theory and Applications of Support Vector Machines

One of the major tools in the history of machine learning is the support vector machine (SVM), which was originally developed for two-class classification problems, but has been extended to multiclass classification, regression, and unsupervised learning problems as well. The concept of “margin maximization” at the root of the algorithm has been applied to various machine learning domains. Another feature of the algorithm, the “kernel function” approach, has been applied and utilized in nonlinear modeling and in domains where structural data appear.

All information is converted into a vector of information called a “feature vector” and classified. Therefore, the accuracy of the classification depends largely on the “feature extraction” from the information. One of the methods to classify this vectorized data is the “support vector machine (SVM)”.

In this issue, we describe a classification problem with more than three classes using SVM: In the two-class classification problem, the cases belonged to either a positive class or a negative class. Here, we describe a multi-class classification problem in which the examples belong to one of the c classes from 1 to c.

There are two major ways to achieve multi-class classification in SVM. The first method is to combine multiple 2-class classifiers, and the second method is to extend the SVM formulation. When combining multiple 2-class classifiers, it is possible to handle multi-class problems without modifying the SVM itself. The following three methods of this type are described. (1) one-to-other method, (2) one-to-one method, and (3) error correcting output codes. These are general-purpose methods that can be used for classification methods other than SVM.

In this issue, we will discuss regression analysis using support vector machines. In particular, I will focus on how support vector regression differs from traditional regression analysis such as the least squares method, and I will also discuss quantile regression analysis, which can be understood in a similar way to support vector regression.

Regression analysis is a problem whose output is a real number. For example, the problem of predicting the value of a financial instrument or the blood pressure of a person taking a new drug can be formulated as a regression problem because the price and blood pressure are represented by real numbers.

Although SV regression allows predictions based on a nonlinear model, we will first illustrate SV regression based on a linear regression function as follows. A dual problem approach with a Lagrangian function is used for optimization.

In the previous article, we discussed a model based on the linear regression model f(x)=ωTx+b. This time, we will discuss regression based on a nonlinear model by using a kernel function as in the classification problem.

The kernel function can be used because of the above-mentioned duality problem and duality representation. In both the dual problem and the dual representation, the input x does not appear alone, but appears entirely in the form of an inner product; to make the SV regression nonlinear, we generalize the inner product using the kernel function K:ℝdxℝd→ℝ, as in the SV classification. Replacing the inner product with the kernel function, the dual problem becomes as follows.

While the classification and regression problems were supervised learning problems in which input-output pairs were given as training data, we now describe an unsupervised learning problem in which only the inputs are given as training examples.

In unsupervised learning, only the inputs {xi}i∈[n] are given as training data. What can be learned and what can be done with only the inputs? In this section, we first discuss three typical unsupervised learning tasks and clarify the purpose of unsupervised learning. Then, one-class SVM (one-class SVM, single class SVM) is described. Typical tasks of unsupervised learning are called clustering, dimensionality reduction, and anomaly detection.

SVM and related derived methods relied only on the inner product of the feature vectors as input. Due to this property, complex models can be realized without explicitly computing Φ(x) by replacing the inner product of the feature vector Φ(x) mapped to the feature space F with a kernel function K(xi,xj)=Φ(xi)TΦ(xj). However, not just any function can be used as a kernel function.

In this article, we will discuss what kind of functions can be used as kernel functions and what kind of operations can be performed on kernel functions. Kernel functions are also closely related to regularization in learning.

The kernel functions discussed in this paper include general kernel functions (linear kernel, polynomial kernel, RBF kernel) and kernel functions for probabilistic data, string data, and graph-type data (p-spectrum kernel, all-quantile kernel, gap-weighted kernel, Fisher kernel, graph Laplacian, commute-time kernel, diffusion kernel, regularized Laplacian, and random walk).

In this article, we will discuss conditions that characterize the optimality of a solution in SVM. These conditions can be used to check whether the computed solution is optimal or not, especially around the optimal solution. We also describe how to solve the SV classification optimization problem using the active set method and the interior point method, which are common methods in the field of numerical optimization.

The SV classification and SV regression methods described so far have all been formulated as numerical optimization problems, whereas in the case of SVM, it is usually not possible to obtain the solution analytically, but rather an iterative process of starting from appropriate initial values and searching for a solution must be performed.

In this section, we will discuss how the optimality of the solution is determined in the SVM optimization problem and how to obtain the solution by standard optimal methods. We will start with the most basic case of 2-class SV classification, but regression and other cases can be treated in almost the same way.

To train SVM on large data sets, a large optimization problem must be solved. In such cases, it is sometimes better to use SVM-specific rather than general-purpose optimization methods. In this section, we describe an approach called the partitioning method among optimization methods for SVMs. This approach does not perform optimization on the entire training set, but rather selects a subset of the training set and iteratively solves a small-scale optimization problem. The method in which the size of the subset is set to the minimum 2 in the partitioning method is called the SMO algorithm, which is the most well-known learning algorithm for SVMs (kernel SVMs) using kernel functions.

This section describes a segmentation algorithm specialized for linear SVM. This algorithm is called the DCDM algorithm (dual coordinate descent method algorithm), and is also used in LIBLINEAR, which was created for linear SVM by the same research group that developed LIBSVM.

One of the advantages of SVM was to capture complex features using kernel functions. On the other hand, in situations where useful features are already known, there are many cases where a linear SVM may provide the same level of performance as a kernel SVM. In addition, recent large-scale data analysis often deals with high-dimensional and sparse features.

For example, in natural language processing problems, whether a word appears in a document or not may be a feature metric. In this case, the dimensionality of the feature is the number of types of words, which makes the data very high-dimensional. On the other hand, since the number of word types in a document is limited, many features will have sparsity, i.e., zero.

The linear SVM algorithm can make good use of feature sparsity. Therefore, the use of linear SVM is effective when dealing with large-scale high-dimensional sparse data.

In real-world applications of SVM, hyperparameters such as regularization parameters need to be determined appropriately. This problem is called model selection and is one of the important topics in machine learning. To address this issue, we first describe the cross-validation method, which is one of the model selection methods.

In the k-partition cross-validation method, k training and evaluation are performed. To illustrate well, in the first case, the training data are the \(\frac{k-1}{k}n_{all}\) examples collected from D2 to Dk, and the set of \(\frac{1}{k}n_{all}\) examples included in D1 is the evaluation data. The classifier is trained using the training data, and the resulting classifier is evaluated using the evaluation data. Similarly, for the second time, the training data is the set of \(\frac{k-1}{k}n_{all}\) examples in D1, D3 to Dk as well, and the set of \(\frac{1}{k}n_{all}\) examples in D2 as evaluation data is used for learning and evaluation. The same steps are repeated until the kth time.

Here we describe a regularized path tracking algorithm for SV classification. This algorithm is often used to solve the dual problem of SV classification. First, we review the dual representation of the decision function and the dual problem, and then derive that the optimal solution of the dual variable is the one in the regularization parameter C. We then use them to describe the regularized path tracking algorithm in SV classifiers.

When there is a change in the data set used for training, the optimal solution may be recalculated accordingly. In this situation, a method called sequential learning efficiently calculates a new solution by using the previously obtained solution.

After training a classifier on given training data, it may be necessary to add or delete training cases. In this case, the optimization problem can be solved again from the beginning for the changed training data, but incremental decremental learning is used to more efficiently calculate a new optimal solution by using the solution that has already been obtained. For example, there may be a situation where you want to update the model by adding data in a timely manner whenever new measurement information is obtained, or when you want to delete old time-series data as soon as new ones are added.

We will discuss the concept of sequential learning, which is different from online machine learning. In online learning, we consider the situation where data is given sequentially and basically do not consider deletion. In addition, in many online learning methods, when a new case is given and the classifier is updated, the updating method is designed to obtain a classifier with as high accuracy as possible using only the current classifier and the new case, without retaining the previous data. This method requires very little computation cost and memory, but it is difficult to guarantee optimality for the entire data set.

SVM has become a standard tool for data analysis and is applied in a variety of fields. SVM is implemented in many statistical analysis software and can be easily used for small- and medium-scale data. Here we describe the kernlab package of the R statistical analysis environment.

On the other hand, it is necessary to have knowledge about the implementation of learning algorithms when using SVM for large-scale data or changing some parts of SVM according to the purpose. With this goal in mind, here we describe in detail the implementation of SVM software called LIBSVM, which is created and maintained by Professor C.J. Lin’s group at National Taiwan University. LIBSVM is implemented in C++ and the code is publicly available, so it can be easily modified for different purposes. The software is implemented in C++ and the code is publicly available, so it is relatively easy to modify it for your purposes and integrate it with other systems.

Predicting data with structure by SVM can be done by using an extension called Structural Support Vector Machines. In this method, the constraints that increase with the introduction of complex structures can be efficiently handled by an optimization technique called the resection plane method. Structural type data can be, for example, data represented as tree structures or arrays. In structural tree prediction in natural language processing, grammatical relations between words are expressed as a tree structure. Or, in the problem of finding similar sequences of proteins, proteins may be represented as sequences of amino acids.

In this article, we will discuss a problem setting called weak label learning, which is positioned between supervised and unsupervised learning. In weak label learning, problems such as classification and regression are considered in situations where label information is only partially available. First, we describe a problem called semi-supervised learning, in which label information is given only for a part of the training cases.

In this section, we describe SVM for semi-supervised learning. In semi-supervised classification problems, both labeled and unlabeled training cases are given. Finally, we describe an example of semi-supervised SVM applied to two-dimensional artificial data. Two examples each with positive and negative labels are given as labeled examples, and 100 unlabeled examples are given as unlabeled examples.

Here we describe an SVM approach to a weak label learning problem called multi-instance learning.

Multi-instance learning is a type of 2-class classification problem. The difference from the usual 2-class classification problem is that labels are given to a set of training cases called a bag, instead of to each individual training case. Each bag consists of multiple instances, and each instance belongs to either the positive class or the negative class, and is called a positive case or a negative case, respectively. If a bag contains at least one positive instance, it is called a positive bag. Conversely, if a bag contains all negative cases, it is called a negative bag. In multi-instance learning, only the label for a bag is given. In a negative bag, we know that all the cases in the bag are negative, but in a positive bag, we do not know which cases are positive and which are negative. For this reason, we must find the classification boundary while estimating the labels of the cases contained in the positive bag.

Kernel functions and reproducing kernel Hilbert spaces used in kernel methods are described. Representation theorems that are important from the viewpoint of applications will also be discussed. In addition, the basic properties of universal kernels with high expressive power will be described.

The estimator \(\hat{f}(x)\) from the kernel function is included in the linear model M and is given by the linear sum of the functions Φ(x)TΦ(x’). When a general kernel function is used instead of the function Φ(x)TΦ(x’), the estimator is given by the linear sum of the kernel functions. Therefore, we define the linear space H0 generated by the linear sum of kernel functions as follows.

When the estimator is constructed using a linear model M, the learned function \(\hat{f}(x)\) is represented as a linear sum of functions k(xi,…),i=1,…,n corresponding to data points x1,…,xn. This property can be summarized as a representation theorem in general reproducing nuclear Hilbert spaces.

Next, we describe a learning algorithm that uses the reproducing nuclear Hilbert space H as a statistical model. The uniform law of large numbers is used to evaluate the prediction accuracy, and in this case, it is necessary to obtain the Rademacher complexity. Here, we evaluate the Rademacher complexity for a bounded set of reproducing nuclear Hilbert spaces H.

When the reproducing nuclear Hilbert space is infinite dimensional, it is expected to approximate a wide class of functions. In this section, we describe a reproducing nuclear Hilbert space corresponding to a kernel function called the universal kernel as a statistical model with sufficiently small approximation error for continuous functions.

    As a representative example of a kernel method, we discuss C-support vector machines, which provide a proof of statistical consistency. Support vector machine SVM is a generic term for typical learning algorithms in machine learning. In this section, we describe the C-support vector machine and ν-support vector machine, which are learning algorithms for binary discriminant. Here, C and ν refer to regularization parameters.

    The ν-support vector machine is an algorithm in which the regularization parameter C of the C-support vector machine is replaced by a parameter ν that has a clearer meaning.

    Let the loss of the discriminant function f(x)+b for data (x,y) be measured by the hinge loss Φhinge. Even if sign(f(xi)+b)=yi in this case, it suffers a nonzero loss when the value of the margin mi=yi(f(xi)+b) is less than 1. The threshold value, 1, is predetermined as a constant. On the other hand, a loss function can be used in which the threshold is variable according to the data and the loss is suffered when the margin is less than a certain positive number ρ. This can be achieved by setting the loss function to max{ρ-mi,0}. Here, if ρ is made variable, it is only necessary to choose a small ρ for any data, and the loss cannot be measured properly in this case. Therefore, we add a penalty term -νρ for small threshold ρ and define the loss for the special function as follows.

    In this article, we describe a norm for selecting zero/non-zero in each group using a predefined group structure as a time transition, and show that it is effective for multi-task learning, multi kernel learning, and vector field estimation.

    Ontology matching by clustering using machine learning techniques (SVM)

    On local learning using logistic regression, softmax regression and kernel methods for image information classification

    Coding is an operation to convert local features of image information into vectors of effective dimensionality for recognition and weaving, and pooling is an operation to combine multiple coded vectors obtained from an image region into a single vector representing the region. For coding, we will discuss methods using non-parametric approaches such as histogram density estimation and kernel density estimation, and methods using parametric approaches such as parameter estimation of mixed Gaussian distribution.

    liblinear is an open source SVM (support vector machine) developed at National Taiwan University. SVM is a pattern recognition model that uses value-supervised learning, and is an algorithm that can be applied to classification and regression. It can be used for a variety of tasks, and is capable of fast linear separation of data with a million digits of instances and features.

    In this article, I used clj-liblinear, a wrapper for Clojure, in combination with kuromoji, a natural language processing tool, to perform the task of sentence classification.

    In Hotelling’s T2 method, which is a classical method, the anomaly detection model was created by assuming that all data follows a single normal distribution. On the other hand, approaches using mixture distribution models and Bayesian estimation gave up using a single distribution and focused on the local scattering of data around the point of interest to create the anomaly detection model. Here, I will describe an approach that takes the idea back to the world of Hotelling’s T2 method, but instead uses a technique called “kernel trick” to indirectly represent the shading of the distribution.

    The theory and algorithms of stochastic gradient descent applied to perceptron, SVM, and logistic regression for online learning are described.

    Ordinary principal component analysis can only compress linearly correlated data (meaning that it can be done without correlation, but it is better if there is correlation).
    Kernel Principal Component Analysis, which projects data to a higher dimension and performs ordinary principal component analysis, makes it possible to handle non-linear correlated data.

    コメント

    1. […] Machine Learning Technology   Artificial Intelligence Technology   Digital Transformation   Support Vector Machine  […]

    タイトルとURLをコピーしました