SVM, a supervised learning pattern recognition model used for classification and regression

Machine Learning Artificial Intelligence Digital Transformation Natural Language Processing Online Learning anomaly and change detection Ontology Technology Image Information Support Vector Machine Navigation of this blog

Summary

The kernel method is a technique used in machine learning to handle nonlinear relationships. It measures the similarity between data using a function called a kernel function, and evaluates the similarity between two sets of data by calculating the inner product between features of the input data. Kernel methods are mainly used in algorithms such as support vector machines (SVM), kernel principal component analysis (KPCA), or Gaussian processes (GP).

Support Vector Machines (SVMs) are machine learning techniques capable of classifying linearly inseparable data and are used for supervised learning classification and regression. to achieve classification of data that is not linearly separable.

Advantages of SVM include its applicability to nonlinear as well as linear classifiers, its high generalization performance, and its resistance to over-training. The challenges include overlearning problems when the number of data is very large or depending on the choice of kernel function, and high computational cost and long processing time depending on the setting of the kernel function.

Here we describe the theory and applications of support vector machines based on the Machine Learning Professional Series “Support Vector Machines.

Machine Learning Prootional Series – Support Vector Machines Reading Notes

Reading notes are given below.

Chapter 1: Two-Class Classification

1.1 Introduction
Two-class classification (binary classification problem)
The problem of identifying whether a given category belongs to one of two categories.
A category is called a “class”.
Example: Identifying whether an image is a person or not
Preprocessing
How to define the window
What format should the image be represented in?
Example:(xi,yi)
xi: image features
Some numeric vector that represents the image
feature vector
input vector
Example: Create a sequence of numbers by simply lining up each pixel value
Generate features from local changes in pixel values in the image
Feature extraction
Depends on the specific task
Yi:label
Label to represent a class
Classifier
The part of the process that estimates the class to which the received image belongs.
A function that returns the predicted value of label y when some feature vector x is given.
Post-processing to present the image to the user.
Post-processing and pre-processing are very different depending on the application of the system
SVM (Support vector machine)
Originally a two-class classification
Extension to regression problems and unsupervised learning
SV classification (support vector classification)
Classification problems
1.2 Linear SV Classification
Introduction
Suppose that the training set {(xi,yi)}i∈[n] consists of N cases, where xi ∈ ℝd is a d-dimensional real vector and yi ∈ {-1,1} is a label with a value of 1 or -1.
Using the real function f:ℝd → ℝ, called the decision function, define the classifier g(x) as above
Consider the above function as f(x)
The d-dimensional vector w ∈ ℝd and the scalar b ∈ ℝ are unknown variables that are not known in advance.
The scalar b is sometimes called the bias
In the space of x, the x for which f(x)=0 is the boundary that separates the classes
Classification boundary
We need to figure out how to estimate these.
On what basis does SV classification estimate w and b?
1.2.1 Hard Margin
If there is a pair of w and b that can correctly classify all points in the training set
Training set is separatable by f(x)
If classification is successful for some yi
The above equation is equivalent to the fact that the signs of yi and f(xi) match
In a separable training set, there exists f(x) such that yif(xi)>0 for all i∈[n].
In general, there are multiple classification boundaries that can separate the training set
In SVM, the classification boundary is defined in such a way that the data of each class is as far away from the classification boundary as possible.
The distance between two classes across the classification boundary is called “margin”.
b has a large margin
Margin maximization
Find a classification boundary with as large a margin as possible
Margin maximization means maximizing the distance from the classification boundary to the closest xi.
The distance from a given xi to the classification boundary is given by the above formula for the distance between a point and a plane.
Margin maximization can be expressed by the above optimization problem
s.t. (abbreviation of subject to)
Constraints (constraint)
Maximize within the range where the constraint is satisfied
In order to maximize the value of the objective function M/∥w∥, we want M to be as large as possible
M must be less than or equal to all yi(wTxi + b) by constraint yi(wTxi + b) ≥ M
M will be the same as the smallest value of yi(wTxi + b) for all cases
If we replace w/M and b/M, which are w and b divided by M, as ẇ and ṁ, respectively
Considering the following two equations, the optimization problem further becomes as above
The maximization of 1/∥w∥ is equivalent to the minimization of its inverse, ∥w∥.
Minimization of ∥w∥ is equivalent to minimization of ∥w∥2 with norm 2
1.2.2 Soft Margin
When SV classification is applied to non-separable data
Relax the constraints that SV classification has
yi(wTxi + b) ≥ 1 – ξi, i ∈ [n].
Introduce a new non-negative variable 𝛏i
Allow xi to be in a different class beyond the margin
Redefine the optimization problem for support vector machines
Coefficient C is the regularization parameter
(left side) When C is large, the power to suppress 𝛏 is large, so the data rarely exceeds the margin and enters different classes
(right side)When C is small and the margin is wide, the data will be allowed to exceed the margin
The reason why we divide ∥w∥2 by 2 is to simplify the calculation later on
In the second term, we suppress ξi, the degree of violation of the original constraint wTxi+b≥1, so that it is as small as possible
The role of the regularization factor C as a parameter to adjust the degree of suppression
The larger C, the closer to the hard margin
The specific type of C to use is evaluated by using “cross-valisation” and other methods.
In soft margin, each training set is divided into three types according to the value of yi(wTxi+b).
xi such that yi(wTxi+b)>1 outside the margin
xi such that yi(wTxi+b)=1 on the margin
Inside the margin xi such that yi(wTxi+b)<1
1.3 Dyadic Expressions
Introduction
An optimization problem is a primal problem of SV classification.
By solving a dual problem, we can sometimes get a different view of the same optimization problem.
In SVM, classifiers can also be obtained by solving dual problems instead of the main problem.
This is related to the usefulness of the dual problem form in the context of nonlinearization.
1.3.1 Dyadic Problems
Rewrite the optimization problem as above
Lagrange function (Lagrange function)
We introduce two new non-negative variables, αi∈ℝ+, i∈[n] and μi∈ℝ+, i∈[n].
Let α and µ be the vectors α=(α1,…. ,αn)T and µ=(µ1,…. ,µn)T.
α and the left-hand side of the first constraint -(yi(wTxi+b)-1+ξi)
Multiply μi by the left-hand side of the second constraint -𝛏i and add them to the objective function
Primal variables w, b, ξ
Dual variables α, µ
Define P(w, b, ξ) as the function that maximizes the Lagrange variables with respect to the dual variables.
Optimization problem to minimize this function for the principal variables
Since the max in the function p(w, b, ξ) is related only to the last two terms of L, the above relation holds.
The optimization problem is said to be feasible if all the constraints are satisfied.
If the main variable is not feasible – there exists i such that (yi(wTxi+b) -1 +ξi) > 0 or -ξi>0
The Lagrangian function L can be made as large as desired, and there is no maximum value.
If the main variables are feasible – (yi(wTxi+b) -1 +ξi) ≤ 0 or -ξi≤0 because
The maximum value of the term in the product of αi and μi is 0
The equation of the general optimization problem appears
Define D(α, µ) as the minimized function of the Lagrange function with respect to the main variables.
The problem of maximizing D(α, µ) with respect to the dual variables α and µ (duality problem)
Minimize the interior of the right-hand side (the partial derivatives of w, b, and 𝝃 in L are zero)
L is a convex quadratic interval with respect to w and is minimized at the point where the derivative becomes 0
Since L is first order with respect to b and 𝝃, as long as the coefficients (obtained by differentiation) are non-zero, L can be made as small as possible by moving these variables in unconstrained minimization.
Transformation of equation
An expression in which not only the main variable but also μ is eliminated
Expression with only α
Expression for the duality problem
1.3.2 Duality and Saddle Points
Relationship between the main problem and the duality problem
Assume that there exist pairs of principal variables w, b, 𝛏 and dual variables α and µ
Relationship between the optimal solution of the main problem and the optimal solution of the dual problem
Continued
We can see that the optimal value of the dual problem is less than the optimal value of the main problem as shown in the above equation
Weak duality
In the case of support vector machines
In the case of support vector machines, a stronger property called strong duality holds.
The objective function values of the main problem and the duality problem coincide in the optimal solution
The above equation holds
Furthermore, the above equation holds
L(w*,b*,𝛏*,α*,µ*) is a minimum for the main variables w, b, and 𝛏, and a maximum for the dual variables α and µ.
Saddle point of a function
The function takes a maximum value in one direction (axis 1) and a
The function takes a maximum value in one direction (axis 1) and a minimum value in another direction (axis 2).
1.3.3 Optimality conditions
In order to obtain a solution to an optimization problem, it is necessary to know the conditions for determining the optimality of the solution.
Duality also appears when describing these conditions.
The Karush-Kuhn-Tucker condition (KKT condition) is a necessary and sufficient condition for optimality of the solution of SV classification.
Differentiation of the Lagrangian with respect to the main variables
Constraint conditions for the main problem
Non-negative condition for dual variables
Complementarity condition
Continued
Positional Relationship between Margin and Feature Vectors and Dual Variables
When there are many feature vectors located on the measure side of the margin, the vector of dual variables α has many zeros (sparse)
1.4 Generalization by Kernel
The duality problem plays an important role in considering the nonlinearization of SVMs.
Consider a function Φ:ℝd→ F that maps an input x to some feature space F.
Let this Φ(x) be a new feature vector
f(x) = wTΦ(x) + b
The parameter w is also an element in the feature space F (w ∈ F)
If the transformation by the map Φ is nonlinear, then the classification boundary defined by f(x)=0 will be nonlinear in the original space
Since f(x) is linear with respect to Φ(x), f(x)=0 forms a linear classification boundary in the space F of Φ(x).
If we define Φ(x) as x in the dual problem, we can use the argument of 1.3.
In the left equation, Φ appears only in the form of the inner product Φ(xi)TΦ(xj)
We do not need to calculate Φ(x) directly, we only need to calculate the inner product Φ(xi)TΦ(xj)
The inner product Φ(xi)TΦ(xj) is defined as a kernel function.
Example of Kernel Function
RBF (radial basis function) kernel
Also called Gaussian kernel
γ is a hyperparameter (needs to be set in advance)
SV classification example using RBF kernel
Using the kernel function, the duality problem can be described as in the above equation
f(x) can also be written as above using the kernel function
1.5 Computational Features
Desirable properties of SV for implementation calculations
Return to the Convex Quadratic Optimization Problem
Optimization of the SV classification problem results in a type of optimization problem called a convex quadratic optimization problem
Convex quadratic optimization problems are relatively easy to deal with, and can start from any initial value and arrive at a global optimal solution.
Sparsity of dual variables
The fact that some of the dual variables α become zero is advantageous in terms of computational efficiency
Representation of data by inner product
The duality of the SV problem depends only on the feature vector x in the form of an inner product
Nonlinearization by kernel function becomes possible
When a particular problem is expressed only by the inner product of feature vectors, the concept of nonlinearizing the model by replacing the inner product with a kernel function
Kernel trick
It has also been applied in principal component analysis.
1.6 Properties of SV Classification
Introduction
How is the classification problem interpreted as a general statistical estimation problem?
1.6.1 Expected loss minimization
Consider inputs and labels as random variables
Actual training data is a realization of the random variable
Suppose the data is generated based on a probability density p(X,Y)
What kind of classifier g(x) is desirable?
Introduce a function called “loss function” ℓ(y,g(x)) to measure the goodness of classification
In the case of two-class classification, it is called (0-1 loss)
Define the expected loss as the expected value for all possible X and Y as above
In the case of 0-1 loss, a classifier that minimizes the expected loss is given by the above equation using conditional probability
Bayesian classifier
Bayes decision boundary
The boundary of the classification made by the Bayes classifier.
1.6.2 Loss Function and Regularization
In practice, it is very difficult to find a classifier that minimizes the expected loss, including the expected values for X and Y.
Consider the empirical loss, which approximates the expected value by the available training information, as in the above equation.
This value is easy to calculate compared to the expected loss.
It is a practical criterion to measure the accuracy of a classifier.
By minimizing the empirical loss, we can estimate a classifier that classifies the training set well.
As an approximate proxy for the 0-1 loss, we define a loss function called “hinge loss”.
Hinge function
Approximation of the 0-1 loss by a continuous and convex function
Comparison of hinge loss and 0-1 loss
Let f(x) = wTΦ(x) + b be the decision function.
The optimization problem of minimizing the hinge loss for the training set is as above
The inner max can be expressed as above by introducing a new variable ξi
The final result of the optimization problem
Pursuing only the classification of the training set does not necessarily reduce the expected loss
Comparison of the Bayesian decision boundary (blue wavy line) and the classification boundary learned to minimize the posterior classification of the training data (solid black line)
The black practice classification society classifies the training data strictly.
Very different from Bayesian decision boundary
It does not achieve an optimal classification system in the sense of expected value.
Over-fitting
In order to prevent over-learning, we need to put some restrictions on the classifier
As a regularization, we restrict the L2 norm of the parameter w of the decision function so that it does not become too large
Optimization problem for the method on the left
λ:regularization parameter
The degree to which the classifier is restricted is determined
The form of dividing Λ by 2 is not necessary.
The form using L2 norm is easy to calculate and often produces high practical results
It is often used
It is equivalent to the soft margin SV classification
Assuming λ=1/C
Examples of three different classification bounds learned by changing the value of Λ
The larger the value of Λ, the smoother the classification boundaries become due to regularization
Framework for minimizing the sum of the loss function and the regularization term
Hinge loss
Different methods using different functions for the loss function
Squared error loss
A loss function that has been used in regression analysis for a long time
Consider the square of the difference between Y and the decision function f(x) as the loss
Problems
Loss occurs even in cases where yf(x)>1.
Because it is a quadratic function, the effect of outliers is large.
Logistic loss
An old method in statistics called logistic regression.
A loss function derived from the negative log likelihood of a binomial distribution.
Clear probabilistic interpretation
1.6.3 Conditional Probability Estimation
0-1 loss could be associated with Bayesian classifier, but
Although 0-1 loss could be associated with Bayesian classifier, it was replaced with hinge loss for simplicity of optimization calculation.
Comparison of f(x) that minimizes the expected loss 𝔼X,Y[ℓ(Y,f(X)]] for each loss function
Squared error loss estimates a linear function of conditional probability
Logistic loss estimates a ratio of conditional probabilities
Hinge loss estimates the conditional probability converted to a discrete value
Any f(x) that minimizes the expected value realizes a Bayesian classifier
This is the case for expected values, and for finite values, for example, outliers will be different.
If conditional probability is obtained, Bayesian classification is achieved.
Direct estimation using Bayes’ theorem is also possible
Bayes’ theorem
Since the denominator does not depend on Y
Since the denominator does not depend on Y, we can calculate p(X=x|Y)p(Y) to compare which label y should be assigned to the observed input x
By the basic formula of probability, p(X|Y)p(Y)=p(X,Y)
The simultaneous probability p(X,Y) is the probability that data will be generated.
The method to estimate this probability density is called “generative model
There are various methods of generative models, depending on how the probability density p(X,Y) is modeled.
p(Y)
p(Y) is the distribution of Y independent of X
It can be simply estimated from the examples of each class in the training data
Let n+ be the number of cases in the positive class and n- be the number of cases in the negative class
p(Y=1)=n+/n and p(Y=-1)=n-/n
p(X|Y)
If we introduce a normal distribution with different means and a common covariance matrix for each class of p(X|Y), we get
Linear discriminant analysis is derived.
A feature of SV classification is that it directly estimates the minimum necessary for classification.
Vapnik’s principle
When solving a problem, do not solve a problem that is more general than the problem
Only classification bounds are required directly. Translated with www.DeepL.com/Translator (free version)

Chapter 2 Multiclass Classification

2.1 Introduction
Multi-class classification problem where each case belongs to one of the c classes from 1 to c.
A label can be represented as yi∈{1,…. ,c}.
Applications of multi-class classification
How to achieve multi-class classification
Combining ordinary two-class classifiers
One-to-many method
One-to-one method
Error-correcting output codes
How to extend the SVM formulation itself
2.2 One-versus-rest method (One-versus-rest)
The simplest way to achieve multiclass classification by combining multiple two-class classifications.
Train a classifier for each class to separate x that belongs to a class from x that does not belong to a class.
Consider xi belonging to the kth class as a positive class and other xi as a negative class, and denote the decision function of the learned 2-class SV classification as fk(x).
In the one-to-many method, classification is performed by selecting the maximum value of fk(x).
The output g(x) of the classifier for input x is expressed in the above equation.
The larger the output of each decision function, fk(x), the stronger the confidence that x belongs to class k. This is the classification rule in the interpretation.
Two-class classification can be achieved by learning only the number of classes
Easy to implement
Problems
Whether it is appropriate to compare the size of the decision function fk(x) of different SV classifications? is not clear
Asymmetry in the number of class labels may need to be taken into account when training classifiers for each class
When the number of examples in class k is small, fk(x) tends to be negatively biased.
2.3 One-to-One Method
Introduction
One-to-one (one-versus-one) method
A method based on classification of pairs of different classes.
For each class c, only the cases belonging to class i and class j are taken out.
Classify the two classes.
Denoted as fij(x)
Considering all pairs of classes
We can create (c-1)c/2 classifiers.
The class to which x is classified is determined by a vote (majority vote) of the (c-1)c/2 classifiers
Problem
Need to prepare a large number of two-class classifiers.
Classification is impossible if multiple classes with the same number of votes appear.
Advantages
Less training data is needed, so the computational cost for one training is less.
How to determine multi-class classification based on pairs
Simple voting method
Acyclic directed graph method
Pairwise coupling method
2.3.1 Acyclic directed graph method (directed acyclic graph)
In the voting method, all (c-1)c/2 decision functions are evaluated to classify an x.
In the directed acyclic graph method
In the case of directed acyclic graphs, multiclass classification can be performed with fewer decision functions evaluated by sequentially performing two-class classification.
Example of an acyclic effective graph for multiclass classification
In order to prepare a classifier for each vertex, it is necessary to train SV classification (c-1)c/2 times.
To prepare a classifier for each vertex, it is necessary to train SV classification (c-1)c/2 times, but the number of classifiers that need to compute an output to classify a new input x will be small.
2.3.2 Pairwise coupling
A method for estimating multiclass classification rules from one-to-one classification results based on a probabilistic interpretation.
In multiclass classification
If the conditional probability in the above equation is obtained, a Bayesian classifier can be realized.
In the one-to-one method
Given an x that belongs to class i or class j
Which class should be assigned to i or j?
Conditional probability formula
From the pair-wise estimation result p(Y=i|Y∈{i,j}, X=x), estimate the truly necessary conditional probability p(Y=i|X=x)
The decision function f(x) of SV classification does not estimate the conditional probability.
Model the conditional probability p(Y=i|Y∈{i,j}, X=x) from the decision function fij(x) for a pair of classes i and j
Let A∈ℝ and B∈ℝ be parameters.
Parameters can be estimated by maximum likelihood estimation
Expression for minimization of negative log likelihood
To solve this optimization problem, we apply standard methods such as Newton’s method.
By estimating A and B, we can calculate ṗij(x) for any x
Estimate p(Y=i|X=x) from this ṗij(x)
From basic probability formulas
A measure of the difference between two probability density functions
Kullback-Leibler divergence (Kullback-Leibler divergence)
Always non-negative
0 when two given probability density functions are identical
The sum of KL divergence weighted by the number of cases in each class pair is defined as the criterion.
We define the criterion as the sum of the KL divergences, weighted by the number of cases in each class pair, taking into account the constraint that the criterion is an establishment (it adds up to 1) and pi≥0
Minimizing this criterion for {pi}I∈[c
We obtain an estimate of the conditional probability pi=p(Y=I|X=x).
2.4 Error Correcting Output Codes
An error correcting code is an error code that is
a mechanism for removing errors contained in transmitted signals.
A more general framework can be given by “error correcting output code”.
2.4.1 Multi-class classification by encoding class labels
Each class is assigned a different sequence of numbers called a code word.
Consider a code word of length m that takes the value 1 or -1.
If we create a code word of length m for each class c
By arranging them, we can obtain a matrix S called the cxm encoding matrix.
Example: Recognition of handwritten numbers from 0 to 9
Example of encoding matrix S
Assign a sign according to the shape characteristics of each digit
Let the columns (m) of the encoding matrix be labeled yi
Learning m two-class classification
Let m decision functions for an input x be f1(x),…. ,fm(x)
Decide which class to assign based on the proximity to which line (code word of which class) of S.
Consider the Hamming distance to determine the closeness.
The closest one is chosen even if they are not exactly the same.
Contrast with one-to-many and one-to-one methods
Coding approach is more general
2.4.2 Use with pairwise coupling
Coding approach combined with pairwise methods
2.5 Simultaneous Formulation of Multi-class Problem
Multiclass approach to the formulation of SV classification itself
Duality problem
Number of decision functions
Time consuming to compute due to increase in number of variables Translated with www.DeepL.com/Translator (free version)

Chapter 3 Regression Analysis

3.1 Regression Problems
A regression problem is a problem where the output is a real number.
Example: Predicting the price of a financial product
Predicting the blood pressure of a person who takes a new drug
SV regression can be used to make predictions based on a nonlinear model
3.2 Regression using the Least Squares Method and the Least Absolute Error Method
Least squares method
In the least-squares method, the model parameters w and b are determined so that the sum of the squared errors of the actual output value yi and the predicted value of the model is minimized.
Least-squares method is formulated as RStan in the above equation.
The error in the least-squares method is replaced by the absolute error.
Formulation
Loss Function of Least Squares Method and Least Absolute Error Method
Differences between the Least Squares Method and the Least Absolute Error Method
The solution of the least-squares method can be obtained analytically by solving linear equations The least-absolute-error method is formulated as a linear programming problem, so it must be solved using some algorithm
When the noise term z is normally distributed, the least squares method coincides with the maximum likelihood estimation method When the noise is Laplace distributed, it coincides with the least absolute error method
The least squares method is an estimator of the mean of the conditional distribution p(Y|x) The least absolute error method is an estimator of the conditional median
The least squares method is sensitive to outliers and outliers, while the least absolute error method is robust to outliers.
3.3 Formulation of SV regression
3.3.1 Loss function of SV regression
3.3.2 Main problem of SV regression
3.3.3 The duality problem of SV regression
3.4 Nonlinear Modeling with SV Regression
Kernel functions can be used to perform regression based on nonlinear models
3.5 Properties of SV Regression
3.5.1 Sparsity and Support Vectors
A model with high sparsity allows for faster evaluation of the regression model and faster processing for applications that require real-time performance.
3.5.2 Relationship between SV regression and least squares and least absolute error methods
Selecting the loss function in a regression problem can be understood as selecting the noise model to be assumed behind it.
3.6 Quantile Regression Analysis
Kernel Quantile Regression

Chapter 4 Support Vector Machines for Unsupervised Learning

4.1 Tasks in Unsupervised Learning
4.1.1 Clustering
A method for dense group structure in input data.
Purpose
Also used as preprocessing for analysis
Commonly used methods
K-means method
Extensions
Large margin clustering (LMC)
Incorporates the concept of margin maximization into clustering
Kernel k-means method
K-means method with kernel function for non-linear clustering
4.1.2 Dimensionarlty reduction
The conversion of high-dimensional data into low-dimensional data.
Purpose of dimensionality reduction
Used as data preprocessing
Noise reduction
Data compression
Commonly used methods
Principal component analysis (PCA)
Dimensionality reduction by linear transformation of data
Consider the dimension with the largest data variability as the important dimension
Efficient because it results in eigenvalue problems to be solved
Extensions
Kernel principal component analysis (KPRA)
Introduces a kernel function to principal component analysis
Dimensionality reduction can be performed by non-linear transformation
Unsupervised dimensionality reduction
Supervised dimensionality reduction
4.1.3 Anomaly detection
Determine whether a new case is normal or abnormal when it is given
In situations where only input examples are given as training data, it is not clear how to define “abnormal”.
Intuitive interpretation
Normal is what is probabilistically likely to happen
Abnormal is what is probabilistically unlikely to happen
The simplest approach is based on similarity
Similarity to many cases is normal
Anomalies are those that are not similar to any of the examples
Unsupervised Anomaly Detection
Given only input cases
Supervised anomaly detection
When both abnormal and normal values are given
Anomalies are called positives and normalities are called negatives
False positives refer to normal cases as abnormal, and false negatives refer to abnormal cases as normal
It is important to maintain a balance between the false positive rate and the false negative rate.
In general, the false positive rate is 1% or 5%.
4.1.4 Unsupervised Learning and Probability Density Estimation
Common features of unsupervised learning tasks
Make inferences about data sources for input data {xi}i∈[n
If we know the probability distribution of the input data sources, we can easily perform all the tasks.
High-dimensional input data (little knowledge of the data)
Difficult to estimate the probability distribution of data sources accurately.
Approaches that directly perform the desired task without estimating probability distributions are preferred.
When we have a priori knowledge of the data source or need to explain or interpret the data itself
First, estimate the probability distribution of the data source
Then, an approach is taken to perform the task based on the estimated probability distribution.
Generative Model Approach
4.2 One-Class SVM
Introduction
One-Class SVM is a method designed in terms of using the SV classification approach for unsupervised learning.
Can be considered as an unsupervised anomaly detection algorithm
+1 if the input value is normal, -1 if it is abnormal
Only normal cases can be given as training data
Controlling the false positive rate should also be considered
4.2.1 Concept of One-Class SVM
In 1-Class SVM, the linear model in the input space is not directly used for anomaly detection.
Consider the following models
Linear model of feature space
Φ is a transformation that represents a mapping from the input space to the feature space
and its dual representation
We start our discussion with the dual representation
Objective of 1-class SVM
Find a decision rule that returns
+1 if the vector x is normal
-1 if the vector x is abnormal
Learning objective
To find a parameter α such that the above decision rule is as correct as possible for unknown data.
4.2.2 Formulation of One-Class SVM
Example of a one-class SVM

Chapter 5 Kernel Functions

5.1 Properties of Kernel Functions
Introduction
By replacing the inner product of the feature vectors Φ(x) mapped to the feature space F with the kernel function K(xi,xj)=Φ(xi)TΦ(xj), complex models can be realized without explicitly calculating Φ(x).
5.1.1 Mercer’s Theorem
In order for a function to be a kernel function
For any input xi,xj
For any input xi,xj, it must correspond to some inner product Φ(xi)TΦ(xj) on some feature space F
If the space X of feature vectors has only a finite number of elements
If the number of elements is m, then X=(x1,…. ,xm)
Consider a matrix K∈ℝnxm with each value of the function K(xi,xj) as an element
In order for K(xi,xj) to be a kernel function
K must be a semidefinite positive matrix
What is a semidefinite positive matrix?
A matrix with non-negative eigenvalues
The matrix of eigenvectors of K is called V
Let Λ be a diagonal matrix with eigenvalues λ arranged in the form
The eigenvalue decomposition is K=VΛVT
The I,j elements of K, K(xi,xj), are as shown in the above equation.
This can be interpreted as the inner product of the case where Φ is defined.
If we consider a more general space X of feature vectors that can have infinite number of elements
Mercer’s theorem (Mercer’s theorem)
5.1.2 Operations on Kernel Functions
Any function can be used as a kernel function, as long as it satisfies Mercer’s theorem.
The basic form of a kernel function and the general operations to derive a new kernel function from a kernel function are shown below.
For a symmetric semidefinite matrix B, the above equation becomes a kernel function
If the function g(x) is a real-valued function on X, the above equation becomes a kernel function
If K1(xi,xj) and K2(xi,xj) are kernel functions satisfying Martha’s theorem and A≥0 is an arbitrary scalar variable, the above equations are all kernel functions
To be continued
Polynomials with kernel functions are also kernel functions
Any function that can be approximated to arbitrary precision by Taylor expansion is a kernel function, as long as it is composed of kernel functions.
5.2 Various Kernel Functions
Introduction
By varying the kernel function, the learned model will be completely different.
A variety of kernel functions have been proposed for different purposes.
5.2.1 Basic Kernel Functions
Very frequently used basic kernel functions
Linear kernel
Simple kernel function derived when Φ(xi)=xi
Polynomial kernel
C:Hyperparameter
d: natural number
Allows for non-linear models
RBF Kernel
γ:Hyperparameter
Allows for non-linear models
5.2.2 Kernel Functions Based on Probabilistic Models
There is a concept of “defining a kernel function by reflecting the probability distribution in the input space X”.
Fisher kernel
Set the parameters θ=(θ1,…,θγ)T Consider a generating distribution p(x|θ) with parameters θ=(θ1,…,θγ)T
The distribution is arbitrary, but the probability density of Θ must be differentiable.
Theta is determined appropriately using maximum likelihood estimation.
The log probability of p(x|θ) is differentiated with respect to θ. The value called Fisher score is defined as Ux.
The Fisher score is expressed by the gradient of the parameter θ of the distribution that generates a feature vector x.
Using the natural metric on the manifold created by the probability model for the parameter θ of the probability distribution, we can derive the kernel function as above
Iθ is the Fisher information matrix
The relationship between the stochastic model and the weight is treated in the theory of information geometry
24
Intuitive interpretation of the Fisher kernel
When the probability distribution is exponential
Density function
s(x) is a function of x and a vector of the same dimension as θ
Ψ(θ) is a normalization term independent of x
The probability density function p(x|θ) depends on the input x only through s(x)
Determines the relationship between the s(x)-ga probability density function and the random variable
s(x): Sufficient statistic
In general, when estimating parameters of a probability distribution, it is sufficient to know the sufficient statistic.
Fisher score of an exponential distribution family
Fisher kernels are often applied to more complex models such as hidden Markov models.
5.2.3 Kernel Functions for Strings
Applying Kernel Functions to Problems with Strings as Input
Assumptions
Input space X, a set of strings of finite length
The individual characters that make up a string are contained in the set A
Denote the length of a string as|s|.
Represent a set of strings of length p as Ap
Various Kernels
P-Svector Kernel
Entire substring kernel
Gap-weighted substring kernel
Define a kernel function by counting the substrings shared by two strings.
Methods based on dynamic programming are often used to make the counting efficient.
(1) p-spectrum kernel
P-spectrum kernel
Examine the frequency of substrings of length p that are shared by two given strings.
What is a substring?
A substring is a sequence of p characters in a string.
Example
A substring of length p=3
The number of occurrences of a substring u in a string s is denoted as Φu(s).
The mapping Φ that defines the P-spectrum kernel is defined by arranging Φu for all substrings of length p
(Φu(s))u∈Ap is a vector of Φu(s) for each element u of Ap.
(2) All-subsequence kernel
All-subsequence kernel
(3) Gap-weighted subsequence kernel
Gap-weighted subsequence kernel
5.2.4 Kernel Functions for Graphs
Representation of a graph as a matrix
Example : Kernel functions for graphs
(1) Kernel between vertices
Given a graph as input, where each vertex corresponds to a case.
Example: The problem of predicting the function of a protein on a protein network
Label each vertex with the functional category of the corresponding protein
Predict the functional category of a protein with unknown function from a protein with known function
One vertex is one case
We need a kernel function defined for any two vertices
If we consider learning on a graph
A matrix called graph Laplacian is often used
D is a diagonal matrix whose diagonal elements are ∑W
Adjacency matrix can be a weighted one
Expanding the rainbow form of the Grass-Laplacian, we can check its semipositive definiteness
f is an arbitrary real vector
The smaller the difference between Fi and fj for i and j connected by an edge, the smaller it becomes
The quadratic form of the graph Laplacian expresses how smoothly {fi}I∈[n] varies on the graph when each element fi of the vector f is associated with each vertex of the graph.
By considering the eigenvectors of the graph Laplacian
By considering the eigenvectors of the graph Laplacian, we can extract the components that represent the smoothness on the graph.
To derive the kernel from the graph Laplacian
Define the kernel matrix from the eigenvalue decomposition
Let the I-th eigenvalue be λi
Let R:ℝ+ → ℝ+ be an appropriate monotonically decreasing function.
The kernel matrix for the vertices of the graph is defined as above
Since the function r is a monotonically decreasing function
Since the function r is a monotonically decreasing function, the smoother the component on the graph corresponding to a small eigenvalue, the more strongly it affects the kernel matrix.
The above kernel matrix is semi-positive definite and the above equation is often used as the function r
Ε∈ℝ++ and σ∈ℝ++ are hyperparameters
Only for the commuting time kernel, r is not a monotonic phenomenon for λ = 0
It is monotonically decreasing for components corresponding to non-zero eigenvalues.
For the graph Laplacian, the above formula with normalization may be used.
The diagonal elements of the graph Laplacian L are normalized to 1.
(2) Kernel between graphs
Consider the case where each element of the input space X is a graph.
Example: A problem to predict the toxicity and other properties of a compound by representing its molecular structure as a graph
Kernel functions need to be prepared for two given graphs
The simplest method
Based on whether the two graphs are isomorphic or not
Does not distinguish between vertices of the graphs
If the two graphs match Kansei by rearranging the subscripts
Define two graphs to be isomorphic
Example: Define an element ΦH(G) of a feature map Φ(G) using the number of subgraphs in which a graph G is isomorphic to a graph H ∈ X.
This method requires a very large amount of computation.
Easier to compute kernel functions
Methods based on walks on graphs
What is a walk on a graph?
A walk on a graph is a movement like walking along an edge on the graph.
A walk on a graph is represented as a sequence of vertices and edges that you walk along.
Consider the case where labels are attached to vertices and edges of a graph
For example, the type of molecule or bond that makes up a compound is a label.
A walk on a graph can be represented as a sequence of labels
The kernel function for the label sequence generated by a walk on two graphs can define the kernel function between the graphs
Let s be an arbitrary label sequence, and consider constructing a feature map Φ(G) for a graph G by a feature map ΦS(G) defined for each label sequence
We introduce a walk-dependent weight parameter λ(w) and define ΦS(G) as above
W(G) is the set of walks in the graph G
s(w) is the label sequence generated by the walks w
For two graphs G1 and G2, the kernel function of this feature map can be expressed as a sum over a set S of possible label sequences
It can also be expressed as a sum over possible walks
Since there is no limit to the length of walks, the computability depends on how to determine λ(w).
For example, if λ(w) is set to 1 if s(w) has length ℓ, and 0 otherwise, then the sum of all walks with length ℓ is computable
If we set λ(w) to decay as its length ℓ increases, we may be able to calculate the value at which the infinite series converges.
A random walk described in “Overview of Random Walks, Algorithms, and Examples of Implementations“, which introduces stochastic transitions into the walk on the graph, can also be incorporated into the kernel function
λ(w) is determined by the transition probability of w on the graph
Kernel computation on a set of graphs can be done for two input graphs
Kernel computation on a set of graphs can be expressed in a unified manner by using an adjacency matrix obtained from a graph called a direct product graph. Translated with www.DeepL.com/Translator (free version)

Chapter 6: Introduction to Optimization: Optimality Conditions and Generic Solution Methods

6.1 Introduction
All methods such as SV classification and SV regression are formulated as numerical optimization problems.
In the case of SVM, it is usually not possible to obtain the solution analytically, but it is necessary to start from a suitable initial solution and iteratively search for the solution.
How do we obtain a solution to a standard optimization problem?
The SVM optimization problem can be solved as a convex quadratic optimization problem, where the objective function is a convex quadratic function and the constraints are linear intervals.
The main problem of 2-class SV classification is defined as above
Defining yiyiK(xi,xj) as a matrix with I and j elements as Q∈ℝnxn, the duality problem becomes as above
1 is a vector of n 1’s
y=(y1,…. ,yn)T
6.2 Optimality Conditions
It plays an important role in considering the SVM of the SVM solution.
Strong duality
Equations for objective function and constraints
The notation for the main problem is the above equation
When the dual variables are collectively denoted as λ=(αT, μT)T, the Lagrangian becomes the above equation
If we denote the objective function of the dual problem as D(λ), the dual problem is expressed as above
Karush-Kuhn-Tucker condition (Karush-Kuhn-Tucker condition)
Using the decision function in the above equation with dual variables, we can express the problem with a simple condition as shown in the above equation.
6.3 Generic Solution Method
Introduction
In continuous-valued numerical optimization, the following methods are commonly used to reduce (or increase) the objective function based on the gradient
Steepest descent (Steepest descent)
Newton method
In the case of SVM optimization, it is necessary to consider how to handle constraints.
Active set method
Interior point method
6.3.1 Active set method
Standard method for solving constrained optimization problems
When an optimization problem has multiple inequalities
In many cases, some of the constraints will lead to a state where the equality is valid
The set of inequality constraints for which the equality sign is satisfied is called an active set.
In SVM, if the optimal active set is found, the optimal solution can be expressed as a linear equation
Omission
In the active set method, an appropriate initial value is given and the active set is updated.
Algorithm
6.3.2 Interior Point Method
Overview
General purpose solution to inequality constrained optimization problems.
It is called the interior point method because it aims at the optimal solution while passing through the interior of the feasible region.
Originally devised as a solution method for linear programming problem
Method
A new non-negative variable s=(s1,…,sn)T∈ℝ is added to the main problem of SVM. (s1,…,sn)T∈ℝn
Rewrite the above equation as
We add a non-negative variable si to the left-hand side of the original inequality constraint -yi(wTΦ(xi)+b)+1-ξi≤0 so that the equality holds
This operation has no effect on the optimal solution.
This rewrite makes the inequality constraints into non-negative constraint drinking for each variable
Instead of this non-negative constraint, consider a problem with a log barrier function as shown above
μ>0 is a sex parameter
Logarithmic barrier function
The closer the middle of Log is to 0, the more it tends toward ∞.
It is not defined for values less than or equal to 0
The logarithmic barrier function is an approximation of the non-negative constraint
The closer μ is to 0, the closer it is to the original optimization problem
The interior point method starts with an appropriate initial value of μ, solves the optimization problem above, and obtains a solution by gradually bringing μ closer to 0.
The Lagrange function of the optimization problem becomes the above equation
Obtain the above equation from the derivative with respect to each variable
KKT condition is obtained
Eliminating w using ∂L/∂w and rearranging the KKT condition, we get the above equation
Diag is a matrix that arranges the vectors taken as arguments into body elements.
Since it contains nonlinear terms, it is commonly obtained by using a linear approximation.
SV separation optimization algorithm using interior point method

Chapter 7 Segmentation Method

Introduction
Among the optimization methods for SVMs, an approach called the partitioning method is presented.
Instead of optimizing the entire training set, we wash a portion of the training set and repeat small-scale optimization.
Method
SMO algorithm
DCDM algorithm
Speed up by optimizing specifically for SVM training
Optimization methods for solving dual problems
7.1 Partitioning Method
In the dual problem of SV classification, we optimize N unknown variables {αi}i∈[n].
Basic policy for efficiently solving the duality problem
Use the fact that the final classifier depends only on the support vector
If we know which training cases become support vectors, we can consider a subset of the training set consisting only of support vectors as training cases and solve a small number of optimization problems.
Until we get the optimal solution, we can’t know which training case will be the support vector.
decomposition method or chunking method
Consider a small subset of training cases that are expected to have a large impact on the classifier, like support vectors
Solve a small optimization problem on this subset
Update the sub-training set using the tentative associations obtained along the way.
Working set
A partial training set of {αi}i∈[n] selected at each step of the partitioning method.
Let S⊆[n] be the working set.
Only those variables included in the working set are considered as unknown variables
Otherwise, set ṡ=[n]S
Those that are not included are regarded as constants
Formulation of optimization problem
Matrix Q is an nxn matrix whose (i,j) elements are Qi,j=yiyjK(xi,xj)
Matrix Q ∈ ℝnxn
Vector α ∈ ℝn+
vector y∈ℝn
A dual problem with only the working set as the unknown variable is also a convex quadratic programming problem
If the size of the working set is small enough, the problem can be solved efficiently using interior point method or active set method.
7.2 SMO Algorithm for Kernel SVM
Introduction
Introduction to the SMO algorithm (sequential minimal optimization algorithm), a kind of partitioning method
The SMO algorithm is a partitioning method in which the size of the working set S is set to|S|=2
Advantages of setting the size of the working set to 2
The solution of the dual problem for S solved in each step can be obtained analytically.
No need to use solvers such as interior point method or active set method.
Very efficient training of SVMs for large problems
Also used in LIBSVM
Sequentially explain how to analytically find the optimal solution for two variables and how to choose two variables
There is a more efficient decomposition method when using linear SVM and kernel function K(xi,xj)=xiTxj
7.2.1 Optimization of two variables
Let the dual variable {αi}i∈[n] be transformed into a variable and the above equation
Since the label is yi∈{±1}, αi and βi are related by the above equation
Using the variables {βi}i∈[n], the dual problem of SV classification becomes the above equation
Kij=K(xi,xj).
In the SMO algorithm, out of n unknown variables {αi}i∈[n], different variables βs, βt, s≠t are used as the working set.
Why choose two variables instead of one as the working set?
Because of the equality condition above
If we fix n-1 variables except for one variable βs
In order to satisfy the equality, the value of βs cannot be changed.
If we can change the two variables βs and βt
If we can change the two variables βs and βt, we can satisfy the equality constraint as long as the value of Βs+βt remains constant.
Consider the case where the two variables are updated as in the above equation
In order to satisfy the equality constraint, we need to update the above equation
We need to satisfy ∆βt=-∆βs
Only Βs is a freely movable variable
Substituting in the other constraints
If we rewrite the problem as a constrained optimization problem for βs
The final solution is
7.2.2 Selection of two variables
How to choose two variables βs and βt
Even if we choose randomly, the SMO algorithm will converge to the optimal solution
By choosing βs and βt well, we can improve the speed of convergence.
The optimal solution for SV classification must satisfy the conditions in the above equation.
The basic strategy for selecting the two parameters is to choose the pair of variables that least satisfies the equation on the left.
The pair that satisfies the least optimality condition is the above equation.
Finally, we need to calculate the above equation.
7.2.3 Summary of the SMO Algorithm
SMO algorithm for SV classification
At each step, we need to compute O(n) elements of the kernel matrix K∈ℝnxn
In order to avoid computing the same element of the kernel matrix many times, it should be precomputed and stored in memory.
In case of large data, store only a part of the kernel matrix in the cache.
7.3 DCDM Algorithm for Linear SVM
Introduction
We introduce an algorithm for partitioning methods specific to linear SVMs.
DCDM algorithm (Dual coordinate descent method algorithm)
It is also used in LIBLINEAR.
The group that developed LIBSVM applied this algorithm to linear SVM.
One of the advantages of SVM is that it uses kernel functions to capture complex features.
In situations where useful features are already known
Linear SVMs can often be used to achieve the same level of performance as kernel SVMs
Data with high-dimensional and sparse features
Example: Whether a word appears in a document or not in natural language processing
The number of feature dimensions is very high.
On the other hand, it has sparsity where many features are zero
Using a linear SVM is effective
7.3.1 Linear SV Classification
Bias-free linear model
The main problem of SV classification is the above equation
The duality problem is as above
7.3.2 DCDM Algorithm
The DCDM algorithm is an algorithm to solve the above duality problem.
The basic policy of the algorithm is the same as the SMO algorithm
Repeat the optimization for the smallest working set S
In SMO, the work set consists of two variables
In the above equation, there are no stators, so one variable αs, s ∈ [n], can be updated
Constrained quadratic minimization problem for univariate ∆αs
Can be solved analytically as in the above equation
DCDM Algorithm Translated with www.DeepL.com/Translator (free version)

Chapter 8 Model Selection and Regularized Path Tracking

Introduction
In practical applications of SVMs, hyperparameters such as regularization parameters need to be determined appropriately.
Model Selection
Use the cross-validation method
Track how the optimal solution changes when the regularization parameter C is gradually changed
Regularization path tracking method
8.1 Model Selection and Cross-validation Method
Model selection and cross-validation methods for SVM
8.1.1 Model Selection
In practical applications of SVMs, it is necessary to appropriately determine hyperparameters such as regularization parameters and kernel parameters.
Model selection
Learn SVMs with various hyperparameters (solve optimization problems) and select the best model.
Reference [8], chapter 7
Classification bounds of the SV classifier for various regularization parameters C
If C is too small, it is “under-fitting”.
If C is too large, it is “over-fitting”.
In general, prepare several candidate hyperparameters and select the best one among them.
When there are only one or two hyperparameters
When there are only one or two hyperparameters, we may divide the candidate range evenly and select one from the grid store.
When the number of hyperparameters is large
Select randomly
Bayesian optimization approach
Approach using gradient information
In order to select an appropriate model, we need to estimate the generalization error of the classifier
Generalization error
Expected value of the error for unknown variables
8.1.2 Cross-validation method
Train a classifier using training data and evaluate the trained classifier using evaluation data
Advantages
Less susceptible to overtraining because training data and evaluation data are different
Problems
Only a part of the available data can be used for training.
If the number of data is not large enough, the accuracy of the classifier will decrease
Cross-validation method is a solution to the above problems.
Image of cross-validation method
The cross-validation method for k-fold is called k-fold cross-validation (K-fold cross-validation)
In general, k=5, 10, etc.
Procedure
Assume that we have a data set consisting of nall training cases at hand.
In the K-fold cross-validation method, Dall is randomly divided into k non-overlapping subsets D1,…,Dk. Dk.
Perform training and evaluation K times
The estimated generalization error by the K-fold cross-validation method is given by
The classifier trained with the training set other than the cases included in Dk is denoted as g(-Dk).
In model selection for regularization parameters, the above equation is calculated for various C, and the one with the smallest value is selected.
Leave-one-out cross-validation
An extreme version of the K-division cross-validation method, where k is nall.
Computational cost of cross-validation is a problem for large scale data
The sequential learning approach is more efficient.
8.2 Regularized Path Tracking Algorithm
8.2.1 Overview of Regularized Path Tracking Algorithm
Regularized Path Tracking Algorithm for SV Classification
Often used to solve the duality problem of SV classification
Dual representation of the decision function and the duality problem
The decision function of SV regression is expressed as above using the kernel function K and the dual variables (α,b).
Using the matrices and vectors in the left equation, the duality problem of SV classification can be transformed as above
To prepare for the introduction of the regularized path tracking algorithm, we define the matrix and vector in the above equation.
0,1 is a vector where all elements are 0 or 1.
An inequality sign for two vectors is an inequality sign for all their elements.
We denote α*(C) to specify that the optimal solution of the dual variable is the one with regularization parameter C.
In the regularization path tracking algorithm, we track how the set {O,M,I} changes as we gradually increase the regularization parameter C.
In the duality problem of SV classification, if the set {O,M,I} is a base, the optimal solution (α*(C),b*(C)) can be obtained analytically as a function of C.
Regularized Path Tracking Algorithm
Step 1
Given a fixed set {O,M,I}, find the optimal solution (α*(C),b*(C)) in parameterized form with regularization parameter C
Step 2
Detect the event that the set {O,M,I} changes when the regularization parameter C is changed.
8.2.2 Parameter representation of the optimal solution (Step 1)
Set {O,M,I} is known
Find the optimal solution (α*(C),b*(C)) of the dual problem in the form of a parameter representation with regularization parameter C
In the optimal solution, the optimality condition is expressed as in the above equation
For the third and fourth lines, we use the properties of the above equation
By arranging all the elements in the set M, the optimality condition can be expressed as above
Optimality conditions using matrices and vectors
The above equation summarizes the optimality conditions
The equation is a linear equation with (α*M(C),b*(C)) unknowns (|M|+1)
The matrix of (|M|+1)x(|M|+1) is
If the matrix of (|M|+1)x(|M|+1) has the inverse matrix A-1, the dual variables (α*M(C),b*(C)) can be obtained as shown in the above equation.
8.2.3 Event detection (Step 2)
Detect events that change the set {O,M,I} that characterizes the optimal solution when the regularization parameter C is gradually increased.
Four events that need to be considered
The value of C at which the four events occur can be detected as shown in the above equation.
8.2.4 Segmented Linearity of the Regularized Path Tracking Algorithm
Regularized Path Tracking Algorithm for SV Classification
Assume that the optimal solution (α*(Cmin), b*(Cmin)) for a small Cmin is obtained
Find the path of the optimal solution at C ∈ [Cmin,∞)
8.2.5 Numerical computation and computational complexity
8.2.6 Example of a Regularized Path Tracking Algorithm
Example of SV decomposer classification bounds for artificially generated n=100, d=2 training set and various C using mixture normal distribution.
Generated classification linear path Translated with www.DeepL.com/Translator (free version)

Chapter 9 Sequential Learning

9.1 Introduction
When there is a change in the data set used for training, the optimal solution is recalculated accordingly.
Incremental decremental learning (ILC)
Efficiently compute a new optimal solution by using the already obtained solution.
E.g., update the model by adding data in a timely manner whenever new measurement information becomes available.
A case where the old data is deleted as new data is added to the time series.
Examples that may occur regardless of the type of data
Calculation of cross-validation method
Iterative optimization on nearly identical data
To speed up the optimization calculation
First, optimize SVM with training data
When optimization is performed without data for evaluation, the SVM is learned sequentially from the optimal solution obtained for the entire system.
In SVM, since there is a correspondence between each dual variable αi and case (xi,yi)
In SVM, there is a correspondence between each dual variable αi and case (xi,yi), so adding or reducing a case corresponds to adding or reducing a dual variable.
In SVM, non-support vectors (i.e., cases with αi = 0) do not affect the solution
Adding or removing such cases to or from the training data has no effect
αi for a new case (xi,yi) is not known in advance.
If the current solution results in yif(xi)≥1
The KKT condition is satisfied by setting αi=0
No need to update
Different concept from “online machine learning”.
Online machine learning only increases data, does not consider deletion
In many cases, the previous data is not retained, but only updated with new classifiers and new examples
9.2 Warm Start
Warm start
When solving an optimal solution, the solution of a similar optimal problem is used as the initial solution to speed up the optimization process.
Also called hot start
Let α0∈ℝn be the currently obtained optimal dual variable.
In the case of adding a case, we directly use α0 as the initial value of the dual variable
Example
K cases are added, and the last k of the new dual variable α ∈ ℝn+k in dimension n+k correspond to these new cases
Initialize the vector α=(α0T,0,… ,0)T can be added to the initial value.
If we want to reduce the number of cases
Later ・・・・
9.3 Methods Based on Active Sets
Introduction.
Different approaches to sequential learning
How to update from an initial optimal solution to a new solution while monitoring changes in the active set
9.3.1 Derivation of the update direction
9.3.2 Event Detection
Sequential Learning Algorithms Based on Active Sets

Chapter 10 Software and Implementation of Support Vector Machines

10.1 SVM Using the Statistical Analysis Environment R
Introduction
The kernlab package of the statistical analysis environment R
10.1.1 SV Classification
We use the “circle” data provided in the “mlbench” package as benchmark data for two-class SV classification.
10.1.1 SV Classification
10.2 Implementation of LIBSVM Software
10.3 Flow of the LIBSVM Algorithm
10.3.1 Initialization
10.3.2 Stop Condition
10.3.3 Shrinking
10.3.4 Selecting the Second Working Set

Chapter 11 Structured Support Vector Machines

11.1 Introduction
We will deal with the case where the expected target variable y is not a simple number, but has a structure.
Structured data can be
For example, data represented as a tree structure or an array
In natural language processing, the grammatical relationship between words is represented as a tree structure.
In the problem of protein similarity search, proteins are represented as amino acid sequences
Example: Consider prediction of a syntactic tree
Two types of parse trees for Fruit flies like a banana
Let the input space be X and the output space of structured data be Y
The problem of predicting data with structure
It can be viewed as a problem of learning a function g:X→Y that returns y∈Y from x∈X
Let f:XxY→ℝ be a function that evaluates how well an output y ∈ Y fits an input x ∈ X.
The predictive model g(x) is the above equation
The space of Y is often huge, making optimization calculations difficult.
Structural support vector machine
Apply the concept of maximum margin to problems where the output variable has a structure
Efficient use of the “cutting plane method” to deal with constraints that increase with the number of elements in Y
11.2 Maximum Margin in Coupled Feature Vector Space
Consider a vector Ψ(x,y) defined by the combination of x and y.
Consider f(x,y) as a linear function of Ψ(x,y) as shown in the above equation
The vector 𝛙(x,y) is called the combined feature vector.
Example of a joint feature vector
In order to make a correct prediction for a case i, the inequality above must hold
Yyi is the set Y minus yi
If we define a new vector 𝝳Ψi(y)=Ψ(xi,yi) – Ψ(xi,y), the inequality can be transformed as above
If we consider 𝝳Ψi(y) as a feature vector, we can formulate the margin maximization problem as shown in the above equation.
As in ordinary SVM, we can rewrite the variables to obtain the above equation
In this equation, all yi are assumed to be predictable, so we add the same relaxation as for soft margins to the above equation
To solve the convex quadratic optimization problem as in the usual SVM, we derive the dual problem
Lagrange function
Equation obtained by differentiating the principal variables
Derive the duality problem from the above two equations as shown above
Organize the elements of the matrix Q
Q can be expressed by the kernel as above
11.3 Optimization Method
Introduction
An optimization method called the resection plane method
Avoid dealing with all constraints at once and add the most violated constraints in the current association in order
Can be optimized by adding only some of the constraints
11.3.1 Single Slack Variable Formulation
11.3.2 Resection Plane Method
Algorithm for Computational Procedures of Structured SVMs Based on the Resection Plane Method
Parse Tree Prediction Using the Coche-Kasami-Younger (CKY) Algorithm
11.4 Introducing the Loss Function
11.5 Application Example: Ranking Learning
Learning to rank” as an example application of structured SVM
Given a search term (query) and some itemset (mainly a document set)
Given a search term (query) and some itemsets (mainly documents), the problem is to sort the itemsets in order of relevance to the search term.
The function for ranking can be expressed as g:X→Y
Let C={d1,…,d|C|} be the set of documents di (corpus). ,d|C|}.
Let Y be the space of possible rankings for documents in C.
For each search term, each document is given a label {1,0} indicating whether or not it is related to the search term.
AP is expressed by the above equation.
We use average precision (AP) as an evaluation criterion for ranking.
Let the ranking (order) of the search be π and the estimated ranking be π’.
Let r(πi) and r(π’i) denote the relevance {0,1} of the i-th ranked document to the search term, respectively.
Prec@j is defined by the above equation
Prec@j is defined as the number of relevant documents above rank j in the estimated ranking π’.
From AP, we define the loss function as above
Let π(y) and π(ŷ) be the rankings created by y and ŷ∈Y, respectively.
We calculate AP for multiple search terms and call the average value MAP (mean average precision).
Following the method of structured SVM, we define g(x) as above
f:XxY→ℝ using a linear model of the joint feature vector
Y is a matrix representation of the ranking as a relation between pairs
Let Y be a matrix of |C|x|C|, 1 if di is ranked higher than dj, -1 if it is lower

Chapter 12 Support Vector Machines for Weak Label Learning

12.1 Introduction.
Consider an input feature x and an output label y
What to do when the output label is partially insufficient?
Weakly labeled data
Various formats
When only a small portion of the training examples are given input feature x and output label y pairs, and most of the rest are given only input feature x
When the label information is given to a set of examples instead of this one.
The problem of determining whether a person is in a general image
Extract the object here by preprocessing such as segmentation
Use the objects as examples
Semi-supervised learning
memo
https://products.sint.co.jp/aisia/blog/vol1-20
Comparison of supervised learning and unsupervised learning
Models for semi-supervised learning
Classifier-based methods (Bootstrap)
Bootstrap means “automatic, hourly”.
A model similar to a child learning to recognize a dog or a cat on its own.
There are two types of methods
Self Training with a single classifier
Co-Training (or Multiview) with multiple classifiers
An old rule-based classification method.
We make friends with the people we meet one after another.
Flow
Semi-supervised SVM
Self-training
Example: Supervised data only
Supervised data and unsupervised data
Label data with high confidence
Co-Training
Use two classifiers
Example: To classify a web page
A classifier that uses the written document as a feature (view1)
Classifier that uses the text of hyperlinks as features (view2)
Transduction learning and inductive learning
Active learning
Data-based methods Graph-based algorithms
Semi-supervised k-Nearest Neighbor
Draw an edge containing k unlabeled data centered on the labeled data, and assign the same label to the data contained in the edge
Semi-supervised Vajra Gaussian is also available.
A probability calculation of closeness
12.2 SVM for Semi-supervised Learning
Introduction
In S3VM, unlabeled data is also taken into account, so that the decision boundary does not pass over unlabeled data as much as possible.
Introduction to Semi-Supervised Learning (Synthesis Lectures on Artificial Intelligence and Machine Learning)
Semi-Supervised Learning (Adaptive Computation and Machine Learning series): Adaptive Computation and Machine Learning series
http://yamaguchiyuto.hatenablog.com/entry/machine-learning-advent-calendar-2014
12.2.1 Semi-Supervised Two-Class Classification Problem
SVM for semi-supervised learning
Semi-supervised Two-Class Classification Problem Formulation
Labeled Training Example
Labeled case
Unlabeled training case
Unlabeled case
Let the input be xi∈ℝd, I∈L∪U
Let the output be yi∈{-1,+1},i∈L
12.2.2 Semi-supervised SVM
12.2.3 Non-Convex Optimization of Semi-Supervised SVM
12.2.4 Example of a Semi-supervised SVM
Figure 12.3.
12.3 SVM for Multi-instance Learning
12.3.1 What is multi-instance learning?
The labels are not given to the training examples here, but to a set of training examples called a bag.
A bag consists of multiple cases (instances).
If a bag contains at least one positive case, it is called a positive bag.
If a bag contains at least one positive case, we call it a positive bag.
Find the classification boundary while estimating the labels of the cases in the positive bag.
If all the cases included in a case are negative, we call it a negative bag
Example of multi-instance learning
There are three positive bags represented by red
Some of the reds are negative
There are two negative bags represented by blue
We know that all blue are negative
Application domain
Drug activity analysis to identify molecules that act as drugs
Bags of individual molecules
Examples of substructures in a molecule
Substructures that bind to proteins are positive examples
Substructures that do not bind are negative examples
Molecules with substructures that bind to proteins are positive cases
Molecules consisting of only unbound substructures are negative cases
Image processing
Whether a particular object is included among multiple objects
Example: human or not
12.3.2 Multi-instance SVM
Introduction
Classification Examples
(1) mi-SVM
Estimates labels for all instances and learns based on the estimated labels for each instance
Algorithm
(2) MI-SVM
Determine a representative case for each bag, and use the representative case and the label of the bag.
Algorithm Translated with www.DeepL.com/Translator (free version)