Statistical Learning Theory

Mathematics Artificial Intelligence Digital Transformation Machine Learning Navigation of this blog
Statistical Learning Theory

Theories about the statistical properties of machine learning algorithms can be used to theoretically elucidate the performance of machine learning algorithms and the effects of data set size and complexity to improve model selection and the learning process. Those theories include the following

  • Theory of generalization error: Machine learning requires high performance on training data, but it also requires high performance on unknown data. This performance on unknown data is called generalization error. The theory of generalization error studies how factors such as model complexity and training data size affect generalization error.
  • Statistical Learning Theory: This theory will show that there is an optimal learning algorithm and that it is possible to derive a performance upper bound for that algorithm. It also allows one to study how the performance of the learning algorithm changes as the training data size increases.
  • Theory of Stochastic Gradient Descent: Stochastic gradient descent is an effective learning algorithm for large data sets. The theory provides a probabilistic guarantee that the parameters obtained by gradient descent converge to the true parameters.

The main focus here is on statistical learning theory: (a) the law of large uniform numbers, (b) the universal kernel, and (c) discriminant fitting loss. Many domestic and foreign works describe (a), and the Foundation of Machine Learning (Mohri, MIT Press) develops a prospective theory of (a) with Rademacher complexity as the main axis. On the other hand, there are few works that systematically explain (b) and (c).

The theory of the statistical properties of machine learning algorithms is known as statistical learning theory. Statistical learning theory provides a theoretical framework for probabilistic properties and optimization when learning from data, and there are various topics related to the statistical properties of machine learning algorithms, as shown below.

For the mathematical approach, we refer the reader to the book “Machine Learning Professional Series: Statistical Learning Theory Reading Memorandum”.

Basic theory rooted in probability and statistics is essential to mastering learning methods, and this course provides a natural introduction to important concepts such as the Kernel method, support vector machines, and boosting. The course provides a natural introduction to important concepts such as the Carnell method, support vector machines, boosting, etc. You can learn methods that apply to real-world data, from binary to multi-valued data.

Statistical learning theory deals mainly with two types of losses: predictive loss and empirical loss. By examining the relationship between these losses, it is possible to quantitatively evaluate the prediction accuracy and other aspects of the learning algorithm.For the case where the hypothesis set is a finite set, we evaluate the prediction loss of the learned hypothesis. What we are about to describe is the essence of statistical learning theory.

The VC dimension and Rademacher complexity are measures of the complexity of a hypothesis set. These measures control the relationship between predictive loss and empirical loss. Each is discussed in detail below.

Although discriminant problems evaluate the accuracy of hypotheses in terms of 0-1 losses, many learning algorithms use other losses that are easier to minimize. Here we discuss the justification for replacing losses in binary discrimination and discuss one of the most important classes of losses, discriminant fitting losses. (Ramp loss, convex margin loss, nonconvex Φ-margin loss, discriminantal conformal, robust support vector machine, discriminantal conformity theorem, L2-support vector machine, squared hinge loss, logistic loss, hinge loss, boosting, exponential loss, discriminantal conformity theorem for convex margin loss, Bayes rules, prediction Φ-loss, prediction discriminant error, monotone non-increasing convex function, empirical Φ-loss, empirical discriminant error)

Kernel functions and reproducing kernel Hilbert spaces used in kernel methods are described. Representation theorems that are important from the viewpoint of applications will also be discussed. In addition, the basic properties of universal kernels with high expressive power will be described.

The estimator \(\hat{f}(x)\) from the kernel function is included in the linear model M and is given by the linear sum of the functions Φ(x)TΦ(x’). When a general kernel function is used instead of the function Φ(x)TΦ(x’), the estimator is given by the linear sum of the kernel functions. Therefore, we define the linear space H0 generated by the linear sum of kernel functions as follows.

When the estimator is constructed using a linear model M, the learned function \(\hat{f}(x)\) is represented as a linear sum of functions k(xi,…),i=1,…,n corresponding to data points x1,…,xn. This property can be summarized as a representation theorem in general reproducing nuclear Hilbert spaces.

Next, we describe a learning algorithm that uses the reproducing nuclear Hilbert space H as a statistical model. The uniform law of large numbers is used to evaluate the prediction accuracy, and in this case, it is necessary to obtain the Rademacher complexity. Here, we evaluate the Rademacher complexity for a bounded set of reproducing nuclear Hilbert spaces H.

When the reproducing nuclear Hilbert space is infinite dimensional, it is expected to approximate a wide class of functions. In this section, we describe a reproducing nuclear Hilbert space corresponding to a kernel function called the universal kernel as a statistical model with sufficiently small approximation error for continuous functions.

As a representative example of a kernel method, we discuss C-support vector machines, which provide a proof of statistical consistency. Support vector machine SVM is a generic term for typical learning algorithms in machine learning. In this section, we describe the C-support vector machine and ν-support vector machine, which are learning algorithms for binary discriminant. Here, C and ν refer to regularization parameters.

The ν-support vector machine is an algorithm in which the regularization parameter C of the C-support vector machine is replaced by a parameter ν that has a clearer meaning.

Let the loss of the discriminant function f(x)+b for data (x,y) be measured by the hinge loss Φhinge. Even if sign(f(xi)+b)=yi in this case, it suffers a nonzero loss when the value of the margin mi=yi(f(xi)+b) is less than 1. The threshold value, 1, is predetermined as a constant. On the other hand, a loss function can be used in which the threshold is variable according to the data and the loss is suffered when the margin is less than a certain positive number ρ. This can be achieved by setting the loss function to max{ρ-mi,0}. Here, if ρ is made variable, it is only necessary to choose a small ρ for any data, and the loss cannot be measured properly in this case. Therefore, we add a penalty term -νρ for small threshold ρ and define the loss for the special function as follows.

In this article, we will discuss group learning, which is a learning method that combines simple learning algorithms. As a representative example, we will discuss boosting and give a theoretical evaluation of its prediction accuracy.

In the multivalued discriminant problem, we learn a discriminant h:𝒳→𝒴 that assigns elements of the input space 𝒳 to elements of a finite set 𝒴={1,…,L}. The task of reading numbers and letters from image data, as well as tagging text data, can be formulated as multivalued discriminations, which can sometimes be discussed in the same way as binary discriminations, but also have difficulties unique to multivalued discriminations. In this section, we discuss standard loss functions and statistical models for multivalued discriminations, as well as theoretical analysis methods for predictive discrimination errors.

In order to learn discriminant functions efficiently, it is necessary to design algorithms using loss functions that are easy to compute. On the other hand, prediction accuracy is usually evaluated based on the prediction discriminant error. Therefore, the loss function used in learning and the loss function for evaluating prediction accuracy are generally different. For binary discriminants, we discussed in “Overview of Discriminant Adaptive Loss in Statistical Mathematics Theory. Here, we discuss discriminant adaptation loss for multi-valued discriminants.

コメント

タイトルとURLをコピーしました