Machine Learning Professional Series Statistical Learning Theory Reading Notes

Mathematics Artificial Intelligence Technology Digital Transformation Machine Learning Technology Statistical Learning Theory Kernel Method Navigation of this blog
Summary

The theory of statistical properties of machine learning algorithms can be used to theoretically elucidate the performance of machine learning algorithms and the effects of data set size and complexity to improve model selection and the learning process. This paper describes this statistical learning theory based on “Statistical Learning Theory” in the Machine Learning Professional Series.
In this article, I will discuss the reading notes.
Machine Learning Professional Series Statistical Learning Theory Reading Notes

“Impressive, well explained! Clear, concise, and easy to understand! Basic theory rooted in probability and statistics is essential to mastering learning methods. This book provides a natural introduction to key concepts such as the Carnell method, support vector machines, and boosting, and allows you to learn methods for real-world data, from binary to multi-valued.”
Preface

Theory of Statistical Properties of Machine Learning Algorithms
Law of large uniform numbers
Numerous national and international publications
Fundations of machine learning
Ordinary kernel
Discriminant Fitting Loss

chapter 1 framework of statistical learning theory

1.1 Problem Setting

Introduction
We want to use past experience for future decision making and prediction.
There must be some correlation between what has happened in the past and what is likely to happen in the future.
Nothing exactly the same as what happened in the past will happen in the future
Probability theory and statistics are used to describe the "loose connection" between past experiences and future events.
The term "learning" is used to describe the process of extracting and using useful information from observed events
Data
Information obtained by observation is called "data
Input data
The set of possible values for input x is called "input space
When the output data takes values in a finite set, the values are called labels.
When there are two possible labels, input-output data (x,y) is called binary data, and when there are three or more labels, it is called multivalued data.
Hypotheses
The function from the input space to the output set is called "hypothesis
A set of hypotheses is called a "hypothesis set
Discriminant function
A hypothesis that takes values in a finite set is called a "discriminant
A real-valued function or vector-valued function used to describe a discriminant is called a discriminant function.
Loss function
A function that measures the error between an output value and a prediction result is called a loss function.
The larger the value of the loss function, the larger the error or loss.
The main theme of "machine learning" is to design algorithms that learn appropriate hypotheses from observed data.
Statistical learning theory
A back-ground framework that evaluates the predictive accuracy of hypotheses obtained by learning algorithms and provides guidelines for improving performance

1.1.1 Discrimination problem

When the output takes values for a finite set of labels y, the problem of predicting the corresponding label from the input data is called the classification problem
When there are two candidate labels (|y|=2), it is a binary classification problem.
When there are three or more candidate labels (|y|≥3), it is a multi-class classification problem.
Examples of classification problems
Unsolicited email classification
Input data
Email text
Output Label
Unsolicited email (spam) and normal email (non-spam)
ℓerr(ȳ, y) = 1[ ȳ ≠ y] {1, y≠ŷ, 0, y=ŷ
0-1 loss (0-1 loss)
Credit Card Nonpayment Classification
Damage suffered by nonpaying customers who can pay is different from nonpaying customers who can't pay
Extend 0-1 loss
ℓerr(ȳ, y) = 1[ ȳ ≠ y] {l, y≠ŷ, 0, y=ŷ
Make the output not 1

1.1.2 Regression Problem

The problem of predicting output from input data when the output is a real number is called a regression problem
Example: Predicting stock prices or electricity
Cannot predict a value that exactly matches the output
Use squared loss as a loss function to measure the accuracy of prediction
ℓ(ŷ, y) = (ŷ - y)2

1.1.3 Ranking Problem

A pair of two input data (x, x') ∈ 𝓧2
y=+1 if x is preferred to x', y=-1 otherwise
Input/output data (x, x', y)
h(x)>h(x') if x is preferable, otherwise h(x)≤h(x')
h1=h(x)
h2=h(x')
ĥ=(h1,h2) ∈ ℝ2
Output label y ∈ {+1, -1}
0-1 loss
ℓ(ĥ, y) = {1, y(h1-h2) ≤ 0, {1, other
Example: Web search
For a given search term
the search engine returns a list of pages (x1,x2,...) of search results returns a list of pages (x1,x2,...) in the search results
Quantify which page xi selected by the user is preferred over others
Example: computerized Shogi game
Good or bad candidate moves

1.2 Prediction loss and experience loss

Let 𝔼(X,Y)~D[...] be the expected value under the probability distribution D that the data (X,Y) follows
𝔼[...].
Predictive loss (ý)
Loss function ℓ(ý, y)
The "predictive loss R(h) of hypothesis h" is the expected loss of the prediction h(X) in the distribution of the test data (X,Y)
R(h) = 𝔼(X,Y)~D[ℓ(h(X),Y)].
predictive classification error
Denote 0-1 loss ℓerr as Rerr(h)
In a typical learning problem setting, the distribution of the data is unknown, so predictive loss cannot be computed
Empirical loss
The observed data (X1,Y1),...,(Xn,Yn),... ,(Xn,Yn)
Explain the input-output relationship with hypothesis h
When measuring the error using the loss function ℓ(ŷ, y), the empirical loss Ȓ(h) of hypothesis h on the observed data is the average loss between the predicted value h(Xi) and the observed value Yi
Empirical classification error
Relationship between predictive loss and empirical loss
Many problems are formulated as problems of finding a hypothesis that minimizes the prediction loss
The exact value of the prediction loss is unknown, but the empirical loss is obtained as an approximation.
Therefore, minimizing the empirical loss is the key to learning an appropriate hypothesis.
To evaluate the accuracy of the learned hypothesis, the difference between the empirical loss and the prediction loss is estimated.

1.3 Bayes Rule and Bayes Error

The goal of learning is to find a hypothesis that minimizes the prediction error as much as possible
Definition 1.1 Bayes Rule and Bayes Error
When we define a loss function ℓ
Lower bound on the prediction loss under any measurable function h:𝒳→𝒴
Bayes error under any loss function ℓ
when there exists a hypothesis that achieves the lower bound
Bayes rule
Given a loss function, Bayes error is a value determined from the probability distribution of the data
Hypothesis that minimizes the prediction error?
Loss function ℓ(ŷ,y)
Distribution of test data P
Conditional expected value
R(h)=𝔼X[𝔼Y[(ℓ(h(X),Y)|X]]
Conditional expectation at each input X=x
𝔼Y[ℓ(h(x),Y)|x] = ∫ ℓ(h(x),y)dP(y|x) minimizing hypothesis h
𝔼Y[ℓ(h,Y)] = ∫ ℓ(h,y)dP(y) minimizing h ∈ 𝓎
Example 1.1 (Discriminant Problem) Using 0-1 loss as loss function
Given an input x, the hypothesis that the label with the largest probability of occurrence is the best predicted label
Example 1.2 (Regression Problem) Bayes Rule in Regression Problems
Example 1.3 (Ranking Problem) As a discriminant problem
Recipient operating characteristic curve (ROC curve)
area under the curve (AUC) is the area under the ROC curve
If hypothesis h is random regardless of input, ROC curve is 45°(TP=FP), AUC(h)=0.5
If we choose a hypothesis that is better than the random prediction, the value of AUC will be greater than 0.5
Bayes rule h0 is given by the hypothesis that maximizes AUC
Bayes error is 1-AUC(h0)
Bayes Discriminant Rule
Applied to identification problems where a probability distribution is assumed between the observed data X and the class to which it belongs
Example: medical tests: there is a correlation between the value of a test item and health status, but the effect is probabilistic
Bayes' theorem
Posterior probability
Given observed data X, the conditional prior probability that it belongs to class Ci
Prior probability
Probability of creation of class Ci
Likelihood distribution
Probability distribution of observed data x given the class
Peripheral probability
Probability of occurrence of observed data x
Theorem equation
Maximum posterior probability criterion
Let x be the observed data and Ci(i=1,.... .K), then
Bayes' discriminant rule is classified into the class with the largest bago probability defined by the above equation
Bayes' discriminant rule
The discriminant boundary between class Ci and class Cj is the point where the posterior probabilities are equal
P(Ci|x)=p(x|Ci)xP(Ci)/p(x) = p(x|Cj)xP(Cj)/p(x)=P(Cj|x)
Discriminant class = arg max p(x|Ci)P(Ci)
Example
Data from a random sample of 1000 people in a town
Derive Bayes' discriminant rules for predicting health status based on drinking and smoking status
Find posterior probabilities for all combinations of features (S,T)
P(G|S,T)=P(S,T|G)xP(G)/P(S,T)
Prior probability for each class
P(G=1)=800/1000
P(G=2)=200/1000
For class conditional probability P(S,T|G), assume conditional independence P(S,T|G)=P(S|G)P(T|G) between S and T
Class conditional probability P(S|G) for smoking
P(S=1|G=1)=320/800=2/5
p(S=0|G=1)=480/800=3/5
P(S=1|G=0)=160/200=4/5
P(S=0|G=0)=40/200=1/5
Class conditional probability of drinking P(T|G)
P(T=1|G=1)=640/800=4/5
P(T=0|G=1)=480/800=3/5
P(T=1|G=0)=160/200=4/5
P(T=0|G=0)=40/200=1/5
Probability Translated with www.DeepL.com/Translator (free version)

1.4 Evaluating the Performance of Learning Algorithms

A learning algorithm is interpreted as a function from a set of observed data to a set of hypotheses
When using a learning algorithm, let hs be the hypothesis obtained from the data S={(X1,Y1),...,(Xn,Yn)}. hs the hypotheses obtained from the data S={(X1,Y1),...,(Xn,Yn)}.
How to evaluate the performance of the learning algorithm
Given a loss function, R(hs) is determined as the predicted loss of the learned hypothesis hs.
For various probabilistically obtained observation data S, the prediction loss takes various values
Expected value of the prediction function for a distribution D of observed data S
Expected prediction loss
Evaluation of the average performance of the algorithm
Majority 1.2 Statistical Consistency
If we can find a hypothesis that achieves a prediction loss close to the Bayesian error, we can achieve highly accurate forecasts

1.5 Learning with a finite set of hypotheses

1.5.1 Evaluation of Prediction Discriminant Error

Evaluate the prediction loss of the learned hypothesis when the hypothesis set is finite
Binary Discriminant Problem
For a finite set of hypotheses H={h1,.... ,hT}
Each hypothesis is a function from the input space X to the binary labels {+1,-1}
Training data S={(X1,Y1),...,Y1) following some distribution P (Xn,Yn)}, the learning algorithm outputs the hypothesis that minimizes the empirical discrimination error
hs=argminȒerr(h)
Corollary 1.3 Hoeffding's inequality
The random variable Z takes values in the bounded interval [1, 0
The random variables Z1,...,Zn Zn independently follow the same distribution as Z
For any ε > 0, the above equation holds

1.5.2 Approximation and estimation errors

1.5.3 Regularization

Typical Methods for Learning Hypothesis Sets of Appropriate Size
The larger the set of hypotheses, the larger the estimation error, and thus the larger the predictive power of the hypotheses obtained as a result of learning.
It is necessary to avoid using a large set of hypotheses when a small set of hypotheses is sufficient.
Learning should be conducted in such a way as to minimize the empirical discriminant error while taking into account the penalty for hypotheses.

Chapter 2. Complexity of Hypothesis Sets

Introduction

We introduce Mach complexity in VC dimensionality as a measure of the complexity of a hypothesis set.
Explain how these measures control the relationship between predictive loss and empirical loss

2.1 VC dimension

VC, taken from the initials Vapnik and Chervonenkis
Complexity, defined for a set of hypotheses, mainly for binary discriminant problems
Can also be extended to the case of multi-valued and regression problems
VC dimension is a quantity to capture the combinatorial nature of a set family, so it has applications in combinatorics, etc.
A hypothesis h ∈ H is a function from the input space X to a label set Y such that |Y|=2
The VC dimension is the upper bound on the number of data such that there are hypotheses that can accommodate any labeling
Corollary 2.1 Sauer's lemma
Common sense holds for n ≥ d when the VC dimension of the hypothesis set H that takes values for the binary labels is d
Theorem 2.2.
Let d<∞ be the VC dimension of the hypothesis set H⊂{h:𝓧→ {+1, -1} that takes values for binary labels Training data (X1,Y1),... ,(Xn,Yn) independently follow the same distribution Equation Theorem 2.3 Radon's theorem For any point set S={x1,...,xd+2} ,xd+2} ⊂ ℝd There exist partitions S1,S2 such that S=S1∪S2, S1∩S2=⦰ and conv(S1)∩conv(S2)≠⊘. Example: When the dimension of the parameter and the VC dimension do not match In linear discriminator, the dimension of the parameter specifying the discriminator and the VC dimension match 2.2 Rademacher Complexity Rademacher complexity and empirical Rademacher complexity, unlike VC dimension, are defined naturally for a set of real-valued functions Definition 2.4 Empirical Rademacher complexity Let 𝓖⊂{f:X→ℝ} be the set of real-valued functions on the input space X The set of input points is denoted by S={x1,... ,xn} ⊂ X. Let σ1,...,σn be independent random variables that take +1 and -1 with equal probability. Let σn be independent random variables that take equal probabilities of +1 and -1 The empirical Rademacher complexity Ȓ(𝓖) of 𝓖 is 𝔼σ[∙] is the expectation for σ1,... , the expected value with respect to σn Intuitive interpretation of empirical Rademacher complexity If σig(xi) > 0, the prediction is correct
If σig(xi) has a large value, it is well learned
Definition 2.5 Rademacher Complexity
If the input points S=(x1,.... ,xn) is a random variable following a distribution D
𝓖, the expected value of the empirical Rademacher complexity is called the Rademacher complexity.

2.3 Law of Uniform Large Numbers

The law of large uniform numbers is an extension of Theorem 2.2, which evaluates the prediction discriminant error using the VC dimension.
Rademacher complexity corresponds to the error in the law of uniform large numbers
Theorem 2.7 uniform law of large numbers

2.4 Proof of Talagrand's complement

Chapter 3 Discriminant Conformal Losses

3.1 Margin loss

Definition 3.1 Margin loss

3.2 Discriminative Conformal Loss

Definition 3.2 Discriminative Conformity Loss

3.3 Discriminative Conformity Theorem: Convex Margin Loss

Theorem 3.5 Discriminant Conformity Theorem for Convex Margin Loss

3.4 Discriminant Conformity Theorem:General Margin Loss

Theorem 3.6 Discriminant Conformance Theorem

Chapter 4: Foundations of Kernel Methods

4.1 Learning with linear models

In discriminant and regression problems, the main objective is to learn the functional relationship between inputs and outputs from the input-output data (x, y).
It is necessary to set up a set of functions in the input space as a statistical model.
Linear model
M = {f(x) = βTΦ(x) | β ∈ ℝD}
Φ(x) = (Φ1(x),.... ,ΦD(x))T ∈ ℝD
Map from input space X to D-dimensional space ℝD
The basis functions Φ1(x),...,ΦD(x) linear sum of the basis functions Φ1(x), ΦD(x)
Translated with www.DeepL.com/Translator (free version)

4.2 Kernel functions

4.3 Recurrent kernel Hilbert spaces

4.3.1 Inner product spaces generated from kernel functions
4.3.2 Complete inner product spaces
4.3.3 Recurrent nuclear Hilbert spaces and kernel functions
4.3.4 Classification of Hilbert Spaces and Recurrent Nuclear Hilbert Spaces

4.4 Representation Theorem
4.5 Rademacher Complexity of Recurrent Nuclear Hilbert Spaces
4.6 Universal kernel

Chapter 5 Support Vector Machines

5.1 Introduction
5.2 Hinge Loss
5.3 C-Support Vector Machines

5.3.1 Optimality Conditions for C-Support Vector Machines
5.3.2 Support Vector
5.3.3 Support Vector Ratio and Prediction Discriminant Error
5.3.4 Upper Bound of Prediction Discriminant Error
5.3.5 Statistical Agreement

5.4 ν-Support Vector Machines

5.4.1 Properties of ν-Support Vector Machines
5.4.2 Dual Representation and Minimum Distance Problems
5.4.3 Prediction Discriminant Error Evaluation and Statistical Agreement

Chapter 6 Boosting

6.1 Group Learning
6.2 Adaboosting
6.3 Nonlinear Optimization and Boosting

6.3.1 Derivation of boosting by coordinate descent method
6.3.2 Learning with Weighted Data and Generalized Linear Models

6.4 Adaboost Error Evaluation

6.4.1 Empirical Discriminant Error
6.4.2 Prediction Discriminant Error

Chapter 7 Multi-Valued Discrimination

7.1 Discriminant Functions and Discriminators
7.2 Evaluation of Rademacher Complexity and Prediction Discriminant Error
7.3 Discriminant Adaptive Loss
7.4 Loss Functions

7.4.1 Multilevel Margin Loss
7.4.2 Examples of Discriminant Conformal Loss

7.5 Statistical Agreement
7.6 Proof of the Discriminant Conformance Theorem in Multilevel Discriminant

Appendix A Probability Inequalities

Appendix B Convex Analysis and Convex Optimization

B.1 Convex Sets
B.2 Convex functions
B.3 Convex optimization

Appendix C Elementary Functional Analysis

C.1 Lebesgue integrals
C.2 Normed spaces and Banach spaces
C.3 End product spaces and Hilbert spaces