Machine Learning by Ensemble Methods – Fundamentals and Algorithms Reading Notes
Machine Learning with Ensemble Methods -From Fundamentals and Algorithms
Ensemble learning methods are the next trend after deep learning. It is a state-of-the-art machine learning method that trains multiple learners using typical methods such as boosting and bagging, and then combines them. It is known to be much more accurate than single learning methods and has actually been used successfully in many situations.
This book is a Japanese translation of a book by Zhi-Hua Zhou, a world leader in the field of machine learning. chapter 1 deals with the background knowledge of ensemble methods. chapters 2 to 5 deal with the core knowledge of ensemble methods. chapters 5 discusses recent information theory diversity and diversity generation. chapters 6 and 7 discuss advanced information theory diversity and diversity generation. Chapter 6 describes advanced ensemble methods. This book is a must-read for researchers, engineers, and students involved in artificial intelligence and machine learning.
1. Introduction
1.1 Basic Concepts
The primary goal of machine learning, pattern recognition, and data mining is to create a good model from a data set.
A dataset typically consists of multiple feature vectors
Sometimes called a sample
Each feature vector
A representation of an object consisting of a set of features
Sometimes called an instance
Number of features in a dataset
Sometimes called dimension or dimensionality
Feature
Sometimes called attributes
The process of generating a model from data
Also called learning or training
Performed by a learning algorithm
Learned model
called a hypothesis
or learner
Learning situation
Supervised learning
Unsupervised learning
Predicts the value of a target feature for instances that have not yet been observed
Learned model is a predictor
Does not require labels
Aim is to find data-specific distribution information
Clustering
When labels are categorical (qualitative)
Classification
Learner
Classifier
If the labels are numerical (quantitative)
regression
Learner
fitted regression model
Instances with known labels
Example
Binary classification
Classification with two labels: positive and negative
Basic Learning Objectives
Generalization
Generalize “knowledge” learned from training data to unobserved instances
A good learner
Small generalization error
Requires knowledge of proud-truth label information
Uses test error as an estimate of generalization error
Small prediction error
1.2 Commonly used learning algorithms
1.2.1 Linear discriminant analysis (linear classifier)
Consists of weight vector and bias
y(predicted label) = sign(w(weight vector)T*x(instance) + b(bias))
Classical Shaped Learning Algorithm
Fisher’s Linear Discriminant Analysis (LDA)
Instances of different classes are farther apart, instances of the same class are closer together
Increase the distance between centers of different classes
Reduce the variance within each class
LDA minimizes J(w) = SB(w)(distance between classes) / SW(w)(within-class variance)
1.2.2 decision tree
A set of decision tests in a tree structure
Works as a divide-and-conquer method
Non-leaf-nodes are given a “feature test” called a split
The data given to the node is split into different subsets according to different values for the feature test
Leaf nodes are labeled
They are assigned to instances
A series of feature tests are performed starting from the root node, and the result is obtained when the leaf node is reached
Learning Algorithm
Recursive process
ID3 algorithm
Information gain criterion is used for segmentation selection
The entropy of the training set D is
When D is partitioned into subsets D1…Dk, the entropy decreases, and the amount of decrease is the information gain.
Issue
We are in love with a system in which features with many possible values are selected despite their relevance to classification.
Only categorical features can be handled
C4.5 Algorithm
Selection using gain ratio
A variant of the information gain criterion normalized by the number of features
Can also handle numerical features
CART (Classification and Regression Tree) algorithm
Selection using Gini coefficients (Gini index)
Can also handle numerical features
Overfitting
Good performance for training set, but generalization performance is poor
Corresponds to noise as the correct answer
Pruning
Truncate trees caused by noise
Pre-pruning
Pruning is done when the tree is growing
Post-pruning
Pruning is done after re-examination of a fully grown tree
When the height of the decision tree is 1
Decision stump
Linear classifier Translated with www.DeepL.com/Translator (free version)
1.2.3 Neural network (artificial neural network)
Neurons are also called units
Common Neuron Models
M-P model (McCulloch-Pitts Model)
Input signal and connection weight
The sum of the signals is compared to a threshold value called the neuron’s bias
If the sum of the signals is greater than the threshold, the neuron is activated and an output signal is generated by an activation function called the transfer function or squashing function
multi-layer feed-forward network
With a hidden layer
Activation function is a sigmoid function
Training objective is to determine connection weights and neuron biases
If the activation function is differentiable, the entire neural network becomes a differentiable function and can be optimized by gradient descent method
Originally Successful Algorithm
Error Back-Propagation (Back-Propagation BP)
Processes forward from input to hidden layer to output
Error is computed
Back-propagate to optimize connection weights and biases
1.2.4 Naive Bayes Classifier
Used as a method to classify text instances
Maximum a Posteriori (MAP) rule
Point estimation of unknowns based on measured data
Statistical parameters themselves are also random variables (parameters are determined probabilistically from the data at hand)
If you think “the data will be like this” (subjectivity), regardless of whether you have the data at hand or not, you can reflect it in the probability distribution
Estimating parameters probabilistically from data and subjectivity
Empirical or statistical support may be included in the subjective view
Maximum Likelihood Method
Assumption that there is only one statistical parameter that is correct
Likelihood
Probability of the actual occurrence of data from a probability distribution
What maximizes the likelihood
1.2.5 k-Nearest Neighbor Method
Objects that are close in the input space are also close in the celebratory space
Lazy learning approach
No explicit training process, instead simply storing a training set
1.2.6 Support Vector Machines and Kernel Methods
Large margin classifiers
Partition instances of different classes so that they have a hyperplane of maximum margin
Margin (margin)
Minimum distance from instances of different classes to the classification hyperplane
Hinge loss
Used to evaluate application to data
Non-linear classification cannot adequately classify classes
Maps data points to higher-order feature spaces, linearly separating data that cannot be linearly separated in the original feature space
Usually, inner products are difficult to perform, but kernel functions are used to facilitate inner products.
1.3. evaluation and comparison
1.4. ensemble method
Three initial paths contributed
Combining classifiers
Much studied in the field of pattern recognition
Ensembles of weak learners
Adaboost
Bagging)
Much studied in the field of machine learning
Mixture of experts
Most studied in the field of neural networks
Divide-and-conquer
Mixture of parametric models
Two steps of construction
Generate basic learner
Should be accurate and diverse
Integrate learners
No Free Lunch Theorem
There is no ideal situation where one learning algorithm is consistently better than another
1.5 Application of ensemble methods
1.6 References
2. boosting
2.1 General Boosting Methods
Boosting is a collection of algorithms that can lead a weak learner to a strong learner.
The process of running a given learner on a given arbitrary data distribution.
Process
Spaces X1, X2, X3 (with 1/3 distribution)
Obtain a classifier h1 that can correctly classify in spaces X1 and X2, but misclassifies in X3
Focusing on the points where h1 is wrong, we obtain a classifier h2 that correctly classifies in the spaces X1 and X3, but incorrectly classifies in X2.
Focusing on the incorrect h2, we obtain a classifier h3 that correctly classifies in the spaces X2 and X3, but incorrectly classifies in X1.
Combine h1, h2, and h3
2.2 AdaBoost algorithm
Achieved by minimizing the expotential loss function
Use additive weighted combination of weak learner and weak combiner
Performed by reweighting
Samples that cannot be reweighted are “re-sampled
The basic training unit that cannot pass the sanity check is removed, and a new data sample is generated.
Sanity check
Checks that the current basic training is better than random
When the number of weak learners is small, it is easy to terminate early because there are no more that satisfy the sanity check
Often happens in multi-class problems
2.3 Real-world examples
2.4 Theoretical Issues
2.4.1 Initial Analysis
2.4.2 Interpreting Margins
2.4.2 Statistical perspective
2.5 Extension to Multiple Classes
2.6 Noise Tolerance
2.7 References
3. bagging
3.1 Two Ensemble Frameworks
Depending on how the basic learner is generated, there are two frameworks for ensemble methods
Sequential ensemble method
Sequential ensemble method (sequential ensemble method)
AdaBosst, etc.
Dependence of the basic learner is considered
Overall performance is enhanced by residual-decreasing
Parallel ensemble method
Basic learners are generated at the same time
Bagging, etc.
Dependence of the basic learner is taken into account
Incorporate randomness to obtain a learner with less dependence
Errors are dramatically reduced by combining independent basic learners
3.2 Bagging Algorithm
Key Elements
Bootstrap
Aggregation
Use bootstrap distribution to generate different basic learners
Use bootstrap sampling to obtain a subset of the data to train the base learner
Given a training data set containing m examples
From it, m training examples are generated by restoration and extraction
Some examples in the original data are included in the sample at least once, while some are not.
This process is repeated T or more times
T samples containing m training examples are obtained
Train a basic learner from each sample
Aggregate the output of the basic learner
Classification
Voting
Regression
Averaging
Example for classification
Put test instances into each classifier and collect all outputs
Take a vote for each label and use the label with the most votes as the prediction
out-of-bag
Out-of-bag prediction for x
Prediction by a trained learner based on a sample that does not contain x
Example
Application to decision trees
Estimation of posterior probabilities at each node of each tree
3.3 Real-world examples
3.4 Theoretical problems
3.5 Ensembles of Random Trees
3.5.1 Random Forest
Random Forest is a typical state-of-the-art ensemble method
One of the extensions of Bagging
The main difference from Bagging is that it incorporates random feature extraction
In each step of the partition selection, which is a component of the decision tree, a subset of features is first randomly selected, and then the usual partition selection is performed on the selected subset of features.
Algorithm
The parameter K controls the randomness to be incorporated
The number proposed for K is the logarithm of the number of features
Randomness is introduced only in the feature selection and not in the partitioning over the selected features
3.5.2 Spectrum of randomization
RF generates a random decision tree by randomly selecting a subset of features at each node
Selection of split stores within the subset of selected features remains deterministic
VR-Tree ensemble
Randomness is added in both feature selection and partition selection
At each node of the tree, a coin is tossed that yields a table (head) with probability α
If the coin tosses a head, a deterministic node is constructed
As in a traditional decision tree, the best split point is selected among all possible split points.
If not tabular, a random node is constructed, i.e., a feature point is selected at random and a random split is selected for that feature.
Algorithm
Parameter α controls randomness
Optimal values can be found by looking at the spectrum
3.5.3 Ensemble of random trees in density estimation
Ensembles of random trees can be used for density estimation
Density estimation is an unsupervised learning problem
There is no label information for training instances
Completely random trees are used
Instead of checking whether instances belong to the same class, the tree is grown until there is only one or indistinguishable instances in the leaf nodes
Density estimation by ensemble of random trees
Applications for high-dimensional data and more complex data analysis
Can also be applied to problems with growing ensembles, such as online learning and streaming data
Can estimate the distribution of a complete random tree by the average density across the data Translated with www.DeepL.com/Translator (free version)
3.5.4 Ensemble of random trees in anomaly detection
Anomality
A data point that does not follow the behavior generally expected of most data.
Outlier and outlier are used interchangeably
Well-established method
Separability is a good indicator
Better separation with fewer divisions
iForest method
Number of divisions required to separate data points in each random tree
Can be measured by the path length from the root node to the leaf node containing the data point
The shorter the path length, the easier it is to separate the data points.
SCiForest method
The goal is to obtain smooth decision boundaries similar to oblique decision trees.
Random forest using hyperplanes
Can be complicated
3.6 References
4. coupling method
4.1 Advantages of the coupled method
The combination method plays an important role in the ensemble method
Advantages of the Combination Method
Statistical point of view
Hypothesis space is too large to examine with limited training data, and there are often several different hypotheses to test that give similar accuracy
Combining hypotheses reduces the risk of selecting the wrong hypothesis
Computational perspective
Reduces the risk of local search and selection of local optimal solutions
Representational perspective
A true path hypothesis may not be representable by any hypothesis in the hypothesis space
Combining hypotheses can expand the space of representable functions
4.2 Averaging
The most commonly used basic averaging method for numerical outputs
Given a set of T training units {h1,. ht}, the output of Hi for instance x is hi(x)
4.2.1 simple averaging
The outputs of the individual learners are directly averaged to obtain a combined output
4.2.2 weighted averaging
Averages the outputs of individual learners with weights of different importance
4.3 voting
The most commonly used basic combination method for nominal outputs
The basic combination method most commonly used for nominal outputs is to combine the outputs of T classifiers {h1,…,ht). ht), and given a set of T classifiers {h1,…,ht), from a set of l possible class labels {c1,…,cl ,cl}, predicting the label of that class from the set of {c1,…,cl}.
Classifiers that construct denormalized margins
SVM, etc.
Calibration methods can be used to convert outputs to probabilities
Platt scaling
Monotonic regression (isotonic regression)
4.3.1 Majority rule
4.3.2 Relative majority rule
4.3.3 Weighted voting
4.3.4 Soft voting
4.3.5 Theoretical issues
4.4 Combining by Learning
4.4.1 Stacking
4.4.2 Infinite ensembles
4.5 Other Joining Methods
4.5.1 Algebraic methods
4.5.2 Behavioral knowledge space methods
4.5.3 Decision template method
4.6 Related Methods
4.6.1 Error Correction Output Codes
4.6.2 Dynamic Classifier Selection
4.6.3 Expert Mixing
4.7 References
5. diversity
5.1 Ensemble Diversity
5.2 Error Decomposition
5.2.1 Error Ambiguity Decomposition
5.2.2 Bias-variance-covariance decomposition
5.3 Diversity Measures
5.3.1 Pairwise Scaling
5.3.2 Non-pairwise scales
5.3.3 Summary and visualization
5.3.4 Limitations of diversity measures
5.4 Information Theoretic Varieties
5.4.1 Information Theory and Ensembles
5.4.2 Informational Variety of Interactions
5.4.3 Multiple Informational Varieties
5.4.4 Estimated quantities
5.5 Generating diversity
5.6 References
6. ensemble pruning
6.1 What is ensemble pruning?
6.2 More is Better than All
6.3 Classification of pruning methods
6.4 Sequencing-based pruning
6.5 Clustering-based pruning
6.6 Optimization-based pruning
6.6.1 Heuristic optimization pruning
6.6.2 Mathematical programming pruning
6.6.3 Probabilistic pruning
6.7 References
7. clustering ensembles
7.1 Clustering
7.1.1 Clustering Methods
7.1.2 Clustering Evaluation
7.1.3 Why Clustering Ensemble?
7.2 Classification of Clustering Ensemble Methods
7.3 Similarity-Based Methods
7.4 Graph-based Methods
7.5 Methods based on festival labels
7.6 Transformation-based methods
7.7 References
8. further topics
8.1 Semi-supervised Learning
8.1.1 Convenience of unlabeled data
8.1.2 Semi-supervised learning with ensembles
8.2 Actuated Learning
8.2.1 Convenience of Human Intervention
8.2.2 Active Learning described in “Active Learning Techniques in Machine Learning” with Ensembles
8.3 Cost-Aware Learning
8.3.1 Learning with Uniform Costs
8.3.2 Ensembles in Cost-Aware Learning
8.4 Class Imbalanced Learning
8.4.1 Learning in Class Imbalance
8.4.2 Performance Evaluation in Class Imbalance
8.4.3 Ensemble Methods for Class Imbalance Learning
8.5 Comprehensibility Improvements
8.5.1 Reducing Ensembles to a Single Model
8.5.2 Rule Extraction from Ensembles
8.5.3 Visualizing the Ensemble
8.6 Future of Ensembles
8.7 References
コメント