Challenges and implementation of achieving 100% reproducibility for risk task response

Machine Learning Artificial Intelligence Digital Transformation Stochastic Generative Models Bayesian Modeling Natural Language Processing Markov Chain Monte Carlo Method Image Information Processing Reinforcement Learning Knowledge Information Processing Explainable Machine Learning Deep Learning General ML Small Data ML Navigation of this blog
What is a 100% Recall in machine learning?

In machine learning tasks, recall is the main metric used for classification tasks. To achieve 100% recall means that the classification model correctly detects all positive samples and that there are no false negatives (i.e., original positives are mistaken for negatives). In the case of a general task, this means extracting all the data (positives) that should be found without omission, and this is something that appears frequently in tasks involving real-world risks.

However, achieving 100% recall is generally difficult to achieve, as it is limited by the characteristics of the data and the complexity of the problem. In addition, the pursuit of 100% recall may lead to an increase in the percentage of false positives (i.e., mistaking an originally negative result for a positive result), so it is necessary to consider the balance between these two factors. There are various issues that may hinder the realization of 100% recall.

Next, we will consider the algorithms to achieve them.

Which machine learning algorithm is best suited to achieve 100% recall?

Achieving 100% recall depends on the characteristics of the data set, and the optimal algorithm may vary depending on the characteristics and distribution of the data, but as a general trend, the following algorithms are effective in improving the recall.

  • Support Vector Machines (SVM): SVMs can be applied to high-dimensional data and data with non-linear bounds, and with the selection of optimal hyper-parameters and appropriate choice of kernel function, a high recall rate can be achieved.
  • Random Forest: Random forests may achieve high recall by combining a large number of decision trees. In particular, it shows robust performance even when the data set contains noise and outliers.
  • Gradient Boosting: Gradient boosting algorithms (e.g., XGBoost and LightGBM) are ensemble learning algorithms as described in “Overview of Ensemble Learning and Examples of Algorithms and Implementations, which combine multiple weak learners to build a model. This approach has the potential to achieve high recall because it can automatically learn the importance of features that help improve recall.

Even with such algorithms, achieving 100% recall is generally difficult and may be affected by noise, outliers, or incomplete data in the actual data set, which may reduce accuracy. In addition, tradeoffs with other performance metrics (e.g., accuracy, F1 score, etc.) must also be considered to increase the recall, and appropriate algorithms must be selected based on the data and problem characteristics.

Each case is discussed below.

The elements necessary to achieve 100% recall using SVM

To achieve 100% recall using SVM, the following elements are required. (For more information on SVM, see “Overview of Kernel Methods and Support Vector Machines.)

  • Selection of an appropriate kernel function: SVMs use linear or nonlinear kernels (RBF, polynomial, sigmoidal, etc.) to map data to a higher-dimensional space. To improve the reproduction rate, the appropriate kernel function should be selected based on the characteristics of the data.
  • Tuning Hyperparameters: Hyperparameters exist in SVMs, such as the regularization parameter (C) and kernel function-specific parameters (gamma, polynomial degree, etc.). In order to maximize recall, it is important to set appropriate hyperparameters, and it is necessary to find the optimal combination of parameters using methods such as cross-validation, grid search, or Bayesian optimization as described in “Implementation of Bayesian Optimization Tools Using Clojure” etc. and other methods described in “Implementation of Bayesian Optimization Tools Using Clojure” and others.
  • Dealing with class imbalances: Data sets with class imbalances can lead to poor recall, and SVM also needs to adjust for data imbalances using methods such as undersampling, oversampling, or class weighting.
  • Model Evaluation and Validation: Model evaluation and validation are important for optimizing recall. They should use holdout and cross-validation to evaluate model performance, with recall as the primary metric.
On the elements necessary to achieve 100% recall using random forests

To achieve 100% recall using random forests, the following elements are required. (For details on random forests, see “Classification (4) Group Learning (Ensemble Learning, Random Forests) and Evaluation of Learning Results (Cross-validation Method)” etc.)

  • Selection of appropriate features: The performance of a model is highly dependent on the quality of the features provided as input. To improve recall, it is important to create meaningful features using feature engineering methods. For more information on feature extraction, see “Various Feature Engineering Methods and Their Python Implementations.
  • Tuning Hyperparameters: Random forests have hyperparameters such as the number of trees, the depth of each tree, and how the sample is split. Setting the appropriate parameters is important to improve the recall. To do so, it is necessary to search for the best combination of hyperparameters using methods such as grid search, random search, or Bayesian optimization as described in “Implementing a Bayesian Optimization Tool Using Clojure” and others.
  • Dealing with unbalanced data: Recall may be reduced for data sets with unbalanced classes. To improve recall, it is necessary to apply methods to deal with imbalanced data, and it is important to adjust for imbalances in the data using techniques such as undersampling, oversampling, or class weighting.
  • Class weighting: Random forests allow for class weighting, whereby increasing the weight for a small number of classes can improve the reproduction rate.
    Model evaluation and validation: To optimize recall, model evaluation and validation are important and require the use of holdout and cross-validation to evaluate model performance with recall as the primary metric.
Elements necessary to achieve 100% recall using LightGBM

LightGBM is a machine learning tool designed to build fast and accurate models on large data sets, primarily using gradient boosting and Decision Tree algorithms (see “LightGBM Overview and Various Language Implementations” for details). LightGBM Overview and Implementation in Various Languages” for more information).

To achieve 100% recall using LightGBM, the following elements are required

  • Selection of appropriate features: Since the performance of the model is highly dependent on the quality of the features provided as input, the selection of appropriate features is important to improve the reproduction rate. This requires an approach that uses feature engineering methods to create meaningful features. For more information on feature extraction, see “Various Feature Engineering Methods and Python Implementation.
  • Parameter Tuning: LightGBM has many hyperparameters that need to be properly parameterized to maximize recall. This includes selecting parameters such as the number of trees, depth, learning rate, etc., and searching for optimal parameters using methods such as grid search and Bayesian optimization as described in “Implementing a Bayesian Optimization Tool Using Clojure” and elsewhere.
  • Dealing with imbalanced data: In order to increase recall, it is necessary to deal with imbalanced data sets. Recall can be reduced by class imbalances, in which case techniques such as oversampling, undersampling, and class weighting are used to adjust for data imbalances.
  • Model Evaluation and Validation: Model evaluation and validation are important to optimize the reproduction rate. These involve using holdout and cross-validation to evaluate the model’s performance, using recall as the primary metric, and evaluating the results.

A common element among the above is the approach to dealing with unbalanced data sets. In the current Shizu problem, this issue appears frequently and is a disincentive to improving recall. In the following, we describe in detail our approach to dealing with this imbalanced data set.

Means to improve machine learning recall reduced by data (class) imbalance

When data (class) imbalances reduce recall, the following approaches can be considered to improve recall

  • Undersampling: By reducing the number of samples in a large number of classes, it is possible to adjust the balance of classes. These methods include randomly removing portions of many classes or removing samples based on class characteristics.
  • Oversampling: Oversampling allows for the adjustment of class balance by increasing the number of samples in a small number of classes. These methods include random replication and synthetic methods (SMOTE, ADASYN, etc.) to increase the number of samples in a few classes.
  • Class Weighting: In learning algorithms, balance can be adjusted by adjusting class weights. This technique improves recall by increasing the weights for a small number of classes. Many machine learning libraries support class weighting.
  • Anomaly Detection: This method identifies a small sample of classes and detects them as anomalies. By considering a small number of classes as normal and detecting their differences from a large number of classes, the reproduction rate can be improved.
  • Custom Threshold Adjustment: The prediction results for a class can be adjusted by changing the threshold value of the prediction score. If the recall is important, the threshold can be lowered to increase the percentage of positive classes in the prediction results.

The details and challenges of each method are described below.

Challenges of Under-Sampling

In unbalanced data sets, where data for minority classes is extremely scarce and the training model may have difficulty learning minority classes adequately, undersampling will seek to improve class balance and model performance by reducing data for majority classes. However, several challenges exist with undersampling

  • Loss of information: undersampling removes some data from the data set in order to reduce the majority class data. This may result in information loss. The removed data may contain important patterns or trends, and methods to minimize information loss should be considered when undersampling.
  • Model Bias: While undersampling improves the balance of the data set, it also significantly reduces the data for minority classes, which may make it difficult for the model to adequately learn minority classes. Models may tend to make predictions that are biased toward the majority class, and when applying undersampling, the proper balance must be maintained while being mindful of model bias.
  • Choice of sampling method: With undersampling, it is necessary to determine which data to reduce. There are a variety of sampling methods, such as random reduction or retaining more important data for minority classes, but failure to select an appropriate sampling method can lead to problems such as loss of information and increased bias. In addition, if data reduction is inadequate, undersampling may not be effective and class imbalances may remain.
  • Reduced generalization performance: While undersampling is a technique that allows models to adequately learn minority classes, it improves the balance of the data set but reduces the overall number of data, which may reduce the generalization performance of the model. In particular, if there is not enough information for the minority class, the model will have difficulty predicting the minority class accurately. Therefore, when undersampling, it is necessary to balance the improvement in model performance and balance.

Approaches to address these issues include combining undersampling with other methods (e.g., oversampling, SMOTE), weighting, and ensemble learning. It is also important to implement appropriate pre-processing methods such as feature extraction.

Challenges of Oversampling and Solutions

<Challenges>

Oversampling in machine learning is an approach to improve class balance by replicating or combining data from minority classes, and can be one of the methods used to resolve class imbalances in a data set. The following challenges exist with this oversampling

  • Risk of overlearning: Oversampling increases the weight of the minority class data by replicating or synthesizing it. This tends to make the model learn the minority class patterns more strongly. However, in the presence of excess minority-class data, the model runs the risk of over-fitting those data and degrading generalization performance for new data. To avoid overlearning, it is important to choose the appropriate balance and model regularization method.
  • Duplication of information: Oversampling may produce data that is similar to the original data because it duplicates or synthesizes data from a minority class. This may result in duplicate information in the data set and the model may learn redundant information. In addition, if the duplicate data does not contribute to the performance of the model, training efficiency may be reduced. Appropriate selection of synthesis methods and parameter tuning are important to minimize duplication.
  • Model Bias: Oversampling can reduce model bias by increasing data from minority classes. However, excessive oversampling may cause the model to make predictions that are biased toward minority classes. The model becomes too sensitive to minority class patterns and fails to adequately capture the characteristics of the actual data. To address these challenges, it is important to use a balanced oversampling approach and to adjust hyperparameters to account for model bias.

Approaches to address these challenges include, as in the case of undersampling, combining oversampling and undersampling, improving synthesis methods (e.g., using SMOTE and GANs described in “Overview of GANs and their various applications and implementations” ), as well as cross-validation, regularization methods, and feature selection, ensemble learning, etc.). Some of these are described in detail below.

<Method for Increasing the Sample Size of Small Classes>

The following synthetic approaches are used to increase the number of minority class samples.

  • SMOTE (Synthetic Minority Over-sampling Technique): SMOTE is a technique to generate synthetic samples based on minority samples. SMOTE balances the data set by creating new samples that complement the small class samples. Specifically, SMOTE calculates a vector connecting the randomly selected neighborhood samples to the minority class samples and generates new samples in proportion to them.
  • ADASYN (Adaptive Synthetic Sampling): ADASYN is an improved version of SMOTE, which generates samples according to the density of data points, ADASYN compensates for data bias by generating more synthetic samples in low-density areas and fewer samples in high-density areas.
  • SMOTE-NC (SMOTE for Nominal and Continuous features): SMOTE-NC is an extension of SMOTE that can be applied to data sets containing both nominal (categorical) and continuous variables. nominal variables and apply appropriate synthesis methods.

These synthesis methods have the risk of over-training and the possibility of introducing noise, so care should be taken in setting appropriate parameters and data characteristics.

Challenge of Class weighting

Class weighting usually involves assigning higher weights to minority classes and lower weights to majority classes, an approach in which the model is expected to learn with more emphasis on minority classes. The following challenges exist with this class weighting approach

  • Balancing performance: Class weighting is used to emphasize the importance of minority classes, but requires that appropriate weights be set. If the weights are improperly set, the model may over-fit minority classes or, conversely, ignore majority classes. Setting appropriate weights should be carefully considered based on class importance and dataset characteristics.
  • Poor Performance: Class weighting is a technique aimed at improving performance in unbalanced data sets, but setting inappropriate weights can degrade performance. For example, assigning too much weight to a minority class may cause the model to over-fit the minority class, resulting in poor predictive performance for the majority class. Therefore, class weighting should be appropriately set while maintaining balance.
  • Difficulty in adjusting parameters: Class weighting requires appropriate setting of weight values. The weight values are treated as hyper-parameters and require tuning. Finding appropriate weight settings requires evaluation of the model’s performance and objective function, while properly tuning weight values requires domain knowledge and understanding of the data.

Careful setting of weights and tuning of parameters is necessary to address these issues, and class weighting can be used in combination with other methods (undersampling, oversampling) or with model evaluation methods such as cross-validation.

Cahllenge of Anomaly Detection Approaches and Solution

<Challenges>

Anomaly detection approaches for imbalanced classes also present the following challenges

  • Lack of minority classes: In the case of unbalanced classes, anomaly data usually belong to minority classes. This lack of minority class data makes it difficult to properly model anomaly data. The model may learn to be biased toward normal data, resulting in poor performance in detecting anomalous data.
    Label imbalance: Anomaly detection typically assigns positive labels to anomalous data, but since anomalous data is much less common than normal data, a label imbalance occurs. In such cases, the model cannot learn enough anomaly data and tends to make predictions that are biased toward normal data.
  • Feature Selection and Extraction: In anomaly detection, it is important to properly capture the features of anomaly data. However, anomaly data usually may have different features from normal data. These require appropriate feature selection and extraction, and domain knowledge and understanding of the data are essential.
  • Evaluation difficulties: Anomaly detection for unbalanced classes makes it difficult to use common evaluation metrics (e.g., correctness rate, recall rate, etc.). The lack of anomaly data may distort the evaluation of the model. Care must be taken in selecting appropriate evaluation indices and in interpreting the results.

To address these issues, it is necessary to ensure appropriate data balance, consider sampling strategies, generate anomalous data and use synthetic methods, optimize feature selection and extraction, and select appropriate evaluation indicators.

<Methods>

Anomaly detection approaches to improve machine learning recall, which is reduced by class imbalance, include the following methods

  • Anomaly score-based methods: Anomaly detection methods detect data that are considered “anomalous” because they exhibit behavior different from that of normal data. Anomaly score-based methods use the degree of data anomaly calculated by the model to identify anomalous data, with data points that are highly anomalous being more likely to be classified as anomalous. This allows for detection of anomalous instances in a small number of classes and improves recall.
  • Combined with undersampling: In unbalanced data sets, samples from the majority class may be so dominant that anomalous instances of the minority class are buried. In this case, some of the majority classes can be undersampled to balance the classes. By applying anomaly detection methods in combination with undersampling, it is possible to detect anomalous instances in a small number of classes more effectively and improve the reproduction rate.
  • Supervised Anomaly Detection Techniques: In supervised anomaly detection techniques, a small class of anomalous instances are labeled, allowing anomaly detection models to be trained as supervised learning. By using supervised learning methods, a small class of anomalous instances can be more accurately identified and the reproduction rate can be improved.
Challenges in the approach with custom threshold settings

While custom thresholding is an important approach in machine learning for imbalanced classes, the following challenges exist

  • Label imbalance: In imbalanced classes, anomalous data usually belong to a minority class. Therefore, when setting custom thresholds, the number of anomalous data is very small, making it difficult to find an appropriate threshold. Setting a low threshold value may increase the number of false positives (false detection of normal data as abnormal), while setting a high threshold value will decrease the number of true positives (correct detection of abnormal data).
  • Selection of evaluation metrics: For unbalanced classes, the usual evaluation metrics (e.g., correct response rate, reproduction rate, etc.) are not adequate. In particular, when custom thresholds are set, trade-offs between true and false positive rates need to be considered. For example, if the emphasis is on detecting anomalous data, the Recall is important, but the false positive rate may increase at the same time.
  • Lack of domain knowledge: Domain knowledge and understanding of the data is important for setting custom thresholds. Without a deep understanding of the characteristics of the data and the distribution of abnormal data, setting appropriate thresholds can be challenging and may require relying on the advice and experience of domain experts.

The following approaches can be used to address these challenges

  • Appropriate selection of model metrics: Select appropriate metrics by balancing the true positive and false positive rates. For example, the F1 score or AUC-ROC (area under the Receiver Operating Characteristic curve) may be considered.
  • Adjustment and Cross-Validation of Custom Thresholds: When setting custom thresholds, cross-validation is performed to evaluate performance and make adjustments to find the appropriate threshold. In selecting thresholds, it is also important to balance the importance of anomalous and normal data and to reflect business goals and constraints.
  • Leverage domain expert knowledge: It is also important to leverage the knowledge and experience of domain experts to help set appropriate thresholds. Understanding anomalous patterns and important characteristics may help to find more appropriate thresholds.

The challenges of dealing with unbalanced data sets are so varied that a simple approach will not provide a solution, and a combination of methods will be necessary.

On the combination of oversampling and undersampling

Some common oversampling and undersampling combination techniques are described below.

  • Combined oversampling and undersampling: This technique uses a combination of oversampling to increase the minority class data and undersampling to decrease the majority class data. Oversampling includes methods such as SMOTE and ADASYN, which synthetically increase the minority class data to improve the balance of the data set. Undersampling, on the other hand, reduces the size of the data set by randomly removing data from the majority class.
  • Iterative oversampling and undersampling: In this technique, oversampling and undersampling are iterated multiple times. The first iteration oversamples and the next iteration undersamples to balance the data set. Repeating this iterative process may result in more stable model learning and improved performance.
  • Combination of oversampling and ensemble learning: This approach uses a combination of oversampling and ensemble learning. Multiple models are trained using the synthetic data generated by oversampling and the original data, and their predictions are combined to produce more accurate forecasts. Bagging and boosting are commonly used ensemble learning methods.

In these combinatorial methods, appropriate sampling strategies and parameter settings are important, and domain knowledge and understanding of the data are essential. In addition, some oversampling may pose the risk of overlearning, where the synthetic data over-fits the original data, so care must be taken.

The python library imbalanced-learn can be used as a convenient approach when implementing these methods.

Implementation with imbalanced-learn

<Overview>

imbalanced-learn is a Python library for dealing with imbalanced data sets (data sets where there is an imbalance in the number of samples between classes). imbalanced-learn’s main features and methods include the following

  • Oversampling methods: As oversampling algorithms, synthetic methods such as SMOTE (Synthetic Minority Over-sampling Technique) and ADASYN (Adaptive Synthetic Sampling) are provided. These methods compensate for class imbalances by synthesizing minority class samples.
  • Undersampling Methods: These methods correct for class imbalances by removing samples from a small number of classes, such as random undersampling and undersampling based on class weighting.
  • Combination sampling methods: Combining oversampling and undersampling provides a more effective method to correct for class imbalances in a dataset.
  • Classifier: imbalanced-learn provides wrappers for classifiers that can be applied to imbalanced data sets, as well as evaluation metrics for imbalanced data. This supports classification tasks on imbalanced data sets.

<Implementation>

imbalanced-learn conforms to Scikit-learn’s API style and can be used with Scikit-learn’s models and pipelines. It is also compatible with Python’s general machine learning framework. Specific implementation examples are described below.

First, install imbalanced-learn using the pip command.

pip install imbalanced-learn

Next, we describe a Python implementation of SMOTE-NC, one of the oversampling methods.

  1. First, the usual SMOTE is applied to the continuous variable part.
    1. For each sample of the minority class, k nearest neighbor samples are randomly selected (usually k=5, etc. is used). (In ADASYN, the number of samples is adjusted to the sample density of the class.)
    2. Compute a vector connecting the selected few class samples and their neighbors.
    3. The vector is randomly extended to generate new samples. The generated sample resides in the position between the minority class sample and its neighbors.
    4. This operation is repeated for each minority class sample in the data set.
  2. Next, synthesis is performed on the nominal variable portion. A nominal variable is a feature with a category value, usually represented by an integer value or a string, and SMOTE-NC finds the nearest neighbor sample for the minority class nominal variable sample and generates a composite sample of the nominal variable with the same value as the neighbor sample.
  3. The composite samples of continuous and nominal variables are combined to create new composite data points.

With imbalanced-learn, these steps are accomplished by setting parameters.

from imblearn.over_sampling import SMOTENC

# Prepare features and class labels for the dataset
X = # Feature matrix (e.g., numpy array)
y = # Class labels (e.g., numpy array)

# Specify indices of features for nominal and continuous variables
categorical_features = [0, 2, 4]  # Index of features of nominal variables
continuous_features = [1, 3, 5]  # Index of features for continuous variables

# Apply SMOTE-NC
smotenc = SMOTENC(categorical_features=categorical_features, random_state=42)
X_resampled, y_resampled = smotenc.fit_resample(X, y)

In the above example, SMOTE-NC is performed using the SMOTENC class in imbalanced-learn. categorical_features is the index of nominal variable features, continuous_features is the index of continuous variable features and the fit_resample method is called to obtain the oversampled feature matrix X_resampled and the corresponding class label y_resampled.

Summary: To achieve 100% recall

再現率を100%にすることは、実際のデータセットにおいては非常に困難なタスクとなる。なぜなら、再現率を100%にするためには、全ての異常なインスタンスを完全に検出する必要があるからで、そのような場合、完全な再現率を達成することは現実的ではないためである。しかし現実には、再現率100%つまり抜け漏れのない検索を実現することは、リスク絡みのタスクにおいて度々要求される目標となる。

それらを実現させるためには、上記に述べたようなアプローチ、その中でも特に不均衡なデータに対する対処が重要であり、ドメイン知識を利用してそれらを最適化する必要がある。また、ここで述べた不均衡なデータに対する対処は、”スモールデータ学習、論理と機械学習との融合、局所/集団学習“で述べているスモールデータでの機械学習を解くためのヒントにもなっている。

コメント

タイトルとURLをコピーしました