Overview of SMOTE (Synthetic Minority Over-sampling Technique), its algorithm and examples of implementation

Machine Learning Natural Language Processing Artificial Intelligence Digital Transformation Image Processing Reinforcement Learning Probabilistic Generative Modeling Deep Learning Python Navigation of this blog

Overview of SMOTE（Synthetic Minority Over-sampling Technique）

SMOTE (Synthetic Minority Over-sampling Technique) is a technique for complementing under-sampling by synthesizing minority class samples in data sets with unbalanced class distributions. used to improve model performance, primarily in machine learning class classification tasks. An overview of SMOTE is given below.

1. unbalanced data set: a situation in which one class (usually the majority class) has a much larger sample than the other class in a classification task.

2. under-sampling problem: In an unbalanced dataset, the model may be trained biased toward the majority class, resulting in poor performance for the minority class.

3. synthetic data generation: SMOTE generates new synthetic data points using minority class samples and their neighbors. This improves the class balance of the entire data set.

While SMOTE improves class balance, the selection of appropriate parameters is an important approach because overgenerating synthetic data can lead to model over-learning. Improvements have also been proposed for SMOTE beyond the simple version.

Algorithms related to SMOTE (Synthetic Minority Over-sampling Technique)

The basic algorithmic steps of SMOTE are described below.

1. Minority class sample selection: Select each minority class sample from the dataset.

2. Selection of samples in the neighborhood using the k-nearest neighbor method: For the selected minority class samples, select samples in the majority class in the neighborhood using the k-nearest neighbor method, where k is a hyperparameter that specifies the number of samples in the neighborhood. Here, k is a hyperparameter that specifies the number of samples in the neighborhood.

3. Generate composite data points: For each minority class sample, a new data point is generated between it and its neighbors. The generated data points are calculated as follows

For each feature, a new value is calculated as original_feature + alpha * (neighbor_feature – original_feature). where alpha is a random number between 0 and 1.
This calculation is performed for each feature to obtain new data points.

4. Add synthetic data: Add the generated synthetic data points to the dataset to improve class balance.

A practical example is as follows.

If the minority class sample is A and its nearest neighbor samples are B and C, new data points are generated as follows

New data points = A + alpha * (B - A)

The algorithm is expected to improve the performance of the model by increasing the number of minority classes in the dataset; improvements to SMOTE have been proposed beyond the simple version, including methods that account for constraints on the generated data at class boundaries and the use of different distance metrics.

Application of SMOTE (Synthetic Minority Over-sampling Technique)

The following are examples of SMOTE applications.

1. medical diagnoses:

Medical data typically shows an unbalanced class distribution due to low incidence of disease; SMOTE can be used to increase the number of minority classes in medical diagnostic datasets and facilitate improved model performance, for example, in cancer detection and disease prediction.

2. credit evaluation:

Since the probability of default is low in the credit evaluation dataset, SMOTE addresses unbalanced class distributions and ensures that the model performs well in predicting default.

3. security:

Attack events are typically less likely to occur in security-related datasets, but it is important to detect them; SMOTE addresses the unbalanced class distribution of security data and complements the training of detection models.

4. fraud detection:

Fraudulent transactions are usually rare in datasets such as financial transactions and online transactions, etc. SMOTE addresses these unbalanced class distributions and contributes to improving the performance of fraud detection models.

5. image processing:

Class imbalance situations can also occur in image data. An example is training models to detect anomalies in medical images.

SMOTE is a useful approach in these cases to help improve model performance and complement unbalanced class distributions. However, the application of SMOTE requires consideration of the characteristics of the dataset and the domain, and it is important to adjust the hyperparameters and validate the results.

Examples of SMOTE (Synthetic Minority Over-sampling Technique) implementations

An example implementation of SMOTE (Synthetic Minority Over-sampling Technique) is shown below. In the following example, SMOTE is implemented using Python’s scikit-learn library. First, install scikit-learn.

pip install -U scikit-learn

Next, the following will be a simple example of oversampling a data set using SMOTE.

from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTE
import matplotlib.pyplot as plt
import numpy as np

# Sample data generation
X, y = make_classification(n_classes=2, class_sep=2, weights=[0.1, 0.9],
                           n_informative=3, n_redundant=1, flip_y=0,
                           n_features=20, n_clusters_per_class=1,
                           n_samples=1000, random_state=42)

# Plotting Original Data
plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], label="Minority Class", alpha=0.7)
plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], label="Majority Class", alpha=0.7)
plt.title("Original Data")
plt.legend()
plt.show()

# Apply SMOTE
smote = SMOTE(sampling_strategy='auto', random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

# Plotting data after oversampling
plt.scatter(X_resampled[y_resampled == 0][:, 0], X_resampled[y_resampled == 0][:, 1], label="Minority Class", alpha=0.7)
plt.scatter(X_resampled[y_resampled == 1][:, 0], X_resampled[y_resampled == 1][:, 1], label="Majority Class", alpha=0.7)
plt.title("SMOTE Over-sampled Data")
plt.legend()
plt.show()

In this example, the make_classification function is used to generate a synthetic dataset and then SMOTE is applied. The distribution of the original data and the data after applying SMOTE is visualized, and the imbalanced-learn package is required for SMOTE. It is important to adjust the parameters of SMOTE according to the characteristics of the data and the specific use case.

Challenges of SMOTE (Synthetic Minority Over-sampling Technique) and its countermeasures

Although SMOTE (Synthetic Minority Over-sampling Technique) is a useful technique, it has several challenges. The following is a list of the main issues and their solutions.

Challenges:

1. Introduction of noise: The synthetic data generated by SMOTE may differ from the original data, which may introduce noise.

2. approximation of class boundaries: SMOTE is based on simple linear interpolation, which may limit its effectiveness for nonlinear class boundaries

3. Overgeneration: SMOTE is generally prone to overgenerate data, which may lead to over-training of the model.

Solution:

1. Borderline-SMOTE: To reduce the introduction of noise, improved versions have been proposed, such as Borderline-SMOTE, which aims to reduce outliers by generating synthetic data using only samples close to class boundaries.

2. adaptive-SMOTE: To deal with class boundary approximation, methods such as Adaptive-SMOTE dynamically adjust the number of nearby samples to deal with nonlinear class boundaries.

3. Control of overgeneration: To deal with overgeneration, it is important to adjust the SMOTE parameters (e.g., number of nearest neighbor samples) appropriately. Also, the use of methods to control overgeneration and model regularization may be considered.

4. Combination with other methods of class imbalance learning: SMOTE is one of the oversampling methods that can benefit from being used in combination with other methods. For example, combining it with under-sampling and anomaly detection methods is expected to improve the performance of the model.

Reference Information and Reference Books

For reference information, see “General Machine Learning and Data Analysis” “Small Data Learning, Combining Logic and Machine Learning, Local/Group Learning,” and “Machine Learning with Sparsity”

For Reference book “Advice for machine learning part 1: Overfitting and High error rate“

“Machine Learning Design Patterns“

“Machine Learning Solutions: Expert techniques to tackle complex machine learning problems using Python“

“Machine Learning with R“等がある。

Basis and Original Papers

1. original paper (proposal of SMOTE)

Title: SMOTE: Synthetic Minority Over-sampling Technique

Authors: Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, W. Philip Kegelmeyer

Published in: Journal of Artificial Intelligence Research (JAIR), 2002

A good first read, containing details of the SMOTE algorithm and evaluation experiments.

Books (including applications and implementations)

2. Imbalanced Learning: Foundations, Algorithms, and Applications

Authors: Haibo He, Yunqian Ma

Publisher: Wiley-IEEE Press, 2013

3.Beginning Anomaly Detection Using Python-Based Deep Learning: Implement Anomaly Detection Applications with Keras and PyTorch

4.Imbalanced Classification with Python: Better Metrics, Balance Skewed Classes, Cost-Sensitive Learning

Application Papers and Extensions

Borderline-SMOTE

Han, H., Wang, W.-Y., & Mao, B.-H. (2005). Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning.

Proposal for an extended version of SMOTE. Focuses on data near decision boundaries.

ADASYN: Adaptive Synthetic Sampling

Haibo He et al. (2008). ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning.

A method to generate samples based on “difficulty” even for a small number of classes. See detail in Overview of ADASYN and examples of algorithms and implementations.

Implementation Resources

imbalanced-learn (Python library)

SMOTE, BorderlineSMOTE, SMOTENC, SVMSMOTE, and many other implementations available