Overview of ADASYN and examples of algorithms and implementations

Machine Learning Natural Language Processing Artificial Intelligence Digital Transformation Image Processing Reinforcement Learning Probabilistic Generative Modeling Deep Learning Python Navigation of this blog
Overview of ADASYN

ADASYN (Adaptive Synthetic Sampling), proposed by Haibo He et al. in 2008, is a data-level technique designed to address class imbalance in classification tasks. The primary goal of ADASYN is to reduce the learning bias caused by severely underrepresented minority class samples and to construct a more balanced and effective classification model.

Unlike basic oversampling methods that uniformly generate synthetic samples, ADASYN focuses on generating more samples in regions where minority class examples are harder to classify. Specifically, it evaluates the distribution of majority class samples in the neighborhood of each minority class instance. The more imbalanced the local region, the more synthetic samples are generated for that point. This allows the model to better learn complex decision boundaries and improves classification accuracy in difficult areas.

By adopting this adaptive sampling strategy, ADASYN can outperform traditional oversampling techniques such as SMOTE in practical applications, particularly by strengthening the model’s performance on critical and challenging regions of the data space.

How ADASYN Works (Summary)

The operation of ADASYN can be summarized in the following steps:

  1. For each sample in the minority class, a set of k nearest neighbors (typically k = 5) is identified.

  2. The number of majority class samples within this neighborhood is counted to evaluate the local class imbalance.

  3. A difficulty score is computed: samples surrounded by more majority class instances are considered harder to classify and are therefore assigned more synthetic samples.

  4. The synthetic samples are generated via linear interpolation, similar to SMOTE, using the original minority class sample and its minority class neighbors.

Through this process, ADASYN places more emphasis on hard-to-learn areas and enhances the classifier’s ability to correctly distinguish the minority class, especially near decision boundaries.

Differences Between ADASYN and SMOTE

While both ADASYN and SMOTE (Synthetic Minority Over-sampling Technique) described in “Overview of SMOTE (Synthetic Minority Over-sampling Technique), its algorithm and examples of implementation” aim to mitigate class imbalance by generating synthetic minority class samples, they differ in their strategies:

  • Location of Generation:
    SMOTE generates samples uniformly across all minority class samples, without regard to local classification difficulty.
    ADASYN, by contrast, focuses on areas with a high chance of misclassification, i.e., where the majority class is dominant in the neighborhood.

  • Weighting Strategy:
    SMOTE applies no weighting, treating all minority class samples equally.
    ADASYN assigns weights based on the local class imbalance, resulting in more samples for harder regions.

  • Targeted Regions:
    SMOTE aims to enhance the entire minority class distribution.
    ADASYN explicitly targets hard-to-classify regions of the minority class, strengthening the classifier in areas that are more likely to contribute to errors.

In summary, while SMOTE is a general-purpose balancing method, ADASYN is a more strategic and focused technique that adaptively learns where support is needed most.

Advantages and Disadvantages of ADASYN

Improved Accuracy

  • Advantage: ADASYN enhances classification accuracy for the minority class by focusing on regions with high misclassification risk.

  • Disadvantage: Overemphasizing difficult regions may lead to overlap between classes, potentially degrading performance by blurring class boundaries.

Local Adaptivity

  • Advantage: ADASYN evaluates the local neighborhood difficulty of each sample, allowing for targeted and context-aware oversampling.

  • Disadvantage: This adaptivity may cause the model to overreact to noise or outliers, generating synthetic samples in unrepresentative or misleading regions.

Automatic Sample Adjustment

  • Advantage: ADASYN dynamically determines the number of synthetic samples based on the difficulty of each instance, optimizing resource allocation during training.

  • Disadvantage: The method’s effectiveness depends on careful tuning of hyperparameters such as the number of neighbors (k), which can significantly affect performance.

In conclusion, ADASYN is a powerful and adaptive oversampling strategy for imbalanced learning, offering notable improvements in minority class learning. However, to achieve its full potential, one must carefully manage noise sensitivity and parameter configuration.

Related Algorithms

ADASYN (Adaptive Synthetic Sampling) is part of a broader family of techniques designed to address imbalanced data problems, particularly through oversampling. Below, we describe key related algorithms categorized by their strategies.

<Oversampling-Based Algorithms (Generate Synthetic Samples for Minority Class)>

Oversampling methods aim to enhance the learning capability of the minority class by generating synthetic samples and adding them to the training data. While ADASYN is one such method, other notable approaches each have their own unique characteristics and strategies.

SMOTE (Synthetic Minority Over-sampling Technique)

SMOTE generates new samples by performing linear interpolation between a minority class sample and its neighbors. It treats all minority samples equally, without considering classification difficulty. In contrast, ADASYN adjusts the number of samples based on difficulty.

Borderline-SMOTE

This method focuses on minority class samples located near the decision boundary, which are more prone to misclassification. It generates synthetic samples only in these high-risk regions, similar to ADASYN, but restricts the target explicitly to borderline areas.

SMOTE-NC (Nominal Continuous)

SMOTE-NC extends SMOTE to support mixed data types (both continuous and categorical). Categorical features are handled via majority voting, while continuous features are interpolated. ADASYN typically applies only to continuous numerical data.

• KMeans-SMOTE

This technique uses KMeans clustering on the minority class to identify representative points and then generates synthetic samples from them. It focuses on the global data structure, unlike ADASYN, which targets local classification difficulty.

SOMO (Self-Organizing Map based Oversampling)

SOMO leverages Self-Organizing Maps (SOM) to learn the topological distribution of the minority class and generates samples near cluster centers. It emphasizes topological structure over local classification difficulty, prioritizing spatial balance over local adaptation.

These oversampling methods each differ in where, how many, and how samples are generated. ADASYN stands out as a strategic and adaptive approach, intensifying support in regions where classification is inherently more difficult.

<Undersampling-Based Algorithms (Reduce Majority Class Samples)>

Undersampling methods address imbalance by reducing the number of majority class samples. This can improve learning efficiency and prevent overfitting, although it may result in information loss if not carefully applied.

• Random UnderSampling (RUS)

RUS randomly removes samples from the majority class to balance the dataset. It is simple and computationally inexpensive, but can remove informative samples, risking degraded performance.

• Tomek Links

This method identifies Tomek Links, which are pairs of samples from opposite classes that are closest to each other. It removes the majority sample from each pair to clarify decision boundaries and reduce overlap.

• Edited Nearest Neighbors (ENN)

ENN removes samples that disagree with the majority of their k nearest neighbors, acting as a noise filtering method. When applied to the majority class, it can eliminate noisy or borderline points, enhancing classifier stability.

• NearMiss

NearMiss selects majority class samples that are closest to the minority class, emphasizing difficult regions. Other majority samples are discarded. This forces the classifier to focus on decision boundaries, improving robustness in challenging zones.

These techniques help reduce the dominance of the majority class while preserving meaningful patterns, and are often combined with oversampling for enhanced results.

<Hybrid & Advanced Ensemble Approaches>

Advanced methods combine oversampling/undersampling with ensemble learning (e.g., Boosting or Bagging) to handle imbalanced data in a more powerful and robust way.

• SMOTEBoost

SMOTEBoost integrates SMOTE with AdaBoost. In each boosting round, SMOTE generates synthetic samples to balance the training data, improving the model’s sensitivity to the minority class while progressively enhancing weak learners.

• ADASYNBoost

ADASYNBoost replaces SMOTE with ADASYN in the SMOTEBoost framework. ADASYN is applied at each stage to focus sample generation on hard-to-classify regions, leading to faster convergence and improved learning on the minority class.

• EasyEnsemble

EasyEnsemble combines random undersampling and Bagging. It creates multiple balanced subsets by randomly sampling from the majority class, trains a classifier on each, and ensembles them. This improves generalization while avoiding excessive information loss.

• BalancedBaggingClassifier (from imbalanced-learn)

This implementation from the Python imbalanced-learn library extends Bagging to automatically apply balanced bootstrapped subsets using undersampling. It is practical for real-world use and supports pipelines that need imbalance-aware classifiers.

These ensemble-based methods provide strong performance even in nonlinear, high-dimensional, and extremely imbalanced scenarios. Their strength lies in combining the diversity and robustness of ensemble learning with data-level balancing.

<Variants and Extensions of ADASYN>

Several enhanced versions of ADASYN have been proposed to improve its robustness and applicability. These include:

• Weighted ADASYN

This version scales the number of synthetic samples not only based on local imbalance, but also by incorporating misclassification probabilities or loss values from a classifier. It offers more refined control, emphasizing regions that truly hinder classification performance.

• Cluster-based ADASYN

This method applies clustering (e.g., KMeans) before sampling, enabling synthesis within cluster boundaries. It allows for better distribution-aware sampling, reducing over-concentration and decision boundary blur.

• Noise-filtered ADASYN

Here, noise filtering (e.g., ENN, Tomek Links) is applied before synthesis to remove outliers or mislabeled samples. This reduces the risk of generating poor-quality synthetic data, addressing one of ADASYN’s known weaknesses.

• ADASYN for Regression

Although ADASYN is designed for classification, some extensions have adapted it for regression tasks involving rare or extreme values. It generates samples based on local error or density, helping regressors handle imbalanced target distributions (e.g., anomaly detection, rare event forecasting).

These variations leverage the core strength of ADASYN—adaptive, difficulty-aware sampling—while expanding its flexibility across domains, data types, and tasks.

By understanding and selecting from this rich ecosystem of algorithms, practitioners can tailor imbalance handling strategies to fit the specific structure, domain, and challenges of their dataset.

Implementation

Here’s the first step (library installation) of the ADASYN implementation example

1. Library Installation

To implement ADASYN for handling imbalanced data in a binary classification task (e.g., fraud detection), you need to install the following Python libraries:

pip install imbalanced-learn scikit-learn matplotlib seaborn

ADASYN implementation (e.g. scikit-learn + imbalanced-learn)

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from imblearn.over_sampling import ADASYN
import matplotlib.pyplot as plt
import seaborn as sns

# 1. 不均衡なデータセットを作成
X, y = make_classification(n_classes=2, class_sep=2,
                           weights=[0.9, 0.1],  # クラス0が90%、クラス1が10%
                           n_informative=3, n_redundant=1,
                           n_clusters_per_class=1, n_samples=1000, random_state=42)

# 2. ADASYNを使って合成データを生成
adasyn = ADASYN(random_state=42, n_neighbors=5)
X_resampled, y_resampled = adasyn.fit_resample(X, y)

# 3. 学習・評価
clf = RandomForestClassifier(random_state=42)
clf.fit(X_resampled, y_resampled)
y_pred = clf.predict(X)

# 4. 評価レポート出力
print("★ 不均衡なテストデータに対する分類精度(ADASYN後のモデル)")
print(classification_report(y, y_pred))

3. visualization of class distribution (oversampling effect)

from collections import Counter

# クラス分布の比較
print("Before:", Counter(y))
print("After :", Counter(y_resampled))

# グラフ描画
sns.countplot(x=y_resampled)
plt.title("Class Distribution after ADASYN")
plt.xlabel("Class Label")
plt.ylabel("Sample Count")
plt.show()

4. Real-World Application Examples

Domain Use Cases
Finance Fraudulent credit card transaction detection, identifying credit defaulters
Healthcare Diagnosis of rare diseases (e.g., cancer, genetic disorders)
Cybersecurity Synthetic generation of attack logs or intrusion patterns
IoT / Manufacturing Anomaly detection and failure prediction based on rare event labels
Customer Analytics Predicting churned users or high-LTV customers (rare but strategically important cases)

Practical Considerations When Using ADASYN

  • Sensitive to Noise and Mislabeled Data:
    ADASYN can be affected by noise or mislabeled samples.
    It is recommended to perform data cleaning or anomaly filtering before applying ADASYN.

  • Less Effective in High-Dimensional Spaces:
    In high-dimensional datasets, ADASYN may lose effectiveness.
    Consider combining it with dimensionality reduction (e.g., PCA) or feature selection techniques.

  • Apply Only to Training Data:
    Synthetic samples generated by ADASYN should only be used in the training set.
    Never apply ADASYN to test or validation data, as it would lead to data leakage and biased evaluation.

コメント

Exit mobile version
タイトルとURLをコピーしました