Overview of ADASYN
ADASYN (Adaptive Synthetic Sampling), proposed by Haibo He et al. in 2008, is a data-level technique designed to address class imbalance in classification tasks. The primary goal of ADASYN is to reduce the learning bias caused by severely underrepresented minority class samples and to construct a more balanced and effective classification model.
Unlike basic oversampling methods that uniformly generate synthetic samples, ADASYN focuses on generating more samples in regions where minority class examples are harder to classify. Specifically, it evaluates the distribution of majority class samples in the neighborhood of each minority class instance. The more imbalanced the local region, the more synthetic samples are generated for that point. This allows the model to better learn complex decision boundaries and improves classification accuracy in difficult areas.
By adopting this adaptive sampling strategy, ADASYN can outperform traditional oversampling techniques such as SMOTE in practical applications, particularly by strengthening the model’s performance on critical and challenging regions of the data space.
How ADASYN Works (Summary)
The operation of ADASYN can be summarized in the following steps:
-
For each sample in the minority class, a set of k nearest neighbors (typically k = 5) is identified.
-
The number of majority class samples within this neighborhood is counted to evaluate the local class imbalance.
-
A difficulty score is computed: samples surrounded by more majority class instances are considered harder to classify and are therefore assigned more synthetic samples.
-
The synthetic samples are generated via linear interpolation, similar to SMOTE, using the original minority class sample and its minority class neighbors.
Through this process, ADASYN places more emphasis on hard-to-learn areas and enhances the classifier’s ability to correctly distinguish the minority class, especially near decision boundaries.
Differences Between ADASYN and SMOTE
While both ADASYN and SMOTE (Synthetic Minority Over-sampling Technique) described in “Overview of SMOTE (Synthetic Minority Over-sampling Technique), its algorithm and examples of implementation” aim to mitigate class imbalance by generating synthetic minority class samples, they differ in their strategies:
-
Location of Generation:
SMOTE generates samples uniformly across all minority class samples, without regard to local classification difficulty.
ADASYN, by contrast, focuses on areas with a high chance of misclassification, i.e., where the majority class is dominant in the neighborhood. -
Weighting Strategy:
SMOTE applies no weighting, treating all minority class samples equally.
ADASYN assigns weights based on the local class imbalance, resulting in more samples for harder regions. -
Targeted Regions:
SMOTE aims to enhance the entire minority class distribution.
ADASYN explicitly targets hard-to-classify regions of the minority class, strengthening the classifier in areas that are more likely to contribute to errors.
In summary, while SMOTE is a general-purpose balancing method, ADASYN is a more strategic and focused technique that adaptively learns where support is needed most.
Advantages and Disadvantages of ADASYN
Improved Accuracy
-
Advantage: ADASYN enhances classification accuracy for the minority class by focusing on regions with high misclassification risk.
-
Disadvantage: Overemphasizing difficult regions may lead to overlap between classes, potentially degrading performance by blurring class boundaries.
Local Adaptivity
-
Advantage: ADASYN evaluates the local neighborhood difficulty of each sample, allowing for targeted and context-aware oversampling.
-
Disadvantage: This adaptivity may cause the model to overreact to noise or outliers, generating synthetic samples in unrepresentative or misleading regions.
Automatic Sample Adjustment
-
Advantage: ADASYN dynamically determines the number of synthetic samples based on the difficulty of each instance, optimizing resource allocation during training.
-
Disadvantage: The method’s effectiveness depends on careful tuning of hyperparameters such as the number of neighbors (k), which can significantly affect performance.
In conclusion, ADASYN is a powerful and adaptive oversampling strategy for imbalanced learning, offering notable improvements in minority class learning. However, to achieve its full potential, one must carefully manage noise sensitivity and parameter configuration.
Related Algorithms
ADASYN (Adaptive Synthetic Sampling) is part of a broader family of techniques designed to address imbalanced data problems, particularly through oversampling. Below, we describe key related algorithms categorized by their strategies.
<Oversampling-Based Algorithms (Generate Synthetic Samples for Minority Class)>
Oversampling methods aim to enhance the learning capability of the minority class by generating synthetic samples and adding them to the training data. While ADASYN is one such method, other notable approaches each have their own unique characteristics and strategies.
• SMOTE (Synthetic Minority Over-sampling Technique)
SMOTE generates new samples by performing linear interpolation between a minority class sample and its neighbors. It treats all minority samples equally, without considering classification difficulty. In contrast, ADASYN adjusts the number of samples based on difficulty.
• Borderline-SMOTE
This method focuses on minority class samples located near the decision boundary, which are more prone to misclassification. It generates synthetic samples only in these high-risk regions, similar to ADASYN, but restricts the target explicitly to borderline areas.
• SMOTE-NC (Nominal Continuous)
SMOTE-NC extends SMOTE to support mixed data types (both continuous and categorical). Categorical features are handled via majority voting, while continuous features are interpolated. ADASYN typically applies only to continuous numerical data.
• KMeans-SMOTE
This technique uses KMeans clustering on the minority class to identify representative points and then generates synthetic samples from them. It focuses on the global data structure, unlike ADASYN, which targets local classification difficulty.
• SOMO (Self-Organizing Map based Oversampling)
SOMO leverages Self-Organizing Maps (SOM) to learn the topological distribution of the minority class and generates samples near cluster centers. It emphasizes topological structure over local classification difficulty, prioritizing spatial balance over local adaptation.
These oversampling methods each differ in where, how many, and how samples are generated. ADASYN stands out as a strategic and adaptive approach, intensifying support in regions where classification is inherently more difficult.
<Undersampling-Based Algorithms (Reduce Majority Class Samples)>
Undersampling methods address imbalance by reducing the number of majority class samples. This can improve learning efficiency and prevent overfitting, although it may result in information loss if not carefully applied.
• Random UnderSampling (RUS)
RUS randomly removes samples from the majority class to balance the dataset. It is simple and computationally inexpensive, but can remove informative samples, risking degraded performance.
• Tomek Links
This method identifies Tomek Links, which are pairs of samples from opposite classes that are closest to each other. It removes the majority sample from each pair to clarify decision boundaries and reduce overlap.
• Edited Nearest Neighbors (ENN)
ENN removes samples that disagree with the majority of their k nearest neighbors, acting as a noise filtering method. When applied to the majority class, it can eliminate noisy or borderline points, enhancing classifier stability.
• NearMiss
NearMiss selects majority class samples that are closest to the minority class, emphasizing difficult regions. Other majority samples are discarded. This forces the classifier to focus on decision boundaries, improving robustness in challenging zones.
These techniques help reduce the dominance of the majority class while preserving meaningful patterns, and are often combined with oversampling for enhanced results.
<Hybrid & Advanced Ensemble Approaches>
Advanced methods combine oversampling/undersampling with ensemble learning (e.g., Boosting or Bagging) to handle imbalanced data in a more powerful and robust way.
• SMOTEBoost
SMOTEBoost integrates SMOTE with AdaBoost. In each boosting round, SMOTE generates synthetic samples to balance the training data, improving the model’s sensitivity to the minority class while progressively enhancing weak learners.
• ADASYNBoost
ADASYNBoost replaces SMOTE with ADASYN in the SMOTEBoost framework. ADASYN is applied at each stage to focus sample generation on hard-to-classify regions, leading to faster convergence and improved learning on the minority class.
• EasyEnsemble
EasyEnsemble combines random undersampling and Bagging. It creates multiple balanced subsets by randomly sampling from the majority class, trains a classifier on each, and ensembles them. This improves generalization while avoiding excessive information loss.
• BalancedBaggingClassifier (from imbalanced-learn
)
This implementation from the Python imbalanced-learn
library extends Bagging to automatically apply balanced bootstrapped subsets using undersampling. It is practical for real-world use and supports pipelines that need imbalance-aware classifiers.
These ensemble-based methods provide strong performance even in nonlinear, high-dimensional, and extremely imbalanced scenarios. Their strength lies in combining the diversity and robustness of ensemble learning with data-level balancing.
<Variants and Extensions of ADASYN>
Several enhanced versions of ADASYN have been proposed to improve its robustness and applicability. These include:
• Weighted ADASYN
This version scales the number of synthetic samples not only based on local imbalance, but also by incorporating misclassification probabilities or loss values from a classifier. It offers more refined control, emphasizing regions that truly hinder classification performance.
• Cluster-based ADASYN
This method applies clustering (e.g., KMeans) before sampling, enabling synthesis within cluster boundaries. It allows for better distribution-aware sampling, reducing over-concentration and decision boundary blur.
• Noise-filtered ADASYN
Here, noise filtering (e.g., ENN, Tomek Links) is applied before synthesis to remove outliers or mislabeled samples. This reduces the risk of generating poor-quality synthetic data, addressing one of ADASYN’s known weaknesses.
• ADASYN for Regression
Although ADASYN is designed for classification, some extensions have adapted it for regression tasks involving rare or extreme values. It generates samples based on local error or density, helping regressors handle imbalanced target distributions (e.g., anomaly detection, rare event forecasting).
These variations leverage the core strength of ADASYN—adaptive, difficulty-aware sampling—while expanding its flexibility across domains, data types, and tasks.
By understanding and selecting from this rich ecosystem of algorithms, practitioners can tailor imbalance handling strategies to fit the specific structure, domain, and challenges of their dataset.
Implementation
Here’s the first step (library installation) of the ADASYN implementation example
1. Library Installation
To implement ADASYN for handling imbalanced data in a binary classification task (e.g., fraud detection), you need to install the following Python libraries:
pip install imbalanced-learn scikit-learn matplotlib seaborn
ADASYN implementation (e.g. scikit-learn + imbalanced-learn)
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from imblearn.over_sampling import ADASYN
import matplotlib.pyplot as plt
import seaborn as sns
# 1. 不均衡なデータセットを作成
X, y = make_classification(n_classes=2, class_sep=2,
weights=[0.9, 0.1], # クラス0が90%、クラス1が10%
n_informative=3, n_redundant=1,
n_clusters_per_class=1, n_samples=1000, random_state=42)
# 2. ADASYNを使って合成データを生成
adasyn = ADASYN(random_state=42, n_neighbors=5)
X_resampled, y_resampled = adasyn.fit_resample(X, y)
# 3. 学習・評価
clf = RandomForestClassifier(random_state=42)
clf.fit(X_resampled, y_resampled)
y_pred = clf.predict(X)
# 4. 評価レポート出力
print("★ 不均衡なテストデータに対する分類精度(ADASYN後のモデル)")
print(classification_report(y, y_pred))
3. visualization of class distribution (oversampling effect)
from collections import Counter
# クラス分布の比較
print("Before:", Counter(y))
print("After :", Counter(y_resampled))
# グラフ描画
sns.countplot(x=y_resampled)
plt.title("Class Distribution after ADASYN")
plt.xlabel("Class Label")
plt.ylabel("Sample Count")
plt.show()
4. Real-World Application Examples
Domain | Use Cases |
---|---|
Finance | Fraudulent credit card transaction detection, identifying credit defaulters |
Healthcare | Diagnosis of rare diseases (e.g., cancer, genetic disorders) |
Cybersecurity | Synthetic generation of attack logs or intrusion patterns |
IoT / Manufacturing | Anomaly detection and failure prediction based on rare event labels |
Customer Analytics | Predicting churned users or high-LTV customers (rare but strategically important cases) |
Practical Considerations When Using ADASYN
-
Sensitive to Noise and Mislabeled Data:
ADASYN can be affected by noise or mislabeled samples.
It is recommended to perform data cleaning or anomaly filtering before applying ADASYN. -
Less Effective in High-Dimensional Spaces:
In high-dimensional datasets, ADASYN may lose effectiveness.
Consider combining it with dimensionality reduction (e.g., PCA) or feature selection techniques. -
Apply Only to Training Data:
Synthetic samples generated by ADASYN should only be used in the training set.
Never apply ADASYN to test or validation data, as it would lead to data leakage and biased evaluation.
Application Examples of ADASYN
ADASYN is widely applied in domains where minority classes are critical and class imbalance poses significant challenges. Below are real-world use cases across various industries.
1. Medical Diagnosis & Healthcare
Background
-
Rare diseases and anomaly detection are inherently minority-class problems (e.g., cancer, heart disease, arrhythmia).
-
Models often prioritize the majority class (healthy) due to imbalance, reducing sensitivity to critical conditions.
Use Cases and ADASYN Contributions
Task | ADASYN’s Contribution |
---|---|
Arrhythmia detection from ECG signals | Synthesizes abnormal waveform samples to boost recall. |
Chest X-ray diagnosis for lung disease | Improves classification accuracy for rare conditions like lung cancer. |
Genetic disorder diagnosis | Complements limited patient data for low-incidence diseases. |
2. Finance & Insurance: Fraud Detection and Credit Risk
Background
-
Fraud and anomalies often occur at rates below 1%.
-
False negatives can lead to significant financial losses.
Use Cases and ADASYN Contributions
Task | ADASYN’s Contribution |
---|---|
Credit card fraud detection | Increases recall by synthesizing fraud examples and reduces false negatives. |
Insurance fraud detection | Augments rare fraud samples to build more robust classifiers. |
Credit scoring for SMEs | Mitigates imbalance in default prediction for small business lending. |
3. Manufacturing & IoT: Anomaly Detection and Predictive Maintenance
Background
-
Normal operation logs dominate; abnormal events (e.g., failure, overheating) are rare but critical.
-
High-precision real-time monitoring is essential.
Use Cases and ADASYN Contributions
Task | ADASYN’s Contribution |
---|---|
Vibration-based machine fault detection | Synthesizes failure data to train a detector sensitive to anomalies. |
Detecting sudden sensor value changes | Expands rare fault logs to reduce false positives. |
4. Cybersecurity & Intrusion Detection
Background
-
Attack or intrusion logs are buried under large volumes of normal data.
-
Learning from known threats is essential for detecting novel attacks.
Use Cases and ADASYN Contributions
Task | ADASYN’s Contribution |
---|---|
Real-time DoS attack detection | Learns patterns from limited abnormal traffic samples. |
Malware classification | Enhances rare but dangerous samples for robust threat recognition. |
5. Education & Churn Prediction
Background
-
In education and SaaS platforms, dropout users or high-value users are few but important.
-
The goal is not just accuracy but high recall for minority cases.
Use Cases and ADASYN Contributions
Task | ADASYN’s Contribution |
---|---|
Predicting student dropout in online learning | Boosts recall by synthesizing behavior data from dropouts. |
Detecting low-performing students early | Improves classification for early intervention alerts. |
References
1. Foundational Paper (Original ADASYN Proposal)
-
He, H., Bai, Y., Garcia, E. A., & Li, S. (2008)
ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning
In Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN)
→ This is the original paper that introduced ADASYN, proposing an adaptive sampling approach based on classification difficulty.
2. Comparative Studies & Theoretical Foundations
-
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002)
SMOTE: Synthetic Minority Over-sampling Technique
Journal of Artificial Intelligence Research (JAIR)
→ This foundational work underpins many later oversampling techniques including ADASYN. It generates synthetic samples uniformly across the minority class. -
Fernández, A., et al. (2018)
Learning from Imbalanced Data Sets (Book Chapter)
→ A comprehensive theoretical and practical comparison of imbalanced learning methods including SMOTE and ADASYN. -
Sun, Y., Wong, A. K. C., & Kamel, M. S. (2009)
Classification of Imbalanced Data: A Review
International Journal on Pattern Recognition and Artificial Intelligence
→ A broad review of imbalanced learning strategies, comparing various methods including ADASYN.
3. Implementation References (Python Library)
-
imbalanced-learn (Python library extension for scikit-learn)
→imblearn.over_sampling.ADASYN
is implemented as a class in this library. The documentation includes usage examples, parameters, and sample code.
4. Applied Research Cases
-
Chen, C., Liaw, A., & Breiman, L. (2004)
Using Random Forest to Learn Imbalanced Data
→ Demonstrates the application of ADASYN and other resampling techniques in combination with Random Forest for tasks like fraud detection and medical diagnosis. -
Liu, X. Y., Wu, J., & Zhou, Z. H. (2009)
Exploratory Undersampling for Class Imbalance
→ Introduces and compares ensemble-based undersampling methods such as EasyEnsemble and BalanceCascade.
5. Other Related Algorithms (For Comparison)
-
Borderline-SMOTE:
-
Han, H., Wang, W. Y., & Mao, B. H. (2005)
Borderline-SMOTE: A New Over-sampling Method in Imbalanced Data Sets Learning
→ Focuses on generating samples near decision boundaries to improve minority class recall.
-
-
KMeans-SMOTE:
-
Douzas, G., Bacao, F., & Last, F. (2018)
Oversampling for Imbalanced Learning Based on K-Means and SMOTE
→ Combines clustering and oversampling to generate better-distributed synthetic samples across the minority class.
-
コメント