How to deal with machine learning with inaccurate training data

Machine Learning Artificial Intelligence Digital Transformation Stochastic Generative Models Bayesian Modeling Natural Language Processing Markov Chain Monte Carlo Method Image Information Processing Reinforcement Learning Knowledge Information Processing Explainable Machine Learning Deep Learning General ML Small Data ML Navigation of this blog

What is the problem with inaccurate machine learning training data and why does it happen?

In real-world machine learning tasks, we often encounter cases where different labels are assigned to things that should be assigned the same label. Such cases are often handled by simply selecting a model and optimizing parameters, but such an approach cannot handle cases with severe accuracy requirements (e.g., those described in “Challenges in Achieving 100% Reproducibility for Risk Tasks” and their implementation). The problem is that the accuracy requirement is not sufficient for such a case.

In this article, we will discuss how to deal with such cases in which the training data in machine learning is inaccurate. Inaccurate training data (e.g., inaccurate training data in machine learning) can cause the following problems

Poor model performance: Training a model with inaccurate training data can degrade the performance of the model, and the model will tend to make incorrect predictions if it contains inaccurate labels or noise.
Introduce bias: Training data may reflect the subjectivity or bias of the annotator. If inaccurate data has a bias, the model may learn a similar bias, which may bias the model’s prediction results.
Poor generalization performance: Using inaccurate training data can lead to poor generalization performance of the model, where the model overfits the inaccurate data and cannot predict well on new data.
Ethical issues: If inaccurate training data contains bias or discrimination, models trained on it may reflect similar bias or discrimination. This may raise issues related to fairness and ethics.

Such inaccuracies in the training data could be caused by the following factors

Labeling errors: training data may contain labeling errors due to annotator errors, subjective judgments, and inconsistent labeling.
Unbalanced datasets: datasets with unbalanced class distributions may contain less data for minority classes, which may result in inaccurate labeling for them.
Noise and outliers: Data sets may contain noise and outliers, which may make accurate labeling difficult or convey incorrect information to the model.
Bias or discrimination: Datasets may contain bias or discriminatory elements, which could result in inaccurate labeling or data collection.
Outdated data sets: If the training data is outdated, it may not reflect new trends or changes, which could lead to inaccurate results.

Overview of Improved Approaches to Machine Learning with Inaccurate Supervisory Data

The following are possible approaches to correcting inaccurate training data resulting from this process.

Improve data quality: To improve the accuracy of the training data, it is important to re-evaluate the data set and identify and correct inaccurate data. Improving data quality includes correcting inaccurate data, processing missing values, and removing outliers.
Identifying and correcting inaccurate data: Another important approach to improving the accuracy of the training data is to analyze the dataset in detail and identify inaccurate data points. This may involve, for example, finding problems with mislabeling, noise, duplicates, etc. Inaccurate data can then be corrected or removed to improve the quality of the data set.
Data Extension: To add diversity to the dataset, data extension is another approach to improve the accuracy of the training data. This could be, for example, in the case of image data, operations such as rotation, flipping, cropping, brightness changes, etc., to generate new data so that the model is robust to different variations.
Use ensemble learning: combining multiple models can reduce the effect of inaccurate training data. Ensemble learning as described in “Overview of Ensemble Learning and Examples of Algorithms and Implementations” allows multiple models with different algorithms and parameter settings to be trained and their predictions to be combined to produce more reliable forecasts.
Use semi-supervised learning: Even if some of the supervised data is inaccurate, improvements can be made by applying semi-supervised learning, which can take advantage of unlabeled data. By using unlabeled data to train models and then using the predictions to modify the training data, model performance can be improved.
Use active learning: Active learning, in which the model is able to acquire new labels on its own, focuses on particularly difficult or near-boundary samples in order to reduce the impact of inaccurate training data, and uses a human expert’s correct This will be an approach to obtaining labels.
Improving model robustness: There are also approaches to make the model more robust. These include, for example, using models that are tolerant of outliers and noise, introducing regularization and dropouts, etc. This can reduce the impact of inaccurate training data.
Perform domain adaptation: If inaccurate training data is biased toward a particular domain, it may be improved by an approach that trains the model using accurate training data from another domain and adapts it to the target domain. This domain adaptation can reduce the impact of inaccurate training data.

We discuss them in more detail below.

1. Approaches to improve the quality of training data

Possible approaches to improve the quality of training data include

Identify and correct inaccurate data: Examine the dataset in detail and identify inaccurate data points. This could be, for example, incorrect labeling, missing values, outliers, etc. Correcting or removing inaccurate data can improve the quality of the data set.
Manual review by a human expert: Another approach is to have the dataset manually reviewed by a human expert to identify inaccurate data. This would involve the expert correcting mislabeling or obvious errors and confirming the quality of the data.
Crowdsourcing: Crowdsourcing platforms can be used to outsource the work to identify and correct inaccurate data. This could be, for example, to correct inaccurate labels or to check the quality of annotations.
Cross-checking and validation: using another reliable data source and comparing it to the original training data to identify data points that do not match. Data validation using consistent information from different data sources to filter out inaccurate data.
Ensemble Learning and Voting Mechanisms: Ensemble learning by combining multiple models and using voting mechanisms when integrating the predictions of different models can reduce the impact of inaccurate data.
Application of noise removal and missing value treatment methods: improve by applying methods to handle noise and missing values in the data set, such as filling in missing values with the mean or median, using filtering methods to remove noise, and other approaches.
Combination of Reinforcement Learning and Human Manual Review: As described in “Reinforcement Learning Application Areas (1) Optimization of Behavior” there is an approach to improving models by learning combined with reinforcement learning. This approach has been used in recent years in chatGPT and has attracted much attention.

The first half of this approach is mainly for manual checking, and the second half is for machine learning and statistical processing. The human checks are subject to human error and time-consuming processing, while the machine learning/statistical processing approach leaves concerns about accuracy.

2. For approaches to identifying and correcting inaccurate data

Approaches to identifying and correcting inaccurate data in the training data are described below.

Manual Validation and Correction: Improve data quality by examining the dataset in detail, manually identifying inaccurate data, and identifying and correcting mislabeling, obvious errors, missing data, outliers, etc.
Apply data quality rules: establish data quality rules and validate data accordingly. This allows, for example, data points with values outside a specific data range or inconsistent data format to be identified and corrected.
Crowdsourcing-based corrections: Crowdsourcing platforms can be used to ask outside parties to validate and correct data. This could be, for example, instructing a worker to correct an inaccurate label or to check the quality of annotations.
Cross-checking and Validation: Identify inaccurate data by using another reliable data source and comparing it to the original training data. This allows for the identification of inconsistent data points between data sources and correcting them with the correct information.
Expert Review: Review data by subject matter experts and domain experts. Their expertise can be leveraged to identify inaccurate data and correct it with the correct information.
Automatic error detection and correction: This approach would use machine learning algorithms and statistical methods to automatically detect and correct outliers and inconsistent patterns in the data. This can include, for example, applying outlier detection or missing value completion techniques to correct inaccurate data (see “Anomaly and Change Detection Techniques” for more information on anomaly detection techniques).

3. On the approach to data augmentation in cases where the training data is imprecise

Even in cases where training data is inaccurate, data augmentation methods can be used to increase the diversity of the data set and lead to improvements. Some common data augmentation approaches are described below.

Image data
- Random Rotation: Introduce viewpoint diversity by randomly rotating the image.
- Mirroring: Flip the image horizontally to create a mirror image variation.
- Cropping: Randomly crop an image to create data focused on different regions.
- Brightness and Contrast Variation: Simulate variations in lighting conditions by varying the brightness and contrast of an image.
Textual data:
- Synonym replacement: Create variations in sentence expression by replacing words in a sentence with synonyms.
- Random Insertions and Deletions: Introduce variations in sentence length and structure by randomly inserting or deleting words from a sentence.
- Shuffling: Create variations in sentence order by randomly reordering words or sentences within a sentence.
Audio data:
- Add Noise: Introduce variations in environmental conditions by adding random noise to the audio data.
- Time Stretch: Create variations in the temporal order of speech by varying the playback speed of the speech.
- Volume variation: Introduce variations in the intensity of the audio by varying the volume of the audio.

Using these data augmentation approaches, it is possible to increase the variation in the data set and improve the generalization performance of the model. For more information on data augmentation, see also “Challenges of Achieving 100% Reproducibility for Risk Tasks and Implementation” including oversampling. In applying this approach, it is important to select the appropriate data extension method according to the domain and task, and if inaccurate data are included, it is also important to correct or remove the inaccurate data before applying the data extension method.

4. On the approach of using ensemble learning

Ensemble learning can be a useful technique when the training data is inaccurate. Ensemble learning combines multiple models to make predictions and integrates the results to complement the weaknesses of individual models and improve performance. Several ensemble learning methods are described below.

Bagging: Bagging is an approach that creates multiple models using a technique called bootstrap sampling. Specifically, models are trained using different bootstrap samples, and the predictions of each model are averaged to produce the final prediction. This reduces noise from inaccurate data and produces more stable forecasts.
Boosting: Boosting is a method of incrementally enhancing a weak model by first creating a weak model, using it to make a forecast, then creating the next model with emphasis on forecast error, and integrating the results to make a forecast. This process can be iterated to improve model performance.
Stacking: In stacking, multiple different models are combined to make the final forecast. The meta-model will then learn to integrate the forecast results of the individual models and make the final forecast. Stacking can improve model performance by leveraging the strengths of different models.

Ensemble learning is an effective method for reducing the effects of inaccurate training data, and combining different models can control model biases and variances to achieve more robust forecasts. For more information, see “Machine Learning with Ensemble Methods – Fundamentals and Algorithms Reading Memo. However, when applying ensemble learning, it is important to select a variety of models and appropriate integration methods, and it is also necessary to use different training data sets and learning algorithms to ensure model diversity.

5. On the approach of using semi-supervised learning

Semi-supervised learning (Semi-Supervised Learning) can also be an effective approach when the supervised data is inaccurate. Semi-supervised learning combines a small amount of accurately labeled data with a large amount of unlabeled data.

Self-Training: In self-supervised learning, a model is trained using labeled data, and the model is used to make predictions on unlabeled data. The model then uses the model to make predictions on unlabeled data. Data that the model predicts with high confidence are treated as new labeled data, and the model is re-trained using this new data. For details, see “Overview of Self-Supervised Learning, Various Algorithms, and Examples of Implementations.
Semi-Supervised Support Vector Machines (S3VM): In Semi-Supervised Support Vector Machines (S3VM), unlabeled data is clustered and each cluster is assigned a provisional label, which is used to train the support vector machine. This method improves prediction accuracy for unknown classes by clustering unlabeled data and assigning provisional labels.
Semi-Supervised Deep Learning: Semi-supervised deep learning uses deep neural networks as in regular supervised learning, but also incorporates unlabeled data into the learning process. This includes, for example, using deep generative models or adversarial generative networks (GANs) to generate unlabeled data, which is then used as supervised data to train the model.

Semi-supervised learning is one approach to small-data machine learning; see “Small-Data Learning, Combining Logic and Machine Learning, and Local/Group Learning” for more details. For more information on deep learning approaches, see “About Deep Learning.

6. The approach to using active learning

In cases where the training data is imprecise, a useful approach is to use Active Learning. Active learning is a method in which the model itself requests labels and selects the most informative data for labeling.

Uncertainty Sampling: Uncertainty sampling involves selecting and labeling data that are highly uncertain about the model’s predictions. This would be, for example, in the case of a classification task, where the model selects data with the lowest predictive confidence or data near the boundaries between classes, thereby focusing on data that the model cannot predict with confidence and optimizing the expert’s effort for labeling.
Query by Committee: Query by Committee would create several different models or learners (commit tees) and select data using the degree to which their predictions match. This allows for more informative data to be obtained by selecting data where each model makes different predictions or where the predictions are less consistent.
Buddy System: In the buddy system, data with questionable labeling are paired with high-confidence data and submitted to an expert. The expert compares the questionable data with the high confidence data and decides whether to confirm or correct the labeling of the questionable data. This improves the quality of the suspect data.

Active learning is a method of improving model performance by focusing on the most informative data with limited cost and effort, which allows for efficient and effective labeling of inaccurate training data. These active learning approaches include ensemble learning as described in “Machine Learning with Ensemble Methods – Fundamentals and Algorithms” and question-and-answer learning combined with reinforcement learning as described in “Application Areas of Reinforcement Learning (1) Optimization of Behavior“.

7. On approaches to improve model robustness

In the case of inaccurate training data, approaches to improve the robustness of the model are also measures for improvement. Typical approaches are described below.

Data normalization and preprocessing: Data normalization and preprocessing can reduce the effects of noise and inaccurate data. This can be done, for example, by scaling and normalizing features, handling missing values, and removing outliers to improve model stability and performance.
Ensemble Learning: Ensemble learning is a method of combining multiple models to make predictions, where the combination of different models can complement the weaknesses of individual models and improve robustness. Ensemble learning includes methods such as bagging, boosting, and stacking.
Regularization: Regularization is a method for controlling model complexity, and its use can improve generalization performance by preventing models from overfitting to noisy or inaccurate data. Common regularization methods include L1 regularization, L2 regularization, and dropout.
Noise Tolerant Training: Noise tolerant training is a method of training a model by intentionally adding noise to the training data, allowing the model to learn more robust features and reduce the effect of noise. This improves the robustness of the model even with imprecise data.
Out-of-distribution detection: Out-of-distribution detection is a method for detecting data for which the model is unknown or imprecise. This minimizes the impact of inaccurate data.

8. On the approach to domain adaptation

When training data is inaccurate, domain adaptation can be used to improve model performance. Domain adaptation is a method of adapting a model to data in a different domain than the training data.

Transfer Learning: Transfer learning is a method of reusing knowledge learned in different domains. By pre-training a model in the domain where the training data is accurate rather than in the domain where it is inaccurate, and then adapting the learned model to the inaccurate domain, the performance of the model can be improved.
Domain Adaptation with Domain Confusion: Dense domain adaptation is a method that minimizes the feature differences between the training data and the unknown domain data. It reduces the differences between domains and increases the domain adaptability of the model by mapping the training data and the unknown domain data to the same feature space.
Unsupervised Domain Adaptation: In unsupervised domain adaptation, adaptation between domains is performed without using supervised data. Methods include maximizing similarity between domains and adjusting the distribution among domains to extract common features among domains.
Domain Generation Models: Domain generation models are methods that generate training data and unknown domain data. By using a generative model to synthesize unknown domain data and using it as training data, it is expected to improve the domain adaptability of the model.

There are various approaches to transfer learning, such as “Research Trends in Deep Reinforcement Learning: Meta-Learning and Transfer Learning, Intrinsic Motivation and Curriculum Learning” “Application Areas of Reinforcement Learning (2) Learning Optimization” and There are various approaches to multitask learning, such as those described in “Overview of Multitask Learning and Examples of Application and Implementation. For domain generative models, there is deep generative learning as described in “Overview of Automatic Sentence Generation Using Huggingface” and “PyTorch Deep Learning for Evolution“, and simulation as described in “Simulation, Data Science, and Artificial Intelligence“, There are various approaches, such as using probabilistic generative models as described in “On Probabilistic Generative Models“.

Conclusion

The approaches I have described can be used as the basis for all machine learning, and I believe that by looking at them when solving specific tasks, they can be taken in an efficient approach.