Small data learning, fusion of logic and machine learning, local/population learning

Machine Learning Artificial Intelligence Digital Transformation Stochastic Generative Models Bayesian Modeling Natural Language Processing Markov Chain Monte Carlo Method Image Information Processing Reinforcement Learning Knowledge Information Processing Explainable Machine Learning Deep Learning Navigation of this blog

Small data learning, fusion of logic and machine learning, local/population learning

Machine learning techniques such as deep learning, which are currently the mainstream, are based on big data, i.e., large amounts of data. These techniques have been successfully applied to various domains (problems). However, there are many problem domains where such a large amount of data does not exist when considering real-world problem solving.

In Shannon’s information theory, the value of information (information quantity) is defined as the value (information entropy) of information that is known and understood (information that occurs frequently with a high probability of occurrence), but the value (information entropy) of information that is unknown (information that has a low probability of occurrence) is high. It is defined as having a lot of value (information entropy). From this perspective, we can see that there are cases in which a small amount of rare information that exists in the long tail of information, rather than a large amount of information that is already known, generates value.

In order to efficiently handle this scarce information with computers, existing big data approaches are insufficient, and a different approach is needed.

The simplest approach is to use “interpretable models” such as linear regression models and logistic regression models, as described in “Explainable Machine Learning”. This approach uses a simple “interpretable model” for learning, and has the advantage of being computationally inexpensive and applicable to a relatively small number of data, but if the model behind it is complex, the accuracy will naturally drop.

The next possible approach is the “probability generating model” approach, which assumes a probability distribution model and performs calculations. This method uses MCMC and variational methods to perform machine learning based on a predefined probability model, and it can provide answers from relatively small amounts of data, along with their variance (dispersion values). This method has the disadvantage that it is computationally expensive when complex models are targeted, and furthermore, as the number of parameters increases, the amount of variance also increases, thus increasing the uncertainty of the data.

There is also semi-supervised learning (called “weak label learning”) and multi-instance learning. This is an approach that combines unsupervised and supervised learning, so to speak, and methods such as mi-SVM exist.

In this blog, I will discuss the chair content for these approaches for small and long-tail data.

Implementations

The issue of small amount of data to be trained (small data) is a problem that appears in various tasks as a factor that reduces the accuracy of machine learning. Machine learning with small data can be approached in various ways, taking into account data limitations and the risk of overlearning. This section discusses the details of each approach and implementation examples.

  • Overview of Transfer Learning, Algorithms, and Examples of Implementations

Transfer learning, a type of machine learning, is a technique for applying a model or knowledge learned in one task to a different task. Transfer learning is usually useful when a new task requires little data or high performance. This section provides an overview of transfer learning and various algorithms and implementation examples.

  • Active Learning Techniques in Machine Learning

Active learning in machine learning (Active Learning) is a strategic approach to effectively selecting labeled data to improve model performance. Typically, training machine learning models requires large amounts of labeled data, but since labeling is costly and time consuming, active learning increases the efficiency of data collection.

In machine learning tasks, recall is an indicator mainly used for classification tasks. To achieve 100% recall means, in the case of a general task, to extract all the data (positives) that should be found without omission, and this is something that frequently appears in tasks involving real-world risks.

However, achieving 100% reproducibility is generally difficult to achieve, as it is limited by the characteristics of the data and the complexity of the problem. In addition, the pursuit of 100% reproducibility may lead to an increase in the percentage of false positives (i.e., mistaking an originally negative result for a positive result), so it is necessary to consider the balance between these two factors.

This section describes the issues that must be considered in order to achieve a 100% reproducibility rate, as well as approaches and specific implementations to address these issues.

When performing real-world machine learning tasks, one often encounters cases where different labels are assigned to things that should have been assigned the same label. In this article, we discuss how to deal with such cases of inaccurate teacher data in machine learning.

Meta-Learners are one of the key concepts in the domain of machine learning and can be understood as “algorithms that learn learning algorithms. In other words, Meta-Learners can be described as an approach to automatically acquire learning algorithms that can be adapted to different tasks and domains. This section describes this Meta-Learners concept, various algorithms and concrete implementations.

Self-Supervised Learning is a type of machine learning and can be considered as a type of supervised learning. While supervised learning uses labeled data to train models, self-supervised learning uses the data itself instead of labels to train models. This section describes various algorithms, applications, and implementations of self-supervised learning.

Machine learning with small data

As typified by the term “big data,” a large amount of data is accumulated on servers every day due to the punishing spread of networks and computers. On the other hand, there is a theoretical conclusion that when the number of data N is very large, the results of parameter estimation by maximum likelihood estimation and the posterior distribution of parameters in Bayesian estimation will asymptotically match.

It is always in situations where data are scarce that some useful insights can be obtained using data analysis. An important approach in machine learning is to “integrate available information,” and in this case, a constructive solution is to combine inferences from different axes of data, such as user profile information and purchase information of many other users.

When considering class identification, if the posterior probability of a class whose discriminant function takes values between 0 and 1 can be predicted, it is possible to quantify the degree to which the input data belong to the class under consideration. However, since the output of the discriminant function ranges from -∞ to +∞, it is difficult to directly interpret it as the posterior probability. The approach using the probabilistic discriminant function is as follows. Approaches using probabilistic discriminant functions, such as logistic regression and softmax regression, are important elements of neural networks.

In the previous examples, we discussed classification problems. The goal of those problems is to predict one discrete label for input data points. Another type of general machine learning problem is regression. Regression predicts continuous values rather than discrete labels. For example, one might predict tomorrow’s temperature based on weather data or the time it will take to complete a project based on the specifications of a software project. Here, the task is to predict housing prices in a Boston neighborhood in the mid-1970s. For this forecast, we will use data points about the Boston neighborhood at that time, such as crime rates and local property tax rates. From the two previous examples, there is a significant difference in the data set used here. This dataset contains 506 data points, which is relatively small, and is divided into 404 training surpluses and 102 test samples. Also, the features of the input data (e.g., crime rate) use different scales. For example, some indicate the percentage as a value between 0 and 1, some take a value between 1 and 12, and some take a value between 0 and 100.

It will be common practice to train image classification models on very small amounts of data. A “small amount” sample could mean a few hundred images, or it could mean tens of thousands of images. As a practical example, let’s take a dataset containing 4000 images of dogs and cats (2000 dog images and 2000 cat images) and classify the images as dogs and cats. We will use 2000 images for training, 1000 images for validation, and 1000 images for testing.

The first step in tackling this problem is to simply train a small CNN (without regularization) using the 2000 training samples to establish a baseline. This yields a classification correctness rate of 71%. At that point, over-training becomes the main issue. Therefore, we perform data augmentation, which is effective in suppressing overlearning in computer vision. Data augmentation improves the CNN’s accuracy rate to 82%.

A well-known and efficient approach for DNNs with small image data sets is to use trained networks. A trained network would be a network that has been trained on a large dataset and then saved. Such an approach is not suitable when the new problem requires classes that are completely different from the original task, e.g., ImageNet consists mainly of classes representing animals and everyday objects, but when they are used for a very different purpose, such as furniture identification, they need to be tuned. some tuning work is required.

Here we describe the VGG16 architecture, developed in 2014 by Karen Simonyan and Andrew Zisserman et al. VGG16 is a simple CNN architecture that is widely used in ImageNet. VGG16 is an older model, not far behind state-of-the-art models, and a bit heavier than many of the newer models.

In this article, we will discuss a problem setting called weak label learning, which is positioned between supervised and unsupervised learning. In weak label learning, problems such as classification and regression are considered in situations where label information is only partially available. First, we describe a problem called semi-supervised learning, in which label information is given only for a part of the training cases.

In this section, we describe SVM for semi-supervised learning. In semi-supervised classification problems, both labeled and unlabeled training cases are given. Finally, we describe an example of semi-supervised SVM applied to two-dimensional artificial data. Two examples each with positive and negative labels are given as labeled examples, and 100 unlabeled examples are given as unlabeled examples.

Here we describe an SVM approach to a weak label learning problem called multi-instance learning.

Multi-instance learning is a type of 2-class classification problem. The difference from the usual 2-class classification problem is that labels are given to a set of training cases called a bag, instead of to each individual training case. Each bag consists of multiple instances, and each instance belongs to either the positive class or the negative class, and is called a positive case or a negative case, respectively. If a bag contains at least one positive instance, it is called a positive bag. Conversely, if a bag contains all negative cases, it is called a negative bag. In multi-instance learning, only the label for a bag is given. In a negative bag, we know that all the cases in the bag are negative, but in a positive bag, we do not know which cases are positive and which are negative. For this reason, we must find the classification boundary while estimating the labels of the cases contained in the positive bag.

Semi-supervised learning of image data and natural language using back translation

The following methods exist to realize semi-supervised learning methods. (1) Self-Training, (2) semi-supervised Gaussian mixture models, (3) Co-Training, (4) Graph-based Semi-Supervised Learning, (5) S3VM (Semi-Supervised Support Vector Machine), PNU Learning

  • About Data Augmentation (external link)This article is about data augmentation. Data augmentation is a technique in data science that has been attracting a lot of attention recently, and is sometimes referred to as data padding. This technique is related to overfitting, which is a universal problem in machine learning, and is also a clue to the mystery of why deep learning can learn and perform so well.

Integration of logic and rules with probability/machine learning

    Various approaches have been taken to the problem of knowledge representation, or how to represent, acquire, and use knowledge, which is the fundamental problem of artificial intelligence technology. These include machine learning technologies such as deep learning, sensor technologies such as speech recognition and image recognition, and inference technologies such as expert systems.

    Today’s knowledge information is used in unstructured form as symbolic knowledge in various and large volumes, such as academic journals, dictionaries, Wikipedia, SNS, news articles, etc., via the Internet.

    This knowledge can be classified into various categories, one of which is logical knowledge and the other is probabilistic knowledge.

    In the following pages of this blog, we will discuss the flow of modeling complex reality on a computer using an establishment model that combines the two major categories of knowledge, logical knowledge and probabilistic knowledge, and the connections between probability, logic, computation, and machine learning behind this modeling.

    Local and collective learning

    When considering class identification, if the discriminant function can predict the posterior probability of a class that takes values between 0 and 1, it is possible to quantify the degree to which the input data belong to the class under consideration. However, since the output of the discriminant function ranges from -∞ to +∞, it is difficult to directly interpret it as the posterior probability. The approach using the probabilistic discriminant function is as follows. Approaches using probabilistic discriminant functions, such as logistic regression and softmax regression, are important elements of neural networks.

    When data is complexly distributed in the feature space, nonlinear classifiers are effective. To construct a nonlinear classifier, kernel methods, neural networks, and other methods can be used. Here, we describe ensemble learning, in which multiple simple classifiers are combined to form a nonlinear classifier. Group learning is also called ensemble learning.

    As a form of ensemble learning, we describe bagging, in which subsets are generated from a set of training data and a predictor is trained on each subset. This method is especially effective for unstable learning algorithms. An unstable learning algorithm is an algorithm in which a small change in the training data set has a large impact on the structure and parameters of the predictor to be learned. Unstable learning algorithms include neural networks and decision trees.

    The bootstrap method is a method for generating diverse subsets from a finite dataset. This method generates M new data sets by randomly repeating restoration and extraction from the data set M times.

    In this article, we will discuss how to construct a model that describes the whole based on a local model. Specifically, we will first take the temperature data for Tokyo as an example, and then try to extract meaning from the data using a statistical model with a small number of fixed parameters. The talk will start with data analysis using the overall model, so to speak. Next, we will introduce a local linear model to extract temporally localized information, and then extend it to a nonlinear model to see how the expressive power of the model can be enriched. This local nonlinear model becomes a “soft” model that allows for stochastic deviations from the constraints that are usually given by stochastic difference equations and expressed in terms of equations. Furthermore, by generalizing the distribution that the noise term, which generates stochastic fluctuations, follows to a non-Gaussian, or non-Gaussian distribution, we can better handle rarely occurring stochastic events, such as jumps and outliers. This characteristic, which cannot be expressed by a noise term following a Gaussian distribution, is called non-Gaussianity.

    Explainable Machine Learning

    There are three patterns of “explaining”: (1) identifying the cause, (2) deriving a more specific hypothesis from a general hypothesis, and (3) finding out what it is. In this context, “explaining” machine learning mainly means clarifying the causes (parameters affecting the results) of (1).

    The current technical trend is to use two main approaches to explain machine learning: (a) interpretation by interpretable machine learning models, and (b) extraction of important features by interpreting the machine learning results (mainly by statistical approaches).

    In the following pages of this blog, we discuss the various approaches in this explainable machine learning technique.

    Probability Generating Model (Bayesian estimation)

    A probabilistic generative model is one that considers that data in the real world is backed by a mechanism (model) that generates the data, and that the data is not generated deterministically and strictly, but is generated with a certain variability and fluctuation.

    As approaches to probabilistic models, there are Bayesian inference based on Bayesian statistics, graphical models for handling complex probabilistic models with intuitive images, and Markov chain Monte Carlo (MCMC) methods, variational methods, and nonparametric Bayesian methods for calculating probability distributions.

    These methods can be applied to natural language processing as represented by topic models, speech recognition using hidden Markov models, sensor analysis, and analysis of various statistical information including geographic information.

    In the following pages of this blog, we discuss Bayesian modeling, topic modeling, and various applications and implementations using these probabilistic generative models.

    Online Learning/OnlinePrediction

    Online learning is a learning method in which the model is improved sequentially using only the given data each time one data (or a part of all data) is given, without using all the data at once. Due to the nature of this data processing method, it can be applied to data analysis on a scale that does not allow all data to be stored in memory or cache, or to learning in an environment where data is generated permanently.

    Reinforcement learning and online prediction are frameworks to handle various decision-making problems by performing meta-mechanical learning using this sequential learning.

    In the following pages of this blog, we provide a theoretical overview of online learning, reinforcement learning, and online prediction, as well as various implementations and applications.

    Machine Learning Based on Sparsity

    Sparse modeling is a technique that automatically extracts the necessary parts of a statistical model according to the given data. In the case of multiple regression analysis, it means extracting a few necessary items from a list of many explanatory variables. On the other hand, there are traditional terms such as “model selection” and “variable selection”. Variable selection” is used to select variables to be included in a model, while “model selection” is used more generally to include, for example, the selection of noise distributions.

    The sparse modeling methods, such as the L1 norm and other regularization methods, have been incorporated into all kinds of data analysis/machine learning, including bioinformatics, engineering, and big data, and now it is hard to find a field where sparsity is not used.

    In the following pages of this blog, we discuss theoretical explanations, concrete implementations, and various applications of machine learning based on sparsity.

    コメント

    タイトルとURLをコピーしました