Explainable Machine Learning

Machine Learning Artificial Intelligence Digital Transformation Reinforce Learning Intelligent information Probabilistic Generative Model Natural Language Processing Ontology Reinforcement Learning Navigation of this blog

About Explainable Machine Learning

Explainable machine learning (EML) refers to the ability to present the results output by machine learning algorithms in a format that allows the reasons and rationale for the results to be explained.

Conventional machine learning algorithms make predictions by extracting complex patterns from data, making the process opaque and making it difficult to confirm or explain whether the predicted results are accurate. Therefore, in many real-world tasks, even if highly accurate machine learning results are obtained, it is not possible to use them for human decision making. In order for machine learning models to support human decision making, there is a need to present the results output by machine learning algorithms in an explainable format.

Explaining” involves three types of actions: (1) determining the cause, (2) deriving a more specific hypothesis from a general hypothesis, and (3) determining the true nature of the hypothesis. Presenting machine learning results in an explainable form means identifying the causes and clarifying the hypotheses that influence the results, in other words, simplifying the predicted patterns to a level of granularity that is understandable to humans.

Current technological trends in explainable machine learning are dominated by two approaches: (A) interpretation using interpretable machine learning models, and (B) post-interpretation models (model-independent interpretation methods).

Interpretation by interpretable machine learning models includes, for example, the following model-based approaches.

  • Decision Tree Models: Decision trees are models for classification and prediction by transforming data into a tree structure and analyzing it.
  • Logistic regression model: Logistic regression is a model for linear classification, which expresses classification results in terms of probabilities. This model can be used to analyze the importance of variables when making forecasts.
  • Random forest: Random forest is an ensemble model that combines multiple decision trees and can improve the interpretability of the model.

In addition, a posteriori interpretation models (model-independent interpretation methods) include the following

  • Approach using statistical properties of features : Quantify the impact of features by analyzing the distribution of features (mean and variance), correlation coefficients among features, importance of features, data visualization, etc. to increase the interpretability of the model
  • Approach using surrogate models : Using an alternative model (surrogate model) that is simpler than the original machine learning model, the same inputs and outputs as the original model are set and explained by the simpler model.
  • Approach with a bandit problem : By preparing different arms for each feature, mapping the reward for each arm to the importance of the feature, and adding the reward corresponding to the arm in which each feature was selected to the prediction output by the model, the importance of each feature is estimated, and based on this importance, which feature is the most important in the outcome influence the outcome.
  • Approach using game theory: Each data point in the training data set is considered an agent, and the action (Action) taken by each agent is the value of the feature. Then, the predicted results output by the model are considered as rewards (Reward), and the importance of each feature is estimated by defining a strategy for each agent to maximize the reward.

This blog will discuss the details for them.

Implementation

Explainable Machine Learning (EML) refers to methods and approaches that explain the predictions and decision-making results of machine learning models in an understandable way. In many real-world tasks, model explainability is often important. This can be seen, for example, in solutions for finance, where it is necessary to explain on which factors the model bases its credit score decisions, or in solutions for medical diagnostics, where it is important to explain the basis and reasons for predictions for patients.

In this section, we discuss various algorithms and examples of python implementations for this explainable machine learning.

Adversarial attack is one of the most widely used attacks against machine learning models, especially for input data such as images, text, and voice. Adversarial attacks aim to cause misrecognition of machine learning models by applying slight perturbations (noise or manipulations). Such attacks can reveal security vulnerabilities and help assess model robustness

Decision Tree is a tree-structured classification and regression method used as a predictive model for machine learning and data mining. Since decision trees can construct conditional branching rules in the form of a tree to predict classes (classification) and values (regression) based on data characteristics (features), they can white box machine learning results, as described in “Explainable Machine Learning”. This section describes various algorithms for decision trees and concrete examples of their implementation.

Causal Forest is a machine learning model for estimating causal effects from observed data, based on Random Forest and extended based on conditions necessary for causal inference. This section provides an overview of the Causal Forest, application examples, and implementations in R and Python.

Statistical Hypothesis Testing is a method in statistics that probabilistically evaluates whether a hypothesis is true or not, and is used not only to evaluate statistical methods, but also to evaluate the reliability of predictions and to select and evaluate models in machine learning. It is also used in the evaluation of feature selection as described in “Explainable Machine Learning,” and in the verification of the discrimination performance between normal and abnormal as described in “Anomaly Detection and Change Detection Technology,” and is a fundamental technology. This section describes various statistical hypothesis testing methods and their specific implementations.

GNNs (Graph Neural Networks) are neural networks for handling graph-structured data, which use node and edge (vertex and edge) information to capture patterns and structures in graph data, and are applicable to social network analysis, chemical structure prediction, recommendation systems, graph It can be applied to social network analysis, chemical structure prediction, recommendation systems, graph-based anomaly detection, etc.

Technical Topics

What does it mean to think scientifically? Here we discuss what it means to think scientifically.

The first step is to clarify the difference between “the language of science” and “the language of science”. The “language of science” is “scientific concepts” such as DNA and entropy, which are defined in scientific theories. On the other hand, “words that describe science” are “meta-scientific concepts” that appear in various theories such as theories, hypotheses, laws, equations, etc., and their meaning must be understood precisely in order to think scientifically.

Among these “meta-scientific concepts,” “theory” and “fact” are the first to be taken up. Scientific theories and hypotheses are based on the premise that the world is indeterminate and ambiguous (100% truth either does not exist or will take a lifetime to know), and are created from a relative perspective of whether or not they are better theories/hypotheses, rather than an absolute one or zero.

The functions expected of science here are “prediction,” “application,” “explanation,” and so on. Among these, I will discuss “explain,” which appears frequently in the foregoing.

There are three patterns of “explaining,” as shown below.

Identifying the cause
To derive a more specific hypothesis/theory from a general/universal hypothesis/theory
Identifying the true cause

To “awareness” means to observe or perceive something carefully, and when a person notices a situation or thing, it means that he or she is aware of some information or phenomenon and has a feeling or understanding about it. Becoming aware is an important process of gaining new information and understanding by paying attention to changes and events in the external world. In this article, I will discuss this awareness and the application of artificial intelligence technology to it.

With the development of information technology (IT), vast amounts of data have been accumulated, and attempts to create new knowledge and value by analyzing this “big data” are spreading. In this context, “prescriptive analysis” has been attracting attention as a method for analyzing big data. Prescriptive analysis is an analytical method that attempts to derive the optimal solution for an “objective” from a complex combination of conditions. This article looks at the characteristics of prescriptive analysis, including its differences from “explanatory analysis” and “predictive analysis,” which are used in many big data analyses, and the scope of benefits that can be derived from its application.

Machine learning engineers, when tackling issues such as classification and regression, are often asked by their superiors and the departments where they work, “How far off do you think the predictions will be? How do I risk assess the predictions of the learning model? How do I assess the risk of the learning model’s predictions? Quantile Regression may be useful in such situations. Quantile is used to express what percentage of the data is in the upper quartile when the data is sorted and separated.

First, let’s discuss the “linear regression” case. The linear regression model expresses the prediction as a weighted sum of the features. It models how much a feature x depends on the target variable y. For a single data i, it can be expressed as follows

The linear regression model is interpreted in terms of its weights. First of all, if the feature is quantitative, the increase or decrease in its value is reflected in the results. If it is a categorical feature, it is evaluated by the variance of the data (how much of the variability is explained by the linear model) rather than the quantity of the feature itself.

While the linear regression model is an algorithm that minimizes the distance by fitting a straight line or hyperplane to classify, the logistic regression model uses a logistic function to transform the output of the linear equation between 0 and 1 (converting it into probability).

Thus, since the output of the logistic regression model is expressed as a probability between 0 and 1, the interpretation of the weights is different. In order to interpret these, we introduce the concept of odds. Odds are the probability that an event will occur divided by the probability that it will not occur, and the logarithm of the odds is called log odds.

While linear models have the advantage of simple model interpretation, they have the disadvantage of being difficult to deal with real-world problems, such as when the results do not follow a normal distribution, when there are interactions between features, or when the true relationship between features and results is nonlinear. In order to deal with these issues, General Linear Models (GLM) and General Additive Models (GAM) have been proposed.

For models whose results do not follow a normal distribution, Generalized Linear Models (GLMs) are used. The core concept of GLMs is “to retain a weighted sum of features, but allow for non-normality in the distribution of the results, and to relate the mean of this distribution to the weighted sum by some non-linear function.

An extension of the linear model is the Generalized Additive Model (GAM). This is an extension of the linear model called GAM (Generalized Additive Model), where an increase of one value in the ordinary linear model always has the same effect on the prediction result, whereas the effect on the prediction result is different when, for example, the temperature rises from 10 to 11 degrees Celsius, and when it rises from 40 to 41 degrees Celsius.

Simple linear or logistic regression models fail to estimate when the features and results are nonlinear or when there is interaction between the features. By using a model based on decision trees, we can find solutions to these problems.

The tree-based model splits the data multiple times based on a certain cutoff value of the features, and through this splitting, the data set is made into different subsets. Each instance will belong to one of these subsets. The last subset is called the terminal node or leaf node, and the middle subset is called the internal node or split node.

There are various algorithms for growing a decision tree. These basically include (1) the structure of the decision tree (e.g., number of branches per node), (2) indicators for finding branches, (3) when to find branches, (4) how to predict simple models in leaves, and (5) stopping criteria.

Decision trees allow us to capture the interaction between features in the data, and to ensure that the interpretation is transparent. However, since decision trees approximate the relationship between input features and results with a branched step function, they lack smoothness, and slight changes in input features can have a significant impact on prediction results. Also, the decision tree is quite unstable, and a slight change in the training data may result in a completely different decision tree. Furthermore, the deeper the tree, the more difficult it becomes to understand the decision rules of the tree.

The decision rule will be a simple IF-THEN statement consisting of a condition (also called a premise) and a prediction. For example, if it is raining today and it is April (condition), then it will rain tomorrow (prediction). Prediction can be done with a single decision rule or a combination of several decision rules.

There are three ways to learn rules from data. (There are many more than these)

    1. OneR: Learning rules from a single feature, OneR is characterized by its simplicity and ease of understanding.
    2. Sequential covering: Learning rules iteratively and removing data points covered by new rules.
    3. Bayesian Rule Lists: Uses Bayesian statistics to integrate pre-discovered frequent patterns into a decision list. The use of pre-discovered patterns is also a common approach in algorithms that learn many rules.
  • Explainable Machine Learning (6) Interpretable Model (RuleFit)

The RuleFit algorithm, proposed by Friedman and Popescu in 2008, is used to train a sparse linear model by using a number of new features, which are the original features and decision rules, to train a sparse linear model that combines the interactions between the features. The generated features are automatically generated by converting each path through the tree into a decision rule by combining the split decisions from the decision tree and turning them into rules.

In RuleFit, a number of trees are generated using a random forest-like technique, and each tree is decomposed into decision rules, which are additional features used in sparse linear regression models (Lasso).

Posterior interpretive models (model-independent interpretive methods) have the advantage over the interpretable models described above that they allow for flexible selection of predictive models and interpretation while maintaining a high degree of predictive accuracy. The expected properties of model-independent interpretation methods are as follows

    • Model flexibility: model interpretation methods can be used for any machine learning model, such as random forests or deep neural networks.
    • Explanation flexibility: Model explanations are not restricted to any particular form. It may be useful to have linear relationships, or it may be useful to visualize the importance of features.
    • Representation flexibility The system of explanation should be able to use a different representation of the features than the model being explained. For text classification using abstract word embedding vectors, it is preferable to use individual words for explanation.

The first model-independent interpretation model is the partial dependence plot (PDP, PD plot), which in a nutshell shows the marginal effect of one or two features on the prediction results of a machine learning model. It can also express whether the relationship between input and output is linear, monotonic, or more complex. When applied to a linear regression model, for example, the partial dependence plot will always show a linear relationship.

A partial dependence plot on the average effect of a feature is a global method because it focuses on the overall average rather than a specific instance. The PDP equivalent for individual instances is called an individual conditional expectation (ICE) plot, which visualizes the effect of a feature on prediction for each instance separately. The PDP corresponds to the average of the ICE plot lines.

The point of looking at individual predictions instead of PDPs is that PDPs can obscure the heterogeneous relationships created by interactions, while ICE plots give a lot of insight when there are interactions.

In this article, we will discuss Accumulated local effects (ALE) plots, which show how much a feature affects, on average, the predictions of a machine learning model. (PDP), which is faster and less biased. Applying PDP to correlated features will include predictions for instances that are not likely to occur in reality. This is a major bias when estimating the effects of features, and we will explain what is wrong with PDP by following the actual steps.

In this article, I would like to discuss feature interactions. When there is feature interaction in a predictive model, a prediction cannot be expressed simply as the sum of the feature effects, because one feature is affected by the value of another feature. When a machine learning model makes a prediction based on two features, this prediction can be decomposed into four terms: a constant term, the first feature, the second feature, and the interaction of the two features. the interaction of the two features refers to the change in prediction caused by changing the features after considering the effects of the individual features.

In the previous article, we discussed the interaction of features among model-independent interpretations. Permutation feature importance measures the increase in prediction error by reordering the feature values, thereby breaking the relationship between the feature values and the true result. The concept is very simple: the importance of a feature is calculated by computing the impact of the model’s prediction error after reordering the feature values. If swapping feature values increases the model error, the feature is “important” because the model is making predictions that depend on the feature. If the model error does not change after replacing the feature value, the feature is “unimportant. Permutation feature importance was introduced for random forests by Breiman (2001)34 .

A global surrogate model is an interpretable model that is trained to approximate the predictions of a black box model. By interpreting surrogate models, conclusions can be drawn about the black box model. Surrogate models are also used in engineering. When the results of interest are expensive, time-consuming, or difficult to measure (e.g., relying on complex computer simulations), they are replaced by inexpensive and fast surrogate model results. The difference between surrogate models used in engineering and surrogate models used in interpretable machine learning is that the models are not simulations, but interpretable machine learning models (not simulations). The goal of a (interpretable) surrogate model is to approximate the predictions of the original model as accurately as possible, while at the same time making them interpretable. The idea of surrogate models can be found under various names, such as approximate models, metamodels, response surface models, emulators, etc.

In this article, I would like to discuss Local Surrogate (LIME). Local surrogate models are interpretable models that are used to explain individual predictions of black-box machine learning models. In the paper Local interpretable model-agnostic explanations (LIME)36, a specific local surrogate model implementation is proposed. Surrogate models are trained to approximate the predictions of the underlying black box model. Instead of learning global surrogate models, LIME focuses on learning local surrogate models to explain individual predictions.

Anchors explain the individual predictions of a black-box classification model by finding decision rules sufficient to “fix” the prediction. A rule fixes a prediction if a change in the value of another feature does not affect the prediction. Anchor combines reinforcement learning with a graph search algorithm to minimize the number of model calls while avoiding falling into a local optimum solution. This algorithm was proposed by Ribeiro, Singh, and Guestrin et al. in 2018.

Like LIME, Anchor employs a method of perturbing the data to provide a local explanation for the predictions of black-box machine learning models. However, whereas LIME uses surrogate models for explanation, Anchor uses more easily understood if-then rules, called Anchors. These rules are scoped and can be reused. In other words, Anchor contains the concept of coverage, which indicates exactly how it applies to other instances, or instances yet to be seen. Finding the Anchor involves a search or multi-armed bandit problem that comes from the field of reinforcement learning.

In this article, I would like to discuss sharpley value. Prediction can be explained by assuming a game in which the feature value of an instance is “player” and prediction is the reward. Sharpley value (a method of cooperative game theory) tells us how to distribute the “reward” fairly among the features. To consider the Sharpe Ray value, assume the following scenario.

In this article, I would like to discuss SHAP (SHapley Additive exPlanations) by Lundberg and Lee (2016), a method for explaining individual predictions. SHAP is based on game theoretically optimal Shapley values. It is an alternative estimation method to kernel-based Sharplays inspired by local surrogate models.

In the previous article, we discussed SHAP (SHapley Additive exPlanations), an extension of sharpley value based on game theory. In this article, I will describe an approach to counterfactual explanations based on causal inference as described in “Statistical Causal Inference and Causal Search.

In this article, we discuss Adversarial Examples and security.

A prototype is a data instance that is representative of all data. A criticism is a data instance that cannot be well represented by a collection of prototypes, and the purpose of criticism is to provide insight with prototypes, especially about data points that prototypes cannot represent well. Although prototypes and criticisms can be used independently of machine learning models to describe data, they can be used to create interpretable models or to make black box models interpretable.

Since 2013, a wide range of methods have been developed to visualize and interpret these representations. In this article, we will focus on three of the most useful and easy-to-use methods.

(1) Visualization of the intermediate outputs of a CNN (activation of intermediate layers): This provides an understanding of how the input is transformed by the layers of the CNN and provides insight into the meaning of the individual filters of the CNN. (2) Visualization of CNN’s filters: To understand what kind of visual patterns and visual concepts are accepted by each filter of CNN. (3) Visualization of a heatmap of class activation in an image: This will allow us to understand which parts of an image belong to a particular class, and thus to localize objects in the image.

    An application of Bayesian estimation previously mentioned is Bayesian nets. Bayesian nets are a modeling method that expresses causal relationships (strictly speaking, probabilistic dependencies) among various events in a graph structure, and are used and studied in various fields such as failure diagnosis, weather prediction, medical decision support, marketing, and recommendation systems.

    To express this mathematically, a finite number of random variables X1,. XN as nodes and a conditional probability table (CPT) associated with each node. XN, and the simultaneous distribution P(X1=x1,. XN=xn) is represented by the following graph structure.

    In the previous article, we discussed SRL (statistical relational learning), which was developed in North America. This time, we will discuss its European counterpart, probabilistic logic learning (PLL).

    SRLs such as PRM (probabilistic relational model) and MLN (Markov logic network) are based on the idea of enriching probabilistic models by using relational and logic formulas. However, enriching post-operative logic with probability is not the direct goal of SRL. On the other hand, knowledge representation by post-operative logic has long been studied in the field of artificial intelligence, and attempts to incorporate probability into it to represent not only logical knowledge, which is always valid, but also probabilistic knowledge have been attempted since before the statistical machine learning boom.

    A decision tree learner is a powerful classifier that uses a tree structure to model the relationship between possible outcomes as futures.

    A key feature of the decision tree algorithm is that the flowchart-like tree structure is not necessarily exclusive to the learner, but the output results of the model can be read by humans to provide a great hint as to why or how the model does (or does not) work well for a particular task.

    Such a mechanism can be particularly useful when the classification mechanism must be transparent for legal reasons, or when the results are shared with others to make explicit business practices between organizations.

    In the previous article, I gave an overview of the decision tree algorithm. In this article, we will discuss clustering using R. We use data from a German financial credit survey (1000 instances (data), 17 variables).

    In order to improve the model performance, we use “adaptive boosting”. This is an improvement over the C.4.5 algorithm in that it creates a number of decision trees, and these decision trees vote for the best class for each instance.

    In this article, we will discuss the extraction of rules using a rule classifier.

    Classification rules represent knowledge in the form of logical if-else statements that give classes to unlabeled instances. These are designated as “antexedent” and “consequent”, and form the hypothesis that “if this happens, that happens”. A simple rule asserts something like, “If the hard disk is ticking, it will soon make an error. The prior case consists of a specific combination of feature values, whereas the posterior case specifies the class values that will be given when the conditions of the rule are met.

    Classification rule learning is often used in the same way as decision tree learning. Classification rules can be specifically used in applications that generate knowledge for future actions, such as the following

    Identification of conditions that cause hardware errors in mechanical devices
    Describing the key characteristics of a group of people belonging to a customer segment
    Extraction of conditions that are harbingers of a significant drop or rise in stock market prices

    The difference between classification rule learning and decision tree learning is that a classification rule is a proposal that can be read in exactly the same way as a statement of fact, as opposed to a decision tree where decisions must be made in order from top to bottom.

    In this article, we will discuss the extraction of rules using R. The data to extract the rules are: Is the mushroom edible? Is it poisonous? as the data to be extracted.

    As an algorithm, we will use RWeka’s RIPPER algorithm for evaluation.

    Global explanation: Explanation of a complex machine learning model by replacing it with a highly readable model; local explanation: Explanation of the behavior of a complex machine learning model with respect to a given input-output example (using a highly readable model, etc.)

    The ubiquitous non-semantic web includes a vast array of unstructured information such as HTML documents. The semantic web provides more structured knowledge such as hand-built ontologies and semantically aware databases. To leverage the full power of both the semantic and non semantic portions of the web, software systems need to be able to reason over both kinds of information. Systems that use both structured and unstructured information face a significant challenge when trying to convince a user to believe their results: the sources and the kinds of reasoning that are applied to the sources are radically different in their nature and their reliability. Our work aims at explaining conclusions derived from a combination of structured and unstructured sources. We present our solution that provides an infrastructure capable of encoding justifications for conclusions in a single format. This integration provides an end-to-end description of the knowledge derivation process including access to text or HTML documents, descriptions of the analytic processes used for extraction, as well as descriptions of the ontologies and many kinds of information manipulation processes, including standard deduction. We produce unified traces of extraction and deduction processes in the Proof Markup Language (PML), an OWL-based formalism for encoding provenance for inferred information. We provide a browser for exploring PML and thus enabling a user to understand how some conclusion was reached.

    Explainability has been a goal for Artificial Intelligence (AI) systems since their conception, with the need for explainability growing as more complex AI models are increasingly used in critical, high-stakes settings such as healthcare. Explanations have often added to an AI system in a non-principled, post-hoc manner. With greater adoption of these systems and emphasis on user-centric explainability, there is a need for a structured representation that treats explainability as a primary consideration, mapping end user needs to specific explanation types and the system’s AI capabilities. We design an explanation ontology to model both the role of explanations, accounting for the system and user attributes in the process, and the range of different literature-derived explanation types. We indicate how the ontology can support user requirements for explanations in the domain of healthcare. We evaluate our ontology with a set of competency questions geared towards a system designer who might use our ontology to decide which explanation types to include, given a combination of users’ needs and a system’s capabilities, both in system design settings and in real-time operations. Through the use of this ontology, system designers will be able to make informed choices on which explanations AI systems can and should provide.

    コメント

    タイトルとURLをコピーしました