Nonparametric Bayesian and Gaussian Processes

Machine Learning Technology Artificial Intelligence Technology Digital Transformation Technology Probabilistic Generative Models Navigation of this blog

Overview

Nonparametric Bayes is a method of Bayesian statistics, an “old and new technique” that was already theoretically perfected in the 1970s, and is a statistical method for data analysis and forecasting using flexible, data-dependent probability models. Nonparametric Bayes is named “nonparametric” because it does not require a priori setting of deterministic parameters.

Nonparametric Bayes allows one to build probability models from the data itself and to estimate probability distributions from the data instead of assuming the true probability distribution that generates the data. This allows the use of flexible models for the data and automatically adjusts the probability distribution to fit the data.

There are several different nonparametric Bayesian methods, but one of the most common is the Dirichlet Process Mixture Model (DPMM), which uses Dirichlet processes. Dirichlet processes are stochastic processes for defining probability distributions of infinite dimension, which can be efficiently computed using modern search algorithms such as Markov chain Monte Carlo methods. DPMM is one of the leading nonparametric Bayesian methods and has been applied to various tasks such as clustering, structural change estimation using statistical models, density estimation, factor analysis, and sparse modeling.

Gaussian Process (GP) is a nonparametric regression and classification method based on probability theory, and is a type of stochastic process used for modeling continuous data. Similar to nonparametric Bayesian processes, Gaussian processes perform data analysis and forecasting by defining a probability distribution of infinite dimension and estimating the probability distribution for the data.

In the Gaussian process approach, the process of generating data is modeled by a probability distribution, and kernels (kernel functions, kernel function matrices) are used to express relationships among data points, and to estimate relationships and uncertainty among data points. The kernel of a Gaussian process can have a variety of shapes and can be customized to fit the characteristics of the data, allowing for a flexible model to be constructed to fit the data.

Gaussian processes can estimate the uncertainty (confidence interval) of the forecast results, which allows one to evaluate the reliability of the forecast. It is also a technique that can be applied to a small number of data points.

In this blog, we will discuss the details of nonparametric Bayesian learning and Gaussian processes.

Implementation

  • Dirichlet Process Mixture Model (DPMM) Overview, Algorithm and Implementation Examples

The Dirichlet Process Mixture Model (DPMM) is one of the most important models in clustering and cluster analysis. The DPMM is characterized by its ability to automatically estimate clusters from data without the need to determine the number of clusters in advance.

Bayesian inference is a method of statistical inference based on a probabilistic framework and is a machine learning technique for dealing with uncertainty. The objective of Bayesian inference is to estimate the probability distribution of unknown parameters by combining data and prior knowledge (prior distribution). This paper provides an overview of Bayesian estimation, its applications, and various implementations.

Markov Chain Monte Carlo (MCMC) is a statistical method for sampling from probability distributions and performing integration calculations. The MCMC is a combination of a Markov Chain and a Monte Carlo method. This section describes various algorithms, applications, and implementations of MCMC.

Kullback-Leibler Variational Estimation is a method for estimating approximate probabilistic models of data by evaluating and minimizing differences between probability distributions. It is widely used in the context of Its main applications are as follows.

Bayesian Structural Time Series Model (BSTS) is a type of statistical model that models phenomena that change over time and is used for forecasting and causal inference. This section provides an overview of BSTS and its various applications and implementations.

  • Black-Box Variational Inference (BBVI) Overview, Algorithm, and Implementation Examples

Black-Box Variational Inference (BBVI) is a type of variational inference method for approximating the posterior distribution of complex probabilistic models in probabilistic programming and Bayesian statistical modeling. BBVI is called “Black-Box” because the probability model to be inferred is treated as a black box and can be applied independently of the internal structure of the model itself and the form of the likelihood function. BBVI is a method that can be used for inference without knowing the internal structure of the model.

A Gaussian process is like a box (stochastic process) that randomly outputs a function form. For example, if we consider that the process of dice generating the natural numbers 1, 2, 3, 4, 5, and 6 depends on the distortion of the dice, we can assume that the appearance of the function ( function that represents the probability that the dice will turn up) depending on the parameters (in this case, the skewness of the dice).

Gaussian process regression is analyzed using correlation coefficients between data, so algorithms using kernel methods are used, algorithms using MCMC combined with Bayesian analytical methods, etc. are applied. The tools used for these analyses are open source in various languages such as Matlab, Python, R, and Clojure. In this article, we will discuss the approach in Clojure.

In this article, I will describe a framework for Gaussian processes using Python. There are two types of Python frameworks: one is based on the general-purpose scikit-learn framework, and the other is a dedicated framework, GPy. GPy is more versatile than scikit-learn, so we will focus on GPy in this article.

Bayesian optimization is an applied technology that fully utilizes the characteristics of Gaussian regression processes, which can make probabilistic predictions based on a small number of samples and a minimal number of processes.

Specific examples include the sequential extraction of the optimal combination of experimental parameters to be used next while conducting experiments in experimental design for medicine, chemistry, materials research, etc., the sequential optimization of hyper-parameters in machine learning while rotating the learning/evaluation cycle, and the optimization of functions by matching parts in the manufacturing industry. It is a technology that can be used in a wide range of applications, such as in the optimization of functions by matching parts in the manufacturing industry.

nonparametric Bayesian

Now, open the door to infinite dimensions!
The book clearly explains the basics of probability distributions and their application to time series data and sparse modeling. The book is kindly designed to carefully explain the theoretical background of measurement theory from the basics as well. Written by an up-and-coming ace researcher with a lot to offer. A must-have for all Bayesians!

Nonparametric Bayesian techniques will be one of the “old and new” techniques that have already been theoretically perfected in the 1970s. This technology is still used in various fields more than 40 years apart, and the characteristics of the technology are (1) flexibility and breadth of modeling to represent phenomena and (2) development of algorithms to efficiently explore vast modeling spaces.

Nonparametric Bayesian models are, in a word, stochastic models in “infinite dimensional” space, and they are also modern search algorithms represented by Markov chain Monte Carlo methods that can efficiently compute them. Its applications include clustering with flexible generative models, structural change estimation with statistical models, and applications in factor analysis and sparse modeling.

Overview of various stochastic models used as approximations of stochastic generative models (Student’s t distribution, Wishart distribution, Gaussian distribution, gamma distribution, inverse gamma distribution, Dirichlet distribution, beta distribution, categorical distribution, Poisson distribution, Bernoulli distribution)

A stochastic generative model is a mathematical model in which the generative process of data is represented by a stochastic model. In this article, we will describe the representation of the process of data generation used in stochastic generative models and statistical learning as an estimation problem for generative models.

Calculate the fundamentals of Bayesian estimation (exchangeability, de Finetti’s theorem, conjugate prior distribution, posterior distribution, marginal likelihood, etc.) used in stochastic generative models based on concrete examples (Dirichlet-multinomial distribution model, gamma-gaussian distribution model).

In this article, we will discuss clustering by finite mixture models as a preparation for nonparametric Bayesian models. Clustering is a data mining technique that classifies similar data into identical classes. Clustering is the most basic application of nonparametric Bayesian models. We describe the K-means algorithm, which is a representative method for clustering, and describe its Bayesian model, the finite mixture Gaussian model.

In this article, we will discuss the extension of the Dirichlet distribution to dimensionless as an introduction to the nonparametric Bayesian model. The emphasis here is on intuitive understanding; more mathematical details will be discussed later.

The Dirichlet process mixture model plays a central role in the nonparametric Bayesian model. Dirichlet process mixture models are also called infinite mixture models because they can be viewed as dimensionless extensions of finite mixture models.

Why is such an infinite mixture model needed in the first place? In general clustering, it is necessary to determine the number of classes K in advance, which is equivalent to determining the number of dimensions of the Dirichlet distribution in advance as a prior distribution.

However, in real-life problems, it is often difficult to know how to set the dimension of the Dirichlet distribution, and when the number of data changes dynamically, the number of clusters K may also need to change dynamically.

The Chinese Restaurant Process (CRP), a stochastic model of nonparametric Bayesian splitting, the Direchlet process behind the CRP, and the estimation of the concentration parameter α. Other technical topics include the stick-breaking process (SBP) and sequential Monte Carlo methods.

In this article, we will discuss structural change estimation of time series data as an application of nonparametric Bayesian models. One of the problems in the analysis of time series data is the estimation of changes in the structure of the data. The problem of analyzing changes in the properties of data is an important topic that has been extensively studied as change checking. Here, we describe a method using a statistical model such as the Dirichlet process.

The basic idea is to assume that each data is generated from multiple models with a certain probability, and to estimate structural changes in the data by estimating changes in the generation process over time.

In this article, we will discuss nonparametric Bayes in factor analysis and sparse modeling. Here, we will focus on beta processes, another stochastic process that constitutes a nonparametric Bayesian model.

The approach is to consider an infinite dimensional binary matrix generating process using the beta-Bernoulli distribution model. The specific algorithm is called the Indian buffet process (IBP). These are computed using Gibbs sampling.

This section describes the fundamentals of measurement theory as the basic theory for nonparametric Bayesian models. σ-additive families, Lebesgue measures, and Lebesgue integrals.

The establishment process that constitutes a nonparametric Bayesian model can be viewed in a unified framework called a point process. The development is a statistical model about a set of “points” that abstracts discrete events and some quantity that each point possesses. It is useful for analyzing the stochastic mechanisms of “point” arrangements on time scales, planes, and more generally in space. Here is an overview of point processes. (additive process, Poisson random measure, gamma random measure, discreteness, Laplace functional, point process)

In this section, we discuss the regularized gamma process, which is a regularization of the gamma process. The regularized gamma process is closely related to the Dirichlet process.

Poisson random measure and gamma random speed can be understood in a unified way by the concept of complete random measure.

Gaussian process

    The paper 1711.00165] Deep Neural Networks as Gaussian Processes by researchers at Google Brain is introduced, as well as reference sites for Gaussian processes in general and Bayesian deep learning in particular.

    Machine learning will be the machine’s computation and estimation of f(x), the output y(=f(x)) for input x, from real x and y. The simplest approach is the analytical one, which minimizes the error e by assuming y-f(x) to be the error e. However, this approach has the problem that the computation becomes more difficult as f(x) becomes more complex with more parameters. On the other hand, the approach in which the parameters of f(x) are considered random variables with probability distributions, and their prior probabilities (the state in which the estimated values of the parameters are unknown) and posterior probabilities (the state in which the range of estimated values is known based on actual data) are assumed, and the stochastic variables (parameters) are estimated by Bayesian estimation This is called a “qualitative” approach.

    The Gaussian process takes this stochastic approach further, making the selection range of f(x) flexible and considering “any function with a certain degree of smoothness (Gaussian process regression)” to obtain the probability distribution of the parameters of those functions by Bayesian estimation. A Gaussian process can be thought of as a “box that pops up a function f() when shaken,” and a cloud of posterior functions can be obtained by fitting this box to real data.

    A Gaussian process can be viewed as an integral elimination of the weights of a linear regression model and can be thought of as a dimensionless Gaussian distribution. However, since the data are always finite, the Gaussian process is actually just a finite-dimensional multivariate Gaussian distribution. A Gaussian process is also a probability distribution that generates random functions.

    In this article, we will give an overview of Gaussian processes and their relationship to the kernel trick.

    In this article, we will discuss the relationship between the kernels of Gaussian processes and the basis functions of linear models, and the cases of various kernel functions (Mattern kernel, character kernel, Fisher kernel, HMM’s peripheralized kernel, linear kernel, exponential kernel, periodic kernel, RBF kernel), respectively.

    Previously, we discussed the derivation of Gaussian processes and their properties. From now on, we will discuss how regression problems can be solved based on Gaussian processes. Specifically, we will discuss the calculation of the predictive distribution of a regression problem in the case of a single predictor and in the case of multiple predictors, and we will also discuss the relationship between the predictive distribution and a neural network.

    In the previous examples, the hyperparameters of the kernel were given by hand as θ1=1, θ2=0.4, and θ3=0.1. How can we estimate these parameters? If we put the hyperparameters together as θ=(θ1, θ2, θ3), the kernel depends on θ, so the kernel matrix K calculated from k also depends on θ and is Kθ.

    These can be optimized using the gradient descent method, SCG method, L-BFG gun, etc.

    In this article, we will discuss the Cauchy distribution of Gaussian processes as a generalization of Gaussian processes, including robustness collateral, Gaussian process identification models, and Poisson distributions for machine breakdowns and decay of elementary particles.

    The hypothesis that observation Y in the real world is obtained by sampling Y~p(Y) from some establishment distribution p(Y) is called a stochastic generative model of observation Y. In this section, we will discuss the basics of stochastic models (independence, conditional independence, simultaneous probability, marginalization and graphical models) as an approach to stochastic generative modeling.

    The hypothesis that “observation Y in the real world is obtained by sampling Y~p(Y) from some probability distribution p(Y)” is called a probabilistic generative model or probabilistic model of observation Y. The probabilistic generative model is a hypothesis.

    It is important to note that a probabilistic generative model is a hypothesis. Because it is a hypothesis, there is no guarantee that it is true. Probabilistic generative models may be used to explain the win-loss record data not only in mahjong and backgammon games, which include a probabilistic component, but also in fully deterministic games such as Go and Shogi, which should not include a probabilistic component. Since a probabilistic generative model is a hypothesis, it does not matter if there are multiple hypotheses p(Y)=p1(Y), p(Y)=p2(Y), … for the same object.

    Furthermore, the difference between hypotheses may be represented by the parameter θ, and the stochastic generative model may be represented by the conditional probability p(Y|θ). In this case, the parametric conditional probability p(Y|θ) is called a parametric model. In a parametric stochastic generative model, determining the parameter θ based on the observation Y is called estimation of the parameter.

    Maximum likelihood estimation refers to the method of determining the parameter θ of a stochastic generative model p(Y|θ) with parameters so that the likelihood function L(θ)=p(Y|θ) of the observed data is maximized. Bayesian estimation is a method of estimating the number of variables in a sample.

    Bayesian estimation considers that “what we want to know (unknown parameter θ) is a random variable. In Bayesian estimation, “estimation” of the unknown parameter θ means updating the distribution of the random variable θ by obtaining the observation X. (There are various ways of expressing the distribution of a random variable, including distribution function, cumulative distribution function, and probability density. Here, unless otherwise noted, we equate the “distribution of a random variable” with the “probability density function” that represents it).

    The primary goal of Bayesian estimation is to find the posterior probability distribution of unknown values. What does it mean to find the probability distribution? We will discuss the following. What exactly do we need to do to input data into a computer and compute the posterior probability? We will consider the following.

    There are two main methods for expressing the probability distribution of a random variable numerically on a computer, which are called parametric and nonparametric methods, respectively. These methods include weighted sampling, kernel density estimation, and distribution estimation using neural networks.

    When the number of data points N is large, the calculation of Gaussian processes is bottlenecked by the computational cost of obtaining the kernel matrix and its inverse. This has led many people to believe that the Gaussian process method is theoretically interesting but not practical. However, there exist approaches that can significantly save computational cost through various innovations. Here, we describe a contrivance called the “auxiliary variable method,” which intersperses hidden variables that are not directly observed, and their developed forms.

    In this article, we derive a stochastic gradient method algorithm based on the Variable Bayesian method (VB method).

    The variational Bayesian method is a method to attribute Bayesian estimation problems with complex hierarchical structures of unknown hidden variables and parameters to numerical optimization problems. The stochastic gradient method is an algorithm for solving numerical optimization problems by sequential updating of parameters, and is an essential method for efficient parameter determination for complex parametric models such as neural networks based on huge data. It can further improve the computational efficiency of the auxiliary variable method, especially when one wants to optimize the hyperparameters θ in the auxiliary variable method of Gaussian process regression models when the number of data N is large.

    In this article, we will discuss a method to significantly reduce the computational cost of Gaussian processes by arranging auxiliary input points in a regular grid pattern. In the auxiliary variable method for Gaussian processes, there are cases in which the number of auxiliary input points must be set large, but if the total number of auxiliary input points M is increased, the computational efficiency of the auxiliary variable method, which assumes that M is small, is lost.

    In such cases, the calculation may be made more efficient even if M is large (even if M>N) by applying the device of lattice auxiliary input points. Three ideas, the Kronecker method, the Teblitz method, and local kernel interpolation, and the KISS-GP method, which summarizes them, are described below.

    In this article, I will describe an approach to the problems of spatial statistics and Bayesian optimization as an example of a real-world application that effectively uses the characteristics of Gaussian processes.

    Specifically, I describe how to use a combination of ARD (a device that automatically performs dimensionality selection) and Matern kernels (a device that provides a diversity of covariance functions) to create a model of a function whose shape is unknown (black box model) in Gaussian regression. The combination of ARD and Matern kernel deserves comparative consideration not only for Bayesian optimization but also for any subject that requires a black box model with sufficient flexibility.

    In this article, we will discuss unsupervised learning using Gaussian processes. By using a Gaussian process, the mapping from X to Y can be made nonlinear in a latent model in which the observed value Y is generated from the latent variable X. At the same time, the problem can be defined mathematically and prospectively. This is equivalent to considering the latent variable model with neural networks more mathematically. In addition, sampling of latent variables associated with Gaussian processes is also discussed.

    Specifically, the Gaussian Process Latent Variable Model (GPLVM) and the Bayesian Gaussian Process Latent Variable Model (Bayesian GPLVM) are described.

    GPLVM assumed that the latent coordinates Z=(x1,…,xN) of the observed data are independent of each other and that the prior distribution of X is \(p(\mathbf{X})=\prod_{n=1}^N p(\mathbf{x}_n)\) as in equation (5). However, in practice, it is considered that X often has a cluster structure or is related as a time series. In this case, if we design an appropriate probability model for p(X), we can learn the structure in the latent space hidden behind the observed values.

    Specific extended models discussed are the infinite warp mixture model, the Gaussian process dynamics model, the Poisson point process, the log Gaussian Cox process, the latent Gaussian process, and elliptic slice sampling.

    A Gaussian process is like a box that randomly outputs a function form, and such a box is called a stochastic process. There are three advantages of obtaining a cloud of functions as a stochastic process by a Gaussian process. (1) When a cloud of functions is obtained, the degree of unknowability is known. (2) When a function cloud is obtained, we can tell the difference between confident and unconfident regions. (3) When the cloud of functions is obtained, model selection and feature selection can be performed.There is a close relationship between regression models and correlation coefficients. In the kernel method, which has a close relationship with Gaussian processes, methods such as HSIC (Hilbert-Schmit Independence Criterion) using the kernel method have recently been proposed against this background, and it has been shown that nonlinear correlations that are not captured by the usual product factor correlation coefficient can be measured with high precision. The kernel method can be used to measure real-valued and realistic correlations. Another advantage of kernel methods is that they can be applied not only to real values and their vectors, but also to general data structures such as strings, graphs, and trees, and thus correlations between them can be captured. These kernels can also be used for Gaussian processes.Stochastic processes and Gaussian processes were originally developed as models for describing stochastically fluctuating processes of physical motion. Therefore, it is natural to think of Gaussian processes and particles in Brownian motion.

    In “Bayesian Learning for Neural Networks,” Neal showed that a one-layer neural network is equivalent to a Gaussian process in the limit of hidden layers → ∞. Therefore, by considering a Gaussian process instead of a neural network, optimization of multiple weights in a neural network becomes unnecessary, and the predictive distribution can be obtained analytically. In addition, Gaussian processes have a natural structure as a stochastic model, and unlike neural networks, which cannot predict what will be learned, Gaussian processes can express prior knowledge about problems through kernel functions and can prospectively handle objects that cannot be trivially vectorized, such as time series and graphs. It has an advantage.

    Stan, BUSGS, etc., which were previously described in probabilistic generative models such as Bayesian models, are also called Probabilistic Programming (PP). PP is a programming paradigm in which probabilistic models are specified in some form and inference on those models is automatically performed. Their purpose is to integrate probabilistic modeling and general-purpose programming to build systems combined with various AI techniques for various uncertain information, such as stock price prediction, movie recommendation, computer diagnostics, cyber intrusion detection, and image detection.

    In this article, we describe our approach to this probabilistic programming in Clojure.

     

    コメント

    タイトルとURLをコピーしました