Bayesian inference and MCMC open source software

Machine Learning Artificial Intelligence Digital Transformation Natural Language Processing Deep Learning MCMC Probabilistic Generative Models Navigation of this blog
Bayesian inference and MCMC open source software

Bayesian statistics means that not only the data, but also the elements behind the data are generated probabilistically. This is easy to understand if you think of the image of “a device that produces dice (which generate data with a certain probability) produces dice with a certain probability of fluctuation,” as mentioned previously. In other words, it is modeling that applies a meta-probability to a probability distribution.

In order to calculate this Bayesian modeling, it is necessary to calculate the probabilities. MCMC (Markov Chain Monte Carlo Method) is an approach to this. This is one of the algorithms for extracting samples (generating random numbers) from a multivariate probability distribution.

When complex probability distributions are calculated as Bayesian estimation, the number of parameters to be estimated increases, and the computational cost becomes large when calculated using MCMC or other methods. To solve this problem, the “Hierarchical Bayesian” method is used, which limits the number of parameters by imposing common constraints on “similar parameters.

In addition, when applying probability models to time-series data, it is necessary to consider autocorrelation because time-series data examine changes in a single sample, whereas general probability models consider a large number of samples based on the assumption that each sample is independent.

One approach to Bayesian estimation using a real tool is to use Stan, a kind of random simulation that uses MCMC to generate sample data. While ordinary machine learning mainly optimizes functions to minimize the gap between the model and the actual data, modeling systems optimize the parameters of the model by simulating the parameters of the initial model, which is a strong point in cases where training data is scarce.

Bayesian statistics are sometimes defined as “simple statistics that do not require much logic” when compared to ordinary statistics. One of the reasons for their simplicity is that they “make more assumptions than ordinary statistics (frequentist statistics).

In Bayesian statistics, not only the data but also the factors behind the data are generated probabilistically. This is easy to understand if you have the image of “a device that manufactures dice (which generate data with a certain probability) manufactures dice with a certain fluctuation in probability” mentioned before. In other words, it is a meta-probability applied to a probability distribution.

MCMC is the power of the Markov Chain Monte Carlo Method, one of the algorithms for extracting samples (generating random numbers) from a multivariate probability distribution.

MCMC is just a way of calculation, and in order to solve the task, it is necessary to organize the problem so that it can be solved by MCMC (statistical modeling). This is similar to the case of a written arithmetic problem, where “what to calculate” is more important than the calculation of addition, subtraction, and multiplication itself.

Matrix decomposition and optimization are necessary to fit a curve by the least squares method or to estimate the parameters of a probability distribution by the maximum likelihood method. In contrast, in Bayesian statistics, the “answer” is first given in the form of a “distribution” called the “posterior distribution. Therefore, Bayesian statistics is a perfect match for MCMC, which generates samples from a distribution. The final necessary information such as point estimates, errors, and prediction intervals are obtained from a large number of samples generated by MCMC.

While statistical models that deal with complex data increase the number of parameters to be estimated, the Hierarchical Bayesian model is characterized by common constraints on “similar parameters. This allows the statistical model to fit the data well even when, for example, the number of parameters is greater than the number of data.

Furthermore, when such a model is created, several methods have been proposed to estimate the parameters of the model by using actual data. One of these methods is the Markov chain Monte Carlo (MCMC) method.

In order to consider these methods in practice, we assume the following fictitious data: the height of a group of 12-year-old elementary school boys (108 in total) was measured, and a second measurement was made one year later.

Normal statistical analysis assumes that the data were sampled independently. Therefore, if the data are not independent and there is some structure among the data, it is necessary to perform a corresponding statistical analysis. For example, if data are taken from several groups and there is a difference in data values among the groups, a mixture model is used in which the groups are assumed to be random effects. Similarly, if there is temporal or spatial structure in the data, a statistical method that takes this into account is necessary.

When dealing with time-series data, i.e., data that changes over time, such as changes in temperature or stock prices, it is not uncommon for there to be a correlation (autocorrelation) between the measured values at a certain point in time and the values before and after that point in time. Such data are not independent in time, and applying statistical methods that treat them as independent may lead to erroneous conclusions.

For this reason, various statistical methods for handling time-series data have been developed and are used in practice. A model called a state-space model is one such method. The characteristic feature of the state-space model is that it considers that “observed values” are generated from “states” that change along time series.

One approach that uses actual tools is to use Stan, which is a type of random simulation that uses MCMC to generate sample data, so to speak. While ordinary machine learning mainly optimizes functions to minimize the gap between the model and the actual data, the modeling system optimizes the parameters of the model by simulating the parameters of the model that were initially set, which is an advantage when training data is scarce.

To actually use Stan, install and use a tool called ‘RSatn’ in R.

  • MCMC library in python
  • Comparison of MCMC software
  • Various Representation Forms of Bayesian Models Including Time and Space

コメント

タイトルとURLをコピーしました