Analysis with R and Clojure using Stan in MCMC model

Machine Learning Artificial Intelligence Digital Transformation Probabilistic Generative Models Navigation of this blog Algorithm Natural Language Processing Deep Learning Topic Model Markov Chain Monte Carlo Method C/C++ Clojure

Analysis in R and Clojure using Stan for Markov Chain Monte Carlo (MCMC) models

It is difficult to compute the integral of a complex complex multiple distribution such as Bayesian estimation analytically, and the Markov Chain Monte Carlo (MCMC) method is often used. This is a type of basic random selection algorithm. The simplest Monte Carlo method generates random numbers as candidate parameters and calculates the probability (integral) corresponding to the random numbers, but such a brute-force method is inefficient because the computational complexity increases explosively as the number of parameters increases, while the MCMC method is not completely random and can be used to calculate the probability of a random number. The algorithm is not random, but rather generates the next random number next random number while gradually searching for a value with a larger probability based on one previous random number (search for a Markov probability field). For details, please refer to the Markov Chain Monte Carlo (MCMC) method in this blog.

A well-known framework for calculating MCMC is STAN, which was developed in C++ by Andrew Gelman and his colleagues and first released in 2012. STAN has a variety of language interfaces, including RStan for integration with R, PyStan for integration with Python, CmdStan for command line executables, and clj-stan for integration with Clojure.

We have previously introduced Rstan, and Pystan and PyMC3 are MCMC implementations in Python. Also, since Rstan has not been updated since 2020 and the Stan team is apparently encouraging CmdStan, I would like to discuss clj-stan, a simple wrapper for CmdStan, as well as CmdStanr, a wrapper for CmdStan’s r.

Running CmdStan with R

First, about CmdStanr. Start the R tool and install CmdStanr.

>install.packages("cmdstanr", repos = c("https://mc-stan.org/r-packages/", getOption("repos")))

In addition, install other necessary libraries.

>install.packages("posterior")
>install.packages("bayesplot")

Load downloaded libraries

>library(cmdstanr)
>check_cmdstan_toolchain(fix = TRUE, quiet = TRUE)
>library(posterior)
>library(bayesplot)
>color_scheme_set("brightblue")

Install CmdStan.

>install_cmdstan(cores=2)

After the download is complete, check the path and version.

>cmdstan_path()
>cmdstan_version()

If both come up successfully, the installation is complete.

Next, let’s try using the sample models in the Stan library.

>file <- file.path(cmdstan_path(), "examples", "bernoulli", "bernoulli.stan")
>mod <- cmdstan_model(file)

The contents can be checked with the $print() method.

>mod$print()
data {
  int<lower=0> N;
  array[N] int<lower=0,upper=1> y; // or int<lower=0,upper=1> y[N];
}
parameters {
  real<lower=0,upper=1> theta;
}
model {
  theta ~ beta(1,1);  // uniform prior on interval 0,1
  y ~ bernoulli(theta);
}

See Bayesian Estimation with STAN in R for data, parametert, and model in the above model.

I will try to run MCMC as soon as possible.

>data_list <- list(N = 10, y = c(0,1,0,0,0,0,0,0,0,1))

>fit <- mod$sample(
  data = data_list, 
  seed = 123, 
  chains = 4, 
  parallel_chains = 4,
  refresh = 500 # print update every 500 iters
)

Since the model is simple and the number of data points is small, the calculation is completed immediately.

Running MCMC with 4 parallel chains...

Chain 1 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 1 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 1 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 1 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 1 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 1 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 2 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 2 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 2 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 2 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 2 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 2 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 3 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 3 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 3 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 3 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 3 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 3 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 4 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 4 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 4 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 4 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 4 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 4 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 1 finished in 0.0 seconds.
Chain 2 finished in 0.0 seconds.
Chain 3 finished in 0.0 seconds.
Chain 4 finished in 0.0 seconds.

All 4 chains finished successfully.
Mean chain execution time: 0.0 seconds.
Total execution time: 0.3 seconds.

To summarize the results, use the summary() method, which calls summarize_draws() in the posterior package. This uses the first argument to specify the variable to summarize, and subsequent arguments are passed to posterior::summarize_draws() to specify which summary to compute, whether to use multicore, and so on.

>fit$summary()
# A tibble: 2 × 10
  variable   mean median    sd   mad      q5    q95  rhat ess_bulk ess_tail
                         
1 lp__     -7.27  -7.00  0.709 0.344 -8.70   -6.75   1.00    1852.    2114.
2 theta     0.247  0.232 0.119 0.123  0.0804  0.466  1.00    1611.    1678.

Execution in Clojure

Use Clojisr as described above. See “Statistical Learning with Clojure and R” for details on setting up the environment.

The project.clj file is as follows.

(defproject clj-stan "0.1.0-SNAPSHOT"
  :description "FIXME: write description"
  :url "http://example.com/FIXME"
  :license {:name "EPL-2.0 OR GPL-2.0-or-later WITH Classpath-exception-2.0"
            :url "https://www.eclipse.org/legal/epl-2.0/"}
  :dependencies [[org.clojure/clojure "1.11.1"]
                 [scicloj/clojisr "1.0.0-BETA20"]]
  :repl-options {:init-ns clj-stan.core}
  :jvm-opts ["-Dclojure.tools.logging.factory=clojure.tools.logging.impl/jul-factory"])

The src/core.clj file should look like this

(ns clj-stan.core)

(require '[clojisr.v1.r :as r :refer
            [r eval-r->java r->java java->r java->clj java->native-clj
             clj->java r->clj clj->r ->code r+ colon require-r]]
          '[clojisr.v1.robject :as robject]
          '[clojisr.v1.session :as session])

(require-r '[cmdstanr]
           '[posterior]
           '[bayesplot])

(r "file <- file.path(cmdstan_path(), 'examples', 'bernoulli', 'bernoulli.stan')")
(r "mod <- cmdstan_model(file)")

(r "mod$print()")

(r "data_list <- list(N = 10, y = c(0,1,0,0,0,0,0,0,0,1))")
(r "fit <- mod$sample( data = data_list, seed = 123, chains = 4, parallel_chains = 4, refresh = 500 )")

(println (r "fit$summary()"))

We can confirm that the same behavior can be achieved by evaluating the R code with the r function, except that the library introduction part becomes “labray(cmdstanr)→(require-r ‘[cmdstanr])”.

As previously discussed in “Clojure-Python Collaboration and Machine Learning,” existing python libraries can also be used on Clojure, so it is possible to use these to combine the rich library processing of R⇄Clojure⇄Python It is possible to use all of them in combination on Clojure. (e.g. → Clojure (real-time data input processing such as IOT) → R (CmdStan) → Python (visualization with Pandas library), etc.)