R language and Machine Learning

Digital Transformation Artificial Intelligence  Mathematics Algorithms and Data Structure Machine Learning Programming Navigation of this blog

About R language and Machine Learning

R was invented in 1933 by Robert Gentleman and Ross Ihaka of the University of Oakland, who improved on the S language invented by Johm Chambers of Bell Labs. It was intended to be interactive so that users could enter commands, get results, and execute other commands. It has since evolved into a language that can be integrated into systems and tackle complex problems.

R is now used in the full stack of data analysis, including data extraction and transformation, model application, inference and prediction, and result plotting and reporting.

The R language also includes a number of packages for implementing machine learning algorithms, including functions for implementing machine learning algorithms, evaluating models, and optimizing hyperparameters. The R language also has excellent capabilities for data processing, making it suitable for data preprocessing and feature engineering.

Recently, the R language has also been used for deep learning, and libraries such as Keras and TensorFlow can be used to build deep learning models. The R language also has packages such as ggplot2, which is excellent for graphical data visualization, and can be useful for evaluating and visualizing machine learning models.

The popularity of R skyrocketed in the late 2000s, spreading not only to academia but also to banking, marketing, medical, political, genetic, and other industries, and many add-on packages (libraries that extend the functionality of R) have been released.

This blog describes various applications and implementations of R.

Technical Topics

In order to program, it is necessary to create a development environment for each language. This section describes how to set up specific development environments for Python, Clojure, C, Java, R, LISP, Prolog, Javascript, and PHP, as described in this blog. Each language has its own platform to facilitate development, which makes it possible to easily set up the environment, but this section focuses on the simplest case.

R is a programming language and development execution environment for statistical analysis, and is a relatively old tool created by the University of Auckland, New Zealand, in 1993. It is characterized by its simple and fast handling of vectors and matrices, which makes it easy to construct mathematical processing algorithms, and by its fast processing speed. In addition, since the main purpose is to perform calculations, relatively simple programming is possible. Because of these features, it is used by research institutes around the world, and many libraries (packages) have been created and released. This number of libraries is less than half that of python, but considering that it is a language specialized for statistical analysis, I think it is safe to say that there is a sufficient amount.

Many reference books have been published. Japanese books include “Machine Learning with R,” a translation of Brett Lantz’s book, and “A New Textbook of Data Analysis and Statistical Analysis in R for Everyone,” a translation of Jared P. Lander’s book. In foreign languages, “R In a nutshell second edition” by O’Reilly, “Statistical Analysis with R Beginner’s Guide” and “Big Data Analytics with R and Hadoop” by Packet, “R for Machine Learning” by Brett Lantz, and “R for Everyone: A New Textbook of Data Analysis and Statistical Analysis” by Jared P. Lander, etc. An Introduction to Statistical Learning:with Applications in R” by Springer, and others.

This section describes the R environment setup and simple clustering operation.

In this article, we will discuss the integration with R. There are several tools to access R libraries from Clojure, which are summarized in the following links. In terms of each tool, we will discuss (a) the API and parsing provided, (b) the type of R backend used (JRI+REngine / Rserve+REngine / Opencpu / Run R from a shell, etc.), (c) the R “data frame” or “matrix” equivalent that is Are there any Clojure concepts being used, and if so, what are they?

Of these, we will discuss Clojisr, which is relatively stable and available.

File input/output functions are the most basic and indispensable functions when programming. Since file input/output functions are procedural instructions, each language has its own way of implementing them. Concrete implementations of file input/output in various languages are described below.

Among programming languages, the basic functionality is one element of the three functions of structured languages (1) sequential progression, (2) conditional branching, and (3) repetition, as described in the “History of Programming Languages” section. Here, we show implementations of repetition and branching in various languages.

Hierarchical clustering using R is described.

In this article, we will introduce k-means, a non-hierarchical clustering.

This section describes clustering by decision tree using R.

Describe rule extraction using R

LightGBM is a Gradient Boosting Machine (GBM) framework developed by Microsoft, which is a machine learning tool designed to build fast and accurate models for large data sets. Here we describe its implementation in pyhton, R, and Clojure.

Time-series data is called data whose values change over time, such as stock prices, temperatures, and traffic volumes. By applying machine learning to this time series data, a large amount of data can be learned and used for business decision making and risk management by making predictions on unknown data. This section describes the implementation of time series data using python and R.

Time-series data is called data whose values change over time, such as stock prices, temperatures, and traffic volumes. By applying machine learning to this time-series data, a large amount of data can be learned and used for business decision making and risk management by making predictions on unknown data. In this article, we will focus on state-space models among these approaches.

The Dynamic Factor Model (DFM) is one of the statistical models used in the analysis of multivariate time series data, which explains the variation of data by decomposing multiple time series variables into common factors (factors) and individual factors (specific factors). This is a model that explains data variation by decomposing multiple time series variables into common factors and individual factors (specific factors). This paper describes various algorithms and applications of DFM, as well as their implementations in R and Python.

Bayesian Structural Time Series Model (BSTS) is a type of statistical model that models phenomena that change over time and is used for forecasting and causal inference. This section provides an overview of BSTS and its various applications and implementations.

Vector Autoregression Model (VAR model) is one of the time series data modeling methods used in fields such as statistics and economics, etc. VAR model is a model that is applied when multiple variables interact with each other. The general autoregression model (Autoregression Model) expresses the value of a variable as a linear combination of its past values, and the VAR model extends this idea to multiple variables, becoming a model that predicts current values using past values of multiple variables.

Generalized Linear Model (GLM) is one of the statistical modeling and machine learning methods used for stochastic modeling of the relationship between response variables (objective variables) and explanatory variables (features). This section provides an overview of this generalized linear model and its implementation in various languages (python, R, and Clojure).

Game theory is a theory for determining the optimal strategy when there are multiple decision makers (players) who influence each other, such as in competition or cooperation, by mathematically modeling their strategies and their outcomes. It is used primarily in economics, social sciences, and political science.

Various methods are used as algorithms for game theory, including minimax methods, Monte Carlo tree search described in “Overview of Monte Carlo Tree Search and Examples of Algorithms and Implementations“, deep learning, and reinforcement learning. Here we describe examples of implementations in R, Python, and Clojure.

Particle Swarm Optimization (PSO) is a type of evolutionary computation algorithm inspired by swarming behavior in nature, modeling the behavior of flocks of birds and fish. PSO is characterized by its ability to search a wider search space than genetic algorithms, which tend to fall into local solutions. PSO is widely used to solve machine learning and optimization problems, and numerous studies and practical examples have been reported.

Causal Forest is a machine learning model for estimating causal effects from observed data, based on Random Forest and extended based on conditions necessary for causal inference. This section provides an overview of the Causal Forest, application examples, and implementations in R and Python.

We describe the linkage between Q-GIS and R, a machine learning approach for location information.

Implementation of Bayesian models in R (KFAS and field)

The implementation of PCA using R

Introduction to STAN, Bayesian estimation using R

About genlasso and lasso, sparse modeling using R

SVD (Singular Valu Decomposition), PMD (Sparse Matrix Decomposition), NMF (Non-negative Matrix Factorization) as sparse modeling Factorization) in practice

SVM has become a standard tool for data analysis and is applied in a variety of fields. SVM is implemented in many statistical analysis software and can be easily used for small- and medium-scale data. Here we describe the kernlab package of the R statistical analysis environment.

On the other hand, it is necessary to have knowledge about the implementation of learning algorithms when using SVM for large-scale data or changing some parts of SVM according to the purpose. With this goal in mind, here we describe in detail the implementation of SVM software called LIBSVM, which is created and maintained by Professor C.J. Lin’s group at National Taiwan University. LIBSVM is implemented in C++ and the code is publicly available, so it can be easily modified for different purposes. The software is implemented in C++ and the code is publicly available, so it is relatively easy to modify it for your purposes and integrate it with other systems.

The most commonly used packages for handling state-space models in R are dlm and KFAS. Both of them allow filtering, smoothing, and prediction using the Kalman filter, but they differ in some respects. Here, we will discuss the analysis of state-space models using the dlm package, which is a package developed by Giovanni Perris. The dlm package is developed by Giovanni Perris and handles the dynamic liner model, which is a linear and normally distributed state-space model.

Since the data analyzed in the previous dlm analysis clearly shows seasonal variations, we will try to add seasonal adjustment to the model. dlmModSeas or dlmModTrig functions are used to handle the seasonal adjustment component in dlm. The former uses dummy variables to represent the seasonal adjustment component, while the latter uses trigonometric functions. Here we used the dlmModSeas function.

KFAS is a package developed by Jouni Heleke that differs from dlm in that it has a coefficient matrix called Rt over system noise, which is used to select which states to add system noise to.

KFAS also analyzes the same seasonal adjustment model as described above. in KFAS, the model is defined by the SSModel function. in KFAS, as in dlm, the model is built by combining functions. In the code below, the model is constructed by combining SSMtrend, a function that handles multinomial components, and SSMseasonal, a function that handles seasonal adjustment components. The degree argument is given 1 to make it a local-level model.

Create a “particle filter” in R without using any packages. If it is just filtering, it can be written in almost 3 lines, except for the initialization and parameter setting parts. Basically, for the data (observed values), the closer the particles are to the observation at time t, the greater the weight, the more likely they are to be selected in the resampling step, and thus the closer the path is to the data. The distribution of particles at each time point then represents the posterior distribution (more precisely, the filtered distribution) obtained from the model.

The question of whether there is a causal relationship from one time series to another assumes that at least two time series data are of interest. Causal inference based on time series is essentially a multivariate time series problem, and multivariate autoregression models (Vector AutoRegression model, VAR model) are often used as models.

Here, we describe the procedure for analyzing causality based on the VAR model, using the free software R as an example of the causal relationship between the approval rating of the Cabinet and stock prices. Data often contain missing data, and the Cabinet approval rating, which is the subject of this paper, is no exception. In this section, we will discuss interpolation of deficient values using the function decomp included in the timsac package of R. In addition, since causal analysis using the VAR model assumes stationarity of time series, it is necessary to check for stationarity and non-stationarity and to perform preprocessing to make the time series stationary. The procedure for this and the use of the unit root test are also described. The R package vars is used for estimation of uncontrolled/controlled VAR models, lag selection, causality tests, and calculation of impulse response functions.

In this section, we introduce the multivariate autoregressive model (VAR model) as a framework for analyzing the causality of time series. vars. A time series xt to yt is said to be “causal in the Granger sense” when the past values of other time series xt are useful in predicting time series yt.

Stan can be used to detect change points in time series data. one of the time series data sets incorporated in R is the Nile River flow data (Nile), which is known to have changed abruptly between 1898 and 1899. This page shows how the Nile River flow data can be used to detect change points.

コメント

Exit mobile version
タイトルとURLをコピーしました