Topics on simulation and data science and artificial intelligence

Machine Learning Artificial Intelligence Digital Transformation Probabilistic Generative Models Support Vector Machine Sparse Modeling Anomaly and Change Detection Relational Data Learning Time Series Data Analysis Economy and Business Navigation of this blog

Simulation, Data Science and Artificial Intelligence

Large-scale computer simulations have become an effective tool in a variety of fields, from astronomy, meteorology, physical properties, and biology to epidemics and urban growth, but only a limited number of simulations can be performed purely based on fundamental laws (first principles). Therefore, the power of data science is needed to set the parameter values and initial values that are the preconditions for the calculations. However, modern data science is even more deeply intertwined with simulation.

One of them is the concept of “data assimilation” and “emulators.” Data assimilation, which can be related to state-space models or Bayesian models, is characterized by its perspective of starting from the simulation side and incorporating “data” into the process of the simulation. Emulators are also nothing more than fitting a (high-dimensional) regression model, but they differ in that they mimic the output of the simulation instead of the real-world data.

Individual applications include “mobile observation,” which actively collects data in the field of weather forecasting, an advanced area of data assimilation, and “multivariate time series analysis of simulation results” and agent-based urban simulation in the field of protein. In addition, as it relates to the field of so-called “artificial intelligence” such as game programming and high-dimensional image recognition, simulation is often used in Go and Shogi programs, and there are also approaches such as deep learning, reinforcement learning, and GAN.

In addition, recent years have seen the emergence of breakthroughs such as reinforcement learning using simulated nuclear fusion data, as previously discussed in “Nuclear Fusion and AI Technology“.

This blog discusses these simulations, data science, and artificial intelligence in the following topics.

Implementation

Combining Simulation and Machine Learning and Examples of Various Implementations

Simulation involves modeling a real-world system or process and executing it virtually on a computer. Simulations are used in a variety of domains, such as physical phenomena, economic models, traffic flows, and climate patterns, and can be built in steps that include defining the model, setting initial conditions, changing parameters, running the simulation, and analyzing the results. Simulation and machine learning are different approaches, but they can interact in various ways depending on their purpose and role.

This section describes examples of adaptations and various implementations of this combination of simulation and machine learning.

Individual Topics

Awareness and Artificial Intelligence Technology

To “awareness” means to observe or perceive something carefully, and when a person notices a situation or thing, it means that he or she is aware of some information or phenomenon and has a feeling or understanding about it. Becoming aware is an important process of gaining new information and understanding by paying attention to changes and events in the external world. In this article, I will discuss this awareness and the application of artificial intelligence technology to it.

Simulation, Data Assimilation, and Emulation

The two main types of reasoning in research are deduction and induction. In physics and related fields, deduction is a method of inferring an unknown state by solving a fundamental equation called the governing equation. The governing equations vary depending on the subject and the time and space scales, ranging from the Schrodinger equation in quantum mechanics to the Navier-Stokes equation in fluid mechanics, even for the same mechanical phenomenon.

Usually, the equations are discretized in space and time, initial values, boundary conditions, and parameter values are determined, and the time evolution or spatial connection is computed by a supercomputer over an enormous amount of time. This is so-called simulation. On the other hand, the method of predicting the future based on past finite events such as experience and data is called induction. Data science methods such as statistics, machine learning, and deep neural nets (DNN) are all inductive methods. In this section, we will discuss how to fuse and integrate these two methods.

Weather Forecasting and Data Science

There are several methods used by weather agencies and other institutions to produce weather forecasts. One of the main ones will be the kinematic method, which predicts the future rain distribution by exterior from the current rain distribution. Kinematic forecasting is effective only for very short periods of time, within a few hours. This is because the winds themselves that drive the rain clouds change from moment to moment due to various physical processes. Therefore, forecasts more than a few hours ahead are made by constructing a discretized physical model based on a system of partial differential equations, such as the basic equations of fluid dynamics, which describe atmospheric flow, and obtaining approximate solutions by numerical calculations. This is referred to here as the physics-based method, as distinguished from the kinematic method.

In the physics-based method, numerical simulations are performed to imitate physical laws. However, to start a numerical simulation, the current state of the atmosphere must be assigned to the system as an initial value for the calculation. To create these initial values, the first step is to collect observation data. Since it is wasteful to just use the observed data as the initial values, a process called data assimilation is used. Data assimilation is a system that uses the variance and covariance of errors in physical quantities found in observed values and simulation results to spread the information of observed data to locations that are not observation points.

Shaky Proteins and Old Me – Data Science in the Age of Misfolding

In animals, muscle contains many proteins. In fact, proteins play the role of major components in all parts of life, whether in animals or plants, as components of catalysts (enzymes) for chemical reactions within organisms and receptors (receptors) on biological membranes. Also, “information is transcribed from DNA to RNA, from which the sequence of amino acids in a protein (one-dimensional structure) is determined,” and “the three-dimensional shape taken by a protein (higher-order structure) is essential to its function as a component” is the start of biology.

The next question, then, is how the higher-order structure is determined from the primary structure. For proteins that are not very large molecules, the correct answer is that the higher-order structure is formed by spontaneous folding. This is called protein folding.

To reproduce this phenomenon in a computer, realistic simulations of proteins have been constructed based on Newton’s equations of motion. This is a technique called molecular dynamics (MD), which is similar to Hamiltonian MCMC in data science.

Realistic SimCity Dreams

SimCity is an urban development simulation game that has been a long-running hit since its release in 1989.

In this game, players take on the role of a mayor and build a virtual city, constructing power plants, town halls, and other infrastructure necessary for the year. As the city grows to a certain degree, the complexity of the issues that must be addressed, such as the satisfaction of the residents and the response to disasters, increases, and the game becomes more challenging.

SimCity is a game for entertainment purposes, but if a simulator based on real-world data and realistic models can be used as an interface like SimCity, even non-experts can be widely and deeply involved in the planning and formulation of marketing and policy measures. However, if simulators based on real-world data and realistic models can be used as an interface like SimCity, even non-experts can be widely and deeply involved in planning and formulating marketing and policy measures.

Emulators and the Inverse Problem of Molecular Simulation

The emulator concept has many applications. (1) Designing something with desired properties, (2) Connecting simulations at different levels and timescales, and (3) Collecting data effectively by using the results at each point in time: The above 1 and 3 can be operated vertically in regression analysis in general, but by combining them with emulation, their applications can be further expanded.

If “giving a model and examining the results generated from it” is called a “forward problem,” then “giving data and estimating the parameters of the model” in data science is called an “inverse problem. Here, we discuss what corresponds to the “inverse problem of protein simulation.

Geospatial Information in Motion -Simulation and data assimilation

It is conceivable to combine real-time observation-based mobile terminals with mesh-based aggregated population data by time and flow data for urban areas based on PT data to obtain real-time estimated flows of people. These approaches are also known as data assimilation, in the sense that they integrate simulation models and observation data, and have the advantage that future predictions that cannot be obtained from observation data can be made from the models.

Specifically, we use a state-space model with “the location of people in the city as the action” as the state variable, and assimilate the data using a particle filter. When the observed data is obtained, the likelihood (goodness of fit to the observed data) of each particle is calculated, and the particles are sampled and weighted by the likelihood. The likelihood is then used to predict the next time. The simulation is based on the assumption that the destination of each person is given to some extent based on the past PT data, and the speed and other factors are changed based on the traffic conditions to that destination.

From Global Models to Local Models General state space model and particle filter and simulation

In this article, we will discuss how to construct a model that describes the whole based on a local model. Specifically, we will first take the temperature data for Tokyo as an example, while using a statistical model with a small number of fixed parameters to extract meaning from the data. The talk will start with data analysis using the overall model, so to speak. Next, we will introduce a local linear model to extract temporally localized information, and then extend it to a nonlinear model to see how the expressive power of the model can be enriched. This local nonlinear model becomes a “soft” model that allows for stochastic deviations from the constraints that are usually given by stochastic difference equations and expressed in terms of equations. Furthermore, by generalizing the distribution followed by the noise term that generates stochastic fluctuations to a non-Gaussian, or non-Gaussian distribution, we can better handle rarely occurring stochastic events, such as jumps and outliers. This characteristic, which cannot be expressed by a noise term following a Gaussian distribution, is called non-Gaussianity.

Anomaly detection using Gaussian process regression -Output anomaly detection for input, application to design of experiments

This paper describes an anomaly detection technique for systems where input-output pairs can be observed. In this case, the relationship between input and output is modeled in the form of a response surface (or regression curve), and anomalies are detected in the form of deviations from the surface. In practical use, the relationship between input and output is often non-linear. In this article, we will discuss the Gaussian process regression method, which has a wide range of engineering applications among the nonlinear regression techniques.

When both input and output are observed, the most natural way to detect anomalies is to look at the output for a given input and see if it deviates significantly from the value expected by normal behavior. In this sense, this problem can be called a response anomaly detection problem.

Implementation of particle filter

Create a “particle filter” in R without using any packages. If it is just filtering, it can be written in almost 3 lines, except for the initialization and parameter setting parts. Basically, for the data (observed values), the closer the particles are to the observation at time t, the greater the weight, the more likely they are to be selected in the resampling step, and thus the closer the path is to the data. The distribution of particles at each time point then represents the posterior distribution (more precisely, the filtered distribution) obtained from the model.

Markov chain Monte Carlo methods

MCMC is an abbreviation for Markov Chain Monte Carlo Method, which is an algorithm for extracting samples (generating random numbers) from a multivariate probability distribution.

MCMC is just a computational method, and in order to solve a task, it is necessary to organize the problem so that it can be solved by MCMC (statistical modeling). This is similar to the way that “what to calculate” is more important than the calculation of addition, subtraction, and multiplication itself in arithmetic problems.

Matrix decomposition” and “optimization” are necessary to fit a curve by the least-squares method or to estimate the parameters of a probability distribution by the maximum likelihood method. In contrast, in Bayesian statistics, the “answer” is first given in the form of a “distribution” called the “posterior distribution. In contrast, in Bayesian statistics, the “answer” is first given in the form of a “distribution” called the “posterior distribution,” which makes it a perfect match for MCMC, which generates samples from the distribution. The final necessary information such as point estimates, errors, prediction intervals, etc. are obtained from the many samples generated by MCMC.

The difference between MCMC and optimization is that MCMC keeps moving and generating samples all the time, while optimization stops somewhere (the real optimal solution when it works, or the first time when it doesn’t). In fact, MCMC may also move close to optimization, but even in that case, it moves around the optimal solution forever, expressing error in the Bayesian sense.

MCMC is a general-purpose method for handling probability distributions, and is also used in statistical physics and frequency theory statistics. On the other hand, in Bayesian statistics, various computational methods are used such as Laplace approximation, Kalman filter, successive Monte Carlo method, variational Bayesian method, etc. MCMC is one of them.

In the following pages of this blog, we discuss the basic theory of the Markov chain Monte Carlo method, the specific algorithm and the implementation code.