On the use of emulators and the inverse problem of molecular simulation

Use of Emulators and Inverse Problems in Molecular Simulation

コンピューター

2024.05.31 2022.06.23

Machine Learning Artificial Intelligence Digital Transformation Probabilistic Generative Models Support Vector Machine Sparse Modeling Anomaly and Change Detection Relational Data Learning Time Series Data Analysis Simulation and Machine Learning Navigation of this blog

Use of Emulators and Inverse Problems in Molecular Simulation

From Iwanami Data Science Series: “Time Series Analysis – State Space Model, Causal Analysis, and Business Applications. In the previous article, we discussed the dream of a realistic SimCity. In this issue, I will discuss the use of emulators and the inverse problem of molecular simulation.

First, we will discuss the “use of emulators.

The emulator concept has many applications. The following 1 and 3 can be operationalized vertically in regression analysis in general, but by combining them with emulation, the applications can be further expanded.

Designing something with desired properties: By optimizing the quantities obtained as a result of simulation, we can design various objects, but the drawback is that it is computationally expensive. Therefore, a possible method is to configure an emulator once and then optimize its output. As a concrete example, using the results of quantum chemical calculations as training data, an emulator can be configured to output predicted values of desired physical quantities when chemical structure formulas are entered, and molecules can be designed by searching for structural formulas that optimize the output.
Connecting simulations at different levels and timescales: For example, when simulating the entire human body, if the results of simulators for individual organs are represented as emulators, the overall simulation only needs to call the emulators, thus reducing the computational load. The simulator can be used as a data base for the entire simulation.
Effective data collection using the results at each time point: The emulator concept is also useful for sequential experimental design (active learning) to get the whole picture with the fewest number of experiments. The emulator is applied to the results of the simulations conducted up to that point in time, and the results are used to determine what information is missing and to plan the “next step,” i.e., the initial values and parameters for additional simulations and the data collection plan in the real world.

Next, we discuss the “inverse problem” of molecular simulation.

If “giving a model and examining the results generated from it” is called a “forward problem,” then “giving data and estimating the parameters of the model,” which is done in data science, is an “inverse problem. Here, we will discuss what corresponds to the “inverse problem of protein simulation.

The first thing that comes to mind is how to choose a functional form of “force” or “potential energy” (“force field” in this industry’s terminology) to use in the simulation. It is believed that the difficulty of the folding problem lies not only in the computation time, but also in the way the force field is chosen.

Although the force field can be obtained by solving the equations of so-called first-principles calculations or quantum mechanics in an approximate manner, from the standpoint of data science, it is important to combine this method with an “inverse problem” approach, in which the results are obtained empirically by comparing experiments and simulations.

In this case, knowing the results for individual proteins and matching them to those results would be unreliable, so “generalization performance,” as in machine learning, is required. Reasonably adjusting the parameters of a simulation to fit the experiment is called calibration in statistical science.

Another “inverse problem” could be to “give the shape of a microscopic object as a target and find an array of elements that would yield that shape. In the case of proteins, this is the inverse of folding and is called the inverse folding problem.

It may seem like a pipe dream to solve the inverse problem when the prediction of folding is beyond our ability, but from an engineering point of view, the original goal should be “to be able to set up a microscopic object at will. A similar problem can be considered for macromolecules other than proteins (e.g., RNA) and aggregates.

In the next article, we will discuss the application of state-space models to marketing.