Application of Simulation and Machine Learning to Functional Analysis of Proteins

Machine Learning Techniques Probabilistic Generative Models Support Vector Machines Sparse Modeling Artificial Intelligence Technology Anomaly and change detection Relational data learning Time Series Data Analysis Simulation and machine learning Digital Transformation Technology Miscellaneous Navigation of this blog

Summary

According to an article on AlphaFold2 by DeepMind, the developer of AlphaGo, the analysis of protein 3D structures, which used to take years, can now be done in a few hours by a protein structure prediction AI named “AlphaFold,” revolutionizing drug development and industrial applications This has revolutionized drug development and industrial applications. In this article, we will discuss the protein structure prediction AI from Iwanami Data Science Series “Time-Series Analysis: State Space Model, Causal Analysis, and Business Applications. (In the previous article, we discussed weather forecasting and data science.) We will discuss simulation and machine learning analysis of the 3D shape of proteins (misfolding).

岩波データサイエンvol6

3D shape and function of proteins

In animals, muscle contains many proteins. In fact, proteins play the role of major components in all parts of life, whether in animals or plants, as components of catalysts (enzymes) for chemical reactions within organisms and receptors (receptors) on biological membranes. Also, “information is transcribed from DNA to RNA, from which the sequence of amino acids in a protein (one-dimensional structure) is determined,” and “the three-dimensional shape taken by a protein (higher-order structure) is essential to its function as a component” is the start of biology.

The next question, then, is how the higher-order structure is determined from the primary structure. For proteins that are not very large molecules, the correct answer is that the higher-order structure is formed by spontaneous folding. This is called protein folding. This is not limited to cases where molecules are synthesized in order from the ends. It has been observed experimentally that when a correctly shaped protein is heated to an untangled state and then cooled down again, it will coil like a shape-memory alloy and return to its original shape.

In order to understand the principle of how such a complex shape can be created in a fully automatic manner, from the viewpoint of biophysics, and what kind of “design” is required in the evolutionary process, the optimization problem of energy (or more precisely, free energy) minimization is examined from the standpoint of computer science and information science. The term “folding” is not a word that can be used to describe the process of “optimization.

Protein Fluctuation

The word “folding” gives the impression of “settling into a certain shape and functioning. However, in the actual folding process, the protein molecules must be constantly changing their forms due to molecular motion because the temperature is finite. If the change of form is only a slight wobble around a fixed shape, the picture of “folding = optimization” is generally correct, but in fact, developments over the past few decades have shown that this is not always the case.

Often we see cases where, even if the whole is folded properly, certain parts of the form wobble greatly depending on its function or features. There are also proteins that are not folded at all, in part or in whole, and exist in an unfolded state. Such proteins are called IDPs (Intrinsically Disordered Proteins).

IDPs are not uncommon, and it is now believed that quite a few short hair proteins in vivo have such properties. For example, the protein α-synuclein, which has attracted attention in connection with Parkinson’s disease, also (according to various theories) has IDP properties. For example, in the case of enzymes, when they approach a molecule (substrate), they may fold around to fulfill their function.

Proteins are not just “microscopic machines,” but also “machines that fluctuate and do work.

Protein misfolding

Another thing that has greatly changed our image of proteins is the realization that there is not necessarily a single way for a protein to fold. A molecule with the same amino acid sequence can have two or more very different higher-order structures.

This kind of thing seems to occur in normal organisms, but its importance became clear with the discovery that diseases such as Jacob’s disease and bovine spongiform encephalopathy (prion diseases) are caused by “the folding of proteins called prions into different shapes than normal. This phenomenon is called misfolding.

In general, misfolded proteins form unusual aggregates (oligomers) or stick together to form water-insoluble clumps (aggregates), which can be toxic to cells. In the case of prions, however, they also have the ability to “autocatalytically” change the proteins around them into the same shape as themselves, just as a seed in solution grows into a crystal (see figure below).

In Kurt Vonnegut’s “Cat’s Cradle” novel, the accidental catalytic shape change of prions is often compared to the creation of a new kind of ice “seed” that is more stable than water at room temperature, causing all the water in the world to turn solid.

While one might think that this is a rare phenomenon unique to prion diseases, recent research has raised the possibility that Parkinson’s disease, Alzheimer’s disease, ALS, multiple system atrophy (MSA), and even certain types of soybean diseases may be related to protein misfolding. For example, in Parkinson’s disease, the idea has been put forth that misfolding of alpha-synuclein is the true nature of the disease.

So, are changes in protein shape transmitted autocatalytically in these diseases, as in the original prion? They are not yet known, but there have been some results suggesting them in in vitro and animal experiments, which of course does not mean that the diseases listed above could be wired by the usual catalysts. (Even in Jacob’s disease, no transmission between spouses or family members has been reported.)

Protein Simulation

As a study to reproduce these phenomena in a computer, realistic simulations of proteins have been constructed based on Newton’s equations of motion. This is a technique called molecular dynamics (MD), which is similar to the Hamiltonian MCMC in data science.

The purpose of the simulation is not only to study folding, but if one were to seriously try to follow the entire folding process with a realistic model of molecular dynamics, it would require a tremendous amount of computation. In the real world, the folding time of a small protein is typically on the order of milliseconds (1/1000th of a second).

The first MD simulations of proteins reported in the late 1970s were for proteins (BPTI) only, without water molecules, and were equivalent to just under 10 picoseconds (less than 1/100 billionth of a second). By the turn of the century, calculations of 1 nanosecond (one billionth of a second) including water molecules were commonplace, and in 1998, simulations reaching 1 microsecond (one millionth of a second) in water were reported. However, this is still not quite enough.

One approach to this is to increase the efficiency of sampling by using methods such as the “replica exchange Monte Carlo method” or the “multi-canonical method. Another approach is to make the model itself discrete, such as the Ising model of magnetic materials or magic circle (lattice protein model).

Another direction is to use a “fast computer” by force. General-purpose machines such as the K computer can of course be used, but there are also methods to build dedicated machines such as those represented by MDGRAPE or to use GPUs. What was shocking in this trend was that the dedicated computer Anton finally achieved calculations in the millisecond range for small proteins (PRTI) (first announced around 2009). Subsequently, BPTI was reported to go back and forth between multiple states, and various small proteins were reported to fold from a solved state. However, the folding of larger proteins and the generation of thermal equilibrium states are still beyond the reach of mankind.

Multivariate analysis of simulations

There are various methods that can contribute to this from the standpoint of data science, and among them, I will discuss the approach of “multivariate analysis of simulation.

Long-time molecular dynamics simulations provide a large amount of data as a “time series of the positions of all atoms. The secrets of “shaking proteins and proteins returning to shape” are hidden in this data, and it is the job of data science to extract and visualize the necessary information from it.

The first possible method is principal component analysis (PCA), which solves the eigenvalue problem of the matrix C0z=λz, where C0 is the covariance matrix obtained from the vector x of coordinates of all or part of the atoms obtained in the simulation. Then, by finding the projection of the data onto the eigenvectors for large eigenvalues, information reduction and visualization are performed.

The example below shows the results of a 5 microsecond simulation analysis of an LAO-binding protein (3649 atoms, 40978 atoms if surrounding water molecules are included).

The structural cohesion (domain) is thought to be important for the recognition of substrates. PCA analysis of the coordinates of 238 carbon atoms (Cα atoms) on the main chain of amino acids has succeeded in extracting the coordinates that represent the open/close motion without prior knowledge.

Extract dynamic information from time series

As a method to deal with dynamic behavior beyond PCA, we describe here relaxation mode analysis (RMA), which was introduced by Takano et al.

RMA was introduced by Takano et al. and is related to other methods developed in different fields, such as Dynamic Mode Decomposition (DMD) and time-domain ICA (tICA). Specifically, we consider the generalized eigenvalue problem C(t0+τ)z=λC(t0)z instead of the eigenvalue problem of the covariance matrix C0=C(0) in PCA by estimating the following correlations from time series x(t) data, including time delay τ

\begin{eqnarray}& &C(\tau)=\mathbf{E}[x(t)-\bar{x})(x(t+\tau)-\bar{x}]\\& &(\mathbf{E}\ is\ the\ time\ average\ assuming\ stationarity, \bar{x}\ is\ the\ expectation\ of\ x)\end{eqnarray}

The axis to be projected is then determined using the generalized eigenvalue vector for large eigenvalues.

In RMA, it can be shown that there is a relationship λ=exp(-τ/τR) between the employment site λ obtained above and the characteristic time (relaxation time) τR when the shape of the molecule fluctuates in the corresponding direction. Therefore, the larger the eigenvalue, the more it is possible to extract coordinates that represent slow motion with large τR.

At first glance, it may seem that the motion in the direction of large fluctuations in the shape of the molecule takes longer, but in some cases this is not the case, in which case the results of PCA and RMA are different. For example, consider the case where the probability density is as shown on the left in the figure below.

In this case, with PCA, the direction with the largest scatter variance of the data becomes the first axis, whereas with RMA, the direction with the smallest scatter but slowest movement is chosen as the first axis (middle and right in the above figure)

The results of RMA applied by Mitsutake and Takano to a protein called signorin are quoted in the figure below.

Signorin is a protein synthesized in 2004 by Honda et al. at the National Institute of Advanced Industrial Science and Technology (AIST), and is composed of 10 amino acids. It is known to fold into a small patterned natural structure, and molecular simulations have yielded both natural and misfolded structures. The paper analyzes the results of 750 nanoseconds of simulations using a 30-dimensional vector of the positions of 10 carbon atoms in the main chain out of 138 atoms as x.

In the figure above, the axial components (Y1, Y2, and Y3 in order of slowest motion) obtained by RMA are shown. In figure (a), we can see how the two states (Native and Msfolded) with different folding methods are aligned side by side almost along the direction in which the Y1 component changes (corresponding to the eighth slowest motion). In addition, an intermediate state (Intermidiate), which is not apparent in PCA, is visible in RMA.

In addition, various statistical science and machine learning methods such as canonical correlation analysis, independent component analysis, Bayesian estimation, and hidden Markov models are used.

In the next article, I will discuss the dream of a realistic SimCity.

Shaky Proteins and Old Me: Data Science in the Age of Misfolding