Overview of machine learning for deep learning, gradient descent, regularization, error back propagation, etc.

Machine Learning Natural Language Processing Artificial Intelligence Digital Transformation Image Processing Reinforcement Learning Probabilistic Generative Modeling Deep Learning Navigation of this blog

This is a good introduction to deep learning (Machine Learning Startup Series)Reading Notes

From “This is a good introduction to deep learning, Machine Learning Startup Series“. Reading notes are provided for introduction to deep learning.

1 Introduction

2 Machine Learning and Deep Learning

2.1 Why Deep Learning?
No-Freeness Theorem
No matter how many algorithms you devise for computer recognition, if you average their performance over a myriad of tasks with different recognition targets, you end up with the same performance for all algorithms.
No matter how many algorithms you devise, you will never be able to create a recognition algorithm that is superior to others for the entire problem.
It is better to limit the number of tasks.
2.2 What is machine learning?
Introduction
What is machine learning?
For a given task T, the ability to perform the task, as measured by P, is improved through E.
Generalization ability
How well a task can be performed after experience with unknown data that did not appear in E
Recommendations for computer programs to acquire universal experience
2.2.1 Typical tasks
(1) Class classification
Dividing the day into several categories
(2) Regression
Predicting a vector of corresponding real values from data.
2.2.2 Various Data Sets
(1) MNIST
Mixed National Institute of Standard and Technology database
(2) ImageNet
2.3 Introduction to Statistics
Introduction
Machine learning is the process of making a program capable of performing various tasks based on data (experience).
Statistics is a method of scientifically analyzing data.
2.3.1 Sampling and Estimation
Generating statistical data
2.3.2 Point Estimation
Statistical estimation: Considering that data are generated from some unknown distribution, and determining the parameters of this probability distribution
Point Estimation: Calculate the most likely values of the parameters from a set of finite element data on hand.
Recommended properties for estimators (low bias, low variance, good agreement)
Various generating distributions (Gaussian, Bernoulli, etc.)
2.3.3 Maximum Likelihood Estimation
Method for estimating complex distributions
Finding the parameter that maximizes the likelihood function
Maximum Likelihood Method in Practice
Gaussian distribution (distribution likelihood, function maximum)
Bernoulli distribution (likelihood function, maximum value)
2.4 Fundamentals of Machine Learning
2.4.1 Supervised Learning
Target Variables
Qualitative variables
Category
Discrete values
Quantitative variables
2.4.2 Linear Regression by Least Squares Method
Minimizing the Error Function
2.4.3 Stochastic Approach to Linear Regression
Error function expressed as a function of random variables
Expected value
If the mean squared error is the performance, the best predictor of y is the expected value for the generating distribution
Two approaches to predicting data
Generative model
Model the generative mechanism behind the data directly as a simultaneous distribution p(x,y)
After estimating the modeled distribution P(x,y) from the data, calculate the conditional distribution P(y|x) using the posterior distribution P(x)=∑P(x,y) and Bayes’ formula
Using the above results, calculate the expected value.
Find the answer indirectly
Discriminant Model
Model the conditional distribution P(y|x) directly and estimate this value from the data
2.4.4 Least Squares and Maximum Likelihood Methods
Probabilistic Origin of Least Squares Error
Starting point
Data are functionally given by the regularity f and the contribution of noise
Ε is assumed to be given by a Gaussian distribution with mean 0 and variance σ2
Y itself is sampled from a Gaussian distribution with the estimate y(x;w) as mean
To bring this model closer to the distribution that produces the data, we use the maximum likelihood method
The log likelihood is
2.4.5 Overfitting and Generalization
Example of linear regression of data
Increase the size of the feature vector or fit
2.4.6 Regularization
Limit the degrees of freedom to avoid overfitting
Weight decay
2.4.7 Class Classification
Binary classification
Multi-class classification
One-hot representation
Representation using the Kronecker Delta Function
2.4.8 Approaches to Classification
Approach for discrete target variables
Functional Model
Binary target variable
Generative model
Probabilistic representation of data
Discriminant model
Represents direct conditional probability
2.4.9 Logistic Regression
Explaining Discriminative Models in Classifications
Two-class classification
Probability distribution is P(C1|x)+P(C2|x)=1
Logistic sigmoid function
or sigmoid function
The conditional distribution can be written from the sigmoid function as shown above
U is a factor called log odds
Odds Ratio
Ratio of the probability that C1 is true to the probability that it is not true
When the log odds is greater than 1, the probability of being C1 exceeds the probability of being C2
Many classifications use models that assume that the log odds u is a linear function of the inputs.
w is a vector of parameters of the model
The left equation is called logistic regression.
Maximum likelihood method is used for fitting to the training data
P(y|x) is a Bernoulli distribution, so it can be written as the above equation in general
Maximum likelihood method is to maximize the likelihood function of the above equation
The error function is defined by the negative log likelihood.
This is called the cross-entropy between the empirical distribution of the data and the model distribution.
2.4.10 Softmax Regression
Discriminant model for multiclass classification
Generalization of Logistic Regression
Natural extension of Bernoulli distribution to multilevel transformations
Categorical distribution is more general
Rewriting the distribution of the previous equation
Extension of log-odds to multiclass
Parameters for describing distributions
From the definition of log-odds, the above equation is obtained.
Summing the right-hand side, we get
Thus
From the above, we can derive the above equation using the softmax function.
Softmax function
Rewriting the equation
If the lth variable is much larger than the others, the above equation is obtained.
The categorical distribution can be written using the softmax function with log odds as an argument.
Discriminant model using linear log odds
Maximum Likelihood Method
Optimize parameters by minimizing the cross-entropy (above equation)
2.5 Advances in Representation Learning and Deep Learning
2.5.1 Representation Learning
Designing features that will successfully extract the essence of the data.
Cannot be applied naively when using complex data
Purposeful features are acquired spontaneously through learning
Data is used for two purposes: feature extraction and regression
2.5.2 Emergence of Deep Learning
Year 2006
G. Hinton
Overcame difficulties in deepening Boltzmann machines
Then applied to simpler neural nets
2012
Hinton et al.
Extraction of cat images Translated with www.DeepL.com/Translator (free version)

3 Neural Networks

3.1 Networks of Neurons
Neurons and synapses in the brain
Action potentials and pulses
3.2 Formal Neurons
Formal Neurons and Simple Neuron Circuits
The Final Equation
The Weight Equation
3.3 Perceptron
3.3.1 Perceptron with Formal Neurons
Example of a perceptron consisting of two layers
3.3.2 Perceptron and Minsky
The Minsky
proved that a simple perceptron with only input and output layers cannot solve a class of problems called linearly inseparable in a finite number of calculations
Two layers of perceptrons can solve any system that is nonseparable.
3.4 Structure of a Forward Propagation Neural Network
3.4.1 Units and Forward Propagation Neural Networks
Unit
An example of a forward propagating neural network with a combination of units
Sigmoid Function
3.4.2 Input Layer
The input layer consists of five units
The input layer has units only for the dimension of the input vector
Each input unit outputs each component of the input vector xT=(x1,x2,x3,x4,x5) to the whole neural network
3.4.3 Intermediate Layer
Consider the jth unit of layer I
Accept as input zi(l-1) from each unit of layer I-1
The activity becomes the above equation
wji(l) is the weight of the coupling between unit i in layer I-1 and unit j in layer I
The output of the unit is the activity uj(l) plus the unit-specific bias bj(l), plus the activation function f(l)
Although the activation function f(l) can be defined for each unit
The activation function f(l) can be defined for each unit, but in this case it is common to each layer.
Matrix representation
Weight matrix
Activation function
Equation assuming that bias is included in the activation function
3.4.4 Output Layer
The output layer receives the input from the L-1 layer and outputs y=Z(L)
3.4.5 Function
Final Model
3.5 Machine Learning with Neural Networks
Introduction.
A forward propagating neural net receives an input x and produces the above output
The output value changes by changing the values of the weights W(l) and bias b(l).
These are collectively written as w.
Fitting w to data
Adjust the weights and bias so that the discrepancy between the output value y(xn;w) of the neural net when training data xn is input and the target value yn becomes as small as possible.
The choice of the error function E(w) to measure the deviation depends on the task to be considered and the structure of the output layer.
Minimization of the error function corresponds to learning.
Separate the L-1 layer from the last layer.
Up to the L-1 layer, each layer plays a role in constructing a sequentially higher-order representation from the input x
The output from the L-1 layer is a deep representation h of the input x
The output layer is responsible for performing ordinary machine learning such as regression using the above representation.
3.5.1 Regression
Perform a linear regression on the representation h
Use a trivial activation function (just a constant mapping)
Linear units
Some nontrivial activation function
Set f(L) according to the problem
Fit the output with data
Minimize the least squares error so that the prediction of the model using y(x;w) is as close as possible to the actual data yn
3.5.2 Binary Classification
Binary classification classifies the data into two classes.
Label is 0 or 1
In a neural net, the output layer should give a logistic regression of the representation h=z(L-1)
The target value yn of the training data D={xn,yn} is 0 or 1
Logistic regression estimates the probability y=P(y=1|x) where y is 1, not the binary variable itself.
To do logistic regression, consider a neural net with one output layer unit whose output estimates p(y=1|x)
Sigmoid Unit
The activity function of a sigmoid unit is the sigmoid function
The insertion power into the unit is log odds
How do we train the above neural network?
Use the maximum likelihood method
Estimate Bernoulli distribution
Just use the negative log likelihood as the error function
3.5.3 Multi-class Classification
If there are K classes
Create an output layer for softmax regression of the expression h=z(L-1).
The kth unit of the output layer should have the output of the above equation.
The activation function of the output layer is a softmax function.
Put the data you want to analyze into the trained neural network, and set the k that maximizes the output value yk as the class to which it belongs.
Vector representation of the target variable
If we also give the observed data for this t
Cross-entropy of the neural network’s sobol output
Error function
3.6 Activation Function
Introduction
The activation function of the middle layer
There is no general criterion that uniquely determines the activation function of the hidden layer.
3.6.1 Sigmoid Function and its Companions
Differentiable functions have been used as activation functions
Sigmoid function
A smooth version of the staircase function
Properties of functions
The derivative coefficients are known from the value of σ(u) without calculation.
Hyperbolic tangent function
Can be used when negative values are also desired
A piecewise linear approximation of this function is also used.
3.6.2 Normalized Linear Functions
Staircase function, sigmoid function, and hyperbolic tangent function are not used so much now.
How to make learning proceed smoothly
It is known that it is better to use a normalized linear function (redtified linear ft) as the activation function to ensure smooth learning.
Units with this activation function
ReLU and PReLU
The soft plus function is a smoothed version of this
Not used very often
When the total input is negative, ReLU produces no output
Set initial value to a small positive value
Generalization of the normalization function
Leaky ReLU/ Parametric ReLU
ReLU is sufficient in many cases
3.6.3 Maxing out
In PReLU, the appropriate slope C for the task is determined through learning
Further generalize and introduce classification shaping functions that can learn general function systems
Commonly used in recent years
Role of max-out
Example: k=3
3 segment linear approximation of a convex function
Increasing the number of k, we can approximate arbitrary convex activation functions as finely as we like.
The shape of the piecewise linear function is determined by the weights and biases
3.7 Why is it important to go deeper?
When you want to increase the complexity of the output, increasing the total number of layers is more effective than increasing the number of layers.
Learning is more likely to be successful with multiple layers
Advanced representation learning can be achieved

4 Learning by gradient descent method

4.1 Gradient Descent Method
Introduction
Learning a neural network is formulated as minimizing a loss function
Approximate solution is obtained numerically on a computer
Newton-Raphson method (Newton-Raphson method)
The differential coefficients vanish in all directions at the minimum value
Approximated by Taylor expansion
Taylor expansion around a point w0=[(w0)j
If we ignore the higher order terms and display the matrix further
H0 is the Hessian matrix
When the left-hand side is zero, the Hessian matrix is regular.
The above equation is used sequentially in the order of t to determine w(t).
Evaluation of the Hessian inverse is expensive, so it is rarely used in deep learning.
4.1.1 Gradient Descent Method
A method that uses only information about the first-order derivative of the loss function
A method to find the minimum point of the error function E(w)
The method of rolling a ball down from the top in a concave shape of the graph of the error function and waiting until the ball reaches the lowest point.
Rolling the ball means moving the position in the opposite direction of the gradient of the graph.
The rule for moving a ball with position w(t) at time t in the opposite direction of the gradient is as above
η: parameter that determines the distance ∆w(t) traveled in one step
4.1.2 The local minimum problem
If the error function is a downward convex function, then in simple situations, any minimum will always coincide with the minimum.
For complex non-convex functions, there is a huge number of local minima in addition to the global minimun.
Neural networks have high symmetry and overlapping minima.
Replacing multiple units does not change the result.
Duplication is full for each combination of swapping
In deep learning, it is sufficient to find good minima of the error function without finding the true minimum.
4.1.3 Stochastic Gradient Descent Method
If the error function is trapped in a critical point where the value is too large, it is useless.
In order to avoid being trapped in a critical point as much as possible, a random element is introduced to escape from the trapped position.
Mechanism for adding random effects
Preconditions
Given training data D={(xn,yn)}n=1… ,N is given.
The error function can be expressed as the sum of the errors calculated for each training sample element (xn,yn)
Example: If the mean square error is
Example: Cross entropy for K-class classification
Normal gradient descent method uses all data
Use only some of the training samples
Prepare a subset B(t) of training samples to be used at time t
Use the error function (above) averaged over the mini-batch at time t
n∈B(t) is the label of the training samples in the mini-batch
|B(t)| is the total number of sample elements in the mini-batch
Using this, we update the parameters as in batch learning.
The case where there is only one training sample in each mini-batch at each time is called online learning or stochastic gradient descent method (SGD).
The mini-batch method does not get stuck in undesirable critical points and has the advantage of parallel computing.
4.1.4 How to make a mini-batch
In mini-batch learning, the learning time is divided into units called epochs.
In the first epoch, the data is randomly divided into mini-batches of appropriate size.
4.1.5 Scheduling Convergence and Learning Rate
In previous discussions, we have assumed that the learning rate η is constant throughout time.
If it remains constant forever, it will be difficult to reach the point of convergence.
The mini-batch approach gives an approximate estimate of the true error function E(w) as the expected value on the mini-batch
The random effect does not go away forever.
As we approach the minima, we need to decrease the learning rate to reduce the statistical fluctuation of the gradient.
Introducing a time-dependent learning rate
Another way to take the learning rate
Another case
4.2 Improved Gradient Descent Method
4.2.1 Challenges of the Gradient Descent Method
Other than the problem of local optimal solution, there are other problems such as vibration, plateau, and precipice.
As a countermeasure against oscillation, we can provide a contribution like friction and inertia in physics to calm down the motion.
As a countermeasure against precipices, gradient clipping is used to adjust the size of parameter update by common sense when the threshold is exceeded.
Another problem is the existence of the saddle point.
The critical point where the slope becomes zero
Slope increases in one direction and decreases in another direction
4.2.2 Momentum method
A method to suppress the oscillations of the gradient method and improve the convergence to the minimum value
To prevent oscillation, the slope of the previous parameter update is added to the next slope.
Equation
The initial value is 0, and μ takes the value 0.99-0.5, which is close to 1.
The momentum method not only prevents oscillation, but also accelerates the parameter update of the gradient method on a normal slope.
4.2.3 Nesterov’s accelerated gradient method
Modified version of the momentum method
Different location to evaluate the gradient value
Nesterov’s accelerated gradient method
4.2.4 AdaGrad
Learning can be made more efficient if multiple learning rates can be introduced for each parameter direction.
Reduce the learning rate for directions that have already had large gradients, and increase the learning rate for directions with small gradients.
If the gradient is large at the beginning of the learning process, the spices will quickly become small.
4.2.5 RMSprop
Method introduced by Hinton
Exponential decay factor cancels out information about gradients that are sufficiently far in the past
AdaGrad
RMSprop
Expression for RMSprop
4.2.6 AdaDelta
RMSprop has the problem of being sensitive to the overall learning rate η
due to dimension mismatch.
Use Newton-Raphson approximation to unify the dimensions
4.2.7 Adam
Replace not only the RMS of the gradient in the denominator of the above equation, but also the gradient confidence with an exponential moving average
Apply momentum, including exponential decay, to the gradient portion of the RMS
Adam’s gradient descent method
Recommended values for parameters
4.2.8 Natural gradient method
If the space of weight parameters is not an ordinary Euclidean space with orthonormal coordinates
The differential distance in a bent Riemannian manifold is measured by a lightweight tensor
Gradient descent method using this as gradient is the natural gradient method
Natural distance in weight space
4.3 Initial values of the weight parameter
Introduction
The initial values of the weight parameters greatly affect the success of learning by gradient descent.
4.3.1 Initializing the LeCum
As shown in the above equation, the initial values of the weights should be sampled from a uniform or Gaussian distribution with variance 1/d.
4.3.2 Initialization of Glorot
Proposed from the analysis of neural networks with only linear units.
4.3.3 He Initialization
Derived from the analysis of neural networks with ReLU
4.4 Preprocessing of Training Samples
Introduction.
Preprocessing of training data to accelerate learning and bring it closer to generalization
Remove task-irrelevant information and statistical bias from the training data.
4.4.1 Normalize the data
Normalize the mean and variance of the training sample.
4.4.2 Whitening the data
Normalize the covariance matrix
4.4.3 Local Contrast Normalization of Image Data
Align the inherent brightness and contrast strength of each image

5 Regularization for Deep Learning

Introduction
Neural nets used in deep learning contain significantly more weight parameters than other machine learning methods.
Overlearning can occur quickly if the neural network is trained easily.
Methods to prevent overlearning have been discovered.
Regularization
5.1 Generalization Performance and Regularization
5.1.1 Generalization error and overlearning
The goal of machine learning is to make the predictions derived by the model as close as possible to the behavior of the data generating distribution.
Redigest the error function, generalization error, obtained from the expected value on the data generating distribution
Example: When using the squared error
The generalization error becomes the above equation
Instead of not being able to use the information on the data generating distribution, we can use the training data we have statistically
Training data D=[(xn,yn)]n=1,…,N Probability distribution representing the frequency of occurrence of each data point appearing in N
Sample approximation of data generating distribution
As a substitute for the data generating distribution, an approximate estimate of the generalization error using the empirical distribution at hand.
Problems caused by mismatch between generalization error and training error
How do we know if learning is going well?
Consider the test error as a guide for generalization.
Divide the data in hand into two parts: D for training and Dtest for testing.
At each stage of training, calculate the test error measured on the test data.
Plot the learning curve by calculating the training error and test error at each epoch during the learning process.
5.1.2 Regularization
Causes of overlearning by continuously minimizing the training error
The large degrees of freedom in the model cause it to learn the statistical variability in the data.
Regularization means modifying the learning algorithm so that the generalization power, not the training error, is minimized as much as possible.
The simplest approach is to add a penalty term R(w) to the error function.
5.2 Weight decay
5.2.1 Effect of Weighted Decay
The simplest regularization is weighted decay
Make the weight parameter as small as possible
A concrete example
Approximate the error function by two time numbers
The minimum value is wi=wi0
The actual point where the derivative coefficient vanishes
Coordinates of the minimum value after regularization
For parameter directions where the component of the Hessian matrix is very small compared to α
Regularization using the L1 norm
When it is used for regression, it is called LASSO regression.
5.2.2 Sparse Regularization and the Bad Condition Problem
In machine learning, there should not be more degrees of freedom in the model parameters than in the data.
In the real case, there is not enough information to find the answer
When there is only one data (x1,x2,y)=(3,2,5) to find w1,2 of W1x1+w2x2=y
We have to solve the above equation.
There are countless solutions
Replace it with the minimization of the error function with the addition of the L1 regularization term
Calculate the gradient of the error function
Station values are limited on the axes
Redigesting the error is limited to (w1,w2)=(14/9,0)
L1 regularization narrows down the number of times from infinite real numbers to one.
The solution after regularization is not the real solution to the original problem.
5.3 Early Termination
5.3.1 What is Early Termination?
Regularization from a different perspective
A sign of overlearning is the phenomenon of increasing test error
Weight parameters updated in past epochs are stored on the computer
Continue learning and stop learning when the test error tends to increase for a while
5.3.2 Relationship between Early Termination and Weight Decay
Early termination is closely related to regularization by penalty term
Discussion on weight decay
The error function can be reduced to the second order of the Taylor expansion around the minimum value.
The update equation in the gradient descent method is the above equation
If we write this equation in terms of components, we get the above equation
Early termination at time T gives the above weight parameter as the solution
The optimal parameter value when there is weight decay is
Assuming both are equal
The floating termination can be regarded as a weighted decay with the regularization parameter using the above equation.
5.4 Weight Sharing
Regularization can be used when some assumptions are made about the values that the weight parameters should take.
Weight sharing does not make all the weights of the neural network independent, but imposes a fast relationship between them.
5.5 Data Expansion and Noise Addition
5.5.1 Data Expansion and Generalization
In supervised learning, it is necessary to prepare training samples with annotations.
Approach to create pseudo data based on samples you have
In image recognition, adding rotation or translation to an image does not compromise the essence of the captured object.
To increase the robustness, add the modified image to the training data.
Noise injection is a method of data expansion.
Noise can also be added to weights
Whenever we update the parameters in the mini-bee process, we inject a small noise that is randomly generated from the weights of the network.
5.5.2 Noise Injection and Penalty Terms
Adding noise to the training data can also be viewed as introducing a penalty term
Sample the noise ε from the Gaussian distribution N(0,ε2) to i.i.d.
For this noise-added data x+ε, the above equation is the output of the model
The change in the squared error in this case is the above equation
By taking the expected value in the distribution P(x,y)N(ε), the error function changes to the above equation by adding noise.
Finally, we obtain the above equation
The noise addition has almost the same effect as the penalty term ∇yT∇y
Regularization equivalent to that with noise added
5.6 Bagging
One of the methods to improve the stability of post-learning behavior in machine learning and to increase the prediction performance
Sampling from a set of training samples, allowing each element to overlap (recovery extraction)
Multiple independent models are trained using each bootstrap sample.
The final prediction is obtained based on the predictions of all the models.
In the case of classification, majority vote of all models
In the case of regression trees, the average of the outputs (model mean)
Fluctuations (non-systematic errors) cancel out by adding together many models with blurring due to inherent errors in the above approach
5.7 Dropout
Introduction
Model averaging can be applied to neural nets
Computational cost is prohibitive and impractical
A method to approximate the ensemble method without directly using the model average
One of the key techniques in deep learning
5.7.1 Learning in Drop-Au
First, consider a partial network ensemble
A network formed by randomly removing units from the input and middle layers of a neural network to be trained.
To remove a unit, the output zj of that unit is externally multiplied by 0.
Set of all possible subnetworks
The task of creating one subnetwork at random is
Multiply the output of a neural network unit by a set of many binary numbers μ{μj}∈{0,1} randomly sampled from a Bernoulli distribution
Call μ a mask.
Define the Amadal product for two arbitrary d-dimensional vectors v=(vi),w=(wi) by the above equation
Amadal product
Dropout is an ensemble method that calculates the model average for all subnetworks
Computing all of them increases computational cost
Use approximation method
Only one mini-batch training is needed
Note that dropout is a different method than bagging.
5.7.2 Inference in Dropout
In ensemble methods, after training, the majority vote or arithmetic mean of the inference results for each model is used for prediction
You need to run a large number of models for the next inference.
It has been shown that it is better to use the average of the squares instead of the additive average.
Has probabilistic meaning
The geometric mean Pens. can be approximated by only one neural net
Inference in Dropout
5.7.3 Theoretical Justification for Dropout
5.8 Sparsity of Deep Representations
Regularize the deep representation induced by the neural network, not the weights
Deep learning is expected to learn a “good” representation of the input data
Reveal the information behind the data, so that it can be analyzed in the form of questions
A good representation of the important information contained in the data, stripped of any noise or information that is unnecessary or distracting for the task being performed.
One of the properties to be realized
Sparsity
A representation in which the vector components have few non-zero values and the information is represented by only a few signals.
The essence of the data can be captured in a compact form even in high dimensional space.
A means to achieve sparsity
Impose constraints on the representation h, the output of the intermediate layer
5.9 Patch Normalization
Introduction
A powerful regularization technique that has been used in place of dropout
5.9.1 Internal Covariate Shift
A situation in which there is a discrepancy between the distribution generated by sampling the training data and the distribution of the data used for inference.
In deep learning, covariate shift occurs due to the multilayer structure.
During training, training samples are sequentially sampled from the data generating distribution and input to the network.
In response to the input, each layer produces an output
The output of layer I, z(I), is also considered to be sampled from a certain distribution, PI(z(I)).
What is learning?
PI(z(I)) of the middle layer is determined from the data generation distribution.
Adjust the weight parameters so that each stratum can be fitted.
If we try to achieve this goal simultaneously in each layer
When layer I is able to learn to approach PI(z(I))
When layer I learns to be close to PI(z(I)), the parameters of the layers before it are also updated.
The pattern of the output z(I) of layer I will also change.
PI(z(I)) will have a different shape when it is learned
Internal covariate shift
5.9.2 Patch Regularization
To prevent the internal covariate shift
To prevent internal covariate shift, the distribution followed by the output values of each layer must be adjusted to be constant during training.
Regularizing the middle layer output forces the output to always follow a distribution with mean 0 and variance 1.
Use mini-batch learning for batch regularization.
To adjust the output distribution of the intermediate layer, we need estimates of the mean and variance to use for normalization.
An estimate of the mini-batch mean when the same forward propagation runs in parallel in batch learning.
Normalize the output with this mean and variance (batch normalization)
Linearly transform the normalized output with the learnable parameters γ and β to get the final result of the batch regularization transformation.
It has been reported that multilayer networks can be successfully trained without dropout if batch regularization is used.
Batch regularization has been successfully applied to architectures such as ResNet and GAN. Translated with www.DeepL.com/Translator (free version)

6 Error back propagation method

Introduction
In training a neural network, we move the parameters in the direction of the gradient, as if we were moving down the slope of the graph of the error function.
Calculating the gradient is non-trivial in neural nets
6.1 The Perceptron Learning Rule and Delta Rule
Background of the Error Back Propagation Method
The Delta Rule
6.2 Error Back Propagation
Introduction
An extension of the delta rule to multilayer neural nets
Published by Ramelhart et al. in 1986
Rediscovered
6.2.1 Complexity of parameter quintiles and toy models
On learning by gradient descent
What is gradient?
Differential coefficient of the error function
The error function E is a measure of the misalignment between the output and the training signal
Depends on the parameters of the neural network through the output y(x;w)
Example: Squared error function
Using the chain rule of differentiation, it comes down to the calculation of the variational distribution of y(x;w) with parameter wij, as shown in the above equation
Each layer is one neural net
If you try to do partial differentiation on the variables in the red part, it will be very difficult.
It becomes a composition of many mappings
Understanding the error back propagation method
Focus on the relation U(l)=W(l)Z(l-1)
Variation in w(l) directly affects the value of U(l).
We can rewrite the above equation from the chain rule.
If we know the quantity in the above equation, we can determine the gradient
Since U(l+1)=W(l+1)f(U(l))
This delta is related to the next layer, δ(l+1)
The gradient of each layer can be determined by solving the equation from the output side to the input side.
The error function is expressed as the sum of the errors En(w) calculated for one sample n.
Focus on the error in the output given by a training sample n
6.2.2 Gradient Calculation of the Error Function
Ideas to arrive at the inverse error propagation method
Don’t compute the derivative of a huge finite rainfall function all at once
Split the difficulty
Rewrite the gradient in terms of how much the variation of the weight wij(l) of interest affects the surroundings through the local structure of the network.
Rewrite the gradient in terms of
The impact of variations in the coupling weights wji(l) on the inputs of each layer
A change in the weight wji(l) first changes the total input of unit j in I
This variation of the total input uj(l) propagates sequentially through the network, changing the output and finally giving rise to a variation of the error function.
Computational Process of Inverse Error Propagation
Expressing these in terms of differential operations, the error function En(uj(l)(wij(l))) is indirectly dependent on the weights wji(l) through the total input uj(l), so it is the same as above.
The first factor on the right-hand side is called the delta.
It is a measure of how much of the signal from unit j is ultimately being heard by the error.
The second derivative factor on the right-hand side is just the output value zi(l-1)
The weight dependence of the total input (above) is linear
It is automatically determined by inputting a training sample xn into the neural network and propagating the signal.
Once the value of the delta is determined, the gradient can be determined.
Using the local structure of the network, divide the gradient into delta and unit output
Next, the algorithm to find the delta
Find an asymptotic expression that the delta satisfies
The delta is the error function differentiated by the total input to unit j, uj(l)
A change in uj(l) causes the error function to vary through a change in the total input uk(l+1) of the next layer of units
The differential coefficient satisfies the properties of the above equation.
The final result is summarized in the above equation
The quantity delta is propagated backwards from layer I+1 to layer I under the double action above.
Information on the “initial value” of delta calculated in the output layer is propagated sequentially toward the input layer.
First, the input is propagated forward to the output layer
Calculate the error at the output layer
Based on the result, the delta is propagated backward.
Mechanism of purchasing calculation by error back propagation method
Calculating Gradients with Deltas
In the various neural network Libraries, essentially the same algorithm as the error back propagation method is implemented by auto-python.
6.2.3 Initial Values for Back Propagation Calculation
In the back propagation method, the initial value for solving the asymptotic equation is the delta in the output layer.
Calculation method
Let f be the activation function of the output layer.
The final output is y=f(uj(L))
The delta of unit j in the output layer is En(yj(uj(L))) differentiated by uj(L)
In the regression problem, the error function is En=(y(x;w)-y)2/2
The specific form of the delta in this case is
If the error function is cross-entropy, then
The derivative of the softmax function is the Kronecker delta function
6.2.4 Gradient Calculation
The purpose of gradient computation is to train a neural network by the gradient descent method.
It doesn’t stop with error back propagation
The gradient ∂En/∂wji(l) computed by each training sample must be averaged over the mini-batch D used for training to obtain the actual parameter updates.
Then, using the ∆wji(t,l) calculated with the parameter values at each time t, update the weight parameters as described above.
6.2.5 Meaning of delta
The meaning of the delta function as it appears in the error back propagation method
The delta in the output layer is given by the above equation
The difference between the final output y of the input training sample x and the target signal y of the training sample
The gradient descent method modifies the weight parameter wpq(L) of the output layer and the layer before it to cancel out this error as much as possible.
δ(L) is a measure of how much the weights leading to the output node contribute to the stiffness.
The case of the intermediate layer
It is a measure of the local error associated with a node in layer I.
6.3 Why is Error Back Propagation Fast?
6.4 The Vanishing Gradient Problem, Parameter Explosion, and How to Deal with It
Introduction.
Other Challenges in Deep Learning
Vanishing gradient problem
Forward propagation does not explode or burn out
Back propagation is a linear transformation, so it rapidly burns out or explodes
Exploding gradient problem
Because of these problems, we were in the second winter period.
Now that these problems have been solved, the current prosperity is achieved.
6.4.1 Prior Learning
One idea to avoid vanishing gradients
In pre-training, the network is not computed exactly, but a good seed base of weight parameters is computed first, and then fine-tuned to match the training data for efficient training.
A good initial value for rolling parameters in gradient descent method.
It is now replaced by other methods.
6.4.2 ReLU Function
Currently, a generic method to solve the vanishing gradient problem.
A method to devise an activation function
ReLU function can be used to solve the vanishing gradient problem

7 Own symbolizer

Introduction
Application of Neural Networks to Unlabeled Data (Unsupervised Learning)
7.1 Data Compression and Principal Component Analysis
Common methods for reducing the dimensionality of data
Bias of the data in various spaces
Less variance in certain axes in certain directions
A general method to perform the above kind of analysis
Assumptions
Data points are represented as vectors
The normal orthogonal basis is represented by the above equation.
Project the general data point xn to the subspace you want to find
C0 corresponds to the mean value of the data
The center point of the distribution of data points
The squared error function is the above equation
By solving the minimum value problem (above equation), we can determine the direction in which the data is actually spread.
Solution of the optimization problem
Using the Lagrange’s undecided coefficient method
Define the above equation as the Lagrangian function with a fast condition imposed using an unconstrained λh
Maximize it with respect to eh and λh
The Lagrange coefficients λh are the eigenvalues of the covariance matrix in the corresponding direction.
It tells us the magnitude of the variance
Only the direction with the largest variance can be regarded as the essence of the data, and the other directions are deleted.
7.2 Self-encoders
Introduction
In principal component analysis, the data points are actually scattered in the input space where the input vector x lives.
in the input space inhabited by the input vector x, we looked for a flat subspace where the data points are actually scattered.
In the real case, the data points are scattered around a subspace with a complex geometric structure.
Approaches to dimensionality reduction of data onto a manifold
There are several specific approaches
Methods using deep representations
Converting input data into a representation
A neural net that produces a good representation produces a useful one
Find the manifold of the distribution of data points in that representation
7.2.1 Hourglass Neural Networks
Let’s start with a twist on the idea of compressing data with neural nets.
A neural net needs a “teacher” to tell it that the target output is yn
Can we train a neural net unsupervised from just the input?
Create training data {(xn,xn)} with identical input and output from only input data
Use this to train a neural net whose output reproduces another input value
What is the point of creating something like the above?
Consider a two-layer hourglass-type neural network with a symmetric structure
Input x is converted to y in the middle layer, which is then converted to final output x
x→y
Coder
y→x
Decoder
Self-encoder (auto encoder, AE)
7.2.2 Learning by reconstruction error
Train the auto encoder using training data {(xn,xn)}.
Pay attention to the design of the output layer
The domain of the activation function f in the output layer must be aligned according to the range of possible values of xj in the components of x.
(1) Real-valued input
When x components are real-valued, use a linear unit for the output layer.
The activation function is an identity operation.
The error function is the mean squared error
(2) Binary input
When x=(x1,…,xd)T (2) When the components of the binary input x=(x1,…,xd)T take binary values of 0 or 1, use a sigmoid unit for the output layer.
Activation function is a sigmoid function
There can be multiple sigmoid units
If there is only one, use logistic regression
There are only x components of sigmoid units, each of which outputs a prediction for xi (above equation)
Learning is done by a log-likelihood function on the product of a number of Bernoulli distributions associated with each sigmoid unit (above)
Find the sum of the cross-entropies as in the above equation
7.2.3 Role of Coding
What does the coder do after learning?
Divides the weight matrix into row vectors wjT
Suppose the input layer of the self-coder is made up of d1 units and the middle layer is made up of d2 units
W is a d2xd1 matrix
There are d2 D1 dimensional vectors wj
Suppose that any input vector can be approximated by a vector wj.
The expansion factor cj measures how many components of wj are contained in the input.
Activate the middle layer and the output will be good.
If it can be learned well, it can be decomposed into each component of an arbitrary weight row vector [wj].
7.2.4 Self-encoders and Principal Component Analysis
When the activation function f is an identity map, the input-output relationship is
The reconstruction error is the mean squared error and is given by
The vectors that create the weights are assumed to be orthogonal
The above equation is the same as the equation for principal component analysis.
The principal component analysis is extended to nonlinear data analysis by the self-coder approach.
7.3 Sparse Self-Coder
7.3.1 Sparsifying the Self-Coder
Sparse regularization can be used to achieve nontrivial data compression even when d1≤d2
The middle layer is inflated.
As is, there are infinite solutions
Putting sparsity constraints on the middle layer uniquely determines the solution
Making the intermediate layer sparse means
No matter what the input is, most of the units in the intermediate layer will only output 0
Sparse Regularization
Sparse representations represent rich data structures by indirectly exploiting large dimensions, even though few units are activated.
7.3.2 Error back propagation method for sparse self-encoders
The error back propagation method for sparse self-encoders requires technical attention.
Sparse Regularized Error Back Propagation Method
7.4 Multilayer Self-Coders and Pre-Learning
7.4.1 Multilayer self-encoder
An hourglass-shaped nural network with one intermediate layer is a shallow network, so it does not suffer from the vanishing gradient.
Shallow encodes can be used to train multi-layered self-encoders.
If we use a multi-layered encoder, we can iteratively compress the data with each successive layer.
Consider a multi-layered self-encoder as a collection of two-layered encoders, and train the whole thing by iterating on those two layers.
Illustration of the algorithm
(1) Take out only the first and second layers, and add a layer with the same number of units as the input layer to the output layer.
This is trained as a self-encoder with training data, and after training, the added layers are removed again to make a coder.
(2) Take out only the second and third layers and train it as a self-encoder again.
(3) Repeat this operation toward the upper layer.
(4) After learning is completed up to the output layer, use the learned weight parameters of each coder as the weight parameters of the multi-layer coder.
Greedy layer-wise training
Self-encoder trained by greedy method
This method is also used for pre-training.
7.4.2 Pre-training
Useful tricks to improve the efficiency of gradient descent
Optimize initial values
Consider training an L-layer deep neural net
The output layer does regression analysis, so the deep representation is up to the previous layer
Train L-1 to L-1 layers as SAE using greedy method and use their weight parameters as initial values of weights for deep neural network
The weights between the L-1 and L layers are chosen randomly.
After that, compute using normal gradient descent method
Efficiency is greatly improved
7.5 Denoising self-encoder
Neural nets are not robust to noise in the input and the performance of the self-encoder is not much better
Train the self-encoder on the input data augmented by noise
Adding Gaussian noise to the input
Add noise to the input and make the output the same as before adding the noise
Defective noising (masking noise)
Salt-and-pepper noise (salt and pepper noise)
7.6 Shrinkage Self-Coder
7.6.1 Shrinkage self-encoders and manifold learning
Motivation for extending PCA with self-encoders
Regularization term added to facilitate manifold learning
Penalties of CAE
Jacobian’s Frobenius norm
CAE is similar to DAE, but serves to reduce similar data distributed in some scattered manner to a single point in the space of signs.
Relationship between CAE and manifold learning in the space of riches
7.6.2 Relationship with other self-coders

8 Convolutional Neural Networks

Introduction
A neural network developed specifically for image recognition is a convolutional neural network.
This neural network has a special structure with sparse connections.
A hint for the development of this structure is the structure of the visual cortex of a real animal.
8.1 Primary Vision and Convolution
8.1.1 The Huber-Wiesel Hierarchy Hypothesis
Humans have a high capacity to recognize patterns from visual information
Due to the special structure of the neuronal network
In 1958, Hubel and Weisel discovered that some cells respond only when presented with a line segment with a specific slope in a cat’s visual field
Simple cells and complex cells
Schematic diagram
Local patterns are first detected by simple cells, and by integrating this information, stable pattern detection can be achieved against misalignment.
Inspired by the hierarchical hypothesis of Huber and Weisel
Neocognitron by Fukushima in 1979
High performance by applying error back propagation method to neural network in 1989
Handwritten character recognition
8.1.2 Neural Networks and Convolution
Convolutional neural networks are made up of two main types of layers.
The first is a layer consisting of simple cells and multiple units
The pattern recognition to be extracted does not know in advance where it will appear in the image.
The layer on the input side is covered with simple cells with local receptive fields.
Wherever local receptive fields and simple cells are combined, they all have the same weight
Weight sparsification and weight sharing
Complex cells are only responsible for coordinating the output of many simple cells
Pooling number
Schematic diagram
8.2 Convolutional Neural Networks
8.2.1 Image Data and Channels
In a relatively simple handwritten character recognition such as MNIST
Even if two-dimensional image data is input to the model as vectors arranged in a one-dimensional form, the training classification model shows a reasonable performance.
For serious pattern recognition such as classification of natural images, the 2D structure of the image is used.
An image sample is represented as a 2D array of real pixel values xy.
For an image of size WxW, the index that specifies the image position (i,j) is as shown in the above equation.
In the actual image data, color component information is assigned to a single position (i,j).
If there are K channels, it is considered to be made up of K images.
8.2.2 Convolutional Layer
Simple case with one channel
The convolution layer plays the role of applying a filter to the matrix input (zij(l-1))
What is the role of a filter?
The filter can be defined as a HxH image with a size smaller than the input
The pixel value is written as hpq
Convolution of the filter (hpq) means the operation of the above equation
Schematic diagram of filter convolution
Extract a pattern from an image
A large number corresponds to the location of the image.
Extracting a specific pattern from an image by convolution of filters
This figure shows a filter obtained by learning an ear that performs convolution and pooling twice each.
The above filtering is performed on K channels
Convolutional layer 1
The final output of the convolutional layer will be the activation function f applied to this image.
This is called the feature map from the convolutional layer.
Nowadays, the LeRU function is often used as the activation function.
8.2.3 1 x 1 Convolution
In modern CNNs, convolution with a filter of size 1×1 is also used
Since the local receptive field is only one pixel, it does not seem to play any role in terms of pattern extraction.
In the channel direction, it works non-trivially because it adds up the number of pixels.
Keeping the number of channels in the filter smaller than in the original image can reduce the number of channels in the transformed image
8.2.4 Factorized convolution
A technique to reduce the amount of computation involved in convolution.
Usually the convolution layer of a 5×5 filter has 5×5=25 parameters.
Almost the same operation as 5×5 convolution can be achieved with a small number of parameters
Schematic of factorization of convolution
5×5 convolution can be done in two 3×3 convolutions
Furthermore, the asymmetric 3×1 and 1×3 convolutions can be used instead of the 3×3 convolution.
8.2.5 Stride
Move the filter one tick at a time to extract the pattern in detail
Glimpse the structure of the image in a more general way.
No need to move the filter one tick at a time
Move the filter by skipping S-1 extra pixels
Convolution with stride
8.2.6 Badding
Convolution always reduces the size of the image.
Due to constraints not found in the original image when superimposing the filters
Methods that can increase the size of the convolved image to some extent
An operation to add a border of thickness P around the original image
Types of padding
Put 0 pixel values in all the increased images
No padding at all, image is reduced
Ensure that the size of the image does not change before and after the convolution
Maximum padding so that the width of the image after the convolution is W+H-1
Example of padding
8.2.7 Pooling Layer
We discuss the pooling layer for complex cells.
Local patterns captured by the convolutional layer can be captured even if the position is moved to some extent.
In order to be robust against translation of the input position
To make it robust against translation of the input position, we can add up the outputs from the simple cells that extracted the local pattern at each position.
In pooling, first
For each channel of the WxWxK input image zi,j,k(l-1)
For each channel of the WxWxK input image zi,j,k(l-1), consider a region Py of size HxH centered at each pixel position (i,j).
For any point, consider the region of Py.
From the number of pixels in Pij, determine the representative pixel value eij(l) in the region.
How to determine the representative value
Extract the largest pixel value in the region as the representative value
Instead of the largest value, use the average of the pixel values in the region
An extended version of the two pooling methods
Stochastic pooling (Stochastic pooling)
Only bundles the output of the convolutional layer rather than baking the pooling layer
No learning of weights
The pooling equation is not changed at all during learning.
8.2.8 Local Contrast Normalization Layer
In image processing, there is a technique called local contrast normalization to improve the quality of training samples.
In the intermediate layer of a convolutional neural network, there is an intermediate layer that performs this normalization.
For K-channel images, local contrast normalization is performed for each channel.
8.2.9 Local Response Normalization Layer
In AlexNet, the model that won the ILSVRC in 2012, the local response normalization layer (LRN layer) was used instead of the LCN.
Definition of local response normalization layer
8.2.10 Network Structure
A convolutional neural network is constructed by repeatedly stacking convolutional layers and pooling layers.
Then, some total joins are added to the output layer, and finally an output layer is added to perform tasks such as image segmentation.
After the pooling layer, a local contrast normalization layer or a local response normalization layer may be inserted.
As a typical example, we show the VGG example from Oxford University.
8.3 Inverse Error Propagation for CNNs
8.3.1 Convolutional Layers
Since the convolutional layer is just a forward propagation layer, we can calculate forward propagation by taking weight sharing into account
from the chain rule.
Delta back propagation rule in the convolutional layer
8.3.2 Pooling layer
The pooling layer can also be back propagated by rewriting the normal forward propagation layer.
The average pooling layer can be regarded as a forward propagation layer with the weights of the above equation.
For maximum pooling, it is necessary to remember the pixel location of the maximum value adopted in the forward propagation.
8.4 Trained Models and Transfer Learning
Models that have been trained on large data sets such as ImageNet produce excellent results.
Neural nets have a high generalization capability.
Deep network structures are expected to acquire a variety of features in the intermediate layers.
Remove the output layer of the trained model
Use the output from the intermediate layer as a representation of the input image
By applying the representation directly to regression or support vector machines, we can expect performance that would not be obtained by analyzing the input image directly.
Replace only the output layer and recompute it for other tasks
This is the initial value for the new task.
8.5 What kind of patterns are captured by CNNs?
As a pattern extractor, CNN is a black box.
Several techniques have been developed to look inside the CNN.
An image corresponding to a particular convolutional layer of a trained VGG
Image of the input pattern corresponding to each filter
As we go from lower layers to construction layers, each filter captures more complex patterns
Unnatural decisions for noise
8.6 Deconvolutional Networks
Various inverse operations of convolution have been studied so far
To visualize what a CNN has learned
or to generate a large image from a small one
To visualize the inside of a CNN, we can use a model derived from the one introduced by Zwirer et al.
It consists of a combination of unspooling (unsampling) and convolution.
8.7 Inception Module
To achieve a high performance CNN, we can create a deep network with many layers.
The deeper the network, the more difficult it is to learn and the more computation is required.
Consider leveraging both depth and width of the network
GoogleNet
Inception Modul Translated with www.DeepL.com/Translator (free version)

9 Recursive Neural Networks

Introduction
We introduce recursive neural nets as a method for learning time series data such as text.
9.1 Time Series Data
The key to successful deep learning
How to design a network structure that is appropriate for the data
In the case of CNNs, consider a network that realizes the convolution of filters according to the characteristics of the image.
How to handle time-series data such as videos, texts, and conversations?
Write (time) series data of length T as x1,x2,…,xT , xT.
The data corresponding to time t is xt
Consider a conversational sentence as time series data
The first word heard is x1, and then words x2,…,xT are added in time order until the sentence ends. Each word is represented as a vector
Each word is represented as a vector
Example: 1-of-K vector
Only the component of the corresponding word is 1, the others are 0
It has as many dimensions as the number of words in the dictionary
Length T of time series data varies from sample to sample
We need generalized results for length T
9.2 Recursive Neural Networks
9.2.1 Loops and Recursion
A typical graph with no forward propagation structure
We can’t define the concept of layers for the graph as it is
Arrows from a unit may come back to itself
Neural nets with the above structure are useful when dealing with time series data.
In order to handle time series data well, we need to retain information about the past.
A forward propagating neural network cannot capture the complex information carried by a series of words.
The white unit is the input layer
There is no arrow coming into the white unit
Orange units are output layers
Gray units are hidden layers
Graph rewritten in the form of layers
A two-layer forward propagating neural network with a new feedback path
Recurrent neural network (RNN)
In the forward direction, the signal propagates immediately as in a normal neural network.
For the loop corresponding to the return path, the signal is delayed by one unit time
The time-delayed loop retains the past time information.
Representation as an expression
Assumptions
Input to the RNN at time t (output of the input layer at time t)
Unit output of intermediate layer
Unit output of output layer
Weights of input layer and intermediate layer
Weight of intermediate layer and output layer
Weight of the feedback path between the intermediate layers
Weight of the feedback path from the output layer to the intermediate layer
Propagation equation of RNN
Equation of output layer
9.2.2 Real-time recurrent learning method
RNNs can be computed without the inverse error propagation method by using the real time recurrent learning (RTRL) method.
In order to define RTRL
First, let the error function at time t be the above equation
The error function at each time is
In the case of regression
For softmax regression
Calculate the gradient
Assumptions
Define the alphabetic subscripts
For weights
The general gradient is determined by the chain rule
According to the propagation rule of RNN
The coefficients are as above
The value of prsi is
Therefore, Prsj becomes the above equation.
Using the obtained p, use the gradient descent method at each time.
9.2.3 Network Expansion
A method to make it easier to see how information is retained in a loop.
Illustration of the expansion
The unfolded network has no loops and is a graph with good properties.
9.2.4 Timed Error Back Propagation
After introducing the RNN expansion, we formulate the error back propagation method as we did for the forward propagation method.
In BPTT method, we consider the error over the whole time (above equation)
Consider the gradient of this error
Let the delta between the intermediate unit and the output unit be the above equation.
To derive the back propagation rule of the delta, we use the chain rule.
From the above, we can get the formula of delta of back propagation of RNN.
Next, we reconstruct the gradient using the delta.
Since the expanded RNN can be regarded as a weight-shared neural network
The gradient for Win is the above equation
The final formula for the BPTT method is
To be precise, it is called epochwise BPTT (EBPTT) method.
Only the error function at the current time is considered
9.3 Application to Machine Translation
Many applications of RNNs have been studied for natural language processing
The idea of using RNNs for machine translation
The input sentence is a collection of characters x1, x2,…,xT. The input sentence is divided into T words such as x1, x2,…, xT.
This is input to the RNN
The output layer of the RNN is a softmax layer with as many units as there are words in the dictionary.
The most likely output word yt can be predicted from the unit with the largest output value
The past information given to the loop gives the RNN information about the previous word sequences x1, x2,…,xt to help it choose yt. , xt, to help the RNN choose yt.
The sequence y1,y2,…,yT, which is created by collecting these outputs at each time, is the translated sentence. , yT will be the translated text.
The idea is just a toy model
It can only be used when the number of words in the input and output sentences are the same.
9.4 Problems with RNNs
Deep network due to inclusion of return paths
Because the same weight W is applied many times in each loop
It is easy to burn out or explode the signal or construction
It is necessary to use gradient clips and other devices.
LSTM, an improvement of RNN, has succeeded in suppressing the problem by devising a network structure.
9.5 Long and Short Term Memory
Introduction
The purpose of RNN is to accumulate long term time series information.
9.5.1 Memory Cells
The middle layer of RNNs is replaced by a memory unit.
Memory units contain mechanisms to retain and forget information.
The memory unit itself is a complex circuit consisting of cells, gates, and units.
Memory cell
A loop is used to pass the output sji of each time cell (C) to the next time cell.
The wavy line in the figure means that the loop propagation is delayed by one time.
The gate also receives input from the outside.
At each time t, it receives the following inputs
xit from the input layer
zjt-1 from the middle layer at the previous time
The state of the cell when no gates are considered
9.5.2 Gates
Another component of a memory unit is a gate.
The gate unit receives an input from the outside and first outputs the gate value gjt.
The incoming signal is then multiplied by the gate value gjt.
9.5.3 LSTM
Overall structure of the memory unit
There are four places where the external inputs xt and zt-1 are taken in.
One is for input to the cell
The rest are for input to the gate
The output from the cell goes not only to the outside, but also to the gate unit to control the gate
Each input unit has a unique weight.
An intermediate layer made up of such memory units is called an LSTM or LSTM block.
9.5.4 Forward Propagation in LSTMs
We will explain the structure of the memory units by looking at the forward propagation of the LSTM in detail.
We start with the input unit (I).
The external input to I is summed up to the total input ujt, and then output as zj0,t under the action of the activation function.
The signal from the input unit undergoes a gate value multiplication from the input gate (IG) before entering the cell (CNC).
The loop in the memory cell (C) has the action of the gate (FG)
Forgetting gate
From the above, the state of the memory cell changes with time as shown in the equation above
The output from the cell is the final output after the activation function is applied by the unit (U) and the gate value from the output gate (OG) is multiplied.
The expression for the output gate is given above.
9.5.5 Back propagation of LSTM
The LSTM back propagation method is no different from the usual one, as long as we are careful about the presence of gates.
First, the delta of the unit (U)
Input from the LSTM to the outside, output layer at the same time and input to the intermediate layer at the next time
If we write the forward propagation in the vicinity of the above figure
The delta for unit U is the above equation
Therefore, the back propagation law for delta is given by
The delta for the gate (OG) is similarly
Delta for a cell
Since there are five arrows coming out of C, variations in the output of the cell will change the activity of the unit to which these five arrows are going.
The delta about F is
The delta for I is
With respect to the input gate
9.5.6 Gated Recursive Unit
A unit that uses gates as in LSTM, but has long-term memory in a simpler way.
Circuit diagram of GRU
Both Xxx expressions
Behavior of the GRU
Forgetting unnecessary information by reset gate
The GRU is a gate that controls the state of memory as well as the LSTM.
9.6 Recursive Neural Networks and Natural Language Processing
Introduction
In the field of statistical natural language processing, probabilistic models are used to train sentences
Sentence xx1,… ,xT to sentences y1,… to y1,…,yT.
Generate translation from input sentences x1,…,xT To generate a translation from the input sentences x1,…,xT, we consider the conditional establishment between the sentences in the source language and the sentences in the target language
Model each probability distribution on the right-hand side of the previous equation and train it using training data.
If the probabilities can be estimated
With y1 to yt-1 already translated
P(yt|y1,…. ,yt-1,xt,… ,xT) to predict which word to adopt as the next word yt.
Repeating this process for t=1,2…,T ,T, and so on.
By repeating this process for t=1,2…,T ,xT, we get the whole translated sentence.
Approach for the case where the input sentence and the output sentence to be translated are different in size
Idea of Encoders and Compositors
Use RNN to encode the whole input sentence x1,x2,…,xT into some representation C ,xT into a certain representation C
State of the middle layer of the RNN
RNN that does this work is called encoder de moiable
The RNN, called decoder, takes the context and gives a probability distribution to determine the translation
The output of the RNN is a probability distribution (the above equation) using softmax or other methods.
9.6.1 Seq2Seq Learning
This section describes the representative models described above.
Overall structure of Seq2Seq (sequence to sequence)
RNN is used to create the context, from which the words of the translated sentence are generated letter by letter.
The output of the previous time is input to the middle layer of the decoder
Since the output layer of the decoder is a softmax layer, we start with the probability distribution P(yt-1|y1,. ,yt-2), the word yt-1 with the highest probability is sampled.
Then, use that word as the input for the next time t.
When the Seq2Seq decoder reaches the end of the sentence, it outputs <EOF> to finish the translation.
The entire structure is learned using the maximum likelihood method.
Consider the distribution of the entire encoder and decoder
Minimize the error function of the function above the training data
Use different RNNs for encoder and decoder.
How to use Seq2Seq
Improve performance by inverting the input
Input by flipping the sentence over
9.6.2 Neural Conversation Model
Applications of Seq2Seq
Neural conversation model
Seq2Seq trained to have natural conversations with people
Seq2Seq trained on conversational data instead of a dialogue corpus
Seq2Seq is trained using conversational data, not a dialogue corpus.
Trained on OpenSubtitles, a dataset of conversational text from movies.
The model used is Seq2Seq, which has two layers of LSTM with 4096 memory units in the middle layer.
Paper 53

10 Boltzmann machine

Introduction
Unlike neural nets, Boltzmann machines are network models in which each unit behaves probabilistically.
Boltzmann machines are the Inzing model of classical statistical physics itself.
10.1 Graphical Models and Probabilistic Inference
Introduction
A neural network is fitted with observed data as training data
An approach to inference based on observed data
For phenomena with uncertainty, it is more natural to describe a probabilistic model
Concept of Generative Model
Suppose there is a phenomenon with uncertainty and N data x(n)=(x1(n),x2(n),…) T as the observed values.
In order to understand the causes of this phenomenon and the causal relationships associated with it, we introduce a probability model.
Assume that the observed value is a realization of the probability x=(x1,x2,…) Let the observed value be a realization of T
Assume that this observed value is independently generated from the underlying probability distribution Pdata(x).
The generating distribution, if it exists at all, is usually not directly knowable
In order to infer the shape of this distribution from observed data
First, we need to formulate a hypothesis that approximates the unknown distribution.
Model distribution P(x|θ), an approximation of the generative distribution
Consider a collection of model distributions.
Among this collection of models, learning is to identify the model that best explains the observed data (most appropriate parameter θ*).
How do we choose the model distribution p(x|θ)?
A graphical model is often used
10.1.1 Directed Graphical Models
In mathematics, a graph is a
A graph is a structure of units (also called nodes or vertices) connected by edges (links).
Schematic diagram of a directed graph
A graphical model for a directed graph
Consider the acyclic case, where the graph has no loops.
Each unit has a random variable attached to it.
Causal relationships between random variables are modeled by conditional probabilities.
By collecting all causal relationships determined from the graph, we represent the simultaneous probability distribution of the graphical model.
Let Ni be the set of parent units.
The probability distribution for a general directed graph is as above
However, for variables with no parent units, the above is given.
Advantages of considering this type of probability model
The random variables are intricately correlated and not independent of each other.
Some of the random variables have a structure called conditional independence.
Conditional independence can be easily seen visually in the structure of a graph.
What is conditional independence?
A situation in which variables that are not originally independent become independent after the realization of another variable is determined.
Suppose we have a graph with only three variables: x3, x4, and x5.
When there is only this subgraph, the simultaneous distribution of these three variables is
Looking at the calculation result of this distribution P(x3,x4,x5), we can see that the three variables are not independent of each other.
However, when the value of variable x3 is observed and takes a certain definite value, the remaining two variables become independent.
The simultaneous distributions of x4 and x5 are decomposed.
Conditional on variable x3, x4 and x5 are independent
The graph shows that when x3 is removed, they become disjoint.
Another structure to express conditional independence
Look at the subgraph consisting of x3, x4, x5, and x6.
The arrow from x3 goes to x4 and x5, and from them to x6
If x4 and x5 are removed, x3 and x6 fall apart
Conditional Independence
On Inference with Probability Distributions
Inference is the process of searching for the cause of a result based on the result
In a directed graphical model, the task is to follow the arrow backwards.
On Generative Process and Inference Process
Variable y is an observed variable
Variable x is an unobserved explanatory variable
Generative process on the left
As illustrated by the arrow going from x to y, the process in which y is generated from x by the simultaneous distribution described above
x is given by the prior distribution P(x)
This determines the distribution of y.
What is observed is y
Inference process on the right
Based on the observed value y, infer the value that the explanatory factor x would have taken
Working backwards through the graph of the wavy line
Determining the posterior probability P(x|y)
Find the right conditional probability from the known information P(x),P(y|x)
The distribution P(y) is obtained from the marginalization of the simultaneous distribution.
10.1.2 Undirected Graphical Model
The arrow in a directed graphical model represents the causal relationship of which unit is affecting which unit.
The structure of the arrow extending from the unit at x to y has a probability distribution P(y|x) attached to it.
It is not always possible to know in advance the causal relationship between random variables.
Undirected graphs do not have an edge orientation.
The concept of conditional independence can be applied to undirected graphs.
A model on an undirected graph that satisfies the conditions for graphical independence
Conditional Independence
Consider three subgraphs a, b, c in an undirected graph
Assume that they do not overlap with each other.
When c nodes (and the edges leading to them) are removed from the graph
a and b are separated into two unconnected graphs
In this case, a and b are assumed to be independent conditional on c.
Global Markov property
Identify conditional independence by finding a set of nodes c to decompose into two subgraphs
Identify conditional independence
For any node i, if we fix the random variables for its neighbors j ∈ Ni, then xi is independent of the random variables for all the remaining nodes
Any two nodes that are not directly connected by an edge are independent of each other if the other random variables are fixed.
A graph in which all pairs of nodes are connected by an edge.
A subgraph of an undirected graph that is a complete graph.
A graph in which three nodes are connected by an edge to form a triangle.
The smallest clique
A clique that is no longer a clique if any other node in the graph is added.
Examples of complete graphs and maximal cliques
Define a probability model using the concept of maximal cliques.
Write C for the set of maximal cliques c in an undirected graph.
At this point, consider the probability distribution model above
Xc is the random variables corresponding to the nodes in the creek c
Z is the distribution function: the coefficient needed to normalize P to be a probability
sum over x
The factor Ψc(xc) is a positively-valued function called the creek potential.
This potential can be written in terms of the creek energy function Φc(xc)
The distribution P(x) becomes the Gibbs-Boltzmann distribution in statistical mechanics.
This will be connected with Boltzmann machine later.
Introduction of the model by Creek
This model is closely related to Markov networks in general.
Conditional independence in an undirected graphical model is defined as
is defined as a separation condition for graphs
The probability distribution P(x) over an arbitrary undirected graph satisfies the local Markovianity over this graph.
The distribution P(x) is given by the Gibbs distribution for the maximal clique of this graph.
Theorems that guarantee the above
To create a probability distribution model in which conditional independence is nicely realized as a separation condition, consider a distribution in the form of a product of creek potentials.
Rough reasons why the Hammersley-Clifford theorem holds.
Using Pairwise Markovianity
Two variables xi and xj that are not linked become independent given all other variables.
Nodes that are not directly connected by a link are separated by a set of other nodes
Consider the model in the figure above.
The distribution needs to satisfy the factorization conditions above.
Every pair of [x2,x3,x4,x5] is always linked.
Therefore, the four variables have a factor of Ψ23435, which cannot be further factored
In the same way, [x1.x4.x5] must have a certain factor Ψ145.
From the above, the distribution must take the form of the above equation.
This will be the distribution given by the creek potential.
Determining the maximal clique set and introducing a potential function for it each time a graph to be studied is a practical complication.
From a statistical mechanics point of view
Introducing a creek potential that involves many random variables is
The introduction of a creep potential involving many random variables corresponds to the inclusion of higher-order interactions in the Hamiltonian.
If you do not need a very complex probability distribution model
It is more convenient to use pairwise Markov random filed, which is a simplified version of Markov network.
In this model, we consider only the special situation where the entire creek potential Ψc(xc) for a maximal creek is represented by the product (above) of the potential functions associated with the edges (i, j) ∈ c in each creek.
A pairwise Markov network is represented by the seats in the above equation
Ε is the set of all edges in the graph
Pairwise Markov networks are
Instead of losing generality, the structure is simplified.
Because of its simplicity, it has many applications
10.2 Boltzmann Machines
10.2.1 Boltzmann machine without hidden variables
A Boltzmann machine is a model for probabilistic machine learning in which
undirected, where the edges have no orientation.
An example of a Boltzmann machine
Each unit i ∈ N is assigned a binary random variable xi that can take the value 0 or 1.
This random variable is called the state of the unit.
The Boltzmann machine describes the simultaneous probability distribution of the random variables on this graph.
In order to write the concrete form of the distribution, I will explain the meaning of the edges connecting the units.
Suppose that two units i and j are connected by an edge.
Let the set of all edges of the graph be ε
(I,j) ∈ ε
Similar to neural nets, each edge in a Boltzmann machine is given a weight parameter wij.
Since the graph is undirected, the order of the indices i and j is meaningless.
Consider the energy function above, which is determined by the variables corresponding to these units and the weights corresponding to the edges
bi is the bias for unit i
The last row is a matrix representation using the symmetric matrix W formed by Wij
b and x are vectors
The above probability distribution model based on this energy is the Boltzmann machine.
Borrowing a term from statistical physics
The distribution represented by the energy function is called the Gibbs-Boltzmann distribution.
The total parameter including weights and biases is written as θ
Z(θ) is the distribution function
P is a factor to satisfy the normalization as an establishment.
We will discuss machine learning using Boltzmann machines.
Using a distribution of the form described above
Determine the massion that best explains the given data (observed data) x(n)
Adjust the parameters so that the Boltzmann machine P(x|θ) is closest to the generating distribution of the data, Pdata(x).
Boltzmann Machines and Physics
Boltzmann machine is nothing but a canonical distribution when xj is considered as a spin variable
Bias is magnetic field
Weights are spin interactions
Unlike ordinary statistical mechanics
The values of the magnetic field and the interaction coupling constant can have different values in different places.
In statistical physics
Given a Hamiltonian (energy function), the distribution is fine-tuned and the expected values of physical quantities such as spin variables are calculated.
In machine learning for Boltzmann machines
First, the various configurations of the system (the values taken by various physical quantities and their averages) are given, and then
Then, the parameters of the Hamiltonian are determined in such a way that the given data are most easily realized.
This process is the opposite of the Inzing model.
It is similar to the analysis of experimental physics.
10.2.2 Boltzmann machine with hidden variables
In the previous section, it was assumed that all variables corresponding to nodes correspond to observed data.
In the actual case, there are two types
Hidden variables and latent variables that do not directly correspond to the observed data
visible variables that correspond to the observed data
Let the random variable x associated with a node in the graph be the visible variable v=(v1 v2 …) T and the hidden variable h=(h1 h2 …) T
An example of a Boltzmann machine with hidden variables
In the case of hidden variables, we can do the same as above
For the above energy function
Since the energy function contains two types of variables, we can rewrite it in detail
The model distribution can be expressed as above
Use to introduce hidden variables
Take missing data into account
Using hidden variables can improve the performance of Boltzmann machines.
There is no guarantee that the probability model you have prepared is a good model to explain the given training data.
No matter how many parameters you adjust, it will not be a good approximation of the generative distribution.
The unbridgeable gap between the assumed collection of model distributions and the training data appears as a limit to the expressive power of the model.
To improve the situation, we can increase the coverage by increasing the complexity of the model.
What complexity can do immediately
The introduction of hidden variables gives an extension of the model with good visibility and performance.
Consider the perimeter distribution used for training (above equation) as an extended Boltzmann machine with only visible variables.
To be continued
The energy function of this extended model P(v|θ) is given above
If we expand the right hand side, we can find any number of terms of high order for the visible variables.
It becomes a complicated energy function.
Even for a complex energy function, the number of independent parameters is finite
The coefficients of all terms in the energy function are the original hidden variables and are determined from only a few parameters of the Boltzmann machine.
Despite the distributional generalization, the parameters are well weight-shared.
Despite the high expressive power, overlearning is prevented, resulting in a model with good properties.
10.3 Learning a Boltzmann Machine and the Computational Explosion
Introduction
In machine learning with Boltzmann machines
First, prepare a graph that seems to be suitable for the problem you want to consider.
People think about it using their brains.
After that, we move on to learning.
Consider that the given training data is generated i.i.d. from an unknown generating distribution.
Bring the Boltzmann machine P(x|θ) of the prepared graph closer to this generating distribution.
Optimize the parameter θ for the given training data.
Technique
Maximum likelihood estimation
Method based on Karpaskaline-Leibler divergence
Hereafter, the visible variable is denoted as v
N data that are observed values are labeled with n=1,2,…. N and label them as v(n).
When we write “x”, we mean the combination of visible and hidden variables.
10.3.1 The case without hidden variables
Consider training a Boltzmann machine with no hidden variables.
We use maximum likelihood estimation because we can choose a passer-meter that looks good enough to explain the data.
Likelihood function
The parameter that maximizes this is
The Boltzmann machine is
When the number of data is large, the likelihood becomes a very small number
Maximizing the log likelihood
If we set θ=(θI)T for both weights and bias
Definition of Boltzmann machine
Differentiating this by the bias gives
First term on the right-hand side
The term derived from the derivative of the energy function
Second term on the right-hand side derived from the parameter derivative of the distribution function
Model average by the Boltzmann machine P(v|θ) we are considering
Similarly differentiated by weights
Substituting the data v=v(n) into the aforementioned differentiation coefficients, and adding them together for all the data
The positive phase is the data average, so use <vi>data,<vivj>data, etc.
In the end, the gradient is the above equation
Given the extremes
Between the Boltzmann machine and the sample distribution of the data, we can adjust the parameters so that the two statistics, expected value and second order moment, match.
Rewrite the equation using the empirical distribution
The mean for the empirical distribution q(v) is the sample mean of the data itself.
Data Mean Appears in the Learning Equation
10.3.2 Convexity of the Log Likelihood Function
Theoretical Aspects of Learning for Boltzmann Machines
Learning is implemented by the gradient ascent method for the log likelihood of a Boltzmann machine
The log likelihood of a Boltzmann machine (above) satisfies convexity
There is only one extreme value
Proof: omitted
Using Herder’s inequality
10.3.3 Gradient ascent method and computational complexity
A method for finding approximate numerical solutions
Update the parameters in the direction of the gradient of the log likelihood function, and find the maximum point as the point of convergence.
Each time the parameters are updated, the gradient needs to be calculated.
Obtained from the difference between the data mean and the model mean
Data mean is easy to calculate
The model mean requires calculation of the above equation and is difficult to calculate.
Combinatorial explosion occurs
Processing time with 1GHz clock when there are 100 variables
In recent years, there has been a discovery of efficient Monte Carlo methods (CD and PCD methods) to approximate the expected value of Boltzmann machines.
10.3.4 Learning by Divergence
Derived in terms of divergence rather than maximum likelihood estimation
A quantity that measures the “distance” (similarity or difference) between two probability distributions
Minimizing the divergence makes the two distributions similar.
In order to train a Boltzmann machine, the distribution should be close to the distribution of the observed data
Consider the divergence between the empirical distribution q of the data and the Boltzmann machine P
The parameter θ* that minimizes this is adopted as the optimal value.
To do so, we transform the divergence.
The first term is the entropy of the empirical distribution
It has nothing to do with the parameters of the Boltzmann machine, so it has no effect on the minimization of the divergence.
If we rewrite the second term
We can draw in absolute rounding
That is, it is exactly the same as minimizing the log likelihood.
10.3.5 The case with hidden variables
For the case with an unobserved random variable h
A Boltzmann machine with hidden variables is
Describes the simultaneous distribution P(v,h|θ) of all variables x=(v,h)
Given in the form of Gibbs distribution
What can be observed are the values of the visible variable v and its empirical distribution.
Compare the observed data with Boltzmann machine
Consider the distribution of only visible variables by peripheralizing the simultaneous distribution.
Learning is to bring the peripheralized distribution P(v|θ) of only visible variables closer to the empirical distribution q(v).
Maximize the log likelihood function
θ*, which maximizes the log likelihood function, is the solution to the above equation
The weights and bias parameters are combined as θ=(θI)T=(bi,wij)T and expressed as above.
Differentiate the log likelihood by writing it specifically in terms of a distribution.
The distribution of the visible variable used for the right-hand side is the same as the above for the case with separated variables.
x is a notation for visible and hidden variables together
Differentiating this distribution by the parameter
The introduced fI is the quantity obtained by differentiating the exponential shoulder of the Gibbs distribution by a parameter, where the parameter is the weight of the bias.
Substituting this derivative coefficient into the derivative coefficient of the log likelihood, we get
The first term on the right side is the positive phase, and the second term is the negative phase
To rewrite the negative phase, we use the above equation obtained from Z(θ)=∑xe-Φ(x,θ).
Finally, the gradient becomes the above equation
The learning equation is as above
The right-hand side of the learning equation derived from the negative phase has a combinatorial explosion problem regarding the sum of states of the Boltzmann machine.
The left hand side of the equation is not just the sample mean when there are hidden variables
The left-hand side is not just the sample mean in the case of hidden variables, but the expected value under the probability P(h|v,θ)q(v), which takes into account the conditional prior distribution obtained from the model distribution.
When the empirical distribution is expressed as a distribution of the frequency of the given data, the expected value of the positive phase becomes the above
It is expressed using the conditional prior distribution when the visible variable realizes the data value v(n).
For each data, we need to calculate the expected value of the distribution P(h|v(n),θ) and add the result across all the data.
A Boltzmann machine with hidden variables will cause further combinatorial explosion, making the calculation more difficult.
Convexity of the log likelihood is lost when introducing the number of lanes.
Iterations of the gradient ascent method are no longer guaranteed to lead to a maximum value.
10.4 Gibbs Sampling and Boltzmann Machines
Introduction.
Learning a Boltzmann machine is challenging.
Need to introduce an approximation method
Numerical Simulation of Probability Distribution with Random Numbers
Random Number Simulation
Generate a large number of random samples independently from the distribution to be studied.
The sample mean of these samples is used to approximate the expected value of the original distribution.
In other words
When we want to evaluate the expected value of f(x) for a sample P(x)
First, independent samples (x1),x(2),…,x(N)) are generated from P(x). ,x(N)) from P(x).
Use the sample mean (above) as a proxy for the expected value.
According to the law of large numbers, we converge to the expected value at the extreme value where the number of samples N goes to infinity.
Monte Carlo methods for generating samples
Markov chain Monte Carlo methods
Gibbs sampling
10.4.1 Markov Chains
Markov chain Monte Carlo method (MCMC)
Using a Markov chain to search the entire state space extensively
to generate a good sample.
Explanation of Markov chains
Stochastic process consisting of a time series of random variables
x(0)→x(2)→… →x(t)→x(t+1)→…
When a stochastic process satisfies the following
The future state depends only on the present state and not on the past history
Mathematical expression
The conditional probability of the transition to the tth step (time t) satisfies the following property
The transition element of the stochastic process is P(x(t)|x(t-1)), which is determined only from the information of the previous step.
The distribution P(x(t)|x(t-1)) is always the same regardless of t.
Conditional prior probability P(x(t)|x(t-1))
In a Markov process, the probability distribution P(x(t)) at each step (each time) transitions in the above manner.
In general, p(x(t)) will have a different distribution at each step.
In general, p(x(t)) has a different distribution at each step, so it should be written as p(t)(x(t)) with the label of the number of steps.
In the following, it is omitted.
The formula for this transition follows immediately from the fact that P(x(t)) is obtained from the permutation and from Markovianity
In a Markov chain, by repeating the Markov property and the chain rate, the simultaneous distribution takes the above form.
This last equation is a directed graphical model of a linear graph connected in a straight line, as shown in the top.
10.4.2 Google and Markov Chains
Google’s search ranking system for web pages, PageRank, is also designed as a Markov chain.
If all the web pages in the world have α=1,2,3,… Each web page in the world is labeled by a natural number α, such as α=1,2,3,…,p.
The probability P(x(t)=α) that a certain visitor (random web surfer) visits web page α at a certain time t is
The probability P(t-1)=1,2,3…) of having visited various pages at time t-1. and
the probability of having visited various pages β=1,2,3,…) and by the transition probability P(α|β) of randomly jumping to page α by following a link from
Model it as above.
When you are vaguely surfing the web, you may think
When you jump from one page to another
you choose a destination from the links regardless of which page you were on a few minutes ago.
The value of the stationary distribution of this Markov chain, P(α), is called the page rank, which is a score of the importance of each page.
About Page Rank
A model of how to determine the transition probability P(α|β) for random surfers
C.D. Meyer et al.’s model of P(α|β), a modification of the original proposal by S.M. Brin and L.E. Page
If page β has a link {α ∈ Aβ}, then
the surfer will jump to the link in proportion 0≤d≤1
The remaining (1-d) percentage of the time, the surfer will jump to any page at random (e.g., by choosing one of his favorite bookmarks or typing in a random address).
When we follow a link with ratio d to jump to another page
We assume that the transition to any linked page α ∈ Aβ is equally probable.
We write such a transition probability P(α|β) again as Gαβ
δ(α ∈ Aβ) is a quantity that is 1 when α is in the linked page and 0 otherwise.
|Aβ| is the total number of links in β.
If page β does not have a single link (when Aβ=Φ), then
randomly to all pages with equal probability.
If there are a total of p pages in the world, the probability Gαβ is given by
G=(Gαβ) is called the Googlr matrix.
The Google matrix defines a Markov chain that converges uniquely.
Its convergence point is the stationary distribution P(∞)(α).
10.4.3 Stationary Distribution
The various states that a random variable can take are
The various states that a random variable can take are labeled with the integers α01,2,…. Label the various states that a random variable can take with the integers α01,2,….
Collect the realization probabilities of each state at time t, P(x(t)=1),p(x(t)=2),p(x(t)=3),… The above vector can be created by collecting
This is called the state probability distribution.
The transition probability P(x(t)=α|x(t-1)=β) from state β to state α is represented by a matrix T with α and β components.
It is called the transition probability matrix.
From the above, the equation describing the transition is a linear evolution equation for the transition probability distribution.
Since we are considering a homogeneous Markov chain, the matrix T does not depend on the number of steps t, so we end up with
If this transition matrix (T)t converges to some finite matrix as t→∞, then we can expect the distribution π(t) of the Markov chain to also converge to some (equilibrium) distribution
Such a distribution should no longer be altered by the action of t
Because the convergence value should satisfy the equilibrium condition π(∞) = Tπ(∞)
If we write P(∞)(α) for the αth component of π(∞), the stationary equation becomes the above condition
This distribution is no longer changing in the transition matrix.
The Markov chain does not always converge to an invariant distribution
It converges to a unique invariant distribution when the following two conditions are satisfied
Any state can transition to any other state in a finite number of steps T cannot return to a smaller submatrix
A chain can never be trapped in a state transition loop.
Condition for detailed balance (sufficient condition for a given distribution P to be invariant)
If P satisfying the detailed balance is obtained for a given transition matrix T
then it is the invariant distribution to which the chain converges.
10.4.4 Markov Chain Monte Carlo Method
Consider a Markov chain Monte Carlo (MCMC) method
Generate a sample from a distribution P(x) using Monte Carlo methods.
Since P(x) itself is complex, it is not easy to sample random numbers directly from it.
Suppose we have a Markov chain with this distribution as the stationary distribution.
MCMC generates a sample not from P(x) itself, but from the Markov chain P(t)(x) that converges to it.
After letting the chain run for a while, the sample can be regarded as a sample from an approximately stationary distribution.
The Markov chain used here is generally designed by utilizing the detailed balance condition.
In practice, the most difficult part is to design a Markov chain with fast convergence (fast mixing).
How to implement MCMC
Metropolis-Hastings method
Slice sampling method
Gibbs sampling method
10.4.5 Gibbs sampling and Boltzmann machines
Gibbs sampling uses a Markov chain designed in a simple way.
It also simplifies the sampling algorithm.
In the case of a multivariate random variable ẋ=(x1,x2,. Consider sampling from a distribution on (x1,x2,…,xM)
From the simultaneous distribution P(x1,x2,… ,xM) at once. ,xM) at once from the simultaneous distribution P(x1,x2,…,xM) is difficult.
From the above, sample xi one by one for each I
The order of sampling (labeling by i) can be whatever you want.
(x1(t),x2(t),…,xM(t)) Use Gibbs chain, which transitions one by one, instead of all at once (x1(t),x2(t),…,xM(t))
Monte Carlo method for sampling from multivariate distributions using MCMC, divided into small blocks
This type of sampling can be used when the shape of the conditional probability distribution can be specified.
Boltzmann machine example
Boltzmann machines satisfy the local Markov property
The values of variables other than unit i in a Boltzmann machine are defined as x-i=(x1,x2,… ,xi-1,xi+1,…) T and so on
From the definition of conditional probability, the fully conditional distribution of xi when the values of the realizations other than unit i are given by x-i is
The factors e-Φ appearing in the denominator and numerator are translated except for the term exp(bixi+∑j∈Niwixixj) which depends on xi.
In the end, the probability can be written compactly as above
For simplicity, we use the above.
This quantity is locally determined only from the state of the units around i
It is a kind of bias that is corrected from the surrounding information.
The conditional probability, which should be undeterminable without knowing all the states x-i except i, is
In reality, P(xi|x-i,θ)=P(xi|xj∈Ni,θ)
P(xi|x-i,θ)=P(xi|xj∈Ni,θ), which can be determined only from the states of the surrounding units j∈Ni.
Correlations between variables are only local.
This property of Boltzmann machines is called local Markovianity.
Thanks to the local Markov property
Once the state of Ni is determined, xi is no longer independent of the other variables
xi and {xj|j∉[Ni,i]} satisfy conditional independence subject to {xj|j∈Ni
The above probabilities can be efficiently calculated from only the information of Ni in i
Explanation of how to sample xi from P(xi=1|x-i,θ)=σ(λi)
Generate a uniform (pseudo)random number from the interval [0,1].
If the value of this random number is less than or equal to σ(λi), xi=1; if it is greater than σ(λi), xi=0 is the sampled value.
The values sampled in this way follow the distribution P(xi|x-i,θ).
Based on the above, let us explain the Gibbs sampling algorithm.
Suppose the random variables of the model distribution are x=(x1,x2,…,xM)T Suppose that the random variables of the model distribution are divided into M single variables, x=(x1,x2,…,xM)T.
The Gibbs chain (above) is created by the conditional probability P(xi|x-i,θ) created from the model distribution.
Gibbs sampling algorithm
Repeat sampling while moving I and t
If we consider the obtained series of samples as a Markov chain, we get the above process
According to the MCMC concept, x(T) after enough time can be used as a sample from the stationary distribution.
Proof that the stationary distribution of the Gibbs chain is actually the original distribution (Boltzmann machine) P(x|θ)
Gibbs chain with M=2
is regarded as a Markov chain, its transition probability is given by the above equation from the structure of the sequence
If x(t) follows the Boltzmann distribution P(x(t)) at time t, then the distribution at the next time is the above equation
Transforming this equation, we obtain
The invariant distribution of the Gibbs chain is now a Boltzmann machine.
Run the Markov chain until it reaches equilibrium and can be regarded as giving an invariant distribution.
It takes time.
The time it takes for a chain to lose its initial state information and reach equilibrium.
There is no way to estimate these times.
Two approaches
Generate a series of sensory samples from a single chain
Recruiting samples independently one by one from multiple chains run independently from a large number of random initial values
In machine learning, two intermediate (mini-batch) methods are used.
10.5 Mean Field Approximation
The Monte Carlo method is a numerical approximation
Theoretical approximation
Mean-field approximation
Why is it difficult to calculate the expected value of a Boltzmann machine?
Because the random variables associated with each unit are correlated in a complex way
Interpreted as an inzing model, it is a difficult multifaceted state because of the interaction between the spins
The term responsible for these interactions is the weight term in the energy function (Hamiltonian) that mixes the different units together.
If the weights are all zero, the distribution of the Boltzmann machine will be as above.
Simplify to a product of independent univariate distributions
In mean-field approximation
In order to simplify the expectation calculation dramatically, a set of distribution models (test distributions) in which each variable is independent is first prepared separately from the Boltzmann machine.
Among these test distributions, find the one closest to the Boltzmann machine and adopt it as an approximation of the Boltzmann machine.
How do we find the answer?
Since the test distribution is a distribution in which all variables are independent, the shape of the distribution is generally the same as above.
Qi(xi) is a certain probability distribution for one variable
We don’t know the specific form of Qi(xi) yet.
It is determined by the condition that Q(x) is closest to the Boltzmann machine.
Minimize the Kullback-Leibler divergence
The divergence between the test distribution Q and the Boltzmann machine P will be the above
We need to find the minimum value of DKL(Q||P) under the condition (above) that Q should satisfy as probability
The fast conditional optimization problem is solved using the Lagrange’s undecided coefficient method
Minimize the Lagrange function L
Λ: Lagrange’s undecided coefficient
Variate the Lagrangian L by this undecided coefficient Λi, and the resulting Eulerian Lagrangian reproduces the fast condition
The Eulerian Lagrangian obtained by varying L by Qi(xi) is as above
Solving these two equations, we can solve for the distribution we are looking for.
In order to solve the form of the distribution, the first term in the right parenthesis of the right equation, k=i, is transformed as shown above.
In the part where k≠i, since the expected value is taken for all the random variables that appear, the final result will be a constant independent of the random variables.
Substituting the concrete form of the Boltzmann machine, we get
The term independent of xi in the last line is a constant determined by the form Qj≠i, etc.
Now introduce the mean value of the probability mean for the test distribution
Mean-field
The equation for the mean-field approximation is
The form of the test distribution obtained as a solution to this equation is
The degree of freedom of the Lagrange’s undecided coefficient Λi is to fix the proportionality constant so that another condition, normalization (above), can be realized.
Finally, the above distribution is obtained.
Conditional distribution of the original Boltzmann machine
The random variables other than the i of interest can be replaced by their mean values.
From the above, the form of distribution to realize mean field approximation is as above.
The distribution for xi is given by the expected value of units other than i, μj(≠i).
Calculate this expectation value
The distribution for xi is determined by the unit x(j≠i)
Determine the distribution of x(j≠I) by the expected value of units xk(≠j), including xi
In order for the above definitions to be consistent
The expectation value of xi must be consistent with the mean field µi that appears in the distribution of the other units
Assuming that μi is the calculated expected value of xi under the above
The conditions would be the above
Self-ignorance equation (mean field equation)
In terms of the sigmoid function, the formula is
In the Inzing model of physics, the function that appears in the self-consciousness equation is tanh(⋅)
In the case of Boltzmann machine, it is a sigmoid function
Since xi is a binary variable that can take the value 0 or 1
The indifference equation is the equation that determines the value µi for all mean fields.
Solving it will also give us the specific values of the distributions.
Since it is a nonlinear system of equations, it cannot be solved analytically.
Solve by successive substitution method
Start with a random initial value of expectation
Iterate the assignment calculation in the order t=1,2,3,… Iterate the substitution calculation in the order of
After the iterative computation converges, the resultant value μi (T≫1) is set to be the number of mean field equations close.
An approximation using the independence equation expressed in the above equation as a substitute for the Boltzmann machine
This approximation simplifies the calculation of the expected value significantly.
The expectation value appearing in the learning equation simply approximates the above.
Since all variables are independent, we only need to calculate under the distribution of one variable, Qi(xi).
The physical interpretation is
Approximate the many-body problem of the Inzing model as a collection of one-body problems.
At the cost of simplification, we lose the correlation between variables.
It is not a very accurate approximation.
A slightly relaxed version of the independence requirement has been proposed
Bethe approximation
Cluster variational method
10.6 Restricted Boltzmann Machines
Introduction.
Impose restrictions on the bandit that mildly reduce the problems arising from hidden variables.
Boltzmann machine with hidden variables, with strong restrictions on graph structure.
Structural Diagram
Visible variables are white, hidden variables are gray
In RBMs, interactions between visible variables and between hidden variables are prohibited.
The energy function with this constraint is shown above.
The simultaneous distribution is
Despite the strong constraints placed on the graph structure, it is possible to approximate arbitrary distributions with desired accuracy by increasing the number of hidden variables.
One of the remarkable properties of RBMs
Conditional independence due to graph structure
Compute the distribution of the visible variables when the hidden variables are fixed.
From the multiplication formula, the above becomes
Since bTv+VTWh is a linear function of vi, the denominator of the right-hand side can be solved by factorization.
As a result, the conditional distribution is
When the values of the hidden variables are solid, the visible variables become independent variables.
The same property holds for the hidden variables.
Bernoulli distribution given by sigmoid
However
In RBM, P(v|h,θ) and P(h|v,θ) are given by the product of the one-variable Bernoulli distribution due to conditional independence
Gibbs sampling of RBMs can be simplified by using conditional independence.
10.6.1 Learning a Restricted Boltzmann Machine
Specializing the gradient of the likelihood and the learning equation of the general Boltzmann machine gives a formula for the RBM
The gradient of the bias bj is the gradient in the RBM
The gradient in the direction of bias cj is
Using conditional independence, we can simplify it as above
Therefore, the gradient is as above
The gradient with respect to the weights is the same as above
In the gradient ascent method, the parameter is brought close to its maximum value by the update ∆θ=η/Nx∂L(θ)/∂θ, so the update equation is
The learning equation is
10.6.2 Blocked Gibbs Sampling
MCMC that samples variables one by one from the full conditional distribution P(xi|x-i,θ).
We need every state-i except the variable we are trying to sample
Local Markovianity makes this computation light
Conditional independence (above), a property of RBMs, represents a kind of local Markovianity
Since the visible variables v are completely independent after the hidden variables are fixed
After all the realizations of the hidden variables are given, sampling the visible variables is simply a matter of sampling each independent component vi from each P(vi|h,θ)
Similarly for hidden variables
By alternately Gibbs sampling the visible and hidden variables
Sampling from a simple distribution with no correlation
Blocked Gibbs sampling (blocked Gibbs sampling)
Procedure for blocked Gibbs sampling
Randomly select a binary value [0,1] for each variable to create the initial value v(0).
Calculate the distribution P(hj|v(0),θ) from this value.
Let hj(0)∈[0,1] be the value of hj sampled from it.
We write 𝒉(0) as the collection of these values for all j
Then calculate the distribution p(vi|𝒉(0),θ)
From there, sample vi(1) ∈ [0,1] to create v(1)
Repeat the sampling process
A sample example is created from
After running this chain sufficiently, adopt multiple sample values (v(T), h(T)) and approximate the model mean by their sample means.
It consumes a lot of computational resources.
Can be improved by the Contrastive Divergence method (CD method)
10.7 Contrastive divergence method and its theory
Introduction.
A high-performance parameter update method that greatly approximates the gradient ascent method using Gibbs sampling
CD method is a minor modification of Gibbs sampling
T-step Contrastive Divergence Method (CD-T method)
Update of the weight parameter wij is updated as above
The proportionality coefficient is the learning rate η
In particular, we considered the case where only one training sample, v(n), is used.
In the case of mini-batch training with multiple samples
We use the average of the training samples of this update over the (mini)batch.
In the positive phase on the right-hand side, there are no changes from the results of the exact gradient calculation.
The negative phase is an approximation of the model mean with MCMC in mind
Sampling from empirical distribution of data q(v) (randomly selected training sample v(n) is used as the initial value of the linkage, V(0)=v(n))
How is the gradient ascent method by Gibbs sampling modified or changed in the CD method?
First modification
How to take the initial value of the Gibbs chain
In Gibbs sampling, the initial value of v(0) is a vector of randomly assigned binary values of 0 or 1 for each component.
The initial value of the chain in the CD method is a single sample v(0)=v(n) selected from the training data.
The chain for the cD method is as above.
Run the chain for T steps and use the collected sample values v(T) for gradient update.
The T-step CD method only needs one sample
Boldly represent the expected value with a single sample
Updating parameters with training sample v(n), except for weights
Update parameters as θ(t+1)←θ(t)+∆θ(nihohi) with ∆θ(t) calculated using sample v(ni)
The training sample v(ni+1) is newly selected and updated θ(t+2)←θ(t+1)+∆θ(t+1) in the same way, and this process is repeated.
This is a kind of online learning.
xxx
The CD method is sufficient for a single chain, but can also be combined with mini-batches
100 samples at most
10.7.1 Why Does the Contrastive Divergence Method Work?
Theoretical explanation of why it works
Explain that you can actually get contrastive divergence as an approximation to the log likelihood gradient
The original gradient ascent method for the maximum likelihood method uses the parameter update wij←wij+η∂L/∂w by the gradient of the log likelihood (above)
10.7.2 Minimization of Contrastive Divergence
We will interpret the Contrastive Divergence method in terms of optimization of a special objective function.
The CD method is a learning method that approximates the gradient ascent method for an objective function called contrastive divergence.
10.7.3 Persistent Contrastive Divergence Method (PCD Method)
In online learning with the PCD method, the
After updating the parameter θ once, one training sample (or mini-batch) is selected again and
Calculate the update amount ∆θ and update the parameters again.
If we repeat the above process, we need to initialize the Markov chain with a new training sample and run it each time on the update journey.
Therefore, we will never reach equilibrium, and even though we run the chain for a long time, we will not be sampling from a stationary distribution.
Once a Markov chain is run, it is not initialized, but is kept running and the gradient update is repeated.
The sample value v(0) used in the CD method is used as the initial value for the next Markov chain.
The parameter update for the tth time in the PCD-T method will be as above
Depending on the (nth) training sample v(ni) used for the Tth time and the (txTth) sample v(tT) from the Markov chain that continues to run independently of this training sample, the amount of update similar to the CD method is
Idea of the PCD-T method
Suppose the learning rate is not very large
The values of parameters are not updated so dramatically
The value of the model distribution, P(v,h|θ+∆θ), does not change much.
The sample taken from the distribution at the previous update step is not so different from the sample from the distribution at the update step.
The Gibbs chain initialized by the samples from the previous step is also expected to mix soon.
Chains that have already reached a stationary state will remain almost stationary after updating the parameters.
A good sample can be obtained.
The PCD method tries to utilize the chains that have been run so far as much as possible.
Methods other than PCD that speed up the mixing of Gibbs chains
Fast persistent contrastive divergence (FPCD)
Replica exchange method (parallel tempering)
10.8 Deep Belief Networks
Introduction
The field of machine learning has long been dominated by the study of simple models with guaranteed amazon of the objective function.
Multilayer neural networks, which are extremely difficult to learn due to the problem of local optimal solutions, have been considered impractical for practical use.
In 2006, Hinton et al. succeeded in learning a deep neural network.
This will lead to deep learning.
Deeper architecture
An example
Hierarchical multi-layer probabilistic model
In general, there are L layers h(1), h(2),… ,h(l).
The units in each layer need not be the same
Characterized by a directed graph from top to bottom, except for the top layer
Only the top two layers are undirected graphs like a Boltzmann machine
There are no connections between the same layers.
Hybrid of a Bayesian net (sigmoid belief net) and a constrained Boltzmann machine
The distribution of the graphical model will be as above
W(I) is a matrix of weights connecting I-1 and I layers
We write θ as the sum of these parameters for all layers.
Bias is omitted for simplicity, but it can be included.
The visible layer, which is the lowest layer, is counted as layer 0.
Specific shape of the distribution
Top two layers (h(L-1),W(L))
Same form of distribution as restricted Boltzmann machine with two layers
Conditional probability for the middle layer
Bernoulli distribution given by sigmoid
This distribution is the same as the Bernoulli distribution given a variable value in the upper layer.
This distribution expresses how much probability the lower layer variable value is generated when the upper layer variable value is given.
The variable h(l) in the upper layer is like an explanatory factor in the deeper layer.
DBN is a generative model that goes from upper to lower layers
10.8.1 Pre-training of DBNs
Training of deep architectures such as DBNs and DBMs is also done in principle by maximum likelihood estimation.
(log) likelihood of the distribution around the visible variables (above equation)
Maximize the (log) likelihood function for
This is problematic in two ways: in terms of computational complexity due to combinatorial explosion, and the danger of falling into a local optimum.
Strategy to prepare good initial values to start learning
Learning Process
1Pre-learning
Find good initial values for parameters and set those values
2Fine tuning
Fine-tune the parameters using the gradient method, etc.
First, let’s discuss pre-tuning.
Pre-training of DBNs is done by the greedy layer-wise training method.
In this method, each pair of adjacent layers of a multilayer model is considered as an independent RBM.
Each pair is trained as an RBM and its parameters are updated.
These operations are performed sequentially from the lower layer to the upper layer.
Finally, the up-dated parameters obtained from each pair are used as the result of prior learning.
Updating the approximate distribution as an RBM provides good initial values for full-scale training.
Pre-training for Deep Belief Nets
Belief nets have difficulties due to the “explaining away effect”.
Inference process (left wavy line) and generation process (right practice)
Even if each component variable of h(l) is independent, the posterior probability for inference, P(h(l)|h(l-1)), will not be a conditional probability but a complex distribution after h(l-1) is given.
In DBN, the conditional distribution from top to bottom is given by a sigmoid and has a simple form.
Conditional probability in the opposite direction has a complex form as shown above.
During training, the pair of two layers is considered as an RBM, and the posterior probability is approximated by the sigmoidal probability as above.
Algorithm for pre-training
Set all initial values of parameters to 0 before pre-training.
Using the training data {v(n)}, update the weight parameters from the lower layer
1. two layers of v and h(1)
First, extract only (v,h(1)) and consider it as an RBM.
While training these two layers, ignore the other layers.
Since (v,h(1)) is considered as RBM, the simultaneous distribution is also approximated as RBM
To this RBM, we apply learning using the training data {v(n)} and obtain
Update the weights W(1). 2.
2. two layers of h(1) and h(2)
Once the two layers of v and h(1) have been trained
Next, we will focus only on (h(1), h(2))
These are considered as RBMs.
Update the weights W(2) by training this RBM using CD method etc.
How do we prepare the training data to be set in the “visible layer” h(1) for this learning?
xxxx
3. two layers, h(l) and h(l+1)
Repeat the same training for the upper layers (h(l), h(l+1)) with l=2, 3, 4.
The two layers considered as RBMs are
Update the weights W(l+1) by training them with the training data sampled by the above formula.
By dividing the DBN into RBMs and training them separately, all weights can be updated in the end.
This is a good initial value for training the entire DBN.
10.8.2 Fine-tuning the BDN
The parameter values obtained in the pre-training are used as initial values to train the entire DBN.
By pre-training
The parameters roughly incorporate the information from the training data.
The parameters can be fine-tuned to improve performance by smelling the furniture in the previous race.
Fine-tune the parameters to improve the performance
Methods for fine-tuning
How to convert DBN into a forward propagation neural network
Use mean field approximation for approximate posterior probability P(h(I-1)|h(I),θ) ≈ Q(h(l-1)|h(l),W(l)) of DBN after pre-training
The self-adjoint equation that gives the mean field µj(l) of hjt is
The propagation equation of a forward propagating neural network whose activation function is a sigmoid itself
As shown in the figure above, there is also a top-to-bottom generation process P(h(l-1)|h(l),W(l)), so both are expanded as a neural net and replaced by a self-encoder.
DBN transposed into a self-encoder
10.8.3 Sampling from a DBN
DBNs in machine learning are models of generative distributions
After training, the model is used to generate realizations of the variables in the visible layer.
To do so, sampling is repeated for each layer
In the top two RBM layers, we sample h(L-1) by running a blocking Gibbs chain for a while.
In the lower stratum, since it is given as a conditionally independent sigmoidal probability going down
From this, we finally get a sample v in the visible layer
This is called ancestral sampling or sampling from ancestors.
10.8.4 Inference in DBNs
For a given input, compute the value of the explanatory factor (hidden variable) in the upper layer, following the arrows of the graph backwards.
Conditional probabilities (posterior probabilities) from lower to upper layers are needed
Very complex because of the exculpatory effect
Forbid conditional distribution toward upper stratum with sigmoidal probability to facilitate sampling
10.9 Deep Boltzmann Machines
Introduction
DBN is a slightly artificial multi-layered architecture consisting of directed graphs and undirected RBMs stitched together.
A more natural model is a deep RBM where all links are undirected.
Deep Boltzman machine (DBM)
The graph structure of DBM is simply the DBN graph with the orientation removed.
An example of a Boltzmann machine with a special coupling pattern.
Energy function
Gibbs distribution given by
DBMs do not have the same coupling in the same tank as RBMs
There are only couplings between adjacent layers
The fully conditional distribution for each variable is
Because the middle layer is coupled to the upper and lower layers, two types of terms appear in the argument
10.9.1 Pre-training a DBM
Similar to DBNs, DBMs are trained using a layer-by-layer greedy learning prior.
The technical details used are the same as for DBNs
There is only one difference
In pre-training, the original DBM graph is not used as it is, but the parameters are updated using the augmented graph.
The expansion method doubles the visible layer and the top layer by adding a copy of each layer, as shown in the figure above.
The copies of the newly added links share weights with the original ones, so
The independent parameters are not increased.
The weights in the middle layer, where no copies are introduced, are doubled in value.
We consider this new Boltzmann machine as an RBM with two layers each, and train it from the bottom to the top.
The resulting updated parameter values are the parameters of the DBM after pre-training (corresponding to the original graph).
We will explain the meaning of prior work for each layer
1. two layers, v and h(1)
Focus on (v,h(1))
The full conditional distribution of these two variables is
The red part hj(1) is the term added to the RBM
This term depends on W(2) and h(2)
In other words, the h(1) layer includes the effect of the coupling coming from above
If we extract only these two layers (v,h(1)) and consider them as RBM, the effect of the coupling from above will be lost
In order to compensate for the effect of ∑kwjk(2)hk(2), the coupling between (v,h(l)) is substituted by ∑iwij(1)vi
In other words, during pre-training, the above model is used to approximate
This is nothing but an RBM with a doubled visible layer.
Figure 2.
Two layers, h(l) and h(l+1)
Next, we focus on the middle layer (h(l), h(l+1))
The full conditional probabilities for these two layers are as above
including the extra term in red.
Similarly, if we take out only the two layers of h(l) and h(l+1), the connection with h(l-1) and h(l+2) will be lost.
During training, compensate as above
During pre-training, the weight W(l+1) of the middle layer is doubled to 2W(l+1).
Figure 3.
3. two layers of h(L-1) and h(L)
Finally, we focus on the top layer (h(L-1),h(L))
These fully conditional distributions become the above equation
To compensate for the red color again, we modify the above equation during pre-training.
Figure 10.9.2.
10.9.2 Fine-tuning the DBM
Explanation of parameter fine-tuning
The entire DBM is trained together with maximum likelihood estimation.
Due to the computational complexity, a learning method that approximates the maximum likelihood method is necessary.
Separate the two terms that appear in the gradient of the log likelihood and discuss each
The positive phase of the gradient requires the calculation of the expected value (above equation).
To eliminate the combinatorial explosion that appears in this expectation, mean field approximation is used.
Apply mean field approximation to the distribution when the visible layer is fixed to the training data v(n)
The self-adaptation equation that determines the mean field is
The self-consciousness equation that determines the mean field is the above equation.
This can be solved sequentially, but the bi-directional coupling significantly increases the computational complexity.
Once the mean field values are determined, the positive phase can be approximated by their sample mean.
The negative phase of the gradient is approximated by the expected value of the model distribution
approximated by Gibbs sampling.
For fine-tuning the DBM, the parameter values obtained from the pre-training are used as initial values for the gradient ascent method.
To evaluate the negative phase, we first run M independent Gibbs chains from the distribution (above) for the initial values.
The initial value for each chain is a randomly initialized value.
After burn-in, generate a sample set from each of these multiple chains.
This m will be labeled with which chain it was sampled from.
The resulting set of M samples is used to approximate the negative phase.
Update the parameters with the above negative and positive phases.
xxx
10.9.3 Conversion to a Forward Propagation Neural Network
The DBM after training can be used as a definitive model by replacing it with a forward propagating neural network.
As with DBNs, the parameters can be adjusted using error back propagation using the replacement with a forward propagating neural network.
Since the DBM Nova-Iha graph is undirected, new elements can be introduced.
The resulting forward propagating neural network can be used as a classifier. Translated with www.DeepL.com/Translator (free version)

11 Deep Reinforce Learning

Introduction
Algorithm learns spontaneously through trial and error, even without correctly labeled training data
Learn about deep reinforcement learning with the goal of understanding the AlphaGo algorithm.
11.1 Reinforcement Learning
Introduction
Alpha Go with Google’s DeepMind
In a reinforcement learning setup, the learning target behaves in various ways in an unknown environment.
Robot behavior in an unknown environment
Letting the computer learn the game’s likelihood method
11.1.1 Markov Decision Process (Overview of Markov Decision Processes (MDP) and Examples of Algorithms and Implementations)

Basic setup for reinforcement learning
In reinforcement learning, the sequence of events from the beginning to the end of a game is called an episode.
Consider a model with a difference time t=0,1,2,… Consider the following model
A sequence of events
An agent observes the state s(t) of the environment at time t, and selects and executes an action a(t) according to some action principle
The environment that received the action changes its state to s(t+1)
Reward r(t+1) is returned to the agent
Why is the action chosen?
It is assumed that the agent chooses according to a criterion called policy π
A policy is a representation of what action is chosen when a state is observed
The above map or probability distribution is given according to whether the model is deterministic or probabilistic.
π(a|s) is a conditional probability
In this book, measures are assumed to be constant.
Learn π that optimizes (maximizes) the reward received from the environment
The actual maximization is
not the immediate reward r(t).
the total amount of reward that will be obtained if we continue to choose a certain strategy
Consider the gain, which is obtained as the sum of the discounted current prices (cumulative reward) (above equation).
Maximize the right equation
0<γ≤1 is the discount rate.
Simplify and incorporate unanticipated changes in the future.
Expresses the attitude of how much weight to put on future rewards
Gain r(t) is modeled as a random variable
It is the expected value that is maximized
In the framework of reinforcement learning, the state of the environment, the behavior of the agent, and the reward are all treated as random variables.
Processes are placed in the stochastic model
State changes can be described by transition probabilities
Agent behavior is determined by information from the environment
The probability that the environment, which was in state s(t)=s at time t, transitions to the next state s(t+1) depending on the agent’s choice of action bh, bh in A2C (where P and q)
Assumption of Markov process
The next state is determined only by the current information, not by any earlier history.
The probability of a state transition is given by adding all possible probabilities together
The reward r(t+1) is given to the agent as soon as the state transitions to s(t+1)=s’ by the action.
From the above, the conditional expectation of the reward obtained at time t+1 becomes the above equation.
Furthermore, if we take the expectation values for possible action a and the next state, we obtain
11.1.2 Bellman Equation and Optimal Strategies
The expected value of the gain obtained in the future if we continue to take the optimal strategy π
equation
Using the property of conditional expectation, we can transform this slightly to
Furthermore
From the definition of the gain, we have an asymptotic equation
In summary, the above equation is the Bellman equation for the state value function
This is an important equation that forms the basis for dynamic programming and other methods.
And the Bellman equation for value behavior
How are the value function and the Bellman equation useful in reinforcement learning?
Define a good strategy using the value function
There are two measures π and π’.
Assume that the above equation is satisfied for all possible states s
Define Π to be the same or better measure than π’.
The best strategy
Equation
Applying the Bellman equation to an optimal measure π*, we obtain
where
xxx
Bellman Optimal Equation
An important equation that is the basis for dynamic programming and other methods
10.1.3 TD Error Learning
Learning the state value function Vπ(s) when the strategy is fixed, and predicting the future gain
To learn, at time t, under state s(t)=s, act according to a certain strategy, and update the value function Vπ(s) based on the resulting experience (reward)
Monte Carlo method
First observe the state s(t)=s, then repeat the observation and action according to the strategy until the gain is determined
After the gain R(t) is calculated, the value of the value function is updated as in the above equation
η is the learning rate
Only η% of the estimated value of the value function is replaced by the observed gain R(t) in an episode
Bring the state value function closer to the actual observed gain
The Monte Carlo method requires the value of R(t), so we need to wait until the end of the episode to do one update
This is costly.
To solve this problem, we substitute a prospective estimate
Equation
Update the value function to fill in the TD error
TD method
10.1.4 Q Learning
In the TD method, we learn Vπ(s) to predict the gain for a measure π and find the optimal measure π.
Next, we consider a method to find the optimal strategy through Qπ by learning the action value function Qπ(s,a).
11.2 Function Approximation and Deep Q-Net
11.2.1 Q Learning and Function Approximation
11.2.2 Deep Q Learning
11.3 Atari Game and DQN
11.4 Policy Learning
11.4.1 Policy Learning by Gradient Ascent Method
11.4.2 Proof of the Policy Gradient Theorem
11.5 Alpha Go
11.5.1 The Idea of Monte Carlo Instrument Search
11.5.2 SL Strategy Network pa
11.5.3 Local out measure pπ
11.5.4 LR measure network Pp
11.5.5 Value Network V
11.5.6 Monte Carlo Tree Search with Measures and Value Networks

Appendix A: Fundamentals of Probability
A.1 Random Variables and Probability Distributions
A.2 Continuous random variables and random mass functions
A.3 Expected value and variance
A.4 Information content and divergence
Appendix B: Variational Methods
B.1 General Functions
B.2 Euler-Lagrange equation