About ATTENTION in Deep Learning

Machine Learning Natural Language Processing Artificial Intelligence Digital Transformation Image Processing Reinforcement Learning Probabilistic Generative Modeling Deep Learning Python Navigation of this blog

About “Attention Is All You Need”

“Attention Is All You Need” will be the paper that proposed a neural network model called Transformer described in “Overview of Transformer Models, Algorithms, and Examples of Implementations“, published by Google researchers in 2017. This paper is a breakthrough in the field of machine learning, particularly in natural language processing and deep learning, as it proposed the Transformer model and showed its significant improvement in accuracy. The OSS using these models are collected in Huggingface, which is described in “Overview of Automatic Sentence Generation Using Huggingface” and a chatGPT-like model using these models is being developed.

Detailed explanations of this paper can be found in “[Paper] Explanation of “Attention is all you need”” and “[Paper] Explanation of Attention Is All You Need (1)” but here I would like to provide an explanation without using mathematical formulas as much as possible.

The following is a summary.

Conventional neural network models use recursive structures (RNN) as described in “Overview of RNN and examples of algorithms and implementations” and convolutional layers (CNN) described in “Overview of CNN and examples of algorithms and implementations”, but they are not efficient for long sequences and large data. This paper describes a new network architecture called Transformer, a structure that is faster and more parallelizable than conventional models.

The Transformer model consists of an architecture built around the Attention mechanism, which is a mechanism for extracting information by appropriately focusing attention on the important parts of the input. In this paper, we propose a special variation of Attention called Self-Attention, which extracts important information for processing long sentences or series of data by calculating the relevance of each position of the input to all other positions. This allows for efficient processing of those data.

The Transformer model consists of two main parts: the encoder and the decoder. The encoder is responsible for extracting feature representations of the input data, while the decoder generates the next output based on the encoder’s representation and previous outputs. The model uses the Attention mechanism between the input and output to ensure that important information is given appropriate attention while processing.

The characteristics of the Transformer model can be summarized as follows

It is an efficient method for processing long sequences and large data sets
Compared to RNNs and CNNs, parallel processing is possible, enabling faster learning and inference.
Utilizes Self-Attention to extract important information from the input
Demonstrate superior performance in transfer learning described in “Overview of Transfer Learning and Examples of Algorithms and Implementations” and learning with large data sets”

The details of these techniques are described below. First, we describe the Attention mechanism and one of its variations, Self-Attention.

Overview of Attention Mechanism

Attention in deep learning becomes an important concept used as part of neural networks. The Attention mechanism refers to the ability of a model to assign different importance to different parts of the input, and the application of this mechanism has been recognized in recent years to be particularly useful in tasks such as natural language processing and image recognition.

Before discussing the Attention mechanism, we consider the challenges of conventional deep learning. Conventional deep learning models without Attention typically receive a fixed-size input and process based on that input, which creates difficulties when the length of the input is variable or when some parts of the input are more important than others. Such challenges appear in a variety of tasks, such as the case of interactive natural text input/output, word mapping in machine translation, and attention to specific regions of an image in image capturing.

Attention mechanisms are usually structured in an “encoder-decoder” model, which consists of two main parts, the encoder and the decoder, where the encoder is responsible for transforming input data into a fixed-dimensional representation and the decoder is responsible for using that representation to responsible for generating the output; the Attention mechanism is used to compute the relevance of the encoder’s intermediate representation to the decoder’s state and to indicate (obtain weights) how relevant each part of the encoder is to the decoder’s current state.

This ability of the Attention mechanism to extract points of attention allows the model to focus on a portion of the input, allowing it to effectively process long input sequences and data with complex relationships, and to address existing deep learning challenges.

The specific steps of the Attention mechanism are as follows

Relevance computation: compute the relevance between the encoder’s intermediate representation and the decoder’s state. The relevance computation mainly uses inner products and similarity functions (e.g., dot product, cosine similarity).
Calculation of weights: Weights are calculated by normalizing the relevance. A softmax function described in “Overview of softmax functions and related algorithms and implementation examples” is usually used here. The softmax function is used to interpret the relevance values as probabilities, which can then be interpreted as weights.
Computing the weighted average: Compute the weighted average of the encoder’s intermediate representation and the weights. The appropriate information is extracted, paying more attention to the parts of the encoder with higher values of the weights.
Generate output: Using the weighted average, the decoder generates the next output. This allows the output to reflect the appropriate information while paying more attention to the important information.

These details are described below.

Calculating Relevance in Attention Mechanisms

The method of calculating relevance in the Attention mechanism is generally done using inner products and similarity functions. The specific procedure is as follows.

Obtain the intermediate representation of the encoder and the state of the decoder. The encoder usually takes as input the series data or feature representation of the image and generates an intermediate representation. The decoder state is information such as the current state and output history of the decoder.
To compute relevance, an inner product or similarity function is computed between the encoder’s intermediate representation and the decoder’s state. The inner product is achieved by taking the element-wise product of the intermediate representation and the state and calculating their sum. Similarity functions are used to compute the similarity between intermediate representations and states. Typical similarity functions include dot product and cosine similarity.
To normalize the results of inner products and similarity functions, softmax functions are applied. The softmax function allows the relevance values to be interpreted as probabilities. The softmax function normalizes the relevance values so that they sum to 1.
After applying the softmax function, weights are assigned to each encoder’s intermediate representation. The weights indicate the portion of the encoder that the decoder should focus on. A higher weight means that more attention is given to that encoder’s part.

The method of calculating relevance may vary depending on the task and model design. For example, in the Transformer model, the relevance is obtained by computing the product of matrices (inner product) between the encoder and decoder and applying a softmax function, which is used to weight the intermediate representation of the encoder against the decoder’s position Other methods include the Global/Local Attention method, which calculates relevance over different ranges using the Attention mechanism, the Context Attention method, which calculates relevance while considering contextual information, and the Key-Value method, which calculates relevance based on key/value pairs. The Global/Local Attention method calculates the relevance of different ranges using the Attention mechanism.

How to calculate weights in the Attention mechanism

The procedure for computing weights in the Attention mechanism is generally performed using a softmax function. The procedure for computing weights is described below.

Relevance computation: The relevance between the encoder’s intermediate representation and the decoder’s state is computed. This relevance is computed using inner products and similarity functions (e.g., dot product, cosine similarity).
Applying a softmax function: To normalize the association, a softmax function is applied. The softmax function is used to interpret the relevance value as a probability and is represented by the following equation

$Softmax$

where is the output of the softmax function and is the relevance value. The softmax function also normalizes the relevance values so that they sum to 1. This assigns a weight to the intermediate representation of each encoder.

Use of weights: The weights obtained by the softmax function will indicate the importance of each encoder to the intermediate representation. This means that more attention will be paid to the part of the encoder with the higher weight, which will allow the decoder to extract the appropriate information from the part of the encoder that needs attention.

Although the procedure for computing weights is generally as described above, subtle changes are often made depending on the actual model and task. For example, in some models, additional nonlinear transformations (e.g., multilayer perceptron) may be applied to the association, and besides those, the Attention mechanism has been proposed in various variants and refinements to improve model performance.

Calculation of the weighted average in the Attention mechanism

The computation of the weighted average in the Attention mechanism refers to the computation of the weighted average of the intermediate representation of the weighted encoders. The procedure for computing the weighted average is described below.

Obtain the encoder’s intermediate representation and weights. Encoders usually receive series data or feature representations of images as input and generate intermediate representations. Weights are computed by the Attention mechanism and assigned to each encoder’s intermediate representation.
The weights are multiplied to the intermediate representation of each encoder. This yields a weighted intermediate representation.
A weighted average of the weighted intermediate representations is computed. Typically, the weighted average is calculated by the following equation

$Weighted Average$

where is the weight assigned to the intermediate representation of the encoder, and is the intermediate representation of the corresponding encoder.

The calculation of this weighted average allows the appropriate information to be extracted while more attention is given to the portion of the encoder with the greater weight. This weighted average is used by the decoder to produce the next output.

Note that the above calculation procedure for the weighted average is a general one, and may be slightly modified depending on the actual model and task. Also, different variations of the Attention mechanism may be processed differently in place of the weighted average.

Generating Outputs in the Attention Mechanism

The generation of outputs in the Attention mechanism is done using an intermediate representation of the weighted average encoder. The procedure for generating outputs is described below.

Obtain an intermediate representation of the weighted encoders. This is the weighted average of the intermediate representation of the encoders weighted by the Attention mechanism. More attention is paid to the part of the encoder with the higher weights, and the appropriate information is aggregated.
The intermediate representation of the weighted average obtained is passed to the decoder as input. The decoder uses the intermediate representation of the weighted average to produce the next output.
Based on the intermediate representation of the weighted average and its own state (e.g., past output), the decoder performs processing appropriate to the task and generates the next output. Decoders usually have a recursive structure (RNN or Transformer) and predict the next output by referring to previous outputs and internal states.
Each time the decoder generates an output, it updates its own state as the next input and accumulates information to generate the next output. This enables the generation of series data (e.g., text, audio, time-series data, etc.).

The generation of output can be more appropriately predicted and generated because the information of interest is provided to the decoder through the Attention mechanism. The Attention mechanism is also characterized by its ability to effectively process long sequences and complex data by appropriately focusing attention on the important parts of the input.

However, this mechanism may differ from model to model and task to task in the specific output generation procedure. For example, in natural language processing tasks, it is common for the decoder to predict words and letters, while in image generation tasks, the decoder generates pixel values for the image, and so on, as appropriate for the task.

Python implementation of the Attention mechanism

Below is a concrete example of the Attention mechanism in Python.

import torch
import torch.nn as nn
import torch.nn.functional as F

class Attention(nn.Module):
    def __init__(self, query_dim, key_dim, value_dim):
        super(Attention, self).__init__()
        self.query_dim = query_dim
        self.key_dim = key_dim
        self.value_dim = value_dim

        self.query_layer = nn.Linear(query_dim, key_dim)
        self.key_layer = nn.Linear(key_dim, key_dim)
        self.value_layer = nn.Linear(key_dim, value_dim)

    def forward(self, query, keys, values):
        """
        query: shape (batch_size, query_dim)
        keys: shape (batch_size, seq_length, key_dim)
        values: shape (batch_size, seq_length, value_dim)
        """
        batch_size, seq_length, _ = keys.size()

        # Convert query to key_dim dimension
        query = self.query_layer(query)  # shape (batch_size, key_dim)

        # Convert Keys to key_dim dimension
        keys = self.key_layer(keys)  # shape (batch_size, seq_length, key_dim)

        # Convert Values to value_dim dimension
        values = self.value_layer(values)  # shape (batch_size, seq_length, value_dim)

        # Calculate Attention score
        attention_scores = torch.matmul(query.unsqueeze(1), keys.transpose(1, 2))  # shape (batch_size, 1, seq_length)

        # Normalized Attention score
        attention_scores = F.softmax(attention_scores, dim=2)  # shape (batch_size, 1, seq_length)

        # Calculate weighted sum
        attended_values = torch.matmul(attention_scores, values)  # shape (batch_size, 1, value_dim)

        # SQUEEZE the result and RETURN
        return attended_values.squeeze(1)  # shape (batch_size, value_dim)

The above example defines a class named Attention, where query_dim, key_dim, and value_dim represent the size of each dimension. in the forward method, for a given query (query), keys (keys), and values (values) Attention mechanism is applied.

The shape of the forward method’s input and output are described in the comments; as input, query has the shape (batch size, query_dim), keys has the shape (batch size, sequence length, key_dim), and values has the shape (batch size, sequence length, value_dim).

In this implementation, the Attention score is obtained by computing the inner product of query and keys. The Attention score is then normalized and a weighted sum is calculated to obtain the final output.

Variation of Attention mechanism

Many variations of the Attention mechanism exist. Some of the major variations are described below.

Dot Product Attention: Calculates the inner product of the intermediate representations of the encoder and decoder to find relevance. This variation is widely used in the Transformer model.
Additive Attention: The intermediate representations of the encoder and decoder are combined and a nonlinear transformation (usually a neural network) is applied to compute relevance. This variant is suitable for tasks such as structured data and images.
Multiplicative (Multiplicative) Attention: Computes relevance by element-by-element multiplication of the intermediate representation of the encoder and decoder. This is sometimes used as an alternative method to additive Attention.
Self-Attention: Self-Attention is used to compute the relevance between elements in the series data. This method calculates the relevance between elements in an encoder and aggregates the information by highlighting the important elements. It is widely used in the Transformer model and has been highly successful in natural language processing tasks.
Multi-head Attention: Multi-head Attention would use multiple Attention mechanisms in parallel to capture the important parts of different representations. Each head has a different weight matrix and is used to extract different information.

Although these are some major variations, various improvements and applications have been proposed for the Attention mechanism. For example, there are ways to make processing of long series data more efficient by scaling self-attention over time, combining it with positional encoding, and customized architectures and mechanisms may also be proposed for different tasks.

Next, we discuss Self-Attention, the subject of the “Attention Is All You Need” paper, in more detail.

Self-Attention

Self-Attention is a type of Attention mechanism that computes the relevance of each position of the input to all other positions, and will be widely used in natural language processing tasks, especially in the Transformer model.

The conventional Attention mechanism computes the relevance of the intermediate representations of the encoder and decoder and weights them. Self-Attention, on the other hand, calculates the relevance of the intermediate representation of the input itself. This allows each position to have relevance to itself and other positions.

The basic procedure of Self-Attention is as follows.

The sequence given as input is represented as a set of vectors. Each element is called a position vector and corresponds to each position in the original sequence.
Three different linear transformations (Wq, Wk, Wv) are applied to the position vectors to produce a query vector (Query), a key vector (Key), and a value vector (Value).
The inner product of the query vector and key vector is computed to obtain a relevance score. The relevance score represents the degree of similarity between the query vector and the key vector.
The relevance score is normalized by a softmax function and weights are calculated. This expresses the importance of each position.
The weighted value vectors are multiplied by the weights to compute a weighted average. This average becomes the context vector extracted by Self-Attention.

Self-Attention allows each element in a sequence to extract information while considering the relationship between itself and other elements, thereby automatically learning important elements and dependencies and enabling appropriate abstraction of the input representation. It is also efficient in processing long sequences and large data sets because of its parallel computing capability.

Python implementation of Self-Attention

An example implementation of Self-Attention in Python is described below.

import torch
import torch.nn as nn

class SelfAttention(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super(SelfAttention, self).__init__()
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim

        self.query_layer = nn.Linear(input_dim, hidden_dim)
        self.key_layer = nn.Linear(input_dim, hidden_dim)
        self.value_layer = nn.Linear(input_dim, hidden_dim)

    def forward(self, inputs):
        """
        inputs: shape (batch_size, seq_length, input_dim)
        """
        batch_size, seq_length, _ = inputs.size()

        # Convert query to hidden_dim dimension
        queries = self.query_layer(inputs)  # shape (batch_size, seq_length, hidden_dim)

        # Convert Keys to hidden_dim dimension
        keys = self.key_layer(inputs)  # shape (batch_size, seq_length, hidden_dim)

        # Convert Values to hidden_dim dimensions
        values = self.value_layer(inputs)  # shape (batch_size, seq_length, hidden_dim)

        # Calculate Attention score
        attention_scores = torch.matmul(queries, keys.transpose(1, 2))  # shape (batch_size, seq_length, seq_length)

        # Normalized Attention score
        attention_probs = torch.softmax(attention_scores, dim=-1)  # shape (batch_size, seq_length, seq_length)

        # Calculate weighted sum
        attended_values = torch.matmul(attention_probs, values)  # shape (batch_size, seq_length, hidden_dim)

        # RETURN RESULT
        return attended_values

In the above example, a class named SelfAttention is defined, where input_dim is the dimensionality of the input and hidden_dim is the dimensionality of the internal representation of Self-Attention, and in the forward method, Self The forward method applies Self-Attention to the input sequence inputs.

The shape of the inputs and outputs of the forward method are described in the comments. As input, inputs is assumed to be a tensor with shape (batch size, sequence length, input_dim).

In this implementation, Query, Key, and Value are computed for each element of the input sequence to obtain an Attention score. The Attention score is then normalized and a weighted sum is calculated to obtain the final output.

The implementation of Attention mechanisms other than Self-Attention is also described below.

Point product Attention

Dot Product Attention (Dot Product Attention) calculates the inner product between Query and Key to obtain an Attention score. An example implementation in Python is described below.

import torch
import torch.nn as nn

class DotProductAttention(nn.Module):
    def __init__(self):
        super(DotProductAttention, self).__init__()

    def forward(self, query, keys, values):
        """
        query: shape (batch_size, query_length, hidden_dim)
        keys: shape (batch_size, key_length, hidden_dim)
        values: shape (batch_size, key_length, value_dim)
        """
        batch_size, query_length, _ = query.size()
        key_length = keys.size(1)

        # Calculate inner product of Query and Key
        attention_scores = torch.matmul(query, keys.transpose(1, 2))  # shape (batch_size, query_length, key_length)

        # Normalized Attention score
        attention_probs = torch.softmax(attention_scores, dim=-1)  # shape (batch_size, query_length, key_length)

        # Calculate weighted sum
        attended_values = torch.matmul(attention_probs, values)  # shape (batch_size, query_length, value_dim)

        # RETURN RESULT
        return attended_values, attention_probs

In the above example, a class named DotProductAttention is defined, and the forward method applies DotProductAttention to the given Query, Keys, and Values.

The input and output shapes of the forward method are described in the comments. As input, query is assumed to be a tensor with shape (batch size, query length, hidden dimension), keys is assumed to be a tensor with shape (batch size, key length, hidden dimension), and values is assumed to be a tensor with shape (batch size, key length, value dimension).

In this implementation, the Attention score is obtained by computing the inner product of Query and Key. The Attention score is then normalized and a weighted sum is calculated to obtain the final output. The Attention score is also output.

About Additive Attention

Additive Attention (Additive Attention) uses a nonlinear function between Query and Key to compute Attention scores. An example Python implementation is shown below.

import torch
import torch.nn as nn

class AdditiveAttention(nn.Module):
    def __init__(self, query_dim, key_dim, hidden_dim):
        super(AdditiveAttention, self).__init__()
        self.query_dim = query_dim
        self.key_dim = key_dim
        self.hidden_dim = hidden_dim

        self.query_layer = nn.Linear(query_dim, hidden_dim)
        self.key_layer = nn.Linear(key_dim, hidden_dim)
        self.energy_layer = nn.Linear(hidden_dim, 1)

    def forward(self, query, keys, values):
        """
        query: shape (batch_size, query_length, query_dim)
        keys: shape (batch_size, key_length, key_dim)
        values: shape (batch_size, key_length, value_dim)
        """
        batch_size, query_length, _ = query.size()
        key_length = keys.size(1)

        # Convert Query and Key to hidden_dim dimension
        processed_query = self.query_layer(query)  # shape (batch_size, query_length, hidden_dim)
        processed_keys = self.key_layer(keys)  # shape (batch_size, key_length, hidden_dim)

        # Calculate Attention score
        energy = torch.tanh(processed_query + processed_keys)  # shape (batch_size, query_length, hidden_dim)
        attention_scores = self.energy_layer(energy).squeeze(-1)  # shape (batch_size, query_length, key_length)

        # Normalized Attention score
        attention_probs = torch.softmax(attention_scores, dim=-1)  # shape (batch_size, query_length, key_length)

        # Calculate weighted sum
        attended_values = torch.matmul(attention_probs.unsqueeze(1), values).squeeze(1)  # shape (batch_size, query_length, value_dim)

        # RETURN RESULT
        return attended_values, attention_probs

In the above example, a class named AdditiveAttention is defined, where query_dim, key_dim, and hidden_dim represent dimension sizes. forward method applies Additive Attention is applied.

The input and output shapes of the forward method are described in the comments. As input, query is assumed to be a tensor with shape (batch size, query length, query dimension), keys is assumed to be a tensor with shape (batch size, key length, key dimension), and values is assumed to be a tensor with shape (batch size, key length, value dimension).

In this implementation, Query and Key are transformed by a nonlinear function and the energy is calculated by adding them together. Here, the energy is converted to one dimension to obtain the Attention score. The Attention score is then normalized and a weighted sum is calculated to obtain the final output, which also outputs the Attention score.

About Multiplicative Attention

Multiplicative Attention (Multiplicative Attention) calculates the inner product between Query and Key to obtain an Attention score. An example implementation in Python is shown below.

import torch
import torch.nn as nn

class MultiplicativeAttention(nn.Module):
    def __init__(self, query_dim, key_dim):
        super(MultiplicativeAttention, self).__init__()
        self.query_dim = query_dim
        self.key_dim = key_dim

        self.query_layer = nn.Linear(query_dim, key_dim)

    def forward(self, query, keys, values):
        """
        query: shape (batch_size, query_length, query_dim)
        keys: shape (batch_size, key_length, key_dim)
        values: shape (batch_size, key_length, value_dim)
        """
        batch_size, query_length, _ = query.size()
        key_length = keys.size(1)

        # Convert Query to Key dimension
        processed_query = self.query_layer(query)  # shape (batch_size, query_length, key_dim)

        # Calculate inner product of Query and Key
        attention_scores = torch.matmul(processed_query, keys.transpose(1, 2))  # shape (batch_size, query_length, key_length)

        # Normalized Attention score
        attention_probs = torch.softmax(attention_scores, dim=-1)  # shape (batch_size, query_length, key_length)

        # Calculate weighted sum
        attended_values = torch.matmul(attention_probs, values)  # shape (batch_size, query_length, value_dim)

        # RETURN RESULT
        return attended_values, attention_probs

In the above example, a class named MultiplicativeAttention is defined, where query_dim and key_dim represent the dimension sizes. forward method applies MultiplicativeAttention to the given Query, Keys and Values. Values.

The shape of the forward method’s input and output are described in the comments. As input, query is assumed to be a tensor with shape (batch size, query length, query dimension), keys is assumed to be a tensor with shape (batch size, key length, key dimension), and values is assumed to be a tensor with shape (batch size, key length, value dimension).

In this implementation, Query is converted to the dimension of Key, and the Attention score is obtained by computing the inner product of Query and Key. The Attention score is then normalized and a weighted sum is calculated to obtain the final output, which also outputs the Attention score.

About Multi-head Attention

Multi-head Attention (Multi-head Attention) provides richer expression and flexibility by having multiple attention heads (number of heads). An example implementation in Python is shown below.

import torch
import torch.nn as nn

class MultiHeadAttention(nn.Module):
    def __init__(self, input_dim, hidden_dim, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        self.num_heads = num_heads

        # Define different weight matrices for each head
        self.query_layers = nn.ModuleList([nn.Linear(input_dim, hidden_dim) for _ in range(num_heads)])
        self.key_layers = nn.ModuleList([nn.Linear(input_dim, hidden_dim) for _ in range(num_heads)])
        self.value_layers = nn.ModuleList([nn.Linear(input_dim, hidden_dim) for _ in range(num_heads)])

        # Linear transformation layer for combining head results
        self.linear = nn.Linear(hidden_dim * num_heads, hidden_dim)

    def forward(self, inputs):
        """
        inputs: shape (batch_size, seq_length, input_dim)
        """
        batch_size, seq_length, _ = inputs.size()

        # Calculate Query, Key, and Value for each head
        queries = [query_layer(inputs) for query_layer in self.query_layers]  # List of tensors with shape (batch_size, seq_length, hidden_dim)
        keys = [key_layer(inputs) for key_layer in self.key_layers]  # List of tensors with shape (batch_size, seq_length, hidden_dim)
        values = [value_layer(inputs) for value_layer in self.value_layers]  # List of tensors with shape (batch_size, seq_length, hidden_dim)

        # Calculate Attention score per head
        attention_scores = [torch.matmul(queries[i], keys[i].transpose(1, 2)) for i in range(self.num_heads)]  # List of tensors with shape (batch_size, seq_length, seq_length)

        # Normalized Attention score per head
        attention_probs = [torch.softmax(score, dim=-1) for score in attention_scores]  # List of tensors with shape (batch_size, seq_length, seq_length)

        # Calculate weighted sum per head
        attended_values = [torch.matmul(attention_probs[i], values[i]) for i in range(self.num_heads)]  # List of tensors with shape (batch_size, seq_length, hidden_dim)

        # Combine head results and apply linear transformation
        concatenated_values = torch.cat(attended_values, dim=-1)  # shape (batch_size, seq_length, hidden_dim * num_heads)
        output = self.linear(concatenated_values)  # shape (batch_size, seq_length, hidden_dim)

        # RETURN RESULT
        return output

The above example defines a class named MultiHeadAttention, where input_dim is the number of dimensions of the input, hidden_dim is the number of dimensions of the hidden layer of each head, and num_heads is the number of heads.

In the forward method, multi-head Attention is applied to the given input. The input inputs are assumed to be a tensor of shape (batch size, sequence length, and dimension of the input).

In Multi-Head Attention, Query, Key, and Value are computed for each head to obtain an Attention score. For each head, inner product, normalization, and weighted sum are performed, and the results are combined to obtain the final output. The above implementation omits elements such as position encoding and residual connections, but these could be added to make the model fully functional as a Transformer model.

About Softmax Attention

Softmax Attention will use a softmax function to calculate the relevance of the encoder’s intermediate representation to the decoder’s state and to determine the weights. The steps of the Softmax Attention algorithm are shown below.

The intermediate representation of the encoder is received as input and its relevance to the current state of the decoder is computed. The relevance is usually computed using inner products or similarity functions (e.g., dot product or cosine similarity).
To normalize the obtained relevance, a softmax function is applied. The softmax function is used to interpret the relevance values as probabilities. This allows the sum of the associations to be equal to 1 and interpreted as weights.
After applying the softmax function, weights are assigned to each encoder’s intermediate representation. The weights indicate the parts of the encoder that the decoder should focus on.
The decoder calculates a weighted average of the encoder’s intermediate representation and weights and uses this weighted average to generate the next output. More attention can be paid to the parts of the encoder with higher weights, and the appropriate information can be extracted and reflected in the output.

Softmax Attention is widely used as a flexible mechanism for selecting portions of interest because it expresses the importance to different parts of the input as probabilities of continuous values. However, care must be taken in the case of long sequences or large input data, as the computational cost may increase.

Reference Information and Reference Books

For details on automatic generation by machine learning, see “Automatic Generation by Machine Learning.

Reference book is “Natural Language Processing with Transformers, Revised Edition”

“Transformers for Machine Learning: A Deep Dive”

“Transformers for Natural Language Processing”

“Vision Transformer入門 Computer Vision Library”