From Seq2Seq to Transformer - Part.1

Kwanwoo · Wed Oct 04 2023

Seq2Seq Model

💡

[2014] Sequence to Sequence Learning with Neural Networks

The attention mechanism was proposed to improve the Seq2Seq model. To understand the attention mechanism well, you must understand the Seq2Seq model.

How the Seq2Seq Model Works

The Seq2Seq model is an RNN-based model designed for tasks such as translation and summarization, where a sequence is taken as input and a sequence is output.

This is a Seq2Seq model that translates the English sentence "I am a student." to French "Je suis étudiant." The left orange rectangle represents the encoder, and the right green rectangle represents the decoder. The encoder takes the input sentence ("I am a student.") and outputs a context vector, while the decoder takes the context vector (and the <sos> token) as input and outputs the sentence ("Je suis étudiant.").

The Seq2Seq model is characterized by separating the 'part that receives the sequence' from the 'part that outputs the sequence'. The part that takes in the sequence (the left orange RNN module) is called the encoder, and the part that outputs the sequence (the right green RNN module) is called the decoder. The encoder converts the input sequence (original text) into a fixed-size vector called the context vector. The decoder outputs the translated sequence using the context vector produced by the encoder.

Looking in more detail, the sequence of operations of the Seq2Seq model during inference is as follows. First, the hidden state of the encoder is initialized to a suitable value (e.g., zero vector). Every time a word (or its embedding) from the source sentence is input at each time step, the encoder updates its hidden state. Repeating this until the end of the input sequence, the final hidden state of the encoder captures the compressed summary of the input sequence. This final state is called the context vector, which is passed to the decoder.

The decoder initializes its hidden state with the context vector received. At each time step, it takes the word it output in the previous step as input, updates its hidden state, and predicts the next word (the <sos> token is used as input for the first step). This process continues either for a predetermined number of repetitions or until the <eos> token, signaling the end of the sequence, is output.

Aside: Training Method of Seq2Seq Model - Teacher Forcing

When training the Seq2Seq model, a slightly different approach is needed. Training the decoder by feeding it its own previous outputs doesn't result in good learning. During model training, the actual correct word from the target sequence, not the previously predicted word from the decoder, should be used as input for the decoder. This method is called teacher forcing.

Training the Seq2Seq model requires the teacher forcing method. That is, when the input is "<sos>je suis étudiant", the output should be "je suis étudiant<eos>".

Limitations of the Seq2Seq Model

The Seq2Seq model showed high performance in tasks like translation and chatbots. However, the Seq2Seq model had a major limitation.

It tries to compress all the information of the input sequence into a single fixed-size vector (context vector), leading to potential information loss. This loss becomes even greater for longer sequences.

Being an RNN-based model, it inevitably suffers from the gradient vanishing/exploding phenomenon.

Attention Mechanism (Attention Mechanism)

💡

[2015] Neural Machine Translation by Jointly Learning to Align and Translate

The attention mechanism was proposed to address the challenges of the Seq2Seq model. The core idea of the attention mechanism is to relieve the encoder's burden of having to encapsulate all information of the input sequence into a single fixed-sized vector (context vector). Instead of just using the encoder's final hidden state (context vector) to predict the next word in the decoder, it leverages all the hidden states from each timestep of the encoder.

Specifically, the attention mechanism assumes:

Just before the decoder outputs a word, its hidden state will be similar to the encoder's hidden state right after the encoder reads a word from the input sequence that is deeply associated with the decoder's output word.

For instance, consider translating the English sentence "I am a student." into the French "Je suis étudiant.". The word "étudiant" in the output sequence (which means "student" in French) is closely related to the word "student" (which means "student" in English) in the input sequence, rather than "I", "am", or "a". According to the attention mechanism, just before the decoder outputs "étudiant", its hidden state will be similar to the encoder's state right after it reads "student". By focusing more on the encoder's state after reading "student", a much higher quality translation can be produced.

Following this assumption, the decoder in the Seq2Seq + attention model predicts the next word in the following order:

To determine which encoder hidden state to focus on more, it calculates the 'similarity' between the current decoder's hidden state and the hidden states of the encoder at every timestep.

This similarity is then converted into a probability form, and a weighted sum of the encoder's hidden states is computed based on these probabilities to produce a 'refined context vector'.

The next word is predicted using this 'refined context vector'.

This approach addresses both major concerns of the Seq2Seq model. Since not just the last hidden state of the encoder, but hidden states from all timesteps are passed to the decoder, there's hardly any information loss even with longer sequences. Also, since both the initial and later hidden states of the encoder are equally transferred in the form of probabilities, it can mitigate the gradient vanishing/exploding phenomenon.

How the Seq2Seq + Attention Model Works

Let's delve deeper into how the Seq2Seq + Attention model operates.

Where:

: Number of words (tokens) in the input sequence. Essentially, the total number of encoder's hidden states.

: Encoder's hidden state at timestep

: Decoder's hidden state at timestep .

: Decoder's input word embedding at timestep . In the Seq2Seq model, the decoder's input is the decoder's output from the previous timestep, so this value is also the decoder's output word at timestep .

Given these, the decoder's output word at timestep can be predicted as:

Calculate the vector , which collects the similarities (referred to as attention scores) between each encoder hidden state and the (previous) decoder hidden state using the score function

The score function (also called the alignment model) signifies a function that takes two vectors as inputs and calculates their similarity. For the types of score functions, refer to the next section.

Compute the attention distribution αt.

The attention distribution is the similarity vector (attention scores) transformed into a probability form using the softmax function.

Calculate the attention value .

This computes the weighted sum of all encoder hidden states using the previously calculated attention distribution as weights. The resulting value, , is known as the attention value.

Decoder hidden states that have higher similarity with will have larger attention scores, thus contributing more to the attention value.

Conversely, those with lower similarities will have a smaller contribution. By referencing the attention value, one can focus on encoder hidden states that are more 'related' based on the assumption.

Using a matrix that collects all values, one can compute the attention value in a single vector operation.

Concatenate the attention value and the decoder input to produce . This concatenated value serves as the input to the RNN cell, producing a new decoder hidden state . Then, in a manner identical to the traditional Seq2Seq model, the word at timestep is predicted from the decoder hidden state .

Generalization: Attention(query, key, value)

The operation of the attention mechanism can be generalized using the terms query, key, and value.

Similarity Calculation: Using the score function, compute the attention scores between the query and each key.

Normalization: Convert the attention scores between the query and each key into an attention distribution using the softmax function.

Weighted Sum Calculation: Use the attention distribution to compute a weighted sum (attention value) of the values.

In essence, the attention operation summarizes the values for a given query. In other words, the attention operation determines which value to "focus" on. When the attention operation is performed, it focuses more on important values (those with keys that have a higher similarity to the query).

The attention operation can be understood as a computation that, based on the similarity between a query (orange) and keys (green), calculates a weighted sum of the values (blue) paired with the keys. The similarity between the query and the keys is calculated using a score function.

In the Seq2Seq + attention model, the decoder's hidden state() is used as the query, and all encoder hidden states() are used as both keys and values.

Regarding the Names 'query', 'key', and 'value':

One of the initial confusing aspects of learning the attention mechanism is the naming. Why the terms query, key, and value? Why different names, especially when, in the Seq2Seq + attention model, the key and value are the same, and in self-attention, all three are the same?

Think of a Python dictionary. It's a data structure storing key-value pairs, working like a dictionary. However, trying to find a non-existent key results in an error. Imagine a "relaxed" version of this dictionary. This "relaxed dictionary" would, given a user request (query), use the stored keys to determine the most similar values, even if a direct value for the query doesn't exist. This is analogous to the attention mechanism.

Score Function:

Various function can be used to calculate the similarity between a query and a key.

The simplest form is the dot product. In cases where the query and key are identical, the dot product achieves its maximum value. Using the dot product for attention is termed dot-product attention or Luong attention.

Other types of score functions are also conceivable. The following table summarizes a few renowned score functions:

Name	Formula	Reference
content-based attention		Graves, 2014
additive attention (Bahdanau attention)		Bahdanau, 2015
dot-product attention (Loung attention)		Luong, 2015
scaled dot-product attention		Vaswani, 2017

Cross Attention vs. Self Attention:

Attention operations that use the same values for key and value, but a different value for query (i.e., query ≠ key = value), are termed cross attention. The attention operation in the Seq2Seq + attention model or the attention used in the decoder of the Transformer model is cross attention.

On the other hand, attention operations that use the same value for query, key, and value (i.e., query = key = value) are termed self-attention. The attention operation used in the encoder of the Transformer model is self attention.