Transformer (Language Model): Detailed Explanation

The Illustrated Transformer

10 min readSep 5, 2024

Introduction

Transformers have completely changed the game in artificial intelligence. Transformers efficiently process sequential data, excelling at handling complex relationships across sequences. The idea was introduced in a landmark paper called “Attention is All You Need” by Vaswani et al. in 2017. Instead of relying on traditional methods like recurrent or convolutional layers, Transformers use something called self-attention. This self-attention figures out which parts of the input are most important for generating the output. The cool thing is, they can do this for any part of the input, no matter how far apart they are. This makes them great at capturing relationships between different pieces of data. Transformers are now the go-to architecture for cutting-edge AI models across many fields, from language processing to image recognition. Their ability to handle complex data and deliver outstanding performance has made them a fundamental tool in modern AI research and applications.

To dive deeper into how Transformers work, let’s break down their architecture.

Model Architecture

A High-Level Look

Let’s consider an example of machine translation. Imagine that we’re translating a French sentence (‘Je suis étudiant’) into English (‘I am a student’). Let’s initiate our exploration by considering the model as a black box.

Opening up the black box, we can see there’s an encoding part, a decoding part, and connections linking them together.

Figure 2: Encoding and decoding components inside the transformer

According to the original research paper (Attention Is All You Need), both the encoder and decoder are composed of a stack of six identical layers (Figure 3). However, this is a hyperparameter, and one can experiment with other arrangements.
The output of the final Encoder in the stack is passed to the Decoders to guide the generation of the output sequence.

This is just a high-level overview. Now, it’s time to dive deeper into each transformer component.

Encoder

The encoder’s job is to process the input sequence and create a representation that the decoder can use to generate the output sequence.

First, the input sequence goes through Input Embedding and Position Encoding, which generates an encoded version of each word, capturing both its meaning and position in the sequence.

All encoders have the same structure and are made up of two main parts: the Self-Attention layer and the Feed-Forward Neural Network.

Figure 4: Main sub-layers in the Encoders

Initially, the encoder processes its inputs using a self-attention layer, which helps it understand the relationships between words in a sentence as it encodes each word. After this, the output from the self-attention layer is sent through a feed-forward neural network, which is the same across all encoders and works individually on each one.

To make the model more advanced, researchers added a technique called residual connection. This involves wrapping each of the two steps — self-attention and feed-forward — with something called layer normalization, which helps the model work better.

Now that you have some idea of what an encoder is, let’s take a closer look at the encoding process.

Process of Encoding

Embedding

First, we tokenized our input text. Tokenizing is the process of breaking down text into smaller units called tokens. Tokens can be words, subwords, characters, or even sentences. The main goal of tokenization is to convert a piece of text into manageable chunks that can be further processed.

As in most NLP applications, then we convert each input word into a vector using an embedding algorithm. Embedding is the process of converting tokens into dense vector representations. These vectors capture semantic meaning and relationships between tokens. The main goal of embedding is to transform tokens into a numerical format that retains semantic information and can be used as input for machine learning models.

This embedding process takes place exclusively in the bottom-most encoder. All encoders share a common feature: they receive a list of vectors, each of size 512. For the bottom encoder, these vectors are word embeddings, whereas for the other encoders, they are the outputs from the encoder immediately below.
By padding or truncating all input sequences to the same length, we ensure that the output embeddings maintain a consistent size, which is essential for batch processing and model training. The length of this list (Output dimension) is a hyperparameter that we can set, corresponding to the longest sentence in our training dataset.

Another key aspect is that the model processes each input token in parallel. By incorporating positional encoding, we retain the information about word order, ensuring the relevance of each word’s position in the sentence is maintained. The summation of matrices necessitates that they be matched in size; therefore, positional encoding dimensions are identical to those of the input embeddings. These positional encodings are integrated into the input encodings at the base of the encoder.

Self-Attention

We know that each word in the input sequence is first converted into a vector using embeddings as described above. As the next step, we create three vectors: a Query vector (Q), a Key vector (K), and a Value vector (V) for each word in the sequence. These vectors are obtained by multiplying the input vector by three different weight matrices that are learned during training.

Figure 7: Query vector (Q), Key vector (K), and the Value vector (V)

To assess the significance of each word in the sequence relative to the others, we calculate the dot product of the Query vector of a word with the Key vectors of all words in the sequence. This generates a set of scores. These scores indicate the amount of attention we should give to other parts of the input sentence while encoding a word at a specific position. (Figure 8)

After calculating the score, it is divided by the square root of the dimension of the key vector (√d_k), which helps achieve more stable gradients. In the original paper, with d_k set to 64, the score is divided by 8. These scores are then passed through a softmax operation, which normalizes them so they are all positive and sum to 1. The softmax function transforms the scores into a probability distribution, highlighting the most relevant tokens and reducing the influence of less relevant ones.

The next step involves multiplying each value vector (V) by the softmax score before summing them. This process ensures that the values of the important words are preserved while the influence of irrelevant words is minimized by multiplying them by very small numbers, like 0.001.

The final step is to sum the weighted value vectors, resulting in the output of the self-attention(Z) layer at a specific position. The resulting vector is ready to be sent to the feed-forward neural network.

Matrix Calculation of Self-Attention

Although we’ve described the self-attention mechanism at the word level for clarity, in practice, transformers perform these calculations using matrix operations. This approach is much faster and more efficient, especially for processing large amounts of data.

By using matrices, the model can handle multiple words and their relationships simultaneously, significantly speeding up the computation.

Figure 9: Matrix Calculation of Self-Attention

Multi-headed attention

Multi-headed attention extends this self-attention concept by running multiple attention mechanisms (heads) in parallel, allowing the model to focus on different parts of the input sequence simultaneously.

Suppose we have an input sentence: “The cat sat on the mat.” In a single-headed attention mechanism, the model might focus on the relationship between “cat” and “mat” when predicting the next word. In a multi-headed attention mechanism, one head might focus on the relationship between “cat” and “mat,” while another head focuses on the relationship between “cat” and “sat,” and so on. This multi-faceted approach helps the model capture more nuances and context.

Multi-headed attention introduces multiple sets of Query/Key/Value weight matrices, rather than just one. In the case of the Transformer model, there are 8 attention heads, resulting in 8 sets of these matrices for each encoder. Each set is initialized randomly and, after training, projects the input embeddings or vectors from lower encoders into different representation subspaces. This diversity in representation helps the model capture a richer and more detailed understanding of the input.

This presents a challenge because the feed-forward layer expects a single matrix (a vector for each word), not eight separate matrices. Therefore, we need a method to combine these eight matrices into one.

To achieve this, we concatenate the matrices and then multiply them by an additional weight matrix, W0.

Figure 12: Combine eight matrices into one

Benefits of Multi-Headed Attention

Enhanced Representation: By having multiple heads, the model can capture different aspects of the input sequence, improving its ability to understand complex dependencies.
Parallel Processing: Multi-headed attention allows the model to process different parts of the input sequence simultaneously, making it more efficient.
Rich Feature Extraction: Each head can learn to attend to different features, providing a richer representation of the input sequence.

Feed-forward network

After the self-attention operation, the output is passed through a feed-forward network. This FFN is a simple two-layer fully connected network that is applied independently to each position in the sequence. It introduces non-linearity and helps the model capture complex patterns in the data.

Residual Connections and Layer Normalization

As I mentioned earlier, both the self-attention mechanism and the feed-forward network are followed by residual connections and layer normalization. The residual connections help preserve the information from the original input while adding the output of the sub-layer, ensuring the model retains important features. Layer normalization is then applied to stabilize and accelerate training by normalizing the output of the previous step.

Decoder

While we’ve already explored the components of the encoder, the decoder plays an equally crucial role in generating the output sequence. The primary function of the decoder is to take the encoded input and generate the output tokens one at a time, in an iterative process.

The decoding process begins with the first step, where the decoder receives the encoder’s output along with a special start token (like <START>). Since no output tokens have been generated yet, this first step relies solely on the input from the encoder.

As the model moves through later steps, it generates the next token by using both the encoder’s output and the sequence of previously generated tokens. The decoder uses a masked self-attention mechanism, ensuring that each token can only attend to tokens that came before it, preserving the sequential nature of the output.

Instead of attending to future tokens, the decoder focuses on previously generated tokens, computing a weighted sum of these tokens to emphasize the most relevant ones. This allows the model to maintain context, producing a coherent and meaningful sequence in each step.

While the self-attention layer helps the decoder focus on the previously generated tokens in the output sequence, the decoder also incorporates another important mechanism: the encoder-decoder attention sub-layer. This sub-layer enables the decoder to attend to the relevant parts of the input sequence encoded by the encoder.

The key idea here is cross-attention, where the decoder generates queries (Q) from its previous layer’s output, while the keys (K) and values (V) are derived from the encoder’s output. This allows the decoder to retrieve the most relevant information from the encoded input, helping it generate a contextually accurate output sequence.

Like self-attention, encoder-decoder attention uses multi-headed attention, which enables the model to focus on different aspects of the input sequence in parallel. This mechanism ensures that the decoder can leverage the information encoded from the input effectively, generating coherent and contextually accurate outputs.

Final Linear Layer and the Softmax Layer

Once the decoder stack produces an output vector, this needs to be converted into a word or token. The final Linear layer projects this output into a much larger vector, known as the logits vector, with each element corresponding to a possible word from the model’s vocabulary. For instance, if the model knows 30,000 words, the logits vector will have 30,000 values, each representing a score for a specific word. The Softmax layer then transforms these scores into a probability distribution, where the word with the highest probability is chosen as the next token in the sequence.

Figure 16: Output of the decoder stack to output word

Through this process, the model generates text in a step-by-step manner, ensuring each word is contextually relevant based on the input and previously generated tokens.

Now that we’ve covered how a forward pass through the entire Transformer architecture works, I hope you found this article informative and helpful.
If you learned something new, feel free to hit the clap icon below! In the next article, we’ll dive into implementing the Transformer model. Until then, happy learning!