Evolution of PT2 transformers: sequence modeling (transformers)

Original): Apoorv Jain

Originally published in the direction of artificial intelligence.

On the previous blog of this series, we examined the early revolutionary idea of ​​recurring neural networks (RNN) for sequence modeling. We discussed their basic intuition, their advantages and key restrictions, especially challenges related to the maintenance of the gradient flow in long sequences of often required in the tasks of modeling the sequence.

In this article, we pay attention to simple but powerful ideas that made transformers unique and highly scalable.

Dig. 1. The student's eyes scan sentences to understand the discourse.

We discussed the idea “Mental summary” This follows the context of previous words in the sentence. However, Human eyes They can do more than just maintaining such a summary, they can It's easy to scan many previous words and directly conclude which of them are most important for understanding the current word. We will discuss the next approach inspired by this idea.

Paying attention

This is a very powerful technique for identifying the meaning of the token in relation to all other tokens. It allows the model to learn complex relationships between different tokens while maintaining the simplicity and performance of the model.

Sentence in the spotlight

Scenario (A):

The mouse froze for a moment, and then broke around the floor in a panic, and her little body trembled when the cat threw himself at him. Surprised by a sudden movement and detection of danger, the mouse threw a feverishly, desperately desperately escaped from the upcoming presence of a predator.

How do you know what “this” is about? Your brain immediately connects “it” from “mice”, looking at the context given by the word “scared”.

Instead of using a single vector representation for each token, we separate three vectors for a single token with different purposes (Q, K, V).

  1. Query (Q): What am I looking for?
    The inquiry vector The word “IT” says: “I need to know who I mean.”
  2. Key (K): What can I offer?
    The key vectors of “mice” says: “I am a noun, an animal, a potential subject of fear.”
  3. Value (V): Actual offer.
    The “mouse” value vector contains its rich semantic meaning.
Attention results calculated using a scaled product method.

The Dot product measures the similarity between two vectors. In the context of attention, the dot product between the questionnaire and the key vector means Match between requirement (inquiry) and victim (key) other tokens. To avoid excessive high values ​​as the dimensions of these vectors (DK) increased, the dot product is scaled by DK. The scaled results are then passed through the Softmax layer, which transforms them into standardized weights. Finally, these scales are used to calculate the weighted average actual offer (value) tokens, creating another token representation.

Evolutionary features in different layers

Attention results that are calculated by Learnne Q, KIV The transformer reflects various types of functions that the model has learned during training, and they become richer and more abstract when the information goes through many layers.

IN Earlier layersAttention can capture Low level patterns such as syntactic relationships or positional dependencies, while deeper layers Gradually focus on more complex Semantic structures, contextual understanding and representatives specific to tasks. Basically, the progress of attention between the layers allows the model to improve its understanding of the input sequences, passing from Compounds at the surface level with significant higher formulas, significant patterns This contributes to better forecasts.

Visualization of self -improvement

We used Bertviz library To analyze this sentence. We chose Bert Enkoder to tokenize and calculate contextual embedding tokens of these sentences. We visualized the results of attention on different layers and many heads of the transformer model.

  1. Earlier layers
    Dig. 2 (a) – – entity's relationship Presented by a strong connection between “Cat(Subject) and “prosecuted” (verb).
    Dig. 2 (b) – – Surroundings Present by the connection between “Mouse” (object) and “prosecuted” (verb).
    These features are low -level relationships, which we found in layer 0 in different heads of attention.
  2. Deeper layers
    Dig. 2 (c) – – Understanding the discourse Who was scared?
    Dig. 2 (D) – – Core resolution To find the topic “Refers about it”.
    These features are complex and require a deep understanding, and therefore are in deeper layers of 9 and 10.
Dig. 2. Low level to high level learned by layers
Part of the speech tag analysis (POS)

Alternative scenario (B):

The mouse pushed feverishly on the floor, while the cat raced after it, not by hunger, but fear, his blink tail and violent movements betrayed the surprised instinct.

In the script and the cat saw the scared mouse and began to chase it, while in the script B the cat himself panicked and started chasing the mouse.

Dig. 2 (c) AND Dig. 2 (D) It beautifully reflects this ambiguity in the results of attention, because you can see double connections between “scared”, “it” from “mouse” and “cat”.

Dig. 3. Two possible scenarios resulting from ambiguity in the sentence. So who was terrified, cat or mouse? Source: Gemini

Transformers

The architecture of transformers revolutionized sequence modeling by introducing a built structure completely about attention Mechanisms, a departure from the traditional use of repetitive or weave neural networks. Instead of processing sequence data, the transformer at the same time allows each token to directly participate in all others in the input data by self -improvement.

Dig. 4. Single transformer encoder block

Key elements

  • Input layer: Transforms entrance tokens into high -rank vector representations that capture semantic information.
  • Positioning coding: Adds information based on a token position Keep the sequence of the sequence during the training. This is a consequence of removing a repetitive connection at various time stages.
  • Enkoder block:
    Self -summary of multiple: Each token deals with all others, capturing contextual relationships in the input sequence. This is done simultaneously by many heads, each of which has its own Q, K and V matrix.
    Administration network (MLP layer): The dense network in terms of position provides the representation of each token for comments.
    Normalization of the layer and residual connections: Stabilize training, Turn on the effective flow of the gradientand correct number stability. This is similar to the student's search to remember the concepts.
Dig. 5. Block of the transformer decoder
  • Decoder block:
    Masted multi -shaped multiple: It prevents attending future tokens when generating to model autoregression.
    Encoder-Decoder (cross) attention: Allows the decoder to focus on encoded input performances during the output generation.
    Feed Forward : As in the encoder, he is alerting to the token levels.
    Normalization of the layer and residual connections: As above, for stabilization and convergence.
  • Output linear + Softmax: Generates probability distribution next to the token.

Did the transformers be the first to pay attention?

No, it was the first to be only on self -numbness without repetitive time -size connections

Dig. 6. Removal of a repetitive connection in time steps
  • The original mechanism of attention was introduced by Bahdanau et al. In 2014 Down Upgrade RNN Enkoder-Decoder models In the case of tasks such as machine translation, it allows models to focus on the appropriate parts of the input sequence.
  • He was considered more as Improvement technique In the case of architectures based on RNN, which were based on repetitive and repayment layers of learning relationships between tokens.
  • The transformer architecture (introduced in “Attention is everything you need”, Vaswani et al., 2017) was the first model that completely replaced the recurrence and weave with attention, paying attention.
  • After saying, the removal of repetitive connections is not without consistency. There is a discussion about whether the lack of recurrence limits the ability to reason transformers. The latest article Fr. Hierarchical reasoning model,27m).

Training goal

Dig. 7. Parallel training to predict the token

To train the transformer architecture, we need a task that can be instilled in it in the world. One commonly known task is Next to the token forecast The task in which you train the model to predict the subsection of the likelihood of the next word, and then compare it with the distribution of the probability of the original word measured with The loss of cross entropy.
This loss is used to promote gradient in layers, adjusting weights in the same way as you train a standard neural network, although training is less complex than RNN.

Loss of cross entropy between the expected probability and the true distribution of the token probability.

Particularly fascinating is how LLM can generate thousands of logical sentences from one hint, producing them individually, while maintaining fluidity, consistency and logical consistency. At first glance, this may seem in advance to require clear planning.
In fact, in addition to the forecast of the next word, there is also another pretration goal, which has recently gained a lot of attention to structured tasks. We will examine it on upcoming blogs.

Why transformers succeeded

  1. No bottleneck problem: The attention mechanism reduces the burden of “mental summary”, enabling the model to directly track various dependencies. In addition, gradients can flow directly to the appropriate tokens without going through unnecessary tokens in the middle. This solves problems with long -range dependence.
  2. Parallel training: Only the use of attention removed the linear relationship; The processing of the token on the n-tit does not require token output (N-1) Th. This makes the training much more scalable and parallel, because the calculations can be effectively expressed as matrix operations on the GPU.
  3. Transfer learning: The property of maintaining the domain knowledge after warning to forecast the next word allows us to tune the lower task with a small number of samples.
  4. Scalability: Different transformers were a coefficient of scalability in which you can continue to increase the number of parameters to get an incremental increase in performance.

Ongoing challenges

  1. Sequential inference: Even if you can train the model evenly on the GPU, there is still a linear dependence at the time of application, you need (N-1) token token to calculate NT-Token. This limitation makes inference slow.
  2. Error accumulation: This is due to the fact that LLM cannot come back. After producing the token can not be improved or replaced, which means that any error is spreading. As a result, the margin of error is extremely narrow, and inaccuracies at the token level can be expensive in the entire sequence.
  3. Limited diversity: By default, the generated text is determined by greedy decoding, which produces permanent and less diverse outputs. We require taking temperature samples to adjust the output probability distribution, which solves the problem of diversity, but only to a certain restriction.

References:

  1. Attention is all you need https://arxiv.org/abs/1706.03762
  2. Neural machine translation by joint learning and translation https://arxiv.org/abs/1409.0473
  3. Illustrated transformer https://jalammar.github.io/illustrated-transformer/
  4. Status https://www.youtube.com/watch?v=ZXQYTK8QSEY&T=1550S
  5. 3blue1brown https://www.youtube.com/watch?v=emlx5ffnoyc
  6. Berviz, https://github.com/jessevig/bertviz
  7. Hierarchical reasoning model https://arxiv.org/abs/2506.21734

Published via AI

LEAVE A REPLY

Please enter your comment!
Please enter your name here