Must-Know Concepts About Transformers

4 min readJust now

These are some list of que and answers that every beginner must know.

Src of image : https://medium.com/machine-intelligence-and-deep-learning-lab/transformer-the-self-attention-mechanism-d7d853c2c621

Que: Why does the vanishing gradient problem occur in RNNs, and how do transformers solve it?

Answer: The vanishing gradient problem arises in RNNs during backpropagation when gradients are multiplied across many timesteps. If the gradients are less than 1, their repeated multiplication causes them to shrink exponentially, making it difficult for the model to learn long-range dependencies.

Transformers solve this problem by using the attention mechanism, which directly relates all input tokens to one another without relying on sequential gradient propagation. This eliminates the need for recurrent computations and mitigates the vanishing gradient issue.

Que: Why do RNNs fail with large sequences, and how do transformers address this limitation?

Answer: RNNs struggle with large sequences because they process input tokens sequentially, passing a single hidden state forward. This approach makes it hard to capture long-range dependencies and interactions between distant tokens effectively.

Transformers address this limitation by employing self-attention mechanisms. These mechanisms allow the model to compute interactions between all tokens simultaneously, capturing global dependencies in the sequence regardless of its length.

Que: Why do positional embeddings exist, and why do they use sine and cosine functions?

Answer: Positional embeddings are necessary in transformers because the architecture processes tokens in parallel, losing the sequential order information inherent to RNNs.

Sine and cosine functions are used because they provide continuous, smooth variations that encode positional information in a way that allows the model to generalize across sequences of different lengths. Unlike discrete numbers, these functions preserve relational properties, enabling the model to capture token distances effectively.

Que: Why do we divide and combine in multi-head attention?

Answer: In multi-head attention, the input embeddings are divided into smaller subspaces (heads) to capture different aspects of relationships between tokens. Each head computes attention separately, focusing on different features of the input.

The outputs of all heads are then combined through concatenation and a linear transformation. This aggregation ensures that the model integrates diverse contextual information into the final representation.

Que: What is the ideology of query, key, and value vectors in attention mechanisms? Can you explain with an example?

Answer: Query, key, and value vectors are the core components of the attention mechanism:

Query (Q): Represents the token for which the attention is being calculated.
Key (K): Represents all tokens in the input sequence.
Value (V): Contains the information to be aggregated based on attention scores.

Example: Imagine a dictionary where the genre is the key, and the list of movies is the value. If you describe the kind of movie you want to watch (query), the dictionary retrieves the most relevant movie based on the matching genre (key). Similarly, in attention, tokens interact by matching queries to keys to determine the relevance of values.

Que: Why do transformers use layer normalization, and what is the role of gamma and beta?

Answer: Layer normalization stabilizes training by normalizing the input of each layer to have a consistent mean and variance. This ensures that the network’s activations remain within a stable range, speeding up convergence and improving performance.

Gamma (γ): A learnable parameter that scales the normalized output.
Beta (β): A learnable parameter that shifts the normalized output. Together, gamma and beta allow the model to adjust the normalized values dynamically, enhancing its expressive power.

Que: What do encoders provide to decoders in an encoder-decoder architecture?

Answer: Encoders provide contextual representations of the input sequence to decoders. Specifically, they supply keys and values to the decoder’s cross-attention mechanism, enabling the decoder to focus on relevant parts of the input while generating output.

During inference, the decoder also uses masked self-attention to generate tokens sequentially, ensuring the output is coherent.

Que: What is the role of the feed-forward neural network in transformers?

Answer: The feed-forward neural network (FFN) refines the token embeddings after attention by applying two linear transformations with a ReLU activation in between. This allows the model to capture complex, non-linear relationships in the data, enriching each token’s representation.

Que: What is the significance and physical meaning of the output of the feed-forward neural network?

Answer: The output of the feed-forward neural network represents refined token embeddings enriched with non-linear features. These embeddings are context-aware and serve as the final representation of tokens before further transformations or output generation.

Que: How is the final output generated after attention and feed-forward processing?

Answer: After attention and feed-forward processing, the transformer passes the refined embeddings through a linear transformation layer followed by a softmax function. This produces a probability distribution over the vocabulary for each token position, allowing the model to predict the most likely word or token at each step.

Que: Does ChatGPT require an encoder during inference?

Answer: No, ChatGPT does not require an encoder during inference. It is a decoder-only transformer architecture, which means it uses its decoder stack for both understanding input and generating output. The model processes the input tokens, applies self-attention, and generates tokens sequentially until the output is complete.