MENU

[AI from Scratch] Episode 254: Implementing the Transformer Model

TOC

Recap and Today’s Theme

Hello! In the previous episode, we explained the implementation of the Attention Mechanism, a technology that focuses on essential information within sequence data. Attention is especially effective in capturing long contexts in natural language processing (NLP) and serves as the foundational technology for the Transformer model, improving computational efficiency and contextual information extraction.

Today, we will discuss the implementation of the Transformer model, an innovative architecture that leverages Attention. Widely used in NLP, the Transformer model has transformed the field, replacing traditional RNN and LSTM architectures. This episode covers the basic structure of Transformers, the application of Self-Attention, and a practical implementation.

What is the Transformer?

1. Basic Concept of the Transformer

Transformer is an NLP model introduced by Google researchers in 2017, offering an alternative to traditional Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks. The main feature of the Transformer is that it is entirely based on the Attention mechanism, allowing for parallel processing of sequences, unlike RNNs that process data sequentially. This design significantly improves computation speed and scalability.

2. Encoder-Decoder Structure

The Transformer consists of two main components:

  • Encoder: Processes the input sequence and generates feature vectors. It consists of multiple encoder blocks, each containing Self-Attention and a Feed-Forward Neural Network (FFN).
  • Decoder: Uses the output from the encoder to generate the target sequence. It also consists of multiple blocks, combining the output of the encoder and target sequence’s Self-Attention to make predictions.

3. Self-Attention and Multi-Head Attention

At the core of the Transformer is Self-Attention, which captures relationships between words in the input sequence. Multi-Head Attention extends this by executing multiple Self-Attention mechanisms simultaneously, enabling the model to capture information from different perspectives.

Mathematical Formulation and Layer Descriptions

1. Self-Attention Calculation

Self-Attention uses three vectors: Query (Q), Key (K), and Value (V). The calculation is as follows:

[
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V
]

Here, ( d_k ) is the dimension of the Key vector, used to scale the score. This operation computes attention scores that weight the Value vectors, enabling the model to focus on relevant information.

2. Multi-Head Attention

Multi-Head Attention performs multiple Self-Attention operations in parallel, combining them for diverse information processing:

[
\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O
]

Each head uses different weight matrices and is computed simultaneously, then concatenated and transformed to form the final output.

3. Positional Encoding

Since the Transformer processes sequences in parallel, it requires a method to retain positional information. Positional Encoding adds this information to the input vectors using sine and cosine functions:

[
\text{PE}{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d{\text{model}}}}\right)
]
[
\text{PE}{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d{\text{model}}}}\right)
]

Where ( pos ) denotes the position, and ( i ) is the dimension index. This encoding helps the Transformer maintain sequence order information.

Implementation Example of the Transformer

In this section, we implement a simple Transformer encoder model using Python and TensorFlow.

1. Implementing the Transformer Encoder

First, we define the basic structure of the Transformer encoder, including Self-Attention and feed-forward layers.

import tensorflow as tf
from tensorflow.keras.layers import Layer, Dense, Dropout, LayerNormalization

class MultiHeadAttention(Layer):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.d_model = d_model
        assert d_model % self.num_heads == 0

        self.depth = d_model // num_heads
        self.wq = Dense(d_model)
        self.wk = Dense(d_model)
        self.wv = Dense(d_model)
        self.dense = Dense(d_model)

    def split_heads(self, x, batch_size):
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])

    def call(self, v, k, q, mask):
        batch_size = tf.shape(q)[0]

        q = self.wq(q)
        k = self.wk(k)
        v = self.wv(v)

        q = self.split_heads(q, batch_size)
        k = self.split_heads(k, batch_size)
        v = self.split_heads(v, batch_size)

        scaled_attention_logits = tf.matmul(q, k, transpose_b=True) / tf.math.sqrt(tf.cast(self.depth, tf.float32))
        if mask is not None:
            scaled_attention_logits += (mask * -1e9)

        attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)
        output = tf.matmul(attention_weights, v)
        output = tf.transpose(output, perm=[0, 2, 1, 3])
        concat_attention = tf.reshape(output, (batch_size, -1, self.d_model))

        return self.dense(concat_attention)

class TransformerEncoderBlock(Layer):
    def __init__(self, d_model, num_heads, dff, dropout_rate=0.1):
        super(TransformerEncoderBlock, self).__init__()
        self.mha = MultiHeadAttention(d_model, num_heads)
        self.ffn = tf.keras.Sequential([
            Dense(dff, activation='relu'),
            Dense(d_model)
        ])
        self.layernorm1 = LayerNormalization(epsilon=1e-6)
        self.layernorm2 = LayerNormalization(epsilon=1e-6)
        self.dropout1 = Dropout(dropout_rate)
        self.dropout2 = Dropout(dropout_rate)

    def call(self, x, training, mask):
        attn_output = self.mha(x, x, x, mask)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(x + attn_output)

        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        out2 = self.layernorm2(out1 + ffn_output)

        return out2

This implementation defines the Multi-Head Attention and Transformer encoder block, including Layer Normalization and Dropout for stabilization.

2. Constructing the Encoder Model

Next, we stack encoder blocks for text classification tasks:

class TransformerEncoder(tf.keras.Model):
    def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size, dropout_rate=0.1):
        super(TransformerEncoder, self).__init__()
        self.d_model = d_model
        self.num_layers = num_layers

        self.embedding = tf.keras.layers.Embedding(input_vocab_size, d_model)
        self.pos_encoding = positional_encoding(1000, d_model)
        self.enc_layers = [TransformerEncoderBlock(d_model, num_heads, dff, dropout_rate) for _ in range(num_layers)]
        self.dropout = Dropout(dropout_rate)

    def call(self, x, training, mask):
        seq_len = tf.shape(x)[1]
        x = self.embedding(x)
        x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        x += self.pos_encoding[:, :seq_len, :]

        x = self.dropout(x, training=training)
        for i in range(self.num_layers):
            x = self.enc_layers[i](x, training, mask)

        return x

Benefits of the Transformer

1. Parallel Processing Capability

Unlike traditional RNNs that process data sequentially, the Transformer processes entire sequences in parallel, significantly improving computation speed.

2. Handling Long Sequences

Self-Attention makes it easier for the Transformer to maintain important information over long sequences, reducing the issues associated with long-range dependencies.

Summary

This episode explained the Transformer model, its core structure, and the implementation of Self-Attention. As one of the most advanced NLP technologies, the Transformer excels in various tasks, demonstrating remarkable performance. Next time, we will focus on fine-tuning BERT, applying pre-trained models to specific tasks.

Next Episode Preview

Next time, we will dive into fine-tuning BERT, learning how to apply pre-trained models to specific tasks for optimal performance.


Notes

  1. Feed-Forward Neural Network (FFN): A neural network that performs forward calculations without looping back.
  2. Positional Encoding: A method to incorporate sequence order information in Transformers.
  3. Self-Attention: A form of attention that captures relationships between words within an input sequence.
Let's share this post !

Author of this article

株式会社PROMPTは生成AIに関する様々な情報を発信しています。
記事にしてほしいテーマや調べてほしいテーマがあればお問合せフォームからご連絡ください。
---
PROMPT Inc. provides a variety of information related to generative AI.
If there is a topic you would like us to write an article about or research, please contact us using the inquiry form.

Comments

To comment

TOC