Part 27 - Implementing Transformers for Sentiment Analysis

Machine Learning Algorithms Series - Building a Transformer Model in Python for Enhanced Natural Language Processing

Feb 06, 2025

Following our exploration of LSTMs, this article introduces Transformers, a deep learning architecture designed for handling sequential data without relying on recurrence. Transformers use a mechanism called self-attention to process all tokens in the sequence simultaneously, capturing dependencies between tokens regardless of their distance in the sequence. This article guides you through implementing a Transformer model for sentiment analysis, using the IMDb dataset to classify movie reviews.

Understanding Transformers

Sequential Data Handling: Transformers handle sequential data without recurrence, unlike RNNs.
Self-Attention Mechanism: Transformers use self-attention to process all tokens simultaneously, capturing dependencies between tokens regardless of their distance in the sequence.
Foundation of NLP Tasks: Transformers have become the foundation of many NLP tasks and models, including BERT and GPT.

Step-by-Step Implementation

Import Libraries:
- tensorflow: The main library for building and training deep learning models.
- tensorflow.keras.layers: The Keras module for creating different types of neural network layers.
- tensorflow.keras.datasets.imdb: The Keras module for loading the IMDb dataset used for sentiment analysis, classifying movie reviews as positive or negative.
- tensorflow.keras.preprocessing.sequence: A Keras utility for pre-processing sequence data, especially useful for padding or truncating sequences to a uniform length.
Load and Pre-process the IMDb Dataset:
- Set max_features = 10000 to limit the vocabulary to the 10,000 most frequent words in the dataset; any words not in the top 10,000 are ignored.
- Set max_length = 200 to restrict each movie review to 200 words, truncating longer reviews and padding shorter ones.
- Load the IMDb dataset using imdb.load_data(num_words=max_features).
- Pad or truncate each review in the training and testing sets to ensure all reviews have exactly max_length words using sequence.pad_sequences().
Define a Transformer Block:
- Create a custom layer representing a Transformer block: class TransformerBlock(layers.Layer).
- Define the initializer method: def __init__(self, embed_dim, num_heads, ffdm_rate=0.1).
  - embed_dim: The embedding dimension for each word vector.
  - num_heads: The number of attention heads in the multi-head attention layer.
  - ffdm_rate: The dropout rate to prevent overfitting.
- Implement the call function for the forward pass.
Define the Model:
- Use an embedding layer.
- Create a Transformer block instance.
- Add layers for global average pooling, dropout, and dense connections.
Compile the Model:
- Configure the model for training.
- Specify the optimizer (Adam), loss function (binary cross-entropy), and metrics (accuracy).
Train the Model:
- Train the model using model.fit() with training data, batch size, epochs, and validation split.
Evaluate the Model:
- Evaluate the model on the test data using model.evaluate() to calculate the test loss and accuracy.
- Print the test accuracy to show the model's generalization performance on unseen data.

Complete Code Example:

import tensorflow as tf
from tensorflow.keras import layers, models
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing import sequence

# Load and pre-process the IMDb dataset
max_features = 10000
max_length = 200
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
x_train = sequence.pad_sequences(x_train, maxlen=max_length)
x_test = sequence.pad_sequences(x_test, maxlen=max_length)

# Define a Transformer block
class TransformerBlock(layers.Layer):
    def __init__(self, embed_dim, num_heads, ffdm_rate=0.1):
        super(TransformerBlock, self).__init__()
        self.at = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.ffn = tf.keras.Sequential([
            layers.Dense(ffdm_rate, activation='relu'),
            layers.Dense(embed_dim),
        ])
        self.layer_norm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layer_norm2 = layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = layers.Dropout(0.1)
        self.dropout2 = layers.Dropout(0.1)

    def call(self, inputs, training=None):
        attn_output = self.at(inputs, inputs)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layer_norm1(inputs + attn_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layer_norm2(out1 + ffn_output)

# Define the model
embed_dim = 32  # Embedding dimension for each word vector
num_heads = 2  # Number of attention heads
ffdm_dim = 32   # Hidden layer size in feed forward network

inputs = layers.Input(shape=(max_length,))
embedding_layer = layers.Embedding(input_dim=max_features, output_dim=embed_dim, input_length=max_length)
x = embedding_layer(inputs)
transformer_block = TransformerBlock(embed_dim, num_heads, ffdm_dim)
x = transformer_block(x)
x = layers.GlobalAveragePooling1D()(x)
x = layers.Dropout(0.1)(x)
x = layers.Dense(20, activation='relu')(x)
x = layers.Dropout(0.1)(x)
outputs = layers.Dense(1, activation='sigmoid')(x)

model = tf.keras.Model(inputs=inputs, outputs=outputs)

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(x_train, y_train, batch_size=64, epochs=3, validation_split=0.2)

# Evaluate the model
test_loss, test_accuracy = model.evaluate(x_test, y_test)
print(f'Test accuracy: {test_accuracy}')

Get started with the Structured Learning

Conclusion

Transformers enhance sentiment analysis by effectively capturing dependencies in text through self-attention. By implementing a Transformer model, we achieve high accuracy in movie review sentiment classification. Transformers' ability to process all tokens simultaneously makes them ideal for complex NLP tasks, outperforming traditional RNNs and LSTMs in many scenarios.

School of AI | Newsletter

Discussion about this post