Part 27 - Implementing Transformers for Sentiment Analysis
Machine Learning Algorithms Series - Building a Transformer Model in Python for Enhanced Natural Language Processing
Following our exploration of LSTMs, this article introduces Transformers, a deep learning architecture designed for handling sequential data without relying on recurrence. Transformers use a mechanism called self-attention to process all tokens in the sequence simultaneously, capturing dependencies between tokens regardless of their distance in the sequence. This article guides you through implementing a Transformer model for sentiment analysis, using the IMDb dataset to classify movie reviews.
Understanding Transformers
Sequential Data Handling: Transformers handle sequential data without recurrence, unlike RNNs.
Self-Attention Mechanism: Transformers use self-attention to process all tokens simultaneously, capturing dependencies between tokens regardless of their distance in the sequence.
Foundation of NLP Tasks: Transformers have become the foundation of many NLP tasks and models, including BERT and GPT.
Step-by-Step Implementation
Import Libraries:
tensorflow
: The main library for building and training deep learning models.tensorflow.keras.layers
: The Keras module for creating different types of neural network layers.tensorflow.keras.datasets.imdb
: The Keras module for loading the IMDb dataset used for sentiment analysis, classifying movie reviews as positive or negative.tensorflow.keras.preprocessing.sequence
: A Keras utility for pre-processing sequence data, especially useful for padding or truncating sequences to a uniform length.
Load and Pre-process the IMDb Dataset:
Set
max_features = 10000
to limit the vocabulary to the 10,000 most frequent words in the dataset; any words not in the top 10,000 are ignored.Set
max_length = 200
to restrict each movie review to 200 words, truncating longer reviews and padding shorter ones.Load the IMDb dataset using
imdb.load_data(num_words=max_features)
.Pad or truncate each review in the training and testing sets to ensure all reviews have exactly
max_length
words usingsequence.pad_sequences()
.
Define a Transformer Block:
Create a custom layer representing a Transformer block:
class TransformerBlock(layers.Layer)
.Define the initializer method:
def __init__(self, embed_dim, num_heads, ffdm_rate=0.1)
.embed_dim
: The embedding dimension for each word vector.num_heads
: The number of attention heads in the multi-head attention layer.ffdm_rate
: The dropout rate to prevent overfitting.
Implement the
call
function for the forward pass.
Define the Model:
Use an embedding layer.
Create a Transformer block instance.
Add layers for global average pooling, dropout, and dense connections.
Compile the Model:
Configure the model for training.
Specify the optimizer (Adam), loss function (binary cross-entropy), and metrics (accuracy).
Train the Model:
Train the model using
model.fit()
with training data, batch size, epochs, and validation split.
Evaluate the Model:
Evaluate the model on the test data using
model.evaluate()
to calculate the test loss and accuracy.Print the test accuracy to show the model's generalization performance on unseen data.
Complete Code Example:
import tensorflow as tf
from tensorflow.keras import layers, models
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing import sequence
# Load and pre-process the IMDb dataset
max_features = 10000
max_length = 200
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
x_train = sequence.pad_sequences(x_train, maxlen=max_length)
x_test = sequence.pad_sequences(x_test, maxlen=max_length)
# Define a Transformer block
class TransformerBlock(layers.Layer):
def __init__(self, embed_dim, num_heads, ffdm_rate=0.1):
super(TransformerBlock, self).__init__()
self.at = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
self.ffn = tf.keras.Sequential([
layers.Dense(ffdm_rate, activation='relu'),
layers.Dense(embed_dim),
])
self.layer_norm1 = layers.LayerNormalization(epsilon=1e-6)
self.layer_norm2 = layers.LayerNormalization(epsilon=1e-6)
self.dropout1 = layers.Dropout(0.1)
self.dropout2 = layers.Dropout(0.1)
def call(self, inputs, training=None):
attn_output = self.at(inputs, inputs)
attn_output = self.dropout1(attn_output, training=training)
out1 = self.layer_norm1(inputs + attn_output)
ffn_output = self.ffn(out1)
ffn_output = self.dropout2(ffn_output, training=training)
return self.layer_norm2(out1 + ffn_output)
# Define the model
embed_dim = 32 # Embedding dimension for each word vector
num_heads = 2 # Number of attention heads
ffdm_dim = 32 # Hidden layer size in feed forward network
inputs = layers.Input(shape=(max_length,))
embedding_layer = layers.Embedding(input_dim=max_features, output_dim=embed_dim, input_length=max_length)
x = embedding_layer(inputs)
transformer_block = TransformerBlock(embed_dim, num_heads, ffdm_dim)
x = transformer_block(x)
x = layers.GlobalAveragePooling1D()(x)
x = layers.Dropout(0.1)(x)
x = layers.Dense(20, activation='relu')(x)
x = layers.Dropout(0.1)(x)
outputs = layers.Dense(1, activation='sigmoid')(x)
model = tf.keras.Model(inputs=inputs, outputs=outputs)
# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# Train the model
model.fit(x_train, y_train, batch_size=64, epochs=3, validation_split=0.2)
# Evaluate the model
test_loss, test_accuracy = model.evaluate(x_test, y_test)
print(f'Test accuracy: {test_accuracy}')
Conclusion
Transformers enhance sentiment analysis by effectively capturing dependencies in text through self-attention. By implementing a Transformer model, we achieve high accuracy in movie review sentiment classification. Transformers' ability to process all tokens simultaneously makes them ideal for complex NLP tasks, outperforming traditional RNNs and LSTMs in many scenarios.