Part 21 - Policy Gradient Methods in Reinforcement Learning
Machine Learning Algorithms Series - Implementing REINFORCE with TensorFlow and Gym
Following our exploration of Deep Q-Networks (DQN), this article delves into policy gradient methods, a different class of reinforcement learning algorithms. Unlike Q-learning-based methods that learn Q-values, policy gradient methods directly learn a policy by optimizing the parameters of a policy network. This approach focuses on finding the optimal action selection strategy that maximizes cumulative rewards. We'll implement the REINFORCE algorithm using TensorFlow and Gym.
Understanding Policy Gradient Methods
Direct Policy Optimization: Policy gradient methods directly optimize the policy parameters instead of indirectly optimizing a value function.
Action Sampling: Actions are sampled from a policy distribution.
Gradient-Based Updates: The policy is updated using gradients based on rewards.
REINFORCE Algorithm: A popular policy gradient method where the policy is updated using gradients based on rewards.
Step-by-Step Implementation of REINFORCE
Importing Libraries:
numpy
for arrays and numerical computations.tensorflow
(astf
) andtensorflow.keras.layers
for building the neural network.gym
for the environment interface.
Setting up the Environment:
Create the Gym environment (e.g., "CartPole-v1").
Get the number of state features from
env.observation_space.shape
.Define the number of possible actions from
env.action_space.n
.
Defining Hyperparameters:
learning_rate
: Step size for the optimizer (e.g., 0.01).gamma
: Discount factor for future rewards (e.g., 0.99).
Building the Policy Network:
Use
tf.keras.Sequential
to create a sequential model.Add dense layers with ReLU activation for hidden layers.
Include an output layer with a number of actions and softmax activation to provide a probability distribution over actions.
Compile the model with the Adam optimizer.
Creating the Action Selection Function:
Reshape the state for model input compatibility.
Use the policy model to predict the probability distribution over actions.
Choose an action based on the predicted probabilities using
np.random.choice
.
Discounted Reward Calculation:
Calculate discounted rewards to account for future rewards.
Normalize discounted rewards by subtracting the mean to stabilize training.
Training Function:
Calculate discounted rewards for the episode.
Use
tf.GradientTape
for automatic differentiation.Calculate action probabilities for each state.
Prepare indices to select the probability of the action taken.
Extract action probabilities for the chosen actions.
Calculate the policy gradient loss.
Compute gradients and apply them to update the model's parameters.
Main Training Loop:
Iterate through a specified number of episodes.
Reset the environment at the start of each episode.
Choose an action using the policy.
Take the action in the environment and get the resulting state, reward, and done flag.
Store the state, chosen action, and received reward.
Update the state to the next state.
When the episode ends, convert episode states into a NumPy array and train on the episode.
Log the episode number and total reward.
Complete Code Example:
import numpy as np
import tensorflow as tf
from tensorflow.keras import layers
import gym
# Set up the environment
environment = gym.make("CartPole-v1")
state_shape = environment.observation_space.shape
num_actions = environment.action_space.n
# Define hyperparameters
learning_rate = 0.01
gamma = 0.99
# Build the policy network
def build_policy_model():
model = tf.keras.Sequential([
layers.Dense(24, activation='relu', input_shape=(state_shape,)),
layers.Dense(24, activation='relu'),
layers.Dense(num_actions, activation='softmax')
])
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate))
return model
policy_model = build_policy_model()
# Action selection function
def choose_action(state):
state = state.reshape([1, state_shape])
probabilities = policy_model.predict(state)
action = np.random.choice(num_actions, p=probabilities)
return action
# Discounted reward calculation
def discounted_rewards(rewards):
discounted = np.zeros_like(rewards)
cumulative = 0
for i in reversed(range(len(rewards))):
cumulative = cumulative * gamma + rewards[i]
discounted[i] = cumulative
discounted -= np.mean(discounted) # Normalize
return discounted
# Training function
def train_on_episode(states, actions, rewards):
discounted_rewards_val = discounted_rewards(rewards)
with tf.GradientTape() as tape:
action_probabilities = policy_model(states, training=True)
action_indices = tf.stack([tf.range(len(actions)), actions], axis=1)
selected_action_probabilities = tf.gather_nd(action_probabilities, action_indices)
loss = -tf.reduce_mean(tf.math.log(selected_action_probabilities) * discounted_rewards_val)
gradients = tape.gradient(loss, policy_model.trainable_variables)
policy_model.optimizer.apply_gradients(zip(gradients, policy_model.trainable_variables))
# Main training loop
num_episodes = 1000
for episode in range(num_episodes):
state, _ = environment.reset()
episode_states, episode_actions, episode_rewards = [], [], []
done = False
while not done:
action = choose_action(state)
next_state, reward, done, truncated, _ = environment.step(action)
done = done or truncated
episode_states.append(state)
episode_actions.append(action)
episode_rewards.append(reward)
state = next_state
episode_states = np.vstack(episode_states)
episode_actions = np.array(episode_actions, dtype=np.int32)
train_on_episode(episode_states, episode_actions, episode_rewards)
total_reward = sum(episode_rewards)
print(f"Episode {episode + 1}, Total Reward: {total_reward}")
Conclusion
Policy gradient methods offer a powerful alternative to value-based methods like DQN. By directly optimizing the policy, these methods can handle continuous action spaces and complex environments. The REINFORCE algorithm, as implemented in this article, provides a clear example of how policy gradients can be used to train an agent to maximize cumulative rewards. This approach is fundamental in reinforcement learning and serves as a building block for more advanced policy gradient algorithms.