Part 21 - Policy Gradient Methods in Reinforcement Learning

Machine Learning Algorithms Series - Implementing REINFORCE with TensorFlow and Gym

Feb 06, 2025

Following our exploration of Deep Q-Networks (DQN), this article delves into policy gradient methods, a different class of reinforcement learning algorithms. Unlike Q-learning-based methods that learn Q-values, policy gradient methods directly learn a policy by optimizing the parameters of a policy network. This approach focuses on finding the optimal action selection strategy that maximizes cumulative rewards. We'll implement the REINFORCE algorithm using TensorFlow and Gym.

Understanding Policy Gradient Methods

Direct Policy Optimization: Policy gradient methods directly optimize the policy parameters instead of indirectly optimizing a value function.
Action Sampling: Actions are sampled from a policy distribution.
Gradient-Based Updates: The policy is updated using gradients based on rewards.
REINFORCE Algorithm: A popular policy gradient method where the policy is updated using gradients based on rewards.

Step-by-Step Implementation of REINFORCE

Importing Libraries:
- numpy for arrays and numerical computations.
- tensorflow (as tf) and tensorflow.keras.layers for building the neural network.
- gym for the environment interface.
Setting up the Environment:
- Create the Gym environment (e.g., "CartPole-v1").
- Get the number of state features from env.observation_space.shape.
- Define the number of possible actions from env.action_space.n.
Defining Hyperparameters:
- learning_rate: Step size for the optimizer (e.g., 0.01).
- gamma: Discount factor for future rewards (e.g., 0.99).
Building the Policy Network:
- Use tf.keras.Sequential to create a sequential model.
- Add dense layers with ReLU activation for hidden layers.
- Include an output layer with a number of actions and softmax activation to provide a probability distribution over actions.
- Compile the model with the Adam optimizer.
Creating the Action Selection Function:
- Reshape the state for model input compatibility.
- Use the policy model to predict the probability distribution over actions.
- Choose an action based on the predicted probabilities using np.random.choice.
Discounted Reward Calculation:
- Calculate discounted rewards to account for future rewards.
- Normalize discounted rewards by subtracting the mean to stabilize training.
Training Function:
- Calculate discounted rewards for the episode.
- Use tf.GradientTape for automatic differentiation.
- Calculate action probabilities for each state.
- Prepare indices to select the probability of the action taken.
- Extract action probabilities for the chosen actions.
- Calculate the policy gradient loss.
- Compute gradients and apply them to update the model's parameters.
Main Training Loop:
- Iterate through a specified number of episodes.
- Reset the environment at the start of each episode.
- Choose an action using the policy.
- Take the action in the environment and get the resulting state, reward, and done flag.
- Store the state, chosen action, and received reward.
- Update the state to the next state.
- When the episode ends, convert episode states into a NumPy array and train on the episode.
- Log the episode number and total reward.

Complete Code Example:

import numpy as np
import tensorflow as tf
from tensorflow.keras import layers
import gym

# Set up the environment
environment = gym.make("CartPole-v1")
state_shape = environment.observation_space.shape
num_actions = environment.action_space.n

# Define hyperparameters
learning_rate = 0.01
gamma = 0.99

# Build the policy network
def build_policy_model():
    model = tf.keras.Sequential([
        layers.Dense(24, activation='relu', input_shape=(state_shape,)),
        layers.Dense(24, activation='relu'),
        layers.Dense(num_actions, activation='softmax')
    ])
    model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate))
    return model

policy_model = build_policy_model()

# Action selection function
def choose_action(state):
    state = state.reshape([1, state_shape])
    probabilities = policy_model.predict(state)
    action = np.random.choice(num_actions, p=probabilities)
    return action

# Discounted reward calculation
def discounted_rewards(rewards):
    discounted = np.zeros_like(rewards)
    cumulative = 0
    for i in reversed(range(len(rewards))):
        cumulative = cumulative * gamma + rewards[i]
        discounted[i] = cumulative
    discounted -= np.mean(discounted)  # Normalize
    return discounted

# Training function
def train_on_episode(states, actions, rewards):
    discounted_rewards_val = discounted_rewards(rewards)
    with tf.GradientTape() as tape:
        action_probabilities = policy_model(states, training=True)
        action_indices = tf.stack([tf.range(len(actions)), actions], axis=1)
        selected_action_probabilities = tf.gather_nd(action_probabilities, action_indices)
        loss = -tf.reduce_mean(tf.math.log(selected_action_probabilities) * discounted_rewards_val)

    gradients = tape.gradient(loss, policy_model.trainable_variables)
    policy_model.optimizer.apply_gradients(zip(gradients, policy_model.trainable_variables))

# Main training loop
num_episodes = 1000
for episode in range(num_episodes):
    state, _ = environment.reset()
    episode_states, episode_actions, episode_rewards = [], [], []
    done = False

    while not done:
        action = choose_action(state)
        next_state, reward, done, truncated, _ = environment.step(action)
        done = done or truncated

        episode_states.append(state)
        episode_actions.append(action)
        episode_rewards.append(reward)

        state = next_state

    episode_states = np.vstack(episode_states)
    episode_actions = np.array(episode_actions, dtype=np.int32)

    train_on_episode(episode_states, episode_actions, episode_rewards)

    total_reward = sum(episode_rewards)
    print(f"Episode {episode + 1}, Total Reward: {total_reward}")

Get started with the Structured Learning

Conclusion

Policy gradient methods offer a powerful alternative to value-based methods like DQN. By directly optimizing the policy, these methods can handle continuous action spaces and complex environments. The REINFORCE algorithm, as implemented in this article, provides a clear example of how policy gradients can be used to train an agent to maximize cumulative rewards. This approach is fundamental in reinforcement learning and serves as a building block for more advanced policy gradient algorithms.

School of AI | Newsletter

Discussion about this post