Main Content

Configure Exploration For Reinforcement Learning Agents

This example shows how to use visualization for configuring exploration settings for reinforcement learning agents.

Fix Random Seed Generator to Improve Reproducibility

The example code may involve computation of random numbers at various stages such as initialization of the agent, creation of the actor and critic, resetting the environment during simulations, generating observations (for stochastic environments), generating exploration actions, and sampling min-batches of experiences for learning. Fixing the random number stream preserves the sequence of the random numbers every time you run the code and improves reproducibility of results. You will fix the random number stream at various locations in the example.

Fix the random number stream with the seed 0 and random number algorithm Mersenne twister. For more information on random number generation see rng.

previousRngState = rng(0,"twister");

Overview

Exploration in reinforcement learning refers to the strategy that an agent uses to discover new knowledge about its environment. Configuring exploration involves adjusting the parameters that govern how the agent explores the environment and typically involves numerous iterations before a satisfactory training performance is achieved. Visualizing data can reduce overhead in such scenario by helping to configure exploration.

In this example, you will visualize and configure exploration metrics for the following reinforcement learning agents:

  • Deep Q-network (DQN) agent.

  • Deep deterministic policy gradient (DDPG) agent.

The output previousRngState is a structure that contains information about the previous state of the stream. You will restore the state at the end of the example.

Create two Cartpole reinforcement learning environments, one with continuous and the other with a discrete action space.

cEnv = rlPredefinedEnv("cartpole-continuous");
dEnv = rlPredefinedEnv("cartpole-discrete");

For more information about these environments, see rlPredefinedEnv.

Create a variable doTraining to enable or disable training of agents in the example. Since training can be computationally intensive the value is set to false. You can enable training by setting the value to true.

doTraining = false;

Epsilon Greedy Exploration

A deep Q-network (DQN) agent performs exploration with an epsilon-greedy policy. The parameters of interest are:

  • Initial epsilon value (default 1.0).

  • Minimum epsilon value (default 0.01).

  • Epsilon decay rate (default 0.005).

When you create a DQN agent, the above default values are assigned to the parameters. First you will train the agent with the default values.

For more information see Deep Q-Network (DQN) Agent.

Create the agent object using the observation and action input specifications of the dEnv environment. The agent has the following options:

  • A learning rate of 1e-4 and 20 hidden units for the critic neural network.

  • The double-DQN algorithm is not used for learning.

  • A mini-batch size of 256 is used for learning.

  • The target critic network is updated using a smoothing factor of 1.0 every 4 learning iterations.

% fix the random seed for reproducibility
rng(0,"twister");

% observation and action input specifications
dEnvObsInfo = getObservationInfo(dEnv);
dEnvActInfo = getActionInfo(dEnv);

% agent initialization options
dqnInitOpts = rlAgentInitializationOptions(NumHiddenUnit=20);

% DQN agent options
criticOpts = rlOptimizerOptions( ...
    LearnRate=1e-4, ...
    GradientThreshold=1);
dqnOpts = rlDQNAgentOptions( ...
    CriticOptimizerOptions=criticOpts, ...
    MiniBatchSize=256, ...
    TargetSmoothFactor=1, ...
    TargetUpdateFrequency=4, ...
    UseDoubleDQN=false);

% create the agent
dqnAgent = rlDQNAgent(dEnvObsInfo, dEnvActInfo, ...
    dqnInitOpts, dqnOpts);

For more information see rlDQNAgent.

Create a data logger object to log data during training. The callback function logEpsilon (provided at the end of the example) logs the epsilon values from the training. The logged data is saved in the current directory under the folder named dqn.

dqnLogger = rlDataLogger();
dqnLogger.LoggingOptions.LoggingDirectory = "dqn";
dqnLogger.AgentStepFinishedFcn = @logEpsilon;

Train the agent for 500 episodes.

dqnTrainOpts = rlTrainingOptions( ...
    MaxEpisodes=500, ...
    StopTrainingCriteria="AverageReward",...
    StopTrainingValue=480);
if doTraining
    dqnResult = train(dqnAgent, dEnv, dqnTrainOpts, ...
        Logger=dqnLogger);
end

An example of the training is shown in the Reinforcement Learning Training Monitor window. Depending on your system configuration you may get a different training result.

To visualize exploration, first click View Logged Data in the Reinforcement Learning Training Monitor window.

In the Reinforcement Learning Data Viewer window, select Epsilon and choose the Line plot type from the toolstrip.

As seen in the plots:

  • The average reward received over 500 episodes did not reach the desired value of 480.

  • The epsilon value was decayed to the minimum value after around 1000 iterations and the agent does not perform further exploration.

Close the Reinforcement Learning Data Viewer and Reinforcement Learning Training Monitor windows.

Inspect the exploration parameters.

dqnOpts.EpsilonGreedyExploration
ans = 
  EpsilonGreedyExploration with properties:

    EpsilonDecay: 0.0050
         Epsilon: 1
      EpsilonMin: 0.0100

The default value of epsilon decay rate is 0.005. Specify a smaller decay rate so that the agent performs more exploration.

dqnOpts.EpsilonGreedyExploration.EpsilonDecay = 1e-3;

Configure and train the agent with new exploration parameters.

% fix the random seed for reproducibility
rng(0,"twister");

% create the agent
dqnAgent = rlDQNAgent(dEnvObsInfo, dEnvActInfo, ...
    dqnInitOpts, dqnOpts);

% train the agent
dqnLogger.LoggingOptions.LoggingDirectory = "dqnTuned";
if doTraining
    dqnResult = train(dqnAgent, dEnv, dqnTrainOpts, ...
        Logger=dqnLogger);
end

The training with new exploration parameters is shown below.

Open the Reinforcement Learning Data Viewer window and plot the Epsilon values again.

As seen in the plots:

  • This time the average reward reached the desired value of 480.

  • The epsilon value was decayed slower than the previous training. Increasing the exploration helped in improving the training performance.

Close the Reinforcement Learning Data Viewer and Reinforcement Learning Training Monitor windows.

Ornstein-Uhlenbeck (OU) Noise

A deep deterministic policy gradient (DDPG) agent uses the Ornstein-Uhlenbeck (OU) noise model for exploration.

The parameters of interest for the noise model are:

  • Mean of the noise (default 0).

  • Mean attraction constant (default 0.15).

  • Initial standard deviation (default 0.3).

  • Standard deviation decay rate (default 0).

  • Minimum standard deviation (default 0).

When you create a DDPG agent, the above default values are assigned to the parameters. First you will train the agent with the default values.

For more information see Deep Deterministic Policy Gradient (DDPG) Agent.

Create the agent object using the observation and action input specifications of the cEnv environment. The agent has the following options:

  • A learning rate of 1e-4 and 200 hidden units for the actor neural network.

  • A learning rate of 1e-3 and 200 hidden units for the critic neural network.

  • A mini-batch size of 64 is used for learning.

  • A sample time of 0.02s.

% fix the random stream for reproducibility
rng(0,"twister");

% observation and action input specifications
cEnvObsInfo = getObservationInfo(cEnv);
cEnvActInfo = getActionInfo(cEnv);

% agent initialization options
ddpgInitOpts = rlAgentInitializationOptions(NumHiddenUnit=200);

% DDPG agent options
actorOpts = rlOptimizerOptions( ...
    LearnRate=1e-4, ...
    GradientThreshold=1);
criticOpts = rlOptimizerOptions( ...
    LearnRate=1e-3, ...
    GradientThreshold=1)
criticOpts = 
  rlOptimizerOptions with properties:

                  LearnRate: 1.0000e-03
          GradientThreshold: 1
    GradientThresholdMethod: "l2norm"
     L2RegularizationFactor: 1.0000e-04
                  Algorithm: "adam"
        OptimizerParameters: [1x1 rl.option.OptimizerParameters]

ddpgOpts = rlDDPGAgentOptions( ...
    ActorOptimizerOptions=actorOpts, ...
    CriticOptimizerOptions=criticOpts, ...
    MiniBatchSize=64, ...
    SampleTime=cEnv.Ts);

% create the agent
ddpgAgent = rlDDPGAgent(cEnvObsInfo, cEnvActInfo, ...
    ddpgInitOpts, ddpgOpts);

For more information see rlDDPGAgent.

Create a data logger object to log data during training. The function logOUNoise (provided at the end of the example) logs the noise and standard deviation values from the training. Save the logged data in the folder named ddpg.

ddpgLogger = rlDataLogger();
ddpgLogger.LoggingOptions.LoggingDirectory = "ddpg";
ddpgLogger.AgentStepFinishedFcn = @logOUNoise;

Train the agent for 500 episodes.

ddpgTrainOpts = rlTrainingOptions( ...
    MaxEpisodes=500, ...
    StopTrainingCriteria="AverageReward", ...
    StopTrainingValue=480);
if doTraining
    ddpgResult = train(ddpgAgent, cEnv, ddpgTrainOpts, ...
        Logger=ddpgLogger);
end

An example of the training is shown in the Reinforcement Learning Training Monitor window. Depending on your system configuration you may get a different training result.

Click the View Logged Data button in the Reinforcement Learning Training Monitor window.

In the Reinforcement Learning Data Viewer window:

  • Select OUNoise and choose the Line plot type from the toolstrip.

  • Select StandardDeviation and choose the Line plot type from the toolstrip.

As seen in the plots:

  • The agent did not achieve the desired average reward of 480.

  • The noise value generally remains within +/-1.

  • The standard deviation value remains constant throughout the training.

Close the Reinforcement Learning Data Viewer and Reinforcement Learning Training Monitor windows.

Inspect the default exploration parameters.

ddpgOpts.NoiseOptions
ans = 
  OrnsteinUhlenbeckActionNoise with properties:

                 InitialAction: 0
                          Mean: 0
        MeanAttractionConstant: 0.1500
    StandardDeviationDecayRate: 0
             StandardDeviation: 0.3000
          StandardDeviationMin: 0

It may be useful to decay the exploration and gradually shift the agent's behavior from exploration to exploitation as it learns more about the environment. Early on, more exploration ensures a diverse range of experiences, which is crucial for the agent to learn a robust policy. However, as learning progresses, too much exploration can introduce unnecessary variance and instability into the learning process. Decaying exploration helps to stabilize learning by gradually reducing this variance.

  • Specify a mean attraction constant value of 0.1. A smaller value reduces the attraction of the noise process towards the mean value.

  • Specify an initial standard deviation value of 0.3.

  • Specify a standard deviation decay rate of 1e-4.

ddpgOpts.NoiseOptions.MeanAttractionConstant = 0.1;
ddpgOpts.NoiseOptions.StandardDeviation = 0.3;
ddpgOpts.NoiseOptions.StandardDeviationDecayRate = 1e-4;

Train the agent with the new exploration options.

% fix the random seed for reproducibility
rng(0,"twister");

% create the agent
ddpgAgent = rlDDPGAgent(cEnvObsInfo, cEnvActInfo, ...
    ddpgInitOpts, ddpgOpts);

% train the agent
ddpgLogger.LoggingOptions.LoggingDirectory = "ddpgTuned";
if doTraining
    ddpgResult = train(ddpgAgent, cEnv, ddpgTrainOpts, ...
        Logger=ddpgLogger);
end

The training with new exploration parameters is shown below.

Open the Reinforcement Learning Data Viewer window again and plot the OUNoise and StandardDeviation values again.

As seen in the plots:

  • This time the average reward reached the desired value of 480.

  • The standard deviation value was decayed. Consequently, the noise values were larger towards the beginning and smaller towards the end of the training.

Close the Reinforcement Learning Data Viewer and Reinforcement Learning Training Monitor windows.

Restore the random number stream using the information stored in previousRngState.

rng(previousRngState);

Logging Functions

function dataToLog = logEpsilon(data)
policy = getExplorationPolicy(data.Agent);
pstate = getState(policy);
dataToLog.Epsilon = pstate.Epsilon;
end

function dataToLog = logOUNoise(data)
policy = getExplorationPolicy(data.Agent);
pstate = getState(policy);
dataToLog.OUNoise = pstate.Noise{1};
dataToLog.StandardDeviation = pstate.StandardDeviation{1};
end

See Also

Functions

Objects

Related Examples

More About

Go to top of page