MATLAB Answers

DDPG agent has saturated actions with diverging Q value

20 views (last 30 days)
James Norris
James Norris on 4 Sep 2020
I have created an environment in Simulink with 3 observations and 2 actions, all of them continuous. The critic state path has 3 inputs, one for each observation, and 2 fully connected layers of 24 neurons each. The critic action path has 2 inputs, one for each action, and one fully connected layer of 24 neurons. The common path has one addition layer and one output layer. The critic network is below
The network was set with the following options:
agentOptions = rlDDPGAgentOptions(...
'SampleTime',0.5,...
'TargetSmoothFactor',1e-2,... %1e-3
'DiscountFactor',1.0, ...
'MiniBatchSize',32, ...
'ExperienceBufferLength',1e5); %1e6
agentOptions.NoiseOptions.Variance = 0.4;
agentOptions.NoiseOptions.VarianceDecayRate = 1e-5;
After a few thousand episodes of training, it is apparent that the agent is choosing to continuously ouput the most extreme action values available, constantly, throughout the whole episode. For example, if action1 is continuous in the range [0 1] and action2 is continuous on the range [0 10], then the actions will settle to (0,0), (0,10), (1,0) or (1,10) all throughout the episodes. It seems to be random which extreme values it will settle on. In addition, the Q value for the training diverges, reaching as high as 10^10 before being terminated in one instance. See below:
Yellow - Q value
Blue - Episode Reward
Red - 20 Episode Average
Can anyone offer a solutions as to what is causing the high Q values and lack of agent exploration?
Thanks

  1 Comment

Jordan Hamilton
Jordan Hamilton on 5 Sep 2020
I’ve experienced a similar issue when using a custom reinforcement learning environment. In my case when the agent is initialised the weights and biases seem to ensure that no matter what the observation is, the initial action seems to always be maximum action (this is before any training has been conducted). After thousands of episodes of training the agent will fluctuate between the maximum and minimum action, as opposed to finding a medium. This largely seems to defeat the object of having a continuous action space.

Sign in to comment.

Answers (1)

Emmanouil Tzorakoleftherakis
For the actor switching between extreme actions, please refer to this answer - sounds relevant.In short, make sure you include a tanh layer followed by a scaling layer at the end of your actor network, and make sure the noise options you are using make sense and lead to a stable noise model. For example, a 0.4 variance sounds a bit to large if your range is (0,1). This post may also be helpful for setting noice params.
Lastrly, for the critic, this post summarizes a few options.

  0 Comments

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!