DDPG agent has saturated actions with diverging Q value

Question

0 votes

I have created an environment in Simulink with 3 observations and 2 actions, all of them continuous. The critic state path has 3 inputs, one for each observation, and 2 fully connected layers of 24 neurons each. The critic action path has 2 inputs, one for each action, and one fully connected layer of 24 neurons. The common path has one addition layer and one output layer. The critic network is below

The network was set with the following options:

agentOptions = rlDDPGAgentOptions(...
    'SampleTime',0.5,...
    'TargetSmoothFactor',1e-2,... %1e-3
    'DiscountFactor',1.0, ...
    'MiniBatchSize',32, ...
    'ExperienceBufferLength',1e5); %1e6
agentOptions.NoiseOptions.Variance = 0.4;
agentOptions.NoiseOptions.VarianceDecayRate = 1e-5;

After a few thousand episodes of training, it is apparent that the agent is choosing to continuously ouput the most extreme action values available, constantly, throughout the whole episode. For example, if action1 is continuous in the range [0 1] and action2 is continuous on the range [0 10], then the actions will settle to (0,0), (0,10), (1,0) or (1,10) all throughout the episodes. It seems to be random which extreme values it will settle on. In addition, the Q value for the training diverges, reaching as high as 10^10 before being terminated in one instance. See below:

Yellow - Q value

Blue - Episode Reward

Red - 20 Episode Average

Can anyone offer a solutions as to what is causing the high Q values and lack of agent exploration?

Thanks

1 Comment
Show -1 older comments Hide -1 older comments

Jordan Hamilton on 5 Sep 2020

I’ve experienced a similar issue when using a custom reinforcement learning environment. In my case when the agent is initialised the weights and biases seem to ensure that no matter what the observation is, the initial action seems to always be maximum action (this is before any training has been conducted). After thousands of episodes of training the agent will fluctuate between the maximum and minimum action, as opposed to finding a medium. This largely seems to defeat the object of having a continuous action space.

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Emmanouil Tzorakoleftherakis on 15 Sep 2020

0 votes

For the actor switching between extreme actions, please refer to this answer - sounds relevant.In short, make sure you include a tanh layer followed by a scaling layer at the end of your actor network, and make sure the noise options you are using make sense and lead to a stable noise model. For example, a 0.4 variance sounds a bit to large if your range is (0,1). This post may also be helpful for setting noice params.

Lastrly, for the critic, this post summarizes a few options.

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

DDPG agent has saturated actions with diverging Q value

1 Comment
Show -1 older comments Hide -1 older comments

Answers (1)

0 Comments
Show -2 older comments Hide -2 older comments

Categories

Products

Release

Tags

Community Treasure Hunt

DDPG agent has saturated actions with diverging Q value

1 Comment Show -1 older comments Hide -1 older comments

Answers (1)

0 Comments Show -2 older comments Hide -2 older comments

Categories

Products

Release

Tags

See Also

Community Treasure Hunt

1 Comment
Show -1 older comments Hide -1 older comments

0 Comments
Show -2 older comments Hide -2 older comments