Reinforcement Learning Toolbox: DDPG Agent, Q0 diverging to very high values during training

Question

Johan Andreas Stendal on 23 Oct 2019

3
Link

Direct link to this question

https://in.mathworks.com/matlabcentral/answers/486956-reinforcement-learning-toolbox-ddpg-agent-q0-diverging-to-very-high-values-during-training

Answered: Emmanouil Tzorakoleftherakis on 25 Oct 2019

q0.png

I made a DDPG reinforcement learning agent to control a simulink environment. Its similar to the water tank level example problem, the agent performs adjustments on the process speed and recieves rewards if an output parameter is inside a specified range, and recieves a big negative reward if this output parameter goes over a specified threshold.

I started with simple network architectures, around 15-20 neurons and one to three layers, then i went all the way up to 100 neurons in each layer and four to five layers in both the critic and actor networks. I also tried reducing the learning rate to 1e-4.

The process takes around 150-300 timesteps, and the reward is 1 point for each timestep that the output parameter is inside the specified range, so the maximum reward possible should be around 150-300, depending on the process speed.

However, regardless of the chosen network arcitechture, the q0 just diverges to very high values every training session, around 10e8, then flattens out, while the episode reward bounces around between -1000 and 150 (see attached figure). This pattern persists even after 80 000+ episodes (three days of training). I have read that the q0 and the episode reward should converge if everything is set up correctly, so something is definitely wrong.

The optimal process speed should follow some sort of S-shape to collect the most rewards. However, every time i stop the training the agent just predicts a constant action value for every time step or no action at all, resulting in a linearly increasing or constant process speed which does poorly in terms of reward.

Any idea what I am doing wrong?

Thank you for your time!

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Emmanouil Tzorakoleftherakis on 25 Oct 2019

0
Link

Direct link to this answer

https://in.mathworks.com/matlabcentral/answers/486956-reinforcement-learning-toolbox-ddpg-agent-q0-diverging-to-very-high-values-during-training#answer_398247

Hi Johan,

It makes sense that stopping the training leads to bad actions since the blown-up critic values probably don't lead to any significant learning. Could you share a repro example? It is hard to guess what's wrong here otherwise.

Also, have a look at this answer for some additional suggestions. My guess is that you are using too many layers/neurons for the critic.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Reinforcement Learning Toolbox: DDPG Agent, Q0 diverging to very high values during training

0 Comments
Show -2 older commentsHide -2 older comments

Answers (1)

0 Comments
Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Products

Community Treasure Hunt

Reinforcement Learning Toolbox: DDPG Agent, Q0 diverging to very high values during training

0 Comments Show -2 older commentsHide -2 older comments

Answers (1)

0 Comments Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Products

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

0 Comments
Show -2 older commentsHide -2 older comments