DDPG Agent not converging, how to improve?

I have a custom simulink environment and am implementing a DDPG agent. I am simply trying to get a quadcopter to level at z = 5, with the action being total thrust. I am using a exponential reward function that peaks at 2 when z = 5, and flattens out to 0 as it gets farther from 5. My reset function places the quadcopter at a random z height between 0 and 10. After looking at the training reward and q0 trend, I can see that the agent is learning, and at times achieves almost the maximum reward, but I can't get it to converge. Here is my training result plot where the maximum reward per episode is 800:
Actor Critic network and options:
I am wondering:
1) Why does the agent perform fairly well around episode 500, but only get worse after that? Is my learning rate too low?
2) Near episode 1250, why did the model suddenly drop to very low reward values while q0 started to increase? Did I just not have enough episodes?
3) Why is there so much fluctuation between rewards, especially in later episodes? It seems as though one episode it will almost get to 800, andthen the next it will drop to 0.
4) The reward function design makes it pretty much impossible for the quad to achieve the max reward of 800, unless the random initial height is right at 5. Even if the agent performs optimally, the reward will descrease as the starting position gets further away from 5. Does this mess with the training process? Should I redesign the reward function so that the initial position does not influence the overall reward?
Looking at the training plot, if anyone has advice for parameter tuning that would help convergence and performance please let me know. I am exploring some on my own, but as training takes many hours I was hoping to get outside opinions as well.

1 Comment

Hi,
for the first question, I think that is related to your reward function and also L2 regularization,actually that might be.
for the second question, the agent is searching for the best policy to obtain max reward value, for doing that, the algorithm is trying to different state and action. Also, q0 is a metric, at the end of the training, converging the same value with awerage reward is the desired condition.
for third question, I can say that you should share your model and network algorithm, afte that, more clear advices can be given, actually 1st, 2nd and 3rd questions depend on each other.
for last question, random initial conditions help us to train more efficiently the agent (over-fitting etc), so initial conditions are randomized to reduce overfitting, and also in the algorithm, different noises can be added to achieve that.
If you can share your model, maybe more clear advices can be given.

Sign in to comment.

Answers (0)

Categories

Find more on Reinforcement Learning Toolbox in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!