Epsilon greedy algorithm and environment reset do not work during DQN agent training

8 views (last 30 days)
I'm using the reinforcement learning toolbox to design and train a DQN agent. The action space of the agent is composed by 24 discreted actions that reppresent 24 locations on a grid map, the agent action is to select one of them as target point and move there. The environment is a custom environment which I've designed defining custom reset and step functions. As suggested in the documentation and in some answers in order to promote the exploration of the action space the reset function sets the starting point of the agent randomly.
The issue I'm facing is the fact that during training (except for the very early episodes) at the beginning of each episode the agent starting position is always the same as the environment reset did not run, furthermore the agent performs the same actions in a loop as the greedy algorithm doesn't work. Here an example of what I mean:
% starting position 14
% Action 4 Action 17 Action 22 Action 18 Action 1
% Episode: 6/1500 | Episode Reward : 7.72 | Episode Steps: 5 | Avg Reward : 6.79 | Step Count : 40 | Episode Q0 : 0.19
% starting position 14
% Action 4 Action 17 Action 22 Action 18 Action 1
% Episode: 7/1500 | Episode Reward : 7.72 | Episode Steps: 5 | Avg Reward : 6.30 | Step Count : 45 | Episode Q0 : 0.19
% starting position 14
% Action 4 Action 17 Action 22 Action 18 Action 1
% Episode: 8/1500 | Episode Reward : 7.72 | Episode Steps: 5 | Avg Reward : 7.46 | Step Count : 50 | Episode Q0 : 0.19
% starting position 14
% Action 4 Action 17 Action 22 Action 18 Action 1
% Episode: 9/1500 | Episode Reward : 7.72 | Episode Steps: 5 | Avg Reward : 7.69 | Step Count : 55 | Episode Q0 : 0.19
% And so on for 20 Episodes.....
Every time an episode starts the environment should be reset therefore I would expect a different starting point for every episode, moreover it seems that the greedy algorithms does not work in fact I expect the agent to perform random actions to explore the action space since for the first episodes the Epsilon is very high, these are my settings for the greedy algorithm:
agentOpts.EpsilonGreedyExploration.Epsilon = 1;
agentOpts.EpsilonGreedyExploration.EpsilonMin = 0.1;
agentOpts.EpsilonGreedyExploration.EpsilonDecay = 0.001;
Therefore my questions, are the following:
  1. In my reset function the random initial position is selected randomly from a set of location through the function randi(), there may be a problem with the rng settings for reproducibility? There is a special setting to make the initial random position to be really random?
  2. I would like to understand how the greedy algorithm works and if there is a way to make the agent explore in an intense manner for the first episodes avoiding same action selection.
  3. Are there other agent/training parametrs that may affect exploration/exploitation during training?
Thank you in advance for your help and your time!
  1 Comment
Weihao Zhou
Weihao Zhou on 14 Apr 2021
Hello, I am a beginner in reinforcement learning. I would like to ask you where to observe the specific results of each episode of DQN training.
Thank you in advance for your help and your time!

Sign in to comment.

Accepted Answer

Emmanouil Tzorakoleftherakis
Hello,
Here are some comments:
1.The reset function should not produce the same output. You should first doublecheck the reset function works as expected by calling it as a standalone function outside of RL training. Right now it seems there is some implementation issue in the function itself. Are you maybe somehow providing a fixed seed in randi?
2. Looking at the numbers you provided, the number of steps per episode changes, so it does not seem like the same actions are performed. I would give it some more time, maybe increase the EpsilonMin value a bit too just to double check.
3. For DQN, epsilon is the main one, but there would also be issues with the network design.
  2 Comments
Matteo Padovani
Matteo Padovani on 5 Mar 2021
Thank you for your hints!
1.The problem was a fixed seed in a function used for mapping.
3.Regarding the possible problems with DQN, i noticed that the Q0 at some point diverges drastically with respect to the episode rewards, how can i modify my arctitecture?
Emmanouil Tzorakoleftherakis
The ideal convergence scenario for DQN would we Q0 to approx track the average episode reward (not individual episode rewards). There is not standard recipe for this, it's all about hyperparam tuning

Sign in to comment.

More Answers (0)

Products


Release

R2020b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!