Epsilon greedy algorithm and environment reset do not work during DQN agent training

Question

0 votes

I'm using the reinforcement learning toolbox to design and train a DQN agent. The action space of the agent is composed by 24 discreted actions that reppresent 24 locations on a grid map, the agent action is to select one of them as target point and move there. The environment is a custom environment which I've designed defining custom reset and step functions. As suggested in the documentation and in some answers in order to promote the exploration of the action space the reset function sets the starting point of the agent randomly.

The issue I'm facing is the fact that during training (except for the very early episodes) at the beginning of each episode the agent starting position is always the same as the environment reset did not run, furthermore the agent performs the same actions in a loop as the greedy algorithm doesn't work. Here an example of what I mean:

% starting position 14 
% Action 4 Action 17 Action 22 Action 18 Action 1
% Episode:   6/1500 | Episode Reward : 7.72 | Episode Steps:    5 | Avg Reward : 6.79 | Step Count : 40 | Episode Q0 : 0.19
% starting position 14 
% Action 4 Action 17 Action 22 Action 18 Action 1
% Episode:   7/1500 | Episode Reward : 7.72 | Episode Steps:    5 | Avg Reward : 6.30 | Step Count : 45 | Episode Q0 : 0.19
% starting position 14 
% Action 4 Action 17 Action 22 Action 18 Action 1
% Episode:   8/1500 | Episode Reward : 7.72 | Episode Steps:    5 | Avg Reward : 7.46 | Step Count : 50 | Episode Q0 : 0.19
% starting position 14 
% Action 4 Action 17 Action 22 Action 18 Action 1
% Episode:   9/1500 | Episode Reward : 7.72 | Episode Steps:    5 | Avg Reward : 7.69 | Step Count : 55 | Episode Q0 : 0.19
% And so on for 20 Episodes.....

Every time an episode starts the environment should be reset therefore I would expect a different starting point for every episode, moreover it seems that the greedy algorithms does not work in fact I expect the agent to perform random actions to explore the action space since for the first episodes the Epsilon is very high, these are my settings for the greedy algorithm:

agentOpts.EpsilonGreedyExploration.Epsilon = 1;
agentOpts.EpsilonGreedyExploration.EpsilonMin = 0.1;
agentOpts.EpsilonGreedyExploration.EpsilonDecay = 0.001;

Therefore my questions, are the following:

In my reset function the random initial position is selected randomly from a set of location through the function randi(), there may be a problem with the rng settings for reproducibility? There is a special setting to make the initial random position to be really random?
I would like to understand how the greedy algorithm works and if there is a way to make the agent explore in an intense manner for the first episodes avoiding same action selection.
Are there other agent/training parametrs that may affect exploration/exploitation during training?

Thank you in advance for your help and your time!

1 Comment
Show -1 older comments Hide -1 older comments

Weihao Zhou on 14 Apr 2021

Hello, I am a beginner in reinforcement learning. I would like to ask you where to observe the specific results of each episode of DQN training.

Thank you in advance for your help and your time!

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Emmanouil Tzorakoleftherakis on 23 Feb 2021

0 votes

Hello,

Here are some comments:

1.The reset function should not produce the same output. You should first doublecheck the reset function works as expected by calling it as a standalone function outside of RL training. Right now it seems there is some implementation issue in the function itself. Are you maybe somehow providing a fixed seed in randi?

2. Looking at the numbers you provided, the number of steps per episode changes, so it does not seem like the same actions are performed. I would give it some more time, maybe increase the EpsilonMin value a bit too just to double check.

3. For DQN, epsilon is the main one, but there would also be issues with the network design.

2 Comments
Show None Hide None

Matteo Padovani on 5 Mar 2021

Thank you for your hints!

1.The problem was a fixed seed in a function used for mapping.

3.Regarding the possible problems with DQN, i noticed that the Q0 at some point diverges drastically with respect to the episode rewards, how can i modify my arctitecture?

Emmanouil Tzorakoleftherakis on 5 Mar 2021

The ideal convergence scenario for DQN would we Q0 to approx track the average episode reward (not individual episode rewards). There is not standard recipe for this, it's all about hyperparam tuning

Sign in to comment.

Epsilon greedy algorithm and environment reset do not work during DQN agent training

1 Comment
Show -1 older comments Hide -1 older comments

Accepted Answer

2 Comments
Show None Hide None

More Answers (0)

Categories

Products

Release

Tags

Community Treasure Hunt

Epsilon greedy algorithm and environment reset do not work during DQN agent training

1 Comment Show -1 older comments Hide -1 older comments

Accepted Answer

2 Comments Show None Hide None

More Answers (0)

Categories

Products

Release

Tags

See Also

Community Treasure Hunt

1 Comment
Show -1 older comments Hide -1 older comments

2 Comments
Show None Hide None