I'm currently training a DQN agent for my RL problem. As the training progresses, I can see that the episode reward, running average and Q0 converge to (approximately) the same value, which is a good sign. However, I am uncertain if indeed it is able to find the optimal policy, or it just gets stuck in a local minimum.
With this, I have the following questions about exploration using the epsilon-greedy algorithm (with configurable parameters in rlDQNAgentOptions).
1. Does the epsilon decay every time step and continuously over all episodes (meaning, it does not reset to epsilon-max at the start of every new episode)?
2. Do the number of time steps per episode and the total number of episodes have a direct impact on the exploration process? Or are there other parameters which affect exploration besides the epsilon parameters?
3. How is the Q0 estimate calculated? Is it solely based on the output of my DNN policy representation?
4. How is the episode reward calculated? My understanding is that, it is just the sum of the actual rewards for all time steps within an episode.
Thank you in advance for your help! :)