This is a common misconception. When you train an agent and you get a certain reward out of a specific episode, if you were to stop training and run the same episode again with the same agent you would most likely get a different response from the agent and thus a different episode reward.
There are a few reasons for that:
1) During training, the agent explores different options based on some probabilistic exploration strategy. For DQN this strategy is related to the epsilon parameter in DQN options. After training, the agent does not explore anymore and only relies on the underlying neural network for inference, so it makes sense that results are different.
2) It is best practice to randomize various elements of the environment during training to get a more robust policy. In that case there will also be differences in how the environment behaves during and after training, so agent will respond differently as well.
3) Some agents are stochastic. This by itself implies different decisions/behavior even if everything else in the environment remains deterministic.
Hope that helps
2 Comments
Direct link to this comment
https://in.mathworks.com/matlabcentral/answers/597502-some-of-the-saved-agents-in-dqn-reinforcement-learning-algorithm-do-not-reproduce-the-training-rewar#comment_1022272
Direct link to this comment
https://in.mathworks.com/matlabcentral/answers/597502-some-of-the-saved-agents-in-dqn-reinforcement-learning-algorithm-do-not-reproduce-the-training-rewar#comment_1022272
Direct link to this comment
https://in.mathworks.com/matlabcentral/answers/597502-some-of-the-saved-agents-in-dqn-reinforcement-learning-algorithm-do-not-reproduce-the-training-rewar#comment_1023055
Direct link to this comment
https://in.mathworks.com/matlabcentral/answers/597502-some-of-the-saved-agents-in-dqn-reinforcement-learning-algorithm-do-not-reproduce-the-training-rewar#comment_1023055
Sign in to comment.