To choose an action, is it correct to compute the value of successor state or do we need to compute value of states in the entire path till end state?

1 view (last 30 days)
While selecting an action , that action is chosen whose Q(s,a) is maximum. Q(s,a) is sum of reward and discounted value of next state.
From a state, when I proceed computing the best action, do I need to continue computing (iterating) the value of successor states over a path till the end state
(or)
is it enough to compute the value of immediate successor state alone and decide the action among the state that yields maximum value.

Accepted Answer

Emmanouil Tzorakoleftherakis
Hi Gowri,
Using the Q value for a state+action pair encodes all the information till 'the end of the path' weighted by a discount factor (assuming you are following the same policy).
So assuming you have a critic tha approximates the Q function relatively well, you shouldn't need to check Q values of successor states.
  3 Comments
Emmanouil Tzorakoleftherakis
If the approximation of the Q function is relatively accurate (whether that's through a table, neural network, polynomial, other), then yes, looking at the Q value of the current state/action pair should be sufficient when you are trying to 'extract' the policy.
In fact, if you look at vanilla DQN, even during training the Bellman equation looks one step ahead. I am not saying than n-step learning is not an option, but you certainly don't need all subsequent Q values.

Sign in to comment.

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!