1 view (last 30 days)

While selecting an action , that action is chosen whose Q(s,a) is maximum. Q(s,a) is sum of reward and discounted value of next state.

From a state, when I proceed computing the best action, do I need to continue computing (iterating) the value of successor states over a path till the end state

(or)

is it enough to compute the value of immediate successor state alone and decide the action among the state that yields maximum value.

Emmanouil Tzorakoleftherakis
on 6 Jul 2020

Hi Gowri,

Using the Q value for a state+action pair encodes all the information till 'the end of the path' weighted by a discount factor (assuming you are following the same policy).

So assuming you have a critic tha approximates the Q function relatively well, you shouldn't need to check Q values of successor states.

Emmanouil Tzorakoleftherakis
on 6 Jul 2020

If the approximation of the Q function is relatively accurate (whether that's through a table, neural network, polynomial, other), then yes, looking at the Q value of the current state/action pair should be sufficient when you are trying to 'extract' the policy.

In fact, if you look at vanilla DQN, even during training the Bellman equation looks one step ahead. I am not saying than n-step learning is not an option, but you certainly don't need all subsequent Q values.

Opportunities for recent engineering grads.

Apply TodayFind the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!
## 0 Comments

Sign in to comment.