Main Content

SARSA Agents

The SARSA algorithm is a model-free, online, on-policy reinforcement learning method. A SARSA agent is a value-based reinforcement learning agent that trains a critic to estimate the return or future rewards. For a given observation, the agent selects and outputs the action for which the estimated return is greatest.

For more information on the different types of reinforcement learning agents, see Reinforcement Learning Agents.

SARSA agents can be trained in environments with the following observation and action spaces.

Observation SpaceAction Space
Continuous or discreteDiscrete

SARSA agents use the following critic.

CriticActor

Q-value function critic Q(S,A), which you create using rlQValueFunction or rlVectorQValueFunction

SARSA agents do not use an actor.

During training, the agent explores the action space using epsilon-greedy exploration. During each control interval the agent selects a random action with probability ϵ, otherwise it selects the action for which the value function greatest with probability 1–ϵ.

Critic Function Approximator

To estimate the value function, a SARSA agent maintains a critic Q(S,A;ϕ), which is a function approximator with parameters ϕ. The critic takes observation S and action A as inputs and returns the corresponding expectation of the long-term reward.

For critics that use table-based value functions, the parameters in ϕ are the actual Q(S,A) values in the table.

For more information on creating critics for value function approximation, see Create Policies and Value Functions.

During training, the agent tunes the parameter values in ϕ. After training, the parameters remain at their tuned value and the trained value function approximator is stored in critic Q(S,A).

Agent Creation

To create a SARSA agent:

  1. Create a critic using an rlQValueFunction object.

  2. Specify agent options using an rlSARSAAgentOptions object.

  3. Create the agent using an rlSARSAAgent object.

Training Algorithm

SARSA agents use the following training algorithm. To configure the training algorithm, specify options using an rlSARSAAgentOptions object.

  • Initialize the critic Q(S,A;ϕ) with random parameter values in ϕ.

  • For each training episode:

    1. Get the initial observation S from the environment.

    2. For the current observation S, select a random action A with probability ϵ. Otherwise, select the action for which the critic value function is greatest.

      A=argmaxAQ(S,A;ϕ)

      To specify ϵ and its decay rate, use the EpsilonGreedyExploration option.

    3. Repeat the following for each step of the episode until S is a terminal state:

      1. Execute action A0. Observe the reward R and next observation S'.

      2. For the current observation S', select a random action A' with probability ϵ. Otherwise, select the action for which the critic value function is greatest.

        A'=argmaxA'Q(S',A';ϕ)

      3. If S' is a terminal state, set the value function target y to R. Otherwise, set it to

        y=R+γQ(S',A';ϕ)

        To set the discount factor γ, use the DiscountFactor option.

      4. Compute the difference ΔQ between the value function target and the current Q(S,A;ϕ) value.

        ΔQ=yQ(S,A;ϕ)

      5. Update the critic using the learning rate α. Specify the learning rate when you create the critic by setting the LearnRate option in the rlCriticOptimizerOptions property within the agent options object.

        • For table-based critics, update the corresponding Q(S,A) value in the table.

          Q(S,A)=Q(S,A;ϕ)+αΔQ

        • For all other types of critics, compute the gradients Δϕ of the loss function with respect to the parameters ϕ. Then, update the parameters based on the computed gradients. In this case, the loss function is the square of ΔQ.

          Δϕ=12ϕ(ΔQ)2ϕ=ϕ+αΔϕ

      6. Set the observation S to S'.

      7. Set the action A to A'.

See Also

|

Related Topics