SARSA Agent
The SARSA algorithm is an on-policy reinforcement learning method for environments with a discrete action space. A SARSA agent trains a value function based critic to estimate the expected discounted cumulative long-term reward of the current policy. Therefore, SARSA is the on-policy version of Q-learning. For more information on the different types of reinforcement learning agents, see Reinforcement Learning Agents.
In Reinforcement Learning Toolbox™, a SARSA agent is implemented by an rlSARSAAgent object.
Note
SARSA agents do not support recurrent networks.
SARSA agents can be trained in environments with the following observation and action spaces.
| Observation Space | Action Space |
|---|---|
| Discrete, continuous, or hybrid. | Discrete |
SARSA agents use the following critic.
| Critic | Actor |
|---|---|
Q-value function critic
Q(S,A), which you create
using | SARSA agents do not use an actor. |
During training, the agent explores the action space using epsilon-greedy exploration. During each control interval the agent selects a random action with probability ϵ, otherwise it selects the action for which the action-value function is greatest with probability 1–ϵ.
Critic Used by the SARSA Agent
To estimate the value of the current policy, a SARSA agent uses a critic. The critic is a function approximator object that implements the parameterized action-value function Q(S,A;ϕ), using parameters ϕ. For a given observation S and action A, the critic stores the corresponding estimate of the expected discounted cumulative long-term reward given the current policy. During training, the critic tunes the parameters in ϕ to improve its action-value function estimation. After training, the parameters remain at their tuned values in the critic internal to the trained agent.
For critics that use table-based value functions, the parameters in ϕ are the actual Q(S,A) values in the table.
For more information on creating critics, see Create Actors, Critics, and Policy Objects.
SARSA Agent Creation
To create a SARSA agent:
Create observation specifications for your environment. If you already have an environment object, you can obtain these specifications using
getObservationInfo.Create action specifications for your environment. If you already have an environment object, you can obtain these specifications using
getActionInfo.Create an approximation model for your critic. Depending on the type of problem and on the specific critic you use in the next step, this model can be an
rlTableobject (only for discrete observation spaces), a custom basis function with initial parameter values, or a neural network object. The inputs and outputs of the model you create depend on the type of critic you use in the next step.Create a critic using
rlQValueFunctionorrlVectorQValueFunction. Use the model you created in the previous step as a first input argument.Specify agent options using an
rlSARSAAgentOptionsobject. Alternatively, you can skip this step and modify the agent options later using dot notation.Create the agent using
rlSARSAAgent.
SARSA Agent Initialization
When you create a SARSA agent, the critic Q(S,A;ϕ) uses random parameter values in ϕ.
The agent uses this initial critic parameters at the beginning of the first training session. For each subsequent training session, the critic retains the parameters from the previous session.
SARSA Training Algorithm
The SARSA agent uses the following training algorithm. To configure the training
algorithm, specify options using an rlSARSAAgentOptions object.
For each training episode:
At the beginning of the episode, get the initial observation from the environment.
For the current observation S, select a random action A with probability ϵ. Otherwise, select the action for which the critic value function is greatest.
To specify ϵ and its decay rate, use the
EpsilonGreedyExplorationoption.Repeat the following operations for each step of the episode until S is a terminal state:
Execute action A. Observe the reward R and the next observation S'.
For the current observation S', select a random action A' with probability ϵ. Otherwise, select the action for which the critic value function is greatest.
If S' is a terminal state, set the value function target y to R. Otherwise, set it to
To set the discount factor γ, use the
DiscountFactoroption.Compute the difference ΔQ between the value function target and the current Q(S,A;ϕ) value.
Update the critic using the learning rate α. Specify the learning rate when you create the critic by setting the
LearnRateoption in therlCriticOptimizerOptionsproperty within the agent options object.For table-based critics, update the corresponding Q(S,A) value in the table.
For all other types of critics, compute the gradients Δϕ of the loss function with respect to the parameters ϕ. Then, update the parameters based on the computed gradients. In this case, the loss function is the square of ΔQ.
If ϵ is greater than its minimum value, perform the decay operation as described in
EpsilonGreedyExploration.Set the observation S to S'.
Set the action A to A'.
References
[1] Sutton, Richard S., and Andrew G. Barto. Reinforcement Learning: An Introduction. Second edition. Adaptive Computation and Machine Learning. Cambridge, Mass: The MIT Press, 2018.
See Also
Objects
rlSARSAAgent|rlSARSAAgentOptions|rlQValueFunction|rlVectorQValueFunction|rlQAgent|rlLSPIAgent|rlDQNAgent
Topics
- Train Reinforcement Learning Agent in Basic Grid World
- Train Reinforcement Learning Agent in MDP Environment
- Compare Agents on Deterministic Waterfall Grid World
- Reinforcement Learning Agents
- Q-Learning Agent
- LSPI Agent
- Deep Q-Network (DQN) Agent
- Create Actors, Critics, and Policy Objects
- Train Reinforcement Learning Agents