Main Content

Reinforcement Learning Agents

The goal of reinforcement learning is to train an agent to complete a task within an uncertain environment. At each time interval, the agent receives observations and a reward from the environment and sends an action to the environment. The reward is a measure of how successful the previous action (taken from the previous state) was with respect to completing the task goal.

The agent contains two components: a policy and a learning algorithm.

  • The policy is a mapping from the current environment observation to a probability distribution of the actions to be taken. Within an agent, the policy is implemented by a function approximator with tunable parameters and a specific approximation model, such as a deep neural network.

  • The learning algorithm continuously updates the policy parameters based on the actions, observations, and rewards. The goal of the learning algorithm is to find an optimal policy that maximizes the expected cumulative long-term reward received during the task.

Diagram showing an agent that interacts with its environment. The observation signal goes from the environment to the agent, and the action signal goes from the agent to the environment. The reward signal goes from the environment to the reinforcement learning algorithm inside the agent. The reinforcement learning algorithm uses the available information to update a policy. The agent uses a policy to map an observation to an action.

Depending on the learning algorithm, an agent maintains one or more parameterized function approximators for training the policy. Approximators can be used in two ways.

  • Critics — For a given observation and action, a critic returns the predicted discounted value of the cumulative long-term reward.

  • Actor — For a given observation, an actor returns as output the action that (often) maximizes the predicted discounted cumulative long-term reward.

Agents that use only critics to select their actions rely on an indirect policy representation. These agents are also referred to as value-based, and they use an approximator to represent a value function (value as a function of the observation) or Q-value function (value as a function of observation and action). In general, these agents work better with discrete action spaces but can become computationally expensive for continuous action spaces.

Agents that use only actors to select their actions rely on a direct policy representation. These agents are also referred to as policy-based. The policy can be either deterministic or stochastic. In general, these agents are simpler and can handle continuous action spaces, though the training algorithm can be sensitive to noisy measurement and can converge on local minima.

Agents that use both an actor and a critic are referred to as actor-critic agents. In these agents, during training, the actor learns the best action to take using feedback from the critic (instead of using the reward directly). At the same time, the critic learns the value function from the rewards so that it can properly criticize the actor. In general, these agents can handle both discrete and continuous action spaces.

Built-In Agents

Reinforcement Learning Toolbox™ software provides the following built-in agents. You can train these agents in environments with either continuous or discrete observation spaces and the following action spaces.

The following tables summarize the types, action spaces, and used approximators for all the built-in agents. For each agent, the observation space can be discrete, continuous or mixed.

Built-In Agents: Type and Action Space

AgentTypeAction SpaceOn/Off Policy
Q-Learning Agents (Q)Value-BasedDiscreteOff-policy
SARSA AgentsValue-BasedDiscreteOn-policy
Deep Q-Network (DQN) AgentsValue-BasedDiscreteOff-policy
Policy Gradient Agents (PG)Policy-BasedDiscrete or continuousOn-policy
Actor-Critic Agents (AC)Actor-CriticDiscrete or continuousOn-policy
Deep Deterministic Policy Gradient (DDPG) AgentsActor-CriticContinuousOff-policy
Twin-Delayed Deep Deterministic Policy Gradient Agents (TD3)Actor-CriticContinuousOff-policy
Soft Actor-Critic Agents (SAC)Actor-CriticContinuousOff-policy
Proximal Policy Optimization Agents (PPO)Actor-CriticDiscrete or continuousOn-policy
Trust Region Policy Optimization Agents (TRPO)Actor-CriticDiscrete or continuousOn-policy
Model-Based Policy Optimization Agents (MBPO)Actor-CriticDiscrete or continuousOff-policy

Built-In Agents: Approximators Used by Each Agent

ApproximatorQ, DQN, SARSAPGAC, PPO, TRPOSACDDPG, TD3

Value function critic V(S), which you can create using

rlValueFunction

 X (if baseline is used)X  

Q-value function critic Q(S,A), which you can create using

rlQValueFunction

X  XX

Multi-output Q-value function critic Q(S), for discrete action spaces, which you can create using

rlVectorQValueFunction

X    

Deterministic policy actor π(S), which you can create using

rlContinuousDeterministicActor

    X

Stochastic (Multinoulli) policy actor π(S), for discrete action spaces, which you can create using

rlDiscreteCategoricalActor

 XX  

Stochastic (Gaussian) policy actor π(S), for continuous action spaces, which you can create using

rlContinuousGaussianActor

 XXX 

Agent with default networks — All agents except Q-learning and SARSA agents support default networks for actors and critics. You can create an agent with a default actor and critic based on the observation and action specifications from the environment. To do so, at the MATLAB® command line, perform the following steps.

  1. Create observation specifications for your environment. If you already have an environment interface object, you can obtain these specifications using getObservationInfo.

  2. Create action specifications for your environment. If you already have an environment interface object, you can obtain these specifications using getActionInfo.

  3. If needed, specify the number of neurons in each learnable layer or whether to use an LSTM layer. To do so, create an agent initialization option object using rlAgentInitializationOptions.

  4. If needed, specify agent options by creating an options object set for the specific agent. This option object in turn includes rlOptimizerOptions objects that specify optimization objects for the agent actor or critic.

  5. Create the agent using the corresponding agent creation function. The resulting agent contains the appropriate actor and critics listed in the table above. The actor and critic use default agent-specific deep neural networks as internal approximators.

For more information on creating actor and critic function approximators, see Create Policies and Value Functions.

You can use the Reinforcement Learning Designer app to import an existing environment and interactively design DQN, DDPG, PPO, or TD3 agents. The app allows you to train and simulate the agent within your environment, analyze the simulation results, refine the agent parameters, and export the agent to the MATLAB workspace for further use and deployment. For more information, see Create Agents Using Reinforcement Learning Designer.

Choose Agent Type

When choosing an agent, a best practice is to start with a simpler (and faster to train) algorithm that is compatible with your action and observation spaces. You can then try progressively more complicated algorithms if the simpler ones do not perform as desired.

  • Discrete action and observation spaces — For environments with discrete action and observation spaces, the Q-learning and SARSA agents are the simplest compatible agent, followed by DQN, PPO, and TRPO.

    Arrow going from left to right showing first a vertical stack containing a Q-learning agent on top and a SARSA agent on the bottom, continuing to the right are a DQN agent, a PPO agent, and then a TRPO agent.

  • Discrete action space and continuous observation space — For environments with a discrete action space and a continuous observation space, DQN is the simplest compatible agent followed by PPO and then TRPO.

    Arrow showing a DQN agent on the left followed by a PPO agent in the middle and a TRPO agent on the right.

  • Continuous action space — For environments with both a continuous action and observation space, DDPG is the simplest compatible agent, followed by TD3, PPO, and SAC, which are then followed by TRPO. For such environments, try DDPG first. In general:

    • TD3 is an improved, more complex version of DDPG.

    • PPO has more stable updates but requires more training.

    • SAC is an improved, more complex version of DDPG that generates stochastic policies.

    • TRPO is a more complex version of PPO that is more robust for deterministic environments with fewer observations.

    Arrow showing a DDPG agent on the left, followed by a vertical stack in the middle containing a TD3 agent, PPO agent, and a SAC agent, then a TRPO agent on the right.

Model-Based Policy Optimization

If you are using an off-policy agent (DQN, DDPG, TD3, SAC), you can consider using model-based policy optimization (MBPO) agent. to improve your training sample efficiency. An MBPO agent contains an internal model of the environment, which it uses to generate additional experiences without interacting with the environment.

During training, the MBPO agent generates real experiences by interacting with the environment. These experiences are used to train the internal environment model, which is used to generate additional experiences. The training algorithm then uses both the real and generated experiences to update the agent policy.

An MBPO agent can be more sample efficient than model-free agents because the model can generate large sets of diverse experiences. However, MBPO agents require much more computational time than model-free agents, because they must train the environment model and generate samples in addition to training the base agent.

For more information, see Model-Based Policy Optimization Agents.

Extract Policy Objects from Agents

You can extract a policy object from an agent and then use getAction to generate deterministic or stochastic actions from the policy, given an input observation. Working with policy objects can be useful for application deployment or custom training purposes. For more information, see Create Policies and Value Functions.

Custom Agents

You can also train policies using other learning algorithms by creating a custom agent. To do so, you create a subclass of a custom agent class, and define the agent behavior using a set of required and optional methods. For more information, see Create Custom Reinforcement Learning Agents. For more information about custom training loops, see Train Reinforcement Learning Policy Using Custom Training Loop.

See Also

| | | | | | | | | |

Related Topics