rlMBPOAgent

Model-based policy optimization (MBPO) reinforcement learning agent

Since R2022a

Description

A model-based policy optimization (MBPO) agent is a model-based, off-policy, reinforcement learning method for environment with a discrete or continuous action space. An MBPO agent contains an internal model of the environment, which it uses to generate additional experiences without interacting with the environment. Specifically, during training, the MBPO agent generates real experiences by interacting with the environment. These experiences are used to train the internal environment model, which is used to generate additional experiences. The training algorithm then uses both the real and generated experiences to update the agent policy.

For more information on MBPO agents, see Model-Based Policy Optimization (MBPO) Agent. For more information on the different types of reinforcement learning agents, see Reinforcement Learning Agents.

Note

MBPO agents do not support recurrent networks or base SAC agents with a hybrid action space.

Creation

Syntax

agent = rlMBPOAgent(baseAgent,envModel)

agent = rlMBPOAgent(___,agentOptions)

Description

agent = rlMBPOAgent(baseAgent,envModel) creates a model-based policy optimization agent with default options and sets the BaseAgent and EnvModel properties.

example

agent = rlMBPOAgent(___,agentOptions) creates a model-based policy optimization agent using specified options and sets the AgentOptions property.

Properties

expand all

`BaseAgent` — Base reinforcement learning agent
`rlDQNAgent` | `rlDDPGAgent` | `rlTD3Agent` | `rlSACAgent`

Base reinforcement learning agent, specified as an off-policy agent object.

For environments with a discrete action space, specify a DQN agent using an rlDQNAgent object.

For environments with a continuous action space, use one of the following agent objects.

rlDDPGAgent — DDPG agent
rlTD3Agent — TD3 agent
rlSACAgent — non-hybrid SAC agent

`EnvModel` — Environment model
`rlNeuralNetworkEnvironment`

Environment model, specified as an rlNeuralNetworkEnvironment object. This environment contains transition functions, a reward function, and an is-done function.

`AgentOptions` — Agent options
`rlMBPOAgentOptions` object

Agent options, specified as an rlMBPOAgentOptions object.

`RolloutHorizon` — Current roll-out horizon value
positive integer

Current roll-out horizon value, specified as a positive integer. For more information on setting the initial horizon value and the horizon update method, see rlMBPOAgentOptions.

`ModelExperienceBuffer` — Model experience buffer
`rlReplayMemory` object

Model experience buffer, specified as an rlReplayMemory object. During training the agent stores each of its generated experiences (S,A,R,S',D) in a buffer. Here:

S is the current observation of the environment.
A is the action taken by the agent.
R is the reward for taking action A.
S' is the next observation after taking action A.
D is the is-done signal after taking action A.

`UseExplorationPolicy` — Option to use exploration policy
`true` | `false`

Option to use exploration policy when selecting actions, specified as one of the following logical values.

true — Use the base agent exploration policy when selecting actions.
false — Use the base agent greedy policy when selecting actions.

The initial value of UseExplorationPolicy matches the value specified in BaseAgent. If you change the value of UseExplorationPolicy in either the base agent or the MBPO agent, the same value is used for the other agent.

`ObservationInfo` — Observation specifications
specification object | array of specification objects

This property is read-only.

Observation specifications, specified as an rlFiniteSetSpec or rlNumericSpec object or an array containing a mix of such objects. Each element in the array defines the properties of an environment observation channel, such as its dimensions, data type, and name.

The value of ObservationInfo matches the corresponding value specified in BaseAgent.

`ActionInfo` — Action specification
`rlFiniteSetSpec` object | `rlNumericSpec` object

This property is read-only.

Action specifications, specified either as an rlFiniteSetSpec (for discrete action spaces) or rlNumericSpec (for continuous action spaces) object. This object defines the properties of the environment action channel, such as its dimensions, data type, and name.

Note

For this agent, only one action channel is allowed.

The value of ActionInfo matches the corresponding value specified in BaseAgent.

`SampleTime` — Sample time of agent
`1` (default) | positive scalar | `-1`

Sample time of the agent, specified as a positive scalar or as -1.

Within a MATLAB^® environment, the agent is executed every time the environment advances, so, SampleTime does not affect the timing of the agent execution.

Within a Simulink^® environment, the RL Agent block that uses the agent object executes every SampleTime seconds of simulation time. If SampleTime is -1 the block inherits the sample time from its input signals. Set SampleTime to -1 when the block is a child of an event-driven subsystem.

Note

Set SampleTime to a positive scalar when the block is not a child of an event-driven subsystem. Doing so ensures that the block executes at appropriate intervals when input signal sample times change due to model variations.

Regardless of the type of environment, the time interval between consecutive elements in the output experience returned by sim or train is always SampleTime.

If SampleTime is -1, for Simulink environments, the time interval between consecutive elements in the returned output experience reflects the timing of the events that trigger the RL Agent block execution, while for MATLAB environments, this time interval is considered equal to 1.

This property is shared between the agent and the agent options object within the agent. Therefore, if you change it in the agent options object, it gets changed in the agent, and vice versa.

Example: SampleTime=-1

Object Functions

`train`	Train reinforcement learning agents within a specified environment
`sim`	Simulate trained reinforcement learning agents within specified environment

Examples

collapse all

Create MBPO Agent

Open Live Script

Create an environment interface and extract observation and action specifications.

env = rlPredefinedEnv("CartPole-Continuous");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

Create a base off-policy agent. For this example, use a SAC agent.

agentOpts = rlSACAgentOptions;
agentOpts.MiniBatchSize = 256;
initOpts = rlAgentInitializationOptions(NumHiddenUnit=64);
baseagent = rlSACAgent(obsInfo,actInfo,initOpts,agentOpts);

Check your agent with a random input observation.

getAction(baseagent,{rand(obsInfo.Dimension)})

ans = 1x1 cell array
    {[-7.2875]}

The neural network environment uses a function approximator object to approximate the environment transition function. The function approximator object uses one or more neural networks as approximator model. To account for modeling uncertainty, you can specify multiple transition models. For this example, create a single transition model.

Create a neural network to use as approximation model within the transition function object. Define each network path as an array of layer objects. Specify a name for the input and output layers, so you can later explicitly associate them with the appropriate channel.

% Observation and action paths
obsPath = featureInputLayer(obsInfo.Dimension(1),Name="obsInLyr");
actionPath = featureInputLayer(actInfo.Dimension(1),Name="actInLyr");

% Common path: concatenate along dimension 1
commonPath = [
    concatenationLayer(1,2,Name="concat")
    fullyConnectedLayer(64)
    reluLayer
    fullyConnectedLayer(64)
    reluLayer
    fullyConnectedLayer( ...
        obsInfo.Dimension(1), ...
        Name="nextObsOutLyr")
    ];

Create dlnetwork object and add layers.

transNet = dlnetwork;
transNet = addLayers(transNet,obsPath);
transNet = addLayers(transNet,actionPath);
transNet = addLayers(transNet,commonPath);

Connect layers.

transNet = connectLayers(transNet,"obsInLyr","concat/in1");
transNet = connectLayers(transNet,"actInLyr","concat/in2");

Plot network.

plot(transNet)

Figure contains an axes object. The axes object contains an object of type graphplot.

Initialize network and display the number of weights.

transNet = initialize(transNet);
summary(transNet)

   Initialized: true

   Number of learnables: 4.8k

   Inputs:
      1   'obsInLyr'   4 features
      2   'actInLyr'   1 features

Create the transition function approximator object.

transitionFcnAppx = rlContinuousDeterministicTransitionFunction( ...
    transNet,obsInfo,actInfo,...
    ObservationInputNames="obsInLyr",...
    ActionInputNames="actInLyr",...
    NextObservationOutputNames="nextObsOutLyr");

Create a neural network to use as a reward model for the reward function approximator object.

% Observation and action paths
actionPath = featureInputLayer( ...
    actInfo.Dimension(1), ...
    Name="actInLyr");
nextObsPath = featureInputLayer( ...
    obsInfo.Dimension(1), ...
    Name="nextObsInLyr");

% Common path: concatenate along dimension 1
commonPath = [
    concatenationLayer(1,2,Name="concat")
    fullyConnectedLayer(64)
    reluLayer
    fullyConnectedLayer(64)
    reluLayer
    fullyConnectedLayer(64)
    reluLayer
    fullyConnectedLayer(1)
    ];

Create dlnetwork object and add layers.

rewardNet = dlnetwork();
rewardNet = addLayers(rewardNet,nextObsPath);
rewardNet = addLayers(rewardNet,actionPath);
rewardNet = addLayers(rewardNet,commonPath);

Connect layers.

rewardNet = connectLayers(rewardNet,"nextObsInLyr","concat/in1");
rewardNet = connectLayers(rewardNet,"actInLyr","concat/in2");

Plot network.

plot(rewardNet)

Figure contains an axes object. The axes object contains an object of type graphplot.

Initialize network and display the number of weights.

rewardNet = initialize(rewardNet);
summary(rewardNet)

   Initialized: true

   Number of learnables: 8.7k

   Inputs:
      1   'nextObsInLyr'   4 features
      2   'actInLyr'       1 features

Create the reward function approximator object.

rewardFcnAppx = rlContinuousDeterministicRewardFunction( ...
    rewardNet,obsInfo,actInfo, ...
    ActionInputNames="actInLyr",...
    NextObservationInputNames="nextObsInLyr");

Create an is-done model for the reward function approximator object.

% Define main path
isdNet = [
    featureInputLayer( ...
        obsInfo.Dimension(1), ...
        Name="nextObsInLyr");
    fullyConnectedLayer(64)
    reluLayer
    fullyConnectedLayer(64)
    reluLayer
    fullyConnectedLayer(2)
    softmaxLayer(Name="isdoneOutLyr")
];

Convert to dlnetwork object.

isdNet = dlnetwork(isdNet);

Display the number of weights.

summary(isdNet)

   Initialized: true

   Number of learnables: 4.6k

   Inputs:
      1   'nextObsInLyr'   4 features

Create the reward function approximator object.

isdoneFcnAppx = rlIsDoneFunction(isdNet,obsInfo,actInfo, ...
    NextObservationInputNames="nextObsInLyr");

Create the neural network environment using the observation and action specifications and the three function approximator objects.

generativeEnv = rlNeuralNetworkEnvironment( ...
    obsInfo,actInfo,...
    transitionFcnAppx,rewardFcnAppx,isdoneFcnAppx);

Specify options for creating an MBPO agent. Specify the optimizer options for the transition network and use default values for all other options.

MBPOAgentOpts = rlMBPOAgentOptions;
MBPOAgentOpts.TransitionOptimizerOptions = rlOptimizerOptions(...
    LearnRate=1e-4,...
    GradientThreshold=1.0);

Create the MBPO agent.

agent = rlMBPOAgent(baseagent,generativeEnv,MBPOAgentOpts);

Check your agent with a random input observation.

getAction(agent,{rand(obsInfo.Dimension)})

ans = 1x1 cell array
    {[7.8658]}

Version History

Introduced in R2022a

rlMBPOAgent

Description

Creation

Syntax

Description

Properties

`BaseAgent` — Base reinforcement learning agent
`rlDQNAgent` | `rlDDPGAgent` | `rlTD3Agent` | `rlSACAgent`

`EnvModel` — Environment model
`rlNeuralNetworkEnvironment`

`AgentOptions` — Agent options
`rlMBPOAgentOptions` object

`RolloutHorizon` — Current roll-out horizon value
positive integer

`ModelExperienceBuffer` — Model experience buffer
`rlReplayMemory` object

`UseExplorationPolicy` — Option to use exploration policy
`true` | `false`

`ObservationInfo` — Observation specifications
specification object | array of specification objects

`ActionInfo` — Action specification
`rlFiniteSetSpec` object | `rlNumericSpec` object

`SampleTime` — Sample time of agent
`1` (default) | positive scalar | `-1`

Object Functions

Examples

Create MBPO Agent

Version History

See Also

Objects

Topics

rlMBPOAgent

Description

Creation

Syntax

Description

Properties

BaseAgent — Base reinforcement learning agent rlDQNAgent | rlDDPGAgent | rlTD3Agent | rlSACAgent

EnvModel — Environment model rlNeuralNetworkEnvironment

AgentOptions — Agent options rlMBPOAgentOptions object

RolloutHorizon — Current roll-out horizon value positive integer

ModelExperienceBuffer — Model experience buffer rlReplayMemory object

UseExplorationPolicy — Option to use exploration policy true | false

ObservationInfo — Observation specifications specification object | array of specification objects

ActionInfo — Action specification rlFiniteSetSpec object | rlNumericSpec object

SampleTime — Sample time of agent 1 (default) | positive scalar | -1

Object Functions

Examples

Create MBPO Agent

Version History

See Also

Objects

Topics

`BaseAgent` — Base reinforcement learning agent
`rlDQNAgent` | `rlDDPGAgent` | `rlTD3Agent` | `rlSACAgent`

`EnvModel` — Environment model
`rlNeuralNetworkEnvironment`

`AgentOptions` — Agent options
`rlMBPOAgentOptions` object

`RolloutHorizon` — Current roll-out horizon value
positive integer

`ModelExperienceBuffer` — Model experience buffer
`rlReplayMemory` object

`UseExplorationPolicy` — Option to use exploration policy
`true` | `false`

`ObservationInfo` — Observation specifications
specification object | array of specification objects

`ActionInfo` — Action specification
`rlFiniteSetSpec` object | `rlNumericSpec` object

`SampleTime` — Sample time of agent
`1` (default) | positive scalar | `-1`