rlQValueFunction

Q-Value function approximator with a continuous or discrete action space reinforcement learning agents

Since R2022a

Description

This object implements a Q-value function approximator that you can use as a critic for a reinforcement learning agent. A Q-value function (also known as action-value function) is a mapping from an environment observation-action pair to the value of a policy. Specifically, its output is a scalar that represents the expected discounted cumulative long-term reward when an agent starts from the state corresponding to the given observation, executes the given action, and keeps on taking actions according to the given policy afterwards. A Q-value function critic therefore needs both the environment state and an action as inputs. After you create an rlQValueFunction critic, use it to create an agent such as rlQAgent, rlDQNAgent, rlSARSAAgent, rlDDPGAgent, or rlTD3Agent. For more information on creating actors and critics, see Create Actors, Critics, and Policy Objects.

Creation

Syntax

critic = rlQValueFunction(net,observationInfo,actionInfo)

critic = rlQValueFunction(tab,observationInfo,actionInfo)

critic = rlQValueFunction({basisFcn,W0},observationInfo,actionInfo)

critic = rlQValueFunction(___,Name=Value)

Description

critic = rlQValueFunction(net,observationInfo,actionInfo) creates the Q-value function object critic. Here, net is the deep neural network used as an approximation model, and it must have both observation and action as input layers and a single scalar output layer. The network input layers are automatically associated with the environment observation and action channels according to the dimension specifications in observationInfo and actionInfo. This function sets the ObservationInfo and ActionInfo properties of critic to the observationInfo and actionInfo input arguments, respectively.

example

critic = rlQValueFunction(tab,observationInfo,actionInfo) creates the Q-value function object critic with discrete action and observation spaces from the Q-value table tab. tab is a rlTable object containing a table with as many rows as the number of possible observations and as many columns as the number of possible actions. The function sets the ObservationInfo and ActionInfo properties of critic respectively to the observationInfo and actionInfo input arguments, which in this case must be scalar rlFiniteSetSpec objects.

example

critic = rlQValueFunction({basisFcn,W0},observationInfo,actionInfo) creates a Q-value function object critic using a custom basis function as underlying approximator. The first input argument is a two-element cell array whose first element is the handle basisFcn to a custom basis function and whose second element is the initial weight vector W0. Here the basis function must have both observation and action as inputs and W0 must be a column vector. The function sets the ObservationInfo and ActionInfo properties of critic to the observationInfo and actionInfo input arguments, respectively.

example

critic = rlQValueFunction(___,Name=Value) specifies names of the action or observation input layers (for network-based approximators) or sets the UseDevice property of critic using one or more name-value arguments. Specifying the input layer names allows you explicitly associate the layers of your network approximator with specific environment channels. For all types of approximators, you can specify the device where computations for critic are executed, for example UseDevice="gpu".

example

Input Arguments

expand all

`net` — Deep neural network
array of `Layer` objects | `layerGraph` object | `DAGNetwork` object | `SeriesNetwork` object | `dlNetwork` object (preferred)

Deep neural network used as the underlying approximator within the critic, specified as one of the following:

Array of Layer objects
layerGraph object
DAGNetwork object
SeriesNetwork object
dlnetwork object

The network must have as many input layers as the number of environment observation channels plus one. Specifically, there must be one input layer for each observation channel, and one additional input layer for the action channel. The network must have a single output layer returning a scalar value.

Note

Among the different network representation options, dlnetwork is preferred, since it has built-in validation checks and supports automatic differentiation. If you pass another network object as an input argument, it is internally converted to a dlnetwork object. However, best practice is to convert other representations to dlnetwork explicitly before using them to create a critic or an actor for a reinforcement learning agent. You can do so using dlnet=dlnetwork(net), where net is any Deep Learning Toolbox™ neural network object. The resulting dlnet is the dlnetwork object that you use for your critic or actor. This practice allows a greater level of insight and control for cases in which the conversion is not straightforward and might require additional specifications.

rlQValueFunction objects support recurrent deep neural networks.

The learnable parameters of the critic are the weights of the deep neural network. For a list of deep neural network layers, see List of Deep Learning Layers. For more information on creating deep neural networks for reinforcement learning, see Create Actors, Critics, and Policy Objects.

`tab` — Q-value table
`rlTable` object

Q-value table, specified as an rlTable object containing an array with as many rows as the possible observations and as many columns as the possible actions. The element (s,a) is the expected cumulative long-term reward for taking action a from observed state s. The elements of this array are the learnable parameters of the critic.

`basisFcn` — Custom basis function
function handle

Custom basis function, specified as a function handle to a user-defined MATLAB function. The user defined function can either be an anonymous function or a function on the MATLAB path. The output of the critic is the scalar c = W'*B, where W is a weight vector containing the learnable parameters, and B is the column vector returned by the custom basis function.

Your basis function must have the following signature.

B = myBasisFunction(obs1,obs2, ...,obsN,act)

Here, obs1 to obsN are inputs in the same order and with the same data type and dimensions as the environment observation channels defined in observationInfo and act is an input with the same data type and dimension as the environment action channel defined in actionInfo.

For an example on how to use a basis function to create a Q-value function critic with a mixed continuous and discrete observation space, see Create Hybrid Observation Space Q-Value Function Critic from Custom Basis Function.

Example: @(obs1,obs2,act) [act(2)*obs1(1)^2; abs(obs2(5)+act(1))]

`W0` — Initial value of the basis function weights
column vector

Initial value of the basis function weights W, specified as a column vector having the same length as the vector returned by the basis function.

Name-Value Arguments

expand all

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Example: UseDevice="gpu"

`ActionInputNames` — Network input layer name corresponding to the environment action channel
string | character vector

Network input layer name corresponding to the environment action channel, specified as a string array or a cell array of character vectors. The function assigns the environment action channel specified in actionInfo to the layer whose name is specified in the value assigned to this argument. Therefore, the specified network input layer must have the same data type and dimensions of the action channel, as defined in actionInfo.

This name-value argument is supported only when the approximation model is a deep neural network.

Example: ActionInputNames="actInLyr_Force"

`ObservationInputNames` — Network input layers names corresponding to the environment observation channels
string array | cell array of strings | cell array of character vectors

Network input layers names corresponding to the environment observation channels, specified as a string array or a cell array of strings or character vectors. The function assigns, in sequential order, each environment observation channel specified in observationInfo to each layer whose name is specified in the array assigned to this argument. Therefore, the specified network input layers, ordered as indicated in this argument, must have the same data type and dimensions as the observation channels, as ordered in observationInfo.

This name-value argument is supported only when the approximation model is a deep neural network.

Example: ObservationInputNames={"obsInLyr1_airspeed","obsInLyr2_altitude"}

Properties

expand all

`ObservationInfo` — Observation specifications
`rlFiniteSetSpec` object | `rlNumericSpec` object | array

Observation specifications, specified as an rlFiniteSetSpec or rlNumericSpec object or an array containing a mix of such objects. Each element in the array defines the properties of an environment observation channel, such as its dimensions, data type, and name.

This property is read-only. When you create the approximator object, the constructor function sets the ObservationInfo property to the input argument observationInfo.

You can extract observationInfo from an existing environment, function approximator, or agent using getObservationInfo. You can also construct the specifications manually using rlFiniteSetSpec or rlNumericSpec.

`ActionInfo` — Action specifications
`rlFiniteSetSpec` object | `rlNumericSpec` object

Action specifications, specified either as an rlFiniteSetSpec (for discrete action spaces) or rlNumericSpec (for continuous action spaces) object. This object defines the properties of the environment action channel, such as its dimensions, data type, and name.

Note

For this approximator object, only one action channel is allowed.

This property is read-only. When you create the approximator object, the constructor function sets the ActionInfo property to the input argument actionInfo.

You can extract ActionInfo from an existing environment or agent using getActionInfo. You can also construct the specifications manually using rlFiniteSetSpec or rlNumericSpec.

`Normalization` — Normalization method
`"none"` (default) | string array

Normalization method, returned as an array in which each element (one for each input channel defined in the observationInfo and actionInfo properties, in that order) is one of the following values:

"none" — Do not normalize the input.
"rescale-zero-one" — Normalize the input by rescaling it to the interval between 0 and 1. The normalized input Y is (U–Min)./(UpperLimit–LowerLimit), where U is the nonnormalized input. Note that nonnormalized input values lower than LowerLimit result in normalized values lower than 0. Similarly, nonnormalized input values higher than UpperLimit result in normalized values higher than 1. Here, UpperLimit and LowerLimit are the corresponding properties defined in the specification object of the input channel.
"rescale-symmetric" — Normalize the input by rescaling it to the interval between –1 and 1. The normalized input Y is 2(U–LowerLimit)./(UpperLimit–LowerLimit) – 1, where U is the nonnormalized input. Note that nonnormalized input values lower than LowerLimit result in normalized values lower than –1. Similarly, nonnormalized input values higher than UpperLimit result in normalized values higher than 1. Here, UpperLimit and LowerLimit are the corresponding properties defined in the specification object of the input channel.

Note

When you specify the Normalization property of rlAgentInitializationOptions, normalization is applied only to the approximator input channels corresponding to rlNumericSpec specification objects in which both the UpperLimit and LowerLimit properties are defined. After you create the agent, you can use the setNormalizer function to assign normalizers that use any normalization method. For more information on normalizer objects, see rlNormalizer.

Example: myActor.Normalization = "rescale-symmetric" sets to "rescale-symmetric" the Normalization property of the function approximator myActor.

`UseDevice` — Computation device used for training and simulation
`"cpu"` (default) | `"gpu"`

Computation device used to perform operations such as gradient computation, parameter update and prediction during training and simulation, specified as either "cpu" or "gpu".

The "gpu" option requires both Parallel Computing Toolbox™ software and a CUDA^® enabled NVIDIA^® GPU. For more information on supported GPUs see GPU Computing Requirements (Parallel Computing Toolbox).

You can use gpuDevice (Parallel Computing Toolbox) to query or select a local GPU device to be used with MATLAB^®.

Note

Training or simulating an agent on a GPU involves device-specific numerical round-off errors. Because of these errors, you can get different results on a GPU and on a CPU for the same operation.

To speed up training by using parallel processing over multiple cores, you do not need to use this argument. Instead, when training your agent, use an rlTrainingOptions object in which the UseParallel option is set to true. For more information about training using multicore processors and GPUs for training, see Train Agents Using Parallel Computing and GPUs.

Example: myCritic.UseDevice = "gpu" sets to "gpu" the UseDevice property of the function approximator myCritic.

`Learnables` — Learnable parameters of approximator object
cell array of `dlarray` objects

Learnable parameters of the approximator object, specified as a cell array of dlarray objects. This property contains the learnable parameters of the approximation model used by the approximator object.

Example: myActor.Learnables = {dlarray(rand(256,4)),dlarray(rand(256,1))} sets the learnable parameters of the function approximator myActor.

`State` — State of approximator object
cell array of `dlarray` objects

State of the approximator object, specified as a cell array of dlarray objects. For dlnetwork-based models, this property contains the Value column of the State property table of the dlnetwork model. The elements of the cell array are the state of the recurrent neural network used in the approximator (if any), as well as the state for the batch normalization layer (if used).

For model types that are not based on a dlnetwork object, this property is an empty cell array, since these model types do not support states.

Example: myCritic.State={dlarray(rand(256,1)),dlarray(rand(256,1))} sets the state of the function approximator myCritic.

Object Functions

`rlDDPGAgent`	Deep deterministic policy gradient (DDPG) reinforcement learning agent
`rlTD3Agent`	Twin-delayed deep deterministic (TD3) policy gradient reinforcement learning agent
`rlDQNAgent`	Deep Q-network (DQN) reinforcement learning agent
`rlQAgent`	Q-learning reinforcement learning agent
`rlSARSAAgent`	SARSA reinforcement learning agent
`rlSACAgent`	Soft actor-critic (SAC) reinforcement learning agent
`getValue`	Obtain estimated value from a critic given environment observations and actions
`getMaxQValue`	Obtain maximum estimated value over all possible actions from a Q-value function critic with discrete action space, given environment observations
`evaluate`	Evaluate function approximator object given observation (or observation-action) input data
`getLearnableParameters`	Obtain learnable parameter values from agent, function approximator, or policy object
`setLearnableParameters`	Set learnable parameter values of agent, function approximator, or policy object
`setModel`	Set approximation model in function approximator object
`getModel`	Get approximation model from function approximator object

Examples

collapse all

Create Q-Value Function Critic from Deep Neural Network

Open Live Script

Create an observation specification object (or alternatively use getObservationInfo to extract the specification object from an environment). For this example, define the observation space as a continuous four-dimensional space, so that there is a single observation channel that carries a column vector containing four doubles.

obsInfo = rlNumericSpec([4 1]);

Create an action specification object (or alternatively use getActionInfo to extract the specification object from an environment). For this example, define the action space as a continuous two-dimensional space, so that the action channel carries a column vector containing two doubles.

actInfo = rlNumericSpec([2 1]);

A Q-value function critic takes the current observation and an action as inputs and returns a single scalar as output (the estimated discounted cumulative long-term reward for taking the action from the state corresponding to the current observation, and following the policy thereafter).

To model the parameterized Q-value function within the critic, use a neural network with two input layers (one receiving the content of the observation channel, as specified by obsInfo, and the other receiving the content of the action channel, as specified by actInfo) and one output layer (returning the scalar value).

Define each network path as an array of layer objects. Assign names to the input and output layers of each path, so you can properly connect the paths. Get the dimensions of the observation and action spaces from the environment specification (regardless of whether the observation space is a column vector, row vector, or matrix, prod(obsInfo.Dimension) is its total number of dimensions, for this example four, similarly, prod(actInfo.Dimension) is the number of dimension of the action space, for this example two).

% Observation path layers
obsPath = [
    featureInputLayer(prod(obsInfo.Dimension))
    fullyConnectedLayer(5)
    reluLayer
    fullyConnectedLayer(5,Name="obsout")
    ];

% Action path layers
actPath = [
    featureInputLayer(prod(actInfo.Dimension))
    fullyConnectedLayer(5)
    reluLayer
    fullyConnectedLayer(5,Name="actout")
    ];

% Common path to output layers
% Concatenate two layers along dimension one.
comPath = [
    concatenationLayer(1,2,Name="cct")
    fullyConnectedLayer(5)
    reluLayer    
    fullyConnectedLayer(1, Name="output")
    ];

Assemble dlnetwork object.

net = dlnetwork;
net = addLayers(net,obsPath);
net = addLayers(net,actPath); 
net = addLayers(net,comPath);

Connect layers.

net = connectLayers(net,"obsout","cct/in1");
net = connectLayers(net,"actout","cct/in2");

Plot network.

plot(net)

Figure contains an axes object. The axes object contains an object of type graphplot.

Initialize network and display the number of weights.

net = initialize(net);
summary(net)

   Initialized: true

   Number of learnables: 161

   Inputs:
      1   'input'     4 features
      2   'input_1'   2 features

Create the critic with rlQValueFunction, using the network as well as the observations and action specification objects. When using this syntax, the network input layers are automatically associated with the components of the observation and action signals according to the dimension specifications in obsInfo and actInfo.

critic = rlQValueFunction(net,obsInfo,actInfo)

critic = 
  rlQValueFunction with properties:

    ObservationInfo: [1×1 rl.util.rlNumericSpec]
         ActionInfo: [1×1 rl.util.rlNumericSpec]
      Normalization: ["none"    "none"]
          UseDevice: "cpu"
         Learnables: {12×1 cell}
              State: {0×1 cell}

To check your critic, use getValue to return the value of a random observation and action, given the current network weights.

v = getValue(critic, ...
    {rand(obsInfo.Dimension)}, ...
    {rand(actInfo.Dimension)})

v = single

-1.1006

Obtain values for a random batch of 5 observations.

v = getValue(critic, ...
    {rand([obsInfo.Dimension 5])}, ...
    {rand([actInfo.Dimension 5])})

v = 1×5 single row vector

   -0.8916   -0.3538   -0.8732   -1.0594   -0.9641

You can now use the critic (along with an actor) to create an agent for the environment described by the given specification objects. Examples of agents that can work with continuous action and observation spaces, and use a Q-value function critic, are rlDDPGAgent, rlTD3Agent, and rlSACAgent.

For more information on creating approximator objects such as actors and critics, see Create Actors, Critics, and Policy Objects.

Create Q-Value Function Critic from Deep Neural Network Specifying Layer Names

Open Live Script

obsInfo = rlNumericSpec([4 1]);

actInfo = rlNumericSpec([2 1]);

Define each network path as an array of layer objects. Assign names to the input and output layers of each path. This assignment allows you to connect the paths and explicitly associate the network input and output layers with the appropriate environment channel.

Get the dimensions of the observation and action spaces from the environment specification (regardless of whether the observation space is a column vector, row vector, or matrix, prod(obsInfo.Dimension) is its total number of dimensions, for this example four, similarly, prod(actInfo.Dimension) is the number of dimension of the action space, for this example two).

% Observation path layers
obsPath = [
    featureInputLayer( ...
         prod(obsInfo.Dimension), ...
         Name="obsInLyr") 
    fullyConnectedLayer(16)
    reluLayer
    fullyConnectedLayer(5,Name="obsPthOutLyr")
    ];

% Action path layers
actPath = [
    featureInputLayer( ...
        prod(actInfo.Dimension), ...
        Name="actInLyr") 
    fullyConnectedLayer(16)
    reluLayer
    fullyConnectedLayer(5,Name="actPthOutLyr")
    ];

% Common path to output layers 
% Concatenate two layers along dimension one
comPath = [
    concatenationLayer(1,2,Name="cct")
    fullyConnectedLayer(8)
    reluLayer
    fullyConnectedLayer(1, Name="qvfOutLyr")
    ];

Create dlnetwork object and add layers.

net = dlnetwork();
net = addLayers(net,obsPath);
net = addLayers(net,actPath);
net = addLayers(net,comPath);

Connect layers.

net = connectLayers(net,"obsPthOutLyr","cct/in1");
net = connectLayers(net,"actPthOutLyr","cct/in2");

Plot network.

plot(net)

Figure contains an axes object. The axes object contains an object of type graphplot.

Initialize network and display the number of weights.

net = initialize(net);
summary(net)

   Initialized: true

   Number of learnables: 395

   Inputs:
      1   'obsInLyr'   4 features
      2   'actInLyr'   2 features

Create the critic with rlQValueFunction, using the network, the observations and action specification objects, and the names of the network input layers to be associated with the observation and action from the environment.

critic = rlQValueFunction(net, ...
             obsInfo,actInfo, ...
             ObservationInputNames="obsInLyr", ...
             ActionInputNames="actInLyr")

critic = 
  rlQValueFunction with properties:

    ObservationInfo: [1×1 rl.util.rlNumericSpec]
         ActionInfo: [1×1 rl.util.rlNumericSpec]
      Normalization: ["none"    "none"]
          UseDevice: "cpu"
         Learnables: {12×1 cell}
              State: {0×1 cell}

To check your critic, use getValue to return the value of a batch of 5 random observations and actions, given the current network weights.

obs = rand([obsInfo.Dimension 5]);
act = rand([actInfo.Dimension 5]);
v = getValue(critic,{obs},{act})

v = 1×5 single row vector

    0.2419    0.2933    0.0893   -0.3399   -0.0223

For more information on creating approximator objects such as actors and critics, see Create Actors, Critics, and Policy Objects.

Create Q-Value Function Critic from Table

Open Live Script

Create a finite set observation specification object (or alternatively use the getObservationInfo function to extract the specification object from an environment with a discrete observation space). For this example define the observation space as a scalar belonging to a finite set with four possible values.

obsInfo = rlFiniteSetSpec([7 5 3 1]);

Create a finite set action specification object (or alternatively use the getActionInfo function to extract the specification object from an environment with a discrete action space). For this example define the action space as a finite set with two possible values.

actInfo = rlFiniteSetSpec([4 8]);

Because both observation and action spaces are discrete and low-dimensional, use a table to model the Q-value function within the critic. rlTable creates a value table object from the observation and action specifications objects.

qTable = rlTable(obsInfo,actInfo);

The table stores a value (representing the expected cumulative long term reward) for each possible observation-action pair. Each row corresponds to an observation and each column corresponds to an action. You can access the table using the Table property of the vTable object. The initial value of each element is zero.

qTable.Table

You can initialize the table to any value, in this case an array containing the integer from 1 through 8.

qTable.Table=reshape(1:8,4,2)

qTable = 
  rlTable with properties:

    Table: [4×2 double]

Create the critic using the table as well as the observations and action specification objects.

critic = rlQValueFunction(qTable,obsInfo,actInfo)

critic = 
  rlQValueFunction with properties:

    ObservationInfo: [1×1 rl.util.rlFiniteSetSpec]
         ActionInfo: [1×1 rl.util.rlFiniteSetSpec]
      Normalization: ["none"    "none"]
          UseDevice: "cpu"
         Learnables: {[4×2 dlarray]}
              State: {}

To check your critic, use the getValue function to return the value of a given observation and action, using the current table entries.

v = getValue(critic,{5},{8})

v = 
6

Obtain values for a batch of 5 observations.

v = getValue(critic,{[5,3,1,7,7]},{[8,4,4,8,4]})

v = 1×5

     6     3     4     5     1

You can now use the critic (along with an actor) to create an agent for the environment described by the given specification objects. Examples of agents that can work with discrete action and observation spaces, and use a Q-value function critic, are rlQAgent, rlDQNAgent, and rlSARSAAgent.

For more information on creating approximator objects such as actors and critics, see Create Actors, Critics, and Policy Objects.

Create Q-Value Function Critic from Custom Basis Function

Open Live Script

Create an observation specification object (or alternatively use getObservationInfo to extract the specification object from an environment). For this example, define the observation space as a continuous three-dimensional space, so that there is a single observation channel that carries a column vector containing three doubles.

obsInfo = rlNumericSpec([3 1]);

actInfo = rlNumericSpec([2 1]);

For this example, to model the parameterized Q-value function within the critic, use a custom basis function. The function must have two input arguments, the first one receives the content of the observation channel, as specified by obsInfo, and the second one receives the content of the action channel, as specified by actInfo.

Note that using local functions to implement a custom basis function is not recommended if you want to save an agent and load it later. This is because local functions are available only in the file in which they are defined, and when you load an agent in the workspace the function is no longer available to the agent. Additionally, local functions are not supported for code generation.

Write a simple custom basis function as a string (alternatively, write your own custom basis function in a file).

str =  "function out = myBasisFcn(myobs,myact)" + newline + ... 
        " out = [myobs(2,1,:).^2;" + newline + ...
        "        myobs(1,1,:).*myact(1,1,:);" + newline + ...
        "        abs(myact(2,1,:));" + newline + ...
        "        myobs(3,1,:) ];" + newline + ...
        "end"

str = 
    "function out = myBasisFcn(myobs,myact)
      out = [myobs(2,1,:).^2;
             myobs(1,1,:).*myact(1,1,:);
             abs(myact(2,1,:));
             myobs(3,1,:) ];
     end"

Here, the first two dimension of the observation and action are 3 and 2, respectively (as defined in obsInfo and actInfo), while the third dimension is the batch dimension. Since the training algorithm normally executes on batches of observations and actions at the same time, you have to keep the batch dimension into account when writing your custom basis function. For each element of the batch dimension, the function returns a vector of four elements.

Write the string to the myBasisFcn.m file and check that the file exists.

fid=fopen("myBasisFcn.m","w");
fwrite(fid,str,"char");
fclose(fid);
exist("myBasisFcn.m","file")

ans = 
2

For each element of the batch dimension, the output of the critic is the scalar W'*myBasisFcn(myobs,myact), which represents the estimated value of the observation-action pair, under the given policy. Here W is a weight column vector which must have the same size of the custom function output. The elements of W are the learnable parameters.

Define an initial parameter vector.

W0 = rand(4,1);

Create the critic. The first argument is a two-element cell containing both the handle to the custom function and the initial weight vector. The second and third arguments are, respectively, the observation and action specification objects.

critic = rlQValueFunction({@myBasisFcn,W0}, ...
    obsInfo,actInfo)

critic = 
  rlQValueFunction with properties:

    ObservationInfo: [1×1 rl.util.rlNumericSpec]
         ActionInfo: [1×1 rl.util.rlNumericSpec]
      Normalization: ["none"    "none"]
          UseDevice: "cpu"
         Learnables: {[1×4 dlarray]}
              State: {}

To check your critic, use the getValue function to return the value of a batch of 10 random observation and action inputs.

robs = rand([3 1 10]);
ract = rand([2 1 10]);
v = getValue(critic,{robs},{ract});

Display the seventh element of the batch.

v(7)

ans = 
0.8311

For more information on creating approximator objects such as actors and critics, see Create Actors, Critics, and Policy Objects.

Create Hybrid Observation Space Q-Value Function Critic from Custom Basis Function

Open Live Script

Create an observation specification object (or alternatively use the getObservationInfo function to extract the specification object from an environment). For this example, define the observation space as a hybrid (that is, mixed discrete-continuous) space with the continuous channel carrying a vector over a continuous two-dimensional space and the second carrying a vector over a three-dimensional space that can assume only four values.

obsInfo = [rlNumericSpec([1 2]) 
           rlFiniteSetSpec({[1 0 -1], ...
                            [-1 2 1], ...
                            [0.1 0.2 0.3], ...
                            [0 0 0] ...
                            })
          ];

Create an action specification object (or alternatively use the getActionInfo function to extract the specification object from an environment). For this example, define the action space as a discrete set consisting of three possible actions, labeled 1, 2, and 3.

actInfo = rlFiniteSetSpec({1,2,3});

A Q-value function critic takes a batch of observations and a batch of actions as inputs and returns as output a corresponding batch of scalars. Each scalar represents the estimated discounted cumulative long-term reward (the value) obtained by taking the action from the state corresponding to the current observation, and following the policy thereafter.

To model the parameterized Q-value function within the critic, use a custom basis function. The basis function must have three input arguments, the first two input arguments receive the content of the two observation channel, as specified by obsInfo, and the third receives the content of the action channel, as specified by actInfo.

Write a simple custom basis function as a string (alternatively, write your own custom basis function in a file).

str =  "function out = myBasisFcn(obsC,obsD,act)" ...
    + newline + ... 
    " out = [obsC(1,1,:).*obsD(1,2,:)+obsD(1,3,:)+act(1,1,:);" ...
    + newline + ...
    "        obsC(1,2,:).*obsD(1,1,:)+obsD(1,2,:)-act(1,1,:);" ...
    + newline + ...
    "        obsC(1,1,:).*obsD(1,2,:)+obsD(1,3,:)+act(1,1,:).^2;" ... 
    + newline + ...
    "        obsC(1,1,:).*obsD(1,1,:)+obsD(1,2,:)-act(1,1,:).^2 ];" ...
    + newline + ...
    "end"

str = 
    "function out = myBasisFcn(obsC,obsD,act)
      out = [obsC(1,1,:).*obsD(1,2,:)+obsD(1,3,:)+act(1,1,:);
             obsC(1,2,:).*obsD(1,1,:)+obsD(1,2,:)-act(1,1,:);
             obsC(1,1,:).*obsD(1,2,:)+obsD(1,3,:)+act(1,1,:).^2;
             obsC(1,1,:).*obsD(1,1,:)+obsD(1,2,:)-act(1,1,:).^2 ];
     end"

Here, the first two dimensions of the observations and action channels are the ones defined in the obsInfo elements, while the third dimension is the batch dimension. Since the training algorithm normally executes on batches of observations and actions at the same time, you have to keep the batch dimension into account when writing your custom basis function. For each element of the batch dimension, the function returns a vector of four elements. Each output element can be any combination of the three inputs, depending on your application.

Write the string to the myBasisFcn.m file and check that the file exists.

fid=fopen("myBasisFcn.m","w");
fwrite(fid,str,"char");
fclose(fid);
exist("myBasisFcn.m","file")

ans = 
2

The output of the critic is the scalar W'*myBasisFcn(obsC,obsD,act), which represents the estimated value of the observation-action pair, under the given policy. Here W is a weight column vector which must have the same size of the custom function output. The elements of W are the learnable parameters.

Define an initial parameter vector.

W0 = rand(4,1);

critic = rlQValueFunction({@myBasisFcn,W0}, ...
    obsInfo,actInfo)

critic = 
  rlQValueFunction with properties:

    ObservationInfo: [2×1 rl.util.RLDataSpec]
         ActionInfo: [1×1 rl.util.rlFiniteSetSpec]
      Normalization: ["none"    "none"    "none"]
          UseDevice: "cpu"
         Learnables: {[1×4 dlarray]}
              State: {}

To check your critic, use the getValue function to return the value of a given observation-action pair, using the current parameter vector.

val = getValue(critic,{[-0.5 0.6],[1 0 -1]},{3})

val = 
-8.2056

Note that the critic does not enforce the set constraint for the discrete set elements.

val = getValue(critic,{[-0.5 0.6],[10 -10 -0.05]},{33})

val = 
-872.0453

Obtain values for a random batch of 10 observations.

robsC = rand([obsInfo(1).Dimension 10]);
robsD = rand([obsInfo(2).Dimension 10]);
ract = rand([actInfo.Dimension 10]);
val = getValue(critic,{robsC,robsD},{ract});

Display the seventh value of the batch.

val(7)

ans = 
1.8189

You can now use the critic (along with an actor) to create an agent for the environment described by the given specification objects. Examples of agents that can work with a discrete action space, a mixed observation space, and use a Q-value function critic, are rlQAgent, rlDQNAgent, and rlSARSAAgent.

For more information on creating approximator objects such as actors and critics, see Create Actors, Critics, and Policy Objects.

Version History

Introduced in R2022a

rlQValueFunction

Description

Creation

Syntax

Description

Input Arguments

`net` — Deep neural network
array of `Layer` objects | `layerGraph` object | `DAGNetwork` object | `SeriesNetwork` object | `dlNetwork` object (preferred)

`tab` — Q-value table
`rlTable` object

`basisFcn` — Custom basis function
function handle

`W0` — Initial value of the basis function weights
column vector

Name-Value Arguments

`ActionInputNames` — Network input layer name corresponding to the environment action channel
string | character vector

`ObservationInputNames` — Network input layers names corresponding to the environment observation channels
string array | cell array of strings | cell array of character vectors

Properties

`ObservationInfo` — Observation specifications
`rlFiniteSetSpec` object | `rlNumericSpec` object | array

`ActionInfo` — Action specifications
`rlFiniteSetSpec` object | `rlNumericSpec` object

`Normalization` — Normalization method
`"none"` (default) | string array

`UseDevice` — Computation device used for training and simulation
`"cpu"` (default) | `"gpu"`

`Learnables` — Learnable parameters of approximator object
cell array of `dlarray` objects

`State` — State of approximator object
cell array of `dlarray` objects

Object Functions

Examples

Create Q-Value Function Critic from Deep Neural Network

Create Q-Value Function Critic from Deep Neural Network Specifying Layer Names

Create Q-Value Function Critic from Table

Create Q-Value Function Critic from Custom Basis Function

Create Hybrid Observation Space Q-Value Function Critic from Custom Basis Function

Version History

See Also

Functions

Objects

Topics

rlQValueFunction

Description

Creation

Syntax

Description

Input Arguments

net — Deep neural network array of Layer objects | layerGraph object | DAGNetwork object | SeriesNetwork object | dlNetwork object (preferred)

tab — Q-value table rlTable object

basisFcn — Custom basis function function handle

W0 — Initial value of the basis function weights column vector

Name-Value Arguments

ActionInputNames — Network input layer name corresponding to the environment action channel string | character vector

ObservationInputNames — Network input layers names corresponding to the environment observation channels string array | cell array of strings | cell array of character vectors

Properties

ObservationInfo — Observation specifications rlFiniteSetSpec object | rlNumericSpec object | array

ActionInfo — Action specifications rlFiniteSetSpec object | rlNumericSpec object

Normalization — Normalization method "none" (default) | string array

UseDevice — Computation device used for training and simulation "cpu" (default) | "gpu"

Learnables — Learnable parameters of approximator object cell array of dlarray objects

State — State of approximator object cell array of dlarray objects

Object Functions

Examples

Create Q-Value Function Critic from Deep Neural Network

Create Q-Value Function Critic from Deep Neural Network Specifying Layer Names

Create Q-Value Function Critic from Table

Create Q-Value Function Critic from Custom Basis Function

Create Hybrid Observation Space Q-Value Function Critic from Custom Basis Function

Version History

See Also

Functions

Objects

Topics

`net` — Deep neural network
array of `Layer` objects | `layerGraph` object | `DAGNetwork` object | `SeriesNetwork` object | `dlNetwork` object (preferred)

`tab` — Q-value table
`rlTable` object

`basisFcn` — Custom basis function
function handle

`W0` — Initial value of the basis function weights
column vector

`ActionInputNames` — Network input layer name corresponding to the environment action channel
string | character vector

`ObservationInputNames` — Network input layers names corresponding to the environment observation channels
string array | cell array of strings | cell array of character vectors

`ObservationInfo` — Observation specifications
`rlFiniteSetSpec` object | `rlNumericSpec` object | array

`ActionInfo` — Action specifications
`rlFiniteSetSpec` object | `rlNumericSpec` object

`Normalization` — Normalization method
`"none"` (default) | string array

`UseDevice` — Computation device used for training and simulation
`"cpu"` (default) | `"gpu"`

`Learnables` — Learnable parameters of approximator object
cell array of `dlarray` objects

`State` — State of approximator object
cell array of `dlarray` objects