Main Content

rlQValueRepresentation

Q-Value function critic representation for reinforcement learning agents

Description

This object implements a Q-value function approximator to be used as a critic within a reinforcement learning agent. A Q-value function is a function that maps an observation-action pair to a scalar value representing the expected total long-term rewards that the agent is expected to accumulate when it starts from the given observation and executes the given action. Q-value function critics therefore need both observations and actions as inputs. After you create an rlQValueRepresentation critic, use it to create an agent relying on a Q-value function critic, such as an rlQAgent, rlDQNAgent, rlSARSAAgent, rlDDPGAgent, or rlTD3Agent. For more information on creating representations, see Create Policy and Value Function Representations.

Creation

Description

Scalar Output Q-Value Critic

example

critic = rlQValueRepresentation(net,observationInfo,actionInfo,'Observation',obsName,'Action',actName) creates the Q-value function critic. net is the deep neural network used as an approximator, and must have both observations and action as inputs, and a single scalar output. This syntax sets the ObservationInfo and ActionInfo properties of critic respectively to the inputs observationInfo and actionInfo, containing the observations and action specifications. obsName must contain the names of the input layers of net that are associated with the observation specifications. The action name actName must be the name of the input layer of net that is associated with the action specifications.

example

critic = rlQValueRepresentation(tab,observationInfo,actionInfo) creates the Q-value function based critic with discrete action and observation spaces from the Q-value table tab. tab is a rlTable object containing a table with as many rows as the possible observations and as many columns as the possible actions. This syntax sets the ObservationInfo and ActionInfo properties of critic respectively to the inputs observationInfo and actionInfo, which must be rlFiniteSetSpec objects containing the specifications for the discrete observations and action spaces, respectively.

example

critic = rlQValueRepresentation({basisFcn,W0},observationInfo,actionInfo) creates a Q-value function based critic using a custom basis function as underlying approximator. The first input argument is a two-elements cell in which the first element contains the handle basisFcn to a custom basis function, and the second element contains the initial weight vector W0. Here the basis function must have both observations and action as inputs and W0 must be a column vector. This syntax sets the ObservationInfo and ActionInfo properties of critic respectively to the inputs observationInfo and actionInfo.

Multi-Output Discrete Action Space Q-Value Critic

example

critic = rlQValueRepresentation(net,observationInfo,actionInfo,'Observation',obsName) creates the multi-output Q-value function critic for a discrete action space. net is the deep neural network used as an approximator, and must have only the observations as input and a single output layer having as many elements as the number of possible discrete actions. This syntax sets the ObservationInfo and ActionInfo properties of critic respectively to the inputs observationInfo and actionInfo, containing the observations and action specifications. Here, actionInfo must be an rlFiniteSetSpec object containing the specifications for the discrete action space. The observation names obsName must be the names of the input layers of net.

example

critic = rlQValueRepresentation({basisFcn,W0},observationInfo,actionInfo) creates the multi-output Q-value function critic for a discrete action space using a custom basis function as underlying approximator. The first input argument is a two-elements cell in which the first element contains the handle basisFcn to a custom basis function, and the second element contains the initial weight matrix W0. Here the basis function must have only the observations as inputs, and W0 must have as many columns as the number of possible actions. This syntax sets the ObservationInfo and ActionInfo properties of critic respectively to the inputs observationInfo and actionInfo.

Options

critic = rlQValueRepresentation(___,options) creates the value function based critic using the additional option set options, which is an rlRepresentationOptions object. This syntax sets the Options property of critic to the options input argument. You can use this syntax with any of the previous input-argument combinations.

Input Arguments

expand all

Deep neural network used as the underlying approximator within the critic, specified as one of the following:

For single output critics, net must have both observations and actions as inputs, and a scalar output, representing the expected cumulative long-term reward when the agent starts from the given observation and takes the given action. For multi-output discrete action space critics, net must have only the observations as input and a single output layer having as many elements as the number of possible discrete actions. Each output element represents the expected cumulative long-term reward when the agent starts from the given observation and takes the corresponding action. The learnable parameters of the critic are the weights of the deep neural network.

The network input layers must be in the same order and with the same data type and dimensions as the signals defined in ObservationInfo. Also, the names of these input layers must match the observation names listed in obsName.

The network output layer must have the same data type and dimension as the signal defined in ActionInfo. Its name must be the action name specified in actName.

rlQValueRepresentation objects support recurrent deep neural networks for multi-output discrete action space critics.

For a list of deep neural network layers, see List of Deep Learning Layers. For more information on creating deep neural networks for reinforcement learning, see Create Policy and Value Function Representations.

Observation names, specified as a cell array of strings or character vectors. The observation names must be the names of the input layers in net.

Example: {'my_obs'}

Action name, specified as a single-element cell array that contains a character vector. It must be the name of the output layer of net.

Example: {'my_act'}

Q-value table, specified as an rlTable object containing an array with as many rows as the possible observations and as many columns as the possible actions. The element (s,a) is the expected cumulative long-term reward for taking action a from observed state s. The elements of this array are the learnable parameters of the critic.

Custom basis function, specified as a function handle to a user-defined MATLAB function. The user defined function can either be an anonymous function or a function on the MATLAB path. The output of the critic is c = W'*B, where W is a weight vector or matrix containing the learnable parameters, and B is the column vector returned by the custom basis function.

For a single-output Q-value critic, c is a scalar representing the expected cumulative long term reward when the agent starts from the given observation and takes the given action. In this case, your basis function must have the following signature.

B = myBasisFunction(obs1,obs2,...,obsN,act)

For a multiple-output Q-value critic with a discrete action space, c is a vector in which each element is the expected cumulative long term reward when the agent starts from the given observation and takes the action corresponding to the position of the considered element. In this case, your basis function must have the following signature.

B = myBasisFunction(obs1,obs2,...,obsN)

Here, obs1 to obsN are observations in the same order and with the same data type and dimensions as the signals defined in observationInfo and act has the same data type and dimensions as the action specifications in actionInfo

Example: @(obs1,obs2,act) [act(2)*obs1(1)^2; abs(obs2(5)+act(1))]

Initial value of the basis function weights, W. For a single-output Q-value critic, W is a column vector having the same length as the vector returned by the basis function. For a multiple-output Q-value critic with a discrete action space, W is a matrix which must have as many rows as the length of the basis function output, and as many columns as the number of possible actions.

Properties

expand all

Representation options, specified as an rlRepresentationOptions object. Available options include the optimizer used for training and the learning rate.

Observation specifications, a reinforcement learning specification object or an array of specification objects defining properties such as the dimensions, data type, and names of the observation signals.

rlQValueRepresentation sets the ObservationInfo property of critc to the input observationInfo.

You can extract observationInfo from an existing environment or agent using getObservationInfo. You can also construct the specifications manually using a specification command such as rlFiniteSetSpec or rlNumericSpec.

Action specifications, a reinforcement learning specification object, defining properties such as the dimensions, data type and name of the action signals.

rlQValueRepresentation sets the ActionInfo property of critc to the input actionInfo.

You can extract actionInfo from an existing environment or agent using getActionInfo. You can also construct the specification manually using rlFiniteSetSpec or rlNumericSpec.

For custom basis function representations, the action signal must be a scalar, a column vector, or a discrete action.

Object Functions

rlDDPGAgentDeep deterministic policy gradient reinforcement learning agent
rlTD3AgentTwin-delayed deep deterministic policy gradient reinforcement learning agent
rlDQNAgentDeep Q-network reinforcement learning agent
rlQAgentQ-learning reinforcement learning agent
rlSARSAAgentSARSA reinforcement learning agent
rlSACAgentSoft actor-critic reinforcement learning agent
getValueObtain estimated value function representation
getMaxQValueObtain maximum state-value function estimate for Q-value function representation with discrete action space

Examples

collapse all

Create an observation specification object (or alternatively use getObservationInfo to extract the specification object from an environment). For this example, define the observation space as a continuous four-dimensional space, so that a single observation is a column vector containing four doubles.

obsInfo = rlNumericSpec([4 1]);

Create an action specification object (or alternatively use getActionInfo to extract the specification object from an environment). For this example, define the action space as a continuous two-dimensional space, so that a single action is a column vector containing two doubles.

actInfo = rlNumericSpec([2 1]);

Create a deep neural network to approximate the Q-value function. The network must have two inputs, one for the observation and one for the action. The observation input (here called myobs) must accept a four-element vector (the observation vector defined by obsInfo). The action input (here called myact) must accept a two-element vector (the action vector defined by actInfo). The output of the network must be a scalar, representing the expected cumulative long-term reward when the agent starts from the given observation and takes the given action.

% observation path layers
obsPath = [featureInputLayer(4, 'Normalization','none','Name','myobs') 
    fullyConnectedLayer(1,'Name','obsout')];

% action path layers
actPath = [featureInputLayer(2, 'Normalization','none','Name','myact') 
    fullyConnectedLayer(1,'Name','actout')];

% common path to output layers
comPath = [additionLayer(2,'Name', 'add')  fullyConnectedLayer(1, 'Name', 'output')];

% add layers to network object
net = addLayers(layerGraph(obsPath),actPath); 
net = addLayers(net,comPath);

% connect layers
net = connectLayers(net,'obsout','add/in1');
net = connectLayers(net,'actout','add/in2');

% plot network
plot(net)

Create the critic with rlQValueRepresentation, using the network, the observations and action specification objects, as well as the names of the network input layers.

critic = rlQValueRepresentation(net,obsInfo,actInfo, ...
    'Observation',{'myobs'},'Action',{'myact'})
critic = 
  rlQValueRepresentation with properties:

         ActionInfo: [1x1 rl.util.rlNumericSpec]
    ObservationInfo: [1x1 rl.util.rlNumericSpec]
            Options: [1x1 rl.option.rlRepresentationOptions]

To check your critic, use the getValue function to return the value of a random observation and action, using the current network weights.

v = getValue(critic,{rand(4,1)},{rand(2,1)})
v = single
    0.1102

You can now use the critic (along with an with an actor) to create an agent relying on a Q-value function critic (such as an rlQAgent, rlDQNAgent, rlSARSAAgent, or rlDDPGAgent agent).

This example shows how to create a multi-output Q-value function critic for a discrete action space using a deep neural network approximator.

This critic takes only the observation as input and produces as output a vector with as many elements as the possible actions. Each element represents the expected cumulative long term reward when the agent starts from the given observation and takes the action corresponding to the position of the element in the output vector.

Create an observation specification object (or alternatively use getObservationInfo to extract the specification object from an environment). For this example, define the observation space as a continuous four-dimensional space, so that a single observation is a column vector containing four doubles.

obsInfo = rlNumericSpec([4 1]);

Create a finite set action specification object (or alternatively use getActionInfo to extract the specification object from an environment with a discrete action space). For this example, define the action space as a finite set consisting of three possible values (named 7, 5, and 3 in this case).

actInfo = rlFiniteSetSpec([7 5 3]);

Create a deep neural network approximator to approximate the Q-value function within the critic. The input of the network (here called myobs) must accept a four-element vector, as defined by obsInfo. The output must be a single output layer having as many elements as the number of possible discrete actions (three in this case, as defined by actInfo).

net = [featureInputLayer(4,'Normalization','none','Name','myobs') 
       fullyConnectedLayer(3,'Name','value')];

Create the critic using the network, the observations specification object, and the name of the network input layer.

critic = rlQValueRepresentation(net,obsInfo,actInfo,'Observation',{'myobs'})
critic = 
  rlQValueRepresentation with properties:

         ActionInfo: [1x1 rl.util.rlFiniteSetSpec]
    ObservationInfo: [1x1 rl.util.rlNumericSpec]
            Options: [1x1 rl.option.rlRepresentationOptions]

To check your critic, use the getValue function to return the values of a random observation, using the current network weights. There is one value for each of the three possible actions.

v = getValue(critic,{rand(4,1)})
v = 3x1 single column vector

    0.7232
    0.8177
   -0.2212

You can now use the critic (along with an actor) to create a discrete action space agent relying on a Q-value function critic (such as an rlQAgent, rlDQNAgent, or rlSARSAAgent agent).

Create a finite set observation specification object (or alternatively use getObservationInfo to extract the specification object from an environment with a discrete observation space). For this example define the observation space as a finite set with of 4 possible values.

obsInfo = rlFiniteSetSpec([7 5 3 1]);

Create a finite set action specification object (or alternatively use getActionInfo to extract the specification object from an environment with a discrete action space). For this example define the action space as a finite set with 2 possible values.

actInfo = rlFiniteSetSpec([4 8]);

Create a table to approximate the value function within the critic. rlTable creates a value table object from the observation and action specifications objects.

qTable = rlTable(obsInfo,actInfo);

The table stores a value (representing the expected cumulative long term reward) for each possible observation-action pair. Each row corresponds to an observation and each column corresponds to an action. You can access the table using the Table property of the vTable object. The initial value of each element is zero.

qTable.Table
ans = 4×2

     0     0
     0     0
     0     0
     0     0

You can initialize the table to any value, in this case, an array containing the integer from 1 through 8.

qTable.Table=reshape(1:8,4,2)
qTable = 
  rlTable with properties:

    Table: [4x2 double]

Create the critic using the table as well as the observations and action specification objects.

critic = rlQValueRepresentation(qTable,obsInfo,actInfo)
critic = 
  rlQValueRepresentation with properties:

         ActionInfo: [1x1 rl.util.rlFiniteSetSpec]
    ObservationInfo: [1x1 rl.util.rlFiniteSetSpec]
            Options: [1x1 rl.option.rlRepresentationOptions]

To check your critic, use the getValue function to return the value of a given observation and action, using the current table entries.

v = getValue(critic,{5},{8})
v = 6

You can now use the critic (along with an with an actor) to create a discrete action space agent relying on a Q-value function critic (such as an rlQAgent, rlDQNAgent, or rlSARSAAgent agent).

Create an observation specification object (or alternatively use getObservationInfo to extract the specification object from an environment). For this example, define the observation space as a continuous four-dimensional space, so that a single observation is a column vector containing 3 doubles.

obsInfo = rlNumericSpec([3 1]);

Create an action specification object (or alternatively use getActionInfo to extract the specification object from an environment). For this example, define the action space as a continuous two-dimensional space, so that a single action is a column vector containing 2 doubles.

actInfo = rlNumericSpec([2 1]);

Create a custom basis function to approximate the value function within the critic. The custom basis function must return a column vector. Each vector element must be a function of the observations and actions respectively defined by obsInfo and actInfo.

myBasisFcn = @(myobs,myact) [myobs(2)^2; myobs(1)+exp(myact(1)); abs(myact(2)); myobs(3)]
myBasisFcn = function_handle with value:
    @(myobs,myact)[myobs(2)^2;myobs(1)+exp(myact(1));abs(myact(2));myobs(3)]

The output of the critic is the scalar W'*myBasisFcn(myobs,myact), where W is a weight column vector which must have the same size of the custom basis function output. This output is the expected cumulative long term reward when the agent starts from the given observation and takes the best possible action. The elements of W are the learnable parameters.

Define an initial parameter vector.

W0 = [1;4;4;2];

Create the critic. The first argument is a two-element cell containing both the handle to the custom function and the initial weight vector. The second and third arguments are, respectively, the observation and action specification objects.

critic = rlQValueRepresentation({myBasisFcn,W0},obsInfo,actInfo)
critic = 
  rlQValueRepresentation with properties:

         ActionInfo: [1×1 rl.util.rlNumericSpec]
    ObservationInfo: [1×1 rl.util.rlNumericSpec]
            Options: [1×1 rl.option.rlRepresentationOptions]

To check your critic, use the getValue function to return the value of a given observation-action pair, using the current parameter vector.

v = getValue(critic,{[1 2 3]'},{[4 5]'})
v = 
  1×1 dlarray

  252.3926

You can now use the critic (along with an with an actor) to create an agent relying on a Q-value function critic (such as an rlQAgent, rlDQNAgent, rlSARSAAgent, or rlDDPGAgent agent).

This example shows how to create a multi-output Q-value function critic for a discrete action space using a custom basis function approximator.

This critic takes only the observation as input and produces as output a vector with as many elements as the possible actions. Each element represents the expected cumulative long term reward when the agent starts from the given observation and takes the action corresponding to the position of the element in the output vector.

Create an observation specification object (or alternatively use getObservationInfo to extract the specification object from an environment). For this example, define the observation space as a continuous four-dimensional space, so that a single observation is a column vector containing 2 doubles.

obsInfo = rlNumericSpec([2 1]);

Create a finite set action specification object (or alternatively use getActionInfo to extract the specification object from an environment with a discrete action space). For this example, define the action space as a finite set consisting of 3 possible values (named 7, 5, and 3 in this case).

actInfo = rlFiniteSetSpec([7 5 3]);

Create a custom basis function to approximate the value function within the critic. The custom basis function must return a column vector. Each vector element must be a function of the observations defined by obsInfo.

myBasisFcn = @(myobs) [myobs(2)^2; myobs(1); exp(myobs(2)); abs(myobs(1))]
myBasisFcn = function_handle with value:
    @(myobs)[myobs(2)^2;myobs(1);exp(myobs(2));abs(myobs(1))]

The output of the critic is the vector c = W'*myBasisFcn(myobs), where W is a weight matrix which must have as many rows as the length of the basis function output, and as many columns as the number of possible actions.

Each element of c is the expected cumulative long term reward when the agent starts from the given observation and takes the action corresponding to the position of the considered element. The elements of W are the learnable parameters.

Define an initial parameter matrix.

W0 = rand(4,3);

Create the critic. The first argument is a two-element cell containing both the handle to the custom function and the initial parameter matrix. The second and third arguments are, respectively, the observation and action specification objects.

critic = rlQValueRepresentation({myBasisFcn,W0},obsInfo,actInfo)
critic = 
  rlQValueRepresentation with properties:

         ActionInfo: [1x1 rl.util.rlFiniteSetSpec]
    ObservationInfo: [1x1 rl.util.rlNumericSpec]
            Options: [1x1 rl.option.rlRepresentationOptions]

To check your critic, use the getValue function to return the values of a random observation, using the current parameter matrix. Note that there is one value for each of the three possible actions.

v = getValue(critic,{rand(2,1)})
v = 
  3x1 dlarray

    2.1395
    1.2183
    2.3342

You can now use the critic (along with an actor) to create a discrete action space agent relying on a Q-value function critic (such as an rlQAgent, rlDQNAgent, or rlSARSAAgent agent).

Create an environment and obtain observation and action information.

env = rlPredefinedEnv('CartPole-Discrete');
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
numObs = obsInfo.Dimension(1);
numDiscreteAct = numel(actInfo.Elements);

Create a recurrent deep neural network for your critic. To create a recurrent neural network, use a sequenceInputLayer as the input layer and include at least one lstmLayer.

Create a recurrent neural network for a multi-output Q-value function representation.

criticNetwork = [
    sequenceInputLayer(numObs,'Normalization','none','Name','state')
    fullyConnectedLayer(50, 'Name', 'CriticStateFC1')
    reluLayer('Name','CriticRelu1')
    lstmLayer(20,'OutputMode','sequence','Name','CriticLSTM');
    fullyConnectedLayer(20,'Name','CriticStateFC2')
    reluLayer('Name','CriticRelu2')
    fullyConnectedLayer(numDiscreteAct,'Name','output')];

Create a representation for your critic using the recurrent neural network.

criticOptions = rlRepresentationOptions('LearnRate',1e-3,'GradientThreshold',1);
critic = rlQValueRepresentation(criticNetwork,obsInfo,actInfo,...
    'Observation','state',criticOptions);
Introduced in R2020a