rlHindsightPrioritizedReplayMemory

Hindsight replay memory experience buffer with prioritized sampling

Since R2023a

Description

An off-policy reinforcement learning agent stores experiences in a circular experience buffer.

During training the agent stores each of its experiences (S,A,R,S',D) in the buffer. Here:

S is the current observation of the environment.
A is the action taken by the agent.
R is the reward for taking action A.
S' is the next observation after taking action A.
D is the is-done signal after taking action A.

The agent then samples mini-batches of experiences from the buffer and uses these mini-batches to update its actor and critic function approximators.

By default, built-in off-policy agents (DQN, DDPG, TD3, SAC, MBPO) use an rlReplayMemory object as their experience buffer. For goal-conditioned tasks, where the observation includes both the goal and a goal measurement, you can use an rlHindsightPrioritizedReplayMemory object.

rlHindsightReplayMemory objects uniformly sample experiences from the buffer. To use prioritized nonuniform sampling, which can improve sample efficiency, use an rlHindsightPrioritizedReplayMemory object.

A hindsight replay memory experience buffer:

Generates additional experiences by replacing goals with goal measurements
Improves sample efficiency for tasks with sparse rewards
Requires a ground-truth reward function and is-done function
Is not necessary when you have a well-shaped reward function

For more information on hindsight experience replay and prioritized sampling, see Algorithms.

Creation

Syntax

buffer = rlHindsightPrioritizedReplayMemory(obsInfo,actInfo,rewardFcn,isDoneFcn,goalConditionInfo)

buffer = rlHindsightReplayMemory(obsInfo,actInfo,rewardFcn,isDoneFcn,goalConditionInfo,maxLength)

Description

buffer = rlHindsightPrioritizedReplayMemory(obsInfo,actInfo,rewardFcn,isDoneFcn,goalConditionInfo) creates a hindsight prioritized replay memory experience buffer that is compatible with the observation and action specifications in obsInfo and actInfo, respectively. This syntax sets the RewardFcn, IsDoneFcn, and GoalConditionInfo properties.

example

buffer = rlHindsightReplayMemory(obsInfo,actInfo,rewardFcn,isDoneFcn,goalConditionInfo,maxLength) sets the maximum length of the buffer by setting the MaxLength property.

Input Arguments

expand all

`obsInfo` — Observation specifications
specification object | array of specification objects

Observation specifications, specified as a reinforcement learning specification object or an array of specification objects defining properties such as dimensions, data types, and names of the observation signals.

You can extract the observation specifications from an existing environment or agent using getObservationInfo. You can also construct the specifications manually using rlFiniteSetSpec or rlNumericSpec.

`actInfo` — Action specifications
specification object | array of specification objects

Action specifications, specified as a reinforcement learning specification object defining properties such as dimensions, data types, and names of the action signals.

You can extract the action specifications from an existing environment or agent using getActionInfo. You can also construct the specification manually using rlFiniteSetSpec or rlNumericSpec.

Properties

expand all

`MaxLength` — Maximum buffer length
`10000` (default) | nonnegative integer

This property is read-only.

Maximum buffer length, specified as a nonnegative integer.

To change the maximum buffer length, use the resize function.

`Length` — Number of experiences in buffer
`0` (default) | nonnegative integer

This property is read-only.

Number of experiences in buffer, specified as a nonnegative integer.

`GoalConditionInfo` — Goal condition information
1-by-N cell array

Goal condition information, specified as a 1-by-N cell array, where N is the number of goal conditions. For the ith goal condition, the corresponding cell of GoalConditionInfo contains a 1-by-4 cell array with the following elements.

GoalConditionInfo{i}{1} — Goal measurement channel index.
GoalConditionInfo{i}{2} — Goal measurement element indices.
GoalConditionInfo{i}{3} — Goal channel index.
GoalConditionInfo{i}{4} — Goal element indices.

The goal measurements in GoalConditionInfo{i}{2} correspond to the goals in GoalConditionInfo{i}{4}.

As an example, suppose that obsInfo contains specifications for two observation channels. Further, suppose that there is one goal condition where the goal measurements correspond to elements 2 and 3 of the first observation channel, and the goals correspond to elements 4 and 5 of the second observation channel. In this case, the goal condition information is:

GoalConditionInfo = {{1,[1 2],2,[4 5]}};

`RewardFcn` — Reward function
function handle

Reward function, specified as a handle to a function with the following signature.

function reward = myRewardFcn(obs,action,nextObs)

Here:

reward is a scalar reward value.
obs is the current observation.
act is the action taken from the current observation.
nextObs is the next observation after taking the specified action.

`IsDoneFcn` — Is-done function
function handle

Is-done function, specified as a handle to a function with the following signature.

function isdone = myIsDoneFcn(obs,action,nextObs)

Here:

isdone is true when the next observation is a terminal condition and false otherwise.
obs is the current observation.
act is the action taken from the current observation.
nextObs is the next observation after taking the specified action.

`Strategy` — Goal measurement sampling strategy
`"final"` (default) | `"episode"` | `"future"`

Goal measurement sampling strategy, specified as one of the following values.

"final" — Use the goal measurement from the end of the trajectory.
"episode" — Randomly sample M goal measurements from the trajectory, where M is equal to NumGoalSamples.
"future" — Randomly sample M goal measurements from the trajectory, but create hindsight experiences for measurements that were observed at time t+1 or later.

`NumGoalSamples` — Number of goal measurements
`4` (default) | positive integer

Number of goal measurements to sample when generating experiences, specified as a positive integer. This parameter is ignored when Strategy is "final".

`PriorityExponent` — Priority exponent
`0.6` (default) | nonnegative scalar less than or equal to 1

Priority exponent to control the impact of prioritization during probability computation, specified as a nonnegative scalar less than or equal to 1.

If the priority exponent is zero, the agent uses uniform sampling.

`InitialImportanceSamplingExponent` — Initial value of importance sampling exponent
`0.4` (default) | nonnegative scalar less than or equal to 1

Initial value of the importance sampling exponent, specified as a nonnegative scalar less than or equal to 1.

`NumAnnealingSteps` — Number of annealing steps
`1000000` (default) | positive integer

Number of annealing steps for updating the importance sampling exponent, specified as a positive integer.

`ImportanceSamplingExponent` — Current value of importance sampling exponent
`0.4` (default) | nonnegative scalar less than or equal to 1

This property is read-only.

Current value of the importance sampling exponent, specified as a nonnegative scalar less than or equal to 1.

During training, ImportanceSamplingExponent is linearly increased from InitialImportanceSamplingExponent to 1 over NumAnnealingSteps steps.

Object Functions

`append`	Append experiences to replay memory buffer
`sample`	Sample experiences from replay memory buffer
`resize`	Resize replay memory experience buffer
`reset`	Reset environment, agent, experience buffer, or policy object
`allExperiences`	Return all experiences in replay memory buffer
`validateExperience`	Validate experiences for replay memory
`generateHindsightExperiences`	Generate hindsight experiences from hindsight experience replay buffer
`getActionInfo`	Obtain action data specifications from reinforcement learning environment, agent, or experience buffer
`getObservationInfo`	Obtain observation data specifications from reinforcement learning environment, agent, or experience buffer

Examples

collapse all

Create DDPG Agent and Set Hindsight Prioritized Replay Memory

Open Live Script

For this example, create observation and action specifications directly. You can also extract such specifications from your environment.

Create an observation specification for an environment with a single observation channel with six observations. For this example, assume that the observation channel contains the signals [ $a$ , $x_{m}$ , $y_{m}$ , $x_{g}$ , $y_{g}$ , $c$ ], where:

$x_{g}$ and $y_{g}$ are the goal observations.
$x_{m}$ and $y_{m}$ are the goal measurements.
$a$ and $c$ are additional observations.

obsInfo = rlNumericSpec([5 1]);

Create a specification for a single action.

actInfo = rlNumericSpec([1 1]);

Create a DDPG agent from the environment specifications. By default, the agent uses a replay memory experience buffer with uniform sampling.

agent = rlDDPGAgent(obsInfo,actInfo);

To create a hindsight replay memory buffer, first define the goal condition information. Both the goals and goal measurements are in the single observation channel. The goal measurements are in elements 2 and 3 of the observation channel and the goals are in elements 4 and 5 of the observation channel.

goalConditionInfo = {{1,[2 3],1,[4 5]}};

Define an is-done function. For this example, the is-done signal is true when the next observation satisfies the goal condition $\sqrt{{(x_{m} - x_{g})}^{2} + {(y_{m} - y_{g})}^{2}} < 0.1$ .

function isdone = hindsightIsDoneFcn1(Observation,Action,NextObservation)
    NextObservation = NextObservation{1};
    xm = NextObservation(2);
    ym = NextObservation(3);
    xg = NextObservation(4);
    yg = NextObservation(5);
    isdone = sqrt((xm-xg)^2 + (ym-yg)^2) < 0.1;
end

Define a reward function. For this example, the reward is 1 when the is-done signal is true and –0.01 otherwise.

function reward = hindsightRewardFcn1(Observation,Action,NextObservation)
    isdone = hindsightIsDoneFcn1(Observation,Action,NextObservation);
    if isdone
        reward = 1;
    else
        reward = -0.01;
    end
end

Create a hindsight prioritized replay memory buffer with a default maximum length.

buffer = rlHindsightPrioritizedReplayMemory(obsInfo,actInfo, ...
    @hindsightRewardFcn1,@hindsightIsDoneFcn1,goalConditionInfo);

Configure the prioritized replay memory options. For example, set the initial importance sampling exponent to 0.5 and the number of annealing steps for updating the exponent during training to 1e4.

buffer.NumAnnealingSteps = 1e4;
buffer.PriorityExponent = 0.5;
buffer.InitialImportanceSamplingExponent = 0.5;

Replace the default experience buffer with the hindsight replay memory buffer.

agent.ExperienceBuffer = buffer;

Limitations

Hindsight prioritized experience replay does not support agents that use recurrent neural networks.

Algorithms

expand all

Hindsight Experience Replay

Hindsight experience replay is a data augmentation method that you can use for goal-conditioned tasks, where the observation includes both the goal and a goal measurement. In such a goal-conditioned task, the agent reaches the goal when the distance between the goal measurement and the goal is less than a threshold.

However, when the reward is sparse, the agent may not be able to reach the goal during training. In this case, the agent does not get any rewards. Hindsight experience replay generates experience data by replacing the goal with a goal measurement. As a result, the agent can get rewards for improving the goal measurement without actually reaching the goal.

The generated hindsight experiences are created at the end of each training episode by sampling goal measurements from the last trajectory of the episode using one of the following sampling strategies. Specify the sampling strategy using the Strategy property.

Sampling Strategy	`Strategy` Value	Description
Final	`"final"`	Sample one goal measurement from the final observation in the trajectory.
Episode	`"episode"`	Randomly sample multiple goal measurements from the entire trajectory.
Future	`"future"`	For a given time t, sample multiple goal measurements from the end of the trajectory starting at time t+1.

Notation

The hindsight experience generation algorithm described here uses the following notation.

$τ = (e_{1}, e_{2}, \dots, e_{T})$ is an experience trajectory, where T is the trajectory length.
$e_{t} = (S_{t}, A_{t}, R_{t}, S_{t + 1}, D_{t})$ is experience at time t, where:
- S_t is the observation at time t.
- A_t is the action at time t.
- S_t₊₁ is the observation at time t + 1.
- R_t is the reward at time t returned by the reward function $F_{R} (S_{t}, A_{t}, S_{t + 1})$ . To specify f_R, use the RewardFunction property.
- D_t is the episode terminal condition at time t returned by the is-done function $F_{D} (S_{t}, A_{t}, S_{t + 1})$ . To specify f_D, use the IsDoneFunction property.
$S_{t} = (m_{t}, g_{t}, x_{t})$ is the observation at time t, where:
- $g_{t} = (g_{t, 1}, \dots, g_{t, N})$ are goals.
- $m_{t} = (m_{t, 1}, \dots, m_{t, N})$ are corresponding goals measurements.
- $x_{t} = (x_{t, 1}, \dots, x_{t, K})$ are observations that are neither measurements nor goals.

Experience Generation Algorithm

At the end of each training episode, hindsight experiences are generated using the following algorithm. When you use a hindsight experience replay buffer in your own custom agent, you can generate experiences using the generateHindsightExperiences function.

Get the last trajectory τ from the experience buffer.
for t = 1:T:
1. Sample M goal measurements m₁, …, m_M based on th sampling strategy.
  - Final strategy:
    M = 1
    m₁ = m_T₊₁ from S_T₊₁ in e_t
  - Episode strategy:
    M = NumGoalSamples
    Randomly sample goal measurements from S₁, …, S_T₊₁ in τM. Do not sample any specific goal measurement more than once.
  - Future strategy:
    M = min(T – t + 1,NumGoalSamples)
    Randomly sample M goal measurements from S_t₊₁, …, S_T₊₁ in τ. Do not sample any specific goal measurement more than once.
2. Create M empty trajectories τ₁, …, τ_M.
3. for i = 1:T:
  1. In e_t, replace the goal g_t in S_t and g_t in S_t₊₁ with m_i.
    $\begin{array}{l} {g^{'}}_{t} = m_{i} \\ {g^{'}}_{t + 1} = m_{i} \\ {S^{'}}_{t} = (m_{t}, {g^{'}}_{t}, x_{t}) \\ {S^{'}}_{t + 1} = (m_{t + 1}, {g^{'}}_{t + 1}, x_{t + 1}) \end{array}$
  2. Compute a new reward.
    ${R^{'}}_{t} = f_{R} ({S^{'}}_{t}, A_{t}, {S^{'}}_{t + 1})$
  3. Compute a new is-done signal.
    ${D^{'}}_{t} = f_{D} ({S^{'}}_{t}, A_{t}, {S^{'}}_{t + 1})$
  4. Create a new experience sample.
    ${e^{'}}_{i, t} = ({S^{'}}_{t}, A_{t}, {R^{'}}_{t}, {S^{'}}_{t + 1}, {D^{'}}_{t})$
  5. Append the new experience to trajectory τ_i.
4. Append τ₁, …, τ_M to the experience buffer.

Prioritized Sampling

Prioritized replay memory samples experiences according to experience priorities. For a given experience, the priority is defined as the absolute value of the associated temporal difference (TD) error. A larger TD error indicates that the critic network is not well-trained for the corresponding experience. Therefore, sampling such experiences during critic updates can help efficiently improve the critic performance, which often improves the sample efficiency of agent training.

When using prioritized replay memory, agents use the following process when sampling a mini-batch of experiences and updating a critic.

Compute the sampling probability P for each experience in the buffer based on the experience priority.
$P (j) = \frac{p {(j)}^{α}}{{\sum_{i = 1}^{N} p (i)}^{α}}$
Here:
- N is the number of experiences in the replay memory buffer.
- p is the experience priority.
- α is a priority exponent. To set α, use the PriorityExponent parameter.
Sample a mini-batch of experiences according to the computed probabilities.
Compute the importance sampling weights (w) for the sampled experiences.
$\begin{array}{l} w' (j) = {(N \cdot P (j))}^{- β} \\ w (j) \leftarrow \frac{w' (j)}{\max_{i \in mini-batch} w' (i)} \end{array}$
Here, β is the importance sampling exponent. The ImportanceSamplingExponent parameter contains the current value of β. To control β, set the ImportanceSamplingExponent and NumAnnealingSteps parameters.
Compute the weighted loss using the importance sampling weights w and the TD error δ to update a critic.
Update the priorities of the sampled experiences based on the TD error.
$p (j) = | δ |$
Update the importance sampling exponent β by linearly annealing the exponent value until it reaches 1.
$β \leftarrow β + \frac{1 - β_{0}}{N_{S}}$
Here:
- β₀ is the initial importance sampling exponent. To specify β₀, use the InitialImportanceSamplingExponent parameter.
- N_S is the number of annealing steps. To specify N_s, use the NumAnnealingSteps parameter.

References

[1] Schaul, Tom, John Quan, Ioannis Antonoglou, and David Silver. 'Prioritized experience replay'. arXiv:1511.05952 [Cs] 25 February 2016. https://arxiv.org/abs/1511.05952.

[2] Andrychowicz, Marcin, Filip Wolski,Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojiech Zaremba. 'Hindsight experience replay'. 31st Conference on Neural Information Processing Systems. Long Beach, CA, USA: 2017.

Version History

Introduced in R2023a

rlHindsightPrioritizedReplayMemory

Description

Creation

Syntax

Description

Input Arguments

`obsInfo` — Observation specifications
specification object | array of specification objects

`actInfo` — Action specifications
specification object | array of specification objects

Properties

`MaxLength` — Maximum buffer length
`10000` (default) | nonnegative integer

`Length` — Number of experiences in buffer
`0` (default) | nonnegative integer

`GoalConditionInfo` — Goal condition information
1-by-N cell array

`RewardFcn` — Reward function
function handle

`IsDoneFcn` — Is-done function
function handle

`Strategy` — Goal measurement sampling strategy
`"final"` (default) | `"episode"` | `"future"`

`NumGoalSamples` — Number of goal measurements
`4` (default) | positive integer

`PriorityExponent` — Priority exponent
`0.6` (default) | nonnegative scalar less than or equal to 1

`InitialImportanceSamplingExponent` — Initial value of importance sampling exponent
`0.4` (default) | nonnegative scalar less than or equal to 1

`NumAnnealingSteps` — Number of annealing steps
`1000000` (default) | positive integer

`ImportanceSamplingExponent` — Current value of importance sampling exponent
`0.4` (default) | nonnegative scalar less than or equal to 1

Object Functions

Examples

Create DDPG Agent and Set Hindsight Prioritized Replay Memory

Limitations

Algorithms

Hindsight Experience Replay

Prioritized Sampling

References

Version History

See Also

Functions

Objects

rlHindsightPrioritizedReplayMemory

Description

Creation

Syntax

Description

Input Arguments

obsInfo — Observation specifications specification object | array of specification objects

actInfo — Action specifications specification object | array of specification objects

Properties

MaxLength — Maximum buffer length 10000 (default) | nonnegative integer

Length — Number of experiences in buffer 0 (default) | nonnegative integer

GoalConditionInfo — Goal condition information 1-by-N cell array

RewardFcn — Reward function function handle

IsDoneFcn — Is-done function function handle

Strategy — Goal measurement sampling strategy "final" (default) | "episode" | "future"

NumGoalSamples — Number of goal measurements 4 (default) | positive integer

PriorityExponent — Priority exponent 0.6 (default) | nonnegative scalar less than or equal to 1

InitialImportanceSamplingExponent — Initial value of importance sampling exponent 0.4 (default) | nonnegative scalar less than or equal to 1

NumAnnealingSteps — Number of annealing steps 1000000 (default) | positive integer

ImportanceSamplingExponent — Current value of importance sampling exponent 0.4 (default) | nonnegative scalar less than or equal to 1

Object Functions

Examples

Create DDPG Agent and Set Hindsight Prioritized Replay Memory

Limitations

Algorithms

Hindsight Experience Replay

Prioritized Sampling

References

Version History

See Also

Functions

Objects

`obsInfo` — Observation specifications
specification object | array of specification objects

`actInfo` — Action specifications
specification object | array of specification objects

`MaxLength` — Maximum buffer length
`10000` (default) | nonnegative integer

`Length` — Number of experiences in buffer
`0` (default) | nonnegative integer

`GoalConditionInfo` — Goal condition information
1-by-N cell array

`RewardFcn` — Reward function
function handle

`IsDoneFcn` — Is-done function
function handle

`Strategy` — Goal measurement sampling strategy
`"final"` (default) | `"episode"` | `"future"`

`NumGoalSamples` — Number of goal measurements
`4` (default) | positive integer

`PriorityExponent` — Priority exponent
`0.6` (default) | nonnegative scalar less than or equal to 1

`InitialImportanceSamplingExponent` — Initial value of importance sampling exponent
`0.4` (default) | nonnegative scalar less than or equal to 1

`NumAnnealingSteps` — Number of annealing steps
`1000000` (default) | positive integer

`ImportanceSamplingExponent` — Current value of importance sampling exponent
`0.4` (default) | nonnegative scalar less than or equal to 1