Main Content

generateHindsightExperiences

Generate hindsight experiences from hindsight experience replay buffer

Since R2023a

    Description

    experience = generateHindsightExperiences(buffer,trajectoryLength) generates hindsight experiences from the last trajectory added to the specified hindsight experience replay memory buffer.

    example

    Examples

    collapse all

    When you use a hindsight replay memory buffer within your custom agent training loop, you generate experiences at the end of training episode.

    Create an observation specification for an environment with a single observation channel with six observations. For this example, assume that the observation channel contains the signals [a, xm, ym, xg, yg, c], where:

    • xg and yg are the goal observations.

    • xm and ym are the goal measurements.

    • a and c are additional observations.

    obsInfo = rlNumericSpec([6 1],...
        LowerLimit=0,UpperLimit=[1;5;5;5;5;1]);

    Create a specification for a single action.

    actInfo = rlNumericSpec([1 1],...
        LowerLimit=0,UpperLimit=10);

    To create a hindsight replay memory buffer, first define the goal condition information. Both the goals and goal measurements are in the single observation channel. The goal measurements are in elements 2 and 3 of the observation channel and the goals are in elements 4 and 5 of the observation channel.

    goalConditionInfo = {{1,[2 3],1,[4 5]}};

    For this example, use hindsightRewardFcn1 as the ground-truth reward function and hindsightIsDoneFcn1 as the termination condition function.

    Create the hindsight replay memory buffer.

    buffer = rlHindsightReplayMemory(obsInfo,actInfo, ...
        @hindsightRewardFcn1,@hindsightIsDoneFcn1,goalConditionInfo);

    As you train your agent, you add experience trajectories to the experience buffer. For this example, add a random experience trajectory of length 10.

    for i = 1:10
        exp(i).Observation = {obsInfo.UpperLimit.*rand(6,1)};
        exp(i).Action = {actInfo.UpperLimit.*rand(1)};
        exp(i).NextObservation = {obsInfo.UpperLimit.*rand(6,1)};
        exp(i).Reward = 10*rand(1);
        exp(i).IsDone = 0;
    end
    exp(10).IsDone = 1;
    
    append(buffer,exp);

    At the end of the training episode, you generate hindsight experiences from the last trajectory added to the buffer. Generate experiences specifying the length of the last trajectory added to the buffer.

    newExp = generateHindsightExperiences(buffer,10);

    For each experience in the final trajectory, the default "final" sampling strategy generates a new experience where it replaces the goals in Observation and NextObservation with the goal measurements from the final experience in the trajectory.

    To validate this behavior, first view the final goal measurements from exp.

    exp(10).NextObservation{1}(2:3)
    ans = 2×1
    
        0.7277
        0.6803
    
    

    Next, view the goal values for one of the generated experiences. This value should match the final goal measurement.

    newExp(6).Observation{1}(4:5)
    ans = 2×1
    
        0.7277
        0.6803
    
    

    After generating the new experiences, append them to the buffer.

    append(buffer,newExp);

    Input Arguments

    collapse all

    Hindsight experience buffer, specified as one of the following replay memory objects.

    Length of last trajectory in buffer, specified as a positive integer.

    Output Arguments

    collapse all

    Experiences sampled from the buffer, returned as a structure with the following fields.

    Observation, returned as a cell array with length equal to the number of observation specifications specified when creating the buffer. Each element of Observation contains a DO-by-batchSize-by-SequenceLength array, where DO is the dimension of the corresponding observation specification.

    Agent action, returned as a cell array with length equal to the number of action specifications specified when creating the buffer. Each element of Action contains a DA-by-batchSize-by-SequenceLength array, where DA is the dimension of the corresponding action specification.

    Reward value obtained by taking the specified action from the observation, returned as a 1-by-1-by-SequenceLength array.

    Next observation reached by taking the specified action from the observation, returned as a cell array with the same format as Observation.

    Termination signal, returned as a 1-by-1-by-SequenceLength array of integers. Each element of IsDone has one of the following values.

    • 0 — This experience is not the end of an episode.

    • 1 — The episode terminated because the environment generated a termination signal.

    • 2 — The episode terminated by reaching the maximum episode length.

    Length of last trajectory in experience buffer, specified as a positive integer.

    Version History

    Introduced in R2023a