Main Content

rlNeuralNetworkEnvironment

Environment model with deep neural network transition models

    Description

    Use an rlNeuralNetworkEnvironment object to create a reinforcement learning environment that computes state transitions using deep neural networks.

    Using an rlNeuralNetworkEnvironment object you can:

    • Create an internal environment model for a model-based policy optimization (MBPO) agent.

    • Create an environment for training other types of reinforcement learning agents. You can identify the state-transition network using experimental or simulated data.

    Such environments can compute environment rewards and termination conditions using deep neural networks or custom functions.

    Creation

    Description

    example

    env = rlNeuralNetworkEnvironment(obsInfo,actInfo,transitionFcn,rewardFcn,isDoneFcn) creates a model for an environment with the observation and action specifications specified in obsInfo and actInfo, respectively. This syntax sets the TransitionFcn, RewardFcn, and IsDoneFcn properties.

    Input Arguments

    expand all

    Observation specifications, specified as a reinforcement learning specification object or an array of specification objects defining properties such as dimensions, data type, and names of the observation signals.

    You can extract the observation specifications from an existing environment or agent using getObservationInfo. You can also construct the specifications manually using rlFiniteSetSpec or rlNumericSpec.

    Action specifications, specified as a reinforcement learning specification object defining properties such as dimensions, data type, and names of the action signals.

    You can extract the action specifications from an existing environment or agent using getActionInfo. You can also construct the specification manually using rlFiniteSetSpec or rlNumericSpec.

    Properties

    expand all

    Environment transition function, specified as one of the following:

    Environment reward function, specified as one of the following:

    • rlContinuousDeterministicRewardFunction object — Use this option when you do not know a ground-truth reward signal for your environment and you expect the reward signal to be deterministic.

    • rlContinuousGaussianRewardFunction object — Use this option when you do not know a ground-truth reward signal for your environment and you expect the reward signal to be stochastic.

    • Function handle — Use this option when you know a ground-truth reward signal for your environment. When you use an rlNeuralNetworkEnvironment object to create an rlMBPOAgent object, the custom reward function must return a batch of rewards given a batch of inputs.

    Environment is-done function, specified as one of the following:

    • rlIsDoneFunction object — Use this option when you do not know a ground-truth termination signal for your environment.

    • Function handle — Use this option when you know a ground-truth termination signal for your environment. When you use an rlNeuralNetworkEnvironment object to create an rlMBPOAgent object, the custom is-done function must return a batch of termination signals given a batch of inputs.

    Observation values, specified as a cell array with length equal to the number of specification objects in obsInfo. The order of the observations in Observation must match the order in obsInfo. Also, the dimensions of each element of the cell array must match the dimensions of the corresponding observation specification in obsInfo.

    To evaluate whether the transition models are well-trained, you can manually evaluate the environment for a given observation value using the step function. Specify the observation values before calling step.

    When you use this neural network environment object within an MBPO agent, this property is ignored.

    Transition model index, specified as a positive integer.

    To evaluate whether the transition models are well-trained, you can manually evaluate the environment for a given observation value using the step function. To select which transition model in TransitionFcn to evaluate, specify the transition model index before calling step.

    When you use this neural network environment object within an MBPO agent, this property is ignored.

    Object Functions

    rlMBPOAgentModel-based policy optimization reinforcement learning agent

    Examples

    collapse all

    Create an environment interface and extract observation and action specifications. Alternatively, you can create specifications using rlNumericSpec and rlFiniteSetSpec.

    env = rlPredefinedEnv("CartPole-Continuous");
    obsInfo = getObservationInfo(env);
    actInfo = getActionInfo(env);

    Get the dimension of the observation and action spaces.

    numObservations = obsInfo.Dimension(1);
    numActions = actInfo.Dimension(1);

    Create a deterministic transition function based on a deep neural network with two input channels (current observations and actions) and one output channel (predicted next observation).

    % Create network layers.
    statePath = featureInputLayer(numObservations, ...
        Normalization="none",Name="state");
    actionPath = featureInputLayer(numActions, ...
        Normalization="none",Name="action");
    commonPath = [concatenationLayer(1,2,Name="concat")
        fullyConnectedLayer(64,Name="FC1")
        reluLayer(Name="CriticRelu1")
        fullyConnectedLayer(64, Name="FC3")
        reluLayer(Name="CriticCommonRelu2")
        fullyConnectedLayer(numObservations,Name="nextObservation")];
    
    % Combine network layers.
    transitionNetwork = layerGraph(statePath);
    transitionNetwork = addLayers(transitionNetwork,actionPath);
    transitionNetwork = addLayers(transitionNetwork,commonPath);
    transitionNetwork = connectLayers( ...
        transitionNetwork,"state","concat/in1");
    transitionNetwork = connectLayers( ...
        transitionNetwork,"action","concat/in2");
    
    % Create dlnetwork object.
    transitionNetwork = dlnetwork(transitionNetwork);
    
    % Create transition function object.
    transitionFcn = rlContinuousDeterministicTransitionFunction(...
        transitionNetwork,obsInfo,actInfo,...
        ObservationInputNames="state", ...
        ActionInputNames="action", ...
        NextObservationOutputNames="nextObservation");

    Create a deterministic reward function with two input channels (current action and next observations) and one output channel (predicted reward value).

    % Create network layers.
    nextStatePath = featureInputLayer( ...
        numObservations,Name="nextState");
    commonPath = [concatenationLayer(1,3,Name="concat")
        fullyConnectedLayer(32,Name="fc")
        reluLayer(Name="relu1")
        fullyConnectedLayer(32,Name="fc2")];
    meanPath = [reluLayer(Name="rewardMeanRelu")
        fullyConnectedLayer(1,Name="rewardMean")];
    stdPath = [reluLayer(Name="rewardStdRelu")
        fullyConnectedLayer(1,Name="rewardStdFc")
        softplusLayer(Name="rewardStd")];
    
    % Combine network layers.
    rewardNetwork = layerGraph(statePath);
    rewardNetwork = addLayers(rewardNetwork,actionPath);
    rewardNetwork = addLayers(rewardNetwork,nextStatePath);
    rewardNetwork = addLayers(rewardNetwork,commonPath);
    rewardNetwork = addLayers(rewardNetwork,meanPath);
    rewardNetwork = addLayers(rewardNetwork,stdPath);
    
    rewardNetwork = connectLayers( ...
        rewardNetwork,"nextState","concat/in1");
    rewardNetwork = connectLayers( ...
        rewardNetwork,"action","concat/in2");
    rewardNetwork = connectLayers( ...
        rewardNetwork,"state","concat/in3");
    rewardNetwork = connectLayers( ...
        rewardNetwork,"fc2","rewardMeanRelu");
    rewardNetwork = connectLayers( ...
        rewardNetwork,"fc2","rewardStdRelu");
    
    % Create dlnetwork object.
    rewardNetwork = dlnetwork(rewardNetwork);
    
    % Create reward function object.
    rewardFcn = rlContinuousGaussianRewardFunction(...
        rewardNetwork,obsInfo,actInfo,...
        ObservationInputNames="state",...
        ActionInputNames="action", ...
        NextObservationInputNames="nextState", ...
        RewardMeanOutputNames="rewardMean", ...
        RewardStandardDeviationOutputNames="rewardStd");

    Create an is-done function with one input channel (next observations) and one output channel (predicted termination signal).

    % Create network layers.
    commonPath = [featureInputLayer(numObservations, ...
            Normalization="none",Name="nextState");
        fullyConnectedLayer(64,Name="FC1")
        reluLayer(Name="CriticRelu1")
        fullyConnectedLayer(64,Name="FC3")
        reluLayer(Name="CriticCommonRelu2")
        fullyConnectedLayer(2,Name="isdone0")
        softmaxLayer(Name="isdone")];
    isDoneNetwork = layerGraph(commonPath);
    
    % Create dlnetwork object.
    isDoneNetwork = dlnetwork(isDoneNetwork);
    
    % Create is-done function object.
    isDoneFcn = rlIsDoneFunction(isDoneNetwork, ...
        obsInfo,actInfo, ...
        NextObservationInputNames="nextState");

    Create a neural network environment using the transition, reward, and is-done functions.

    env = rlNeuralNetworkEnvironment( ...
        obsInfo,actInfo, ...
        transitionFcn,rewardFcn,isDoneFcn);

    Create an environment interface and extract observation and action specifications. Alternatively, you can create specifications using rlNumericSpec and rlFiniteSetSpec.

    env = rlPredefinedEnv("CartPole-Continuous");
    obsInfo = getObservationInfo(env);
    numObservations = obsInfo.Dimension(1);
    actInfo = getActionInfo(env);
    numActions = actInfo.Dimension(1);

    Create a deterministic transition function based on a deep neural network with two input channels (current observations and actions) and one output channel (predicted next observation).

    % Create network layers.
    statePath = featureInputLayer(numObservations,...
        Normalization="none",Name="state");
    actionPath = featureInputLayer(numActions,...
        Normalization="none",Name="action");
    commonPath = [concatenationLayer(1,2,Name="concat")
        fullyConnectedLayer(64,Name="FC1")
        reluLayer(Name="CriticRelu1")
        fullyConnectedLayer(64, Name="FC3")
        reluLayer(Name="CriticCommonRelu2")
        fullyConnectedLayer(numObservations,Name="nextObservation")];
    
    % Combine network layers.
    transitionNetwork = layerGraph(statePath);
    transitionNetwork = addLayers(transitionNetwork,actionPath);
    transitionNetwork = addLayers(transitionNetwork,commonPath);
    transitionNetwork = connectLayers(transitionNetwork,"state","concat/in1");
    transitionNetwork = connectLayers(transitionNetwork,"action","concat/in2");
    
    % Create dlnetwork object.
    transitionNetwork = dlnetwork(transitionNetwork);
    
    % Create transition function object.
    transitionFcn = rlContinuousDeterministicTransitionFunction(...
        transitionNetwork,obsInfo,actInfo,...
        ObservationInputNames="state", ...
        ActionInputNames="action", ...
        NextObservationOutputNames="nextObservation");

    You can define a known reward function for your environment using a custom function. Your custom reward function must take the observations, actions, and next observations as cell-array inputs and return a scalar reward value. For this example, use the following custom reward function, which computes the reward based on the next observation.

    type cartPoleRewardFunction.m
    function reward = cartPoleRewardFunction(obs,action,nextObs)
    % Compute reward value based on the next observation.
    
        if iscell(nextObs)
            nextObs = nextObs{1};
        end
    
        % Distance at which to fail the episode
        xThreshold = 2.4;
    
        % Reward each time step the cart-pole is balanced
        rewardForNotFalling = 1;
    
        % Penalty when the cart-pole fails to balance
        penaltyForFalling = -5;
    
        x = nextObs(1,:);
        distReward = 1 - abs(x)/xThreshold;
    
        isDone = cartPoleIsDoneFunction(obs,action,nextObs);
    
        reward = zeros(size(isDone));
        reward(logical(isDone)) = penaltyForFalling;
        reward(~logical(isDone)) = ...
            0.5 * rewardForNotFalling + 0.5 * distReward(~logical(isDone));
    end
    

    You can define a known is-done function for your environment using a custom function. Your custom is-done function must take the observations, actions, and next observations as cell-array inputs and return a logical termination signal. For this example, use the following custom is-done function, which computes the termination signal based on the next observation.

    type cartPoleIsDoneFunction.m
    function isDone = cartPoleIsDoneFunction(obs,action,nextObs)
    % Compute termination signal based on next observation.
    
        if iscell(nextObs)
            nextObs = nextObs{1};
        end
    
        % Angle at which to fail the episode
        thetaThresholdRadians = 12 * pi/180;
    
        % Distance at which to fail the episode
        xThreshold = 2.4;
    
        x = nextObs(1,:);
        theta = nextObs(3,:);
        
        isDone = abs(x) > xThreshold | abs(theta) > thetaThresholdRadians;
    end
    

    Create a neural network environment using the transition function object and the custom reward and is-done functions.

    env = rlNeuralNetworkEnvironment(obsInfo,actInfo,transitionFcn,...
        @cartPoleRewardFunction,@cartPoleIsDoneFunction);

    Version History

    Introduced in R2022a