Main Content

Custom Training Loop with Simulink Action Noise

This example shows how to tune a controller for vehicle platooning applications using a custom reinforcement learning (RL) training loop. For this application, action noise is generated in the Simulink® model to promote exploration during training.

For an example on tuning a PID-based vehicle platooning system, see Design Controller for Vehicle Platooning (Simulink Control Design).

Platooning has the following control objectives [1].

  • Individual vehicle stability — Spacing error for each following vehicle converges to zero if the preceding vehicle is traveling at constant speed.

  • String stability — Spacing errors do not amplify as they propagate towards the tail of the vehicle string.

Platooning Environment Model

In this example, there are five vehicles in the platoon. Every vehicle is modeled as a truck-trailer system with the following parameters. All lengths are in meters.

L1 = 6;               % Truck length
L2 = 10;              % Trailer length
M1 = 1;               % Hitch length
L = L1 + L2 + M1 + 5; % Desired front-to-front vehicle spacing

The lead vehicle follows a given acceleration profile. Each trailing vehicle has a controller that controls its acceleration.

Open the Simulink® model.

mdl = "fiveVehiclePlatoonEnv";

The model contains an RL Agent block with its last action input port enabled. This input port allows the specification of custom noise in the Simulink model for off-policy RL agents, such as deep deterministic policy gradient (DDPG) agents.

Specify the path to the RL Agent block.

agentBlk = mdl + "/RL Agent";

Controller Structure

In this example, each trailing vehicle (ego vehicle) has the same continuous-time controller structure and parameterization.



  • aego, vego, and xego are the respective acceleration, velocity, and position of the ego vehicle.

  • afront, vfront, and xfront are the respective acceleration, velocity, and position of the vehicle directly in front of the ego vehicle.

Each vehicle has full access to its own velocity and position states but can only access the acceleration, velocity, and position of the vehicle directly in front using wireless communication.

The controller minimizes the velocity error vfront-vego using the velocity gain K2 and minimizes the spacing error xego-xfront+L using the spacing gain K3. The feedforward gain K1 is used to improve tracking of the front vehicle.

The lead vehicle' acceleration is assumed to be a sine wave



  • alead is the lead vehicle acceleration (m/s^2).

  • A is the amplitude of the sine wave (m/s).

  • w is the frequency of the sine wave (rad/s).

  • t is the simulation time (s).

Reinforcement Learning Agent Design

The objective of the agent is to compute adaptive gains so that each vehicle can track the desired spacing with respect to the vehicle immediately in front. Therefore, the model is configured such that:

  • The action signal consists of the gains K=[K1K2K3] shared by all vehicles except the lead vehicle. Each gain has a lower bound of 0 and upper bounds of 1, 20, and 20, respectively. The agent calculates new gains once per second. To encourage exploration during training, the gains are perturbed by random noise with a zero-mean normal distribution: Knoisy=K+Ν(0,σ2) where the variance σ2=[0.02,0.1,0.1].

  • The observation signal consists of the vehicle spacing (diff(x)) minus the target spacing (L), the vehicle velocities (v), and the vehicle accelerations (a).

  • The reward calculated at every time step t is



  • st is the vehicle spacing (diff(xt)) at time step t.

  • maxOvershoot(st,L) calculates the max overshoot of all vehicles given the actual vehicle spacing st and the desired spacing L. In this case, overshoot is defined as when the vehicle spacing is less than the desired spacing st<L.

  • ct=any((st+L1+L2+M1-L)0)) indicates if a vehicle collision occurs. The simulation will terminate if ctis true.

The first term in the reward function encourages the vehicle spacing to match L. The second term penalizes large changes in gain between time steps. The third term penalizes overshooting the target spacing (getting too close to the front vehicle). Finally, the fourth term penalizes collisions.

For this example, to accommodate the custom noise specified in the model, you implement a custom DDPG training loop.

Define Model Parameters

Define the training and simulation parameters that remain fixed during training.

Ts = 1;                       % Sample time (seconds)
Tf = 100;                     % Simulation length (seconds)
AccelNoiseV = ones(1,5)*0.01; % Acceleration input noise variance
VelNoiseV = ones(1,5)*0.01;   % Velocity sensor noise variance
PosNoiseV = ones(1,5)*0.01;   % Position sensor noise variance
ParamLowerLimit = [0 0 0]';   % Lower limits for the controller gains
ParamUpperLimit = [1 20 20]'; % Upper limits for the controller gains
UseParamNoise = 1;            % Option to indicate if noise is injected

Define the parameters that change every training episode. The values for these parameters are updated in the environment reset function resetFunction.

LeadA = 2;                    % Lead vehicle acceleration amplitude
LeadF = 1;                    % Lead vehicle acceleration frequency
ParamNoiseV = [0.02 0.1 0.1]; % Variance for controller gains

% Random noise seeds
ParamNoiseSeed = 1:3;       % Controller gain noise seed
AccelNoiseSeed = 1:5 + 100; % Acceleration input noise seed
VelNoiseSeed = 1:5 + 200;   % Velocity sensor noise seed
PosNoiseSeed = 1:5 + 300;   % Position sensor noise seed

% Initial position and velocity of each vehicle
InitialPositions  = [200 150 100 50 0] + 50; % Positions
InitialVelocities = [10 10 10 10 10];        % Velocities

Create Environment

Create an environment using rlSimulinkEnv.

To do so, first define the observation and action specifications for the environment.

obsInfo = rlNumericSpec([14 1]);
actInfo = rlNumericSpec([3  1],...
obsInfo.Name = "measurements";
actInfo.Name = "control_gains";

Next, create the environment object.

env = rlSimulinkEnv(mdl,agentBlk,obsInfo,actInfo);

Set the environment reset function to the local function resetFunction included with this example. This function varies the training conditions for each episode.

env.ResetFcn = @resetFunction;

Noise in the model is specified using the Random Number (Simulink) block. Each block has its own random number generator and thus its own starting seed parameter. To ensure the noise stream varies across episodes, the seed variables are updated using resetFunction .

Create Actor, Critic, and Policy

Create actor and critic function approximators for the agent using the local function createNetworks included with this example.

[critic,actor] = createNetworks(obsInfo,actInfo);

Create optimizer objects for updating the actor and critic. Use the same options for both optimizers.

optimizerOpt = rlOptimizerOptions(...
    LearnRate=1e-3, ...
    GradientThreshold=1, ...
criticOptimizer = rlOptimizer(optimizerOpt);
actorOptimizer  = rlOptimizer(optimizerOpt);

Create a deterministic policy for the actor approximator.

policy = rlDeterministicActorPolicy(actor);

Specify the policy sample time.

policy.SampleTime = Ts;

Create Experience Buffer

Create an experience buffer for the agent with a maximum length of 1e6.

replayMemory = rlReplayMemory(obsInfo,actInfo,1e6);

Data Required for Learning

To update the actor and critic during training, the runEpisode function processes each experience as it is received from the environment. For this example, the processing function is the processExperienceFcn local function.

This function requires additional data to perform its processing. Create a structure to store this additional data.

processExpData.Critic = critic;
processExpData.TargetCritic = critic;
processExpData.Actor = actor;
processExpData.TargetActor = actor;
processExpData.ReplayMemory = replayMemory;
processExpData.CriticOptimizer = criticOptimizer;
processExpData.ActorOptimizer = actorOptimizer;

processExpData.MiniBatchSize = 128;
processExpData.DiscountFactor = 0.99;
processExpData.TargetSmoothFactor = 1e-3;

Each episode, the processExperienceFcn function updates the critics, actors, replay memory, and optimizers. The updated data is used as the input for the next episode.

Training Loop

To train the agent, the custom training loop simulates the agent in the environment for a maximum of maxEpisodes episodes.

maxEpisodes = 1000;

Compute the maximum number of steps per episode using the simulation time and sample time.

maxSteps = ceil(Tf/Ts);

For this custom training loop:

  • The runEpisode function simulates the agent in the environment for one episode.

  • Experiences are processed as they are received from the environment using the processExperienceFcn function.

  • Experiences are not logged by runEpisode since the experiences are processed as they are received.

  • To speed up training, when calling runEpisode, the CleanupPostSim option is set to false. Doing so keeps the model compiled between episodes.

  • The PlatooningTrainingCurvePlotter object is a helper object to plot training data while the training is running.

  • You can stop the training using a Stop button in the training plot.

  • After all the episodes are complete, the cleanup function cleans up the environment and terminates the model compilation.

Training the policy is a computationally intensive process that can take several minutes to hours to complete. To save time while running this example, load a pretrained agent by setting doTraining to false. To train the policy yourself, set doTraining to true.

doTraining = false;
if doTraining
    % Create plotting helper object.
    plotObj = PlatooningTrainingCurvePlotter();

    % Training loop
    for i = 1:maxEpisodes
        % Run the episode.
        out = runEpisode(...
        % Extract episode information for updating the training curves.
        episodeInfo = out.AgentData.EpisodeInfo;
        % Extract updated processExpData for the next episode.
        processExpData = out.AgentData.ProcessExperienceData;
        % Extract the updated policy for the next episode.
        policy = out.AgentData.Agent;
        % Extract the critic and actor networks from processExpData.
        critic = processExpData.Critic;
        actor  = processExpData.Actor;

        % Extract the cumulative reward and calculate average reward 
        % per step for this episode.
        cumulativeRwd = episodeInfo.CumulativeReward;
        avgRwdPerStep = cumulativeRwd/episodeInfo.StepsTaken;
        % Evaluate q0 from the initial observation of the episode.
        obs0 = episodeInfo.InitialObservation;
        q0 = evaluate(critic,[obs0,evaluate(actor,obs0)]);
        q0 = double(q0{1});
        % Update the plot.
        % Exit training if button pushed.
        if plotObj.StopTraining

    % Clean up the environment.

    % Save the policy.
    % Load the pretrained policy.

Validate Trained Policy

Validate the learned policy by running five simulations with random initial conditions specified by the reset function.

First, turn off parameter noise in the model.

UseParamNoise = 0;

Simulate the model against the trained policy five times.

N = 5;
simOpts = rlSimulationOptions(...
    MaxSteps=maxSteps, ...
experiences = sim(env,policy,simOpts);

Plot the vehicle spacing error, gains, and reward from the experiences output structure.

f = figure(Position=[100 100 1024 768]);
for i = 1:N
    % get the spacing
    tspacing = experiences(i).Observation.measurements.Time;
    spacing  = squeeze(experiences(i).Observation.measurements.Data(1:4,:,:));
    % get the gains
    tgains = experiences(i).Action.control_gains.Time;
    gains  = squeeze(experiences(i).Action.control_gains.Data);
    % get the reward
    trwd = experiences(i).Reward.Time;
    rwd  = experiences(i).Reward.Data;

    % plot the spacing
    title(sprintf("Vehicle Spacing Error Simulation %u",i))
    grid on

    % plot the gains
    title(sprintf("Vehicle Gains Simulation %u",i))
    grid on

    % plot the reward
    title(sprintf("Vehicle Reward Simulation %u",i))
    grid on

Figure contains 15 axes objects. Axes object 1 with title Vehicle Spacing Error Simulation 1 contains 4 objects of type stair. Axes object 2 with title Vehicle Gains Simulation 1 contains 3 objects of type stair. Axes object 3 with title Vehicle Reward Simulation 1 contains an object of type stair. Axes object 4 with title Vehicle Spacing Error Simulation 2 contains 4 objects of type stair. Axes object 5 with title Vehicle Gains Simulation 2 contains 3 objects of type stair. Axes object 6 with title Vehicle Reward Simulation 2 contains an object of type stair. Axes object 7 with title Vehicle Spacing Error Simulation 3 contains 4 objects of type stair. Axes object 8 with title Vehicle Gains Simulation 3 contains 3 objects of type stair. Axes object 9 with title Vehicle Reward Simulation 3 contains an object of type stair. Axes object 10 with title Vehicle Spacing Error Simulation 4 contains 4 objects of type stair. Axes object 11 with title Vehicle Gains Simulation 4 contains 3 objects of type stair. Axes object 12 with title Vehicle Reward Simulation 4 contains an object of type stair. Axes object 13 with title Vehicle Spacing Error Simulation 5 contains 4 objects of type stair. Axes object 14 with title Vehicle Gains Simulation 5 contains 3 objects of type stair. Axes object 15 with title Vehicle Reward Simulation 5 contains an object of type stair.

From the plots, you can see that the trained policy generates adaptive gains that adequately track the desired spacing for all vehicles.

Local Functions

The process experience function is called every time an experience is processed by the RL Agent block. Here, processExperienceFcn appends the experience to the replay memory, samples a mini-batch of experiences from the replay memory, and updates the critic, actor, and target networks.

function [policy,processExpData] = processExperienceFcn(exp,episodeInfo,policy,processExpData)

% Append experience to replay memory buffer.

% Sample a mini-batch of experiences from replay memory.
miniBatch = sample(processExpData.ReplayMemory, ...

if ~isempty(miniBatch)
    % Update network parameters using the mini-batch.
    [processExpData,actorParams] = learnFcn(processExpData,miniBatch);
    % Update the policy parameters using the actor parameters.
    policy = setLearnableParameters(policy,actorParams);

The learnFcn function updates the critic, actor, and target networks given a sampled mini-batch

function [processExpData,actorParams] = learnFcn(processExpData,miniBatch)

% Find the terminal experiences.
doneidx = (miniBatch.IsDone == 1);

% Compute target next actions against the next observations.
nextAction = evaluate(processExpData.TargetActor,miniBatch.NextObservation);

% compute qtarget = reward + gamma*Q(nextObservation,nextAction)
%                 = reward + gamma*expectedFutureReturn
targetq = miniBatch.Reward;

% Bootstrap the target at nonterminal experiences.
expectedFutureReturn = ...
targetq(~doneidx) = targetq(~doneidx) + ...

% Compute critic gradient using deepCriticLoss function.
criticGradient = gradient(processExpData.Critic,@deepCriticLoss,...

% Update the critic parameters.
[processExpData.Critic,processExpData.CriticOptimizer] = update(...

% Compute the actor gradient using the deepActorGradient function. To
% accelerate the deepActorGradient function, the critic network is
% extracted outside the function and is passed in as a field to the
% actorGradData input struct.
actorGradData.CriticNet = getModel(processExpData.Critic);
actorGradData.MiniBatchSize = processExpData.MiniBatchSize;
actorGradient = customGradient(processExpData.Actor,@deepActorGradient,...

% Update the actor parameters.
[processExpData.Actor,processExpData.ActorOptimizer] = update(...
actorParams = getLearnableParameters(processExpData.Actor);

% Update targets using the given TargetSmoothFactor hyperparameter.
processExpData.TargetCritic = syncParameters(processExpData.TargetCritic,...
processExpData.TargetActor  = syncParameters(processExpData.TargetActor ,...
    processExpData.Actor ,processExpData.TargetSmoothFactor);

The critic gradient is computed against the deepCriticLoss function.

function loss = deepCriticLoss(q,targetq)
q = q{1};
% Loss is the half mean-square error of q = Q(observation,action)
%against  qtarget
loss = mse(q,reshape(targetq,size(q)));

The actor gradient is computed to maximize the expected value of an observation-action pair given the policy parameters. Here, the negative sign is used to maximize Q with respect to θ.



  • s is the batch observations

  • a is the batch actions

  • Qϕ is the critic network parameterized by ϕ

  • πθ is the actor network parameterized by θ

  • N is the mini batch size

function dQdTheta = deepActorGradient(actorNet,observation,gradData)

% Evaluate actions from current observations.
action = forward(actorNet,observation{:});

% Compute: q = Q(s,a)
q = predict(gradData.CriticNet,observation{:},action);

% Compute: qsum = -sum(q)/N to maximize q
qsum = -sum(q,"all")/gradData.MiniBatchSize;

% Compute: d(-sum(q)/N)/dActorParams
dQdTheta = dlgradient(qsum,actorNet.Learnables);

The environment reset function varies the initial conditions, reference trajectory, and noise seeds for every episode.

function in = resetFunction(in)
% Perturb the nominal reference amplitude and frequency.
LeadA = max(2 + 0.1*randn,0.1);
LeadF = max(1 + 0.1*randn,0.1);

% Perturb the nominal spacing.
L = 22 + 3*randn;

% Perturb the initial states.
InitialPositions  = [250 200 150 100 50] + 5*randn(1,5);
InitialVelocities = [10  10  10  10  10] + 1*randn(1,5);

% Update the noise seeds.
ParamNoiseSeed  = randi(100,1,3);
AccelNoiseSeed  = randi(100,1,5) + 100;
VelNoiseSeed    = randi(100,1,5) + 200;
PosNoiseSeed    = randi(100,1,5) + 300;

% Update the model variables.
in = setVariable(in,"L",L);
in = setVariable(in,"LeadA",LeadA);
in = setVariable(in,"LeadF",LeadF);
in = setVariable(in,"InitialPositions",InitialPositions);
in = setVariable(in,"InitialVelocities",InitialVelocities);
in = setVariable(in,"ParamNoiseSeed",ParamNoiseSeed);
in = setVariable(in,"AccelNoiseSeed",AccelNoiseSeed);
in = setVariable(in,"VelNoiseSeed",VelNoiseSeed);
in = setVariable(in,"PosNoiseSeed",PosNoiseSeed);

Create the critic and actor networks.

function [critic,actor] = createNetworks(obsInfo,actInfo)

hiddenLayerSize = 64;

numObs = prod(obsInfo.Dimension);
numAct = prod(actInfo.Dimension);

% Create the critic network.
obsInput = featureInputLayer(numObs, ...
    Normalization="none", ...
actInput = featureInputLayer(numAct, ...
    Normalization="none", ...
catPath = [
net = layerGraph();
net = addLayers(net,obsInput);
net = addLayers(net,actInput);
net = addLayers(net,catPath);
net = connectLayers(net,obsInfo.Name,"concat/in1");
net = connectLayers(net,actInfo.Name,"concat/in2");
net = dlnetwork(net);

% Create the critic function approximator.
critic = rlQValueFunction(net,obsInfo,actInfo);
critic = accelerate(critic,true);

% Create the actor network.
scale = (actInfo.UpperLimit - actInfo.LowerLimit)/2;
bias  = actInfo.LowerLimit + scale;
obsPath = [
    featureInputLayer(numObs, ...
        Normalization="none", ...
net = layerGraph;
net = addLayers(net,obsPath);
net = dlnetwork(net);

% Create the actor function approximator.
actor = rlContinuousDeterministicActor(net,obsInfo,actInfo);
actor = accelerate(actor,true);


[1] Rajamani, Rajesh. Vehicle Dynamics and Control. 2. ed. Mechanical Engineering Series. New York, NY Heidelberg: Springer, 2012.

See Also



Related Topics