Train PPO Agent with Curriculum Learning for a Lane Keeping Application

Open Live Script

This example demonstrates how to use a curriculum to train a Proximal Policy Optimization (PPO) agent for Lane Keeping Assist (LKA) in Simulink®. A curriculum is a training strategy that gradually increases task complexity to facilitate the learning process of the agent. The curriculum employed in this example begins with an easy task and gradually increases the task difficulty to enhance training efficiency. Task difficulty is controlled by the initial states and the curvature of the lane.

For more information on PPO agents, see Proximal Policy Optimization (PPO) Agent.

Simulink Model

The training goal for the lane-keeping application is to keep the ego vehicle traveling along the centerline of a lane by adjusting the front steering angle. This example uses the same ego vehicle dynamics and sensor dynamics as the Train DQN Agent for Lane Keeping Assist example.

m = 1575;   % total vehicle mass (kg)
Iz = 2875;  % yaw moment of inertia (mNs^2)
lf = 1.2;   % longitudinal distance from CG to front tires (m)
lr = 1.6;   % longitudinal distance from CG to rear tires (m)
Cf = 19000; % cornering stiffness of front tires (N/rad)
Cr = 33000; % cornering stiffness of rear tires (N/rad)
Vx = 15;    % longitudinal velocity (m/s)

Define the sample time, Ts, and simulation duration, T, in seconds.

Ts = 0.1;
T = 15;

The output of the LKA system is the front steering angle of the ego vehicle. Considering the physical limitations of the ego vehicle, constrain its steering angle to the range [-60,60] degrees. Specify the constraints in radians.

u_min = -1.04;
u_max = 1.04;

Define the curvature of the road, $ρ$ , as a constant 0.001( $m^{- 1}$ ).

rho = 0.001;

Set initial values for the lateral deviation (e1_initial) and the relative yaw angle (e2_initial). During training, these initial conditions are set to random values for each training episode.

e1_initial =  0.2;
e2_initial = -0.1;

Open the model.

mdl = "rlCurriculumLKAMdl";
open_system(mdl)

Define the path to the RL Agent block within the model.

agentblk = mdl + "/RL Agent";

Create Environment

Create a reinforcement learning environment interface for the ego vehicle. To do so, first define the observation and action specifications. These observations and actions are the same as the features for supervised learning used in Imitate MPC Controller for Lane Keeping Assist.

The six observations for the environment are the lateral velocity $v_{y}$ , yaw rate $ψ_{}^{˙}$ , lateral deviation $e_{1}$ , relative yaw angle $e_{2}$ , steering angle at previous step $u_{0}$ , and curvature $ρ$ .

obsInfo = rlNumericSpec([6 1],...
    LowerLimit=-inf*ones(6,1), ...
    UpperLimit= inf*ones(6,1))

obsInfo = 
  rlNumericSpec with properties:

     LowerLimit: [6x1 double]
     UpperLimit: [6x1 double]
           Name: [0x0 string]
    Description: [0x0 string]
      Dimension: [6 1]
       DataType: "double"

obsInfo.Name = "observations";

The action for the environment is the front steering angle. Specify the steering angle constraints when creating the action specification object.

actInfo = rlNumericSpec([1 1], ...
    LowerLimit=u_min, ...
    UpperLimit=u_max);
actInfo.Name = "steering";

In the model, the Signal Processing for LKA block creates the observation vector signal, computes the reward function, and calculates the stop signal.

In this example, the reward $r_{t}$ , provided at every time step $t$ , consists of a cost, a penalty for the terminal condition, and an incentive to keep the lateral error within a threshold.

$r_{t} = \frac{c o s t + p e n a l t y + i n c e n t i v e}{s c a l e}$

$c o s t = - (10 e_{1}^{2} + 5 e_{2}^{2} + 2 u^{2} + 5 {e_{}^{˙}}_{1}^{2} + 5 {e_{}^{˙}}_{2}^{2})$

$\begin{array}{l} penalty = {\begin{cases} 0 & if | e_{1} | \leq 1 \\ - 300 & else \end{cases} \\ incentive = 2 \\ scale = 100 \end{array}$

Here, $u$ is the control input from the previous time step $t - 1$ .

The simulation stops when $| e_{1} | > 1$ .

Create the reinforcement learning environment.

env = rlSimulinkEnv(mdl,agentblk,obsInfo,actInfo);

To define the initial condition for lateral deviation and relative yaw angle, specify an environment reset function using an anonymous function handle. The localResetFcn function, which is defined at the end of the example, sets the initial lateral deviation and relative yaw angle to random values depending on the task number. Each subsequent task become more difficult.

Task 1: Straight road, starting with small errors.

$\begin{array}{l} e1_initial \sim Uniform (- 0.05, 0.05) \\ e2_initial \sim Uniform (- 0.01, 0.01) \\ ρ = 0 \end{array}$

Task 2: Road with small curvature, starting with relatively small errors.

$\begin{array}{l} e1_initial \sim Uniform (- 0.25, 0.25) \\ e2_initial \sim Uniform (- 0.05, 0.05) \\ ρ = 0.0005 \end{array}$

Task 3: Final task

$\begin{array}{l} e1_initial \sim Uniform (- 0.5, 0.5) \\ e2_initial \sim Uniform (- 0.1, 0.1) \\ ρ = 0.001 \end{array}$

Create PPO Agent

Fix the random generator seed for reproducibility.

rng(0)

Create a PPO agent.

initOptions = rlAgentInitializationOptions("NumHiddenUnit",256);
agentNoCrm = rlPPOAgent(obsInfo, actInfo, initOptions);

Specify the PPO agent options.

agentNoCrm.AgentOptions.MiniBatchSize = 512;
agentNoCrm.AgentOptions.LearningFrequency = 4096;
agentNoCrm.AgentOptions.ActorOptimizerOptions.LearnRate  = 5e-4;
agentNoCrm.AgentOptions.ActorOptimizerOptions.GradientThreshold = 5;
agentNoCrm.AgentOptions.CriticOptimizerOptions.LearnRate = 5e-4;
agentNoCrm.AgentOptions.CriticOptimizerOptions.GradientThreshold = 5;
agentNoCrm.AgentOptions.SampleTime = Ts;

Train Agent Without a Curriculum

As a baseline, train the agent for task 3 without a curriculum. Set the reset function with taskNumber = 3.

taskNumber = 3;
env.ResetFcn = @(in)localResetFcn(in, taskNumber);

To train the agent, first specify the training options. For this example, use the following options.

Run training for at most 50,000 episodes, with each episode lasting at most 150 time steps.
Display the training progress in the Reinforcement Learning Training Monitor.
Stop training when the evaluation statistics (mean of evaluation episode rewards) is 2.9. The agent is evaluated with 10 evaluation episodes at every 100 training episodes.

For more information, see rlTrainingOptions.

maxepisodes = 50000;
maxsteps = T/Ts;
trainingOpts = rlTrainingOptions(...
    MaxEpisodes=maxepisodes,...
    MaxStepsPerEpisode=maxsteps,...
    Verbose=false,...
    Plots="training-progress",...
    StopTrainingCriteria="EvaluationStatistic",...
    StopTrainingValue=2.9);

% Create an evaluator object
evl = rlEvaluator( ...
    NumEpisodes=10, ...
    EvaluationFrequency=100, ...
    RandomSeeds=101:110);

Train the agent using the train function. To save time while running this example, only the training result is shown. To train the agent yourself, set doTrainingWithoutCurriculum to true.

doTrainingWithoutCurriculum = false;
if doTrainingWithoutCurriculum    
    % Train the agent without a curriculum.
    trainingStats = train(agentNoCrm,env,trainingOpts,Evaluator=evl);
end

Train Agent with a Curriculum

Curriculum learning starts with simpler tasks and gradually increases the complexity to train the agent more efficiently. In this example, the agent is first trained on two simpler tasks (task 1 and task 2) before moving on to the final task.

Create PPO Agent

Fix the random generator seed for reproducibility.

rng(0)

Create a PPO agent.

initOptions = rlAgentInitializationOptions("NumHiddenUnit",256);
agent = rlPPOAgent(obsInfo, actInfo, initOptions);
agent.AgentOptions.SampleTime = Ts;
agent.AgentOptions.MiniBatchSize = 512;
agent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 5;
agent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 5;

Train the agent using the train function. To save time while running this example, the training is skipped. To train the agent yourself, set doTraining to true.

doTraining = false;

Task 1: Straight road, starting with small errors.

Set the task to 1, the max number of episodes to 3000, and the stop training value to 2.5.

taskNumber = 1;
env.ResetFcn = @(in)localResetFcn(in, taskNumber);
trainingOpts.MaxEpisodes = 3000;
trainingOpts.StopTrainingValue = 2.5;

Specify the PPO agent options for task 1. A larger LearnRate is used in the initial task to update the agent quickly. A larger LearnRate may cause issues with convergence; however, the goal is to train the agent in the final task, and it is not necessary to fine-tune the agent in the initial task. To update the agent more often in the first task, set LearningFrequency to 1024.

agent.AgentOptions.LearningFrequency = 1024;
agent.AgentOptions.ActorOptimizerOptions.LearnRate  = 1e-3;
agent.AgentOptions.CriticOptimizerOptions.LearnRate = 1e-3;

Train the PPO agent.

if doTraining
    trainingStats1 = train(agent,env,trainingOpts,Evaluator=evl);
end

Task 2: Road with small curvature, starting with relatively small errors.

Set the task to 2, the max number of episodes to 3000, and the stop training value to 2.5.

taskNumber = 2;
env.ResetFcn = @(in)localResetFcn(in, taskNumber);
trainingOpts.MaxEpisodes = 3000;
trainingOpts.StopTrainingValue = 2.5;

Specify the PPO agent options. To collect more trajectories before update, LearningFrequency to 2048.

agent.AgentOptions.LearningFrequency = 2048;

Train the PPO agent.

if doTraining
    trainingStats2 = train(agent,env,trainingOpts,Evaluator=evl);
end

Task 3: Final task

Set the task to 3, the max number of episode to 50,000, and the stop training value to 2.9.

taskNumber = 3;
env.ResetFcn = @(in)localResetFcn(in, taskNumber);
trainingOpts.MaxEpisodes = 50000;
trainingOpts.StopTrainingValue = 2.9;

Specify the PPO agent options to match those used for the agent trained without a curriculum.

agent.AgentOptions.LearningFrequency = 4096;
agent.AgentOptions.ActorOptimizerOptions.LearnRate  = 5e-4;
agent.AgentOptions.CriticOptimizerOptions.LearnRate = 5e-4;

Train the PPO agent.

if doTraining
    trainingStats3 = train(agent,env,trainingOpts,Evaluator=evl);
else
    load("LKAPPOAgent.mat")
end

The PPO agent with a curriculum met the criteria after 5000 episodes in total (300 episodes in task 1, 2700 episodes in task 2, and 2000 episodes in task 3). In contrast, the agent trained without a curriculum needed 8300 episodes to reach the criteria. The total training time was shorter with curriculum learning compared to the non-curriculum approach.

Simulate PPO Agent

To validate the performance of the trained agent, simulate it within the environment. Since exploration is not necessary during simulation, set UseExplorationPolicy to false. For more information on agent simulation, see rlSimulationOptions and sim.

agent.UseExplorationPolicy = false;
simOptions = rlSimulationOptions(MaxSteps=maxsteps);
% Set the task number to 4 to evaluate the agent with a specific initial
% condition. The curvature is the same as in task 3.
taskNumber = 4; 
env.ResetFcn = @(in)localResetFcn(in, taskNumber);
experience = sim(env,agent,simOptions);

As shown below, the lateral error (middle plot) and relative yaw angle (bottom plot) are both driven to zero. The vehicle starts with a lateral deviation from the centerline (0.2 m) and a nonzero yaw angle error (-0.1 rad). The lane-keeping controller makes the ego vehicle travel along the centerline after around two seconds. The steering angle shows that the controller reaches a steady state within the same time frame.

Local Functions

function in = localResetFcn(in, taskNumber)
    switch taskNumber
        case 1
            % Task 1: Straight road, 
            % starting with small errors.
            e1_initial = 0.05*(-1+2*rand);
            e2_initial = 0.01*(-1+2*rand);                        
            rho = 0;
        case 2
            % Task 2: Road with small curvature, 
            % starting with relatively small errors.
            e1_initial = 0.25*(-1+2*rand);
            e2_initial = 0.05*(-1+2*rand);
            rho = 0.0005;
        case 3
            % Task3 : Final task
            e1_initial = 0.5*(-1+2*rand);
            e2_initial = 0.1*(-1+2*rand);
            rho = 0.001;
        case 4
            % Final simulation purpose.
            e1_initial = 0.2;
            e2_initial = -0.1;
            rho = 0.001;          
    end    
        % Set random value for lateral deviation.
        in = setVariable(in,"e1_initial", e1_initial);        
        
        % Set random value for relative yaw angle.
        in = setVariable(in,"e2_initial", e2_initial);
    
        % Set curvature
        in = setVariable(in,"rho", rho);        
end