Train PPO Agent with Curriculum Learning for a Lane Keeping Application
This example demonstrates how to use a curriculum to train a Proximal Policy Optimization (PPO) agent for Lane Keeping Assist (LKA) in Simulink®. A curriculum is a training strategy that gradually increases task complexity to facilitate the learning process of the agent. The curriculum employed in this example begins with an easy task and gradually increases the task difficulty to enhance training efficiency. Task difficulty is controlled by the initial states and the curvature of the lane.
For more information on PPO agents, see Proximal Policy Optimization (PPO) Agent.
Simulink Model
The training goal for the lane-keeping application is to keep the ego vehicle traveling along the centerline of a lane by adjusting the front steering angle. This example uses the same ego vehicle dynamics and sensor dynamics as the Train DQN Agent for Lane Keeping Assist example.
m = 1575; % total vehicle mass (kg) Iz = 2875; % yaw moment of inertia (mNs^2) lf = 1.2; % longitudinal distance from CG to front tires (m) lr = 1.6; % longitudinal distance from CG to rear tires (m) Cf = 19000; % cornering stiffness of front tires (N/rad) Cr = 33000; % cornering stiffness of rear tires (N/rad) Vx = 15; % longitudinal velocity (m/s)
Define the sample time, Ts
, and simulation duration, T
, in seconds.
Ts = 0.1; T = 15;
The output of the LKA system is the front steering angle of the ego vehicle. Considering the physical limitations of the ego vehicle, constrain its steering angle to the range [-60,60] degrees. Specify the constraints in radians.
u_min = -1.04; u_max = 1.04;
Define the curvature of the road,, as a constant 0.001().
rho = 0.001;
Set initial values for the lateral deviation (e1_initial
) and the relative yaw angle (e2_initial
). During training, these initial conditions are set to random values for each training episode.
e1_initial = 0.2; e2_initial = -0.1;
Open the model.
mdl = "rlCurriculumLKAMdl";
open_system(mdl)
Define the path to the RL Agent block within the model.
agentblk = mdl + "/RL Agent";
Create Environment
Create a reinforcement learning environment interface for the ego vehicle. To do so, first define the observation and action specifications. These observations and actions are the same as the features for supervised learning used in Imitate MPC Controller for Lane Keeping Assist.
The six observations for the environment are the lateral velocity , yaw rate , lateral deviation , relative yaw angle , steering angle at previous step , and curvature .
obsInfo = rlNumericSpec([6 1],... LowerLimit=-inf*ones(6,1), ... UpperLimit= inf*ones(6,1))
obsInfo = rlNumericSpec with properties: LowerLimit: [6x1 double] UpperLimit: [6x1 double] Name: [0x0 string] Description: [0x0 string] Dimension: [6 1] DataType: "double"
obsInfo.Name = "observations";
The action for the environment is the front steering angle. Specify the steering angle constraints when creating the action specification object.
actInfo = rlNumericSpec([1 1], ... LowerLimit=u_min, ... UpperLimit=u_max); actInfo.Name = "steering";
In the model, the Signal Processing for LKA block creates the observation vector signal, computes the reward function, and calculates the stop signal.
In this example, the reward , provided at every time step , consists of a cost, a penalty for the terminal condition, and an incentive to keep the lateral error within a threshold.
Here, is the control input from the previous time step .
The simulation stops when .
Create the reinforcement learning environment.
env = rlSimulinkEnv(mdl,agentblk,obsInfo,actInfo);
To define the initial condition for lateral deviation and relative yaw angle, specify an environment reset function using an anonymous function handle. The localResetFcn
function, which is defined at the end of the example, sets the initial lateral deviation and relative yaw angle to random values depending on the task number. Each subsequent task become more difficult.
Task 1: Straight road, starting with small errors.
Task 2: Road with small curvature, starting with relatively small errors.
Task 3: Final task
Create PPO Agent
Fix the random generator seed for reproducibility.
rng(0)
Create a PPO agent.
initOptions = rlAgentInitializationOptions("NumHiddenUnit",256);
agentNoCrm = rlPPOAgent(obsInfo, actInfo, initOptions);
Specify the PPO agent options.
agentNoCrm.AgentOptions.MiniBatchSize = 512; agentNoCrm.AgentOptions.LearningFrequency = 4096; agentNoCrm.AgentOptions.ActorOptimizerOptions.LearnRate = 5e-4; agentNoCrm.AgentOptions.ActorOptimizerOptions.GradientThreshold = 5; agentNoCrm.AgentOptions.CriticOptimizerOptions.LearnRate = 5e-4; agentNoCrm.AgentOptions.CriticOptimizerOptions.GradientThreshold = 5; agentNoCrm.AgentOptions.SampleTime = Ts;
Train Agent Without a Curriculum
As a baseline, train the agent for task 3 without a curriculum. Set the reset function with taskNumber = 3
.
taskNumber = 3; env.ResetFcn = @(in)localResetFcn(in, taskNumber);
To train the agent, first specify the training options. For this example, use the following options.
Run training for at most 50,000 episodes, with each episode lasting at most 150 time steps.
Display the training progress in the Reinforcement Learning Training Monitor.
Stop training when the evaluation statistics (mean of evaluation episode rewards) is 2.9. The agent is evaluated with 10 evaluation episodes at every 100 training episodes.
For more information, see rlTrainingOptions
.
maxepisodes = 50000; maxsteps = T/Ts; trainingOpts = rlTrainingOptions(... MaxEpisodes=maxepisodes,... MaxStepsPerEpisode=maxsteps,... Verbose=false,... Plots="training-progress",... StopTrainingCriteria="EvaluationStatistic",... StopTrainingValue=2.9); % Create an evaluator object evl = rlEvaluator( ... NumEpisodes=10, ... EvaluationFrequency=100, ... RandomSeeds=101:110);
Train the agent using the train
function. To save time while running this example, only the training result is shown. To train the agent yourself, set doTrainingWithoutCurriculum
to true
.
doTrainingWithoutCurriculum = false; if doTrainingWithoutCurriculum % Train the agent without a curriculum. trainingStats = train(agentNoCrm,env,trainingOpts,Evaluator=evl); end
Train Agent with a Curriculum
Curriculum learning starts with simpler tasks and gradually increases the complexity to train the agent more efficiently. In this example, the agent is first trained on two simpler tasks (task 1 and task 2) before moving on to the final task.
Create PPO Agent
Fix the random generator seed for reproducibility.
rng(0)
Create a PPO agent.
initOptions = rlAgentInitializationOptions("NumHiddenUnit",256);
agent = rlPPOAgent(obsInfo, actInfo, initOptions);
agent.AgentOptions.SampleTime = Ts;
agent.AgentOptions.MiniBatchSize = 512;
agent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 5;
agent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 5;
Train the agent using the train
function. To save time while running this example, the training is skipped. To train the agent yourself, set doTraining
to true
.
doTraining = false;
Task 1: Straight road, starting with small errors.
Set the task to 1, the max number of episodes to 3000, and the stop training value to 2.5.
taskNumber = 1; env.ResetFcn = @(in)localResetFcn(in, taskNumber); trainingOpts.MaxEpisodes = 3000; trainingOpts.StopTrainingValue = 2.5;
Specify the PPO agent options for task 1. A larger LearnRate
is used in the initial task to update the agent quickly. A larger LearnRate
may cause issues with convergence; however, the goal is to train the agent in the final task, and it is not necessary to fine-tune the agent in the initial task. To update the agent more often in the first task, set LearningFrequency
to 1024.
agent.AgentOptions.LearningFrequency = 1024; agent.AgentOptions.ActorOptimizerOptions.LearnRate = 1e-3; agent.AgentOptions.CriticOptimizerOptions.LearnRate = 1e-3;
Train the PPO agent.
if doTraining trainingStats1 = train(agent,env,trainingOpts,Evaluator=evl); end
Task 2: Road with small curvature, starting with relatively small errors.
Set the task to 2, the max number of episodes to 3000, and the stop training value to 2.5.
taskNumber = 2; env.ResetFcn = @(in)localResetFcn(in, taskNumber); trainingOpts.MaxEpisodes = 3000; trainingOpts.StopTrainingValue = 2.5;
Specify the PPO agent options. To collect more trajectories before update, LearningFrequency
to 2048.
agent.AgentOptions.LearningFrequency = 2048;
Train the PPO agent.
if doTraining trainingStats2 = train(agent,env,trainingOpts,Evaluator=evl); end
Task 3: Final task
Set the task to 3, the max number of episode to 50,000, and the stop training value to 2.9.
taskNumber = 3; env.ResetFcn = @(in)localResetFcn(in, taskNumber); trainingOpts.MaxEpisodes = 50000; trainingOpts.StopTrainingValue = 2.9;
Specify the PPO agent options to match those used for the agent trained without a curriculum.
agent.AgentOptions.LearningFrequency = 4096; agent.AgentOptions.ActorOptimizerOptions.LearnRate = 5e-4; agent.AgentOptions.CriticOptimizerOptions.LearnRate = 5e-4;
Train the PPO agent.
if doTraining trainingStats3 = train(agent,env,trainingOpts,Evaluator=evl); else load("LKAPPOAgent.mat") end
The PPO agent with a curriculum met the criteria after 5000 episodes in total (300 episodes in task 1, 2700 episodes in task 2, and 2000 episodes in task 3). In contrast, the agent trained without a curriculum needed 8300 episodes to reach the criteria. The total training time was shorter with curriculum learning compared to the non-curriculum approach.
Simulate PPO Agent
To validate the performance of the trained agent, simulate it within the environment. Since exploration is not necessary during simulation, set UseExplorationPolicy
to false
. For more information on agent simulation, see rlSimulationOptions
and sim
.
agent.UseExplorationPolicy = false; simOptions = rlSimulationOptions(MaxSteps=maxsteps); % Set the task number to 4 to evaluate the agent with a specific initial % condition. The curvature is the same as in task 3. taskNumber = 4; env.ResetFcn = @(in)localResetFcn(in, taskNumber); experience = sim(env,agent,simOptions);
As shown below, the lateral error (middle plot) and relative yaw angle (bottom plot) are both driven to zero. The vehicle starts with a lateral deviation from the centerline (0.2 m) and a nonzero yaw angle error (-0.1 rad). The lane-keeping controller makes the ego vehicle travel along the centerline after around two seconds. The steering angle shows that the controller reaches a steady state within the same time frame.
Local Functions
function in = localResetFcn(in, taskNumber) switch taskNumber case 1 % Task 1: Straight road, % starting with small errors. e1_initial = 0.05*(-1+2*rand); e2_initial = 0.01*(-1+2*rand); rho = 0; case 2 % Task 2: Road with small curvature, % starting with relatively small errors. e1_initial = 0.25*(-1+2*rand); e2_initial = 0.05*(-1+2*rand); rho = 0.0005; case 3 % Task3 : Final task e1_initial = 0.5*(-1+2*rand); e2_initial = 0.1*(-1+2*rand); rho = 0.001; case 4 % Final simulation purpose. e1_initial = 0.2; e2_initial = -0.1; rho = 0.001; end % Set random value for lateral deviation. in = setVariable(in,"e1_initial", e1_initial); % Set random value for relative yaw angle. in = setVariable(in,"e2_initial", e2_initial); % Set curvature in = setVariable(in,"rho", rho); end
See Also
Functions
train
|sim
|rlSimulinkEnv
Objects
rlPPOAgent
|rlPPOAgentOptions
|rlAgentInitializationOptions
|rlTrainingOptions
|rlOptimizerOptions
|rlSimulationOptions
Blocks
Related Examples
- Train DQN Agent for Lane Keeping Assist
- Train DDPG Agent for Path-Following Control
- Lane Keeping Assist System Using Model Predictive Control (Model Predictive Control Toolbox)