# Train PPO Agent to Land Rocket

This example shows how to train a proximal policy optimization (PPO) agent with a discrete action space to land a rocket on the ground. For more information on PPO agents, see Proximal Policy Optimization Agents.

### Environment

The environment in this example is a 3-DOF rocket represented by a circular disc with mass. The rocket has two thrusters for forward and rotational motion. Gravity acts vertically downwards, and there are no aerodynamic drag forces. The training goal is to make the robot land on the ground at a specified location.

For this environment:

• Motion of the rocket is bounded in X (horizontal axis) from -100 to 100 meters and Y (vertical axis) from 0 to 120 meters.

• The goal position is at (0,0) meters and the goal orientation is 0 radians.

• The sample time is 0.1 seconds.

• The observations from the environment are the rocket position $\left(\mathit{x},\mathit{y}\right)$, orientation $\left(\theta \right)$, velocity $\left(\stackrel{˙}{\mathit{x}},\stackrel{˙}{\mathit{y}}\right)$, angular velocity $\left(\stackrel{˙}{\theta }\right)$, and a sensor reading that detects rough landing (-1), soft landing (1) or airborne (0) condition. The observations are normalized between -1 and 1.

• At the beginning of every episode, the rocket starts from a random initial $\mathit{x}$ position and orientation. The altitude is always reset to 100 meters.

• The reward ${\mathit{r}}_{\mathit{t}}$ provided at the time step $\mathit{t}$ is as follows.

`$\begin{array}{l}{\stackrel{ˆ}{\mathit{d}}}_{\mathit{t}}=\sqrt{{{\mathit{x}}_{\mathit{t}}}^{2}+{{\mathit{y}}_{\mathit{t}}}^{2}}/{\mathit{d}}_{\mathrm{max}}\\ {\stackrel{ˆ}{\mathit{v}}}_{\mathit{t}}=\sqrt{{\stackrel{˙}{{\mathit{x}}_{\mathit{t}}}}^{2}+{\stackrel{˙}{{\mathit{y}}_{\mathit{t}}}}^{2}}/{\mathit{v}}_{\mathrm{max}}\\ {\mathit{r}}_{1}=1-\left(\frac{\sqrt{{\stackrel{ˆ}{\mathit{d}}}_{\mathit{t}}}+\sqrt{{\stackrel{ˆ}{\mathit{v}}}_{\mathit{t}}}}{2}\right)\\ {\mathit{r}}_{2}=\frac{1}{2}{\mathit{e}}^{-20{\theta }_{\mathit{t}}^{2}}\\ {\mathit{r}}_{3}=1-\left(\frac{{\mathit{L}}_{\mathit{t}}+{\mathit{R}}_{\mathit{t}}}{20}\right)\\ {\mathit{r}}_{4}=10000\left[\left({\mathit{y}}_{\mathit{t}}\le 0\right)\text{\hspace{0.17em}}&&\text{\hspace{0.17em}}\left({\stackrel{˙}{\mathit{y}}}_{\mathit{t}}\ge -0.5\text{\hspace{0.17em}}&&\text{\hspace{0.17em}}|{\stackrel{˙}{\mathit{x}}}_{\mathit{t}}|\le 0.5\right)\right]\\ {\mathit{r}}_{\mathit{t}}={\mathit{r}}_{1}+{\mathit{r}}_{2}+{\mathit{r}}_{3}+{\mathit{r}}_{4}\end{array}$`

Here:

• ${\mathit{x}}_{\mathit{t}}$,${\mathit{y}}_{\mathit{t}}$,${\stackrel{˙}{\mathit{x}}}_{\mathit{t}}$, and ${\stackrel{˙}{\mathit{y}}}_{\mathit{t}}$ are the positions and velocities of the rocket along the x and y axes.

• ${\stackrel{ˆ}{\mathit{d}}}_{\mathit{t}}$ is the normalized distance of the rocket from the goal position.

• ${\stackrel{ˆ}{\mathit{v}}}_{\mathit{t}}$ is the normalized speed of the rocket.

• ${\mathit{d}}_{\mathrm{max}}$ and ${\mathit{v}}_{\mathrm{max}}$ are the maximum distances and speeds within the state-space.

• ${\theta }_{\mathit{t}}$ is the orientation with respect to the vertical axis.

• ${\mathit{L}}_{\mathit{t}}$ and ${\mathit{R}}_{\mathit{t}}$ are the action values for the left and right thrusters.

• ${\mathit{r}}_{1}$ is a reward for minimizing distance and speed simultaneously.

• ${\mathit{r}}_{2}$ is a reward for minimizing the orientation of the rocket.

• ${\mathit{r}}_{3}$ is a reward for minimizing control effort.

• ${\mathit{r}}_{4}$ is a sparse reward for soft-landing with horizontal and vertical velocities less than 0.5 m/s.

### Create MATLAB Environment

Create a MATLAB environment for the rocket lander using the `RocketLander` class.

`env = RocketLander;`

Obtain the observation and action information from the environment.

```actionInfo = getActionInfo(env); observationInfo = getObservationInfo(env); numObs = observationInfo.Dimension(1); numAct = numel(actionInfo.Elements);```

Fix the random generator seed for reproducibility.

`rng(0)`

### Create PPO Agent

The PPO agent in this example operates on a discrete action space. At every time step, the agent selects one of the following discrete action pairs.

`$\begin{array}{l}\mathit{L},\text{\hspace{0.17em}}\mathit{L}-\mathrm{do}\text{\hspace{0.17em}}\mathrm{nothing}\\ \mathit{L},\text{\hspace{0.17em}}\mathit{M}-\mathrm{fire}\text{\hspace{0.17em}}\mathrm{right}\text{\hspace{0.17em}}\left(\mathrm{med}\right)\\ \mathit{L},\text{\hspace{0.17em}}\mathit{H}-\mathrm{fire}\text{\hspace{0.17em}}\mathrm{right}\text{\hspace{0.17em}}\left(\mathrm{high}\right)\\ \mathit{M},\text{\hspace{0.17em}}\mathit{L}-\mathrm{fire}\text{\hspace{0.17em}}\mathrm{left}\text{\hspace{0.17em}}\left(\mathrm{med}\right)\\ \mathit{M},\text{\hspace{0.17em}}\mathit{M}-\mathrm{fire}\text{\hspace{0.17em}}\mathrm{left}\text{\hspace{0.17em}}\left(\mathrm{med}\right)+\mathrm{right}\text{\hspace{0.17em}}\left(\mathrm{med}\right)\\ \mathit{M},\text{\hspace{0.17em}}\mathit{H}-\mathrm{fire}\text{\hspace{0.17em}}\mathrm{left}\text{\hspace{0.17em}}\left(\mathrm{med}\right)+\mathrm{right}\text{\hspace{0.17em}}\left(\mathrm{high}\right)\\ \mathit{H},\text{\hspace{0.17em}}\mathit{L}-\mathrm{fire}\text{\hspace{0.17em}}\mathrm{left}\text{\hspace{0.17em}}\left(\mathrm{high}\right)\\ \mathit{H},\text{\hspace{0.17em}}\mathit{M}-\mathrm{fire}\text{\hspace{0.17em}}\mathrm{left}\text{\hspace{0.17em}}\left(\mathrm{high}\right)+\mathrm{right}\text{\hspace{0.17em}}\left(\mathrm{med}\right)\\ \mathit{H},\text{\hspace{0.17em}}\mathit{H}-\mathrm{fire}\text{\hspace{0.17em}}\mathrm{left}\text{\hspace{0.17em}}\left(\mathrm{high}\right)+\mathrm{right}\text{\hspace{0.17em}}\left(\mathrm{high}\right)\end{array}$`

Here, $\mathit{L}=0.0,\mathit{M}=0.5$ and $\mathit{H}=1.0$ are normalized thrust values for each thruster.

To estimate the policy and value function, the agent maintains function approximators for the actor and critic, which are modeled using deep neural networks. The training can be sensitive to the initial network weights and biases, and results can vary with different sets of values. In this example, the network weights are randomly initialized to small values. To load a set of predefined values instead, set `predefinedWeightsandBiases` to `true`.

```criticLayerSizes = [200 100]; actorLayerSizes = [200 100]; predefinedWeightsandBiases = false; if predefinedWeightsandBiases load('PredefinedWeightsAndBiases.mat'); else createNetworkWeights; end```

Create the critic deep neural network with six inputs and one output. The output of the critic network is the discounted long term reward for the input observations.

```criticNetwork = [imageInputLayer([numObs 1 1],'Normalization','none','Name','observation') fullyConnectedLayer(criticLayerSizes(1),'Name','CriticFC1', ... 'Weights',weights.criticFC1, ... 'Bias',bias.criticFC1) reluLayer('Name','CriticRelu1') fullyConnectedLayer(criticLayerSizes(2),'Name','CriticFC2', ... 'Weights',weights.criticFC2, ... 'Bias',bias.criticFC2) reluLayer('Name','CriticRelu2') fullyConnectedLayer(1,'Name','CriticOutput',... 'Weights',weights.criticOut,... 'Bias',bias.criticOut)];```

Create the critic representation.

```criticOpts = rlRepresentationOptions('LearnRate',1e-3); critic = rlValueRepresentation(criticNetwork,env.getObservationInfo, ... 'Observation',{'observation'},criticOpts);```

Create the actor using a deep neural network with six inputs and two outputs. The outputs of the actor network are the probabilities of taking each possible action pair. Each action pair contains normalized action values for each thruster. The environment `step` function scales these values to determine the actual thrust values.

```actorNetwork = [imageInputLayer([numObs 1 1],'Normalization','none','Name','observation') fullyConnectedLayer(actorLayerSizes(1),'Name','ActorFC1',... 'Weights',weights.actorFC1,... 'Bias',bias.actorFC1) reluLayer('Name','ActorRelu1') fullyConnectedLayer(actorLayerSizes(2),'Name','ActorFC2',... 'Weights',weights.actorFC2,... 'Bias',bias.actorFC2) reluLayer('Name','ActorRelu2') fullyConnectedLayer(numAct,'Name','Action',... 'Weights',weights.actorOut,... 'Bias',bias.actorOut) softmaxLayer('Name','actionProbability') ]; ```

Create the actor using a stochastic actor representation.

```actorOptions = rlRepresentationOptions('LearnRate',1e-3); actor = rlStochasticActorRepresentation(actorNetwork,env.getObservationInfo,env.getActionInfo,... 'Observation',{'observation'}, actorOptions);```

The PPO agent generates experiences for `ExperienceHorizon` number of steps or until the end of the episode. Then it trains on those experiences using mini-batches for a specified number of epochs.

• Set the `ExperienceHorizon` to 512 steps, which is slightly less than half the total simulation duration of 1200 steps.

• Use the `"gae"` method, which computes the advantage function based on the smoothed discounted sum of temporal differences.

• To improve training stability, use an objective function clip factor of 0.2.

• Discount factors that are close to 1 encourage long term rewards. Use a discount factor of 0.9995, which has a half-life of log(0.5)/log(0.9995) ≈ 1386 steps.

Specify the agent options using `rlPPOAgentOptions`.

```opt = rlPPOAgentOptions('ExperienceHorizon',512,... 'ClipFactor',0.2,... 'EntropyLossWeight',0.02,... 'MiniBatchSize',64,... 'NumEpoch',3,... 'AdvantageEstimateMethod','gae',... 'GAEFactor',0.95,... 'SampleTime',env.Ts,... 'DiscountFactor',0.9995);```

Create the PPO agent.

`agent = rlPPOAgent(actor,critic,opt);`

### Train Agent

To train the PPO agent, specify the following training options.

• Run the training for at most 20000 episodes, with each episode lasting at most 1200 time steps.

• Stop the training when the average reward over 100 consecutive episodes is 10000. This value indicates that the agent is consistently receiving the reward for soft landing.

• Save a copy of the agent for each episode where the episode reward is 11000 or more.

```trainOpts = rlTrainingOptions(... 'MaxEpisodes',20000,... 'MaxStepsPerEpisode',1200,... 'Verbose',false,... 'Plots','training-progress',... 'StopTrainingCriteria','AverageReward',... 'StopTrainingValue',10000,... 'ScoreAveragingWindowLength',100,... 'SaveAgentCriteria',"EpisodeReward",... 'SaveAgentValue',11000);```

Plot the rocket lander environment to visualize the training or simulation.

`plot(env)` Train the agent using the `train` function. Due to the complexity of the environment, this process is computationally intensive and takes several hours to complete. To save time while running this example, load a pretrained agent by setting `doTraining` to `false`.

```doTraining = false; if doTraining % Train the agent. trainingStats = train(agent,env,trainOpts); else % Load pretrained parameters for the example. load('RocketLanderPPOAgentParams.mat'); loadPretrainedParams(agent,actorParams,criticParams) end```
```ans = rlPPOAgent with properties: AgentOptions: [1x1 rl.option.rlPPOAgentOptions] ``` ### Simulate

Simulate the trained agent within the environment. For more information on agent simulation, see `rlSimulationOptions` and `sim`.

```Tf = 120; % Total simulation length (seconds) simOptions = rlSimulationOptions('MaxSteps',ceil(Tf/env.Ts)); experience = sim(env,agent,simOptions);``` ### Local Function

Function to update agent with pretrained parameters

```function agent = loadPretrainedParams(agent,actorParams,criticParams) % Set actor parameters. actor = getActor(agent); pretrainedActor = setLearnableParameters(actor,actorParams); % Set critic parameters. critic = getCritic(agent); pretrainedCritic = setLearnableParameters(critic,criticParams); % Set actor and critic in agent. agent = setActor(agent,pretrainedActor); agent = setCritic(agent,pretrainedCritic); end```