How to train model-based reinforcement learning agents
Create and train model-based policy optimization (MBPO) agents. An MBPO agent uses neural networks to internally approximate the environment. This reusable internal model allows for a greater sample efficiency compared to a typical model-free agent.
Published: 6 Oct 2022
In this video, we'll demonstrate how to apply Model-Based Reinforcement Learning with Reinforcement Learning Toolbox. We will use the MBPO, or Model-Based Policy Optimization agent, introduced in 2022 a to balance a Cart-Pole system. Unlike model three methods, Model-Based Reinforcement learning uses traditional models as a part of the training algorithm.
For example, in addition to training the policy, the MBPO agent trains a neural network transition model of the environment using data or experiences collected from interactions with the real environment and stored in a memory buffer. The MBPO agent can also optionally train neural network models of the environmental reward and isDone signal if they are not known before training.
The trained models are then used to generate additional simulated experiences. The generated experiences are used by a base agent, for example, soft tactile critique, along with the experiences collected from the real environment to train the policy. Compared to model-free agents, the MBPO agent can be more sample-efficient as you can use the generated data to limit the interactions between the agent and the environment.
Now let's see how we can use the MPBO agent to balance the Cart-Pole system. First, we define a Cart-Pole environment. Then we construct an MBPO agent. This part consists of six steps.
First, we create a base agent. You can use DDPG, TD3, or SAC for continuous actions and DKN for discrete actions. We use SAC agent as a base agent in this example. Then, we need to construct a neural network transition model of the Cart-Pole system.
The transitional model predicts the next observation given the current observation and the action. To deal with model uncertainty, then BPO agent allows you to specify and train multiple transition models. This can be either deterministic or stochastic. In this example, we use three deterministic transition models, but you can use more.
Let's look at how to create a transitional model. First, we create a neural network with the desired architecture. The inputs for the deterministic transition network are current observation and action. Data output is the next observation.
After the network is created, we define our deterministic transition function using our continuous deterministic transition function. We did the same steps for two more transition functions. Note that neural network architecture does not need to be the same across the transition functions.
Next, we construct the rewardFcn. You can use the truth reward function if it is available. If you cannot access the truth reward function or the reward is too complex to compute, you can specify a neural network model you'd like the agent to train inside continuous deterministic reward function. In this example, we use a truth reward function. The agent can get more reward if the Cart is closer to the original position.
Next, we construct an isdoneFcn. This function predicts the terminal signal, given the current observation, action, and the next observation. Similar to the reward, you can use a truthIsDoneFcn if it is available. If you cannot access the truthIsDoneFcn, you can use rlIsDoneFunction with your defined in your network.
In this example, isDone is true when the pole is more than 12 degrees from the upright position or the cart moves more than 2.4 meters from the original position. After you have created the transition model, reward, model, and isDone model, you can define the overall environment model with the rlNeuralNetworkEnvironment.
Now we define MBPO agent options. In this setting, the MPBO agent trains the environment model at the beginning of each episode using 15-minute batches from the real replay memory. Using the real sample ratio parameter, you can specify the ratio of real and generated experiences in each batch used by the base agent when training the policy.
We use these settings to generate 2,000 trajectories at the beginning of each episode using the environmental model. We also increase the generated trajectory length every 100 epochs as our environment model becomes more accurate. Finally, we construct an MBPO agent by specifying the base agent, the environment model, and MBPO agent options.
After specifying the training options, we can begin training the agent. After approximately 500 episodes, the training converges. Let's verify the change of policy in simulation. The policy is working fine.
Now let's check one of the three transition models, the MPBO agent trained, to approximate the Cart-Pole system. You can set which transition models to use here and the call step method of the environment model to predict the next observation reward and isDone signal.
These four figures show each dimension of observation. Blue lines show ground truth values and orange lines show predicted values. As you can see, the prediction of the trained models are close to the ground truth values. This concludes MBPO agent demonstration.