How can i interpret an oscillating average reward graphic in RL trainig process ?
5 views (last 30 days)
Show older comments
Hi all,
I have tried to train DDPG agent in RL for referance tracking problem, which has an environment in Simulink. But it can not track the referance. I have changed hyper parameters and tried may times, however, in the most of tries, the average reward graphic osscilate around or below the episode Q0 as in following graphics (first one for NoiseOptions.Variance = 0.1, the second one for 0.05 and the last one for 0.01).
Meanwhile, i have tried many different reward function and observation. They almost have the same problem.
I am sharing the most important part of my codes.
For a well trained agent, is it requred an avareged reward grahpic following the Episode Q0 ?
How can the given training process graphics are interpreted? What should i change in my RL algorithm ?
Thanx for any help.
obsInfo = rlNumericSpec([5 1],...
'LowerLimit',[-inf -inf 0 -inf 0]',...
'UpperLimit',[ inf inf inf inf inf]');
obsInfo.Name = 'observations';
obsInfo.Description = 'integrated error, error, and measured height';
numObservations = obsInfo.Dimension(1);
actInfo = rlNumericSpec([1 1]);
actInfo.Name = 'flow';
numActions = actInfo.Dimension(1);
env = rlSimulinkEnv('sz_rlforward','sz_rlforward/RL Agent',...
obsInfo,actInfo);
statePath = [
featureInputLayer(numObservations,Normalization='none', Name='State') % 'rescale-symmetric' = range [-1, 1] veya 'rescale-zero-one'
fullyConnectedLayer(50,Name='CriticStateFC1')
reluLayer %('Name','CriticRelu1')
fullyConnectedLayer(25, Name='CriticStateFC2')];
actionPath = [
featureInputLayer(numActions,Normalization='none', Name='Action1')
fullyConnectedLayer(25,Name='CriticActionFC1')];
commonPath = [
additionLayer(2,Name='add')
reluLayer %('Name','CriticCommonRelu')
fullyConnectedLayer(1,Name='CriticOutput')];
criticNetwork = layerGraph();
criticNetwork = addLayers(criticNetwork,statePath);
criticNetwork = addLayers(criticNetwork,actionPath);
criticNetwork = addLayers(criticNetwork,commonPath);
criticNetwork = connectLayers(criticNetwork,'CriticStateFC2','add/in1');
criticNetwork = connectLayers(criticNetwork,'CriticActionFC1','add/in2');
criticNetwork = dlnetwork(criticNetwork);
critic = rlQValueFunction(criticNetwork,obsInfo,actInfo,...
ObservationInputNames="State",ActionInputNames="Action1");
actorNetwork = [
featureInputLayer(numObservations,Normalization='none',Name='State')
fullyConnectedLayer(5, Name='actorFC')
reluLayer
fullyConnectedLayer(50)
reluLayer
fullyConnectedLayer(numActions)
sigmoidLayer
scalingLayer(Scale=0.5,Bias=0.5) % i need an action value in the range 0-1
];
actorNetwork = dlnetwork(actorNetwork);
actor = rlContinuousDeterministicActor(actorNetwork, ...
obsInfo,actInfo);
criticOptions = rlOptimizerOptions( ...
LearnRate=1e-3, ...
GradientThreshold=1, ...
L2RegularizationFactor=1e-4);
actorOptions = rlOptimizerOptions( ...
LearnRate=1e-4, ...
GradientThreshold=1, ...
L2RegularizationFactor=1e-4);
agentOptions = rlDDPGAgentOptions(...
SampleTime=Ts,...
ActorOptimizerOptions=actorOptions,...
CriticOptimizerOptions=criticOptions,...
MiniBatchSize=128, ...
DiscountFactor=0.95, ...
ExperienceBufferLength=1e6);
agentOptions.NoiseOptions.Variance = 0.05;
agentOptions.NoiseOptions.VarianceDecayRate = 1e-5;
agent = rlDDPGAgent(actor,critic,agentOptions)



0 Comments
Answers (2)
awcii
on 18 Jul 2023
2 Comments
Emmanouil Tzorakoleftherakis
on 18 Jul 2023
A trained agent does not necessarily need to have overlapping Q0 and reward values. It could be the case that the actor converges faster than the critic in which case it's totally ok to stop the training process early.
The last graph you shared seems promising. How does tha trained agent perform in that case?
See Also
Categories
Find more on Training and Simulation in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!