Why is this PPO agent not able to learn its task ?
Show older comments
Hi, I am currently trying to learn the leg coordination of a hexapod robot, meaning it should learn when to lift its legsso that an efficient gait(tripod, wave, etc...)emerges.
I am relatively new to RL and for the last two weeks I tried to get this to work, but no matter the algorithm parameters or reward function definition, the agent does not learn at all.

- I am running a Simscape Physics Simulation of the hexapod robot(stepsize=0.25ms)
- The hexapod has 3 joints per leg,
, the agent receives the α-angle(Rad.) of each leg as observations - The movement sequence of a hexapod consists of a swing phase(lift leg and put in front) and stance phase(push leg back to move foreward)
- The movement of a leg, meaning swing and stance phase is predefined, the agent only has to decide when to initiate the swing
- As said above, the agent receives the α-angles as observations and has to output a 1 to initiate the swing of a leg as an action(all other output values do nothing)
- The reward is currently defined as the following:
, where
is the movement speed in x-direction to reward moving forward,
the y-position to discourage diversion from a straight line, and
the height difference between normal and current height, to discourage stumbling or falling.
I appreciate any advice you can give me, I just find it very odd that there seems to be no progress. I probably ran this simulation about 10 times with parameter changes to about 2000 Ep., but it allways looks just like the graph above.
Does the agent lack more information, is the reward poorly defined ?
Thank you in advance for any tips.
To give you as much information as possible, here are all the RL parameters:
- I use the PPO architecture used here, but greatly reduced the size of the layers from 300/400 to 32(tested larger NN as well without any success)
- Agent options:
ExperienceHorizon=512, ...
MiniBatchSize=128, ...
ClipFactor=0.2,...
EntropyLossWeight=0.01,...
NumEpoch=3,...
AdvantageEstimateMethod="gae",...
GAEFactor=0.95,...
NormalizedAdvantageMethod="none",...
AdvantageNormalizingWindow=1e6,...
ActorOptimizerOptions=actorOpts,...
CriticOptimizerOptions=criticOpts,...
SampleTime=0.05,...
DiscountFactor=0.99
- actorOpt and criticOpts only contain: learnRate=0.02
- Training options:
MaxEpisodes=10000,...
MaxStepsPerEpisode=512,...
ScoreAveragingWindowLength=50,...
Verbose=true,...
Plots="training-progress",...
StopTrainingCriteria="EpisodeCount",...
StopTrainingValue=maxEpisodes,...
SaveAgentCriteria="EpisodeReward",...
SaveAgentValue=65
- During an Episode(512 steps, 0.05 sample time) a hexapod with a predefined tripod gait(no learning) receives >120 as a reward
1 Comment
Muhammad Fairuz Abdul Jalal
on 2 Nov 2023
Hi.
I have provided comment on this similar topic here: https://www.mathworks.com/matlabcentral/answers/1629850-ppo-reinforcement-learning-agent-doesn-t-learn?s_tid=srchtitle
Hope it helps.
Answers (0)
Categories
Find more on Reinforcement Learning in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!