Why is this PPO agent not able to learn its task ?

Hi, I am currently trying to learn the leg coordination of a hexapod robot, meaning it should learn when to lift its legsso that an efficient gait(tripod, wave, etc...)emerges.
I am relatively new to RL and for the last two weeks I tried to get this to work, but no matter the algorithm parameters or reward function definition, the agent does not learn at all.
  • I am running a Simscape Physics Simulation of the hexapod robot(stepsize=0.25ms)
  • The hexapod has 3 joints per leg, , the agent receives the α-angle(Rad.) of each leg as observations
  • The movement sequence of a hexapod consists of a swing phase(lift leg and put in front) and stance phase(push leg back to move foreward)
  • The movement of a leg, meaning swing and stance phase is predefined, the agent only has to decide when to initiate the swing
  • As said above, the agent receives the α-angles as observations and has to output a 1 to initiate the swing of a leg as an action(all other output values do nothing)
  • The reward is currently defined as the following: , where is the movement speed in x-direction to reward moving forward, the y-position to discourage diversion from a straight line, and the height difference between normal and current height, to discourage stumbling or falling.
I appreciate any advice you can give me, I just find it very odd that there seems to be no progress. I probably ran this simulation about 10 times with parameter changes to about 2000 Ep., but it allways looks just like the graph above.
Does the agent lack more information, is the reward poorly defined ?
Thank you in advance for any tips.
To give you as much information as possible, here are all the RL parameters:
  • I use the PPO architecture used here, but greatly reduced the size of the layers from 300/400 to 32(tested larger NN as well without any success)
  • Agent options:
ExperienceHorizon=512, ...
MiniBatchSize=128, ...
ClipFactor=0.2,...
EntropyLossWeight=0.01,...
NumEpoch=3,...
AdvantageEstimateMethod="gae",...
GAEFactor=0.95,...
NormalizedAdvantageMethod="none",...
AdvantageNormalizingWindow=1e6,...
ActorOptimizerOptions=actorOpts,...
CriticOptimizerOptions=criticOpts,...
SampleTime=0.05,...
DiscountFactor=0.99
  • actorOpt and criticOpts only contain: learnRate=0.02
  • Training options:
MaxEpisodes=10000,...
MaxStepsPerEpisode=512,...
ScoreAveragingWindowLength=50,...
Verbose=true,...
Plots="training-progress",...
StopTrainingCriteria="EpisodeCount",...
StopTrainingValue=maxEpisodes,...
SaveAgentCriteria="EpisodeReward",...
SaveAgentValue=65
  • During an Episode(512 steps, 0.05 sample time) a hexapod with a predefined tripod gait(no learning) receives >120 as a reward

Answers (0)

Products

Release

R2023a

Asked:

on 25 Sep 2023

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!