MATLAB Answers

Deep Q Learning to Control an arm model

5 views (last 30 days)
Leif Eric Goebel
Leif Eric Goebel on 30 Nov 2019
Answered: Vimal Rathod on 13 Dec 2019
Hello,
I am new to the Deep Q Learning (or in general Deep neural network) content. I have a task to create a (optimal) control for a mathematical arm model that I implemented. I oriented my attempt at the "Cart Pole Environment" that comes with the Matlab Deep Learning Toolbox. Hence I have a step function, a reset function and I use them to create my environment. Thats is working fine. My actions are cell arrays (correctly initalized) to create the control of the biceps and triceps in all combinations of two arrays.
In short my system or better step function gets the inputs:
α (angle of the arm measured from fully streched to fully bent ,
(velocity of the arm while in motion) and
(difference between the current angle and the desired angle).
Now my reset function sets the inital position of the arm is randomly given with initial velocity 0 and the random (but then fixed) desired position.
I set my networks up as follows:
ObservationInfo = rlNumericSpec([3 1]);
ObservationInfo.Name = 'Arm States';
ObservationInfo.Description = 'alpha, dalpha, DiffState';
a = P.Param.a;
b = P.Param.b;
[A,B] = meshgrid(a,b);
actions = reshape(cat(2,A',B'),[],2);
ActionInfo = rlFiniteSetSpec(num2cell(actions,2));
ActionInfo.Name = 'actions';
env = rlFunctionEnv(ObservationInfo,ActionInfo,'simulateStep','resetFunction');
rng(0);
InitialObs = reset(env);
hiddenLayerSize = 128;
statePath = [
imageInputLayer([3 1 1],'Normalization','none','Name','state')
tanhLayer('Name','CriticRelu1')
fullyConnectedLayer(hiddenLayerSize,'Name','CriticStateFC1')
tanhLayer('Name','CriticRelu2')
fullyConnectedLayer(hiddenLayerSize,'Name','CriticStateFC2')];
actionPath = [
imageInputLayer([1 2 1],'Normalization','none','Name','action')
fullyConnectedLayer(hiddenLayerSize,'Name','CriticActionFC1')
tanhLayer('Name','tanh1')
fullyConnectedLayer(hiddenLayerSize,'Name','CriticActionFC2')
reluLayer('Name','ActionRelu1')
fullyConnectedLayer(hiddenLayerSize,'Name','CriticActionFC3')
fullyConnectedLayer(hiddenLayerSize,'Name','CriticActionFC4')];
commonPath = [
additionLayer(2,'Name','add')
reluLayer('Name','CriticCommonRelu')
fullyConnectedLayer(1,'Name','output')];
criticNetwork = layerGraph(statePath);
criticNetwork = addLayers(criticNetwork, actionPath);
criticNetwork = addLayers(criticNetwork, commonPath);
criticNetwork = connectLayers(criticNetwork,'CriticStateFC2','add/in1');
criticNetwork = connectLayers(criticNetwork,'CriticActionFC4','add/in2');
So this is very much copied from the given example. However my agent takes absolute ages to learn and at the end the given control does not work at all. It simply does not learn anything (like "minimizing" the distance from current angle to desired angle).
My reward function is basically the following:
Reward = -(P.Param.finish(1) - P.State(1))^2 - 0.1*(P.Param.finish(2) - P.State(2))^2 - sum(sum(Action));
Reward = Reward *~IsDone - tooFar* 1000 + stateCorrect*100;
And the variable tooFar is set to 1 if the state either exceeds to the negative or to positive and stateCorrect is 1 if the angle α is within 1 degree around .
In the end I have the following questions:
1) Is there a better way to set up my Network to get the result I want?
2) Is the Reward function "functional"? Or should I use something that is more like "get +1 if the distance is smaller than the step before"?
Thank you very much in advance and I hope the question is detailed enough to be answered. If there are any other questions about my question let me know.

  0 Comments

Sign in to comment.

Answers (1)

Vimal Rathod
Vimal Rathod on 13 Dec 2019
Your network seems fine and I am hoping you have set your hyper parameters properly. Coming to the rewards, when you start training Deep Q learning network, it is initially suggested to use "Discrete Rewards" function to push the agent twords the desired outputs. You could do that by putting a smaller reward and a higher penality if the arm is not in a desired position or angle. In later stages of training, you can change the rewards to continuous one. Generally it is preferred to use continoius rewards to fine tune network performance.

  0 Comments

Sign in to comment.

Products


Release

R2019b