5 views (last 30 days)

Hello everyone,

I am trying to implement the following custom environment for my RL agent. There is a rectangular area, as shown in the figure, with smaller grids of some fixed dimension (say 20m x 20m). Every grid has a profit. The agent starts from the lower left bottom (x=0, y=0) and moves in each step (say each 0.25s), and finally stays in a position that gives the highest profit sum. The total duration is 30s. So once it finds the best place, it stays there for the rest of the duration and enjoys the profit. The profit sum is calculated as the coverage of the agent (say the coverage range of the agent is 50m). However, the agent not only wants to maximize the profit sum but also wants to minimize its travel distance. So essentially, the optimization is maximizing (profit sum/traveled distance). The agent can take two continuous actions- distance (D) and angle of movement (Theta). D is [0, 12.5m] and Theta[0, 359.9999deg). This means the position of the agent updates the following way in each step.

x(new) = x(old) + D*cos(Theta)

y(new) = y(old) + D*sin(Theta)

I thought the observation space to be like these. It's a tuple of four elements.

[Profit collected, Distance traveled, x pos of the agent, y pos of the agent].

And my reward is something like = [delta(profit)/delta(distance traveled)]. Where delta(profit) means profit gained(or lost) by moving in a step and the same thing for delta(distance traveled). Because my target is to gain high profit with less movement. I am looking and trying to implement the custom environment by looking into

https://in.mathworks.com/help/reinforcement-learning/ug/create-custom-matlab-environment-from-template.html

I have implemented some part of it but the last two methods are not clear to me. And doubtful about the overall code as well. The respective section is commented in the code. Any suggestions?

Another thing, how do I introduce multiple agents into the scenario?

Thanks for reading such a long question.

classdef MyEnvironment < rl.env.MATLABEnvironment

%MYENVIRONMENT: Template for defining custom environment in MATLAB.

%% Properties (set properties' attributes accordingly)

properties

% Specify and initialize environment's necessary properties

% X and Y grid numbers

XGrid = 10

YGrid = 10

% Grid size in meter (square grids)

GridSize = 20.0

% Full dimension of the Grid

XMax = 200

YMax = 200

% Max and Min Angle the agent can move in degree (in each step)

MaxAngle = 359.9999

MinAngle = 0

% Sample time (S)

Ts = 0.25

% Max Distance the agent can travel in meter (in each sample time)

MaxD = 50 % in 1s agent can travel 50m

MaxDistance = 12.50 % = 50*0.25

MinDistance = 0

% System dynamics dont change for this interval in sec

FixDuration = 30

SimuDuration = 30 % for now Simulation time is same as Fix duration i.e. no change in profit over time

no_of_steps = 120 % no. of steps possible in one episode (30/0.25)

% Coverage range (in m)--agent can cover upto this range.

CovRange = 50

% Penalty when the agent goes outside boundary

PenaltyForGoingOutside = -100

end

properties

% Initialize system state 4 values. All zeros.

% They are -- [Collected Profit sum, Traveled distance, x, and y pos of the

% agent]

% [ProfitSum, DistTravel, agent_x, agent_y]'

State = zeros(4,1)

end

properties(Access = protected)

% Initialize internal flag to indicate episode termination

IsDone = false

end

%% Necessary Methods

methods

% Contructor method creates an instance of the environment

function this = MyEnvironment()

% Initialize Observation settings

ObservationInfo = rlNumericSpec([4 1]);

ObservationInfo.Name = 'Grid States';

ObservationInfo.Description = 'Profit, Distance, x, y';

% Initialize Action settings

% Two actions -- distance and angle. Both limited by the

% provided range

ActionInfo = rlNumericSpec([2 1],'LowerLimit',[0;0],'UpperLimit',[12.5;359.9999]);

ActionInfo.Name = 'dist;angle';

% The following line implements built-in functions of RL env

this = this@rl.env.MATLABEnvironment(ObservationInfo,ActionInfo);

% Grids' centre position and profit details

% Grid is a struct which stores 3 values. x pos, y pos and profit

% for each grid.

% This information is necessary to compute the coverage profit sum

Total_Grids = this.XGrid*this.YGrid;

Grid(Total_Grids) = struct();

G = 1;

for i = 1:this.XGrid

for j = 1:this.YGrid

Grid(G).X = (this.GridSize/2)+ (j-1)*(this.GridSize); % x pos of each grid centre

Grid(G).Y = (this.GridSize/2)+ (i-1)*(this.GridSize); % y pos of each grid centre

G = G + 1;

end

end

Profits = randi([0,20],100,1); % profit of each grid, 100 grids are there

G = 1;

for i = 1:Total_Grids

Grid(G).Profit = Profits(G); % stored in Grid structure

G = G+1;

end

% Initialize property values and pre-compute necessary values

updateActionInfo(this);

end

% Apply system dynamics and simulates the environment with the

% given action for one step.

function [Observation,Reward,IsDone,LoggedSignals] = step(this,Action)

LoggedSignals = [];

% n is used to count total number of steps taken.

% when n == possible time steps in one episode then end the

% episode

persistent n

if isempty(n)

n = 1;

else

n = n+1;

end

% Get actions

[dist,angle] = getMovement(this,Action);

% Unpack state vector

Profit = this.State(1);

Distance = this.State(2);

x = this.State(3);

y = this.State(4);

% Computation of the necessary values

CosTheta = cosd(angle);

SinTheta = sind(Theta);

x_new = x + dist*CosTheta;

y_new = y + dist*SinTheta;

% To compute the new profit after taking the actions

% Idea is if the centre of a grid is within the coverage range

% of the agent, then it is covered and its profit is obtained.

P = 0;

for k = 1: this.Total_Grids

if sqrt((x_new-this.Grid(k).X)^2 + (y_new-this.Grid(k).Y)^2)<= this.CovRange

P = P + this.Grid(k).Profit;

end

end

new_Profit = P;

dist_Traveled = dist;

delta_profit = new_Profit-Profit;

delta_dist = dist;

% New Observation

Observation = [new_profit, dist_traveled, x_new, y_new];

% Update system states

this.State = Observation;

% Check terminal condition

if n == this.no_of_steps

this.IsDone = true;

end

% Reward::

% If goes outside the region, penalize the agent

if (x_new > this.XMax || y_new > this.YMax)

penalty = this.PenaltyForGoingOutside;

else

penalty = 0;

end

Reward = 10*(delta_profit/delta_dist)+ penalty;

end

% Reset environment to initial state and output initial observation

function InitialObservation = reset(this)

% Profit sum goes to 0

P0 = 0;

% Distance travelled goes to 0

D0 = 0;

% Initial x pos of the robot

X0 = 0;

% Initial y pos of the robot

Y0 = 0;

InitialObservation = [P0;D0;X0;Y0];

this.State = InitialObservation;

end

end

methods

% Helper methods to create the environment

% Not sure how to update this two methods???

function [dist,angle] = getMovement(this,action)

if ~ismember(action,this.ActionInfo.Elements)

error('Action must be limited by the valid range');

end

[dist,angle] = action;

end

% Update the action info based on Values

% Not sure how to update this in my case??? kept same as cart pole

function updateActionInfo(this)

this.ActionInfo.Elements = this.MaxForce*[-1 1];

end

end

end

Emmanouil Tzorakoleftherakis
on 29 Nov 2020

Hello,

Based on the updated files you sent on this post, you are setting this.IsDone, however this is a class variable which is different than IsDone that is needed as output of 'step'. You need to set both to eliminate the error you are seeing.

There is an additional error which happends after that and it's due to how the reward (line 168 in attached) is defined. Specifically there is division by zero - make sure you account for that in your reward logic.

Hope that helps

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!
## 0 Comments

Sign in to comment.