Custom RL environment creation

Question

laha_M on 19 Oct 2020

0
Link

Direct link to this question

https://in.mathworks.com/matlabcentral/answers/619138-custom-rl-environment-creation

Commented: laha_M on 29 Nov 2020

Accepted Answer: Emmanouil Tzorakoleftherakis

Hello everyone,

I am trying to implement the following custom environment for my RL agent. There is a rectangular area, as shown in the figure, with smaller grids of some fixed dimension (say 20m x 20m). Every grid has a profit. The agent starts from the lower left bottom (x=0, y=0) and moves in each step (say each 0.25s), and finally stays in a position that gives the highest profit sum. The total duration is 30s. So once it finds the best place, it stays there for the rest of the duration and enjoys the profit. The profit sum is calculated as the coverage of the agent (say the coverage range of the agent is 50m). However, the agent not only wants to maximize the profit sum but also wants to minimize its travel distance. So essentially, the optimization is maximizing (profit sum/traveled distance). The agent can take two continuous actions- distance (D) and angle of movement (Theta). D is [0, 12.5m] and Theta[0, 359.9999deg). This means the position of the agent updates the following way in each step.

x(new) = x(old) + D*cos(Theta)

y(new) = y(old) + D*sin(Theta)

I thought the observation space to be like these. It's a tuple of four elements.

[Profit collected, Distance traveled, x pos of the agent, y pos of the agent].

And my reward is something like = [delta(profit)/delta(distance traveled)]. Where delta(profit) means profit gained(or lost) by moving in a step and the same thing for delta(distance traveled). Because my target is to gain high profit with less movement. I am looking and trying to implement the custom environment by looking into

https://in.mathworks.com/help/reinforcement-learning/ug/create-custom-matlab-environment-from-template.html

I have implemented some part of it but the last two methods are not clear to me. And doubtful about the overall code as well. The respective section is commented in the code. Any suggestions?

Another thing, how do I introduce multiple agents into the scenario?

Thanks for reading such a long question.

classdef MyEnvironment < rl.env.MATLABEnvironment
    
    %MYENVIRONMENT: Template for defining custom environment in MATLAB.    
    
    %% Properties (set properties' attributes accordingly)
    properties
        % Specify and initialize environment's necessary properties  
        
        % X and Y grid numbers
        XGrid = 10
        YGrid = 10
        
        % Grid size in meter (square grids)
        GridSize = 20.0
        
        % Full dimension of the Grid
        XMax = 200
        YMax = 200
        
        % Max and Min Angle the agent can move in degree (in each step)
        MaxAngle = 359.9999
        MinAngle = 0
        
        % Sample time (S)
        Ts = 0.25
        
        % Max Distance the agent can travel in meter (in each sample time)
        MaxD = 50 % in 1s agent can travel 50m
        MaxDistance = 12.50 % = 50*0.25
        MinDistance = 0
        
        % System dynamics dont change for this interval in sec
        FixDuration = 30
        
        SimuDuration = 30 % for now Simulation time is same as Fix duration i.e. no change in profit over time
        
        no_of_steps = 120 % no. of steps possible in one episode (30/0.25)
        
        
        % Coverage range (in m)--agent can cover upto this range.
        CovRange = 50
        
        
             
        % Penalty when the agent goes outside boundary
        PenaltyForGoingOutside = -100
        
    end
    
    properties
        % Initialize system state 4 values. All zeros.
        % They are -- [Collected Profit sum, Traveled distance, x, and y pos of the
        % agent]
        % [ProfitSum, DistTravel, agent_x, agent_y]'
        State = zeros(4,1)
    end
    
    properties(Access = protected)
        % Initialize internal flag to indicate episode termination
        IsDone = false        
    end
    %% Necessary Methods
    methods              
        % Contructor method creates an instance of the environment
        
        function this = MyEnvironment()
            % Initialize Observation settings
            ObservationInfo = rlNumericSpec([4 1]);
            ObservationInfo.Name = 'Grid States';
            ObservationInfo.Description = 'Profit, Distance, x, y';
            
            % Initialize Action settings
            % Two actions -- distance and angle. Both limited by the
            % provided range
            ActionInfo = rlNumericSpec([2 1],'LowerLimit',[0;0],'UpperLimit',[12.5;359.9999]);
            ActionInfo.Name = 'dist;angle';
            
            % The following line implements built-in functions of RL env
            this = this@rl.env.MATLABEnvironment(ObservationInfo,ActionInfo);
            
            
            % Grids' centre position and profit details
            % Grid is a struct which stores 3 values. x pos, y pos and profit
            % for each grid. 
            % This information is necessary to compute the coverage profit sum
            Total_Grids = this.XGrid*this.YGrid;
            Grid(Total_Grids) = struct();
            G = 1;
            for i = 1:this.XGrid
              for j = 1:this.YGrid
                  Grid(G).X = (this.GridSize/2)+ (j-1)*(this.GridSize); % x pos of each grid centre
                  Grid(G).Y = (this.GridSize/2)+ (i-1)*(this.GridSize); % y pos of each grid centre
                  G = G + 1;
              end
            end
            Profits = randi([0,20],100,1); % profit of each grid, 100 grids are there
            G = 1;
            for i = 1:Total_Grids
                Grid(G).Profit = Profits(G); % stored in Grid structure
                G = G+1;
            end
            
            
            
            % Initialize property values and pre-compute necessary values
            updateActionInfo(this);
        end
        
        % Apply system dynamics and simulates the environment with the 
        % given action for one step.
        
        function [Observation,Reward,IsDone,LoggedSignals] = step(this,Action)
            
            LoggedSignals = [];
            
            % n is used to count total number of steps taken.
            % when n == possible time steps in one episode then end the
            % episode
            persistent n
            if isempty(n)
                 n = 1;
            else
                 n = n+1;
            end
            
            % Get actions
            [dist,angle] = getMovement(this,Action);            
            
            % Unpack state vector
            Profit = this.State(1);
            Distance = this.State(2);
            x = this.State(3);
            y = this.State(4);
            
            % Computation of the necessary values
            CosTheta = cosd(angle);
            SinTheta = sind(Theta);            
            x_new = x + dist*CosTheta;
            y_new = y + dist*SinTheta;
            
            % To compute the new profit after taking the actions
            
            % Idea is if the centre of a grid is within the coverage range
            % of the agent, then it is covered and its profit is obtained.
            P  = 0;
            for k = 1: this.Total_Grids
                if sqrt((x_new-this.Grid(k).X)^2 + (y_new-this.Grid(k).Y)^2)<= this.CovRange
                    P = P + this.Grid(k).Profit;
                end
            end
            
            new_Profit = P;
            dist_Traveled = dist;
            
            delta_profit = new_Profit-Profit;
            delta_dist = dist;
            
            % New Observation
            Observation = [new_profit, dist_traveled, x_new, y_new];
            % Update system states
            this.State = Observation;
            
            % Check terminal condition
            if n == this.no_of_steps
                this.IsDone = true;
            end
            
            % Reward::
            % If goes outside the region, penalize the agent
            if (x_new > this.XMax || y_new > this.YMax)
                penalty = this.PenaltyForGoingOutside;
            else 
                penalty = 0;
            end
            Reward = 10*(delta_profit/delta_dist)+ penalty;
            
        end
        
        % Reset environment to initial state and output initial observation
        function InitialObservation = reset(this)
            % Profit sum goes to 0
            P0 = 0;  
            % Distance travelled goes to 0
            D0 = 0;
            % Initial x pos of the robot
            X0 = 0;
            % Initial y pos of the robot
            Y0 = 0;
            
            InitialObservation = [P0;D0;X0;Y0];
            this.State = InitialObservation;
        end
    end
    
    methods               
        % Helper methods to create the environment
        % Not sure how to update this two methods???
        function [dist,angle] = getMovement(this,action)
            if ~ismember(action,this.ActionInfo.Elements)
                error('Action must be limited by the valid range');
            end
            [dist,angle] = action;           
        end
        % Update the action info based on Values
        % Not sure how to update this in my case??? kept same as cart pole
        function updateActionInfo(this)
            this.ActionInfo.Elements = this.MaxForce*[-1 1];
        end
        
     
        
    end
      
end

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Emmanouil Tzorakoleftherakis on 29 Nov 2020

0
Link

Direct link to this answer

https://in.mathworks.com/matlabcentral/answers/619138-custom-rl-environment-creation#answer_559283

MyEnvironment.m

Hello,

Based on the updated files you sent on this post, you are setting this.IsDone, however this is a class variable which is different than IsDone that is needed as output of 'step'. You need to set both to eliminate the error you are seeing.

There is an additional error which happends after that and it's due to how the reward (line 168 in attached) is defined. Specifically there is division by zero - make sure you account for that in your reward logic.

Hope that helps

1 Comment
Show -1 older commentsHide -1 older comments

laha_M on 29 Nov 2020

Thanks Emmanouil.

Sign in to comment.

Custom RL environment creation

0 Comments
Show -2 older commentsHide -2 older comments

Accepted Answer

1 Comment
Show -1 older commentsHide -1 older comments

More Answers (0)

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

Custom RL environment creation

0 Comments Show -2 older commentsHide -2 older comments

Accepted Answer

1 Comment Show -1 older commentsHide -1 older comments

More Answers (0)

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

1 Comment
Show -1 older commentsHide -1 older comments