I don't know why my DDPG scheduling model do the worst actions.

Question

찬목 on 2 Aug 2025

0
Link

Direct link to this question

https://in.mathworks.com/matlabcentral/answers/2178915-i-don-t-know-why-my-ddpg-scheduling-model-do-the-worst-actions

Answered: shantanu on 5 Aug 2025

48_consumption_6.1.xlsx

Open in MATLAB Online

Hi, I'm studying ESS scheduling Algorithm using Reinforcement Learning.

I just made code and algorithm of ESS schduling to save money, (reward is how much we saved money).

Concept of scheduling is saving daily fee of electricity through discharging ESS during the high price hour and charging during the lowest price hour. But I don't know why my code just do discharging all day.

Can someone help me with this problem?

function [NextObs, Reward, IsDone, LoggedSignals] = essDDPGStepFcn(Action, LoggedSignals)
    % 
    eff_cha = 0.95;
    eff_dch = 0.95;
    SOC_min = 0;
    SOC_max = 3000;
    delta_max = 1000;  % 
    p_crt = 100;
    soc_penalty_coef = 1e3;  
    %
    soc = LoggedSignals.SOC;
    t = LoggedSignals.T;
    price = LoggedSignals.Price;
    load = LoggedSignals.Load;
   
    u = Action;
    if u >= 0
        soc = soc + u*delta_max*eff_cha;  %u>0 charging
        reward = -price(t) * (u*delta_max / eff_cha);
    else
        soc = soc + u*delta_max/eff_dch;  % u<0 → discharging
        reward = price(t) * (-u*delta_max * eff_dch);
    end
    % SOC constraints penalty
    if soc <= SOC_min
        reward = -1e2;
        %reward = reward - soc_penalty_coef * (SOC_min + 0.01 - soc);
    elseif soc >= SOC_max
        reward = -1e2;
        %reward = reward - soc_penalty_coef * (soc - (SOC_max - 0.01));
    end
    if soc <= p_crt*(24-t)
        reward = -1e2;
        %reward = reward - soc_penalty_coef * (SOC_min + 0.01 - soc);
    end
    % time update
    t = t + 1;
    IsDone = t > 24;
    % Obs
    NextObs = [soc;
               t/24;
               price(min(t,24))/150;
               load(min(t,24))/max(load)];
    % log update
    LoggedSignals.SOC = soc;
    LoggedSignals.T = t;
    LoggedSignals.Price = price;
    LoggedSignals.Load = load;
    Reward = reward;
    assignin('base', sprintf('log_SOC_%d', t), soc);
    assignin('base', sprintf('log_action_%d', t), Action);
    assignin('base', sprintf('log_reward_%d', t), Reward);
end

function [InitialObs, LoggedSignals] = essDDPGResetFcn()
    LoggedSignals.T = 1;
    LoggedSignals.SOC = 2500; 
    price = 140.5*ones(1,24);
    price(1:7) = 87.3;
    price(22:24) = 87.3;
    price(8:10) = 109.8;
    price(12) = 109.8;
    price(18:21) = 109.8;
    load = table2array(readtable('48_consumption_6.1.xlsx'));
    LoggedSignals.Price = price;
    LoggedSignals.Load = load;
    InitialObs = [LoggedSignals.SOC;
                  LoggedSignals.T;
                  price(1);
                  load(1)];
end

previousRngState = rng(0,"twister");
obsInfo = rlNumericSpec([4 1]);
obsInfo.Name = 'observations';
actInfo = rlNumericSpec([1 1], 'LowerLimit', -1, 'UpperLimit', 1);
actInfo.Name = 'action';
env = rlFunctionEnv(obsInfo, actInfo, @essDDPGStepFcn, @essDDPGResetFcn);
% === 3. Actor ==
actorNet = [
    featureInputLayer(4, "Name", "state")
    fullyConnectedLayer(64, "Name", "fc1")
    reluLayer("Name", "relu1")
    fullyConnectedLayer(64, "Name", "fc2")
    reluLayer("Name", "relu2")
    fullyConnectedLayer(1, "Name", "fc3")
    tanhLayer("Name", "tanh")];
actorOpts = rlRepresentationOptions('LearnRate',1e-4);
actor = rlDeterministicActorRepresentation(actorNet, obsInfo, actInfo, ...
    'Observation', {'state'}, actorOpts);
% === 4. Critic ===
statePath = [
    featureInputLayer(4, "Name", "state")
    fullyConnectedLayer(64, "Name", "fcState")
    reluLayer("Name", "reluState")];
actionPath = [
    featureInputLayer(1, "Name", "action")
    fullyConnectedLayer(64, "Name", "fcAction")];
commonPath = [
    additionLayer(2, "Name", "addition")
    reluLayer("Name", "reluCommon")
    fullyConnectedLayer(1, "Name", "qValue")];
criticNet = layerGraph(statePath);
criticNet = addLayers(criticNet, actionPath);
criticNet = addLayers(criticNet, commonPath);
criticNet = connectLayers(criticNet, 'reluState', 'addition/in1');
criticNet = connectLayers(criticNet, 'fcAction',  'addition/in2');
criticOpts = rlRepresentationOptions('LearnRate',1e-3);
critic = rlQValueRepresentation(criticNet, obsInfo, actInfo, ...
    'Observation', {'state'}, 'Action', {'action'}, criticOpts);
agentOpts = rlDDPGAgentOptions(...
    'SampleTime', 1, ...
    'TargetSmoothFactor', 1e-3, ...
    'ExperienceBufferLength', 1e5, ...
    'MiniBatchSize', 64, ...
    'DiscountFactor', 0.99);
agent = rlDDPGAgent(actor, critic, agentOpts);
stop_reward_val = 20e4;
trainOpts = rlTrainingOptions(...
    'MaxEpisodes', 1000, ...
    'MaxStepsPerEpisode', 24, ...
    'StopTrainingCriteria', 'EpisodeReward', ...
    'StopTrainingValue', stop_reward_val, ...
    'Verbose', false, ...
    'Plots', 'training-progress');
trainingStats = train(agent, env, trainOpts);
[bestReward, bestEp] = max(trainingStats.EpisodeReward)
filename = sprintf('Agent%d.mat',bestReward);
save(filename,"agent");
simOptions = rlSimulationOptions('MaxSteps', 24);
load(filename, 'agent');
sim(env, agent, rlSimulationOptions('MaxSteps', 24));
T = 24;
soc_log = zeros(1, T);
action_log = zeros(1, T);
reward_log = zeros(1, T);
for t = 1:T
    soc_var = sprintf('log_SOC_%d', t);
    action_var = sprintf('log_action_%d', t);
    reward_var = sprintf('log_reward_%d', t);
    if evalin('base', sprintf('exist(''%s'', ''var'')', soc_var))
        soc_log(t) = evalin('base', soc_var);
    end
    if evalin('base', sprintf('exist(''%s'', ''var'')', action_var))
        action_log(t) = evalin('base', action_var);
    end
    if evalin('base', sprintf('exist(''%s'', ''var'')', reward_var))
        reward_log(t) = evalin('base', reward_var);
    end
end
figure;
subplot(3,1,1);
plot(1:T, soc_log, '-o');
title('State of Charge (SOC)');
xlabel('Time step');
ylabel('SOC');
subplot(3,1,2);
plot(1:T, action_log, '-x');
title('Action');
xlabel('Time step');
ylabel('Action value');
subplot(3,1,3);
plot(1:T, reward_log, '-s');
title('Reward');
xlabel('Time step');
ylabel('Reward');
sgtitle('DDPG ESS Scheduling Results');

1 Comment
Show -1 older commentsHide -1 older comments

Torsten on 2 Aug 2025

Edited: Torsten on 2 Aug 2025

But I don't know why my code just do discharging all day.

Maybe your objective function is formulated to minimize profit instead of maximizing it.

Sign in to comment.

Sign in to answer this question.

Answer 1

shantanu on 5 Aug 2025

0
Link

Direct link to this answer

https://in.mathworks.com/matlabcentral/answers/2178915-i-don-t-know-why-my-ddpg-scheduling-model-do-the-worst-actions#answer_1568771

Hi,

I see you are trying to optimize daily electricity fee of Energy Storage System using DDPG Algorithm

In your step function while calculating rewards based on the action ‘u’, reward for charging is always a negative value and reward for discharging is always a positive value which is what causes ‘discharging’ action to be picked at all times since the RL Algorithm would attempt to maximise the net cumulative reward.

Since your goal is to save daily fee of electricity by discharging during high price hour and charging during low price hours, your reward function should also involve some net cost reduction/profit. At each timestep, in order to accomplish the goal, you must try that net cost profit / loss is reflected at the current price w.r.t the action taken.

Also, in the SOC Constraints penalty section, reward = -1e2 removes the reward of action taken at that particular timestep completely through which agent loses all the information. Instead, add a penalty to the rewards and that might result in agent learning to avoid the violation when trained for long time steps.

To prevent SOC from overflowing or underflowing, you can clamp it to maximum and minimum values respectively.

I hope this information helps!

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

I don't know why my DDPG scheduling model do the worst actions.

1 Comment
Show -1 older commentsHide -1 older comments

Answers (1)

0 Comments
Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Products

Community Treasure Hunt

I don't know why my DDPG scheduling model do the worst actions.

1 Comment Show -1 older commentsHide -1 older comments

Answers (1)

0 Comments Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Products

Community Treasure Hunt

1 Comment
Show -1 older commentsHide -1 older comments

0 Comments
Show -2 older commentsHide -2 older comments