I don't know why my DDPG scheduling model do the worst actions.

8 views (last 30 days)
Hi, I'm studying ESS scheduling Algorithm using Reinforcement Learning.
I just made code and algorithm of ESS schduling to save money, (reward is how much we saved money).
Concept of scheduling is saving daily fee of electricity through discharging ESS during the high price hour and charging during the lowest price hour. But I don't know why my code just do discharging all day.
Can someone help me with this problem?
function [NextObs, Reward, IsDone, LoggedSignals] = essDDPGStepFcn(Action, LoggedSignals)
%
eff_cha = 0.95;
eff_dch = 0.95;
SOC_min = 0;
SOC_max = 3000;
delta_max = 1000; %
p_crt = 100;
soc_penalty_coef = 1e3;
%
soc = LoggedSignals.SOC;
t = LoggedSignals.T;
price = LoggedSignals.Price;
load = LoggedSignals.Load;
u = Action;
if u >= 0
soc = soc + u*delta_max*eff_cha; %u>0 charging
reward = -price(t) * (u*delta_max / eff_cha);
else
soc = soc + u*delta_max/eff_dch; % u<0 → discharging
reward = price(t) * (-u*delta_max * eff_dch);
end
% SOC constraints penalty
if soc <= SOC_min
reward = -1e2;
%reward = reward - soc_penalty_coef * (SOC_min + 0.01 - soc);
elseif soc >= SOC_max
reward = -1e2;
%reward = reward - soc_penalty_coef * (soc - (SOC_max - 0.01));
end
if soc <= p_crt*(24-t)
reward = -1e2;
%reward = reward - soc_penalty_coef * (SOC_min + 0.01 - soc);
end
% time update
t = t + 1;
IsDone = t > 24;
% Obs
NextObs = [soc;
t/24;
price(min(t,24))/150;
load(min(t,24))/max(load)];
% log update
LoggedSignals.SOC = soc;
LoggedSignals.T = t;
LoggedSignals.Price = price;
LoggedSignals.Load = load;
Reward = reward;
assignin('base', sprintf('log_SOC_%d', t), soc);
assignin('base', sprintf('log_action_%d', t), Action);
assignin('base', sprintf('log_reward_%d', t), Reward);
end
function [InitialObs, LoggedSignals] = essDDPGResetFcn()
LoggedSignals.T = 1;
LoggedSignals.SOC = 2500;
price = 140.5*ones(1,24);
price(1:7) = 87.3;
price(22:24) = 87.3;
price(8:10) = 109.8;
price(12) = 109.8;
price(18:21) = 109.8;
load = table2array(readtable('48_consumption_6.1.xlsx'));
LoggedSignals.Price = price;
LoggedSignals.Load = load;
InitialObs = [LoggedSignals.SOC;
LoggedSignals.T;
price(1);
load(1)];
end
previousRngState = rng(0,"twister");
obsInfo = rlNumericSpec([4 1]);
obsInfo.Name = 'observations';
actInfo = rlNumericSpec([1 1], 'LowerLimit', -1, 'UpperLimit', 1);
actInfo.Name = 'action';
env = rlFunctionEnv(obsInfo, actInfo, @essDDPGStepFcn, @essDDPGResetFcn);
% === 3. Actor ==
actorNet = [
featureInputLayer(4, "Name", "state")
fullyConnectedLayer(64, "Name", "fc1")
reluLayer("Name", "relu1")
fullyConnectedLayer(64, "Name", "fc2")
reluLayer("Name", "relu2")
fullyConnectedLayer(1, "Name", "fc3")
tanhLayer("Name", "tanh")];
actorOpts = rlRepresentationOptions('LearnRate',1e-4);
actor = rlDeterministicActorRepresentation(actorNet, obsInfo, actInfo, ...
'Observation', {'state'}, actorOpts);
% === 4. Critic ===
statePath = [
featureInputLayer(4, "Name", "state")
fullyConnectedLayer(64, "Name", "fcState")
reluLayer("Name", "reluState")];
actionPath = [
featureInputLayer(1, "Name", "action")
fullyConnectedLayer(64, "Name", "fcAction")];
commonPath = [
additionLayer(2, "Name", "addition")
reluLayer("Name", "reluCommon")
fullyConnectedLayer(1, "Name", "qValue")];
criticNet = layerGraph(statePath);
criticNet = addLayers(criticNet, actionPath);
criticNet = addLayers(criticNet, commonPath);
criticNet = connectLayers(criticNet, 'reluState', 'addition/in1');
criticNet = connectLayers(criticNet, 'fcAction', 'addition/in2');
criticOpts = rlRepresentationOptions('LearnRate',1e-3);
critic = rlQValueRepresentation(criticNet, obsInfo, actInfo, ...
'Observation', {'state'}, 'Action', {'action'}, criticOpts);
agentOpts = rlDDPGAgentOptions(...
'SampleTime', 1, ...
'TargetSmoothFactor', 1e-3, ...
'ExperienceBufferLength', 1e5, ...
'MiniBatchSize', 64, ...
'DiscountFactor', 0.99);
agent = rlDDPGAgent(actor, critic, agentOpts);
stop_reward_val = 20e4;
trainOpts = rlTrainingOptions(...
'MaxEpisodes', 1000, ...
'MaxStepsPerEpisode', 24, ...
'StopTrainingCriteria', 'EpisodeReward', ...
'StopTrainingValue', stop_reward_val, ...
'Verbose', false, ...
'Plots', 'training-progress');
trainingStats = train(agent, env, trainOpts);
[bestReward, bestEp] = max(trainingStats.EpisodeReward)
filename = sprintf('Agent%d.mat',bestReward);
save(filename,"agent");
simOptions = rlSimulationOptions('MaxSteps', 24);
load(filename, 'agent');
sim(env, agent, rlSimulationOptions('MaxSteps', 24));
T = 24;
soc_log = zeros(1, T);
action_log = zeros(1, T);
reward_log = zeros(1, T);
for t = 1:T
soc_var = sprintf('log_SOC_%d', t);
action_var = sprintf('log_action_%d', t);
reward_var = sprintf('log_reward_%d', t);
if evalin('base', sprintf('exist(''%s'', ''var'')', soc_var))
soc_log(t) = evalin('base', soc_var);
end
if evalin('base', sprintf('exist(''%s'', ''var'')', action_var))
action_log(t) = evalin('base', action_var);
end
if evalin('base', sprintf('exist(''%s'', ''var'')', reward_var))
reward_log(t) = evalin('base', reward_var);
end
end
figure;
subplot(3,1,1);
plot(1:T, soc_log, '-o');
title('State of Charge (SOC)');
xlabel('Time step');
ylabel('SOC');
subplot(3,1,2);
plot(1:T, action_log, '-x');
title('Action');
xlabel('Time step');
ylabel('Action value');
subplot(3,1,3);
plot(1:T, reward_log, '-s');
title('Reward');
xlabel('Time step');
ylabel('Reward');
sgtitle('DDPG ESS Scheduling Results');
  1 Comment
Torsten
Torsten on 2 Aug 2025
Edited: Torsten on 2 Aug 2025
But I don't know why my code just do discharging all day.
Maybe your objective function is formulated to minimize profit instead of maximizing it.

Sign in to comment.

Answers (1)

shantanu
shantanu on 5 Aug 2025
Hi,
I see you are trying to optimize daily electricity fee of Energy Storage System using DDPG Algorithm
  • In your step function while calculating rewards based on the action ‘u’, reward for charging is always a negative value and reward for discharging is always a positive value which is what causes ‘discharging’ action to be picked at all times since the RL Algorithm would attempt to maximise the net cumulative reward.
  • Since your goal is to save daily fee of electricity by discharging during high price hour and charging during low price hours, your reward function should also involve some net cost reduction/profit. At each timestep, in order to accomplish the goal, you must try that net cost profit / loss is reflected at the current price w.r.t the action taken.
  • Also, in the SOC Constraints penalty section, reward = -1e2 removes the reward of action taken at that particular timestep completely through which agent loses all the information. Instead, add a penalty to the rewards and that might result in agent learning to avoid the violation when trained for long time steps.
  • To prevent SOC from overflowing or underflowing, you can clamp it to maximum and minimum values respectively.
I hope this information helps!

Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!