I made a DDPG reinforcement learning agent to control a simulink environment. Its similar to the water tank level example problem, the agent performs adjustments on the process speed and recieves rewards if an output parameter is inside a specified range, and recieves a big negative reward if this output parameter goes over a specified threshold.
I started with simple network architectures, around 15-20 neurons and one to three layers, then i went all the way up to 100 neurons in each layer and four to five layers in both the critic and actor networks. I also tried reducing the learning rate to 1e-4.
The process takes around 150-300 timesteps, and the reward is 1 point for each timestep that the output parameter is inside the specified range, so the maximum reward possible should be around 150-300, depending on the process speed.
However, regardless of the chosen network arcitechture, the q0 just diverges to very high values every training session, around 10e8, then flattens out, while the episode reward bounces around between -1000 and 150 (see attached figure). This pattern persists even after 80 000+ episodes (three days of training). I have read that the q0 and the episode reward should converge if everything is set up correctly, so something is definitely wrong.
The optimal process speed should follow some sort of S-shape to collect the most rewards. However, every time i stop the training the agent just predicts a constant action value for every time step or no action at all, resulting in a linearly increasing or constant process speed which does poorly in terms of reward.
Any idea what I am doing wrong?
Thank you for your time!