Reward not improving for a custom environment using PPO

Question

I've been trying to train an agent on a custom environment I implemented with gym where the goal is to resolve voltage violations in a power grid by adjusting the active power (loads) at each node. I tried mainly two algorithms, via stable-baselines3, PPO and DDPG. However, I'm getting some very bad results with both (like rewards decreasing over time) and I was hoping someone could help me go in a better direction.

So, the agent takes an observation which contains the voltage values of each node and some other continuous values, then executes the action which is the ability of changing the load on each node of the grid (so an array with 24 continuous values), then a Power Flow is executed to determine the new voltage values and then a reward is calculated based on these new voltage values.

I want my agent to act as little as possible and solve the violations in a minimal amount of timesteps. So, I structured the reward function this way:

If there are no violations I give it a reward of 10 and episode terminates
At each step, if there are violations I impose a base penalty and I add and additional penalty proportional to the magnitude of the adjustment
If the adjustments made by the agent are so extreme that my power flow algorithm doesn't converge and stops working I impose a penalty of -10 and the episode ends.

Episode Setup: Episodes begin with initial observations containing violations. When an episode ends the next episode will start with another observation with voltages (some with violations).

My PPO model has the following parameters:

model = PPO("MlpPolicy", env, verbose=1, n_steps=256, tensorboard_log="C:\Users\antonio\Downloads\RL", ent_coef=0.01, gamma=0.9)

I chose a lower gamma since the agent needs to prioritize resolving violations quickly.

Here are the metrics for a PPO try with 10k steps:

For DDPG I used the default values of SB3 and I got this:

This is it, sorry for the long post. Anyway, any suggestions that you could give me?

Linc · Accepted Answer · 2025-01-16 19:30:13Z

Looking at the episodic reward and the episodic length, they seem to be inversely related. Longer episodes have a lower reward and vice versa. The proportional penalty at each timestep might be so large that it overshadows the effect of the +/-10 for the successful and unsuccessful case. Perhaps you could consider reducing its scale.

Stack Exchange Network

Reward not improving for a custom environment using PPO

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Reward not improving for a custom environment using PPO

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions