0
$\begingroup$

I've been trying to train an agent on a custom environment I implemented with gym where the goal is to resolve voltage violations in a power grid by adjusting the active power (loads) at each node. I tried mainly two algorithms, via stable-baselines3, PPO and DDPG. However, I'm getting some very bad results with both (like rewards decreasing over time) and I was hoping someone could help me go in a better direction.

So, the agent takes an observation which contains the voltage values of each node and some other continuous values, then executes the action which is the ability of changing the load on each node of the grid (so an array with 24 continuous values), then a Power Flow is executed to determine the new voltage values and then a reward is calculated based on these new voltage values.

I want my agent to act as little as possible and solve the violations in a minimal amount of timesteps. So, I structured the reward function this way:

  • If there are no violations I give it a reward of 10 and episode terminates
  • At each step, if there are violations I impose a base penalty and I add and additional penalty proportional to the magnitude of the adjustment
  • If the adjustments made by the agent are so extreme that my power flow algorithm doesn't converge and stops working I impose a penalty of -10 and the episode ends.

Episode Setup: Episodes begin with initial observations containing violations. When an episode ends the next episode will start with another observation with voltages (some with violations).

My PPO model has the following parameters:

  • model = PPO("MlpPolicy", env, verbose=1, n_steps=256, tensorboard_log="C:\Users\antonio\Downloads\RL", ent_coef=0.01, gamma=0.9)

I chose a lower gamma since the agent needs to prioritize resolving violations quickly.

Here are the metrics for a PPO try with 10k steps: PPO 10k steps

For DDPG I used the default values of SB3 and I got this: DDPG try

This is it, sorry for the long post. Anyway, any suggestions that you could give me?

$\endgroup$

    1 Answer 1

    1
    $\begingroup$

    Looking at the episodic reward and the episodic length, they seem to be inversely related. Longer episodes have a lower reward and vice versa. The proportional penalty at each timestep might be so large that it overshadows the effect of the +/-10 for the successful and unsuccessful case. Perhaps you could consider reducing its scale.

    $\endgroup$

      You must log in to answer this question.

      Start asking to get answers

      Find the answer to your question by asking.

      Ask question

      Explore related questions

      See similar questions with these tags.