Help with Reward Plateaus and Missing Initial Episodes in PG Algorithm Training

Hi everyone,

I’m currently training my agent on a custom environment using a Policy Gradient (PG) algorithm. However, I’m encountering some unexpected behavior when analyzing the training rewards:

  1. Plateaus in Rewards: The mean reward values per episode show plateau-like behavior, where the reward remains exactly the same across multiple episodes (I verified this by checking the numerical values).
  2. Missing Initial Episodes: The plotted rewards start from episode 750 instead of 0, which seems odd.

I suspect these issues might be related to the parallelization of environments across different workers, but I’m struggling to pinpoint the root cause or how to address it effectively.

Has anyone faced similar issues or has any insights on how to resolve them?

Here’s my config script:

config = {
    "env": Cleaning_Diffusion,
    "env_config": env_config,
    "exploration_config": {
        "type": "StochasticSampling"
    },
    "model": {
        "custom_model": CustomNet,
    },
    "lr": learning_rate,
    "gamma": discount_factor,
    "num_workers": 12,
    "num_envs_per_worker": 4,
    "batch_size": 2048, 
    "entropy_coeff": 0.1,
    "num_gpus": 1
}

Thanks in advance for your help!

Best regards,
L.E.O.