Lack of convergence when increasing the number of workers

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi all,

I am currently experiencing an issue where increasing the number of workers from 1/2 to 7, significantly effects the performance of the algorithms where a higher reward/ correct completion of the episode, is obtained from the 1/2 worker model.

Below are images of the mean reward and a custom callback which shows the amount of successful episodes. The 2 worker model converges to a high success rate and falls back down but returns to a success rate of 1, basically successfully completing all episodes.

On the other hand, the 7 worker model fails to even complete one single successful episode.

d08dffebf2Arrived
7 Workers Success Rate (When it successfully completed the episode)
d08dffebf2Reward
7 Workers Reward Mean
bf947115f6Reward
2 Workers Reward Mean
bf947115f6Arrived
2 Workers Success Rate (When it successfully completed the episode)

I am using an SAC algorithm with the following configuration options

# Works for both torch and tf.
num_workers: 7
num_gpus: 1
num_cpus_per_worker: 2
framework: torch
gamma: 1
twin_q: True
# these probably do nothing
q_model_config:
  "fcnet_hiddens": [ 512, 512, 1024 ]
  "fcnet_activation": "relu"
policy_model_config:
  "fcnet_hiddens": [ 512, 512, 1024 ]
  "fcnet_activation": "relu"
#model:
#  "fcnet_hiddens": [ 256, 512 ]
#  "fcnet_activation": "tanh"

#batch_mode: "complete_episodes"

# temp change because carla crashed for some reason
recreate_failed_workers: True
# Do hard syncs.
# Soft-syncs seem to work less reliably for discrete action spaces.
tau: 1
#lr: 0.001
target_network_update_freq: 8000
#initial_alpha: 0.2
# auto = 0.98 * -log(1/|A|)
target_entropy: auto
clip_rewards: False
n_step: 1
rollout_fragment_length: 1
replay_buffer_config:
  type: MultiAgentPrioritizedReplayBuffer
  capacity: 400000
  # How many steps of the model to sample before learning starts.
  # If True prioritized replay buffer will be used.
  prioritized_replay_alpha: 0.6
  prioritized_replay_beta: 0.4
  prioritized_replay_eps: 0.000001
store_buffer_in_checkpoints: False
num_steps_sampled_before_learning_starts: 10000
train_batch_size: 256
min_sample_timesteps_per_iteration: 4
# Paper uses 20k random timesteps, which is not exactly the same, but
# seems to work nevertheless. We use 100k here for the longer Atari
# runs (DQN style: filling up the buffer a bit before learning).
optimization:
    actor_learning_rate: 0.00005
    critic_learning_rate: 0.00005
    entropy_learning_rate: 0.00005

"exploration_config": {
  "type": "EpsilonGreedy",
  "initial_epsilon": 1.0,
  "final_epsilon": 0.01,
  "epsilon_timesteps": 500000
}

I have tried changing the target_network_update_freq but it doesn’t seem to make much difference apart from having a smoother reward curve which still doesn’t produce a single successful episode.

I am leaning towards having an incorrect rollout_fragment_length but I am not sure what values to try. Is there a way to know what rollout_fragment_length should be based on other values or against what I should compare it to? Could train_batch_size be affecting this in any way?

Any other behavior which you noticed that maybe could lead to something?

Thank you for any help in advance. Any leads will be highly appreciated.

  1. What version of Ray are you using?
  2. Can you post the number of complete episodes and the number of timesteps vs the number of training iterations? All off-policy algorithms are sensitive to the ratio of “number of sampled experiences / number of steps trained on”. So if that stays constant, the performance should not be a function of the number of workers.
  3. How do you calculate the “difficult_custom_done_ar…” thingy? Depending on how long the rollouts are etc, that may break these calculations. So please check on the underlying data for the 1/2 worker case and the 7 worker case.

Hi Arthur thank you for your reply.

I am currently using Ray 2.1.0.

If the below data what you are looking for? I am not sure how to interpret this information but it seems the stay constant like you suggested it should.

7 Workers
d08dffebf2Iterations
7 Workers
d08dffebf2Episodes

1 Worker
bf947115f6Episodes
1 Worker
bf947115f6Iterations

Not sure if how I coded the ‘difficult_custom_done_arrived’ parameter makes a difference. This is a callback where in the function on_episode_end() it checks a variable inside the experiment class to check if the episode was of the difficult type and if it was successful or not. With that info I do the following:

if not worker.env.experiment.custom_done_arrived:
    episode.custom_metrics["difficult_custom_done_arrived"] = 0

elif worker.env.experiment.custom_done_arrived:
    episode.custom_metrics["difficult_custom_done_arrived"] = 1

Edit: Just noticed the significant difference in training iterations between the two experiments, although I am not sure how that came about. Would a different number of workers require different algorithm parameters such as learning rate, to combat the lack of training iterations for the same amount of data sampled? Just as an FYI, I increased the number of workers to decrease the training time and got the following times, but seeing the difference in training iterations, I am just able to obtain more data faster rather than training faster with a greater number of workers?

____________ 600K | 1.1 million ____(timesteps)

2 Workers| 14.3hrs | 1.1 days
7 Workers| 6.6 hrs | /

Edit2 : Searching around it seems that the training iterations are somewhat linked to the trainer. Just as an FYI I am using a custom trainer implemented as follows:

#!/usr/bin/env python

# Copyright (c) 2021 Computer Vision Center (CVC) at the Universitat Autonoma de
# Barcelona (UAB).
#
# This work is licensed under the terms of the MIT license.
# For a copy, see <https://opensource.org/licenses/MIT>.

import torch
import os

from ray.rllib.algorithms.sac import SAC


class CustomSACTrainer(SAC):
    """
    Modified version of SACTrainer with the added functionality of saving the torch model for later inference
    """
    def save_checkpoint(self, checkpoint_dir):
        checkpoint_path = super().save_checkpoint(checkpoint_dir)

        model = self.get_policy().model
        torch.save(model.state_dict(),
                   os.path.join(checkpoint_dir, "checkpoint_state_dict.pth"))

        return checkpoint_path

The custom trainer should not make a difference in this case.
The graphs that you posted would need to have the two experiments with two workers and seven workers inside of them, otherwise we can’t compare them.
For example, if one of your experiments has more timesteps per training iteration than the other, the ratio between sampling and training may be different and that will affect learning.

If you leave all other parameters untouched, the number of workers should not affect how the mean episodic rewards looks plotted over the number of timesteps, but should only drive the wallclock time down.

Thank you for your reply. How would I go about adjusting the ratio of sampling to training?

See the min_train_timesteps_per_iteration and min_sample_timesteps_per_iteration config settings in the AlgorithmConfig object.

I have a similar problem. I try to scale my experiments, set num_workers = 10, 20, 39 and see drastically worsened performance when I go from 20 to 39. The reward does not achieve the values I have with 10 and 20 workers.

The plots of env steps sampled and trained look as follows:

If I just set min_train_timesteps_per_iteration to some value to guarantee enougth training does it mean that I will have even more samples in one iteration as well and the ratio would not change? Can I manage the ratio somehow so it would be maintained independently from the number of workers?

UPD: for the context I am using SAC with multi-agent environment with 5 equal agents and episodes with length up to 1000. The config file has the following parameters set:

{“num_workers”: 10, “train_batch_size”: 1024, “num_cpus_per_worker”: 1, “gamma”: 0.99, “observation_filter”: “MeanStdFilter”, “batch_mode”: “truncate_episodes”, “horizon”: 10000, “num_gpus”: 0, “framework”: “torch”, “no_done_at_end”: true, “target_network_update_freq”: 1000, “tau”: 0.01, “num_steps_sampled_before_learning_starts”: 500, “initial_alpha”: 1, “twin_q”: true, “optimization”: {“actor_learning_rate”: 0.0003, “critic_learning_rate”: 0.0003, “entropy_learning_rate”: 0.0003}, “replay_buffer_config”: {“capacity”: 10000000, “prioritized_replay”: false}}

Hi @Elena,

See that dip around 0.75 in the 39 workers case? When I see that, I usually suspect there is something weird going on with the gradients and assume the training is broken from then on until I investigate. I would look for NaNs in either the policy logits, the log_prob of the action distribution or the value function.

Hi @mannyv, thank you for your replay.

Indeed, there is very high q-mean and the gradients but I notice such a huge jump only with the large number of workers so I suspect that there is some disbalance I’m getting by increasing the number of workers and cannot catch another reason for it.

@Elena,

If this was PPO I would suggest that it was the std logits for the variance approaching zero. This problem has been reported many times. I have not seen it arise as an issue with SAC though. And I do not know enough of the RLlib implementation details of SAC off the top of my head to know if it is likely the issue.

There is an easy test and fix if that is the problem. You can set this in the config.
model_config_dict {"free_log_std": True}

You can find a more detailed discussion here:

Hi @mannyv,

Thank you for your response. The thread you mentioned is very useful. I investigated this matter and found out that SAC uses squashed Gaussian distribution and its log_std values are clipped by some numbers. However, they are indeed very negative before clipping and means are well beyond the limits as well. It leads to a very extreme policy with actions always equal to minimum or maximum allowed values. The more workers I am using the higher the chances to end up with this policy.

@Elena,
That is about what I expected. Glad you were able to track down the issue. Have you tried thefree_log_std option yet?

Looking at the trend of the grad_gnorm in the 10 worker case. I would guess that if you let it train long enough it would also show this issue.

Yes, I have tried free_log_std without success. I believe this is because my action distribution goes to very large means (like 20-100) although my actions are bounded by the interval [-0.1, 0.1]. Now I am a bit confused about how the policy nn goes to such extreme numbers (all weights in the nn are reasonably small, no NaNs). For instance, it can suggest the action (20, 30, -10) and it will be squashed to (1, 1, -1) and all actions in the period of the instability and high gradient_gnorm we observed in the plot are of the type (-/+1, -/+1, -/+1).

Hi @Elena

There are three things I would think to investigate in this case:

  1. You have a bad set of hyperparameters and you need to adjust them. Good luck figuring out which ones.

  2. There is a bug in the rllib implementation of SAC loss somewhere. Good luck hunting it down.

  3. You can adjust the loss term to fight the large logits. This is the one I would try first. I would start by adding an l2 norm to loss function on the policy logits. There are two ways to do this and they can be used together.

The first is to penalize the outputs directly. Subclass the policy, add a custom_loss implementation and add l2_coef * torch.norm(logits,2). Recall that L1 regularization encourages sparsity and I am guessing that is not what you want here.

Your other option is to add l2 regularization to your parameters. Here you can do that by overriding opitimizer_fn and setting the the Adam weight_decay parameter to some non-zero value for whichever sets of parameters you want regularization. While you are at it you might consider switching over to AdamW.

Are the q_values also large?