Lack of convergence when increasing the number of workers

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi all,

I am currently experiencing an issue where increasing the number of workers from 1/2 to 7, significantly effects the performance of the algorithms where a higher reward/ correct completion of the episode, is obtained from the 1/2 worker model.

Below are images of the mean reward and a custom callback which shows the amount of successful episodes. The 2 worker model converges to a high success rate and falls back down but returns to a success rate of 1, basically successfully completing all episodes.

On the other hand, the 7 worker model fails to even complete one single successful episode.

d08dffebf2Arrived
7 Workers Success Rate (When it successfully completed the episode)
d08dffebf2Reward
7 Workers Reward Mean
bf947115f6Reward
2 Workers Reward Mean
bf947115f6Arrived
2 Workers Success Rate (When it successfully completed the episode)

I am using an SAC algorithm with the following configuration options

# Works for both torch and tf.
num_workers: 7
num_gpus: 1
num_cpus_per_worker: 2
framework: torch
gamma: 1
twin_q: True
# these probably do nothing
q_model_config:
  "fcnet_hiddens": [ 512, 512, 1024 ]
  "fcnet_activation": "relu"
policy_model_config:
  "fcnet_hiddens": [ 512, 512, 1024 ]
  "fcnet_activation": "relu"
#model:
#  "fcnet_hiddens": [ 256, 512 ]
#  "fcnet_activation": "tanh"

#batch_mode: "complete_episodes"

# temp change because carla crashed for some reason
recreate_failed_workers: True
# Do hard syncs.
# Soft-syncs seem to work less reliably for discrete action spaces.
tau: 1
#lr: 0.001
target_network_update_freq: 8000
#initial_alpha: 0.2
# auto = 0.98 * -log(1/|A|)
target_entropy: auto
clip_rewards: False
n_step: 1
rollout_fragment_length: 1
replay_buffer_config:
  type: MultiAgentPrioritizedReplayBuffer
  capacity: 400000
  # How many steps of the model to sample before learning starts.
  # If True prioritized replay buffer will be used.
  prioritized_replay_alpha: 0.6
  prioritized_replay_beta: 0.4
  prioritized_replay_eps: 0.000001
store_buffer_in_checkpoints: False
num_steps_sampled_before_learning_starts: 10000
train_batch_size: 256
min_sample_timesteps_per_iteration: 4
# Paper uses 20k random timesteps, which is not exactly the same, but
# seems to work nevertheless. We use 100k here for the longer Atari
# runs (DQN style: filling up the buffer a bit before learning).
optimization:
    actor_learning_rate: 0.00005
    critic_learning_rate: 0.00005
    entropy_learning_rate: 0.00005

"exploration_config": {
  "type": "EpsilonGreedy",
  "initial_epsilon": 1.0,
  "final_epsilon": 0.01,
  "epsilon_timesteps": 500000
}

I have tried changing the target_network_update_freq but it doesn’t seem to make much difference apart from having a smoother reward curve which still doesn’t produce a single successful episode.

I am leaning towards having an incorrect rollout_fragment_length but I am not sure what values to try. Is there a way to know what rollout_fragment_length should be based on other values or against what I should compare it to? Could train_batch_size be affecting this in any way?

Any other behavior which you noticed that maybe could lead to something?

Thank you for any help in advance. Any leads will be highly appreciated.

  1. What version of Ray are you using?
  2. Can you post the number of complete episodes and the number of timesteps vs the number of training iterations? All off-policy algorithms are sensitive to the ratio of “number of sampled experiences / number of steps trained on”. So if that stays constant, the performance should not be a function of the number of workers.
  3. How do you calculate the “difficult_custom_done_ar…” thingy? Depending on how long the rollouts are etc, that may break these calculations. So please check on the underlying data for the 1/2 worker case and the 7 worker case.

Hi Arthur thank you for your reply.

I am currently using Ray 2.1.0.

If the below data what you are looking for? I am not sure how to interpret this information but it seems the stay constant like you suggested it should.

7 Workers
d08dffebf2Iterations
7 Workers
d08dffebf2Episodes

1 Worker
bf947115f6Episodes
1 Worker
bf947115f6Iterations

Not sure if how I coded the ‘difficult_custom_done_arrived’ parameter makes a difference. This is a callback where in the function on_episode_end() it checks a variable inside the experiment class to check if the episode was of the difficult type and if it was successful or not. With that info I do the following:

if not worker.env.experiment.custom_done_arrived:
    episode.custom_metrics["difficult_custom_done_arrived"] = 0

elif worker.env.experiment.custom_done_arrived:
    episode.custom_metrics["difficult_custom_done_arrived"] = 1

Edit: Just noticed the significant difference in training iterations between the two experiments, although I am not sure how that came about. Would a different number of workers require different algorithm parameters such as learning rate, to combat the lack of training iterations for the same amount of data sampled? Just as an FYI, I increased the number of workers to decrease the training time and got the following times, but seeing the difference in training iterations, I am just able to obtain more data faster rather than training faster with a greater number of workers?

____________ 600K | 1.1 million ____(timesteps)

2 Workers| 14.3hrs | 1.1 days
7 Workers| 6.6 hrs | /

Edit2 : Searching around it seems that the training iterations are somewhat linked to the trainer. Just as an FYI I am using a custom trainer implemented as follows:

#!/usr/bin/env python

# Copyright (c) 2021 Computer Vision Center (CVC) at the Universitat Autonoma de
# Barcelona (UAB).
#
# This work is licensed under the terms of the MIT license.
# For a copy, see <https://opensource.org/licenses/MIT>.

import torch
import os

from ray.rllib.algorithms.sac import SAC


class CustomSACTrainer(SAC):
    """
    Modified version of SACTrainer with the added functionality of saving the torch model for later inference
    """
    def save_checkpoint(self, checkpoint_dir):
        checkpoint_path = super().save_checkpoint(checkpoint_dir)

        model = self.get_policy().model
        torch.save(model.state_dict(),
                   os.path.join(checkpoint_dir, "checkpoint_state_dict.pth"))

        return checkpoint_path

The custom trainer should not make a difference in this case.
The graphs that you posted would need to have the two experiments with two workers and seven workers inside of them, otherwise we can’t compare them.
For example, if one of your experiments has more timesteps per training iteration than the other, the ratio between sampling and training may be different and that will affect learning.

If you leave all other parameters untouched, the number of workers should not affect how the mean episodic rewards looks plotted over the number of timesteps, but should only drive the wallclock time down.

Thank you for your reply. How would I go about adjusting the ratio of sampling to training?

See the min_train_timesteps_per_iteration and min_sample_timesteps_per_iteration config settings in the AlgorithmConfig object.

I have a similar problem. I try to scale my experiments, set num_workers = 10, 20, 39 and see drastically worsened performance when I go from 20 to 39. The reward does not achieve the values I have with 10 and 20 workers.

The plots of env steps sampled and trained look as follows:

If I just set min_train_timesteps_per_iteration to some value to guarantee enougth training does it mean that I will have even more samples in one iteration as well and the ratio would not change? Can I manage the ratio somehow so it would be maintained independently from the number of workers?

UPD: for the context I am using SAC with multi-agent environment with 5 equal agents and episodes with length up to 1000. The config file has the following parameters set:

{“num_workers”: 10, “train_batch_size”: 1024, “num_cpus_per_worker”: 1, “gamma”: 0.99, “observation_filter”: “MeanStdFilter”, “batch_mode”: “truncate_episodes”, “horizon”: 10000, “num_gpus”: 0, “framework”: “torch”, “no_done_at_end”: true, “target_network_update_freq”: 1000, “tau”: 0.01, “num_steps_sampled_before_learning_starts”: 500, “initial_alpha”: 1, “twin_q”: true, “optimization”: {“actor_learning_rate”: 0.0003, “critic_learning_rate”: 0.0003, “entropy_learning_rate”: 0.0003}, “replay_buffer_config”: {“capacity”: 10000000, “prioritized_replay”: false}}

Hi @Elena,

See that dip around 0.75 in the 39 workers case? When I see that, I usually suspect there is something weird going on with the gradients and assume the training is broken from then on until I investigate. I would look for NaNs in either the policy logits, the log_prob of the action distribution or the value function.

Hi @mannyv, thank you for your replay.

Indeed, there is very high q-mean and the gradients but I notice such a huge jump only with the large number of workers so I suspect that there is some disbalance I’m getting by increasing the number of workers and cannot catch another reason for it.

@Elena,

If this was PPO I would suggest that it was the std logits for the variance approaching zero. This problem has been reported many times. I have not seen it arise as an issue with SAC though. And I do not know enough of the RLlib implementation details of SAC off the top of my head to know if it is likely the issue.

There is an easy test and fix if that is the problem. You can set this in the config.
model_config_dict {"free_log_std": True}

You can find a more detailed discussion here:

Hi @mannyv,

Thank you for your response. The thread you mentioned is very useful. I investigated this matter and found out that SAC uses squashed Gaussian distribution and its log_std values are clipped by some numbers. However, they are indeed very negative before clipping and means are well beyond the limits as well. It leads to a very extreme policy with actions always equal to minimum or maximum allowed values. The more workers I am using the higher the chances to end up with this policy.

@Elena,
That is about what I expected. Glad you were able to track down the issue. Have you tried thefree_log_std option yet?

Looking at the trend of the grad_gnorm in the 10 worker case. I would guess that if you let it train long enough it would also show this issue.

Yes, I have tried free_log_std without success. I believe this is because my action distribution goes to very large means (like 20-100) although my actions are bounded by the interval [-0.1, 0.1]. Now I am a bit confused about how the policy nn goes to such extreme numbers (all weights in the nn are reasonably small, no NaNs). For instance, it can suggest the action (20, 30, -10) and it will be squashed to (1, 1, -1) and all actions in the period of the instability and high gradient_gnorm we observed in the plot are of the type (-/+1, -/+1, -/+1).

Hi @Elena

There are three things I would think to investigate in this case:

  1. You have a bad set of hyperparameters and you need to adjust them. Good luck figuring out which ones.

  2. There is a bug in the rllib implementation of SAC loss somewhere. Good luck hunting it down.

  3. You can adjust the loss term to fight the large logits. This is the one I would try first. I would start by adding an l2 norm to loss function on the policy logits. There are two ways to do this and they can be used together.

The first is to penalize the outputs directly. Subclass the policy, add a custom_loss implementation and add l2_coef * torch.norm(logits,2). Recall that L1 regularization encourages sparsity and I am guessing that is not what you want here.

Your other option is to add l2 regularization to your parameters. Here you can do that by overriding opitimizer_fn and setting the the Adam weight_decay parameter to some non-zero value for whichever sets of parameters you want regularization. While you are at it you might consider switching over to AdamW.

Are the q_values also large?

Hi @mannyv,

Thank you for your response. It took a while for me to investigate the problem and one finding I have got was particularly curious. It turned out that a good policy in my MARL task should suggest relatively big action-vectors at the beginning of the episodes (with components around 0.5) and tiny action-vectors at the end (with components around 1E–4 or even 1E-5). My actions are bounded by the interval (-0.1, 0.1) and the preferable actions are beyond this interval so the policy is never penalized for too high numbers, and the action distribution numbers rocket to the sky (like (mean, std) = (30, -50)). These numbers continue increasing until I stop training. At the same time, alpha loss is decreasing drastically, probably, because such odd distributions help to achieve a very low target entropy.
The natural solution would be to widen the boundaries for the actions, for example, make them (-1, 1). However, in this case, I can never achieve a good performance at the end of the episode when the preferable actions have orders of magnitude -4 or -5. But this is a completely separate problem.

Hi @Elena! I’m having similar problem and was wondering if you have solved yours. I started off with 24 env runners because i didn’t know that having multiple workers will be worse! After seeing your results, i tried using only 2, and it’s working wonderfully.

My previous knowledge was that for SAC, having 1~2 update-to-data ratio is best according to ( SAMPLE-EFFICIENT REINFORCEMENT LEARNING BY BREAKING THE REPLAY RATIO BARRIER). So i matched the number of steps trained and number of collected data.

However, now when i use 2 workers, it ratio is 10,000,000 to 150,000 which is like 66, which should have much worse performance. So I’m confused.

Hi @iykim. I remember that increasing the number of workers significantly contributed to the instability in learning. If I had a problem with learning occurring occasionally with 1 worker, this problem would be reproduced each time I ran the same code with 39 workers. Small unexpected increasing in grad gnorm for one worker becomes gradient explosion with 39 workers. So I stopped working with 1 worker and just stuck with 39 workers and made them work. In this particular case, I had a few problems each of which led to the gradient explosion with 39 workers (but I did not check some of them on 1 worker):

  1. My action space is continuous and the limits for the actions were too narrow. The “optimal” policy was to choose the maximum allowable value for the action in some states. It led to huge numbers in the mean of the action distribution from the policy network because they were squashed to the limits anyway. I explained this problem in the comments above.
  2. My target entropy was too low or too high, especially too high entropy was not good for the policy. I noticed that SAC can do really weird things trying to adjust the target entropy and sacrificing the reward. If this happens then the gradient can explode and the alpha_loss becomes very low.
  3. Occasionally, I had too large numbers (50-500) in my observation vector. Although I applied the MeanStdObservationFilter, I got more stable learning without jumps in the grad gnorm when I started both clipping the observation vector manually AND applying MeanStdObservationFilter.

I also experimented with the training_intensity parameter but did not get any improvement so I let it be calculated automatically.

I hope this will help.

Thank you. After looking carefully i found out that I had an unnormalized value in my observation. And i was ingoring the warning message from ray complaining that observation was out of observation space. I’m embarrassed by my ignorance.

Maybe in your case too, even if you normalize with moving mean and std, the large numbers occur occasionally, so i think they would not be normalized well. That is why it started working better when you also clipped the value. Maybe you can try log scaling on the observation? Anyways thanks for your input. i was able to learn alot!

Hi @iykim ,
You are right, occasionally I have very large numbers, but in my project I can not log scale some components of observation as I need to preserve the relations between them. I have to clip them in such a way that preserves the relations. For instance, if I have two related components, say [100, 5], I clip the maximum component which is 100 to 1, and then multiply another component on (1/100), so instead of [100, 5] I have [1, 0.05].