Lack of convergence when increasing the number of workers

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi all,

I am currently experiencing an issue where increasing the number of workers from 1/2 to 7, significantly effects the performance of the algorithms where a higher reward/ correct completion of the episode, is obtained from the 1/2 worker model.

Below are images of the mean reward and a custom callback which shows the amount of successful episodes. The 2 worker model converges to a high success rate and falls back down but returns to a success rate of 1, basically successfully completing all episodes.

On the other hand, the 7 worker model fails to even complete one single successful episode.

7 Workers Success Rate (When it successfully completed the episode)
7 Workers Reward Mean
2 Workers Reward Mean
2 Workers Success Rate (When it successfully completed the episode)

I am using an SAC algorithm with the following configuration options

# Works for both torch and tf.
num_workers: 7
num_gpus: 1
num_cpus_per_worker: 2
framework: torch
gamma: 1
twin_q: True
# these probably do nothing
  "fcnet_hiddens": [ 512, 512, 1024 ]
  "fcnet_activation": "relu"
  "fcnet_hiddens": [ 512, 512, 1024 ]
  "fcnet_activation": "relu"
#  "fcnet_hiddens": [ 256, 512 ]
#  "fcnet_activation": "tanh"

#batch_mode: "complete_episodes"

# temp change because carla crashed for some reason
recreate_failed_workers: True
# Do hard syncs.
# Soft-syncs seem to work less reliably for discrete action spaces.
tau: 1
#lr: 0.001
target_network_update_freq: 8000
#initial_alpha: 0.2
# auto = 0.98 * -log(1/|A|)
target_entropy: auto
clip_rewards: False
n_step: 1
rollout_fragment_length: 1
  type: MultiAgentPrioritizedReplayBuffer
  capacity: 400000
  # How many steps of the model to sample before learning starts.
  # If True prioritized replay buffer will be used.
  prioritized_replay_alpha: 0.6
  prioritized_replay_beta: 0.4
  prioritized_replay_eps: 0.000001
store_buffer_in_checkpoints: False
num_steps_sampled_before_learning_starts: 10000
train_batch_size: 256
min_sample_timesteps_per_iteration: 4
# Paper uses 20k random timesteps, which is not exactly the same, but
# seems to work nevertheless. We use 100k here for the longer Atari
# runs (DQN style: filling up the buffer a bit before learning).
    actor_learning_rate: 0.00005
    critic_learning_rate: 0.00005
    entropy_learning_rate: 0.00005

"exploration_config": {
  "type": "EpsilonGreedy",
  "initial_epsilon": 1.0,
  "final_epsilon": 0.01,
  "epsilon_timesteps": 500000

I have tried changing the target_network_update_freq but it doesn’t seem to make much difference apart from having a smoother reward curve which still doesn’t produce a single successful episode.

I am leaning towards having an incorrect rollout_fragment_length but I am not sure what values to try. Is there a way to know what rollout_fragment_length should be based on other values or against what I should compare it to? Could train_batch_size be affecting this in any way?

Any other behavior which you noticed that maybe could lead to something?

Thank you for any help in advance. Any leads will be highly appreciated.

  1. What version of Ray are you using?
  2. Can you post the number of complete episodes and the number of timesteps vs the number of training iterations? All off-policy algorithms are sensitive to the ratio of “number of sampled experiences / number of steps trained on”. So if that stays constant, the performance should not be a function of the number of workers.
  3. How do you calculate the “difficult_custom_done_ar…” thingy? Depending on how long the rollouts are etc, that may break these calculations. So please check on the underlying data for the 1/2 worker case and the 7 worker case.

Hi Arthur thank you for your reply.

I am currently using Ray 2.1.0.

If the below data what you are looking for? I am not sure how to interpret this information but it seems the stay constant like you suggested it should.

7 Workers
7 Workers

1 Worker
1 Worker

Not sure if how I coded the ‘difficult_custom_done_arrived’ parameter makes a difference. This is a callback where in the function on_episode_end() it checks a variable inside the experiment class to check if the episode was of the difficult type and if it was successful or not. With that info I do the following:

if not worker.env.experiment.custom_done_arrived:
    episode.custom_metrics["difficult_custom_done_arrived"] = 0

elif worker.env.experiment.custom_done_arrived:
    episode.custom_metrics["difficult_custom_done_arrived"] = 1

Edit: Just noticed the significant difference in training iterations between the two experiments, although I am not sure how that came about. Would a different number of workers require different algorithm parameters such as learning rate, to combat the lack of training iterations for the same amount of data sampled? Just as an FYI, I increased the number of workers to decrease the training time and got the following times, but seeing the difference in training iterations, I am just able to obtain more data faster rather than training faster with a greater number of workers?

____________ 600K | 1.1 million ____(timesteps)

2 Workers| 14.3hrs | 1.1 days
7 Workers| 6.6 hrs | /

Edit2 : Searching around it seems that the training iterations are somewhat linked to the trainer. Just as an FYI I am using a custom trainer implemented as follows:

#!/usr/bin/env python

# Copyright (c) 2021 Computer Vision Center (CVC) at the Universitat Autonoma de
# Barcelona (UAB).
# This work is licensed under the terms of the MIT license.
# For a copy, see <>.

import torch
import os

from ray.rllib.algorithms.sac import SAC

class CustomSACTrainer(SAC):
    Modified version of SACTrainer with the added functionality of saving the torch model for later inference
    def save_checkpoint(self, checkpoint_dir):
        checkpoint_path = super().save_checkpoint(checkpoint_dir)

        model = self.get_policy().model,
                   os.path.join(checkpoint_dir, "checkpoint_state_dict.pth"))

        return checkpoint_path

The custom trainer should not make a difference in this case.
The graphs that you posted would need to have the two experiments with two workers and seven workers inside of them, otherwise we can’t compare them.
For example, if one of your experiments has more timesteps per training iteration than the other, the ratio between sampling and training may be different and that will affect learning.

If you leave all other parameters untouched, the number of workers should not affect how the mean episodic rewards looks plotted over the number of timesteps, but should only drive the wallclock time down.

Thank you for your reply. How would I go about adjusting the ratio of sampling to training?

See the min_train_timesteps_per_iteration and min_sample_timesteps_per_iteration config settings in the AlgorithmConfig object.