When I was doing DDPPO training, cluster initialization failed

ERROR worker.py:756 – Exception raised in creation task: The actor died because of an error raised in its creation task, ray::DDPPO.init() (pid=126499, ip=10.19.xxx.xx, repr=DDPPO)
(DDPPO pid=126499) File “/opt/conda/envs/rl_decision/lib/python3.8/site-packages/ray/rllib/algorithms/ddppo/ddppo.py”, line 179, in init
(DDPPO pid=126499) super().init(
(DDPPO pid=126499) File “/opt/conda/envs/rl_decision/lib/python3.8/site-packages/ray/rllib/algorithms/algorithm.py”, line 308, in init
(DDPPO pid=126499) super().init(config=config, logger_creator=logger_creator, **kwargs)
(DDPPO pid=126499) File “/opt/conda/envs/rl_decision/lib/python3.8/site-packages/ray/tune/trainable/trainable.py”, line 157, in init
(DDPPO pid=126499) self.setup(copy.deepcopy(self.config))
(DDPPO pid=126499) File “/opt/conda/envs/rl_decision/lib/python3.8/site-packages/ray/rllib/algorithms/ddppo/ddppo.py”, line 264, in setup
(DDPPO pid=126499) ray.get(
(DDPPO pid=126499) ray.exceptions.RayTaskError(RuntimeError): ray::RolloutWorker.setup_torch_data_parallel() (pid=3420304, ip=10.20.xx.xx, repr=<ray.rllib.evaluation.rollout_worker.RolloutWorker object at 0x7f17b66a6640>)
(DDPPO pid=126499) File “/opt/conda/envs/rl_decision/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py”, line 1680, in setup_torch_data_parallel
(DDPPO pid=126499) torch.distributed.init_process_group(
(DDPPO pid=126499) File “/opt/conda/envs/rl_decision/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py”, line 602, in init_process_group
(DDPPO pid=126499) default_pg = _new_process_group_helper(
(DDPPO pid=126499) File “/opt/conda/envs/rl_decision/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py”, line 703, in _new_process_group_helper
(DDPPO pid=126499) pg = ProcessGroupGloo(prefix_store, rank, world_size, timeout=timeout)
(DDPPO pid=126499) RuntimeError: [/opt/conda/conda-bld/pytorch_1659484683044/work/third_party/gloo/gloo/transport/tcp/pair.cc:799] connect [127.0.1.1]:4505: Connection refused

My config:
“num_workers”: 4, #2,
“num_envs_per_worker”: 1, # 5
“num_cpus_per_worker”: 2, # 16
“framework”: “torch”,
“no_done_at_end”: True,
“sample_async”: False,
“placement_strategy”: “SPREAD”,
“keep_local_weights_in_sync”:True,
“num_gpus_per_worker”: 0.4, # 0.5
“rollout_fragment_length”: 200,
“torch_distributed_backend”:“gloo”,

I trained with the docker environment on both servers.

when I use nccl

(DDPPO pid=128397) 2023-02-23 08:23:22,790 ERROR algorithm.py:2173 – Error in training or evaluation attempt! Trying to recover.
(DDPPO pid=128397) Traceback (most recent call last):
(DDPPO pid=128397) File “/opt/conda/envs/rl_decision/lib/python3.8/site-packages/ray/rllib/algorithms/algorithm.py”, line 2373, in _run_one_training_iteration
(DDPPO pid=128397) results = self.training_step()
(DDPPO pid=128397) File “/opt/conda/envs/rl_decision/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py”, line 466, in _resume_span
(DDPPO pid=128397) return method(self, *_args, **_kwargs)
(DDPPO pid=128397) File “/opt/conda/envs/rl_decision/lib/python3.8/site-packages/ray/rllib/algorithms/ddppo/ddppo.py”, line 290, in training_step
(DDPPO pid=128397) sample_and_update_results = self._ddppo_worker_manager.get_ready()
(DDPPO pid=128397) File “/opt/conda/envs/rl_decision/lib/python3.8/site-packages/ray/rllib/execution/parallel_requests.py”, line 173, in get_ready
(DDPPO pid=128397) objs = ray.get(ready_requests)
(DDPPO pid=128397) File “/opt/conda/envs/rl_decision/lib/python3.8/site-packages/ray/_private/client_mode_hook.py”, line 105, in wrapper
(DDPPO pid=128397) return func(*args, **kwargs)
(DDPPO pid=128397) File “/opt/conda/envs/rl_decision/lib/python3.8/site-packages/ray/_private/worker.py”, line 2275, in get
(DDPPO pid=128397) raise value.as_instanceof_cause()
(DDPPO pid=128397) ray.exceptions.RayTaskError(ValueError): ray::RolloutWorker.apply() (pid=128448, ip=10.19.xxx.xx, repr=<ray.rllib.evaluation.rollout_worker.RolloutWorker object at 0x7f7ef1967790>)
(DDPPO pid=128397) File “/opt/conda/envs/rl_decision/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py”, line 1320, in all_reduce
(DDPPO pid=128397) work = default_pg.allreduce([tensor], opts)
(DDPPO pid=128397) RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1659484683044/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3
(DDPPO pid=128397) ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer.
(DDPPO pid=128397)
(DDPPO pid=128397) The above exception was the direct cause of the following exception:
(DDPPO pid=128397)
(DDPPO pid=128397) ray::RolloutWorker.apply() (pid=128448, ip=10.19.xxx.xx, repr=<ray.rllib.evaluation.rollout_worker.RolloutWorker object at 0x7f7ef1967790>)
(DDPPO pid=128397) File “/opt/conda/envs/rl_decision/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py”, line 1669, in apply
(DDPPO pid=128397) return func(self, *args, **kwargs)
(DDPPO pid=128397) File “/opt/conda/envs/rl_decision/lib/python3.8/site-packages/ray/rllib/algorithms/ddppo/ddppo.py”, line 351, in _sample_and_train_torch_distributed
(DDPPO pid=128397) info = do_minibatch_sgd(
(DDPPO pid=128397) File “/opt/conda/envs/rl_decision/lib/python3.8/site-packages/ray/rllib/utils/sgd.py”, line 129, in do_minibatch_sgd
(DDPPO pid=128397) local_worker.learn_on_batch(
(DDPPO pid=128397) File “/opt/conda/envs/rl_decision/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py”, line 919, in learn_on_batch
(DDPPO pid=128397) info_out[pid] = policy.learn_on_batch(batch)
(DDPPO pid=128397) File “/opt/conda/envs/rl_decision/lib/python3.8/site-packages/ray/rllib/utils/threading.py”, line 24, in wrapper
(DDPPO pid=128397) return func(self, *a, **k)
(DDPPO pid=128397) File “/opt/conda/envs/rl_decision/lib/python3.8/site-packages/ray/rllib/policy/torch_policy_v2.py”, line 606, in learn_on_batch
(DDPPO pid=128397) grads, fetches = self.compute_gradients(postprocessed_batch)
(DDPPO pid=128397) File “/opt/conda/envs/rl_decision/lib/python3.8/site-packages/ray/rllib/utils/threading.py”, line 24, in wrapper
(DDPPO pid=128397) return func(self, *a, **k)
(DDPPO pid=128397) File “/opt/conda/envs/rl_decision/lib/python3.8/site-packages/ray/rllib/policy/torch_policy_v2.py”, line 789, in compute_gradients
(DDPPO pid=128397) tower_outputs = self._multi_gpu_parallel_grad_calc([postprocessed_batch])
(DDPPO pid=128397) File “/opt/conda/envs/rl_decision/lib/python3.8/site-packages/ray/rllib/policy/torch_policy_v2.py”, line 1179, in _multi_gpu_parallel_grad_calc
(DDPPO pid=128397) raise last_result[0] from last_result[1]
(DDPPO pid=128397) ValueError: NCCL error in: /opt/conda/conda-bld/pytorch_1659484683044/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3
(DDPPO pid=128397) ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer.
(DDPPO pid=128397) tracebackTraceback (most recent call last):
(DDPPO pid=128397) File “/opt/conda/envs/rl_decision/lib/python3.8/site-packages/ray/rllib/policy/torch_policy_v2.py”, line 1137, in _worker
(DDPPO pid=128397) torch.distributed.all_reduce(
(DDPPO pid=128397) File “/opt/conda/envs/rl_decision/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py”, line 1320, in all_reduce
(DDPPO pid=128397) work = default_pg.allreduce([tensor], opts)
(DDPPO pid=128397) RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1659484683044/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3
(DDPPO pid=128397) ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer.
(DDPPO pid=128397)
(DDPPO pid=128397) In tower 0 on device cuda:0

Hi @zhangzhang,

Thanks for raising this. Can you report this issue to the torch-distributed folks?
We don’t “mess” with torch ddp, so this is unlikely to be an RLlib related error.
Have you found anything so far?