output with cluster_resource:
{'object_store_memory': 29328530227.0, 'memory': 61890834023.0, 'CPU': 24.0, 'node:10.231.13.71': 1.0, 'node:10.231.21.63': 1.0}
(pid=352113) 2021-06-14 17:44:28,902 INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://10.231.13.71:35451 [rank=0]
(pid=352103) 2021-06-14 17:44:28,900 INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://10.231.13.71:35451 [rank=4]
(pid=352101) 2021-06-14 17:44:28,906 INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://10.231.13.71:35451 [rank=7]
(pid=352106) 2021-06-14 17:44:28,899 INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://10.231.13.71:35451 [rank=9]
(pid=352123) 2021-06-14 17:44:28,901 INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://10.231.13.71:35451 [rank=12]
(pid=352112) 2021-06-14 17:44:28,911 INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://10.231.13.71:35451 [rank=6]
(pid=352102) 2021-06-14 17:44:28,912 INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://10.231.13.71:35451 [rank=11]
(pid=352107) 2021-06-14 17:44:28,917 INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://10.231.13.71:35451 [rank=10]
(pid=352104) 2021-06-14 17:44:28,960 INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://10.231.13.71:35451 [rank=1]
(pid=352105) 2021-06-14 17:44:28,928 INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://10.231.13.71:35451 [rank=2]
(pid=352109) 2021-06-14 17:44:28,943 INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://10.231.13.71:35451 [rank=3]
(pid=352108) 2021-06-14 17:44:29,036 INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://10.231.13.71:35451 [rank=5]
(pid=352255) 2021-06-14 17:44:29,076 INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://10.231.13.71:35451 [rank=14]
(pid=352343) 2021-06-14 17:44:29,053 INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://10.231.13.71:35451 [rank=15]
(pid=352119) 2021-06-14 17:44:29,132 INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://10.231.13.71:35451 [rank=8]
(pid=352120) 2021-06-14 17:44:29,095 INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://10.231.13.71:35451 [rank=13]
(pid=5913, ip=10.231.21.63) 2021-06-14 17:44:29,189 INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://10.231.13.71:35451 [rank=16]
(pid=5912, ip=10.231.21.63) 2021-06-14 17:44:29,180 INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://10.231.13.71:35451 [rank=17]
(pid=5910, ip=10.231.21.63) 2021-06-14 17:44:29,233 INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://10.231.13.71:35451 [rank=18]
(pid=5911, ip=10.231.21.63) 2021-06-14 17:44:29,231 INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://10.231.13.71:35451 [rank=19]
Traceback (most recent call last):
File "sgd_example.py", line 37, in <module>
trainer1 = TorchTrainer(
File "/home/centos/anaconda3/envs/py38/lib/python3.8/site-packages/ray/util/sgd/torch/torch_trainer.py", line 265, in __init__
startup_success = self._start_workers(self.max_replicas)
File "/home/centos/anaconda3/envs/py38/lib/python3.8/site-packages/ray/util/sgd/torch/torch_trainer.py", line 329, in _start_workers
return self.worker_group.start_workers(num_workers)
File "/home/centos/anaconda3/envs/py38/lib/python3.8/site-packages/ray/util/sgd/torch/worker_group.py", line 227, in start_workers
ray.get(
File "/home/centos/anaconda3/envs/py38/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 62, in wrapper
return func(*args, **kwargs)
File "/home/centos/anaconda3/envs/py38/lib/python3.8/site-packages/ray/worker.py", line 1494, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::DistributedTorchRunner.setup_process_group() (pid=352113, ip=10.231.13.71)
File "python/ray/_raylet.pyx", line 501, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 451, in ray._raylet.execute_task.function_executor
File "/home/centos/anaconda3/envs/py38/lib/python3.8/site-packages/ray/_private/function_manager.py", line 563, in actor_method_executor
return method(__ray_actor, *args, **kwargs)
File "/home/centos/anaconda3/envs/py38/lib/python3.8/site-packages/ray/util/sgd/torch/distributed_torch_runner.py", line 60, in setup_process_group
setup_process_group(
File "/home/centos/anaconda3/envs/py38/lib/python3.8/site-packages/ray/util/sgd/torch/utils.py", line 38, in setup_process_group
dist.init_process_group(
File "/home/centos/anaconda3/envs/py38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 439, in init_process_group
_default_pg = _new_process_group_helper(
File "/home/centos/anaconda3/envs/py38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 517, in _new_process_group_helper
pg = ProcessGroupGloo(
RuntimeError: [/tmp/pip-req-build-ojg3q6e4/third_party/gloo/gloo/transport/tcp/pair.cc:769] connect [fe80::f816:3eff:fe09:d41d]:17796: Connection timed out