RaySGD example connection time out on 2-node cluster?

I was using the sgd example, it works on single node perfectly well. When I extend to multiple-machine cluster though, it eventually says “connection time out” before running anything. Is there any extra configuration I need to make it running on multiple machine?

ray version 1.4. python version tried both 3.7/3.8,

My code is like this, really I pretty much copy-pasted from example here(RaySGD: Distributed Training Wrappers — Ray v2.0.0.dev0). The only change is setting worker=20, so that I have workers overflow to 2nd node. Thanks!

import ray
from ray.util.sgd import TorchTrainer
from ray.util.sgd.torch import TrainingOperator
from ray.util.sgd.torch.examples.train_example import LinearDataset

import torch
from torch.utils.data import DataLoader

class CustomTrainingOperator(TrainingOperator):
    def setup(self, config):
        # Load data.
        train_loader = DataLoader(LinearDataset(2, 5), config["batch_size"])
        val_loader = DataLoader(LinearDataset(2, 5), config["batch_size"])

        # Create model.
        model = torch.nn.Linear(1, 1)

        # Create optimizer.
        optimizer = torch.optim.SGD(model.parameters(), lr=1e-2)

        # Create loss.
        loss = torch.nn.MSELoss()

        # Register model, optimizer, and loss.
        self.model, self.optimizer, self.criterion = self.register(
            models=model,
            optimizers=optimizer,
            criterion=loss)

        # Register data loaders.
        self.register_data(train_loader=train_loader, validation_loader=val_loader)


ray.init(address="auto")

trainer1 = TorchTrainer(
    training_operator_cls=CustomTrainingOperator,
    num_workers=20,
    use_gpu=False,
    config={"batch_size": 64})

stats = trainer1.train()
print(stats)
trainer1.shutdown()
print("success!")

Can you also print ray.cluster_resources() for me?

Another thing I might suspect is that there’s some sort of race condition. Can you also try using num_workers=32?

output with cluster_resource:

{'object_store_memory': 29328530227.0, 'memory': 61890834023.0, 'CPU': 24.0, 'node:10.231.13.71': 1.0, 'node:10.231.21.63': 1.0}
(pid=352113) 2021-06-14 17:44:28,902	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://10.231.13.71:35451 [rank=0]
(pid=352103) 2021-06-14 17:44:28,900	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://10.231.13.71:35451 [rank=4]
(pid=352101) 2021-06-14 17:44:28,906	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://10.231.13.71:35451 [rank=7]
(pid=352106) 2021-06-14 17:44:28,899	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://10.231.13.71:35451 [rank=9]
(pid=352123) 2021-06-14 17:44:28,901	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://10.231.13.71:35451 [rank=12]
(pid=352112) 2021-06-14 17:44:28,911	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://10.231.13.71:35451 [rank=6]
(pid=352102) 2021-06-14 17:44:28,912	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://10.231.13.71:35451 [rank=11]
(pid=352107) 2021-06-14 17:44:28,917	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://10.231.13.71:35451 [rank=10]
(pid=352104) 2021-06-14 17:44:28,960	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://10.231.13.71:35451 [rank=1]
(pid=352105) 2021-06-14 17:44:28,928	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://10.231.13.71:35451 [rank=2]
(pid=352109) 2021-06-14 17:44:28,943	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://10.231.13.71:35451 [rank=3]
(pid=352108) 2021-06-14 17:44:29,036	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://10.231.13.71:35451 [rank=5]
(pid=352255) 2021-06-14 17:44:29,076	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://10.231.13.71:35451 [rank=14]
(pid=352343) 2021-06-14 17:44:29,053	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://10.231.13.71:35451 [rank=15]
(pid=352119) 2021-06-14 17:44:29,132	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://10.231.13.71:35451 [rank=8]
(pid=352120) 2021-06-14 17:44:29,095	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://10.231.13.71:35451 [rank=13]
(pid=5913, ip=10.231.21.63) 2021-06-14 17:44:29,189	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://10.231.13.71:35451 [rank=16]
(pid=5912, ip=10.231.21.63) 2021-06-14 17:44:29,180	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://10.231.13.71:35451 [rank=17]
(pid=5910, ip=10.231.21.63) 2021-06-14 17:44:29,233	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://10.231.13.71:35451 [rank=18]
(pid=5911, ip=10.231.21.63) 2021-06-14 17:44:29,231	INFO distributed_torch_runner.py:58 -- Setting up process group for: tcp://10.231.13.71:35451 [rank=19]
Traceback (most recent call last):
  File "sgd_example.py", line 37, in <module>
    trainer1 = TorchTrainer(
  File "/home/centos/anaconda3/envs/py38/lib/python3.8/site-packages/ray/util/sgd/torch/torch_trainer.py", line 265, in __init__
    startup_success = self._start_workers(self.max_replicas)
  File "/home/centos/anaconda3/envs/py38/lib/python3.8/site-packages/ray/util/sgd/torch/torch_trainer.py", line 329, in _start_workers
    return self.worker_group.start_workers(num_workers)
  File "/home/centos/anaconda3/envs/py38/lib/python3.8/site-packages/ray/util/sgd/torch/worker_group.py", line 227, in start_workers
    ray.get(
  File "/home/centos/anaconda3/envs/py38/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 62, in wrapper
    return func(*args, **kwargs)
  File "/home/centos/anaconda3/envs/py38/lib/python3.8/site-packages/ray/worker.py", line 1494, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::DistributedTorchRunner.setup_process_group() (pid=352113, ip=10.231.13.71)
  File "python/ray/_raylet.pyx", line 501, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 451, in ray._raylet.execute_task.function_executor
  File "/home/centos/anaconda3/envs/py38/lib/python3.8/site-packages/ray/_private/function_manager.py", line 563, in actor_method_executor
    return method(__ray_actor, *args, **kwargs)
  File "/home/centos/anaconda3/envs/py38/lib/python3.8/site-packages/ray/util/sgd/torch/distributed_torch_runner.py", line 60, in setup_process_group
    setup_process_group(
  File "/home/centos/anaconda3/envs/py38/lib/python3.8/site-packages/ray/util/sgd/torch/utils.py", line 38, in setup_process_group
    dist.init_process_group(
  File "/home/centos/anaconda3/envs/py38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 439, in init_process_group
    _default_pg = _new_process_group_helper(
  File "/home/centos/anaconda3/envs/py38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 517, in _new_process_group_helper
    pg = ProcessGroupGloo(
RuntimeError: [/tmp/pip-req-build-ojg3q6e4/third_party/gloo/gloo/transport/tcp/pair.cc:769] connect [fe80::f816:3eff:fe09:d41d]:17796: Connection timed out

I have 16 CPU and 8 CPU respectively on my cluster, therefore I tried num_workers=24. Same result.

Ray status output:

(py38) [centos@n231-013-071 poc]$ ray status
======== Cluster status: 2021-06-14 18:22:26.633479 ========
Node status
------------------------------------------------------------
 1 node(s) with resources: {'CPU': 16.0, 'memory': 39254419047.0, 'node:10.231.13.71': 1.0, 'object_store_memory': 19627209523.0}
 1 node(s) with resources: {'memory': 22636414976.0, 'object_store_memory': 9701320704.0, 'node:10.231.21.63': 1.0, 'CPU': 8.0}

Resources
------------------------------------------------------------
Usage:
 0.0/24.0 CPU
 0.00/57.640 GiB memory
 0.00/27.314 GiB object_store_memory

Demands:
 (no resource demands)

To report back my findings.

I switched to using a different cluster (referred to as Cluster-2 below), now the example works…

There are some difference between these two clusters, regarding network configuration, and linux distribution. But cluster-1 (previous cluster) runs ray cluster and other applications on it without any issue. It is just pytorch backend somehow causing trouble.

Not sure about the root cause triggering underlying pytorch backend stuck, for now I just move on.

Hope this helps if anybody else run into same situation.