Fail to setup ray clusters from inter-connectable machines

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi,

I tried to established a “simple” ray cluster by connecting two aws machines using following commands:

# machine1 designated as head_node
ray start --head
# it starts successfully and prints machine1_ip:6379
# on machine2
ray start --address='machine1_ip:6379'
Local node IP: machine2_ip

2023-01-14 20:56:15,350 WARNING utils.py:1346 -- Unable to connect to GCS at machine1_ip:6379. Check that (1) Ray GCS with matching version started successfully at the specified address, and (2) there is no firewall setting preventing access.

I got this known error (Launching an On-Premise Cluster — Ray 2.2.0) and follow the troubleshoot nc -vv -z $HEAD_ADDRESS $PORT which shows successful connection. Replaced machine1_ip with domain name is not working either. Both machines can ssh connect to each other using public key without typing in password. Also, they are running matching python-3.10.0 and ray-2.1.0.

What else should I do to nail down the failing point? I appreciate anyone’s help!