Cannot connect to cluster other than using loopback address on head node

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I created a cluster on aws using the template from here, the only change which is relevant to this topic (but I don’t think it’s the root cause) is the following:

 provider:
     type: aws
     // xxx
     cache_stopped_nodes: True
+    security_group:
+        GroupName: test_ray_client_security_group
+        IpPermissions:
+              - FromPort: 10001
+                ToPort: 10001
+                IpProtocol: TCP
                  // xxx

With the yaml, I created a Ray cluster from my laptop using the following command and got the head node IP address (suppose the public IP is 3.3.3.3 and the private IP 10.0.0.20)

ray up example-full.yml

Then in my laptop, I tried to connect with the cluster using the code below but it doesn’t work:

from ray.job_submission import JobSubmissionClient
client = JobSubmissionClient("http://3.3.3.3:8265")

I did some checks on my security group setup but no luck. What troubles me now is seems like even inside the cluster head node (the security group accepts all traffic within the security group), I can’t connect to the cluster other than the loopback address:

This works (on the head node 10.0.0.20, out side of the container):

client = JobSubmissionClient("http://127.0.0.1:8265")

This doesn’t (on the head node 10.0.0.20, out side of the container):

client = JobSubmissionClient("http://10.0.0.20:8265")

ConnectionError: Failed to connect to Ray at address: http://10.0.0.20:8265

Digging a little deeper, it seems port 6379 (ray gcs address) bind to all available IP addresses, but port 8265 (ray web ui/dashboard) only accepts connection from the loopback address:

(base) ubuntu@ip-10-0-0-20:~$ sudo netstat -tlnp | grep 6379
tcp6       0      0 :::6379                 :::*                    LISTEN      7012/gcs_server

(base) ubuntu@ip-10-0-0-20:~$ sudo netstat -tlnp | grep 8265
tcp        0      0 127.0.0.1:8265          0.0.0.0:*               LISTEN      7088/python

Port forwarding mentioned here should work, but seems not programatic enough, I was expecting something simple as this.

Is there any configuration which lets the dashboard process listen to all available IP address? Or any other setups needed to access the job submission API outside of the cluster?

I believe the issue is that the dashboard’s http server is only listening on 127.0.0.1. Apologies, I think this example should have better defaults.

Can you try making your head_start_ray_commands look like this? (or remove that section entirely. The default should be right.

head_start_ray_commands:
    - ray stop
    - ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0