I want to start a ray head node on an on-prem Linux machine, connect multiple nodes (also machines on-prem) to it, and allow my coworkers access to Ray. We all have user account on the machine running the head node, and they belong to the same group. What I eneded up doing to allow other to access ray is changing permission of /tmp/ray/session_2021-01-13_19-38-03_898565_24443/sockets/plasma_store
and /tmp/ray/session_2021-01-13_19-38-03_898565_24443/sockets/raylet
to be group writable. Is this the “proper” way of doing this? would Ray Cluster Launcher help me in this use case?
Why can’t just each of you guys run drivers with ray.init(address=‘auto;’)?
I thought that would work too, but alas. I started a ray process on the non-head nodes by
ray start --address <ip_of_head_node>
If I start a python REPL on the non-head node and run ray.init(address='auto')
, the process just hangs & eventually gives me a backtrace. I had 6379 open on the head node for Redis (head node started with --port 6379). What other ports need to be open on machines in the cluster in order for this to work?
So are you saying if you try running a driver in non-head node, it is crashed? Can you explain your env a bit more? This is not supposed to happen normally.
Also here is the information about the port number; Configuring Ray — Ray v1.1.0
@sangcho thanks for the link, I followed it and made sure ports are open on the head node (IP 10.70.21.30) by
sudo ufw allow 6379:20000/tcp
sudo ufw allow 6379:20000/udp
and started the head node with
ray start --head --dashboard-host 0.0.0.0 --include-dashboard true --dashboard-port 8265 --gcs-server-port 6380 --node-manager-port 6381 --object-manager-port 6382
and on the regular node I did
ray start --address 10.70.21.30:6379
which seem like it succeeded. But if I subsequently run ray status
, it hangs at
➜ ray status
2021-01-21 02:07:06,676 INFO scripts.py:1355 -- Connecting to Ray instance at 10.70.21.30:6379.
2021-01-21 02:07:06,683 INFO worker.py:650 -- Connecting to existing Ray cluster at address: 10.70.21.30:6379
and same thing when I do ray.start(address='auto')
in a script.
In the mean time, I made sure that the node can reach head on these ports:
➜ nmap -p 6379-6382 rl-lambda-1
Starting Nmap 7.80 ( https://nmap.org ) at 2021-01-21 02:12 PST
Nmap scan report for rl-lambda-1 (10.70.21.30)
Host is up (0.073s latency).
PORT STATE SERVICE
6379/tcp open redis
6380/tcp open unknown
6381/tcp open unknown
6382/tcp open metatude-mds
Am I missing something?
Both nodes are running ray 1.0.1.post1 and Python 3.8.6; head OS is Ubuntu 18, the other is Manjaro