Security best practice for ray tune with on-premise cluster

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I am using ray.tune on a local cluster with several GPU instances. Since the cluster is small, I just manage them manually, with something like

# Head
ray start --head --port PORT --include-dashboard false --num-cpus 2 --num-gpus 2 --system-config '{"worker_niceness": 0}' --block
# Worker
ray start --address=‘HEAD-IP:PORT’ --num-cpus 2 --num-gpus 2 --block

But I’m worrying about its security. If I understand correctly, the recent ray version (2.6) does not offer authentication functionality, and I did not find a way to limit the network interface it binds. So any computer could access my cluster, and execute arbitrary code.

Defining some firewall rule rules may be an option, but it both requires root previlege which requires me to contact system manager and has to know which ports to protect.

Some other options I could think of (but did not succeed):

  • Protect with a custom password
  • Bind to localhost, then use SSH port forwarding to connect with each host.

Please give me some suggestions or point out anything I misunderstand.

Well, actually there is a --node-ip-address option to explicitly set a network interface for ray interconnection. However, in my case, having two different network interfaces with ray set to use only one of them was not successful (I got a spurious errors/exceptions, see below). You might manage it better.

PS: Just found that ray allows encrypted connections (unfortunately, not for manual deployment):

https://docs.ray.io/en/latest/ray-core/configure.html#tls-authentication

I was aware of the --node-ip-address option (I went through most of ray cli options to see if there is a solution for my problem), but it did not work as expected.

When running

ray start --head --node-ip-address='127.0.0.1' --port=26379 --include-dashboard false --block

it prints Local node IP: <My IP of enp71s0> which is my public IP. Ray did something internally to change localhost to a public IP.

If I run

ray start --head --node-ip-address='172.17.0.1' --port=26379 --include-dashboard false --block

where 172.17.0.1 is docker0’s IP (the alternative private IP most easily avaiable to me), Ray did print Local node IP: 172.17.0.1, but it was still accessible from another machine.

IN BOTH CASES, lsof -i :26379 includes

COMMAND       PID        USER   FD   TYPE     DEVICE SIZE/OFF NODE NAME
......
gcs_serve ....... ...........   ...  IPv6 ..........      0t0  TCP *:26379 (LISTEN)
......

indicating ray still binds to all interfaces.

Currently, I mitigate with

sudo iptables -A INPUT -p tcp --dport 26379 ! -i lo ! -s aaa.bbb.ccc.0/24 -j DROP

But some other ports opened by ray may still be vulnerable.

The TLS solution you mentioned is interesting (I didn’t notice it before), and I think it only involves manipulating several environment variables, thus applicable to manual deployment. But it feels like an overkill to me, unnecessary and tedious for simple usage.

I hope ray could make some improvements on this. A safe-by-default solution is always desirable, especially for a software that could execute arbitrary task on demand.