Security best practice for ray tune with on-premise cluster

jjyyxx · August 5, 2023, 5:18am

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I am using ray.tune on a local cluster with several GPU instances. Since the cluster is small, I just manage them manually, with something like

# Head
ray start --head --port PORT --include-dashboard false --num-cpus 2 --num-gpus 2 --system-config '{"worker_niceness": 0}' --block
# Worker
ray start --address=‘HEAD-IP:PORT’ --num-cpus 2 --num-gpus 2 --block

But I’m worrying about its security. If I understand correctly, the recent ray version (2.6) does not offer authentication functionality, and I did not find a way to limit the network interface it binds. So any computer could access my cluster, and execute arbitrary code.

Defining some firewall rule rules may be an option, but it both requires root previlege which requires me to contact system manager and has to know which ports to protect.

Some other options I could think of (but did not succeed):

Protect with a custom password
Bind to localhost, then use SSH port forwarding to connect with each host.

Please give me some suggestions or point out anything I misunderstand.

hm-gmail · August 8, 2023, 9:20am

Well, actually there is a --node-ip-address option to explicitly set a network interface for ray interconnection. However, in my case, having two different network interfaces with ray set to use only one of them was not successful (I got a spurious errors/exceptions, see below). You might manage it better.

PS: Just found that ray allows encrypted connections (unfortunately, not for manual deployment):

https://docs.ray.io/en/latest/ray-core/configure.html#tls-authentication

jjyyxx · August 8, 2023, 12:32pm

I was aware of the --node-ip-address option (I went through most of ray cli options to see if there is a solution for my problem), but it did not work as expected.

When running

ray start --head --node-ip-address='127.0.0.1' --port=26379 --include-dashboard false --block

it prints Local node IP: <My IP of enp71s0> which is my public IP. Ray did something internally to change localhost to a public IP.

If I run

ray start --head --node-ip-address='172.17.0.1' --port=26379 --include-dashboard false --block

where 172.17.0.1 is docker0’s IP (the alternative private IP most easily avaiable to me), Ray did print Local node IP: 172.17.0.1, but it was still accessible from another machine.

IN BOTH CASES, lsof -i :26379 includes

COMMAND       PID        USER   FD   TYPE     DEVICE SIZE/OFF NODE NAME
......
gcs_serve ....... ...........   ...  IPv6 ..........      0t0  TCP *:26379 (LISTEN)
......

indicating ray still binds to all interfaces.

Currently, I mitigate with

sudo iptables -A INPUT -p tcp --dport 26379 ! -i lo ! -s aaa.bbb.ccc.0/24 -j DROP

But some other ports opened by ray may still be vulnerable.

The TLS solution you mentioned is interesting (I didn’t notice it before), and I think it only involves manipulating several environment variables, thus applicable to manual deployment. But it feels like an overkill to me, unnecessary and tedious for simple usage.

I hope ray could make some improvements on this. A safe-by-default solution is always desirable, especially for a software that could execute arbitrary task on demand.

Topic		Replies	Views
Port requirements for custom clusters Ray Clusters	0	339	August 11, 2023
Adding a node to a cluster while running tune experiments Ray Tune	0	15	April 10, 2025
Tune.run() on cluster failing with "'Worker' object has no attribute 'core_worker'" Ray Tune	6	1421	May 11, 2022
Raytune does not use resources of the second node Ray Clusters	1	346	June 15, 2023
RayTune cluster not distributing load correctly? Ray Tune	4	236	November 14, 2023

Security best practice for ray tune with on-premise cluster

Related topics