This might be simple but I couldn’t really find in the documentation on what’s the proper way to connect to a deployed cluster. (I’ve successfully deployed two different clusters in GCP, one on GKE and the other on VMs with the provided yaml files)
I see that you can use commands to trigger a one-off run script via ray submit, and that I can ssh into the instances and run simple python commands there. But what is the recommended way to connect and keep a stable connection to a cluster that I can use as a ray.util.multiprocessing.Pool?
I’ve tried creating a ssh-tunnel and using pool = Pool(ray_address="127.0.0.1:6379") but this besides not being stable it gets me a lot of connection timeouts.
What would be the best way to expose and connect to my ray cluster from outside and run processes there? What is that that I’m missing here?
unfortunately you’ll have to run your scripts on the cluster head node for now! We’re currently working on a client interface for your ideal workflow (with the ssh tunnel), but this is still work in progress.
ray submit and ray rsync-up are my current typical commands for execution.
Hi @rliaw,
I think I ran into the same problem. Is it documented somewhere that you must run your scripts on the cluster head node? I only found out randomly via this thread.
Also, in the documentation it says: “To run a distributed Ray program, you’ll need to execute your program on the same machine as one of the nodes.”, i.e. not restricted to running it on the head node.
No, you can run it on any node actually. The long-standing github issue (specifically, this example) doesn’t actually run the Ray script properly. For containerized settings, the script must be run within the same container as where you call ray start.
Happy to answer any more questions you might have, though I think we should move this discussion to another thread!
import ray
import ray.util
ray.util.connect("0.0.0.0:50051") # replace with the appropriate host and port
# Normal Ray code follows
@ray.remote
def f(x):
return x ** x
do_work.remote(2)
#....