Persistent SSH connection need for ray.get() or ray.wait()

Vaibhav_Singh · August 18, 2021, 4:32pm

Hi,

For a distributed training work, if we simply want to kick-off a script on different actors. Will just .remote() call suffice or we must have at least one object that we should .get() or .wait() for?

In one specific application where we followed the default actor patterns similar to this: mesh-transformer-jax/TPU_cluster.py at e2f3420163f02e90591f2a3c8c05e6387113703e · kingoflolz/mesh-transformer-jax · GitHub

It seems to persist a ssh connection from the head to the actors.
Is there a way, perhaps using .wait with shorter timeouts and repeated calls we can avoid the need of the persistent connection?

rliaw · August 18, 2021, 4:52pm

Hey @Vaibhav_Singh , thanks a bunch for posting!

if we simply want to kick-off a script on different actors. Will just .remote() call suffice or we must have at least one object that we should .get() or .wait() for?

A .remote() should work to simply kick off a script on different actors.

Is there a way, perhaps using .wait with shorter timeouts and repeated calls we can avoid the need of the persistent connection?

I don’t think we use SSH specifically in Ray during .wait/etc. We do open a couple persistent connections – what’s the reason to avoid the persistent connection?

Also, I’d love to hear what you’re doing with Ray!

Vaibhav_Singh · August 18, 2021, 9:04pm

Hey Richard, Thank you super quick response.

A follow up question, how does .remote ensures that the script it starts continue to run. Are there any connection timeout or other settings relevant there.

In an oversimplified mental model, I would want to launch those scripts with something like nohup to make sure it’s not interrupt due to any lost connection.

How does .remote() ensures it?

More context:
The use case is distributed training with cloud TPU-VMs. In this setup if we are using frameworks like jax or PyTorch all we want to do is setup required environment on all the nodes of the cluster and then launch some scripts from each of nodes. Any metric aggregation etc can be delegated to those scripts. Head can play some special roles but that can be managed using PyTorch/JAX APIs.

Using .remote() → .get() patterns, some users have reported errors due to ssh connection lost. I assumed it could resulting from .get because .get seemed to be blocking and doing something that looked like would need persistent connection.
But it could well be originating from something .remote tries to do as well.
Hence the followup question.

rliaw · August 18, 2021, 9:40pm

Generally, Ray runtime requires each node/vm to have an internet connection to run. Ray nodes will self-destruct if their heartbeats are lost/not received. Thus, .remote and .get will fail.

It’s exciting that you’re getting a bunch of users reporting this What are the ssh connection errors that they are seeing? It’d be great to get a stacktrace.

Vaibhav_Singh · August 19, 2021, 12:29am

Thanks. I don’t have the logs yet.
But if I could get the heartbeats more explicitly/(some api call?) and prevent the nodes from self destruct, that’s a good scenario for this usecase.

rliaw · August 19, 2021, 1:32am

I guess you can actually set the heartbeat timeout to a much higher value: [gcp] Node mistakenly marked dead: increase heartbeat timeout? · Issue #16945 · ray-project/ray · GitHub

BTW, this seems like a recurring issue on GCP. cc @Dmitri

Topic		Replies	Views
What is the correct way to connect to a Ray cluster? Ray Core	8	1422	January 28, 2021
Multiple requests to a single Actor Ray Core	5	446	September 29, 2021
Node started with ssh is lost in a minute Ray Clusters	3	27	August 1, 2024
Ray actor.remote() call blocks forever Ray Core	11	1898	April 26, 2023
How remote and get work？	3	464	April 28, 2022

Persistent SSH connection need for ray.get() or ray.wait()

Related topics