Persistent SSH connection need for ray.get() or ray.wait()

Hi,

For a distributed training work, if we simply want to kick-off a script on different actors. Will just .remote() call suffice or we must have at least one object that we should .get() or .wait() for?

In one specific application where we followed the default actor patterns similar to this: mesh-transformer-jax/TPU_cluster.py at e2f3420163f02e90591f2a3c8c05e6387113703e · kingoflolz/mesh-transformer-jax · GitHub

It seems to persist a ssh connection from the head to the actors.
Is there a way, perhaps using .wait with shorter timeouts and repeated calls we can avoid the need of the persistent connection?

Hey @Vaibhav_Singh , thanks a bunch for posting!

if we simply want to kick-off a script on different actors. Will just .remote() call suffice or we must have at least one object that we should .get() or .wait() for?

A .remote() should work to simply kick off a script on different actors.

Is there a way, perhaps using .wait with shorter timeouts and repeated calls we can avoid the need of the persistent connection?

I don’t think we use SSH specifically in Ray during .wait/etc. We do open a couple persistent connections – what’s the reason to avoid the persistent connection?

Also, I’d love to hear what you’re doing with Ray!

Hey Richard, Thank you super quick response.

A follow up question, how does .remote ensures that the script it starts continue to run. Are there any connection timeout or other settings relevant there.

In an oversimplified mental model, I would want to launch those scripts with something like nohup to make sure it’s not interrupt due to any lost connection.

How does .remote() ensures it?

More context:
The use case is distributed training with cloud TPU-VMs. In this setup if we are using frameworks like jax or PyTorch all we want to do is setup required environment on all the nodes of the cluster and then launch some scripts from each of nodes. Any metric aggregation etc can be delegated to those scripts. Head can play some special roles but that can be managed using PyTorch/JAX APIs.

Using .remote() → .get() patterns, some users have reported errors due to ssh connection lost. I assumed it could resulting from .get because .get seemed to be blocking and doing something that looked like would need persistent connection.
But it could well be originating from something .remote tries to do as well.
Hence the followup question.

Generally, Ray runtime requires each node/vm to have an internet connection to run. Ray nodes will self-destruct if their heartbeats are lost/not received. Thus, .remote and .get will fail.

It’s exciting that you’re getting a bunch of users reporting this :slight_smile: What are the ssh connection errors that they are seeing? It’d be great to get a stacktrace.

Thanks. I don’t have the logs yet.
But if I could get the heartbeats more explicitly/(some api call?) and prevent the nodes from self destruct, that’s a good scenario for this usecase.

I guess you can actually set the heartbeat timeout to a much higher value: [gcp] Node mistakenly marked dead: increase heartbeat timeout? · Issue #16945 · ray-project/ray · GitHub

BTW, this seems like a recurring issue on GCP. cc @Dmitri