For a distributed training work, if we simply want to kick-off a script on different actors. Will just .remote() call suffice or we must have at least one object that we should .get() or .wait() for?
In one specific application where we followed the default actor patterns similar to this: mesh-transformer-jax/TPU_cluster.py at e2f3420163f02e90591f2a3c8c05e6387113703e · kingoflolz/mesh-transformer-jax · GitHub
It seems to persist a ssh connection from the head to the actors.
Is there a way, perhaps using .wait with shorter timeouts and repeated calls we can avoid the need of the persistent connection?