Hi~
Sorry to late.
You are absolutely correct.
The connection time is short under normal circumstances.
I’m running on a worker node in the cluster, and when the head node is down, I wait for up to 60 seconds.
Compared to normal operation, waiting for 60 seconds seems a bit long in my opinion.
I have similar problem. Users are allowed to create a ray cluster with an external cluster manager and then the main software connects to the cluster and does whatever it was designed to do. Now if the user has forgotten launch their cluster or address/port are wrongly configured, the main software will get stuck in ray.init for some time due to head node not being alive.
Timeout for ray.init could be a nice plus, but I’ll manage without for now