Failing to launch workers due to docker pull timeout

Yoav · May 29, 2021, 9:10pm

We recently moved to a larger docker image for the ray head and workers, and see many failures:

head node often fails to start (“ray up” fails), the failure is often in the middle of the docker pull command (the returned error is “SSH command failed”.
worker nodes started by the autoscaler are failing to get launched, again while pulling the docker, again with “SSH command failed”.

I suspect this is due to some timeout when running the ssh commands. I see the code has a default timeout of 120 seconds, how can I increase this default without changing ray code?

Topic		Replies	Views
Connection timeout when pulling docker image of head node Ray Clusters	0	720	April 30, 2023
Worker nodes fail to setup container Ray Clusters	1	701	September 12, 2022
Ray workers can't ssh to head node Ray Core	5	747	June 14, 2022
Head node fails to ssh into worker nodes Ray Clusters	6	1939	August 17, 2022
Worker nodes stuck in "waiting-for-ssh" Ray Clusters	8	1721	July 6, 2022

Failing to launch workers due to docker pull timeout

Related topics