We recently moved to a larger docker image for the ray head and workers, and see many failures:
- head node often fails to start (“ray up” fails), the failure is often in the middle of the docker pull command (the returned error is “SSH command failed”.
- worker nodes started by the autoscaler are failing to get launched, again while pulling the docker, again with “SSH command failed”.
I suspect this is due to some timeout when running the ssh commands. I see the code has a default timeout of 120 seconds, how can I increase this default without changing ray code?