Hi guys. I am using ray services function with KubeRay. I hope to enjoy the high availability and zero downtime upgrade for Ray clusters. Here are my queustions:
- If some ray jobs still running while ray cluster began to upgrade, is it possible for these jobs to resume and continue executing?
- Using external Redis as the backend of gcs can make this happen or not ? Where can I learn more about using external storage backend of ray gcs?
- I heard some teams let the job driver run on the nodes other than the head node. How can i do that? And, is this can make resuming jobs possible?