How severe does this issue affect your experience of using Ray?
Low: It annoys or frustrates me for a moment.
Hello everyone,
my understanding of ray clusters is that one I started one with ray up cluster.yaml, the only ways to shut it down is to run ray down cluster.yaml or to remove the resources somehow outside of ray.
Is there a way to shutdown a cluster once a job is done (in the sense that the VM is deleted/deallocated)?
My use case is that I occasionally will have jobs that run very long. The cluster should only exist for the runtime of the job and then remove itself. One option would be to just remove the head node if it is idle for too long, however I could find that option only for the workers.
Is that possible in any way? Ideally just with ray, but if there is an option e.g. with Kubernetes that would also be okay.
If you’re looking for a quick and dirty solution, you can use ray submit --stop to run your job. That being said, the issue is that if the runner of that command dies, it won’t stop the cluster.
The battle hardened recommendation is to handle this with a workflow engine of your choice (e.g. Airflow).
In the kubernetes world, you could run your ray job submitter in a k8s job and have an operator teardwon the cluster when the k8s job completes successfully.
Is this something that will come at some point?
I would find it really useful, if ray internally had some way to fully tear down it’s cluster at the end of a script.
@M_S, as @Alex mentioned, we’re focusing on making HA easier to accomplish with KubeRay. We haven’t estimated a timeline for out-of-the-box HA, but these soon-to-come KubeRay improvements will be step towards that.